A2A 的扩展性：支持多模态交互

摘要：A2A（Agent2Agent）协议通过支持多模态交互（文本、表单、音视频），为企业 AI 系统提供了灵活的协作能力。动态 UX 协商机制允许代理在运行时切换交互模式，适应复杂场景。本文深入剖析 A2A 的多模态交互设计，聚焦 AgentCard 的 interactionModes、动态协商流程和音视频流的支持。结合 GitHub 仓库的 demo 应用、Mermaid 图表和代码示例，我们将揭示 A2A 如何通过硬核的扩展性设计驱动动态协作，为开发者提供深入的技术洞察。

1. 引言：多模态交互的必要性

在企业 AI 系统中，代理（Agent）需要与用户或其他代理以多样化的方式交互。例如，一个客服代理可能从文本聊天开始，切换到表单输入发票信息，甚至升级到音视频通话以解决复杂问题。传统的单一模式通信（如 REST API 的 JSON）无法满足这些动态需求。Google 的 A2A（Agent2Agent） 协议通过支持多模态交互（文本、表单、音视频），提供了高度扩展的协作框架。

A2A 的多模态交互基于 AgentCard 的 capabilities.interactionModes 和动态 UX 协商，允许代理在运行时协商最合适的交互方式。本文将深入解析这一机制，结合 Google A2A GitHub 仓库的 demo 应用，揭示其硬核内核。

2. 多模态交互概览

A2A 的多模态交互允许代理支持以下模式：

文本：基于 JSON 或纯文本的交互，适合简单任务（如查询状态）。
表单：结构化输入（如 HTML 表单或 JSON Schema），用于收集复杂数据。
音视频：实时流媒体，用于客服、教育或远程协作场景。

这些模式通过 AgentCard 的 interactionModes 字段声明，例如：

1
2
3
4
5
6
7
8
9
{
  "name": "CustomerSupportAgent",
  "url": "https://example.com/a2a",
  "capabilities": {
    "interactionModes": ["text", "form", "video"],
    "streaming": true,
    "pushNotifications": true
  }
}

2.1 动态 UX 协商

动态 UX 协商是指代理在任务执行过程中，根据任务需求或用户上下文，协商切换交互模式。例如：

用户提交文本请求，代理发现需要发票图片，提议切换到表单模式。
客服代理在复杂问题下升级为视频通话。

协商流程依赖 AgentCard 和任务状态更新，结合 HTTP 或 WebSocket 通信。

2.2 多模态交互流程图

以下是多模态交互的流程图（基于你的规划）：

flowchart TD
    A[Client Request] --> B{Interaction Type}
    B --> C[Text]
    B --> D[Form]
    B --> E[Audio/Video]
    C --> F[Process Text]
    D --> G[Render Form]
    E --> H[Stream Media]
    F --> I[Response]
    G --> I
    H --> I

3. 核心机制：多模态交互的设计

3.1 AgentCard 的 `interactionModes`

AgentCard 的 capabilities.interactionModes 字段定义了代理支持的交互模式，格式为字符串数组（如 ["text", "form", "video"]）。每个模式对应特定的处理逻辑：

Text：处理 JSON 或纯文本输入，输出简单的响应。
Form：基于 JSON Schema 或 HTML 模板，渲染交互式表单。
Audio/Video：通过 WebRTC 或其他流媒体协议，处理实时音视频。

示例 AgentCard：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
  "name": "ExpenseAgent",
  "url": "https://example.com/a2a",
  "capabilities": {
    "interactionModes": ["text", "form", "video"],
    "streaming": true
  },
  "schema": {
    "input": {
      "type": "object",
      "properties": {
        "amount": {"type": "number"},
        "currency": {"type": "string"},
        "invoice": {"type": "string", "format": "uri"}
      }
    }
  }
}

3.2 动态协商流程

多模态交互的动态协商分为以下步骤：

发现：Host Agent 获取 Remote Agent 的 AgentCard，检查 interactionModes。
提议：Host Agent 提议初始模式（如 text），通过 HTTP 或 WebSocket 发送。
调整：Remote Agent 根据任务需求建议替代模式（如 form 或 video）。
确认：双方达成一致，进入任务执行。

协商的时序图：

sequenceDiagram
    participant H as Host Agent
    participant R as Remote Agent
    H->>R: GET /agentcard
    R-->>H: AgentCard (interactionModes: ["text", "form", "video"])
    H->>R: Propose text mode
    R-->>H: Suggest form (needs invoice)
    H->>R: Agree to form
    H->>R: Submit form data
    R-->>H: Suggest video (complex issue)
    H->>R: Agree to video
    R-->>H: WebRTC stream setup
    H-->>R: Stream video
    R-->>H: Task result

3.3 通信支持

多模态交互依赖 A2A 的通信机制：

HTTP：用于初始协商、文本和表单交互（POST /task 或 GET /form）。
WebSocket：用于实时音视频流和动态模式切换（interaction_request 事件）。

WebSocket 消息示例（模式切换）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "event": "interaction_request",
  "taskId": "task-001",
  "mode": "form",
  "schema": {
    "type": "object",
    "properties": {
      "invoice": {"type": "string", "format": "uri"}
    }
  }
}

4. 实现细节：多模态交互的技术支持

4.1 文本交互

文本交互是 A2A 的基础模式，使用 JSON 格式交换任务数据。Remote Agent 验证输入是否符合 schema.input，返回文本响应。

示例任务：

1
2
3
4
5
6
7
8
{
  "taskId": "task-001",
  "type": "expense",
  "data": {
    "amount": 100,
    "currency": "USD"
  }
}

4.2 表单交互

表单交互通过 JSON Schema 或 HTML 模板实现。Remote Agent 返回表单定义，Host Agent 渲染 UI（如 Web 表单）收集用户输入。

表单请求示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
  "event": "interaction_request",
  "taskId": "task-001",
  "mode": "form",
  "form": {
    "title": "Upload Invoice",
    "fields": [
      {
        "name": "invoice",
        "type": "file",
        "label": "Invoice Image"
      }
    ]
  }
}

4.3 音视频交互

音视频交互使用 WebRTC 或类似协议，通过 WebSocket 建立流媒体连接。AgentCard 的 streaming: true 表明支持实时流。

WebRTC 设置流程（简化）：

Host Agent 发送 SDP（Session Description Protocol）提议。
Remote Agent 响应 SDP 答案，建立 P2P 连接。
双方通过 WebSocket 交换 ICE 候选地址，优化连接。

4.4 动态 UX 的扩展性

A2A 的动态 UX 协商支持运行时扩展，例如：

模式组合：任务从文本开始，依次切换到表单和视频。
上下文感知：根据用户设备（手机 vs PC）选择合适的模式。
第三方集成：通过插件支持新模式（如 AR/VR）。

5. 代码示例：多模态交互实现

以下是一个基于 samples/python/agents/google_adk 的客服代理 demo，展示文本、表单和视频交互的动态切换（参考 GitHub 仓库）。

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
import asyncio
from aiohttp import web
from a2a import A2AServer, A2AClient, AgentCard
import json

# Remote Agent：支持多模态交互
class CustomerSupportAgent(A2AServer):
    def __init__(self):
        card = AgentCard(
            name="CustomerSupportAgent",
            description="Handles customer support with text, form, and video",
            url="http://localhost:8080/a2a",
            capabilities={
                "interactionModes": ["text", "form", "video"],
                "streaming": True,
                "pushNotifications": True
            },
            schema={
                "input": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"},
                        "invoice": {"type": "string", "format": "uri"}
                    }
                }
            }
        )
        super().__init__(card=card)

    async def negotiate_interaction(self, proposal: dict) -> dict:
        mode = proposal.get("mode")
        if mode in self.card.capabilities["interactionModes"]:
            return {"status": "accepted", "mode": mode}
        return {"status": "rejected", "suggested": "text"}

    async def handle_task(self, task: dict) -> dict:
        task_id = task["taskId"]
        query = task["data"].get("query", "")

        # 初始文本交互
        await self.notify_status(task_id, "in_progress")
        if query == "upload invoice":
            # 切换到表单
            await self.send_interaction_request(task_id, {
                "mode": "form",
                "form": {
                    "title": "Upload Invoice",
                    "fields": [
                        {"name": "invoice", "type": "file", "label": "Invoice Image"}
                    ]
                }
            })
            return {"status": "pending", "message": "Awaiting form input"}

        if query == "video support":
            # 切换到视频
            await self.send_interaction_request(task_id, {
                "mode": "video",
                "webrtc": {"sdp": "v=0\r\no=- 123456789 1 IN IP4 127.0.0.1\r\n..."}
            })
            return {"status": "pending", "message": "Awaiting video setup"}

        return {
            "status": "completed",
            "result": {"message": f"Processed query: {query}"}
        }

# Hostხ:1px;border-bottom:1px solid #000000;">Host Agent：多模态客户端
async def support_client(remote_url: str):
    async with aiohttp.ClientSession() as session:
        client = A2AClient(remote_url, session=session)
        agent_card = await client.get_agent_card()
        print(f"Agent: {agent_card['name']}, Modes: {agent_card['capabilities']['interactionModes']}")

        # 文本交互
        task = {
            "taskId": "task-001",
            "type": "support",
            "data": {"query": "upload invoice"}
        }
        response = await client.submit_task(task)
        print(f"Task submitted: {response}")

        # 处理交互请求
        async for update in client.subscribe_task_updates(task["taskId"]):
            print(f"Update: {update}")
            if update.get("event") == "interaction_request":
                if update["mode"] == "form":
                    # 模拟表单提交
                    form_data = {"invoice": "https://example.com/invoice.jpg"}
                    await client.submit_form(task["taskId"], form_data)
                elif update["mode"] == "video":
                    # 模拟 WebRTC 响应
                    await client.submit_webrtc_sdp(task["taskId"], {"sdp": "answer_sdp"})
            if update["status"] in ["completed", "failed"]:
                break

if __name__ == "__main__":
    server = CustomerSupportAgent()
    asyncio.run(support_client("http://localhost:8080/a2a"))
    server.run(port=8080)