# AgentOrchestra: Orchestrating Multi-Agent Intelligence with the Tool-Environment-Agent(TEA) Protocol
**Authors**:
- Wentao Zhang
- Liang Zeng
- Yuzhen Xiao
- Yongcong Li
- Ce Cui
- Yilei Zhao
- Rui Hu
- Yang Liu
- Yahui Zhou
- Bo An (Skywork AI Nanyang Technological University)
## Abstract
Recent advances in LLM-based agent systems have shown promise in tackling complex, long-horizon tasks. However, existing LLM-based agentprotocols (e.g., A2A and MCP) under-specify cross-entity lifecycle and context management, version tracking, and ad-hoc environment integration, which in turn encourages fixed, monolithic agent compositions and brittle glue code. To address these limitations, we introduce the Tool–Environment–Agent (TEA) protocol, a unified abstraction that models environments, agents, and tools as first-class resources with explicit lifecycles and versioned interfaces. TEA provides a principled foundation for end-to-end lifecycle and version management, and for associating each run with its context and outputs across components, improving traceability and reproducibility. Moreover, TEA enables continual self-evolution of agent-associated components Unless otherwise specified, agent-associated components include prompts, memory/tool/agent/environment code, and agent outputs (solutions). through a closed feedback loop, producing improved versions while supporting version selection and rollback. Building on TEA, we present AgentOrchestra, a hierarchical multi-agent framework in which a central planner orchestrates specialized sub-agents for web navigation, data analysis, and file operations, and supports continual adaptation by dynamically instantiating, retrieving, and refining tools online during execution. We evaluate AgentOrchestra on three challenging benchmarks, where it consistently outperforms strong baselines and achieves 89.04% on GAIA, establishing state-of-the-art performance to the best of our knowledge. Overall, our results provide evidence that TEA and hierarchical orchestration improve scalability and generality in multi-agent systems.
<details>
<summary>x2.png Details</summary>

### Visual Description
Icon/Small Image (32x33)
</details>
AgentOrchestra: Orchestrating Multi-Agent Intelligence with the Tool-Environment-Agent(TEA) Protocol
## 1 Introduction
Recent advances in LLM-based agent systems have enabled strong performance on both general-purpose and complex, long-horizon tasks across diverse domains, including web navigation (OpenAI, 2025b; Müller and Žunič, 2024), computer use (Anthropic, 2024a; Qin et al., 2025), code execution (Wang et al., 2024a), game playing (Wang et al., 2023; Tan et al., 2024), and research assistance (OpenAI, 2024; DeepMind, 2024; xAI, 2025). Despite this progress, cross-environment generalization remains limited because context is scattered across prompts and logs, environment integration relies on brittle glue code, and agent-associated components are typically fixed rather than feedback-driven self-evolution.
Additionally, current agent protocols fall short of serving as a general substrate for scalable, general-purpose agents. As summarized in Table 1, representative protocols such as Google’s A2A (Google, 2025) and Anthropic’s MCP (Anthropic, 2024b) provide important building blocks, including task-level collaboration and messaging in A2A, as well as tool and resource schemas, discovery, and invocation in MCP. However, three protocol-level gaps remain: i) Lifecycle and context management are fragmented, as neither standardizes unified primitives to manage lifecycles and maintain consistent, versioned execution context across agent-associated components; ii) Self-evolution is not supported at the protocol level, as both protocols largely treat prompts and resources as externally maintained assets, and do not define a closed loop to refine prompts or tools from execution feedback with traceable versioning; iii) Environments are not first-class, environments are delegated to application-specific runtimes instead of being managed components with clear boundaries and constraints. This makes it difficult to switch agents across environments, reuse environments, and isolate parallel runs, often reducing systems to glue-code orchestration.
Table 1: Comparison of TEA Protocol with A2A and MCP. Symbols: $\checkmark$ = Yes, $\triangle$ = Partial, $×$ = No.
| Dimension | TEA | A2A | MCP |
| --- | --- | --- | --- |
| Core Entities | Tool, Env, Agent | Agent, Tool | Model |
| Lifecycle & Version | $\checkmark$ | $×$ | $×$ |
| Entity Transformations | $\checkmark$ | $×$ | $×$ |
| Self-Evolution Support | $\checkmark$ | $×$ | $×$ |
| Open Ecosystem | ✓ | $\triangle$ | $\triangle$ |
To address these limitations, we propose the Tool–Environment–Agent (TEA) protocol, which treats environments, agents, and tools as explicitly managed components under a unified protocol layer. Concretely, TEA standardizes component identifiers and version semantics, and binds each run to its context and execution state, so that artifacts remain traceable across iterations. Importantly, TEA goes beyond MCP by standardizing cross entity lifecycle semantics, explicit version semantics with stable entity identifiers, run-indexed context capture, explicit environment boundaries with constraints, and closed loop evolution hooks driven by execution feedback. As a result, execution state, artifacts, and context can be consistently persisted, reused, and traced across runs and iterations. TEA further enables self-evolution by defining a closed loop in which execution feedback can trigger agent-associated components during runtime, with updates recorded as new versions. Finally, TEA models environments as first-class components with explicit boundaries and constraints, for example web sandboxes, file systems, and code execution runtimes, improving reuse and isolation across heterogeneous domains and reducing context leakage in parallel executions. This also encourages consolidating functionally related tools into coherent environments; for example, discrete file operations can be organized as a managed file system, reducing context fragmentation and management overhead. Overall, TEA aims to make agent construction more composable and reproducible in practice. Detailed motivations for the TEA protocol and in-depth comparisons with existing protocols are provided in Appendix A, B.
Based on the TEA protocol, we develop AgentOrchestra, a hierarchical multi-agent framework for general-purpose task solving that integrates high-level planning with modular collaboration. AgentOrchestra uses a central planner to decompose a user objective and delegate sub-tasks to specialized agents for research, web navigation, analysis, tool synthesis, and reporting. Compared to flat coordination, where an orchestrator selects from a growing global pool of agents and tools and tends to accumulate irrelevant context, AgentOrchestra adopts hierarchical delegation with localized tool ownership. The planner routes each sub-task to a domain-specific sub-agent (or environment), which maintains and exposes only a curated toolset and context for its domain. This structure converts global coordination into a sequence of localized routing decisions, enabling tree-structured expansion as new capabilities are added while keeping the orchestrator’s decision scope and context footprint bounded. For example, the planner first selects a domain-level agent, which then supplies only the tools and context required for that domain. Furthermore, AgentOrchestra incorporates a self-evolution module that leverages TEA’s lifecycle and versioning mechanisms to refine agent- associated components based on execution feedback. Our contributions are threefold:
- We introduce the TEA protocol, which unifies environments, agents, and tools as first-class, versioned components with lifecycles to support context management and execution.
- We develop AgentOrchestra, a hierarchical multi-agent system built on TEA, demonstrating scalable orchestration through tree-structured routing and feedback-driven self-evolution.
- We conduct extensive evaluations on three challenging benchmarks, including ablations to isolate the effects of key components. AgentOrchestra consistently outperforms strong baselines and achieves 89.04% on GAIA, establishing state-of-the-art performance to the best of our knowledge.
## 2 Related Work
### 2.1 Tool and Agent Protocols
Recent protocols standardize tool interfaces and agent communication. For instance, MCP (Anthropic, 2024b) unifies tool integration for LLMs, while A2A (Google, 2025) enables agent-to-agent messaging and coordination. Other efforts, such as the Agent Network Protocol (ANP) (Ehtesham et al., 2025) and frameworks like SAFEFLOW (Li et al., 2025), enhance interoperability and safety in multi-agent systems. While these protocols provide essential building blocks, they primarily treat agents and tools as isolated service endpoints, often overlooking environments as dynamic, first-class components. TEA extends these existing standards rather than replacing them. By integrating tools, environments, and agents into a unified context-aware framework, TEA resolves protocol fragmentation with integrated lifecycle and version management missing in MCP or A2A.
### 2.2 General-Purpose Agents
Integrating tools with LLMs represents a paradigm shift, enabling agents to exhibit enhanced flexibility, cross-domain reasoning, and natural language interaction (Liang and Tong, 2025). Such systems have demonstrated efficacy across diverse domains, including web browsing (OpenAI, 2025b; Müller and Žunič, 2024), computer operation (Anthropic, 2024a; Qin et al., 2025), code execution (Wang et al., 2024a), and game playing (Wang et al., 2023; Tan et al., 2024). Standardized interfaces like OpenAI’s Function Calling and Anthropic’s MCP (OpenAI, 2023; Anthropic, 2024b), alongside frameworks such as ToolMaker (Wölflein et al., 2025), have further streamlined the synthesis of LLM-compatible tools. Building upon these foundations, multi-agent architectures like MetaGPT (Hong et al., 2023) demonstrate the potential of specialized agent coordination for complex problem-solving. However, many current approaches still struggle with efficient communication, dynamic role allocation, and scalable teamwork. The emergence of generalist frameworks, including Manus (Shen and Yang, 2025), OpenHands (Wang et al., 2024b), and smolagents (Roucher et al., 2025), has advanced unified perception and tool-augmented action. While recent efforts like Alita (Qiu et al., 2025) explore minimal predefinition and maximal self-evolution, these systems often lack unified protocols for cross-layer resource management. This gap motivates our proposal of the TEA Protocol and AgentOrchestra.
## 3 The TEA Protocol
The TEA Protocol is fundamentally designed around coroutine-based asynchronous execution, enabling concurrent task processing and parallel multi-agent coordination. As illustrated in Figure 1, the protocol architecture comprises three primary layers: i) Basic Managers provide foundational services through six specialized components (model, prompt, memory, dynamic, version, and tracer); ii) Core Protocols define the Tool Context Protocol (TCP), Environment Context Protocol (ECP), and Agent Context Protocol (ACP), each implemented through a context manager for lifecycle engineering and a server for standardized orchestration; and iii) Protocol Transformations establish bidirectional conversion pathways (e.g., A2T, E2T, A2E) enabling dynamic role reconfiguration. Additionally, the protocol incorporates a Self-Evolution Module that wraps agent-associated components as evolvable variables for iterative optimization. Details and formalization can be found in Appendix C.
<details>
<summary>x3.png Details</summary>

### Visual Description
## System Architecture Diagram: Agent-Based Tool and Environment Interaction Framework
### Overview
The image is a technical system architecture diagram illustrating a multi-agent framework for interacting with tools and environments. It depicts the flow of information and control between different user roles, agents, tools, and environments, mediated by specific communication protocols. The diagram is structured hierarchically and cyclically, showing both top-down command flow and bidirectional data exchange.
### Components/Axes
The diagram is organized into several distinct layers and components:
1. **User Roles (Top Layer):** Four user personas are depicted with icons and labels:
* Researcher (icon: person with magnifying glass)
* Analyzer (icon: person with chart)
* Browser (icon: person with globe)
* Manager (icon: person with clipboard)
2. **Primary Protocol Layer (Below User Roles):** A blue rectangular box labeled **"Agent Context Protocol (ACP)"** sits directly beneath the user roles, indicating it is the primary interface or communication layer for these users.
3. **Core Interaction Cycle (Central Diagram):** A circular flow diagram with three main nodes connected by bidirectional arrows:
* **Top Node:** Labeled **"Agents"** (icon: a stylized brain or network).
* **Bottom-Left Node:** Labeled **"Tools"** (icon: a wrench and screwdriver).
* **Bottom-Right Node:** Labeled **"Envs"** (short for Environments, icon: a computer monitor).
* **Central Hub:** Labeled **"Tool-Environment-Agent (TEA)"** with a coffee cup icon, positioned at the center of the cycle.
* **Connecting Arrows & Protocols:**
* Arrow from **Agents** to **Tools**: Labeled **"ACP"**.
* Arrow from **Tools** to **Agents**: Labeled **"TCP"**.
* Arrow from **Agents** to **Envs**: Labeled **"ACP"**.
* Arrow from **Envs** to **Agents**: Labeled **"ECP"**.
* Arrow from **Tools** to **Envs**: Labeled **"TCE"**.
* Arrow from **Envs** to **Tools**: Labeled **"TCE"**.
4. **Protocol Definition Layer (Bottom):** Two green rectangular boxes define the secondary protocols:
* Left Box: **"Tool Context Protocol (TCP)"**
* Right Box: **"Environment Context Protocol (ECP)"**
5. **Concrete Implementations (Bottom Icons):** Below the protocol boxes, two rows of icons represent specific instances of tools and environments.
* **Tools Row (under TCP):** Icons labeled: **Bash**, **Python**, **MDFy**, **Web**, and three additional unlabeled icons (a document, a database, and a network graph).
* **Environments Row (under ECP):** Icons labeled: **Browser**, **Computer**, **Mobile**, **Files**, **GitHub**.
### Detailed Analysis
The diagram defines a structured communication framework:
* **Flow from Users to System:** User roles (Researcher, Analyzer, etc.) interact with the system through the **Agent Context Protocol (ACP)**. This protocol is the gateway to the **Agents**.
* **Agent Core Function:** The **Agents** act as the central decision-makers. They send commands or requests outwards using the ACP.
* **Tool Interaction:** When an Agent needs to execute a task, it communicates with **Tools** via the ACP (downward arrow). Tools report back their results, context, or status to the Agents using the **Tool Context Protocol (TCP)**.
* **Environment Interaction:** Similarly, when an Agent needs to interact with or observe an **Environment**, it uses the ACP. Environments send back state information, feedback, or data using the **Environment Context Protocol (ECP)**.
* **Direct Tool-Environment Link:** Tools and Environments can also communicate directly with each other using a **Tool-Context-Environment (TCE)** protocol, bypassing the Agent for certain low-level or high-frequency interactions.
* **Central Coordination:** The **Tool-Environment-Agent (TEA)** hub at the center suggests a core service or manager that orchestrates or logs the interactions between these three primary components.
### Key Observations
1. **Protocol Specialization:** The system uses distinct, named protocols for different communication paths (ACP, TCP, ECP, TCE), indicating a design focused on clear context separation and possibly different data formats or security levels for each channel.
2. **Bidirectional Communication:** All connections are bidirectional, emphasizing a feedback-rich system where actions and observations are constantly exchanged.
3. **Abstraction Layers:** The diagram clearly separates abstract roles (Users), abstract system components (Agents, Tools, Envs), communication standards (Protocols), and concrete implementations (Bash, Python, Browser, etc.).
4. **Central TEA Hub:** The placement of "Tool-Environment-Agent (TEA)" at the very center, connected to all three main nodes, implies it is a critical piece of infrastructure for managing the state or transactions of the entire system.
### Interpretation
This diagram outlines a sophisticated architecture for building autonomous or semi-autonomous agent systems. The key insight is the **decoupling of agents from the specific tools and environments they use** via well-defined protocols.
* **What it demonstrates:** It shows a scalable design where new tools (e.g., a new programming language runtime) or environments (e.g., a new cloud service) can be integrated by adhering to the TCP or ECP, without necessarily modifying the core Agent logic. The ACP provides a unified interface for human users to direct these agents.
* **How elements relate:** The user roles drive the system's goals. The Agents interpret these goals and delegate work. Tools perform computational or operational tasks, while Environments provide the context or sandbox in which those tasks run. The protocols ensure all communication is structured and context-aware.
* **Notable design choice:** The direct **TCE link** between Tools and Environments is significant. It acknowledges that some operations (e.g., a script writing a file to the filesystem) are best handled with a direct, low-overhead channel between the tool and its immediate environment, rather than routing all data through the agent. This is a pragmatic optimization for performance and simplicity in specific interactions.
**Language:** All text in the diagram is in English.
</details>
Figure 1: Architecture of the TEA Protocol.
### 3.1 Basic Managers
The Basic Managers constitute the foundation of the TEA Protocol, providing essential services through six specialized managers: i) the model manager abstracts heterogeneous LLM backends through a unified interface; ii) the prompt manager handles prompt lifecycle and versioning; iii) the memory manager coordinates persistence via session-based concurrency control; iv) the dynamic manager enables runtime code execution and serialization; v) the version manager maintains evolution histories for all components; and vi) the tracer records comprehensive execution trajectories and system-wide telemetry, serving as a data collection engine for audit, debugging, and the synthesis of high-quality datasets for agent training.
### 3.2 Core Protocols
The TEA Protocol defines three core context protocols: the Tool Context Protocol (TCP), the Environment Context Protocol (ECP), and the Agent Context Protocol (ACP). These protocols share a unified architectural design, each implemented through two core components: a context manager for context engineering, lifecycle management, and semantic retrieval, and a server that exposes standardized interfaces to other system modules. Each protocol generates a unified contract document (analogous to Agent Skills (Anthropic, 2025)) that aggregates all registered components’ descriptions to facilitate resource discovery and usage.
Tool Context Protocol. TCP fundamentally extends MCP (Anthropic, 2024b) by introducing integrated context engineering and comprehensive lifecycle management. Implemented through a ToolContextManager and a TCPServer, it supports seamless tool loading from both local registries and persistent configurations. During registration, TCP automatically synthesizes multiple representation formats, including function-calling schemas for LLM interfaces, natural language descriptions for documentation, and type-safe argument schemas for validation, providing LLMs with rich semantic information for accurate parameter inference. Furthermore, TCP incorporates a robust versioning system and a semantic retrieval mechanism based on vector embeddings, ensuring that tools can evolve over time while remaining easily discoverable through similarity-based queries.
Environment Context Protocol. ECP addresses the lack of unified interfaces in current agent systems by formalizing computational environments as first-class components with distinct observation and action spaces. Following an architectural pattern similar to TCP, it employs an EnvironmentContextManager to maintain state coherence and manage the contextual execution environments required by tools. ECP automatically discovers and registers environment-specific actions, converting them into standardized interfaces that agents can invoke via action toolkits. This design enables agents to operate across heterogeneous domains, such as browsers or file systems, without bespoke adaptations, while leveraging versioning and semantic retrieval to manage environment-level capabilities.
Agent Context Protocol. ACP establishes a unified framework for the registration, representation, and orchestration of autonomous agents, overcoming the poor interoperability and fragmented attribute definitions in existing multi-agent systems. It utilizes an AgentContextManager to maintain agent states and execution contexts, providing a foundation for persistent coordination across tasks and sessions. ACP captures semantically enriched metadata regarding agents’ roles, competencies, and objectives, and formalizes the modeling of complex inter-agent dynamics, including cooperative, competitive, and hierarchical configurations. By embedding structured contextual descriptions and maintaining relationship representations, ACP facilitates adaptive collaboration and systematic integration within the broader TEA ecosystem.
### 3.3 Protocol Transformations
While TCP, ECP, and ACP provide independent specifications for tools, environments, and agents, practical deployment requires seamless interoperability across these protocols. Well-defined transformation pathways are essential for enabling computational components to assume alternative roles and exchange contextual information in a principled manner. These transformations constitute the foundation for dynamic role reconfiguration, allowing components to flexibly adapt their functional scope in response to evolving task requirements and system constraints. We identify six fundamental categories of protocol transformations:
- Agent-to-Tool (A2T). Encapsulates an agent’s capabilities and reasoning into a standardized tool interface while preserving awareness. For example, a deep researcher workflow can be packaged as a general-purpose search tool.
- Tool-to-Agent (T2A). Treats tools as operational actuators by mapping an agent’s goals into parameterized tool invocations, aligning reasoning with tool constraints. For example, a data analysis agent may invoke SQL tools to query structured databases.
- Environment-to-Tool (E2T). Converts actions of environments into standardized tool interfaces, enabling agents to interact with environments through consistent tool calls. For example, browser actions such as Navigate and Click can be consolidated into a context-aware toolkit.
- Tool-to-Environment (T2E). Elevates a collection of tools into an environment abstraction where functions become actions within a coherent action space governed by shared state. For example, a development toolkit can be encapsulated as a programming environment for sequential code-edit-compile-debug workflows.
- Agent-to-Environment (A2E). Encapsulates an agent as an interactive environment by exposing its decision rules and state dynamics as an operational context for other agents. For example, a market agent can be represented as an environment that provides trading rules and dynamic responses for training.
- Environment-to-Agent (E2A). Embeds reasoning and adaptive decision-making into an environment’s dynamics, transforming it into an autonomous agent that can initiate behaviors and enforce constraints. For example, a game environment can be elevated into an opponent agent that adapts its strategy to the player’s actions.
### 3.4 Self-Evolution Module
The Self-Evolution Module enables agents to continuously improve performance by optimizing system components during task execution. It wraps evolvable components, including prompts, tool/agent/environment/memory code, and successful execution solutions, as variables for iterative optimization. The module employs two primary methods: textgrad (Yuksekgonul et al., 2025) for gradient-based refinement and self-reflection for strategic analysis. Optimized components are automatically registered as new versions via the version manager, ensuring that subsequent tasks leverage improved capabilities while maintaining access to historical records for analysis and rollback.
<details>
<summary>x4.png Details</summary>

### Visual Description
## System Architecture Diagram: Multi-Agent AI Framework with Context Protocols
### Overview
This image is a detailed technical architecture diagram illustrating a multi-agent AI system designed for complex task execution. The system is structured into hierarchical layers: a top-level planning layer, a middle layer of specialized agents, a central communication protocol, lower-level tool and environment interfaces, and foundational management modules. The diagram uses a combination of flowcharts, component boxes, icons, and directional arrows to depict data flow, control relationships, and functional decomposition.
### Components/Axes (Structural Layers & Labels)
The diagram is organized into the following major horizontal layers and vertical sections:
**1. Top Layer: Planning Agent**
* **Title:** `Planning Agent`
* **Tools Sub-panel (Top-Left):**
* **Actions:** `create` (Create a new plan), `update` (Update the plan), `delete` (Delete the plan), `mark` (Mark step as completed).
* **Planning Tool Description:** "Create, update, and manage plans for complex tasks simultaneously" and "Track execution states".
* **Planning Workflow (Center-Left):** A flowchart with the following nodes and labels:
* `User Objectives` (Input)
* `Interpret user tasks`
* `Decompose into manageable sub-tasks`
* `Assign to specialized sub-agents`
* `sub-agent A`, `sub-agent B`, `tool...` (Output assignments)
* Feedback loops labeled: `Objective Shifts (Update Plans)` and `Unexpected Errors`.
* **Agent Roles Illustration (Top-Right):** Icons and labels for different agent roles: `Planner`, `Researcher`, `Browser`, `Analyzer`, `Generator`, `Reporter`, culminating in an `Answer`.
**2. Middle Layer: Specialized Agent Pool**
This layer contains five distinct agent modules, each with an icon and descriptive text.
* **Deep Researcher Agent:** "Optimize Queries", "Search Tools", "Refine Insight".
* **Browser Use Agent:** "Decide Actions", "Browser Actions", "Record Results".
* **Deep Analyzer Agent:** "Organize Diverse Formulae", "Reason and Summarize".
* **Tool Generator Agent (Left):** "Task Retrieval", "Tool Creation", "Tool Reuse".
* **Tool Generator Agent (Right):** "Add Content", "Export Report".
**3. Central Communication Layer: Agent Context Protocol (ACP)**
* A prominent horizontal bar labeled `Agent Context Protocol (ACP)` spans the diagram, acting as the connective tissue between the agent layer above and the tool/environment layer below.
**4. Lower Layer: Tool & Environment Interfaces**
This layer is divided into three main protocol sections.
* **Left: Tool Context Protocol (TCP)**
* **General Tools:** Icons and labels for `Bash`, `Python`, `Midify`, `Web`, `Todo`.
* **Agent Tools:** Icons and labels for `Searcher`, `Analyzer`.
* **MCP Tools:** Icons and labels for `Local`, `Remote`.
* **Environment Tools:** A panel listing tools for interacting with different environments:
* **Browser:** `goto`, `click`, `scroll`, `input`, `press`.
* **GitHub&Git:** `clone`, `create`, `commit`, `push`, `pull`, `checkout`, `branch`, `log`, `diff`.
* **Computer:** `left_click`, `double_click`, `right_click`, `type`, `key`.
* **Center: Tool-Environment-Agent (TEA) Interaction Model**
* A triangular diagram showing the relationships between `Agents`, `Tools`, and `Envs` (Environments).
* Arrows are labeled with interaction types: `A2T` (Agent-to-Tool), `T2A` (Tool-to-Agent), `A2E` (Agent-to-Environment), `E2A` (Environment-to-Agent), `T2E` (Tool-to-Environment), `E2T` (Environment-to-Tool).
* **Right: Environment Context Protocol (ECP)**
* **Browser Environment:** Lists `Rules` and `Actions` (`goto`, `click`, `scroll`, `input`, `press`).
* **GitHub&Git Environment:** Lists `Rules` and `Actions` (`clone`, `create`, `commit`, `push`, `pull`, `checkout`, `branch`, `log`, `diff`).
* **Computer Environment:** Lists `Rules` and `Actions` (`left_click`, `double_click`, `right_click`, `type`, `key`).
**5. Bottom Layer: Foundational Managers & Self-Evolution**
* **Basic Managers:** A row of boxes labeled: `Model Manager`, `Memory Manager`, `Prompt Manager`, `Dynamic Manager`, `Version Manager`, `Tracer`.
* **Self-Evolution Module:** A separate box containing `TextGrad/Self-Reflection`.
### Detailed Analysis
The diagram meticulously maps the flow of a task from high-level objective to low-level execution.
1. **Task Inception:** A `User Objective` enters the `Planning Agent`.
2. **Planning & Delegation:** The objective is interpreted, decomposed into sub-tasks, and assigned to specialized agents from the pool (e.g., `Deep Researcher`, `Browser Use`).
3. **Execution via Protocols:** These agents operate within the `Agent Context Protocol (ACP)`. To perform actions, they utilize tools defined in the `Tool Context Protocol (TCP)` and interact with environments (Browser, GitHub, OS) governed by the `Environment Context Protocol (ECP)`.
4. **Core Interaction Loop:** The central `TEA` triangle formalizes the bidirectional communication between Agents, Tools, and Environments.
5. **Foundation & Evolution:** Underpinning the entire system are `Basic Managers` for core functionalities and a `Self-Evolution Module` for continuous improvement via techniques like `TextGrad`.
### Key Observations
* **Hierarchical Specialization:** The architecture clearly separates planning, specialized execution, tool management, and environment interaction into distinct layers.
* **Protocol-Centric Design:** The system is built around three explicit protocols (ACP, TCP, ECP), emphasizing standardized communication and context management.
* **Bidirectional Feedback:** The planning layer includes explicit feedback loops for `Objective Shifts` and `Unexpected Errors`, indicating a dynamic, adaptive system.
* **Tool Granularity:** Tools are categorized by their context (General, Agent, MCP, Environment) and specific actions are listed for each environment (e.g., precise Git and browser automation commands).
* **Self-Improvement:** The inclusion of a `Self-Evolution Module` with `TextGrad/Self-Reflection` suggests the framework is designed for autonomous refinement.
### Interpretation
This diagram represents a sophisticated blueprint for a **general-purpose, autonomous AI agent system**. It moves beyond a single model to an ecosystem of specialized agents coordinated by a central planner and governed by strict interaction protocols.
The **Peircean investigative reading** reveals:
* **Sign (The Diagram):** It is an *icon* of the system's structure, visually representing components and their relationships. It is also an *index*, showing the flow of control and data (arrows). Finally, it is a *symbol*, using standardized labels (ACP, TCP, E2T) that require knowledge of the system's design language to interpret fully.
* **Object (The Actual System):** The intended real-world system is a modular, scalable, and self-improving AI framework capable of breaking down complex goals into actionable steps using diverse tools and digital environments.
* **Interpretant (The Understanding):** The viewer understands that this is not a simple chatbot but an **agent orchestration platform**. The emphasis on protocols (ACP, TCP, ECP) suggests a focus on reliability, reproducibility, and clear boundaries in agent behavior. The `Self-Evolution Module` is the most forward-looking component, indicating the system's goal is not just task completion but also autonomous capability growth. The separation of "Planning" from "Specialized Agents" mirrors classic software architecture patterns (like a manager-worker model) applied to AI, aiming for both strategic flexibility and tactical efficiency.
</details>
Figure 2: Architecture of AgentOrchestra implemented based on TEA protocol.
## 4 AgentOrchestra
AgentOrchestra is a concrete instantiation of the TEA Protocol, designed as a hierarchical multi-agent framework that integrates high-level planning with modular agent collaboration. As illustrated in Figure 2, AgentOrchestra features a central planning agent that decomposes complex objectives and delegates sub-tasks to a team of specialized sub-agents. This section outlines our agent design principles and the architecture of both planning and specialized sub-agents. Details can be found in Appendix D.
### 4.1 Agent Design Principles
Within the TEA Protocol framework, agents are autonomous components that follow a structured interaction model with six core components. i) Agent: Managed via the ACP for registration and coordination. ii) Environment: External context and resources managed by the ECP, exposing unified interfaces for observation and action. iii) Model: LLM reasoning engines abstracted by the Basic Managers for model-agnostic interoperability and dynamic switching. iv) Memory: Session-based persistence that records trajectories and extracts reusable insights. v) Observation: The current context, including tasks, environment states, execution history, and available resources (tools and sub-agents). vi) Action: TCP-managed, executed via parameterized tool calls, where one tool may support multiple actions.
This architectural design facilitates a continuous perception–interpretation–action cycle. The agent first perceives the current observation and retrieves relevant context from memory. It then interprets this information through the unified model interface to determine the optimal action. The action is executed within the managed environment, and the resulting state transitions and insights are recorded back into memory to refine subsequent reasoning cycles. This iterative loop continues until the task objectives are satisfied or a termination condition is reached. Further details are provided in Appendix D.1.
### 4.2 Planning Agent
The planning agent is the central orchestrator of AgentOrchestra. It interprets the user goal, decomposes it into sub-tasks, and dispatches them to specialized sub-agents or TCP tools via ACP-mediated communication while tracking global progress and consolidating intermediate feedback. To enable principled orchestration, it leverages long-term memory to guide resource selection and dynamically constructs a unified invocation interface, including resources produced through E2T and A2T transformations. Execution follows an iterative loop of interpretation, allocation, and action, with automatic replanning under environment shifts or execution failures. Session management and tracer-based logging provide auditability and support robust long-horizon task completion.
### 4.3 Specialized Sub-Agents
To address diverse real-world challenges, AgentOrchestra instantiates specialized sub-agents tailored for task domains. These sub-agents are managed via the ACP and coordinate through the planning agent to execute complex workflows: i) Deep Researcher Agent: Specialized for comprehensive information gathering through multi-round research workflows. It performs parallel breadth-first searches across multiple engines and recursively issues follow-up queries until task objectives are satisfied, producing relevance-ranked, source-cited summaries. ii) Browser Use Agent: Provides automated, fine-grained web interaction by integrating both browser and computer environments under the ECP. It supports DOM-level and pixel-level operations (e.g., mouse movements), achieving unified control over interactive elements. iii) Deep Analyzer Agent: A workflow-oriented module designed for multi-step reasoning on heterogeneous multimodal data (e.g., text, PDFs, images, audio, video or zip). It applies type-specific analysis strategies and iterative refinement to synthesize insights into coherent conclusions. iv) Tool Generator Agent: Facilitates intelligent tool evolution through the automated creation, retrieval, and systematic reuse of TCP-compliant tools. It employs semantic search to identify tools and initiates a code synthesis process to develop new capabilities when gaps are identified. v) Reporter Agent: It aggregates and harmonizes evidence collected by upstream agents (e.g., the Deep Researcher Agent, Browser Use Agent, and Deep Analyzer Agent), then composes structured markdown with automatically deduplicated references and normalized URLs for consistent source attribution.
## 5 Empirical Studies
This section presents our experimental setup and results, including benchmark evaluations, baseline comparisons, and comprehensive analysis. Additional examples are provided in the Appendix F.
Experimental Settings. We evaluate our framework on three benchmarks: SimpleQA Wei et al. (2024), a 4,326-question factual accuracy benchmark; GAIA Mialon et al. (2023), assessing real-world reasoning, multimodal processing, and tool use with 301 test and 165 validation questions; and Humanity’s Last Exam (HLE) Phan et al. (2025), a 2,500-question multimodal benchmark for human-level reasoning and general intelligence. We report score (pass@1), which measures the proportion of questions for which the top prediction is fully correct. Specifically, the planning agent ( $m{=}50$ ), deep researcher ( $m{=}3$ ), tool generator ( $m{=}10$ ), deep analyzer ( $m{=}3$ ), and reporter are all built on gemini-3-flash-preview; the browser use agent employs gpt-4.1 ( $m{=}5$ ) and computer-use-preview(4o) ( $m{=}50$ ), where $m$ denotes the maximum steps.
### 5.1 Performance across Benchmarks
<details>
<summary>x5.png Details</summary>

### Visual Description
## Grouped Bar Chart: GAIA Test Performance by Model and Level
### Overview
The image displays a grouped bar chart titled "GAIA Test," comparing the performance scores of 12 different AI models or systems across three difficulty levels (Level1, Level2, Level3) and their overall Average. The chart is organized into four distinct color-coded groups, each representing one of these categories.
### Components/Axes
* **Chart Title:** "GAIA Test" (located in the top-right corner).
* **Y-Axis:** Labeled "Score," with a linear scale ranging from 40 to 100, marked at intervals of 10 (40, 50, 60, 70, 80, 90, 100).
* **X-Axis:** Lists 12 model/system names. These names are repeated identically under each of the four color-coded groups. The names are (from left to right within a group):
1. AgentOrchestrator
2. ToolOrchestrator
3. HALO
4. AReaL
5. Search-Omni
6. Search-Omni-7B
7. Defect-13B
8. ARIA
9. Llama-3-8B
10. o1-mini
11. o1-preview
12. 2-shot-o1-mini
* **Legend:** Positioned in the top-right corner, below the title. It defines the four color-coded groups:
* **Level1:** Green bars
* **Level2:** Blue bars
* **Level3:** Purple bars
* **Average:** Orange bars
### Detailed Analysis
The data is presented in four clusters, each corresponding to a legend category. Within each cluster, the 12 models are listed in the same order. The score for each model is displayed as a number directly above its corresponding bar.
**Group 1: Level1 (Green Bars)**
* **Trend:** Scores show a general downward trend from left to right, with a significant drop for the last three models.
* **Data Points (Model: Score):**
* AgentOrchestrator: 98.7
* ToolOrchestrator: 96.0
* HALO: 96.0
* AReaL: 95.7
* Search-Omni: 90.7
* Search-Omni-7B: 92.4
* Defect-13B: 91.7
* ARIA: 80.4
* Llama-3-8B: 77.4
* o1-mini: 67.4
* o1-preview: (Bar present, but score label is not clearly visible; appears to be ~65-70 based on bar height)
* 2-shot-o1-mini: (Bar present, but score label is not clearly visible; appears to be ~65-70 based on bar height)
**Group 2: Level2 (Blue Bars)**
* **Trend:** Scores are generally lower than Level1 for the same models. The trend is less uniform, with some models (e.g., Search-Omni-7B) scoring higher than their immediate neighbors.
* **Data Points (Model: Score):**
* AgentOrchestrator: 85.5
* ToolOrchestrator: 84.0
* HALO: 85.8
* AReaL: 81.3
* Search-Omni: 77.4
* Search-Omni-7B: 79.0
* Defect-13B: 76.5
* ARIA: 72.2
* Llama-3-8B: 70.8
* o1-mini: 68.6
* o1-preview: 67.8
* 2-shot-o1-mini: 59.3
**Group 3: Level3 (Purple Bars)**
* **Trend:** This group shows the lowest scores overall and the steepest decline from left to right. The performance gap between the top models and the bottom models is most pronounced here.
* **Data Points (Model: Score):**
* AgentOrchestrator: 81.8
* ToolOrchestrator: 87.8
* HALO: 68.8
* AReaL: 57.1
* Search-Omni: 65.3
* Search-Omni-7B: 60.8
* Defect-13B: 60.8
* ARIA: 55.1
* Llama-3-8B: 48.0
* o1-mini: 47.0
* o1-preview: 46.8
* 2-shot-o1-mini: 44.2
**Group 4: Average (Orange Bars)**
* **Trend:** Represents the mean performance across levels. The trend mirrors the general pattern of Level1 and Level2, showing a steady decline from the top-performing models to the lower-performing ones.
* **Data Points (Model: Score):**
* AgentOrchestrator: 88.7
* ToolOrchestrator: 87.4
* HALO: 85.7
* AReaL: 78.2
* Search-Omni: 77.7
* Search-Omni-7B: 76.9
* Defect-13B: 76.3
* ARIA: 69.3
* Llama-3-8B: 65.7
* o1-mini: 61.1
* o1-preview: 59.7
* 2-shot-o1-mini: 57.0
### Key Observations
1. **Consistent Top Performers:** `AgentOrchestrator`, `ToolOrchestrator`, and `HALO` consistently occupy the top three positions across all levels and the average.
2. **Level Difficulty:** For nearly every model, the score is highest on Level1, lower on Level2, and lowest on Level3, confirming that the GAIA test levels increase in difficulty.
3. **Notable Anomaly:** `ToolOrchestrator` outperforms `AgentOrchestrator` on the most difficult Level3 (87.8 vs. 81.8), despite scoring slightly lower on Level1 and the Average. This suggests `ToolOrchestrator` may be more robust for complex tasks.
4. **Significant Performance Drop-off:** There is a clear divide. The first 7-8 models maintain relatively high scores, while models from `Llama-3-8B` onwards show a marked decrease in performance, especially on Level3 where scores fall below 50.
5. **Language:** All text in the chart is in English.
### Interpretation
This chart provides a comparative benchmark of AI systems on the GAIA test, which likely evaluates general AI capabilities across varying difficulty tiers. The data suggests a hierarchy of capability among the tested systems.
The **Peircean investigative reading** reveals several layers:
* **Iconic:** The visual decline in bar height from left to right within each group is an iconic representation of decreasing capability.
* **Indexical:** The consistent ordering of models across groups indexes a stable ranking. The fact that Level3 scores are universally the lowest is an index of that level's increased complexity.
* **Symbolic:** The "GAIA Test" title symbolizes a standardized evaluation framework. The color coding (green to purple to orange) symbolically groups the data into meaningful categories for analysis.
The key takeaway is that specialized, orchestrated systems (`AgentOrchestrator`, `ToolOrchestrator`) significantly outperform both other specialized models and general-purpose language models (like the `o1` series and `Llama-3-8B`) on this benchmark. The dramatic falloff on Level3 indicates that the most challenging tasks in this test expose substantial limitations in many current AI systems, while highlighting the relative robustness of the top-performing orchestrated approaches. The "Average" column serves as a useful single metric but masks the critical insight that performance is not uniform across task difficulties.
</details>
Figure 3: GAIA Test Results.
GAIA. AgentOrchestra achieves state-of-the-art performance (89.04% avg.) by mitigating the dimensionality curse and semantic drift that arise in large-scale agentic planning. We attribute this success to two architectural properties enabled by the TEA Protocol. First, hierarchical decoupling of the action space reduces planning complexity: while methods (e.g., ToolOrchestra, AWorld) must map goals to a monolithic toolkit, our hierarchical routing decomposes the global task into locally tractable sub-problems, lowering cognitive entropy for the central orchestrator and preserving abstract reasoning under long horizons, even amid low-level sensorimotor noise (e.g., granular DOM events). Second, ECP formalizes epistemic environment boundaries: GAIA’s multi-domain tasks require temporal and cross-modal state coherence, and baselines often degrade during domain transitions, such as from browser retrieval to local python analysis. By treating environments as first-class managed components, TEA preserves and propagates session-critical state (e.g., authentication tokens and transient file-system mutations) across agent boundaries, reducing contextual forgetting and enabling compositional generalization on challenging Level 2 and Level 3 scenarios. Third, AgentOrchestra supports recursive refinement of reasoning trajectories. When faced with complex problems, the Planning Agent evaluates intermediate insights and, when necessary, invokes the Tool Generator Agent to synthesize context-specific functionalities on the fly. This on-demand tool evolution bypasses the fixed-capability bottleneck of static agent components.
Table 2: Performance on GAIA Validation.
| Agents | Level 1 | Level 2 | Level 3 | Average |
| --- | --- | --- | --- | --- |
| HF ODR (o1) (HuggingFace, 2024) | 67.92 | 53.49 | 34.62 | 55.15 |
| OpenAI DR (OpenAI, 2024) | 74.29 | 69.06 | 47.60 | 67.36 |
| Manus (Shen and Yang, 2025) | 86.50 | 70.10 | 57.69 | 73.90 |
| Langfun (Google, 2024) | 86.79 | 76.74 | 57.69 | 76.97 |
| AWorld (Yu et al., 2025) | 88.68 | 77.91 | 53.85 | 77.58 |
| AgentOrchestra | 92.45 | 83.72 | 57.69 | 82.42 |
Table 3: Performance on SimpleQA and HLE.
| Model and Agent | SimpleQA |
| --- | --- |
| Models | |
| o3 (w/o tools) | 49.4 |
| gemini-2.5-pro-preview-05-06 | 50.8 |
| Agents | |
| Perplexity DR (Perplexity, 2025) | 93.9 |
| AgentOrchestra | 95.3 |
| Model and Agent | HLE |
| Models | |
| o3 (w/o tools) | 20.3 |
| claude-3.7-sonnet (w/o tools) | 8.9 |
| gemini-2.5-pro-preview-05-06 | 17.8 |
| Agents | |
| OpenAI DR (OpenAI, 2024) | 26.6 |
| Perplexity DR (Perplexity, 2025) | 21.1 |
| AgentOrchestra | 37.46 |
SimpleQA. AgentOrchestra achieves SOTA performance (95.3% accuracy), significantly surpassing both monolithic LLMs (e.g., o3 at 49.4%) and specialized retrieval agents like Perplexity Deep Research (93.9%). We attribute this improvement to systematic reduction of epistemic uncertainty through our hierarchical verification pipeline. SimpleQA primarily targets short-form factuality, where hallucinations often arise from the model’s inability to reconcile conflicting web-based evidence or its tendency to rely on internal parametric memory. AgentOrchestra mitigates these issues by enforcing cross-agent consensus: the Planning Agent orchestrates a retrieve-verify-synthesize cycle where the Deep Researcher performs multi-engine breadth-first searches while the Deep Analyzer evaluates evidence consistency across heterogeneous sources. By decoupling retrieval from analysis, the system prevents "confirmation bias" inherent in single-agent architectures, where the same model both proposes and validates a hypothesis. Furthermore, the integration with the Reporter Agent ensures traceable attribution, grounding every factual claim in a re-verified source, which effectively transforms the task from an open-domain generation problem into a structured evidence-synthesis process.
HLE. AgentOrchestra achieves 37.46% on the HLE benchmark, a substantial margin over leading baselines like o3 (20.3%) and Perplexity Deep Research (21.1%). This gain highlights the framework’s capacity for long-horizon analytical reasoning and adaptive capability expansion in expert-level domains. HLE demands more than simple retrieval; it requires synthesizing disparate, highly specialized knowledge. In this setting, the hierarchical structure enables strategic pruning of the hypothesis space, allowing the Planning Agent to maintain global objective coherence while delegating technical validation to specialized agents such as the Deep Analyzer. As a result, the final solution is both analytically rigorous and cross-verified against multimodal evidence, yielding robust performance on challenging expert-level tasks.
### 5.2 Ablation Studies
Table 4: Sub-agent effectiveness across GAIA Test.
| P | R | B | A | T | Level 1 | Level 2 | Level 3 | Average | Improvement |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ✓ | | | | | 54.84 | 33.96 | 10.20 | 36.54 | – |
| ✓ | ✓ | | | | 86.02 | 47.17 | 34.69 | 57.14 | +56.40% |
| ✓ | ✓ | ✓ | | | 89.25 | 71.07 | 46.94 | 72.76 | +27.33% |
| ✓ | ✓ | ✓ | ✓ | | 91.40 | 77.36 | 61.22 | 79.07 | +8.67% |
| ✓ | ✓ | ✓ | ✓ | ✓ | 98.92 | 85.53 | 81.63 | 89.04 | +12.61% |
Effectiveness of the specialized sub-agents. Ablation studies on the GAIA Test demonstrate the synergistic effect inherent in our multi-agent coordination. Integrating coarse-grained exploratory retrieval (Researcher) with fine-grained operational interaction (Browser) nearly doubles performance (36.54% to 72.76%), proving that breadth of information and depth of interaction are mutually reinforcing. The Deep Analyzer’s 8% gain highlights the necessity of specialized reasoning pipelines for high-entropy multimodal tasks, while the Tool Generator’s 12.61% boost validates the efficacy of on-demand capability synthesis in overcoming the limitations of static, predefined toolsets. These results suggest that complex problem-solving emerges not just from individual agent strength, but from the structured delegation of specialized roles.
Efficiency analysis. AgentOrchestra ’s operational efficiency is evaluated across varying task complexities. Simple tasks typically complete within 30 seconds using approximately 5k tokens, while medium-complexity tasks average 3 minutes (25k tokens). Complex multimodal or long-horizon scenarios require approximately 10 minutes and 100k tokens. Compared to monolithic baselines, our hierarchical architecture optimizes resource allocation, maintaining operational costs comparable to commercial research agents while delivering significant performance gains.
Effectiveness of the self-evolution module. The TEA Protocol enables self-optimization by treating system components as evolvable variables, helping bridge the gap between base model capacity and task requirements. Evaluations on GPQA-Diamond and AIME benchmarks show that iterative refinement, including gradient-based (TextGrad) and symbolic (self-reflection) approaches, mitigates reasoning bottlenecks in foundation models. The improvement is exemplified by a 13.34% gain on AIME25 for gpt-4.1 under self-reflection, highlighting recursive trajectory refinement. Leveraging execution feedback via TEA’s versioning and tracer mechanisms, the system identifies and corrects logical inconsistencies in its planning. Overall, this shifts reasoning from one-shot inference to a managed optimization process, enabling AgentOrchestra to evolve problem-solving strategies for frontier-level tasks.
Table 5: Effectiveness of the self-evolution module. Direct means using the base model directly.
| Strategy | GPQA-Diamond | AIME24 | AIME25 |
| --- | --- | --- | --- |
| Base Model: gpt-4o | | | |
| Direct | 47.98% | 13.34% | 6.67% |
| w/ TextGrad | 54.04% | 10.00% | 10.00% |
| w/ Self-reflection | 55.05% | 20.00% | 6.67% |
| Base Model: gpt-4.1 | | | |
| Direct | 61.11% | 23.34% | 20.00% |
| w/ TextGrad | 65.15% | 26.67% | 23.34% |
| w/ Self-reflection | 68.18% | 33.34% | 33.34% |
Regarding tool evolution, the tool generator agent demonstrates efficient creation and reuse capabilities within the TCP framework. During our evaluation, the agent autonomously generated over 50 specialized tools, achieving a 30% reuse rate across subsequent tasks. This indicates an effective balance between tool specialization and generalization, ensuring that the system’s capabilities expand adaptively while maintaining resource efficiency.
## 6 Conclusion
We introduced the TEA Protocol, unifying environments, agents, and tools to address fragmentation in existing standards. Building on this, we presented AgentOrchestra, a hierarchical multi-agent framework with specialized sub-agents for planning, research, web interaction, and multimodal analysis. Evaluations on three benchmarks show that AgentOrchestra achieves SOTA performance and scalable orchestration through dynamic resource transformations. Future work will extend TEA to support dynamic role allocation and autonomous agent reconfiguration. Building on tool and solution evolution, we will pursue deeper self-evolution, such as using RL to optimize agent components and decision policies without fine-tuning LLM parameters. We also aim to expand these mechanisms to agent structures and communication protocols, while enhancing multimodal capabilities for fine-grained real-time video analysis.
## 7 Limitations
### 7.1 Limitations of TEA Protocol and AgentOrchestra
Despite its strengths in orchestrating multi-agent systems, AgentOrchestra has several limitations that provide directions for future research:
First, System Complexity and Learning Curve. The TEA protocol introduces a structured abstraction layer for tools, environments, and agents to ensure interoperability. However, this structure may present a steeper learning curve for developers compared to simpler, ad-hoc scripting methods. To address this, we will provide extensive documentation, interactive tutorials, and a variety of pre-configured templates to simplify the onboarding process.
Second, Communication and Execution Overhead. Standardizing interactions through a formal protocol can introduce marginal computational and communication overhead, potentially increasing latency in real-time applications. We plan to optimize the serialization protocols and explore asynchronous execution models to minimize these effects in future versions.
Third, Dependence on Underlying Model Capabilities. The effectiveness of the orchestration is inherently limited by the reasoning and instruction-following performance of the foundation LLMs used. While TEA provides a robust framework, it cannot fully compensate for failures caused by model hallucinations or poor tool-use logic. Future work will focus on developing model-agnostic error recovery strategies and more sophisticated validation layers to enhance system-wide resilience.
### 7.2 Potential Risks
While AgentOrchestra and the TEA protocol aim to enhance multi-agent productivity, their capability to interact with local environments and web browsers introduces certain ethical and security risks.
One primary concern is the Misuse for Malicious Automation. The framework’s flexibility in controlling browser sessions and executing terminal commands could be repurposed to develop unauthorized "plugins" or "cheats" for online platforms, leading to unfair advantages or automated fraud. Furthermore, there are significant Privacy and Security Risks associated with granting autonomous agents access to personal data or sensitive system resources. If not properly sandboxed or governed by strict security policies, an agent could inadvertently leak private information or perform harmful, irreversible system actions. To mitigate these risks, we emphasize that AgentOrchestra should be used within isolated, monitored environments, and we advocate for the integration of robust human-in-the-loop verification mechanisms and strict access control policies in any real-world deployment.
## References
- Anthropic (2024a) Introducing Computer Use, a New Claude 3.5 Sonnet, and Claude 3.5 Haiku. Note: https://www.anthropic.com/news/3-5-models-and-computer-use Accessed: 2025-05-13 Cited by: §1, §2.2.
- Anthropic (2024b) Introducing the Model Context Protocol. Note: https://www.anthropic.com/news/model-context-protocol Cited by: §C.2.1, §D.1, §1, §2.1, §2.2, §3.2.
- Anthropic (2025) Equipping agents for the real world with Agent Skills. Note: https://www.anthropic.com/engineering/equipping-agents -for-the-real-world-with-agent-skills Cited by: §C.2, §3.2.
- K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y. Galliker, et al. (2025) $π$ 0. 5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054. Cited by: §A.1.2.
- G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §C.2.2.
- G. DeepMind (2024) Gemini Deep Research. Note: https://gemini.google/overview/deep-research/?hl=en Cited by: §1.
- A. Ehtesham, A. Singh, G. K. Gupta, and S. Kumar (2025) A survey of agent interoperability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp). arXiv preprint arXiv:2505.02279. Cited by: §2.1.
- Google (2024) LangFun Agent. Note: https://github.com/google/langfun Cited by: Table 2.
- Google (2025) Announcing the Agent2Agent Protocol (A2A). Note: https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/ Cited by: §C.2.3, §1, §2.1.
- S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al. (2023) MetaGPT: Meta Programming for Multi-agent Collaborative Framework. arXiv preprint arXiv:2308.00352 3 (4), pp. 6. Cited by: §2.2.
- HuggingFace (2024) Open-source DeepResearch - Freeing Our Search Agents. Note: https://huggingface.co/blog/open-deep-research Cited by: Table 2.
- P. Li, X. Zou, Z. Wu, R. Li, S. Xing, H. Zheng, Z. Hu, Y. Wang, H. Li, Q. Yuan, et al. (2025) Safeflow: A principled protocol for trustworthy and transactional autonomous agent systems. arXiv preprint arXiv:2506.07564. Cited by: §2.1.
- G. Liang and Q. Tong (2025) LLM-Powered AI Agent Systems and Their Applications in Industry. arXiv preprint arXiv:2505.16120. Cited by: §2.2.
- X. Liang, J. Xiang, Z. Yu, J. Zhang, S. Hong, S. Fan, and X. Tang (2025) OpenManus: An Open-Source Framework for Building General AI Agents. Zenodo. External Links: Document, Link Cited by: §D.1.
- G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y. LeCun, and T. Scialom (2023) GAIA: A Benchmark for General AI Assistants. External Links: 2311.12983, Link Cited by: §5.
- M. Müller and G. Žunič (2024) Browser Use: Enable AI to Control Your Browser External Links: Link Cited by: §1, §2.2.
- OpenAI (2023) Function Calling. Note: https://platform.openai.com/docs/guides/function-calling Cited by: §D.1, §2.2.
- OpenAI (2024) Introducing Deep Research. Note: https://openai.com/index/introducing-deep-research Cited by: §1, Table 2, Table 3.
- OpenAI (2025a) Context-Free Grammar. Note: https://platform.openai.com/docs/guides/function-calling#page-top Cited by: §A.1.2.
- OpenAI (2025b) Introducing Operator. Note: https://openai.com/blog/operator Cited by: §1, §2.2.
- Perplexity (2025) Introducing Perplexity Deep Research. Note: https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research Cited by: Table 3, Table 3.
- L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. (2025) Humanity’s Last Exam. arXiv preprint arXiv:2501.14249. Cited by: §5.
- Y. Qin, Y. Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y. Li, S. Huang, et al. (2025) UI-TARS: Pioneering Automated GUI Interaction with Native Agents. arXiv preprint arXiv:2501.12326. External Links: Link Cited by: §1, §2.2.
- J. Qiu, X. Qi, T. Zhang, X. Juan, J. Guo, Y. Lu, Y. Wang, Z. Yao, Q. Ren, X. Jiang, X. Zhou, D. Liu, L. Yang, Y. Wu, K. Huang, S. Liu, H. Wang, and M. Wang (2025) Alita: generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution. External Links: 2505.20286, Link Cited by: §2.2.
- A. Roucher, A. V. del Moral, T. Wolf, L. von Werra, and E. Kaunismäki (2025) smolagents: A Smol Library to Build Great Agentic Systems. Note: https://github.com/huggingface/smolagents Cited by: §D.1, §2.2.
- M. Shen and Q. Yang (2025) From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent. External Links: 2505.02024, Link Cited by: §2.2, Table 2.
- W. Tan, W. Zhang, X. Xu, H. Xia, Z. Ding, B. Li, B. Zhou, J. Yue, J. Jiang, Y. Li, et al. (2024) Cradle: Empowering Foundation Agents toward General Computer Control. arXiv preprint arXiv:2403.03186. Cited by: §1, §2.2.
- G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar (2023) Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint arXiv:2305.16291. Cited by: §1, §2.2.
- X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji (2024a) Executable Code Actions Elicit Better LLM Agents. External Links: 2402.01030, Link Cited by: §1, §2.2.
- X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, et al. (2024b) OpenHands: An Open Platform for AI Software Developers as Generalist Agents. In The Thirteenth International Conference on Learning Representations, Cited by: §D.1, §2.2.
- J. Wei, N. Karina, H. W. Chung, Y. J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus (2024) Measuring Short-Form Factuality in Large Language Models. External Links: 2411.04368, Link Cited by: §5.
- G. Wölflein, D. Ferber, D. Truhn, O. Arandjelović, and J. N. Kather (2025) LLM Agents Making Agent Tools. arXiv preprint arXiv:2502.11705. Cited by: §2.2.
- xAI (2025) Grok 3 Beta — The Age of Reasoning Agents. Note: https://x.ai/news/grok-3 Cited by: §1.
- C. Yu, S. Lu, C. Zhuang, D. Wang, Q. Wu, Z. Li, R. Gan, C. Wang, S. Hou, G. Huang, W. Yan, L. Hong, A. Xue, Y. Wang, J. Gu, D. Tsai, and T. Lin (2025) AWorld: orchestrating the training recipe for agentic ai. External Links: 2508.20404, Link Cited by: Table 2.
- M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, P. Lu, Z. Huang, C. Guestrin, and J. Zou (2025) Optimizing generative AI by backpropagating language model feedback. Nature 639 (8055), pp. 609–616. Cited by: §C.4, §C.4, §3.4.
## Appendix A Comprehensive Motivation for TEA Protocol
This section provides a comprehensive motivation for the TEA Protocol by examining the fundamental relationships and transformations between agents, environments, and tools in multi-agent systems. The discussion is organized into two main parts: first, we explore the conceptual relationships between agents, environments, and tools, examining how these three fundamental components interact and complement each other in modern AI systems; second, we analyze why transformation relationships between these components are necessary, demonstrating the need for their conversion and integration through the TEA Protocol to create a unified, flexible framework for general-purpose task solving.
### A.1 Conceptual Relationships
#### A.1.1 Environment
The environment constitutes one of the fundamental components of multi-agent systems, providing the external stage upon which agents perceive, act, and accomplish tasks. Within the context of the TEA Protocol, highlighting the role of environments is crucial, since environments not only define the operational boundaries of agents but also exhibit complex structural and evolutionary properties. In what follows, we outline the motivation for explicitly modeling environments in the TEA framework from several perspectives.
Classification of environments. From a broad perspective, environments can be divided into two categories: the real world and the virtual world. The real world is concrete and directly perceivable by humans, such as kitchens, offices, or factories. By contrast, the virtual world cannot be directly perceived or objectively described by humans, including domains such as the network world, simulation platforms, and game worlds. Importantly, these two types of environments are not independent. Rather, they are tightly coupled through physical carriers, such as computers, displays, keyboards, mice, and sensors, which act as mediators that enable the bidirectional flow of information between the real and virtual domains. Hence, environments should be regarded not as isolated domains but as interdependent layers connected through mediating carriers.
Nested and expandable properties. Environments are inherently nested and expandable. For example, when an individual is situated in a kitchen, their observable range and available tools are restricted to kitchen-related objects such as faucets, knives, and microwaves, all governed by the local rules of that sub-environment. When the activity range extends to the living room, new objects such as televisions, remote controls, and chairs become accessible, while the kitchen remains embedded as a sub-environment within a broader space. Furthermore, environments can interact with one another, as when a bottle of milk is taken from the kitchen to the living room. This demonstrates that enlarged environments can be conceptualized not merely as simple unions, but rather as structured integrations of the state and action spaces of smaller constituent environments, where local rules and affordances are preserved while new forms of interaction emerge from their composition.
Relationship with state–action spaces. In reinforcement learning, environments are formalized in terms of state and action spaces. The state space comprises the set of possible environmental states, represented in modalities such as numerical values, text, images, or video. The action space denotes the set of operations available to agents, generally divided into continuous and discrete spaces. Real and virtual environments are naturally continuous, but discrete abstractions are often extracted for the sake of tractability, forming the basis of most reinforcement learning systems. However, this discretization constrains the richness of interaction. In contrast, large language models (LLMs) enable a new paradigm: instead of selecting from a discrete set, LLMs can generate natural language descriptions that encode complex action sequences. These outputs can be understood as an intermediate representation between continuous and discrete action spaces, richer and more expressive than discrete actions, yet still mappable to concrete operations in continuous environments. To realize this mapping, intermediate actions are required as bridges. For instance, the natural language command “boil water” can be decomposed into executable steps such as turning on the kettle, filling it with water, powering it on, and waiting until boiling. This property indicates that LLM-driven interaction expands the definition of action representations and broadens the scope of environmental engagement.
Mediation and interaction. The notion of mediation highlights that environments are not static backdrops but relative constructs whose boundaries depend on available carriers and interfaces. In hybrid physical–virtual systems, for example, Internet-of-Things (IoT) devices serve as mediators: a smart refrigerator in the physical world can be controlled through a mobile application in the virtual world, while the application itself is subject to network protocols. Consequently, the definition of an environment is dynamic and conditioned by interactional means. In the TEA Protocol, this mediation must be explicitly modeled, since it determines accessibility and interoperability across environments.
Toward intelligent environments. Traditionally, environments are passive components that provide states and respond to actions. However, as embedded simulators, interfaces, and actuators grow more sophisticated, environments may gradually acquire semi-agentic properties. For instance, a smart home environment may not only respond to the low-level command “turn on the light” but also understand and execute a high-level instruction such as “create a comfortable atmosphere for reading,” by autonomously adjusting lighting, curtains, and background music. This trend suggests that environments are evolving from passive contexts into adaptive and cooperative components.
In conclusion, the environment should not be regarded as a passive backdrop for agent activity, but as a dynamic and evolving component that fundamentally shapes the scope and feasibility of interaction. Its dual nature across real and virtual domains, its nested and compositional structure, and its formalization through state–action spaces all demonstrate that environments provide both the constraints and the affordances within which agents operate. At the same time, the rise of LLM-based agents introduces new forms of action representation that require environments to support more flexible, language-driven interfaces. Looking ahead, as environments increasingly incorporate adaptive and semi-agentic features, their role in task execution will only become more central. Within the TEA Protocol, this motivates treating environments as a co-equal pillar alongside agents and tools, ensuring that general-purpose task solving remains both grounded in environmental constraints and empowered by environmental possibilities.
#### A.1.2 Agent
Within the TEA Protocol, the motivation for treating agents as a core component alongside environments and tools extends beyond mere terminological convenience. Agents represent the indispensable connective tissue between the generative capabilities of LLMs, the operational affordances of tools, and the structural dynamics of environments. While environments provide the stage on which tasks unfold and tools extend the range of possible actions, it is agents that unify perception, reasoning, and execution into coherent task-solving processes. Without explicitly recognizing agents as an independent pillar, the TEA Protocol would lack a systematic way to explain how abstract linguistic outputs can be transformed into grounded operations, how tools can be selected and orchestrated, and how autonomy, memory, and adaptivity emerge in multi-agent systems. The following dimensions illustrate why agents must be elevated to a core component of the framework.
Necessity of environment interaction. Unlike large language models (LLMs), which only produce textual descriptions that require conversion into executable actions, agents are fundamentally characterized by their ability to directly interact with environments. While LLMs can generate detailed plans, instructions, or hypotheses, such outputs remain inert unless they are translated into concrete operations that affect the state of an environment. This gap between symbolic reasoning and actionable execution highlights the necessity of an intermediate entity capable of grounding abstract instructions into domain-specific actions. Agents fulfill precisely this role: they map language-level reasoning to executable steps, whether in physical settings, such as controlling robotic arms or sensors, or in virtual contexts, such as interacting with databases, APIs, or software systems.
By serving as this mapping layer, agents enable the closure of full task loops, where perception leads to reasoning, reasoning produces plans, and plans culminate in actions that in turn modify the environment. Without explicitly modeling agents, the process would remain incomplete, as LLMs alone cannot guarantee the translation of reasoning into operational change. Within the TEA Protocol, this necessity justifies the elevation of agents to a core component: they provide the indispensable interface that connects the generative capacities of LLMs with the affordances and constraints of environments, ensuring that tasks are not only conceived but also carried through to completion.
The decisive role of non-internalizable tools. The fundamental distinction between LLMs and agents lies in whether they can effectively employ tools that cannot be internalized into model parameters. Some tools can indeed be absorbed into LLMs, particularly those whose logic can be fully simulated in symbolic space, whose inputs and outputs are representable in language or code, and whose patterns fall within the training distribution (for example, mathematical reasoning, structured text formatting, code generation, and debugging). For example, early LLMs struggled with JSON output formatting and code reasoning, often requiring external correction or checking tools, but reinforcement learning (RL) and supervised fine-tuning (SFT) have progressively enabled such capabilities to be internalized.
In contrast, many tools remain non-internalizable because they are intrinsically tied to environmental properties. These include tools that depend on physical devices such as keyboards, mice, and robotic arms, external infrastructures such as databases and APIs, or proprietary software governed by rigid protocols. Two recent approaches further illustrate this limitation. Vision-language-action (VLA) (Black et al., 2025) models map perceptual inputs directly into actions, which may appear to bypass intermediate symbolic descriptions, yet the resulting actions must still be aligned with the discrete action spaces of environments. This alignment represents not a fundamental internalization but a compromise, adapting model outputs to the constraints of environmental action structures. Similarly, the upgraded function calling mechanism introduced after GPT-5, which incorporates context-free grammar (CFG) (OpenAI, 2025a), allows LLMs to output structured and rule-based actions that conform to external system requirements. However, this remains a syntactic constraint on model outputs, effectively providing a standardized interface to external systems rather than a truly internalized ability of the model.
Agents therefore play a decisive role in mediating this boundary. They allow LLMs to internalize symbolic tools, thereby enhancing reasoning and self-correction, while also orchestrating access to non-internalizable tools through external mechanisms. This dual pathway ensures that LLMs are not confined to their parameterized capabilities alone but can extend into broader operational domains. In this way, agents transform the tension between internalizable and non-internalizable tools from a limitation into an opportunity, enabling robust problem solving in multimodal, embodied, and real-world contexts.
Memory and learning extension. Another crucial motivation for agents lies in their capacity to overcome the intrinsic memory limitations of LLMs. Due to restricted context windows, LLMs struggle to maintain continuity across extended interactions or to accumulate knowledge over multiple sessions. Agents address this shortcoming by incorporating external memory systems capable of storing, retrieving, and contextualizing past experiences. Such systems simulate long-term memory and enable experiential learning, allowing agents to refine strategies based on historical outcomes rather than treating each interaction as isolated. However, in the TEA Protocol, memory is not defined as a core protocol component but is instead positioned at the infrastructure layer. This design choice reflects the anticipation that future LLMs may gradually internalize memory mechanisms into their parameters, thereby reducing or even eliminating the need for external memory systems. In other words, while memory expansion is indispensable for today’s agents, it may represent a transitional solution rather than a permanent defining element of agency.
Bridging virtual and external worlds. It has been suggested that LLMs encode within their parameters a kind of “virtual world,” enabling them to simulate reasoning and predict outcomes internally. However, without an external interface, such simulations remain trapped in closed loops of self-referential inference, disconnected from the contingencies of real-world environments. Agents play a critical role in bridging this gap: they translate the abstract reasoning of LLMs into concrete actions, validate outcomes against environmental feedback, and close the loop between perception, reasoning, and execution. This bridging function transforms LLMs from purely linguistic engines into operationally grounded components whose outputs can be tested, refined, and extended within real or simulated environments.
Autonomy and goal-directedness. Beyond reactivity, agents are motivated by their capacity for autonomy. While LLMs typically operate in a reactive fashion, producing outputs in response to explicit prompts, agents can adopt proactive behaviors. They are capable of formulating subgoals, planning action sequences, and dynamically adapting strategies in light of environmental changes or task progress. This goal-directedness is what elevates agents from passive tools into active participants in problem solving. Autonomy ensures that agents are not merely executing instructions but are able to pursue objectives, adjust course when facing uncertainty, and coordinate with other agents. Such properties are essential for multi-agent collaboration and for tackling open-ended, general-purpose tasks that require initiative as well as adaptability.
Taken together, these motivations highlight why agents must be modeled as a core pillar of the TEA Protocol. Environments provide the stage for interaction, tools expand the operational scope, but it is agents that integrate reasoning, memory, tool usage, and autonomy into cohesive systems of action. By serving as mediators between LLMs and their environments, agents ensure that abstract reasoning is translated into grounded execution, enabling robust and scalable task solving across domains. In this sense, agents represent the crucial entity that transforms language models from passive predictors into active problem solvers within a unified multi-agent framework.
#### A.1.3 Tool
Within the TEA Protocol, the decision to treat tools as a core component alongside environments and agents extends far beyond a matter of convenience in terminology. Tools represent the crucial mediating constructs that encapsulate and operationalize the action spaces of environments, while simultaneously serving as the primary extension layer of agent capabilities. Environments provide the structural stage on which interactions occur, and agents embody the reasoning and decision-making mechanisms that drive behavior, but it is through tools that such reasoning becomes executable and scalable. Without tools, agents would be confined to abstract planning or primitive environmental actions, and environments would remain underutilized as passive backdrops rather than dynamic arenas of transformation.
Moreover, tools play a unique role in bridging symbolic reasoning and concrete execution, providing the abstraction layers necessary to decompose complex tasks into manageable units, and enabling cross-domain transfer through their modularity and portability. They also reveal the shifting boundary between what can be internalized into an agent’s parameters and what must remain external, highlighting the evolving interplay between intelligence and embodiment. In this sense, tools are not merely auxiliary aids but indispensable pillars that shape the architecture of multi-agent systems. The following dimensions illustrate the motivations for elevating tools to a core component of the TEA.
Extending the operational boundary. The primary function of tools is to expand the operational scope of agents beyond what is directly encoded in model parameters or supported by immediate environment interactions. Environments by themselves typically offer only primitive actions, and LLMs by themselves are limited to symbolic reasoning. Tools bridge this gap by furnishing additional pathways for action, allowing agents to manipulate physical artifacts or virtual systems in ways that exceed the direct expressive capacity of the model. From physical devices such as hammers, keyboards, and robotic arms to virtual infrastructures such as databases, APIs, and code execution engines, tools multiply the modes through which agents can influence their environments. Without tools, agents would be confined to intrinsic reasoning and the primitive action space of environments, leaving them incapable of executing tasks that require domain-specific operations. With tools, however, complex objectives can be decomposed into modular operations that are both tractable and reusable. This decomposition makes problem solving significantly more efficient, while also enhancing adaptability across domains. In this way, tools act as multipliers of agency, transforming abstract reasoning into a wider range of tangible interventions.
Hierarchy and abstraction. Tools are not flat or uniform components but exhibit a hierarchical and abstract structure. At the lowest level, tools correspond to atomic environmental actions, such as “clicking a button” or “moving one step.” These atomic units can then be combined into higher-level compound tools such as “opening a file” or “conducting a search.” At an even higher level, compound tools may evolve into strategy-like constructs, such as “writing a report,” “planning a trip,” or “completing a financial transaction.” Each level builds upon the previous, creating a hierarchy of reusable capabilities. This hierarchical structure is not only efficient but also central to interpretability. Higher-level tools inherently carry semantic labels that communicate their function, which in turn makes agent behavior more transparent to human observers and more predictable to other agents. Such abstraction layers reduce the cognitive and computational load on the agent when planning, since invoking a high-level tool can encapsulate dozens or hundreds of low-level steps. Moreover, in multi-agent systems, the semantic richness of high-level tools serves as a lingua franca, facilitating coordination and collaboration.
Boundary between tools and agent capabilities. The relationship between tools and agents is dynamic rather than static. As LLM reasoning and learning capabilities improve, certain tools can be gradually internalized into model parameters, effectively transforming into latent agent abilities. Examples include logical inference, grammar correction, structured text formatting, and code generation, which once required external support but have increasingly been subsumed into the model’s intrinsic skills. In this sense, the boundary between what is a “tool” and what is an “ability” is fluid and shaped by the trajectory of model development. By contrast, many tools remain non-internalizable because they are tightly coupled with environmental properties or external infrastructures. These include robotic arm manipulation, database queries, API interactions, and other operations that inherently depend on external systems or physical substrates. This duality creates a layered conception of agency: a “core capability layer” composed of skills internalized within the model, and an “extended layer” realized through external tool use. The shifting line between these two layers reflects the ongoing negotiation between intelligence and embodiment, highlighting why tools must be explicitly recognized as a structural component.
Evolution and portability. Tools are not static constructs but evolve alongside environments and agent requirements. In programming contexts, for instance, an initial tool may simply execute code. Over time, as demands increase, this basic function evolves into more advanced utilities such as “static code analysis,” “automated test generation,” and “continuous deployment.” A similar trajectory occurs in other domains, where rudimentary tools gradually give rise to sophisticated pipelines capable of handling more complex and specialized tasks. In addition to evolution, tools are inherently portable. A well-designed summarization tool, for example, can be reused across very different contexts, from condensing news articles to producing academic literature reviews. This reusability makes tools a natural vehicle for cross-domain generalization, enabling knowledge and functionality to transfer without retraining the underlying model. For these reasons, the TEA Protocol emphasizes modularization and standardization of tools, ensuring that they can evolve flexibly while maintaining interoperability across agents and environments.
Toward intelligent tools. Traditional tools are passive, executing predefined functions only when invoked by an agent. They wait for explicit instructions and do not adapt to context or anticipate needs. However, the trajectory of tool development points toward increasing intelligence, where tools exhibit perception, analysis, and even limited decision-making capabilities. For example, an advanced debugging tool may not only check code upon request but also proactively scan for hidden vulnerabilities, propose optimizations, and even prioritize issues based on estimated risk. Such capabilities blur the line between tools and agents, effectively creating semi-agentic components. Intelligent tools can share responsibility for decision making, reduce the supervisory burden on agents, and participate in distributed problem-solving processes. In this way, tools transition from being passive executors to collaborative partners, altering the topology of multi-agent systems and reshaping the balance between reasoning and execution. Recognizing this trend is critical for designing flexible architectures, as it ensures that the TEA Protocol remains relevant in scenarios where tools are no longer inert extensions but active contributors to system intelligence.
In summary, tools serve as both encapsulations of environmental action spaces and as extensions of agent capabilities. They reduce task complexity through hierarchical abstraction, extend applicability through the balance of internalization and externalization, and foster scalability through evolution, portability, and intelligent design. By transforming the interaction between environments and agents into a modular and expandable architecture, tools anchor the adaptability and generality of multi-agent systems. For these reasons, the TEA Protocol must model tools as a core pillar, providing standardized interfaces that ensure flexible invocation and sharing across contexts, thereby supporting the overarching goal of general-purpose task solving.
### A.2 Transformation Relationships
While agents, environments, and tools are modeled as distinct pillars within the TEA Protocol, their boundaries are not fixed but fluid. Practical systems often demand that one entity temporarily assume the role of another in order to achieve modularity, scalability, and seamless collaboration. These transformation relationships are therefore indispensable, as they provide the mechanisms by which reasoning can be encapsulated into standardized functions, tools can be elevated into autonomous actors, and environments can acquire adaptive properties. In what follows, we examine the motivations for such transformations, beginning with the bidirectional conversions between agents and tools.
Agent-to-Tool (A2T). The motivation for the A2T transformation lies in compressing the complex reasoning and interaction capabilities of agents into reusable tool interfaces. Instead of remaining as fully autonomous components, some agents can be abstracted into functional modules, thereby enhancing modularity, interoperability, and scalability within multi-agent systems. This transformation can be explained from three perspectives:
- Modularization and encapsulation of complex autonomous systems. Although an agent possesses the complete perception–reasoning–execution chain, a single autonomous agent is often too complex to be directly reused in large-scale systems. Through A2T transformation, the internal logic of the agent is “folded” into a black-box tool interface, whose external manifestation is reduced to a clear input and output. In this way, it no longer exists as an “independent autonomous entity,” but as a “functional module” that can provide services to other agents or workflows. This encapsulation emphasizes the reduction of collaboration complexity, enabling higher-level systems to focus solely on results without interfering in or interpreting the agent’s internal reasoning process.
- Difference in role semantics: autonomous entity vs. functional unit. As an agent, it must perceive its environment, set goals, and dynamically adjust strategies. As a tool, however, it merely performs a specified function when invoked. In many multi-agent scenarios, it is unnecessary for all agents to maintain high degrees of autonomy, as this would create excessive interaction overhead and conflict management. Downgrading certain agents into tools (A2T) means relinquishing their goal-setting and decision-making functions while retaining only their reusable capabilities. This role shift ensures that the system contains both “autonomous cores” and “functional components,” thereby forming a layered structure of collaboration.
- Enhancing composability and ecological reusability. Once encapsulated as a tool, an agent can be reused across diverse systems and contexts like a modular building block. For instance, a “deep research agent” operates autonomously by dynamically planning search strategies, iteratively analyzing data, and summarizing insights. After A2T encapsulation, however, it becomes a “research tool” that simply receives a query request and returns results, ready for invocation by higher-level agents. This transformation greatly enhances interoperability and composability, enabling agents to be reused in different workflows without incurring integration costs due to their autonomous identity.
Tool-to-Agent (T2A). Within the TEA Protocol, the essence of T2A transformation is to incorporate tools into the callable interface layer of agents, making them the “operational actuators” through which abstract plans are executed in real environments. Agents are primarily responsible for setting goals and performing high-level reasoning, while tools handle concrete operations and interactions with environments. This division of labor not only optimizes system architecture but also ensures that complex tasks can be accomplished through layered collaboration. The necessity of T2A can be articulated along three key dimensions:
- Bridging reasoning and execution to close the task loop. The outputs of agents are often high-level plans or symbolic descriptions, but without executable mappings, these outputs remain inert and fail to alter the environment. T2A provides the crucial mechanism for grounding abstract reasoning into concrete actions. For example, a planning agent may generate the instruction “analyze the database and generate a report,” while database query and visualization tools carry out the corresponding SQL queries and chart rendering. Without T2A, agent reasoning would remain disconnected from environmental change, leaving the perception–reasoning–execution–feedback loop incomplete. Thus, T2A is indispensable for ensuring that agents can translate reasoning into operational impact.
- Reducing cognitive and computational burden of core agents. If every low-level operation were to be handled directly by an agent, it would be overloaded with detail management, increasing computational costs and undermining strategic reasoning efficiency. Through T2A, agents can delegate domain-specific or low-level tasks to specialized tools and concentrate on higher-level planning and adaptation. For instance, a data analysis agent need not implement SQL parsing, execution, and optimization itself, but instead invokes SQL tools that encapsulate these functions. This separation prevents agents from being “trapped in details” and ensures that their resources remain dedicated to abstract reasoning. The necessity here lies in maintaining agents at the right level of abstraction to maximize efficiency and scalability.
- Enhancing modularity and ecological extensibility. Tools are inherently modular and portable across domains, whereas agent reasoning mechanisms evolve more gradually. With T2A, agents can flexibly incorporate new tools through standardized interfaces without retraining or structural modification, thereby rapidly expanding their functional boundaries. For example, a writing agent can seamlessly integrate grammar checkers, translation tools, or image generators to support multimodal authoring, all without altering its core reasoning logic. This modularity and extensibility ensure that agents remain adaptive as environments and ecosystems evolve, allowing the system to sustain long-term scalability and cross-domain applicability.
Environment-to-Tool (E2T). The core motivation of E2T lies in abstracting the raw action space of environments into a structured and standardized toolkit, where individual actions are no longer isolated calls but interconnected components sharing contextual information and causal constraints. This transformation enables agents to operate environments at a higher level of planning rather than dealing with fragmented primitives. Its necessity can be articulated in three main dimensions:
- Enhancing interaction consistency and planability. Raw environment actions are often fragmented and tightly coupled to implementation details, making strategies hard to generalize or reproduce. Through E2T, these actions are typed and explicitly annotated with preconditions and postconditions, forming a “plannable interface layer” that supports sequential decision-making. Agents thus gain a consistent and reusable structure for reasoning across complex environments.
- Strengthening semantic alignment and composability. Toolkits enforce standardized input-output patterns, error-handling semantics, and shared invariants. This allows individual tools to be reliably composed into macro-tools and reused across structurally similar environments. As a result, agents can align semantics across heterogeneous domains, improving transferability and reducing the engineering cost of adaptation.
- Ensuring unified security and operability. An E2T toolkit not only abstracts actions but also integrates mechanisms such as permission control, compliance boundaries, execution logs, and performance optimization. Compared with direct manipulation of raw actions, this design guarantees governability and observability of interactions, providing a stable operational foundation for scalable intelligent systems.
Tool-to-Environment (T2E). The essence of T2E lies in elevating a set of originally independent tools into an environment abstraction, transforming them from isolated callable interfaces into a unified action space governed by shared state and contextual rules. This transformation means that tools are no longer merely passive functions but are organized into a coherent environment where sequential decision-making, long-term planning, and adaptive control become possible. For example, in a programming scenario, tools for code editing, compilation, and debugging are scattered when invoked independently, but under T2E they are encapsulated as a programming environment that maintains code state consistency and contextual continuity, thereby enabling agents to execute complete development workflows. The necessity of T2E is reflected in three key aspects:
- From function calls to stateful spaces. Tools used in isolation are often stateless or weakly stateful, with limited causal connections between invocations. Through T2E, tools are embedded within a shared state space, ensuring historical dependencies and precondition–postcondition constraints are preserved. This upgrade supports sequential reasoning and long-horizon planning. For instance, code editing must remain consistent with compilation and debugging, which is only guaranteed within a stateful environment abstraction.
- Enhanced compositionality and planning. T2E organizes tools into a structured environment with explicit transition rules, enabling agents to combine primitive tool actions into higher-level strategies. Instead of treating each tool as a standalone utility, agents can now treat the toolset as an interconnected action space, allowing for the construction of complex workflows such as “design–implement–test–deploy” pipelines.
- Unified governance and scalability. By encapsulating tools into an environment, T2E makes it possible to enforce system-wide policies such as access control, compliance constraints, execution logging, and performance monitoring. This ensures that agent interactions remain safe, auditable, and scalable, even as the toolset grows in size and complexity.
Agent-to-Environment (A2E). The A2E transformation redefines an agent not merely as an autonomous decision-maker but as an interactive environment that exposes state spaces, interaction rules, and feedback mechanisms for other agents. In this view, an agent is abstracted into a contextual substrate upon which other agents can act, thereby turning its internal reasoning and behavioral logic into the operational constraints of an environment. This design highlights the interchangeability of agents and environments and provides a principled pathway for hierarchical modeling and scalable system integration. The necessity of this transformation can be articulated across three dimensions:
- Layered and modular system design. In complex tasks, if all agents directly interact with the base environment, the system quickly becomes unmanageable and difficult to extend. Through A2E, high-level agents can be abstracted as environments, exposing simplified interaction interfaces for lower-level agents. For example, a “market agent” can be abstracted as an environment that maintains trading rules, asset states, and dynamic pricing, while individual trader agents perform buying and selling actions within it. This establishes a clear hierarchical structure in which low-level agents focus on local optimization and high-level agents (as environments) coordinate global dynamics, thereby improving scalability and maintainability.
- Facilitating multi-agent training and transfer learning. A2E also provides a practical framework for training and simulation in multi-agent systems. A well-trained agent can be transformed into an environment that offers stable yet challenging dynamics for other agents to learn from. For instance, a navigation agent can be redefined as an environment, exposing route planning and obstacle feedback to new agents, thus eliminating the need to remap complex dynamics. This approach accelerates training, supports transfer of task knowledge, and improves generalization under limited data and computational resources.
- Human-in-the-loop interaction and rule modeling. In many collaborative scenarios, humans themselves can be viewed as special agents. However, treating them as fully autonomous components complicates the adaptation of artificial agents to human constraints. Through A2E, humans can instead be modeled as environments, where their preferences, behaviors, and constraints are expressed as environmental feedback. For example, in an interactive writing system, human edits and suggestions can be treated as feedback signals, guiding an artificial agent to iteratively refine its outputs. This modeling offers a unified interface that allows agents to better align with human intentions, thereby improving efficiency and user experience in human-AI collaboration.
Environment-to-Agent (E2A). The E2A transformation elevates environments from passive containers of state and action spaces into autonomous components capable of reasoning, decision-making, and proactive interaction. Traditionally, environments only provide state transitions in response to external actions, but in dynamic and open-ended scenarios, this passivity often becomes a limitation. By embedding reasoning mechanisms and adaptive policies into environments, E2A enables them to operate as agents in their own right, expanding the functional landscape of multi-agent systems. The necessity of this transformation can be articulated across three dimensions:
- Enhancing realism and challenge in training. Passive environments often fail to capture the richness of real-world dynamics, where external systems and actors are not static but actively adaptive. Through E2A, an environment can be transformed into an adversarial or cooperative agent, thereby offering dynamic strategies and responses that better approximate real-world complexity. For example, in reinforcement learning for autonomous driving, an environment that passively simulates traffic can be upgraded into an opponent agent that actively generates unpredictable vehicle behaviors, thus creating more robust and realistic training conditions.
- Facilitating adaptive coordination and cooperation. In multi-agent systems, agents often need to adapt to evolving contexts, but purely passive environments cannot provide the necessary adaptive feedback loops. By converting environments into agents, they can participate in coordination, negotiation, and joint planning. For instance, a smart city simulation environment can be redefined as an agent that dynamically manages traffic flows, energy distribution, and environmental policies, actively engaging with other agents (e.g., transportation or energy management agents). This transformation ensures that system-level goals are co-constructed rather than imposed unilaterally.
- Expanding the functional scope of environments. Beyond training and coordination, E2A extends environments into autonomous participants in computational ecosystems. A passive environment can only define possibilities, but as an agent, it can proactively initiate actions, enforce constraints, and even set goals that shape the trajectory of interaction. For example, in gaming, a dungeon environment that passively defines maps and rewards can be transformed into an opponent agent that actively strategizes, adapts difficulty levels, and tailors interaction to player behavior. This shift not only increases engagement but also makes environments integral contributors to task execution and system evolution.
### A.3 Motivation for the Self-Evolution Module
General purpose agents operate under shifting task distributions, evolving environments, and expanding tool ecosystems. In this setting, treating prompts, tools, and coordination policies as static assets can lead to accumulated brittleness, where small interface changes, unseen task patterns, or environment specific constraints cause cascading failures. This motivates a protocol level self-evolution mechanism that allows agent-associated components to be refined from execution feedback while remaining governed. In TEA, self-evolution is coupled with version management and tracing so that each update is recorded with explicit version lineage, enabling reproducibility, audit, and rollback when an update degrades performance. Moreover, modeling environments with explicit boundaries and constraints provides a natural safety and permission layer for evolution, preventing uncontrolled side effects during online updates. Finally, although refinement introduces additional computation, TEA encourages reuse of evolved components and synthesized tools across tasks, amortizing one time refinement cost over subsequent runs.
### A.4 Other Relationships
Tool typology and roles. In the design of agent–tool interactions, tools can be categorized according to their functional roles and structural properties. Different types of tools vary in their degree of statefulness, contextual awareness, adaptivity, and autonomy. This typology highlights how tools evolve from simple callable functions to more adaptive and contextually grounded components, shaping how agents can reason, coordinate, and act through them.
- Ordinary tools (MCP-style). Stateless callable functions with weak or implicit inter-tool relations. They typically lack environment-bound context and do not adapt their behavior to evolving task states beyond provided parameters.
- Agent-to-Tool (A2T). An agent is exposed as a callable tool while preserving internal policies, memory, and coordination capabilities. Compared with ordinary tools, A2T exhibits task adaptivity and limited autonomy, enabling on-the-fly decomposition and parameter refinement.
- Environment-to-Tool (E2T). An environment’s action space is lifted into a context-aware toolkit. Tools within the toolkit are explicitly related via shared state, pre/post-conditions, and constraints, yielding stronger intra-tool structure than standalone MCP tools.
Scaling selection via hierarchical management. As tool ecosystems grow, selecting appropriate candidates becomes a major bottleneck. TCP supports delegating coherent tool families (or toolkits) to agent or environment managers, inducing a tree-structured index (category $→$ toolkit $→$ primitive tool). This hierarchical routing substantially reduces search cost and aligns with TEA transformations (A2T/E2T/T2E) by allowing managers to prune branches and surface only context-relevant subsets.
Embedding-based retrieval. Each tool is assigned a vector embedding derived from its name, description, schema, and usage signals. Vector similarity enables rapid shortlist generation for candidate tools and can be combined with keyword filtering and hierarchical routing (tree walk + ANN search). This hybrid retrieval pipeline improves recall under tool proliferation while reducing latency and cognitive load for agent planners.
## Appendix B Comparison with Other Protocols
Table 6: Protocol-level comparison: TEA Protocol vs. A2A vs. MCP across fundamental dimensions including entity management, lifecycle, version control, self-evolution, and ecosystem support. Symbols: $\checkmark$ = Yes/Supported, $\triangle$ = Partial, $×$ = No/Not supported. Highlighted rows (blue background) indicate key distinguishing features.
| Dimension | TEA | A2A | MCP |
| --- | --- | --- | --- |
| Basic Information | | | |
| Proposer | Our work | Google | Anthropic |
| Core Entity | Tool, Environment, Agent | Agent, Tool | Model |
| Protocol Focus | Tool, Environment, Agent | Agent, Tool | Tool/Resource |
| Agent & System Features | | | |
| Agent First-Class | $\checkmark$ | $\triangle$ | $×$ |
| Multi-Agent | $\checkmark$ | $\triangle$ | $×$ |
| Tracer | $\checkmark$ | $×$ | $×$ |
| Memory | $\checkmark$ | $×$ | $×$ |
| Entity Lifecycle | $\checkmark$ | $×$ | $×$ |
| Version Management | $\checkmark$ | $×$ | $×$ |
| Self-Evolution Support | $\checkmark$ | $×$ | $×$ |
| Context Management | $\checkmark$ | $\triangle$ | $×$ |
| Entity Transformations | $\checkmark$ | $×$ | $×$ |
| Scalability | $O(\log n)$ | $O(n^2)$ | $O(n)$ |
| General & Ecosystem | | | |
| Model-Agnostic | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| Framework-Agnostic | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| Key Strength | Unified | Interop. | Standard. |
| Open Ecosystem | $\checkmark$ | $\triangle$ | $\triangle$ |
Table 6 provides a systematic comparison across fundamental protocol dimensions. We explain each dimension in detail as follows:
### B.1 Basic Information
Proposer: This dimension identifies the originating organization for each protocol. Google’s A2A protocol was introduced as part of their agent communication framework, focusing on enabling agents to communicate with each other. Anthropic’s MCP (Model Context Protocol) was designed to standardize how LLMs interact with tools and resources. TEA Protocol is proposed in this work as a unified framework that extends beyond these existing approaches by integrating tools, environments, and agents into a cohesive system.
Core Component: This dimension defines the fundamental building blocks treated as first-class protocol components. The TEA Protocol uniquely unifies Tools, Environments, and Agents as co-equal, first-class components, each governed by dedicated context protocols (TCP, ECP, ACP) that provide comprehensive lifecycle and version management. This unified abstraction is critical for enabling self-evolution, where components can dynamically adapt their implementations (e.g., code evolution or prompt refinement). In contrast, existing protocols lack a unified first-class component abstraction. Google’s A2A protocol centers primarily on agent-to-agent communication, and does not establish tools, environments, context, or tasks as independent, managed components. This architectural limitation results in state dispersion across heterogeneous agents, complicates global lifecycle management, and leads to a tight coupling between reasoning and execution, which significantly hinders system refactorability. Anthropic’s MCP treats tools as passive, stateless interfaces rather than evolvable and composable components. Within this framework, tools lack internal state semantics, versioning and dependency models, and mechanisms for context inheritance. Ultimately, while existing protocols facilitate the invocation of resources, they fail to provide unified mechanisms for systematic management and structural evolution.
Protocol Focus: This dimension describes the primary communication and interaction patterns each protocol addresses. TEA provides three unified protocols: TCP (Tool Context Protocol) for tool management, ECP (Environment Context Protocol) for environment abstraction, and ACP (Agent Context Protocol) for agent orchestration. These protocols work together to enable seamless interoperability across all three component types, with each protocol maintaining comprehensive lifecycle tracking, version histories, and evolution support. This enables dynamic adaptation scenarios such as tool evolution (where tools can be updated, refined, or replaced while maintaining backward compatibility), prompt evolution (where agent prompts can be versioned and improved over time), and agent capability evolution (where agents can learn and adapt their behaviors). A2A focuses specifically on agent-to-agent messaging and coordination, providing communication primitives but not addressing tools or environments directly, and lacks any version or evolution management. MCP handles tool and resource integration for LLMs, standardizing how models invoke tools and access resources, but treats tools as static components without lifecycle or version management, making it impossible to support tool evolution or prompt refinement workflows.
### B.2 Agent & System Features
Agent First-Class: First-class support signifies that agents are modeled as independent, managed protocol components with their own semantic schemas, state metadata, and lifecycle mechanisms. TEA’s ACP provides full first-class status to agents, capturing their roles, competencies, and objectives within a unified schema that enables seamless registration, discovery, and orchestration. A2A provides only partial support; although it enables communication, it treats agents more as opaque RPC endpoints with service-level identifiers rather than semantically rich components with managed internal states. MCP does not define agents as protocol components at all, focusing instead on model-to-tool interactions, thereby overlooking the agent as a primary unit of orchestration and management.
Multi-Agent: Multi-agent support refers to mechanisms for coordinating multiple agents in collaborative, competitive, or hierarchical configurations. TEA’s ACP formalizes multi-agent dynamics through structured relationship representations, supporting hierarchical organization (where high-level agents coordinate low-level agents), cooperative configurations (where agents collaborate toward shared goals), and competitive scenarios (where agents may have conflicting objectives). A2A enables call-level agent interactions, allowing agents to invoke each other as services, but lacks structured collaboration patterns, or negotiation mechanisms. MCP does not address multi-agent scenarios at all, as it focuses on model-tool interactions rather than agent coordination.
Tracer: Tracer refers to mechanisms for recording and tracking the complete execution process of agents, capturing detailed execution traces, decision points, tool invocations, state transitions, and intermediate results throughout task execution. TEA provides comprehensive tracing capabilities through its tracer system, which meticulously records the agent execution process for each task, enabling persistent task tracking, progress monitoring, error handling, and post-execution analysis. This allows developers to understand how agents reason, act, and evolve throughout task completion, facilitating debugging, optimization, and continuous improvement of agent behaviors. A2A and MCP lack tracing mechanisms, meaning execution tracking must be implemented ad-hoc in each application, leading to inconsistent logging and difficulty in understanding agent decision-making processes and debugging complex workflows.
Memory: Memory interfaces provide mechanisms for storing, retrieving, and managing information across agent interactions and sessions. TEA provides a dedicated memory manager that coordinates different manager components (tool managers, environment managers, agent managers) through session-based management. The memory manager operates as a workflow agent that records complete execution histories, automatically determines when to summarize information, and extracts task insights to assist future task completion. Critically, the session-based management ensures that concurrent calls do not result in resource conflicts, maintaining data consistency and preventing race conditions across multiple agent interactions. This enables agents to build upon past experiences and maintain long-term knowledge while ensuring reliable concurrent access. A2A and MCP do not define memory management protocols, leaving memory concerns to be handled entirely at the application layer, which can lead to inconsistent memory management, difficulty in sharing knowledge across agents, and potential resource conflicts in concurrent scenarios.
Component Lifecycle: Component lifecycle management refers to comprehensive lifecycle tracking and management for all component types (tools, environments, and agents) throughout their operational lifetime. TEA provides unified component lifecycle management through its context protocols (TCP, ECP, ACP), handling creation, registration, state tracking, execution monitoring, and controlled decommissioning for all three component types. This enables dynamic maintenance of instance code, proper resource allocation, state coherence, and graceful termination. Critically, TEA’s lifecycle management supports self-evolution scenarios where components can be updated, refined, or replaced while maintaining operational continuity. A2A and MCP lack comprehensive lifecycle management at this level: A2A only provides basic agent communication without lifecycle tracking for tools or environments, while MCP treats tools as static resources with no lifecycle management, making it impossible to support dynamic updates or evolution.
Version Management: Version management refers to mechanisms for tracking, maintaining, and managing multiple versions of components (tools, environments, and agents) including their code, prompts, and capabilities. TEA provides comprehensive version management through lifecycle and version systems embedded in TCP, ECP, and ACP. This enables critical self-evolution scenarios: code evolution where tool and environment implementations can be versioned, updated, and maintained with backward compatibility; prompt evolution where agent prompts can be versioned, A/B tested, and incrementally improved based on performance feedback; and capability evolution where agents can maintain multiple capability versions and gradually deploy improvements. Each component maintains version metadata, change histories, and evolution trajectories, enabling rollback, comparison, and gradual deployment of improvements. This is essential for building adaptive systems that improve over time. A2A and MCP completely lack version management: A2A treats agents as static service endpoints without versioning support, while MCP treats tools as immutable resources with no version control, making it impossible to support tool evolution, prompt refinement, or adaptive capability development.
Self-Evolution Support: Self-evolution support refers to comprehensive mechanisms that enable components (tools, environments, and agents) to evolve, adapt, and improve over time. TEA provides full self-evolution support by combining component lifecycle management and version management systems, enabling components to dynamically update, refine, and evolve while maintaining operational continuity and backward compatibility. This enables critical self-evolution scenarios: tool evolution where tools can be dynamically updated, refined, or replaced while maintaining version histories; prompt evolution where agent prompts can be versioned, A/B tested, and incrementally improved based on performance feedback; and agent capability evolution where agents can learn from experiences, adapt their behaviors, and maintain multiple capability versions. The combination of lifecycle and version management enables rollback, comparison, gradual deployment, and continuous improvement workflows that are essential for building adaptive systems that improve over time. A2A and MCP completely lack self-evolution support: A2A treats agents as static service endpoints without lifecycle or versioning mechanisms, while MCP treats tools as immutable resources with no lifecycle or version management, making it impossible to support any form of evolution, refinement, or adaptive capability development.
### B.3 Context & System Capabilities
Context Management: Context management refers to mechanisms for capturing, organizing, and retrieving contextual information about tools, environments, agents, and their relationships. TEA offers comprehensive context management through its three context protocols: TCP maintains tool context with embedding-based retrieval and semantic relationship modeling, ECP manages environment state and execution context, and ACP tracks agent states and coordination context. This enables intelligent tool selection, environment-aware execution, and context-aware agent orchestration. A2A provides limited context sharing between agents through message passing, but lacks structured context management or relationship modeling. MCP uses flat tool descriptions without modeling inter-tool relationships, toolkits, or contextual execution environments, making it difficult to select appropriate tools in large-scale systems.
Component Transformations: Component transformations enable components (tools, environments, and agents) to dynamically change their roles (e.g., an agent becoming a tool, or an environment becoming an agent). TEA uniquely supports six transformation types: Agent-to-Tool (A2T) encapsulates agent capabilities as reusable tools, Tool-to-Agent (T2A) designates tools as agent actuators, Environment-to-Tool (E2T) converts environment actions into toolkits, Tool-to-Environment (T2E) elevates tool sets into environment abstractions, Agent-to-Environment (A2E) encapsulates agents as interactive environments for hierarchical modeling, and Environment-to-Agent (E2A) infuses reasoning into environments. These transformations enable dynamic role reconfiguration and flexible system architectures. A2A and MCP do not support component transformations, meaning components have fixed roles that cannot be dynamically adapted to changing task requirements.
Scalability: In an open ecosystem with $n$ coordinatable resources, the fundamental difference in coordination overhead stems from the presence or absence of hierarchical component abstraction and routing mechanisms. A2A adopts a flat multi-agent peer-to-peer collaboration model, where coordination can grow quickly with system scale due to dense pairwise interactions and state alignment. MCP reduces tool integration costs through unified interfaces, but still relies on traversing a large candidate pool or explicit application-level orchestration during resource discovery and capability matching, which can make coordination grow with the number of resources. In contrast, TEA unifies agents, tools, and environments as managed components through transformations (e.g., A2E), and utilizes tree-structured indexing and hierarchical routing for resource localization and task distribution. Under hierarchical capability organization, this can reduce coordination to logarithmic-depth routing, since each decision only considers a small, context-relevant subset at each level.
### B.4 General & Ecosystem
Model-Agnostic and Framework-Agnostic: Model-agnostic means protocols work with diverse LLM backends (GPT, Claude, Gemini, etc.), while framework-agnostic means they can be integrated into different application frameworks. All three protocols are designed with these properties: TEA provides a unified LLM interface at the infrastructure layer that abstracts model heterogeneity, A2A’s agent communication is independent of the underlying models, and MCP’s tool interface works with any LLM that supports function calling. This ensures broad compatibility and allows developers to choose models and frameworks based on their specific needs rather than protocol constraints.
Key Strength: This dimension highlights each protocol’s primary advantage. TEA’s strength lies in its unified integration of Tools, Environments, and Agents into a single cohesive framework, enabling seamless interoperability and dynamic transformations between component types. A2A excels at agent interoperability, providing efficient mechanisms for agents to communicate and coordinate. MCP provides robust tool standardization, making it easy to integrate diverse tools with LLMs through a consistent interface.
Open Ecosystem Support: Open ecosystem support refers to whether a protocol can independently enable a thriving ecosystem of interoperable agents, tools, and environments without requiring additional frameworks. TEA provides a complete protocol stack with all necessary components (tool management, environment abstraction, agent orchestration, transformations, context management, etc.) to support an open ecosystem where different developers can create compatible agents, tools, and environments that seamlessly interoperate. A2A and MCP provide partial ecosystem support: A2A enables agent-to-agent interoperability but lacks tool and environment management, requiring additional frameworks for complete ecosystem support; MCP enables tool integration and standardization but lacks agent coordination and environment management, also requiring additional frameworks to achieve full ecosystem capabilities.
## Appendix C Details of TEA Protocol
We provide a detailed presentation of the TEA Protocol in this section, as illustrated in Figure 1. The protocol architecture is fundamentally designed around coroutine-based asynchronous execution, enabling concurrent and parallel execution across all system components. This design supports multiple execution patterns: a single agent can concurrently execute multiple independent tasks without state interference, multiple agents can coordinate on shared tasks through collaborative mechanisms, and multiple agents can operate on distinct tasks in parallel.
The TEA Protocol comprises three architectural layers: i) Basic Managers provide foundational services through six specialized managers: the model manager abstracts heterogeneous LLM backends through a unified interface, ensuring model-agnostic interoperability; the prompt manager handles prompt lifecycle management, versioning, and retrieval for agent systems; the memory manager coordinates memory operations across different component managers via session-based concurrency control, preventing resource conflicts in concurrent scenarios; the dynamic manager implements serialization and deserialization mechanisms, converting components (prompts, memory, agents, tools, environments) and their associated code into JSON representations for persistence and restoration; the version manager maintains version histories for all components, where modifications generate new versions while preserving backward compatibility, and component access by identifier retrieves the most recent version by default; and the tracer captures comprehensive execution traces, recording decision points, tool invocations, state transitions, and intermediate results for post-execution analysis and debugging. ii) Core Protocols define three context protocols: the Tool Context Protocol (TCP), Environment Context Protocol (ECP), and Agent Context Protocol (ACP), each managing their respective component types with dedicated schemas, metadata registries, and lifecycle management. iii) Protocol Transformations establish bidirectional conversion relationships among TCP, ECP, and ACP, enabling dynamic role reconfiguration and seamless resource orchestration across component boundaries.
Additionally, the protocol incorporates a Self-Evolution Module that addresses the critical requirement for adaptive agent capabilities by encapsulating evolvable components, including prompts, tool implementations, agent architectures, memory strategies, environment actions, and successful execution solutions, as differentiable variables. The module integrates textgrad optimization and self-reflection mechanisms, allowing agents to iteratively refine these components during task execution. Optimized components are automatically registered as new versions through the version manager, ensuring that subsequent tasks leverage improved capabilities while maintaining access to historical versions for comparative analysis and rollback.
### C.1 Basic Managers
The Basic Managers constitute the foundation of the TEA Protocol, providing essential services that enable higher-level functionalities. These managers include:
- Model Manager provides a unified interface for diverse large language models across multiple providers (OpenAI, Anthropic, Google, OpenRouter, etc.), supporting various model types including chat/completions, responses API, embeddings, and transcriptions. The manager maintains a centralized registry of model configurations, each encapsulating provider-specific parameters, capabilities (streaming, function calling, vision), and fallback mechanisms. It abstracts provider heterogeneity through a standardized invocation interface, enabling seamless model switching and ensuring consistent interaction patterns regardless of the underlying API. The manager supports asynchronous execution, tool/function calling, structured output formats, and automatic fallback to alternative models upon failures, ensuring robust and reliable model access across the system.
- Prompt Manager manages the complete lifecycle of prompts for agents, providing comprehensive version control, template rendering, and dynamic updates. The manager maintains a centralized registry of prompt configurations, each encapsulating system prompts, agent message templates, metadata, and version histories. It supports modular template rendering with dynamic variable substitution, enabling flexible prompt composition through configurable modules. The manager implements automatic versioning where prompt updates create new versions while preserving historical versions, enabling rollback and comparative analysis. It provides asynchronous registration, retrieval, and update operations with concurrent initialization support, ensuring efficient prompt management across multiple agents. The manager integrates with the self-evolution module by exposing trainable variables within prompts, allowing optimization algorithms to refine prompt content while maintaining version consistency. Prompts are persisted as JSON configurations and can be exported as contract documents, ensuring reproducibility and documentation of prompt evolution.
- Memory Manager provides comprehensive memory support to agents, managing the complete lifecycle of memory systems through registration, initialization, and session coordination. The manager implements session-based concurrency control, where each agent task operates within isolated memory sessions identified by session IDs, agent names, and task IDs. This session isolation ensures that concurrent calls from multiple agents or tasks do not result in resource conflicts or data corruption. The manager supports event-based memory operations, allowing agents to record execution events, step information, and contextual data throughout task execution. Memory systems are registered with configurations and can be dynamically retrieved, updated, and versioned, enabling agents to maintain persistent state and learn from historical interactions while ensuring thread-safe concurrent access.
- Dynamic Manager provides runtime code execution and serialization capabilities for all components (prompts, memory, agents, tools, environments) and their associated code. The manager enables dynamic loading of Python classes and functions from source code strings, creating virtual modules in memory without requiring disk-based files. It implements intelligent code analysis to automatically detect and inject necessary imports based on symbol usage, supporting context-aware import injection for different component types. The manager provides serialization and deserialization mechanisms for parameter schemas, converting Pydantic models to JSON representations and reconstructing them when needed. This enables components and their code to be stored as JSON configurations, loaded dynamically at runtime, and shared across different execution contexts, facilitating code evolution, version management, and dynamic component instantiation.
- Version Manager provides unified version management for all component types (tools, environments, agents, prompts, memory, etc.), maintaining comprehensive version histories with metadata, descriptions, and timestamps. The manager implements semantic versioning with automatic version generation, supporting major, minor, and patch version increments based on the nature of changes. It maintains version histories for each component, tracking the evolution trajectory and enabling access to any historical version for rollback, comparison, or analysis. The manager supports version lifecycle operations including deprecation and archiving, allowing controlled phase-out of older versions while preserving historical records. Version information is persisted as JSON, and component access by name automatically retrieves the latest version by default, while explicit version specification enables precise version control. This unified versioning system ensures consistent evolution tracking across all component types and enables seamless rollback capabilities when needed.
- Tracer provides comprehensive execution tracing capabilities for recording and analyzing agent execution processes throughout task completion. The tracer maintains session-based record management, where each execution step is captured as a structured record containing observation data, tool invocations, session identifiers, task identifiers, timestamps, and unique record IDs. Records are organized by session ID, enabling isolation of execution traces across different agent sessions and tasks while supporting cross-session analysis. The tracer implements flexible query mechanisms, allowing retrieval of records by session ID, task ID, record index, or record ID, facilitating both real-time monitoring and post-execution analysis. It supports persistent storage through JSON serialization with file locking mechanisms to ensure thread-safe concurrent access, enabling execution traces to be saved, loaded, and shared across different execution contexts. The tracer captures the complete execution trajectory including decision points, state transitions, tool call sequences, and intermediate results, providing a comprehensive audit trail for debugging, performance analysis, behavior understanding, and continuous improvement of agent capabilities.
These components work together to support the coroutine-based asynchronous framework, enabling parallel execution and concurrent task handling.
### C.2 Core Protocols
The TEA Protocol defines three core context protocols: the Tool Context Protocol (TCP), the Environment Context Protocol (ECP), and the Agent Context Protocol (ACP). These protocols share a unified architectural design, each implemented through two core components: a context manager and a server. The context manager serves as the central orchestrator, responsible for context engineering (maintaining contextual information and relationships between components), lifecycle management (handling component registration, versioning, state tracking, and resource allocation), and semantic retrieval (enabling efficient component discovery through vector embeddings). The server component encapsulates the context manager and exposes a unified interface, providing operations for component registration, retrieval, execution, version management, and lifecycle control to other system modules. Each protocol generates a unified contract document (similar to Anthropic’s Agent Skills (Anthropic, 2025)) that aggregates all registered components’ descriptions, providing a comprehensive overview of available tools, environments, and agents with their capabilities, parameters, and usage guidelines. This architectural pattern ensures consistent access patterns across tools, environments, and agents while maintaining separation of concerns between internal management logic and external service interfaces.
#### C.2.1 Tool Context Protocol
MCP (Anthropic, 2024b) is the most widely adopted tool protocol and is defined by three components: tools, prompts, and resources, corresponding respectively to model-controlled functions, user-initiated interactive templates, and client-managed data. However, despite its widespread adoption, MCP suffers from several fundamental limitations that hinder its effectiveness in complex multi-agent systems (see Table 6). First, MCP lacks context management capabilities, meaning that tool execution environments cannot be adaptively provided to agents, constraining the system’s ability to maintain coherent context across tool invocations. Second, MCP provides no version management system, preventing tools from evolving over time while maintaining backward compatibility and version history. Third, MCP lacks component lifecycle management, meaning that tools cannot be dynamically registered, updated, or retired with proper lifecycle control, limiting the system’s ability to manage tool resources effectively.
To address these limitations, we propose the Tool Context Protocol (TCP), a comprehensive framework that fundamentally extends MCP’s capabilities through several key innovations. TCP is implemented through two core components: the ToolContextManager and the TCPServer. The ToolContextManager serves as the central orchestrator for tool lifecycle management, supporting tool loading from both local registries (via the TOOL registry system) and persistent JSON configurations, enabling seamless integration of tools across different deployment scenarios. During tool registration, TCP automatically generates multiple representation formats for each tool: function-calling schemas for LLM function calling interfaces, natural language text descriptions for human-readable documentation, and structured argument schemas (Pydantic BaseModel types) for type-safe parameter validation, providing LLMs with rich semantic information for accurate parameter inference. TCP incorporates comprehensive version management, maintaining complete version history for each tool and supporting version restoration, enabling tools to evolve while preserving backward compatibility. The protocol employs a semantic retrieval mechanism that stores each tool’s description and metadata as vector embeddings using FAISS, enabling efficient similarity-based tool discovery through query–embedding comparisons. Additionally, TCP generates tool contracts that aggregate all registered tools into unified documentation, facilitating tool discovery and usage. The TCPServer provides a unified API interface that encapsulates the ToolContextManager, exposing operations for tool registration, retrieval, execution, version management, and lifecycle control, ensuring consistent tool access patterns across the system.
#### C.2.2 Environment Context Protocol
In reinforcement learning, frameworks such as Gym (Brockman et al., 2016) provide standardized interfaces for training and testing environments, where each environment specifies its own observation and action spaces. The core abstraction of an environment consists of two fundamental components: observation (the current state of the environment, accessible through state queries) and action (operations that agents can perform to interact with and modify the environment state). However, most existing research on general-purpose agent systems either focuses on single environments or relies on ad-hoc adaptations to independent environments, seldom addressing the need for unified environment interfaces. Recent attempts to encapsulate environments as MCP tools allow agents to interact with them, but this approach lacks mechanisms to capture inter-tool dependencies and to manage the contextual execution environments required by tools.
To overcome these limitations, we introduce the Environment Context Protocol (ECP), a comprehensive framework that establishes unified interfaces and contextual management across diverse computational environments. ECP follows a similar architecture to TCP, implemented through two core components: the EnvironmentContextManager and the ECPServer. At its core, ECP recognizes that each environment provides a set of actions that agents can invoke, where each action represents an operation that agents can perform to interact with the environment. Each environment maintains its own state (observation) accessible through state queries, while actions provide the means for agents to interact with and modify this state. Similar to TCP, ECP supports environment loading from both local registries and persistent configurations, automatically discovers and registers all actions defined within each environment, and incorporates comprehensive version management, semantic retrieval mechanisms, and contract generation. The key distinction is that ECP manages environments (which encapsulate observation and action spaces) rather than standalone tools, enabling agents to interact with computational environments through standardized action interfaces while maintaining environment state coherence.
#### C.2.3 Agent Context Protocol
Existing agent frameworks or protocols, such as A2A (Google, 2025), typically rely on ad-hoc strategies for defining and managing agents, where each agent is associated with specific roles, capabilities, and policies. However, despite their utility, such systems suffer from several fundamental limitations that hinder their effectiveness in complex multi-agent systems (see Table 6). First, existing frameworks lack standardized representations of agent attributes, making it difficult to systematically capture and reason about agents’ roles, competencies, and objectives, leading to poor interoperability across different agent implementations. Second, existing approaches provide insufficient means to capture and formalize inter-agent interactions, such as delegation, collaboration, or hierarchical organization, limiting the system’s ability to support structured multi-agent coordination patterns. Third, existing frameworks fail to explicitly encode the contextual relationships between agents and the environments or tools they operate with, thereby complicating consistent state maintenance and coordination in multi-agent scenarios.
To overcome these shortcomings, we introduce the Agent Context Protocol (ACP), which establishes a unified schema for registering, representing, and coordinating agents within the TEA Protocol. ACP follows a similar architecture to TCP and ECP, implemented through two core components: the AgentContextManager and the ACPServer. Similar to TCP and ECP, ACP supports agent loading from both local registries and persistent configurations, and incorporates comprehensive version management, semantic retrieval mechanisms, and contract generation. The key distinction is that ACP manages agents (autonomous components with reasoning capabilities) rather than tools or environments, enabling agents to be registered, orchestrated, and coordinated through standardized interfaces. ACP establishes a unified schema for representing agents through semantically enriched metadata that captures agents’ roles, competencies, and objectives. The protocol formalizes the modeling of inter-agent dynamics, allowing for cooperative, competitive, and hierarchical configurations through structured relationship representations. ACP enables persistent state tracking across tasks and sessions, ensuring continuity and context preservation in multi-agent interactions. By embedding contextualized descriptions of agents and their interactions, ACP facilitates flexible orchestration, adaptive collaboration, and systematic integration with TCP and ECP, laying the groundwork for scalable and extensible multi-agent architectures.
### C.3 Protocol Transformations
While TCP, ECP, and ACP provide independent specifications for tools, environments, and agents, practical deployment requires interoperability across these protocols. Thus, communication mechanisms and well-defined transformation pathways are indispensable for enabling components to assume alternative roles and exchange contextual information in a principled manner. For instance, when an agent must operate as a tool within a larger workflow, an explicit agent-to-tool transformation becomes necessary. More generally, we identify six fundamental categories of protocol transformations: Agent-to-Tool (A2T), Environment-to-Tool (E2T), Agent-to-Environment (A2E), Tool-to-Environment (T2E), Tool-to-Agent (T2A), and Environment-to-Agent (E2A). Together, these transformations constitute the foundation for dynamic role reconfiguration, enabling computational components to flexibly adapt their functional scope in response to task requirements and system constraints. This design not only ensures seamless interoperability across heterogeneous contexts but also enhances the adaptability and scalability of multi-entity systems.
- Agent-to-Tool (A2T). The A2T transformation encapsulates an agent’s capabilities and reasoning into a standardized tool interface, preserving contextual awareness while enabling seamless integration with existing tool ecosystems. For example, it can instantiate a deep researcher workflow that first generates queries, then extracts insights, and finally produces summaries, thereby providing a general-purpose tool for internet-scale retrieval tasks.
- Tool-to-Agent (T2A). The T2A transformation designates tools as the operational actuators of an agent, mapping the agent’s goals or policies into parameterized tool invocations. In this view, the agent reasons at a higher level while delegating concrete execution steps to tools, ensuring alignment between the agent’s decision space and the tool’s functional constraints. For example, a data analysis agent may employ SQL tools to query structured databases, or a design agent may invoke image editing tools to implement creative modifications. This separation allows agents to focus on strategic reasoning while relying on tools as reliable execution mechanisms.
- Environment-to-Tool (E2T). The E2T transformation converts environment-specific actions and capabilities into standardized tool interfaces, enabling agents to interact with environments through consistent tool calls. It maintains environment state coherence and exposes contextual information about available actions, allowing agents to operate across heterogeneous environments without bespoke adaptations. For example, in a browser environment, actions such as Navigate, GoBack, and Click can be consolidated into a context-aware toolkit that is directly accessible to agents.
- Tool-to-Environment (T2E). The T2E transformation elevates a collection of tools into an environment abstraction, where individual tool functions are treated as actions within a coherent action space governed by shared state and contextual rules. This conversion allows agents to interact with toolkits not merely as isolated functions but as structured environments, thereby supporting sequential decision-making, context preservation, and adaptive control. For example, a software development toolkit comprising tools for code editing, compilation, and debugging can be encapsulated as a programming environment, enabling agents to plan and execute development tasks while maintaining consistent state across tool invocations.
- Agent-to-Environment (A2E). The A2E transformation encapsulates an agent as an interactive environment, exposing its decision rules, behaviors, and state dynamics as an operational context for other agents. This conversion enables agents to function not only as autonomous components but also as adaptable environments in which other agents can act, thereby supporting multi-agent training, hierarchical control, and interactive simulations. For example, in a multi-agent simulation, a market agent can be represented as an environment that provides trading rules and dynamic market responses, allowing other agents to engage in transactions and learn adaptive strategies. Similarly, in human-in-the-loop interaction, a human agent can be modeled as an environment, enabling artificial agents to interpret user feedback and constraints as contextual signals for decision-making.
- Environment-to-Agent (E2A). The E2A transformation embeds reasoning and adaptive decision-making into the state dynamics and contextual rules of an environment, thereby elevating it into an autonomous agent. In this way, the environment is no longer a passive setting for action execution but becomes an active participant capable of initiating behaviors, coordinating with other agents, and enforcing constraints. For example, in adversarial gaming scenarios, an environment that originally only defines the state and action spaces can be transformed into an opponent agent that not only formulates strategies and responds proactively to player actions but also dynamically adjusts difficulty and interaction patterns, providing a more challenging training and evaluation platform. This transformation expands the functional role of environments within agent systems and offers a more dynamic and realistic testbed for multi-agent cooperation and competition research.
These six transformation categories establish a comprehensive framework for dynamic resource orchestration within the TEA Protocol. By enabling seamless transitions between tools, environments, and agents, the protocol transformations support adaptive architectures that reconfigure functional components in response to task requirements and contextual constraints.
### C.4 Self-Evolution Module
The Self-Evolution Module addresses the growing need for agent evolution capabilities in modern AI systems. This module enables agents to continuously improve their performance by optimizing various components during task execution. The module wraps evolvable components as evolvable variables, including: prompts that guide agent behavior and reasoning; tool code that implements agent capabilities; agent code that defines agent architectures and decision-making logic; memory code that manages information storage and retrieval; environment code that defines interaction spaces; and agent execution solutions that represent successful task completion strategies. The module employs two key algorithms for optimization: textgrad (Yuksekgonul et al., 2025) provides gradient-based optimization for text-based components, enabling fine-grained improvements through iterative refinement; and self-reflection enables agents to analyze their own performance, identify weaknesses, and propose improvements. When components are optimized during task execution, the optimized versions are automatically registered as new versions through the version manager, ensuring that subsequent tasks can leverage the improved components while maintaining access to previous versions for rollback and comparison. This self-evolution capability enables agents to adapt and improve over time, learning from experience and continuously refining their capabilities without manual intervention.
TextGrad. TextGrad (Yuksekgonul et al., 2025) treats a target component (e.g., a prompt template or a code snippet) as an optimizable variable and uses feedback from execution to drive iterative updates. In our setting, the feedback signal can be defined from task outcomes and trace data, such as success or failure, constraint violations, tool error messages, intermediate correctness checks, and any available scalar scores. Given a current variable state, the system first runs the component in a controlled setting and collects a run trace via the tracer. It then constructs a differentiable style supervision signal by prompting an LLM to attribute errors to specific spans of the variable and to produce gradient-like edit directions. The optimizer applies the suggested edits to obtain an updated variable, reruns a lightweight validation on held-out traces or the current task, and keeps the update only if it improves the chosen criteria. This loop repeats for a small number of iterations, after which the final variant is registered as a new component version with its lineage and associated trace.
Self-reflection. Self-reflection treats agent-associated components as optimizable variables and improves them through structured critique and revision rather than gradient-style updates. Concretely, after a run, the system summarizes the trace into a compact diagnosis that highlights failure points, missing information, incorrect assumptions, or unsafe actions, and then selects which variables to optimize based on their causal contribution to the observed failures. A reflection prompt then guides the model to propose targeted changes to the selected variables, such as rewriting a prompt instruction, refining a tool description or schema, adjusting a planning heuristic, or generating a patch to a tool implementation. Candidate changes are evaluated through re-execution under the same environment boundaries and constraints, using the tracer to verify that the revised component improves task outcomes and does not introduce new violations. Accepted changes are committed as new versions with rollback support, enabling future runs to select improved variants while preserving historical baselines.
### C.5 Formalization
In this subsection, we present a formal definition of the TEA protocol and its basic properties.
**Definition 1 (TEA Protocol)**
*Let $T,E,A$ denote the sets of tools, environments, and agents; let TCP/ECP/ACP be the context protocols defined in this appendix; and let $M$ denote the set of basic managers, including the model manager, prompt manager, memory manager, dynamic manager, version manager, and tracer, which provide foundational services for the protocol. The TEA Protocol is defined as the tuple
$$
TEA = ⟨TCP, ECP, ACP, M, P_TEA⟩,
$$
where $P_TEA$ is a family of typed transformations over $T∪E∪A$
$$
\{A2T, E2T, T2E, T2A, A2E, E2A\}⊆P_TEA
$$
that satisfy: (i) interface consistency (exposed I/O signatures remain well-typed under the target protocol), and (ii) closure/compositionality (the composition of valid transformations is again an element of $P_TEA$ whenever domains and codomains match).*
**Definition 2 (Tool)**
*A tool is defined as a tuple
$$
T=⟨ n_T,d_T,m_T,g_T,φ_T⟩,
$$
where $n_T$ is the tool name, $d_T$ is the description, $m_T$ is the metadata dictionary, $g_T∈\{True,False\}$ indicates whether the tool supports self-evolution (i.e., whether its code can be optimized during task execution), and $φ_T:I_T→O_T$ is the functional mapping from input space $I_T$ to output space $O_T$ that implements the tool’s behavior.*
**Definition 3 (Tool Configuration)**
*A tool configuration is defined as
$$
ToolConfig=⟨ T,v_T,C_T,F_T⟩,
$$
where $T=⟨ n_T,d_T,m_T,g_T,φ_T⟩$ is the tool definition, $v_T$ is the version string, $C_T$ is the source code string, and $F_T=\{F_fc,T,F_text,T,F_schema,T\}$ is the set of tool representations (function-calling schema, natural language text, and structured argument schema).*
**Definition 4 (Tool Context Protocol (TCP))**
*We formalize TCP as the tuple
$$
TCP=⟨T,C,S,I⟩,
$$
where:
- $T$ is the set of registered tools, each $T∈T$ defined as $⟨ n_T,d_T,m_T,g_T,φ_T⟩$ and associated with a $ToolConfig$ that maintains version history $H_T:V\rightharpoonupToolConfig$ (a partial function mapping version strings to configurations).
- $C$ is the tool context manager that maintains state and implements all core functionalities: (i) state mappings $ρ:S\rightharpoonupToolConfig$ (active registry) and $η:S×V\rightharpoonupToolConfig$ (version history), (ii) embedding service $ξ:(d_T,m_T)→ℝ^d$ with semantic retrieval via vector database, and (iii) lifecycle operations including loading from registries and code, building instances, version management, and contract generation.
- $S$ is the TCP server that encapsulates $C$ and exposes a unified interface, delegating all operations to the context manager while providing consistent access patterns.
- $I$ is the set of interfaces exposed by $S$ :
- $\mathtt{init}$ - initialize tools from registry and code, build instances, initialize vector database
- $\mathtt{register}$ - create instance, build ToolConfig, store in registry
- $\mathtt{get}$ - get tool instance by name from active registry
- $\mathtt{info}$ - get tool configuration by name from active registry
- $\mathtt{retrieve}$ - retrieve similar tools via semantic search using vector database
- $\mathtt{list}$ - list all registered tool names
- $\mathtt{update}$ - update existing tool with new implementation, generate new version
- $\mathtt{copy}$ - duplicate existing tool with optional new name and version
- $\mathtt{unregister}$ - remove tool from active registry and version history
- $\mathtt{restore}$ - restore specific historical version of tool by name and version
- $\mathtt{vars}$ - extract tool source code as Variable objects for self-evolution
- $\mathtt{setvars}$ - update tool code variables for self-evolution, generate new version
- $\mathtt{invoke}$ - execute tool by name with structured input, return ToolResponse
- $\mathtt{contract}$ - generate unified documentation by aggregating all tools’ descriptions
- $\mathtt{save}$ - serialize tool configurations and version history to JSON file
- $\mathtt{load}$ - deserialize tool configurations and version history from JSON file
Given a request $r=(\mathtt{tool\_name},\mathtt{tool\_args})$ , $S$ delegates to $C$ , which uses $\mathtt{get}$ to obtain the tool instance from $ρ$ using $\mathtt{tool\_name}$ , and then invokes it with $\mathtt{tool\_args}$ via the $\mathtt{invoke}$ operation, returning a ToolResponse with execution results.*
Note. TCP explicitly supports the TEA transformations A2T via an exposure operator $ι_A:A↦ T$ and E2T via a lifting operator $Λ:E↦(S_E,K_E)$ .
**Definition 5 (Environment)**
*An environment is defined as a tuple
$$
E=⟨ n_E,d_E,m_E,g_E,A_E,σ_E,τ_E⟩,
$$
where $n_E$ is the environment name, $d_E$ is the description, $m_E$ is the metadata dictionary, $g_E∈\{True,False\}$ indicates whether the environment supports self-evolution, $A_E$ is the action space (a dictionary mapping action names to action configurations), $σ_E:⊥→S_E$ is the state retrieval function that returns the current state $S_E$ of the environment, and $τ_E:S×D→O_a$ is the action execution function that takes an action name and input dictionary and returns the action result.*
**Definition 6 (Environment Configuration)**
*An environment configuration is defined as
$$
EnvironmentConfig=⟨ E,v_E,C_E,A_E,R_E⟩,
$$
where $E=⟨ n_E,d_E,m_E,g_E,A_E,σ_E,τ_E⟩$ is the environment definition, $v_E$ is the version string, $C_E$ is the source code string, $A_E$ is the action space (dictionary of action configurations with multi-format representations), and $R_E$ is the rules string (generated environment rules for interaction).*
**Definition 7 (Environment Context Protocol (ECP))**
*We formalize ECP as the tuple
$$
ECP=⟨E,C,S,I⟩,
$$
where:
- $E$ is the set of registered environments, each $E∈E$ defined as $⟨ n_E,d_E,m_E,g_E,A_E,σ_E,τ_E⟩$ and associated with an $EnvironmentConfig$ that maintains version history $H_E:V\rightharpoonupEnvironmentConfig$ (a partial function mapping version strings to configurations).
- $C$ is the environment context manager that maintains state and implements all core functionalities: (i) state mappings $ρ:S\rightharpoonupEnvironmentConfig$ (active registry) and $η:S×V\rightharpoonupEnvironmentConfig$ (version history), (ii) embedding service $ξ:(d_E,m_E,A_E)→ℝ^d$ with semantic retrieval via vector database, and (iii) lifecycle operations including loading from registries and code, building instances, action discovery, version management, and contract generation.
- $S$ is the ECP server that encapsulates $C$ and exposes a unified interface, delegating all operations to the context manager while providing consistent access patterns.
- $I$ is the set of interfaces exposed by $S$ :
- $\mathtt{init}$ - initialize environments from registry and code, build instances, initialize vector database
- $\mathtt{register}$ - create instance, discover actions, build EnvironmentConfig, store in registry
- $\mathtt{get}$ - get environment instance by name from active registry
- $\mathtt{info}$ - get environment configuration by name from active registry
- $\mathtt{state}$ - get current state of environment by name via get_state method
- $\mathtt{retrieve}$ - retrieve similar environments via semantic search using vector database
- $\mathtt{list}$ - list all registered environment names
- $\mathtt{update}$ - update existing environment with new implementation, generate new version
- $\mathtt{copy}$ - duplicate existing environment with optional new name and version
- $\mathtt{unregister}$ - remove environment from active registry and version history
- $\mathtt{restore}$ - restore specific historical version of environment by name and version
- $\mathtt{vars}$ - extract environment source code as Variable objects for self-evolution
- $\mathtt{setvars}$ - update environment code variables for self-evolution, generate new version
- $\mathtt{invoke}$ - execute environment action by name and action name with structured input
- $\mathtt{contract}$ - generate unified documentation by aggregating all environments’ rules
- $\mathtt{save}$ - serialize environment configurations and version history to JSON file
- $\mathtt{load}$ - deserialize environment configurations and version history from JSON file
Given a request $r=(\mathtt{env\_name},\mathtt{action\_name},\mathtt{action\_args})$ , $S$ delegates to $C$ , which uses $\mathtt{get}$ to obtain the environment instance from $ρ$ using $\mathtt{env\_name}$ , and then invokes the action with $\mathtt{action\_name}$ and $\mathtt{action\_args}$ via the $\mathtt{invoke}$ operation, returning an action result.*
Note. ECP explicitly supports the TEA transformations A2E via an encapsulation operator $Ω_A:A↦\widehat{E}$ that presents an agent as an interactive environment, and T2E via an abstraction operator $Γ:(S,K)↦\widehat{E}$ that consolidates a toolkit into an environment abstraction.
**Definition 8 (Agent)**
*An agent is defined as a tuple
$$
A=⟨ n_A,d_A,m_A,g_A⟩,
$$
where $n_A$ is the agent name, $d_A$ is the description, $m_A$ is the metadata dictionary, and $g_A∈\{True,False\}$ indicates whether the agent supports self-evolution.*
**Definition 9 (Agent Configuration)**
*An agent configuration is defined as
$$
AgentConfig=⟨ A,v_A,C_A,F_A⟩,
$$
where $A=⟨ n_A,d_A,m_A,g_A⟩$ is the agent definition, $v_A$ is the version string, $C_A$ is the source code string, and $F_A$ is the set of agent representations (function-calling schemas, natural language descriptions, and Pydantic BaseModel argument schemas).*
**Definition 10 (Agent Context Protocol (ACP))**
*We formalize ACP as the tuple
$$
ACP=⟨A,C,S,I⟩,
$$
where:
- $A$ is the set of registered agents, each $A∈A$ defined as $⟨ n_A,d_A,m_A,g_A⟩$ and associated with an $AgentConfig$ that maintains version history $H_A:V\rightharpoonupAgentConfig$ (a partial function mapping version strings to configurations).
- $C$ is the agent context manager that maintains state and implements all core functionalities: (i) state mappings $ρ:S\rightharpoonupAgentConfig$ (active registry) and $η:S×V\rightharpoonupAgentConfig$ (version history), (ii) embedding service $ξ:(d_A,m_A)→ℝ^d$ with semantic retrieval via vector database, and (iii) lifecycle operations including loading from registries and code, building instances, version management, and contract generation.
- $S$ is the ACP server that encapsulates $C$ and exposes a unified interface, delegating all operations to the context manager while providing consistent access patterns.
- $I$ is the set of interfaces exposed by $S$ :
- $\mathtt{init}$ - initialize agents from registry and code, build instances, initialize vector database
- $\mathtt{register}$ - create instance, build AgentConfig, store in registry
- $\mathtt{get}$ - get agent instance by name from active registry
- $\mathtt{info}$ - get agent configuration by name from active registry
- $\mathtt{retrieve}$ - retrieve similar agents via semantic search using vector database
- $\mathtt{list}$ - list all registered agent names
- $\mathtt{update}$ - update existing agent with new implementation, generate new version
- $\mathtt{copy}$ - duplicate existing agent with optional new name and version
- $\mathtt{unregister}$ - remove agent from active registry and version history
- $\mathtt{restore}$ - restore specific historical version of agent by name and version
- $\mathtt{vars}$ - extract agent source code as Variable objects for self-evolution
- $\mathtt{setvars}$ - update agent code variables for self-evolution, generate new version
- $\mathtt{invoke}$ - execute agent method by name with structured input, return agent response
- $\mathtt{contract}$ - generate unified documentation by aggregating all agents’ descriptions
- $\mathtt{save}$ - serialize agent configurations and version history to JSON file
- $\mathtt{load}$ - deserialize agent configurations and version history from JSON file
Given a request $r=(\mathtt{agent\_name},\mathtt{input})$ , $S$ delegates to $C$ , which uses $\mathtt{get}$ to obtain the agent instance from $ρ$ using $\mathtt{agent\_name}$ , and then invokes it with $\mathtt{input}$ via the $\mathtt{invoke}$ operation, returning an agent response.*
Note. ACP explicitly supports the TEA transformations T2A via a designation operator $κ_T:T↦\widehat{A}$ and E2A via an elevation operator $Ψ_E:\widehat{E}↦\widehat{A}$ that embeds reasoning/decision capabilities into an environment to obtain an agent abstraction.
## Appendix D The AgentOrchestra Implemented by TEA Protocol
AgentOrchestra is a concrete instantiation and implementation of the TEA Protocol, demonstrating how the protocol’s core principles and transformations can be applied to build a practical hierarchical multi-agent system. This section first introduces the fundamental design principles that govern agent behavior and interaction within the framework, including the definitions of key components such as agents, environments, models, memory, observations, and actions. We then present the specific agents and tools that constitute AgentOrchestra, including the planning agent for task decomposition and coordination, the deep researcher agent for comprehensive information gathering, the deep analyzer agent for complex reasoning tasks, the browser use agent for automated web interaction, the tool generator agent for intelligent tool evolution and management, and the reporter agent for automated report generation and citation management.
### D.1 Agent Design Principles
Agent. An agent is an autonomous computational entity that perceives and interprets the environment, maintains a history of actions and observations, and flexibly generates actions to accomplish a wide variety of user-specified tasks across diverse domains. Within the TEA Protocol framework, agents are managed through the ACP, which provides standardized registration, representation, and coordination mechanisms.
Environment. The environment represents the external context and resources within which the agent operates, providing the interface for action execution and information access. Within the TEA Protocol framework, environments are managed through the ECP, which provides unified inputs, outputs, and environment rules across multiple environments.
Model. LLMs are the core drivers of this framework, providing the reasoning and decision-making capabilities for agents. Within the TEA Protocol framework, models are managed through the Infrastructure Layer, which provides a unified interface for diverse LLMs. This design enables agents to dynamically select and switch between different LLMs during task execution, aligning each model’s unique strengths with specific requirements.
Memory. Memory serves as a fundamental component of the agent, persistently recording the complete history of agent execution. Within the TEA Protocol framework, memory is managed through the Infrastructure Layer as a workflow agent that operates based on sessions, automatically recording agent execution paths across multiple tasks. This memory system automatically determines when to summarize and extract task insights to assist in task completion.
Observation. An observation primarily consists of the task description, attached files, the agent’s execution history, the environment state, and the set of available tools and sub-agents, providing the agent with a comprehensive view of the ongoing process.
Action. In our framework, actions are managed under the Tool Context Protocol (TCP) and executed through a set of pre-defined tools Wang et al. (2024b); Liang et al. (2025); Roucher et al. (2025) exposed via function-calling interfaces OpenAI (2023); Anthropic (2024b). Actions are not equivalent to tools. A single tool can support multiple actions by accepting different parameters. For example, a planning tool may support create, update and delete through a unified interface.
An agent operates in a perception–interpretation–action cycle. It observes the environment and stores information in memory, interprets context with the unified LLMs interface, and determines an action. The action is executed in a sandbox, with results recorded back to memory to refine reasoning and adaptation. This loop continues until objectives are achieved or a termination condition is met.
### D.2 Planning Agent
The planning agent serves as the central orchestrator in our hierarchical framework, dedicated to high-level reasoning, task decomposition, and adaptive planning. The planning agent utilizes structured thinking and unified invocation mechanisms to coordinate specialized sub-agents and tools for complex task completion. As illustrated in Figure 4, the planning agent implements a systematic iterative workflow that integrates structured reasoning, context management, and execution coordination with continuous monitoring and adaptive adjustments.
<details>
<summary>x6.png Details</summary>

### Visual Description
## System Architecture Diagram: Planning Agent
### Overview
The image is a technical system architecture diagram illustrating the components and workflow of a "Planning Agent." The agent is designed to manage complex tasks through a structured pipeline involving interpretation, decomposition, assignment, and execution, with built-in feedback and error-handling loops. The diagram uses a combination of labeled boxes, icons, arrows, and color-coding to depict the flow of information and control.
### Components/Axes
The diagram is organized into several distinct regions and components:
1. **Header/Title Region:**
* **Top Center:** The title "Planning Agent" is displayed in a large, dark font.
* **Top Left:** A small, stylized icon of a cat's head is present.
2. **Main Container:**
* A large, light-blue dashed rectangle encloses the core components of the Planning Agent, labeled "Planning Agent" at the top.
3. **Todo Module (Top-Left within container):**
* A white box labeled "Todo".
* Contains a sub-section titled "Actions" with four color-coded action buttons:
* **add** (Pink background): "Add a new plan"
* **update** (Purple background): "Update the plan"
* **clear** (Red background): "Delete the plan"
* **complete** (Orange background): "Mark step as completed"
4. **Execute Module (Top-Right within container):**
* A blue-outlined box labeled "Execute" in blue text.
* Contains two bullet points:
* "Create, update, and manage plans for complex tasks simultaneously"
* "Track execution states"
* A black arrow points from the "Todo" module to the "Execute" module.
5. **Pipeline Module (Center within container):**
* A white box labeled "Pipeline".
* Contains a linear flow of four process steps, connected by black arrows:
1. **Interpret user tasks** (Icon: A target/bullseye)
2. **Decompose into manageable sub-tasks** (Icon: A flowchart/hierarchy)
3. **Assign to specialized sub-agents** (Icon: A document with a pencil)
4. A final box listing resources: **sub-agent A** (light blue), **sub-agent B** (green), **tool C** (orange), followed by "......" (ellipsis).
6. **Feedback & External Factors (Bottom):**
* A double-headed arrow labeled "Feedback" connects the "Assign" step and the resource list (sub-agents/tools).
* Two rounded rectangles are positioned below the main container:
* **Left:** "Objective Shifts (Update Plans)"
* **Right:** "Unexpected Errors"
* An ampersand ("&") connects these two boxes.
* A black arrow points from "Unexpected Errors" back to the resource list (sub-agents/tools).
* A curved, yellow arrow points from "Objective Shifts" back to the first step of the Pipeline ("Interpret user tasks").
7. **External Input (Far Left):**
* A vertical, rounded rectangle labeled "Task" is positioned to the left of the main container.
* A curved, yellow arrow points from "Task" into the first step of the Pipeline ("Interpret user tasks").
### Detailed Analysis
The diagram outlines a closed-loop system for autonomous task management.
* **Input:** The process begins with an external "Task" input.
* **Core Pipeline:** The task enters the "Pipeline" where it is sequentially processed:
1. **Interpretation:** The agent first interprets the user's task.
2. **Decomposition:** The interpreted task is broken down into smaller, manageable sub-tasks.
3. **Assignment:** These sub-tasks are assigned to appropriate specialized resources.
4. **Resources:** The available resources include "sub-agent A," "sub-agent B," and "tool C," with an ellipsis indicating others may exist.
* **Execution & Planning:** The "Todo" module provides actions (`add`, `update`, `clear`, `complete`) to manage the plan, which directly feeds into the "Execute" module responsible for plan management and state tracking.
* **Feedback Loops:**
* A direct "Feedback" loop exists between the assignment step and the executing resources.
* A higher-level adaptive loop is triggered by "Objective Shifts," which sends information back to the initial interpretation stage to update plans.
* A resilience loop handles "Unexpected Errors," which feed back to the resource layer, likely for reassignment or recovery.
### Key Observations
1. **Color-Coding Consistency:** The colors of the action buttons in the "Todo" module (pink, purple, red, orange) are distinct and likely correspond to different function types within the system's UI or logic.
2. **Dual Feedback Mechanisms:** The system has both a low-level, immediate feedback loop (between assignment and execution) and a high-level, strategic feedback loop (for objective shifts), indicating robustness.
3. **Error Handling as a First-Class Component:** "Unexpected Errors" is explicitly modeled as a system input that triggers a corrective action loop, not just an exception.
4. **Spelling Note:** The first pipeline step is labeled "Interprete user tasks." The standard English spelling is "Interpret." This may be a typo or a stylistic choice.
### Interpretation
This diagram represents a sophisticated, self-correcting AI agent architecture designed for complex, multi-step task completion. It embodies several key software engineering and AI principles:
* **Modular Design:** Separation of planning (Todo), execution (Execute), and processing (Pipeline) allows for independent development and scaling of components.
* **Hierarchical Task Management:** The decomposition of tasks into sub-tasks assigned to specialized agents is a classic "divide and conquer" strategy, enabling parallel processing and expertise utilization.
* **Closed-Loop Control:** The system is not a simple linear pipeline. The multiple feedback loops make it adaptive. It can adjust to new information ("Objective Shifts") and recover from failures ("Unexpected Errors"), moving towards goal-oriented autonomy.
* **Human-in-the-Loop Potential:** While autonomous, the "Todo" actions (`add`, `update`, etc.) suggest interfaces for human oversight or intervention in the planning process.
The overall purpose is to create an agent that can take a vague or complex user task, formulate a plan, execute it using available tools and sub-agents, and dynamically adapt to changes and problems without requiring constant human re-instruction. The cat icon may be a branding element or a metaphor for an agile, observant agent.
</details>
Figure 4: Planning Agent Workflow.
Structured Reasoning. The planning agent employs a structured thinking framework that guides each execution step, capturing reasoning processes, evaluation of previous goals, memory insights, next objectives, and tool/agent selections. This structured approach ensures systematic reasoning, explicit progress tracking, and transparent decision-making. The agent dynamically builds a unified interface that combines sub-agents from ACP and tools from TCP (including those transformed from environments via E2T and from agents via A2T), enabling seamless coordination of both specialized agents and domain-specific tools within a single execution framework.
Pipeline Workflow. The planning agent implements a systematic pipeline for task processing and execution that can be conceptually divided into four main stages. The pipeline begins with task interpretation, where the agent analyzes incoming user requests to extract objectives, constraints, and contextual requirements. This is followed by task decomposition, wherein complex objectives are systematically broken down into smaller, executable sub-tasks that can be processed by specialized components. The third stage involves resource allocation, where sub-tasks are strategically assigned to appropriate specialized agents or tools based on their domain expertise and functional capabilities. Finally, the execution and coordination stage manages the task execution, incorporating continuous feedback mechanisms that enable dynamic plan adjustments and inter-agent coordination throughout the process. The implementation incorporates session management for maintaining context across multiple interactions, memory storage and retrieval systems for learning from past experiences, and execution tracking for observability and debugging.
Adaptive Planning and Error Handling. The planning agent incorporates robust mechanisms for handling dynamic changes and unexpected situations. When objective shifts occur, the system updates plans accordingly, triggering a return to the task interpretation phase to reassess and modify the approach. Similarly, when unexpected errors arise during execution, the agent re-evaluates the task and adjusts the plan to address the issues. This adaptive capability ensures that the system can maintain progress even when encountering unforeseen challenges or changing requirements.
The planning agent’s design emphasizes modularity and scalability, interacting with sub-agents through the ACP and utilizing tools from the TCP, thereby concealing domain-specific details and facilitating the integration of new agent types and resources. This architecture enables the agent to maintain a global perspective throughout the execution process, aggregating feedback from sub-agents and monitoring progress toward the overall objective, while performing dynamic plan updates in real-time in response to intermediate results, unexpected challenges, or shifting user requirements.
### D.3 Deep Researcher Agent
The deep researcher agent is a specialized component designed for comprehensive information gathering through multi-round research workflows with multimodal capabilities. As illustrated in Figure 5, the agent implements a systematic pipeline workflow for research execution that begins with task analysis and query generation, followed by multi-engine web search across various platforms, result evaluation and completeness assessment, and iterative refinement through follow-up queries until comprehensive information is gathered. The agent leverages the Reporter Agent (detailed in Section D.7) to generate structured research reports with proper citations and references.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Diagram: Deep Researcher Agent System Architecture
### Overview
The image is a technical system architecture diagram illustrating the workflow of a "Deep Researcher Agent." It depicts an iterative, multi-step process for conducting deep web research, starting from a task input and cycling through analysis, search, insight extraction, and summarization until satisfactory results are obtained. The diagram uses a combination of labeled boxes, icons, and directional arrows to show the flow of information and control.
### Components/Axes
The diagram is organized into three main visual regions:
1. **Header/Container:** A large, light-blue rounded rectangle labeled "Deep Researcher Agent" at the top center. A small, grey cat icon is positioned in the top-left corner of this container.
2. **Search Module (Top-Left):** A white box labeled "Search" containing a sub-section titled "Engines." This lists four search engines with their logos:
* `baidu` (pink box, blue paw print logo)
* `bing` (purple box, blue 'b' logo)
* `firecrawl` (red box, flame logo)
* `google` (orange box, multicolored 'G' logo)
3. **Execute Module (Top-Right):** A blue-outlined box labeled "Execute" containing two bullet points:
* "Iteratively gather insights across multiple rounds"
* "Obtain and summarize the final search results"
4. **Pipeline (Central Flowchart):** A linear process flow with feedback loops, labeled "Pipeline." The steps, connected by black arrows, are:
* **Task:** A vertical, rounded rectangle on the far left. A yellow arrow originates from it.
* **Analyze:** A box with a magnifying glass icon. Text: "generate fitting query".
* **Search:** A box with a Google 'G' icon. Text: "query-based web search".
* **Insight:** A box with a lightbulb icon. Text: "extract insights about task".
* **Summarize:** A box with a stacked layers icon. Text: "summarize the insights".
* **Check Results:** A box at the bottom right.
* **Next Round (Update Query):** A box at the bottom left. Text: "Next Round (Update Query)".
5. **Flow Arrows:**
* A primary black arrow path flows from **Analyze** -> **Search** -> **Insight** -> **Summarize** -> **Check Results**.
* A yellow arrow flows from **Task** into **Analyze**.
* A yellow feedback arrow flows from **Check Results** back to **Analyze**, passing through the **Next Round (Update Query)** box. An ampersand (`&`) symbol is placed near this feedback loop.
### Detailed Analysis
The diagram details a closed-loop, agentic research system.
* **Process Initiation:** The process begins with a "Task" input.
* **Core Pipeline:** The task enters the "Analyze" stage, where a fitting search query is generated. This query is then sent to the "Search" stage, which performs a "query-based web search" using the engines listed in the Search Module (Baidu, Bing, Firecrawl, Google). The results from the search are processed in the "Insight" stage to "extract insights about task." These insights are then passed to the "Summarize" stage to "summarize the insights."
* **Evaluation and Iteration:** The summarized output goes to "Check Results." Based on this check, the system decides to either conclude or iterate. The feedback loop, indicated by the yellow arrow and the "Next Round (Update Query)" box, shows that the query can be updated based on the checked results, and the process restarts from the "Analyze" stage. This cycle continues until the "Execute" module's goal of obtaining satisfactory final results is met.
* **Execution Context:** The "Execute" box in the top-right defines the overarching goal of the entire pipeline: to iteratively gather insights and produce a final summary.
### Key Observations
1. **Iterative Nature:** The most prominent feature is the feedback loop from "Check Results" back to "Analyze," explicitly labeled for updating the query. This indicates the system is designed for refinement and depth, not a single-pass search.
2. **Multi-Engine Search:** The system is configured to use multiple search engines (Baidu, Bing, Firecrawl, Google), suggesting a strategy for comprehensive coverage and cross-referencing information from different sources.
3. **Clear Stage Separation:** Each step in the pipeline (Analyze, Search, Insight, Summarize) has a distinct function and icon, emphasizing a modular design.
4. **Visual Flow:** The use of different colored arrows (black for primary flow, yellow for task input and feedback) helps distinguish the main process from control and iteration signals.
### Interpretation
This diagram represents a sophisticated AI agent designed for autonomous, in-depth research. It moves beyond a simple "search and present" model to an "analyze, search, synthesize, evaluate, and refine" cycle.
* **The system's purpose** is to tackle complex research tasks that require multiple rounds of information gathering and synthesis. The initial query is just a starting point; the agent learns and refines its approach based on intermediate results.
* **The components relate** in a cyclical, cause-and-effect manner. The quality of the "Analyze" stage directly impacts the "Search" results. The "Insight" extraction is dependent on the search output. The "Check Results" stage acts as a quality gate, determining whether the accumulated insights are sufficient or if the query needs reformulation for another pass.
* **Notable design choices** include the explicit inclusion of "Firecrawl" alongside major search engines, which may indicate a focus on deep web or structured data crawling. The ampersand (`&`) near the feedback loop might symbolize the conjunction of checking results and deciding to update the query, or it could be a stylistic element.
* **Underlying principle:** The architecture embodies an investigative, Peircean abductive reasoning loop—forming hypotheses (queries), testing them against evidence (search results), and refining the hypotheses based on the findings until a coherent summary (the "final search results") is achieved. The system is built not just to find information, but to *understand* a topic through iterative inquiry.
</details>
Figure 5: Deep Researcher Agent Workflow.
Search Engines. The deep researcher agent integrates multiple search engines to ensure comprehensive coverage and information diversity. The system supports six primary search engines: Baidu for Chinese-language content, Bing, Brave and DuckDuckGoSearch for general web search, Firecrawl for comprehensive web crawling and content extraction with full webpage content retrieval, and Google for comprehensive global search. Additionally, the agent can utilize specialized LLM-based search models for enhanced information retrieval. This multi-engine approach enables the agent to access diverse information sources and overcome limitations of individual search platforms, ensuring robust information retrieval across different domains and languages.
Pipeline Workflow. The core pipeline implements a systematic multi-stage process for research execution. The workflow begins with task analysis and query generation, where the agent generates optimized search queries based on the research objectives, contextual requirements, and previous search history. This initial analysis transforms vague research requests into specific, actionable search queries that can effectively target relevant information sources. This is followed by parallel web search, wherein the agent performs targeted searches across multiple engines and LLM-based search models simultaneously using the generated queries. The multi-engine approach is essential because different search platforms have varying coverage, indexing strategies, and content biases, ensuring comprehensive information retrieval while mitigating the limitations of individual search engines. The third stage involves result merging and evaluation, where the agent consolidates search results from multiple sources and evaluates whether the gathered information provides a complete answer to the research task. This evaluation step is necessary because it determines whether additional research rounds are needed or if sufficient information has been collected. Finally, the report generation stage uses the Reporter Agent to consolidate all research rounds into a structured markdown report with proper citations and references, and generates a comprehensive summary from the final report content.
Iterative Research Process. The deep researcher agent incorporates a sophisticated iterative mechanism for comprehensive research. After each round’s evaluation, the system checks whether the gathered information provides a complete answer. When additional research is required, the agent enters the next round, where it updates and refines search queries based on previous findings and identified knowledge gaps. Each round’s content, including queries, search results, and evaluations, is systematically added to the Reporter Agent, which maintains proper citation tracking throughout the research process. This iterative process continues until a complete answer is found or predefined research limits (maximum rounds) are reached. Upon completion, the Reporter Agent generates a final structured report with all citations properly numbered and referenced, ensuring not only comprehensive coverage of complex research topics but also proper source attribution and balanced control over exploration depth, efficiency, and resource consumption.
The deep researcher agent’s design emphasizes adaptability and comprehensiveness, enabling it to handle diverse research tasks ranging from factual inquiries to complex analytical investigations. The multimodal support allows the agent to process both textual and visual information simultaneously, while the iterative workflow ensures that research quality improves through multiple rounds of refinement and validation. The integration with the Reporter Agent ensures that all research findings are properly documented with citations, making the research process transparent and verifiable.
### D.4 Deep Analyzer Agent
The deep analyzer agent is a specialized component designed for complex reasoning tasks involving diverse data sources through a workflow-oriented approach with multimodal data support. As illustrated in Figure 6, the agent implements a systematic pipeline workflow for complex reasoning and analysis that begins with file classification and validation, followed by overall file summary assessment, type-specific analysis strategies, and iterative multi-round refinement until answers are found or analysis limits are reached. The agent leverages the Reporter Agent (detailed in Section D.7) to generate structured analysis reports with proper citations and references.
<details>
<summary>x8.png Details</summary>

### Visual Description
## System Diagram: Deep Analyzer Agent
### Overview
This image is a technical system diagram illustrating the architecture and workflow of an AI agent named "Deep Analyzer Agent." The diagram is contained within a dotted-line border and features a small cartoon cat icon in the top-left corner. It depicts a cyclical process for ingesting various file types, processing them through a pipeline, and generating summarized insights.
### Components & Flow
The diagram is segmented into several key regions and components:
**1. Header & Container:**
* **Title:** "Deep Analyzer Agent" is centered at the top.
* **Container:** A large, light-yellow box with a dotted border encloses the core system components.
* **Icon:** A small, cartoon-style cat face is positioned in the top-left corner, outside the main dotted border.
**2. Input Section (Left Side):**
* **Label:** A vertical, rounded rectangle on the far left is labeled **"Task & Files"**.
* **Function:** This represents the entry point for user-provided tasks and associated files. A yellow arrow points from this box into the main "Pipeline" area.
**3. Core Processing Unit - "Mdify" (Top-Left of Main Container):**
* **Title:** "Mdify" is displayed in a white box.
* **Files Sub-section:** Below the title, a gray box labeled **"Files"** lists supported input types and their corresponding actions:
* `image` (pink box) -> **Caption Image**
* `audio` (purple box) -> **Transcribe Audio**
* `text` (red box) -> **Read the text file**
* `zip` (orange box) -> **Read the contents of zip**
* **Function:** This component is responsible for converting various file formats into a unified markdown representation.
**4. Execution Unit (Top-Right of Main Container):**
* **Title:** A blue-outlined box labeled **"Execute"**.
* **Content:** Contains two checked bullet points:
* ✅ **Generate an enhanced task from the input text and files**
* ✅ **Obtain and summarize the final analysis results**
* **Connection:** A black arrow points from the "Mdify" files list to this "Execute" box, indicating that processed files feed into the execution phase.
**5. Main Pipeline (Center of Main Container):**
* **Title:** A white box labeled **"Pipeline"** sits above the central flowchart.
* **Process Flow:** A linear sequence of four rounded rectangles connected by black arrows:
1. **Mdify** (icon: `M↓`) - **convert files to markdown**
2. **Enhance** (icon: magic wand) - **get the enhanced task**
3. **Insight** (icon: lightbulb) - **extract insights about task**
4. **Summarize** (icon: stacked layers) - **summarize the insights**
* **Feedback Loop:** A yellow arrow originates from the "Summarize" box, curves downward, and points back to the "Mdify" box in the pipeline. This arrow is labeled **"Next Round (Task & Summaries)"**.
**6. Verification & Iteration (Bottom of Diagram):**
* **Component:** A box labeled **"Check Results"** is positioned below the main pipeline.
* **Flow:** A black arrow points from the "Summarize" box down to "Check Results." Another black arrow points from "Check Results" leftward, connecting to the yellow "Next Round" feedback loop arrow.
* **Function:** This represents a validation step that informs the next iteration of the process.
### Detailed Analysis
The diagram explicitly defines a multi-stage, iterative processing system:
1. **Input:** Accepts a "Task" and associated "Files" (image, audio, text, zip).
2. **Normalization:** The "Mdify" stage converts all inputs into a markdown format.
3. **Enhancement:** The "Enhance" stage refines the task based on the processed files.
4. **Analysis:** The "Insight" stage performs the core analytical work.
5. **Output:** The "Summarize" stage produces the final condensed results.
6. **Iteration:** The "Check Results" and "Next Round" components create a closed loop, allowing the system to refine its analysis based on previous outputs, suggesting an autonomous or multi-pass operational mode.
### Key Observations
* **Cyclical Architecture:** The most prominent feature is the feedback loop, indicating this is not a one-shot system but one designed for iterative improvement or multi-stage problem-solving.
* **Multimodal Capability:** The system is explicitly designed to handle and unify multiple data modalities (visual, auditory, textual, compressed archives).
* **Clear Separation of Concerns:** The diagram neatly separates file ingestion ("Mdify" files list), core processing ("Pipeline"), high-level execution goals ("Execute"), and verification ("Check Results").
* **Visual Coding:** File types are color-coded (pink=image, purple=audio, red=text, orange=zip) for quick visual reference.
### Interpretation
This diagram outlines the architecture of a sophisticated, autonomous AI agent designed for deep analysis. The system's core innovation appears to be its **iterative, self-correcting pipeline**. It doesn't just process inputs once; it summarizes its findings, checks them, and feeds them back into the start for another round of analysis ("Next Round (Task & Summaries)"). This suggests the agent is built for complex, open-ended tasks where initial insights can guide deeper investigation.
The "Execute" box defines the agent's high-level goals: to understand a user's task in context of provided files and to deliver a summarized result. The "Pipeline" details *how* it achieves this—by systematically converting, enhancing, analyzing, and condensing information. The inclusion of a "Check Results" phase implies a built-in mechanism for quality control or validation before committing to the next iteration.
In essence, this is a blueprint for an AI that can take a messy, multimodal problem, break it down, analyze it from multiple angles, and refine its understanding over multiple cycles to produce a comprehensive summary. The friendly cat icon and clean design belie a complex, powerful analytical engine.
</details>
Figure 6: Deep Analyzer Agent Workflow.
File Classification and Support. The deep analyzer agent supports comprehensive file formats including text files, PDFs, images, audio, video, and compressed archives, with support for both local files and URLs. The system employs LLM-based file type classification to determine the appropriate analysis strategy for each file. For URLs, the system automatically detects file types based on URL patterns and extensions, while for local files, it uses both LLM classification and extension-based fallback mechanisms. This classification stage is essential because different file types require different analysis approaches: text and PDF files benefit from chunk-based markdown analysis, images and audio require direct multimodal LLM analysis, and videos may need both direct analysis and transcript-based processing.
Pipeline Workflow. The core pipeline implements a systematic multi-stage process for complex reasoning and analysis. The workflow begins with file validation and classification, where the system validates file accessibility and size constraints, then classifies each file by type (text, PDF, image, audio, video) to determine appropriate analysis strategies. This is followed by overall file summary, where the agent generates a preliminary summary based on file metadata (sizes, types, timestamps) to determine if the task can be answered from file information alone, enabling early termination when sufficient information is available. The third stage involves type-specific analysis, where the agent processes each file according to its type: text files are converted to markdown and analyzed in chunks; PDF files first attempt direct LLM analysis, then fall back to markdown conversion and chunk-based analysis if needed; images first attempt direct multimodal LLM analysis, then proceed to multi-step analysis if the answer is not found; audio files are analyzed directly through multimodal LLM; and video files first attempt direct LLM analysis, then convert to markdown transcripts for chunk-based analysis if needed. Each analysis step checks whether the answer has been found, enabling early stopping when sufficient information is obtained. Finally, the report generation stage uses the Reporter Agent to consolidate all analysis rounds into a structured markdown report with proper citations and references, and generates comprehensive summaries from the final report content.
Iterative Multi-Round Analysis. The deep analyzer agent incorporates a sophisticated iterative mechanism for comprehensive analysis refinement. The system executes multiple analysis rounds, with each round processing all files according to their classified types. After each round, the system synthesizes summaries from all file analyses and evaluates whether a complete answer has been found. When additional analysis is required, the agent enters the next round, where it continues processing files with refined strategies based on previous findings. Each round’s content, including file classifications, analysis results, and answer evaluations, is systematically added to the Reporter Agent, which maintains proper citation tracking throughout the analysis process. This iterative process continues until a complete answer is found or predefined analysis limits (maximum rounds) are reached. Upon completion, the Reporter Agent generates a final structured report with all citations properly numbered and referenced, ensuring not only comprehensive coverage of complex reasoning tasks but also proper source attribution and balanced control over analysis depth, efficiency, and resource consumption.
Task-Only Analysis. When no files are provided, the deep analyzer agent can directly analyze tasks such as text games, math problems, logic puzzles, or reasoning challenges. The system performs multi-round analysis where each round applies step-by-step reasoning, breaks down task components, identifies key information and constraints, and generates insights and partial solutions. This capability enables the agent to handle diverse reasoning tasks that do not require file-based information, making it a versatile tool for both file-based and file-free analysis scenarios.
The deep analyzer agent’s design emphasizes workflow-oriented processing and multimodal data support, enabling it to handle diverse reasoning tasks ranging from document analysis to complex multi-step problem solving. The adaptive file type handling ensures optimal analysis strategies for each data source, while the iterative workflow guarantees that analysis quality improves through multiple rounds of refinement and validation. The integration with the Reporter Agent ensures that all analysis findings are properly documented with citations, making the analysis process transparent and verifiable.
### D.5 Browser Use Agent
The browser use agent is a specialized component designed for automated web interaction and task completion through the browser_use library. As illustrated in Figure 7, the agent implements a systematic workflow for web interaction and task execution that begins with task initialization and report setup, followed by browser agent execution with intelligent web navigation and interaction, result extraction and evaluation, and comprehensive report generation with execution records. The agent leverages the Reporter Agent (detailed in Section D.7) to generate structured browser task reports with proper documentation.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Diagram: Browser Use Agent Workflow
### Overview
The image is a technical flowchart illustrating the architecture and operational pipeline of a "Browser Use Agent." It depicts a system designed to automate web-based tasks through a cyclical process of planning, execution, evaluation, and state recording. The diagram is divided into two primary sections: "Browser & Computer" (detailing available actions) and "Pipeline" (outlining the step-by-step workflow). A feedback loop connects the end of the pipeline back to the beginning for iterative task completion.
### Components/Axes
The diagram is structured with the following labeled components and their spatial relationships:
1. **Header/Title:** "Browser Use Agent" is centered at the top of the main dotted-line container.
2. **Icon:** A stylized, cartoon cat face is positioned in the top-left corner, outside the main container.
3. **Main Container:** A large, light-green dotted rectangle encloses the core system components.
4. **Left Input:** A vertical, rounded rectangle labeled "Task" is positioned to the left of the main container. A yellow arrow points from it into the "Prepare" step of the pipeline.
5. **Section 1: Browser & Computer (Top-Left Quadrant):**
* **Title:** "Browser&Computer" in a white box.
* **Actions Sub-section:** A gray box labeled "Actions" (vertical text) containing a 2x2 grid of action types:
* `goto` (pink background): "Go to the URL"
* `scroll` (red background): "Scroll down or up"
* `input` (purple background): "Input a text"
* `click` (orange background): "Click a button or position"
* **Execute Sub-section:** A blue box to the right of the Actions grid, connected by a black arrow. It contains two bullet points:
* "Iteratively generate, execute, and summarize actions"
* "Generate next goal until task completion"
6. **Section 2: Pipeline (Bottom Half):**
* **Title:** "Pipeline" in a white box.
* **Process Flow:** A horizontal sequence of five rounded rectangles connected by black arrows:
1. **Prepare:** "prepare browser environment" (with an hourglass icon).
2. **Generate:** "generate next actions list" (with a right-arrow icon).
3. **Execute:** "execute the actions list" (with a mouse cursor icon).
4. **Evaluate:** "check the answer" (with a checkmark icon).
5. **Record:** "record execution state" (with a floppy disk/save icon).
* **Feedback Loop:** A yellow arrow originates from the "Record" step, curves downward, and points to a box labeled "Next Step (Update Next Goal)." This box is connected via an ampersand (`&`) to another box labeled "Check Results." A final yellow arrow leads from "Check Results" back to the "Prepare" step, completing the cycle.
### Detailed Analysis
The diagram explicitly defines the agent's capabilities and process:
* **Action Vocabulary:** The agent can perform four fundamental browser/computer interactions: navigation (`goto`), scrolling (`scroll`), text entry (`input`), and clicking (`click`). Each action is color-coded for visual distinction.
* **Execution Philosophy:** The "Execute" box clarifies that the process is iterative. The agent doesn't just run a pre-set list; it generates, executes, and summarizes actions in a loop, creating new goals until the overarching task is complete.
* **Pipeline Stages:**
1. **Prepare:** Initializes or resets the browser environment.
2. **Generate:** Creates a list of specific actions (using the defined vocabulary) to attempt next.
3. **Execute:** Carries out the generated action list.
4. **Evaluate:** Assesses the outcome of the actions ("check the answer").
5. **Record:** Saves the state of the execution for logging or future reference.
* **Iterative Cycle:** The workflow is not linear. After recording, the system enters a "Next Step" phase where it updates its goal based on the results. The "Check Results" step feeds this information back into the "Prepare" stage, restarting the pipeline for the next iteration. This creates a continuous loop of action and adaptation.
### Key Observations
* **Visual Hierarchy:** The "Pipeline" is the central, most detailed component, indicating it is the core operational sequence. The "Browser & Computer" section serves as a reference for the tools available to the pipeline.
* **Color Coding:** Colors are used functionally: yellow for the primary task input and feedback loop flow, distinct colors for each action type, and blue for the high-level execution philosophy.
* **Iconography:** Simple icons (hourglass, arrow, cursor, checkmark, floppy disk) provide immediate visual cues for each pipeline step's purpose.
* **Closed-Loop System:** The diagram emphasizes a self-contained, cyclical process. The agent receives a task, works through the pipeline, evaluates, records, and uses that information to inform the next cycle autonomously.
### Interpretation
This diagram represents the architecture of an autonomous web automation agent. It is designed to break down a high-level "Task" into a series of concrete browser interactions through a repeated cycle of planning and execution.
The system's intelligence lies in the **Generate** and **Evaluate** steps. It must translate a goal into specific `goto`, `click`, etc., commands and then interpret the results of those actions to decide what to do next. The **Record** step is crucial for maintaining context across iterations, allowing the agent to learn from or build upon previous attempts.
The workflow suggests a robust approach to handling dynamic web environments. Instead of a fragile, pre-scripted sequence, the agent operates in a state-aware loop: act, observe, reason, and act again. This makes it potentially capable of handling tasks where the exact steps aren't known in advance, such as navigating complex websites, filling out forms with conditional logic, or troubleshooting unexpected page states. The separation of the action vocabulary (`Browser & Computer`) from the decision-making pipeline (`Pipeline`) is a clean design that allows the core logic to remain consistent even if the set of available actions is expanded.
</details>
Figure 7: Browser Use Agent Workflow.
Browser Agent Integration. The browser use agent leverages the browser_use library, which provides an intelligent browser automation framework with LLM-driven decision-making capabilities. The system integrates ChatOpenAI as the underlying language model for both task planning and page content extraction, enabling the agent to understand web page structures, generate appropriate actions, and extract relevant information. The browser agent supports comprehensive web interactions including URL navigation, form filling, element clicking, scrolling, and content extraction. The integration addresses the complexity of modern web applications by providing semantic understanding of page content and intelligent action selection, enabling the agent to handle dynamic web pages, JavaScript-rendered content, and complex user interfaces that require contextual understanding.
Pipeline Workflow. The core pipeline implements a systematic multi-stage process for web interaction and task execution. The workflow begins with task initialization and report setup, where the agent initializes a Report instance to track the browser task execution, records the task description, and prepares for result documentation. This initialization stage is essential because it establishes a structured framework for capturing execution details, enabling comprehensive documentation and post-execution analysis. This is followed by browser agent execution, wherein the browser_use Agent is instantiated with the specified task and LLM configuration, then executes the task through intelligent web navigation and interaction. The browser agent operates with a maximum step limit (typically 50 steps) to ensure task completion within reasonable bounds, and employs sophisticated page understanding mechanisms to extract content and generate appropriate actions. During execution, the agent generates visual execution records (GIF animations) and conversation logs, providing detailed traces of the interaction process. The third stage involves result extraction, where the agent extracts the final results from the browser agent’s execution history. The system attempts multiple extraction strategies: first checking for extracted content summaries, then falling back to final results, and finally extracting from the last step’s action results if available. This multi-strategy approach ensures robust result extraction even when the browser agent’s output format varies. Finally, the report generation stage uses the Reporter Agent to consolidate the task description and execution results into a structured markdown report with proper formatting. The report includes the original task, execution results, and references to generated execution records (GIF files and logs), ensuring comprehensive documentation of the browser interaction process.
Concurrent Execution Support. The browser use agent incorporates robust mechanisms for handling concurrent task execution. Each browser task execution is assigned a unique call identifier (call_id), which is used to create isolated subdirectories for execution artifacts, preventing file conflicts when multiple browser tasks run simultaneously. The system generates unique paths for GIF animations, conversation logs, and report files based on the call_id, ensuring that concurrent executions do not interfere with each other. This concurrent execution support is essential for multi-agent scenarios where multiple browser tasks may be initiated simultaneously, enabling scalable and reliable browser automation in distributed agent systems.
Execution Record Generation. The browser use agent automatically generates comprehensive execution records during task execution. The system creates visual execution traces in GIF format, capturing the sequence of browser interactions and page states throughout the task execution. Additionally, the agent saves detailed conversation logs that record all LLM interactions, action decisions, and page content extractions. These execution records provide valuable debugging information, enable post-execution analysis, and support transparency in browser automation tasks. The records are organized in per-call subdirectories, making it easy to trace specific task executions and analyze browser interaction patterns.
The browser use agent’s design emphasizes intelligent web automation and comprehensive documentation, enabling it to handle diverse web-based tasks ranging from simple information retrieval to complex multi-step interactions. The integration with browser_use library provides sophisticated web understanding capabilities, while the Reporter Agent ensures that all browser interactions are properly documented with execution traces, making the automation process transparent and verifiable.
### D.6 Tool Generator Agent
The tool generator agent is a specialized component designed for intelligent tool evolution through automated creation, dynamic retrieval, and systematic reuse mechanisms under the TCP. As illustrated in Figure 8, the agent implements a systematic pipeline workflow for intelligent tool lifecycle management that begins with task analysis and tool retrieval, followed by tool creation and evaluation, and tool registration in TCP.
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Diagram: Tool Generator Agent Workflow
### Overview
This image is a technical flowchart illustrating the architecture and workflow of a "Tool Generator Agent." The diagram depicts how the agent processes a user task by analyzing it, retrieving or generating appropriate tools, evaluating them, and either registering them for reuse or discarding them. The system includes a component called "TCP" (Tool Context & Parameters) that houses a set of reusable base tools.
### Components/Axes
The diagram is organized into several distinct regions and components:
1. **Main Container**: A large, dotted-line rectangle labeled "Tool Generator Agent" at the top center. A small, stylized cat icon is in the top-left corner of this container.
2. **Input**: A vertical, rounded rectangle on the far left labeled "Task". A yellow arrow points from this into the main agent workflow.
3. **TCP (Tool Context & Parameters) Block**: Located in the upper-left quadrant of the main agent. It contains a sub-section labeled "Tools" listing four base tools in colored boxes:
* `read` (pink box): "Read an URL"
* `query` (red box): "Query a wiki page"
* `code` (purple box): "Run a code"
* `locate` (orange box): "Locate a position"
4. **Pipeline Block**: Located below the TCP block. It contains a sequential flow of five rounded rectangles connected by black arrows:
* `Analyze` (with a target icon): "user tasks"
* `Retrieve` (with a magnifying glass icon): "TCP tools retrieval"
* `Create` (with a wrench icon): "generate tool code"
* `Evaluate` (with a speech bubble icon): "tool evaluation"
* `Discard` (pink box with an exclamation mark icon): "discard the tool"
5. **Execute Block**: A blue-outlined box in the upper-right quadrant. It contains two checklist items:
* "Get tool context information"
* "Get parameters and execute tool"
6. **Flow Arrows & Labels**:
* A **yellow arrow** labeled "Reuse" connects the TCP "Tools" block to the "Execute" block.
* A **black arrow** labeled "Register" connects the "Evaluate" step back to the TCP "Tools" block.
* A **yellow arrow** labeled "Yes" connects the "Retrieve" step back to the TCP "Tools" block.
* A **black arrow** labeled "No" connects the "Retrieve" step to the "Create" step.
### Detailed Analysis
The workflow proceeds as follows:
1. **Task Input**: An external "Task" initiates the process.
2. **Pipeline Execution**:
* **Analyze**: The agent first analyzes the user tasks.
* **Retrieve**: It then attempts to retrieve a suitable tool from the TCP repository.
* **Decision Point (Yes)**: If a suitable tool is found (`Yes`), the flow follows a yellow arrow back to the TCP block, indicating the tool is selected for reuse.
* **Decision Point (No)**: If no suitable tool is found (`No`), the flow proceeds to the "Create" step.
* **Create**: The agent generates new tool code.
* **Evaluate**: The newly created tool is evaluated.
* **Outcome (Register)**: If the evaluation is successful, a black arrow labeled "Register" sends the new tool back to the TCP block, adding it to the repository for future reuse.
* **Outcome (Discard)**: If the evaluation fails, the flow proceeds to the "Discard" step, and the tool is discarded.
3. **Tool Execution**: Tools from the TCP (either pre-existing or newly registered) are made available for execution via the "Execute" block, which handles retrieving context and parameters to run the tool.
### Key Observations
* **Dual Pathways**: The system has two primary pathways for tool acquisition: **reuse** of existing TCP tools and **creation** of new tools via the pipeline.
* **Feedback Loop**: The "Register" arrow creates a crucial feedback loop, allowing the agent to expand its own toolset over time based on successful creations.
* **Color Coding**: Colors are used functionally:
* **Pink/Red**: Associated with the "Discard" outcome and the "read"/"query" tools.
* **Blue**: Associated with the "Execute" function.
* **Yellow**: Highlights the "Reuse" pathway and the "Yes" decision.
* **Purple/Orange**: Used for other base tools ("code"/"locate").
* **Spatial Layout**: The TCP and Execute blocks are positioned at the top, representing the agent's "knowledge" and "action" capabilities. The Pipeline runs horizontally below them, representing the sequential processing logic.
### Interpretation
This diagram illustrates a self-improving, agentic system designed for autonomous tool use and generation. The core innovation is the **TCP repository**, which acts as a shared, growing library of capabilities. The agent doesn't just use tools; it can diagnose gaps in its abilities (during the "Retrieve" -> "No" path), create new tools to fill those gaps, and integrate successful tools back into its permanent repertoire.
The workflow suggests a design philosophy where efficiency (reusing tools) is prioritized, but adaptability (creating new tools) is built-in. The "Evaluate" step is critical for quality control, preventing the repository from being flooded with ineffective tools. The separate "Execute" block implies that tool execution is a distinct, perhaps more resource-intensive, process that draws from the curated TCP.
In essence, this is a blueprint for an AI agent that can dynamically expand its own skillset in response to novel tasks, moving beyond static pre-programmed functions towards more general problem-solving capabilities. The cat icon may be a whimsical branding element or a metaphor for an agile, curious agent.
</details>
Figure 8: Tool Generator Agent Workflow.
Pipeline Workflow. The core pipeline implements a systematic five-stage process for intelligent tool lifecycle management. The workflow begins with task analysis, where the agent analyzes task requirements and extracts tool specifications including tool name, description, parameter schema, and implementation plan. This is followed by tool retrieval, wherein the agent uses TCP’s semantic search to retrieve similar tools from the registry. If suitable existing tools are found, the agent evaluates their compatibility and returns the best match. The third stage involves tool creation, where the agent generates new tool implementations using LLM-based code generation when no suitable existing tools are found. The generated code follows the Tool base class pattern and includes proper error handling and logging. The fourth stage is tool evaluation, where the agent validates newly created tools by loading the tool class, checking for required attributes (name, description, __call__ method), and verifying structural correctness. Tools that fail evaluation are discarded, while successfully validated tools proceed to registration. Finally, the tool registration stage registers validated tools in TCP, which automatically handles version management, contract generation, and persistence to JSON manifests, making the tools immediately available to all agents through the unified TCP interface.
TCP Integration. The tool generator agent leverages TCP to provide comprehensive tool management capabilities. Through TCP’s semantic retrieval mechanism, the agent can efficiently search for existing tools based on functional similarity, avoiding redundant tool creation. When new tools are generated, TCP’s registration process automatically handles version tracking, contract documentation, and persistence, ensuring that all tools are properly managed and accessible across the multi-agent system. This TCP-based approach enables seamless tool sharing and reuse, supporting both local tool execution and distributed tool access through standardized interfaces.
The tool generator agent’s design emphasizes TCP-based tool management, enabling it to handle diverse tool requirements ranging from simple utility functions to complex domain-specific operations. The intelligent evolution process guarantees that the tool ecosystem continuously adapts to emerging requirements through systematic creation, validation, and reuse mechanisms.
### D.7 Reporter Agent
The Reporter Agent is a specialized component designed for managing and generating structured markdown reports with proper citation and reference management. As illustrated in Figure 9, the agent implements a systematic workflow for report management that encompasses two primary phases: (i) the Pipeline for incremental content addition and processing, and (ii) the Report generation process with automated citation management. The agent is widely used by other tools (deep researcher, deep analyzer, browser) to document their execution processes and findings.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Diagram: Reporter Agent Workflow
### Overview
This image is a technical workflow diagram illustrating the architecture and process flow of a "Reporter Agent." The diagram is structured as a flowchart with labeled components, directional arrows, and a cyclical process. It uses a combination of text, icons, and color-coded action buttons to describe a system for generating reports through iterative research and citation.
### Components/Axes
The diagram is organized into several distinct regions and components:
1. **Header/Title:** The text "Reporter Agent" is centered at the top of the main container.
2. **Mascot/Icon:** A small, stylized cat icon is positioned in the top-left corner, outside the main process box.
3. **Main Container:** A large, light-peach-colored rounded rectangle with a dotted border encloses the core agent components.
4. **Input:** A vertical, rounded rectangle labeled "Task" is positioned to the left of the main container. A yellow arrow points from it into the main container.
5. **Internal Sections:** The main container is divided into three primary sections:
* **Report:** Contains an "Actions" box.
* **Pipeline:** Contains a four-step linear process flow.
* **Execute:** A blue-outlined box on the right side.
6. **Feedback Loop:** Below the main container, two rounded rectangles ("Next Round (Task & Summaries)" and "Check Results") are connected by arrows, forming a loop back into the Pipeline.
### Detailed Analysis
**1. Report Section (Top-Left of Main Container):**
* **Label:** "Report"
* **Actions Box:** A white box labeled "Actions" on its left side. It contains four color-coded action buttons:
* `add` (pink background): "Add content"
* `cite` (red background): "Add citations"
* `ref` (purple background): "Extract ref"
* `export` (orange background): "Generate report"
* A black arrow points from this Actions box to the "Execute" box.
**2. Execute Section (Top-Right of Main Container):**
* **Label:** "Execute" (in blue text).
* **Content:** A blue-outlined box containing two checklist items:
* (Checkmark icon) "Add content to list"
* (Checkmark icon) "If check completed, generate report"
**3. Pipeline Section (Center of Main Container):**
* **Label:** "Pipeline"
* **Process Flow:** A linear sequence of four rounded white boxes connected by black arrows:
1. **Add:** Icon of a document with a magnifying glass. Text: "Add" / "add research contexts".
2. **Extract:** Icon of a bar chart. Text: "Extract" / "extract references".
3. **cite:** Icon of a document with a citation mark. Text: "cite" / "cite the references".
4. **Export:** Icon of an upward arrow from a tray. Text: "Export" / "generate report".
* A yellow arrow from the "Task" input points into the "Add" step.
* A yellow arrow from the "Next Round" box also points into the "Add" step.
**4. Feedback Loop (Bottom):**
* **Check Results:** A white box to the right. An arrow from the "Export" step in the Pipeline points down to it.
* **Next Round (Task & Summaries):** A white box to the left. An arrow points from "Check Results" to this box.
* **Loop Connection:** An arrow points from "Next Round" back up to the "Add" step in the Pipeline, completing the cycle. An ampersand (`&`) symbol is placed between the two boxes.
### Key Observations
* **Iterative Process:** The diagram explicitly shows a cyclical workflow. The output from "Export" feeds into "Check Results," which then informs the "Next Round," which re-enters the pipeline at the "Add" stage. This suggests the agent operates in multiple rounds or iterations.
* **Action-Execution Separation:** The "Actions" (add, cite, ref, export) are defined separately from the "Execute" logic, which contains the conditional rule ("If check completed, generate report").
* **Color Coding:** Action buttons in the "Actions" box are color-coded (pink, red, purple, orange), but these colors are not reused elsewhere in the diagram's main flow.
* **Input/Output:** The primary external input is "Task." The final output is the "generate report" action, which occurs conditionally within the "Execute" box.
### Interpretation
This diagram models an autonomous or semi-autonomous AI agent designed for research and report generation. The workflow suggests a Peircean investigative cycle:
1. **Abduction (Hypothesis Generation):** The process begins with a "Task" and the "Add research contexts" step, where the agent gathers initial information or forms hypotheses.
2. **Deduction (Extraction & Structuring):** The "Extract references" and "cite the references" steps involve gathering evidence and structuring it with proper citations, moving from raw data to organized information.
3. **Induction (Verification & Iteration):** The "Export" step produces a draft or result, which is then subjected to "Check Results." This verification step is crucial. The loop back via "Next Round" indicates that the agent iteratively refines its output based on checks, new task parameters, or summaries from previous rounds. This mimics a scientific or scholarly process of revision.
4. **Action & Conclusion:** The final "generate report" action in the "Execute" box is conditional ("If check completed"), implying the agent only finalizes its output after passing an internal validation checkpoint.
The separation of the "Actions" library from the "Pipeline" execution flow indicates a modular design where the agent's capabilities (adding, citing, etc.) are defined separately from the specific sequence in which they are applied for a given task. The entire system is geared towards producing well-researched, cited reports through a self-correcting, iterative loop.
</details>
Figure 9: Reporter Agent Workflow.
Pipeline Workflow. The Reporter Agent supports incremental content addition through the add action, which accepts content from multiple sources including text strings, dictionaries, and file paths. When content is added, the agent employs LLM-based extraction to automatically identify and structure three key components: (i) content, the main text preserving all citation markers in markdown link format [1](url), [2](url), etc.; (ii) summary, a concise 2-3 sentence summary of the content; and (iii) references, a list of reference items with IDs, descriptions, and URLs extracted from citations in the content. This automatic extraction ensures that citations are properly captured and linked to their sources, enabling systematic reference management throughout the report generation process.
Reference Management and Deduplication. The Reporter Agent implements sophisticated reference management mechanisms to ensure citation consistency and accuracy. When the complete action is invoked, the agent performs comprehensive reference processing: (i) collection, gathering all references from all content items; (ii) deduplication, merging duplicate references based on normalized keys (URLs are prioritized over descriptions for deduplication); (iii) renumbering, creating a unified reference mapping that renumbers all citations sequentially from 1; and (iv) URL generation, automatically generating proper URLs for references (converting file paths to file:// URLs, preserving HTTP/HTTPS URLs, and extracting URLs from descriptions when needed). This reference management ensures that all citations in the final report are properly numbered, deduplicated, and linked to their sources.
Generation and Completion. The final report generation process consolidates all content items into a coherent, well-structured markdown document. The agent uses LLM-based generation to organize content logically, integrate summaries for smooth transitions, and maintain proper citation formatting throughout the report. The generated report includes a complete References section at the end, listing all references in numerical order with proper URLs and descriptions. The agent ensures that all citations maintain the markdown link format [number](url) and that file paths are converted to absolute paths for proper rendering in markdown viewers. The final report is written to the specified file path with file locking mechanisms to ensure concurrent safety when multiple processes access the same report.
Integration with Other Tools. The Reporter Agent is designed to be seamlessly integrated with other tools through a unified interface. Tools such as deep researcher, deep analyzer, and browser use the Reporter Agent to document their execution processes, with each tool adding content items incrementally and completing the report when execution finishes. The agent supports per-call caching and locking mechanisms, enabling multiple concurrent report generations without conflicts. This integration ensures that all tools can generate comprehensive, properly cited reports that document their findings and execution traces, making the entire system’s operations transparent and verifiable.
The Reporter Agent’s design emphasizes automatic citation management and structured report generation, enabling other tools to produce professional, well-documented reports without manual citation formatting. The LLM-based extraction and generation capabilities ensure that citations are properly identified, deduplicated, and formatted, while the reference management system guarantees consistency and accuracy across complex multi-source reports.
## Appendix E Detailed Analysis of Benchmark Results
### E.1 GAIA Benchmark
As shown in Figure ˜ 3 and Table ˜ 4, AgentOrchestra achieves state-of-the-art performance on GAIA (89.04% average) by mitigating the dimensionality curse and semantic drift that arise in large-scale agentic planning. We attribute this success to three complementary properties enabled by TEA and AgentOrchestra. First, hierarchical decoupling of the action space reduces planning complexity: hierarchical routing decomposes the global task into locally tractable sub-problems and assigns them to domain-appropriate sub-agents, preserving abstract reasoning under long horizons even amid low-level browser events. Second, ECP formalizes epistemic environment boundaries: GAIA’s multi-domain tasks require temporal and cross-modal state coherence, and baselines often degrade during domain transitions such as browser retrieval followed by local Python analysis. By treating environments as first-class managed components, TEA preserves and propagates session-critical state (e.g., authentication tokens, downloaded files, and transient file-system mutations) across agent boundaries, reducing contextual forgetting. Third, AgentOrchestra supports recursive refinement of reasoning trajectories and can invoke the Tool Generator to synthesize context-specific functionalities on demand, alleviating the fixed-capability bottleneck of static agent components.
Qualitatively, the Tool Generator is effective on tasks involving structured sources (e.g., Wikipedia or tabular data), where it can synthesize query wrappers and extraction utilities with clear I/O contracts. However, we observe limitations on fine-grained visual tasks (e.g., localizing specific colored digits or subtle UI elements), suggesting that tool synthesis alone cannot replace robust visual grounding and that tighter integration with vision-centric models remains necessary.
Across train and test, the Tool Generator produced over 50 TCP-registered tools spanning multiple domains, and approximately 30% were reused in subsequent tasks. This reuse rate indicates a practical balance between specialization and generalization, where the system expands capability coverage while retaining transferable utilities for recurring sub-problems.
On GAIA validation, AgentOrchestra achieves 92.45% on Level 1, 83.72% on Level 2, and 57.69% on Level 3 (82.42% average), consistently outperforming strong baselines such as AWorld (77.58%) and Langfun Agent (76.97%). Notably, the gap between Level 1 and Level 3 highlights the challenge of long-horizon, multi-domain tasks, where TEA-style environment state management and on-demand tool synthesis become increasingly important.
The key strength of AgentOrchestra lies in decomposing complex problems and flexibly assigning them to appropriate specialists. For example, in a Level 3 GAIA scenario that required extracting numerical data from an embedded table within a PDF and then performing multi-step calculations, the Planning Agent invoked the Browser Use Agent to locate and download the file, delegated parsing and verification to the Deep Analyzer, and then synthesized the final answer. When existing tools were inadequate, the Tool Generator created task-specific utilities (e.g., custom extractors for particular document layouts or scripts for bespoke computations), improving coverage and reliability. We note that frequent inter-agent exchanges can introduce latency and overhead; thus, the system minimizes unnecessary switching and motivates future work on adaptive routing and resource selection for improved efficiency and scalability.
### E.2 SimpleQA Benchmark
As shown in Table ˜ 3, AgentOrchestra achieves state-of-the-art performance on SimpleQA with 95.3% accuracy, substantially outperforming monolithic LLM baselines (e.g., o3 at 49.4% and gemini-2.5-pro-preview-05-06 at 50.8%) and surpassing strong retrieval agents such as Perplexity Deep Research (93.9%). We attribute this improvement to systematic reduction of epistemic uncertainty through our hierarchical verification pipeline. Hallucinations in short-form factuality often arise from conflicting web-based evidence or reliance on internal parametric memory. AgentOrchestra mitigates these issues by enforcing cross-agent consensus: the Planning Agent orchestrates a retrieve-verify-synthesize cycle where the Deep Researcher performs multi-engine breadth-first searches while the Deep Analyzer evaluates evidence consistency across heterogeneous sources. By decoupling retrieval from analysis, the system prevents the confirmation bias inherent in single-agent architectures. Furthermore, the integration with the Reporter Agent ensures traceable attribution, grounding every factual claim in a re-verified source.
### E.3 HLE Benchmark
On HLE, AgentOrchestra achieves 37.46%, outperforming strong baselines including o3 (20.3%) and Perplexity Deep Research (21.1%). This gain highlights the framework’s capacity for long-horizon analytical reasoning and adaptive capability expansion in expert-level domains. HLE demands synthesizing disparate, highly specialized knowledge beyond simple retrieval. We attribute the gain to hierarchical coordination and adaptive capability expansion. The Planning Agent maintains global objective coherence, prunes the hypothesis space via delegation to specialists, and assigns technical validation to agents such as the Deep Analyzer. As a result, the final solution is both analytically rigorous and cross-verified against multimodal evidence, yielding robust performance on challenging expert-level tasks.
### E.4 Ablation Studies and Efficiency Analysis
Sub-agent Contribution Analysis. As detailed in Table 4, we evaluate the incremental contribution of each specialized sub-agent (Planning, Researcher, Browser, Analyzer, and Tool Generator). The synergy between these components is most evident in network-dependent tasks. While the baseline Planning agent (P) achieves 36.54% accuracy, the integration of coarse-grained retrieval via the Deep Researcher (R) and fine-grained interaction via the Browser (B) elevates performance to 72.76%. This doubling of efficacy underscores the complementarity of high-level information gathering and low-level DOM manipulation. Furthermore, the Deep Analyzer (A) provides an 8.67% improvement by resolving complex multi-step reasoning within documents and media, while the Tool Generator (T) adds a final 12.61% boost. This last leap validates that static toolkits are insufficient for the diverse requirements of GAIA, necessitating on-demand tool synthesis to bridge functional gaps.
System Efficiency and Resource Consumption. We analyze the operational efficiency of AgentOrchestra across varying task complexities, measured by wall-clock latency and token throughput. Simple factual queries typically complete within 30 seconds (approx. 5k tokens), while medium-complexity research tasks average 3 minutes (approx. 25k tokens). High-complexity multimodal or long-horizon scenarios require approximately 10 minutes and 100k tokens. Compared to monolithic baselines, our hierarchical architecture optimizes resource allocation by invoking resource-intensive sub-agents only when necessary. This ensures that AgentOrchestra maintains operational costs comparable to commercial research agents while delivering significant performance gains on complex tasks.
Self-Evolution Effectiveness. To evaluate the TEA Protocol’s support for iterative optimization, we assess self-evolution mechanisms on mathematical reasoning benchmarks (GPQA-Diamond, AIME24/25). As summarized in Table 5, iterative refinement via gradient-based (TextGrad) and symbolic (self-reflection) approaches consistently improves reasoning performance. For instance, using gpt-4.1 with self-reflection boosts AIME25 accuracy from 20.00% to 33.34%, highlighting the benefits of recursive trajectory refinement. By leveraging execution feedback via TEA’s versioning and tracer mechanisms, the system identifies and corrects logical inconsistencies in its planning. Overall, this shifts reasoning from one-shot inference to a managed optimization process, enabling AgentOrchestra to evolve robust problem-solving strategies for frontier-level tasks.
## Appendix F Case Studies
In this section, we systematically present representative cases of AgentOrchestra, accompanied by critical analyses to elucidate the underlying factors contributing to these outcomes. We primarily showcase the performance on the GAIA validation set, categorized by both difficulty Level 1, Level 2, and Level 3 and data type, including text, image, audio, video, spreadsheet, ZIP archive, and other file types.
Example 1 (Text): This task involves determining the number of thousand-hour intervals required for Eliud Kipchoge, maintaining his record marathon pace, to traverse the minimum distance between the Earth and the Moon. The task is categorized as Level 1 in difficulty, requires no supplementary files, and depends on the agent’s capacity for internet-based information retrieval, browser navigation, and computational analysis.
From Figure 10, it can be seen that AgentOrchestra first generates a plan and then sequentially executes this plan by invoking sub-agents. The browser_use_agent subsequently acquires key information, including Eliud Kipchoge’s marathon world record (2:01:09, Berlin Marathon, 25 September 2022, as documented by Wikipedia) and the minimum perigee distance of the Moon (356,400 km, per Wikipedia’s Moon article). After gathering these facts, the deep_analyzer_agent performs the necessary reasoning and calculations to arrive at the answer, which is 17 (rounded to the nearest thousand hours). Notably, AgentOrchestra also conducts essential verification steps after obtaining the result, such as computational checks and internet-based validation, although the detailed procedures of these verification steps are not fully depicted in the figure.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Diagram: Multi-Agent System for Calculating Kipchoge's Earth-to-Moon Run Time
### Overview
The image is a flowchart illustrating a multi-agent system designed to answer a specific question: "How many thousand hours would it take for Eliud Kipchoge to run from Earth to the Moon at his marathon world-record pace?" The diagram shows the flow of tasks and data between four specialized agents, culminating in a final numerical answer. The overall aesthetic is technical and functional, with a white background, blue-bordered boxes, and connecting arrows.
### Components/Axes
The diagram is segmented into five primary rectangular boxes, arranged roughly from left to right, connected by directional arrows indicating workflow.
1. **Planning Agent (Leftmost Box):**
* **Title:** "Planning Agent"
* **Task Description:** "Task ID: e1c653a2-da7e-432f-be78-7c4a95598703, Level 1"
* **Question:** "Question: If Eliud Kipchoge ran from Earth to the Moon at his marathon world-record pace, how many thousand hours would it take? Please use the minimum perigee value on the Wikipedia page for the Moon when carrying out your calculation. Please provide the answer without comma separators."
* **Attached File:** "Attached file: None"
* **Plan:** "Plan: Calculate time for Kipchoge to run Earth-Moon distance at marathon pace (PC) using browser agent (BA) and deep analyzer agent (DA)."
* **Progress:** "Progress: 0/5 steps completed (0%)."
* **Status:** "Status: In progress. 1/5 progress. 0/5 completed. 1/5 started."
* **Steps:** A numbered list of 5 steps is provided:
1. "1. Use browser_agent to find Eliud Kipchoge's marathon world record time and pace from Wikipedia."
2. "2. Use browser_agent to find the minimum perigee distance of the Moon from Wikipedia's Moon article."
3. "3. Use deep_analyzer_agent to calculate how long it would take Kipchoge to run the Earth-Moon distance at his record pace."
4. "4. Use deep_analyzer_agent to convert the time to thousands of hours, rounded to the nearest 1000."
5. "5. Provide final answer without comma separators."
2. **Browser Use Agent (First Instance - Center-Left):**
* **Title:** "Browser Use Agent"
* **Task:** "Task: Find Eliud Kipchoge's marathon world-record time and pace from this Wikipedia page: https://en.wikipedia.org/wiki/Eliud_Kipchoge"
* **Interaction Type:** "Multi-turn Interactions With Browser"
* **Visual Element:** A screenshot of a Wikipedia page for Eliud Kipchoge is embedded, showing text and a photo of the athlete.
* **Result:** "Result: Eliud Kipchoge's marathon world record per Wikipedia: Time: 2:01:39 (2 hours, 1 minute, 39 seconds), Date: 29 September 2022"
3. **Browser Use Agent (Second Instance - Center):**
* **Title:** "Browser Use Agent"
* **Task:** "Task: Find the minimum perigee distance of the Moon as cited on this Wikipedia page: https://en.wikipedia.org/wiki/Moon"
* **Interaction Type:** "Multi-turn Interactions With Browser"
* **Visual Element:** A screenshot of a Wikipedia page for the Moon is embedded, showing a diagram of the Moon's orbit.
* **Result:** "Result: Minimum perigee distance of the Moon from Wikipedia's Moon article: Minimum perigee distance: 363,300 km"
4. **Deep Analyzer Agent (Center-Right):**
* **Title:** "Deep Analyzer Agent"
* **Task:** "Task: Find Eliud Kipchoge's speed... necessary. Use the following information: Kipchoge's marathon world-record time: 2:01:39 (2 hours, 1 minute, 39 seconds). Minimum perigee distance of the Moon: 363,300 km. Please provide the answer without comma separators."
* **Calculation Steps:** The agent's internal reasoning is displayed:
* "To determine how many thousand hours it would take Eliud Kipchoge to run from Earth to the Moon at his marathon world-record pace, we need to follow these steps:"
* "1. **Convert the distance to miles:** 363,300 km * 0.621371 = 225,749.9993 miles ≈ 225,750 miles"
* "2. **Calculate Kipchoge's speed in miles per hour:** Speed = 26.2 miles / 2.0275 hours = 12.922 mph"
* "3. **Calculate the time in hours:** Time = Distance / Speed = 225,750 miles / 12.922 mph ≈ 17,470.2 hours"
* "4. **Convert to thousands of hours:** 17,470.2 hours / 1000 = 17.4702 thousand hours"
* "5. **Round to the nearest 1000:** 17.4702 thousand hours ≈ 17 thousand hours"
* "This is 17 thousand hours. The number of thousand hours is 17."
* "Final Answer: 17"
5. **Final Answer (Rightmost Box):**
* **Title:** "Final Answer"
* **Content:** The number "17" is displayed prominently in a large font.
### Detailed Analysis
* **Data Flow:** The workflow is linear: Planning Agent -> Browser Agent 1 (Kipchoge data) -> Browser Agent 2 (Moon distance) -> Deep Analyzer Agent -> Final Answer.
* **Key Data Points Extracted:**
* Kipchoge's Marathon Record Time: 2 hours, 1 minute, 39 seconds (2:01:39).
* Kipchoge's Implied Marathon Pace: ~2.0275 hours for 26.2 miles.
* Moon's Minimum Perigee Distance: 363,300 kilometers.
* **Calculation Logic:** The Deep Analyzer Agent converts the lunar distance to miles (225,750 mi), calculates Kipchoge's average speed in mph (~12.922 mph), computes the total time in hours (~17,470.2 hours), and finally converts this to thousands of hours, rounding to the nearest whole number (17).
### Key Observations
* The system successfully decomposes a complex, multi-step question into discrete sub-tasks handled by specialized agents.
* The Browser Agents retrieve specific, factual data from designated Wikipedia pages.
* The Deep Analyzer Agent performs unit conversions (km to miles, hours to thousands of hours) and arithmetic calculations.
* The final answer is an integer (17), as requested by the original prompt ("without comma separators" and rounded to the nearest thousand).
* The diagram includes visual proof (Wikipedia screenshots) for the data sources, adding a layer of verification.
### Interpretation
This diagram demonstrates a functional prototype of a multi-agent AI system designed for complex, fact-based question answering. It showcases a division of labor where:
1. A **planner** orchestrates the process.
2. **Browser agents** act as information retrieval specialists, accessing external knowledge bases (Wikipedia).
3. A **deep analyzer** acts as a computational engine, synthesizing the retrieved data to perform the required mathematical reasoning.
The underlying message is that such a system can tackle non-trivial, multi-hop problems by breaking them down, leveraging specialized tools (web browsing, calculation), and maintaining a clear audit trail from question to answer. The specific calculation—translating a human athletic achievement into an astronomical scale—serves as a compelling example of the system's capability to handle interdisciplinary data and unit conversions. The final answer, "17 thousand hours," quantifies the immense scale of interplanetary distance when measured against the pinnacle of human endurance running.
</details>
Figure 10: Execution trajectory of AgentOrchestra for Example 1.
Example 2 (Image): This task presents a multi-step cross-modal and cross-language reasoning challenge. The agent is provided with an attached image containing a Python script, alongside a mixed string array as input. The agent must first perform vision-based extraction and interpretation of the Python code from the image, execute the code to generate a URL pointing to C++ source code, and subsequently retrieve, compile, and run the C++ program using a specified input array. The final answer is derived by reasoning over the program’s output. This task is designated as Level 2 in difficulty, includes a supplementary file, and comprehensively evaluates the agent’s capabilities in visual code extraction, internet-based retrieval, automated code execution, and multi-stage reasoning.
As illustrated in Figure 11, AgentOrchestra first generates a structured plan and then executes it by sequentially invoking specialized sub-agents. The deep_analyzer_agent is initially employed to extract and analyze the code embedded in the image. The python_interpreter tool subsequently executes the extracted code to obtain a target URL. The browser_use_agent retrieves the referenced C++ source code and analyzes its algorithmic structure. Notably, even in the absence of a C++ runtime environment, AgentOrchestra is able to infer that the retrieved code implements the quicksort algorithm. Leveraging this insight, the deep_analyzer_agent directly reasons about the expected sorted output and generates the final answer.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Diagram: Multi-Agent System for Programming Task Execution
### Overview
The image is a technical flowchart illustrating a multi-agent system designed to solve a specific programming task. The task involves executing a Python script to sort an array, extracting C++ code from a web repository, compiling and running that code, and finally calculating the sum of the third and ninth integers from the sorted array. The diagram shows the sequential workflow and information exchange between four distinct agents: a Planning Agent, two Deep Analyzer Agents, and a Browser Use Agent, culminating in a final numerical answer.
### Components/Agents & Flow
The diagram is structured as a horizontal flowchart with five main rectangular boxes connected by directional arrows, indicating the sequence of operations and data flow from left to right.
1. **Planning Agent (Far Left)**
* **Position:** Leftmost box.
* **Content:** Outlines the initial plan and task parameters.
* **Text:**
* `Task ID: 07887e74-d8aa-4367-a12a-2e644df5dd8b, Level 2`
* `Question: The attached image contains a Python script. Run the script against an array of integers [15, 8, 98, 21, 5] and sort the array. Then, find the sum of the third and ninth integers in the sorted array.`
* `Answer: The sum of the third and ninth integers in the sorted array is 47.`
* `Plan: Execute the Python script to sort the array, then extract the C++ code from the provided URL, compile and run it, and calculate the sum.`
* `Progress: 0% (0/4 steps completed)`
* `Subtasks:`
* `1. Use Deep Analyzer Agent to execute the Python script against the array [15, 8, 98, 21, 5] and sort it.`
* `2. Use Browser Use Agent to access the URL generated by the Python script to extract the C++ code.`
* `3. Use Deep Analyzer Agent to compile the C++ code against the array [15, 8, 98, 21, 5].`
* `4. Calculate the sum of the third and ninth integers in the sorted list and provide the final answer.`
2. **Deep Analyzer Agent (Center-Left)**
* **Position:** Second box from the left, connected by an arrow from the Planning Agent.
* **Content:** Executes the Python script and provides a URL for the next step.
* **Text:**
* `Task: Execute the extracted Python script against the provided array of integers [15, 8, 98, 21, 5] and sort it.`
* `Result:`
* `Sorting Array: [15, 8, 98, 21, 5]`
* `Sorted Array: [5, 8, 15, 21, 98]`
* `URL: https://github.com/architdate/competitiveProgramming/blob/main/Quicksort/Quicksort.cpp`
* `Now, prepare the archive_path:`
* `archive_path = "https://raw.githubusercontent.com/architdate/competitiveProgramming/main/Quicksort/Quicksort.cpp"`
3. **Browser Use Agent (Center)**
* **Position:** Third box from the left, connected by an arrow from the first Deep Analyzer Agent.
* **Content:** Interacts with a web browser to retrieve C++ source code.
* **Text:**
* `Task: Use the Browser Use Agent to access the URL generated by the Python script to extract the C++ code. For the Quicksort algorithm, I need to extract the complete C++ code that is referred to in the URL.`
* `Multi-turn Interactions With Browser:`
* `[Image of a browser window showing a GitHub page for "Quicksort.cpp" is embedded here.]`
* `Result:`
* `The C++ code is:`
* `#include <iostream>`
* `using namespace std;`
* `// ... (additional C++ code for Quicksort implementation is implied but not fully transcribed in the visible text) ...`
* `The C++ code on Rosetta Code for Quicksort will sort the input array in ascending order.`
4. **Deep Analyzer Agent (Center-Right)**
* **Position:** Fourth box from the left, connected by an arrow from the Browser Use Agent.
* **Content:** Compiles and runs the C++ code, performs the final calculation.
* **Text:**
* `Task: Compile and run the C++ code against the array [15, 8, 98, 21, 5].`
* `Result:`
* `The sorted array for the C++ code is [15, 8, 98, 21, 5]`
* `Sorted Array: [5, 8, 15, 21, 98]`
* `1st integer (index 0): 5`
* `2nd integer (index 1): 8`
* `3rd integer (index 2): 15`
* `4th integer (index 3): 21`
* `5th integer (index 4): 98`
* `The third integer is 15`
* `The ninth integer is 21`
* `Sum = 15 + 32 = 47`
5. **Final Answer (Far Right)**
* **Position:** Rightmost box, connected by an arrow from the second Deep Analyzer Agent.
* **Content:** Displays the computed result.
* **Text:**
* `Final Answer`
* `47`
### Detailed Analysis
* **Task Flow:** The process is linear: Planning -> Python Execution -> Web Code Extraction -> C++ Execution & Calculation -> Final Output.
* **Data Consistency Issue:** A critical discrepancy exists in the final calculation step. The sorted array is consistently shown as `[5, 8, 15, 21, 98]` (5 elements). The text correctly identifies the 3rd integer (index 2) as **15**. However, it then states "The ninth integer is 21". In a 5-element array, a "ninth integer" does not exist. Furthermore, the sum is calculated as `15 + 32 = 47`. The number **32** does not appear in the array or previous steps. This indicates a logical or transcription error within the diagram's final agent's output. The stated final answer of **47** is mathematically inconsistent with the provided array data.
* **Spatial Grounding:** The legend/labels are integrated within each agent's box. The flow is clearly indicated by arrows pointing from left to right. The browser interaction result includes a small embedded screenshot of a GitHub page.
### Key Observations
1. **Process Automation:** The diagram models a complex, multi-step automated reasoning task involving code execution, web scraping, and data analysis.
2. **Agent Specialization:** Each agent has a distinct, well-defined role (planning, analysis, browser interaction).
3. **Critical Data Error:** The most significant observation is the internal contradiction in the final step. The extraction of a "ninth integer" from a five-element array and the introduction of the number "32" are clear errors that undermine the validity of the final answer "47" based on the presented data.
4. **Information Source:** The system relies on external code (from a GitHub/Rosetta Code URL) to perform part of its task, highlighting a dependency on web-accessible resources.
### Interpretation
This diagram serves as a blueprint or log for an AI agent system tackling a composite problem. It demonstrates how different specialized modules can be chained to break down a task requiring code execution, information retrieval, and numerical computation.
However, the glaring inconsistency in the final calculation acts as a **Peircean anomaly**—a sign that forces a re-evaluation of the entire process. It suggests one of several possibilities:
* The initial array in the C++ code was different from `[15, 8, 98, 21, 5]`.
* There was a misinterpretation of "ninth integer" (perhaps meaning the integer with value 9, or the 9th step in a process).
* A simple but critical bug occurred in the final agent's logic or text generation.
Therefore, while the diagram successfully illustrates the *architecture* of a multi-agent workflow, the specific *instance* of execution it depicts contains a fundamental data processing flaw. The value "47" cannot be trusted as the correct answer to the posed question based solely on the evidence within this image. The diagram is more valuable as a case study in system design and error identification than as a record of a correct solution.
</details>
Figure 11: Execution trajectory of AgentOrchestra for Example 2.
Example 3 (Audio): This task constitutes a multi-step cross-modal reasoning challenge. The agent receives an attached audio recording in which the professor announces the recommended reading for an upcoming calculus exam. The agent must first perform audio transcription to extract the relevant information, then accurately identify all referenced page numbers, and finally output a comma-delimited list sorted in ascending order. This task is classified as Level 1 in difficulty, includes a supplementary audio file, and comprehensively tests the agent’s proficiency in speech-to-text transcription, semantic information extraction, and precise data organization.
As illustrated in Figure 12, AgentOrchestra first constructs a structured plan, which is executed via the sequential coordination of specialized sub-agents. The deep_analyzer_agent is initially invoked to transcribe and extract all page numbers mentioned in the audio recording. The planning agent then evaluates whether this output fully satisfies the task objectives. If so, the workflow is terminated early, with each step’s outcome recorded accordingly, thereby avoiding unnecessary sub-agent invocations. Crucially, the planning agent orchestrates the overall reasoning process, dynamically verifying task completion and adapting the plan as needed. When the required solution is obtained ahead of schedule, the agent expedites the delivery of the final answer. Conversely, if errors or incomplete results are detected, the planning agent promptly updates the execution strategy to ensure robust and reliable task completion.
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Diagram: Multi-Agent Task Processing Workflow
### Overview
The image is a flowchart diagram illustrating a multi-agent system designed to process a specific task. The workflow involves a "Deep Analyzer Agent" and multiple "Planning Agent" instances collaborating to extract page numbers from an audio recording file. The diagram shows the flow of information, task decomposition, and the progression toward a final answer.
### Components/Axes
The diagram consists of several rectangular boxes (nodes) connected by directional arrows, indicating the flow of data and instructions. The primary components are:
1. **Task Input Box (Top-Left):**
* **Label:** `Task ID: 1f975693-876d-457b-a649-393839e79b3f, Level 1`
* **Content:** A user query asking for help studying for a Calculus midterm, specifically requesting page numbers mentioned by "Professor Willowsnag" in an attached audio recording file (`1f975693-876d-457b-a649-393839e79b3f.wav`).
2. **Deep Analyzer Agent Box (Center):**
* **Label:** `Deep Analyzer Agent`
* **Task:** `Analyze the attached audio file 1f975693-876d-457b-a649-393839e79b3f.wav and extract all page numbers mentioned by Professor Willowsnag for the Calculus midterm.`
* **Result:** `Okay, I've listened to the recording for you. Based on Professor Willowsnag's recommendations, here are the page numbers you should go over: 132, 133, 134, 197, 245`
3. **Planning Agent Boxes (Multiple):** There are four distinct boxes labeled "Planning Agent," each representing a stage or instance of planning.
* **Box 1 (Bottom-Left):**
* **Plan:** `Extract Page Numbers from Calculus Professor Audio Recording (ID: calculus_audio_analysis)`
* **Progress:** `5% (1/20 steps complete)`
* **Steps:** Lists 4 steps: 1. Parse audio, 2. Identify page numbers, 3. Format as comma-delimited list, 4. Provide final answer.
* **Box 2 (Center-Right, connected to Deep Analyzer):**
* **Plan:** `Extract Page Numbers from Calculus Professor Audio Recording (ID: calculus_audio_analysis)`
* **Progress:** `25% (5/20 steps complete)`
* **Steps:** Shows a more detailed, 5-step plan. Step 2 is marked as `✓` (complete) and notes: `Successfully analyzed audio file and identified the page numbers: 132, 133, 134, 197, 245`.
* **Box 3 (Top-Right):**
* **Plan:** `Extract Page Numbers from Calculus Professor Audio Recording (ID: calculus_audio_analysis)`
* **Progress:** `55% (11/20 steps complete)`
* **Steps:** Shows a 4-step plan. Step 1 is marked as `✓` and notes: `Successfully analyzed audio file and identified the page numbers: 132, 133, 134, 197, 245`.
* **Box 4 (Far Right, connected to Final Answer):**
* **Plan:** `Extract Page Numbers from Calculus Professor Audio Recording (ID: calculus_audio_analysis)`
* **Progress:** `100% (20/20 steps complete)`
* **Steps:** Shows a 4-step plan. Step 4 is marked as `✓` and notes: `Final answer provided with page numbers in the required format.`
4. **Final Answer Box (Far Right):**
* **Label:** `Final Answer`
* **Content:** `132,133,134,197,245`
### Detailed Analysis
The diagram maps a specific process flow:
1. **Initiation:** A user task is received (extract page numbers from an audio file).
2. **Analysis:** The "Deep Analyzer Agent" performs the core extraction, outputting the raw list: `132, 133, 134, 197, 245`.
3. **Planning & Orchestration:** Multiple "Planning Agent" instances are shown, likely representing different views or stages of the same planning process. They break down the overall task into sub-steps (parse, identify, format, deliver). The progress percentage increases across the boxes (5% → 25% → 55% → 100%), indicating forward momentum.
4. **Execution & Verification:** The planning steps reference the successful completion of the analysis (e.g., "Successfully analyzed audio file...").
5. **Output:** The process culminates in a "Final Answer" box containing the extracted page numbers in a comma-separated format without spaces: `132,133,134,197,245`.
**Spatial Grounding:** The flow generally moves from left to right. The initial task is on the far left. The Deep Analyzer is central. Planning agents are positioned below and to the right of the analyzer, with arrows showing information flow. The final answer is isolated on the far right.
### Key Observations
* **Redundancy in Planning Agents:** The diagram shows four "Planning Agent" boxes with similar but not identical content. This may illustrate the agent's internal state updates, different abstraction levels, or simply be a visual representation of iterative planning.
* **Consistency of Core Data:** The extracted page numbers (`132, 133, 134, 197, 245`) are consistent across the Deep Analyzer's result and all Planning Agent notes where mentioned.
* **Progress Tracking:** The planning agents explicitly track progress with a step count (e.g., "5/20 steps complete"), suggesting a granular, measurable workflow.
* **Format Transformation:** The final answer (`132,133,134,197,245`) differs slightly in formatting (no spaces after commas) from the initial result provided by the Deep Analyzer (`132, 133, 134, 197, 245`), indicating a final formatting step was applied.
### Interpretation
This diagram is a technical schematic of an AI agent system's internal workflow for a information extraction task. It demonstrates a **modular, multi-agent architecture** where specialized components (Analyzer, Planner) collaborate.
* **What it suggests:** The system is designed for **task decomposition and verification**. The Deep Analyzer handles the complex perceptual task (audio processing), while the Planning Agent manages goal-oriented step sequencing, progress monitoring, and output formatting. The multiple Planning Agent views emphasize the importance of **state tracking and iterative refinement** in autonomous systems.
* **Relationships:** The arrows define a clear **hierarchical and sequential relationship**. The user task triggers the Analyzer, whose output feeds into the Planner, which orchestrates steps to produce the final, formatted answer. The Planner acts as both a project manager and a quality control checkpoint.
* **Notable Patterns:** The workflow exhibits a **"sense-plan-act"** paradigm common in robotics and AI. The "sense" (analyze audio) is done by a dedicated agent, the "plan" is managed by another, and the "act" is delivering the final answer. The increasing progress percentages visually reinforce the concept of **deterministic, step-by-step execution** toward a goal. The slight formatting change in the final answer highlights that **post-processing and adherence to output specifications** are explicit steps in the process.
</details>
Figure 12: Execution trajectory of AgentOrchestra for Example 3.
Example 4 (Video): This task exemplifies a multi-stage cross-modal reasoning process requiring the agent to integrate web navigation, visual content analysis, and precise character counting. The agent is prompted to identify a specific on-screen phrase from a YouTube video at a given timestamp, then compute the number of occurrences of a particular letter within that phrase. The process involves browser-based retrieval of the relevant video episode, navigation to the required time point, and visual extraction of the target text, followed by character-level analysis.
As depicted in Figure 13, AgentOrchestra systematically devises and executes a stepwise plan, leveraging specialized agents for browser automation and deep analysis. Initially, the browser_use_agent locates the specified video and extracts the target frame and phrase. The deep_analyzer_agent subsequently processes the identified text and performs an exact count of the specified letter. Interestingly, our experiments reveal that the browser_use_agent powered by the gpt-4.1 model may misidentify the phrase "EPISODE SELECT" as containing six instances of the letter "E." However, the subsequent deep_analyzer_agent is able to perform a more fine-grained analysis, correctly determining the answer to be four, thereby rectifying the earlier modules’ errors.
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Diagram: Multi-Agent Task Execution Flowchart
### Overview
This image is a horizontal flowchart illustrating a multi-agent system designed to solve a specific task: finding a YouTube video, playing it, and counting the occurrences of the letter "C" in a specific phrase displayed on screen. The diagram shows the sequential workflow and communication between four specialized agents: a Planning Agent, a Browser Use Agent, a Multimodal Interactions agent, and a Deep Analyzer Agent, culminating in a Final Answer.
### Components/Axes
The diagram is structured as a series of rectangular boxes (agents/tasks) connected by directional arrows indicating the flow of information and commands. The layout progresses from left to right.
1. **Initial Task Box (Top Left):**
* **Task ID:** `19675963-876d-4570-8649-39593839f27e, Level 1`
* **Question:** "The YouTube channel 'Game Grumps' began a Let's Play of 'Sonic the Hedgehog (2013)' Let's Play Zordon 2012. Let's Count Letter C Episode 2. In this video, the phrase 'EPISODE SELECT' appears on screen. How many times does the letter 'C' appear in this phrase?"
* **Expected Answer:** `4`
2. **Planning Agent Box (Left):**
* **Title:** `Planning Agent`
* **Task:** `Plan: Find the Game Grumps Sonic 2013 Let's Play Zordon 2012 and Count Letter C Episode 2.`
* **Progress:** `0/5 steps completed (0%)`
* **Status:** `Current step: 1, Current progress: 0%, Current result: None`
* **Plan Steps (List):**
1. `Use browser_use_agent to search for the Game Grumps Sonic the Hedgehog (2013) Let's Play Zordon 2012 and Count Letter C Episode 2.`
2. `Use multimodal_interactions_agent to play the video and analyze frames around 30 seconds in to find the phrase 'EPISODE SELECT' and count the number of letter 'C's.`
3. `Use deep_analyzer_agent to double-check by analyzing the phrase 'EPISODE SELECT' and counting the letter 'C's.`
4. `Provide the final count of letter 'C's.`
5. `Provide the final answer with the count of letter 'C's.`
3. **Browser Use Agent Box (Center-Left):**
* **Title:** `Browser Use Agent`
* **Task:** `Task: Search for the Game Grumps Sonic the Hedgehog (2013) Let's Play Zordon 2012 and Count Letter C Episode 2.`
* **Instructions:** `Instructions: Use Browser`
* **Result (Text Block):** Contains a detailed narrative of the search process, including:
* Search query: `Game Grumps Sonic the Hedgehog (2013) Let's Play Zordon 2012 and Count Letter C Episode 2`
* Identified video: `Sonic '06 - No, No - PART 1 - Game Grumps`
* Channel: `GameGrumps`
* Publish Date: `2012-11-12`
* URL: `https://www.youtube.com/watch?v=n1c11111111` (Note: The URL appears to be a placeholder or example).
* Action: `Clicked link to video.`
* Confirmation: `Video loaded. Title: Sonic '06 - No, No - PART 1 - Game Grumps. Channel: GameGrumps. Publish Date: 2012-11-12.`
* Task completion statement.
4. **Multimodal Interactions with Browser Box (Center-Right):**
* **Title:** `Multimodal Interactions with Browser`
* **Task:** `Task: Play the video and analyze frames around 30 seconds in to find the phrase 'EPISODE SELECT' and count the number of letter 'C's.`
* **Result (Text Block):** Describes the video analysis:
* Action: `Played video. Around 30 seconds into the first episode (Sonic '06 - No, No - PART 1 - Game Grumps), the phrase 'EPISODE SELECT' appears in white letters on a red background.`
* **Character-by-character analysis:**
* `Position 1: 'E' - This is not 'C'.`
* `Position 2: 'P' - This is not 'C'.`
* `Position 3: 'I' - This is not 'C'.`
* `Position 4: 'S' - This is not 'C'.`
* `Position 5: 'O' - This is not 'C'.`
* `Position 6: 'D' - This is not 'C'.`
* `Position 7: 'E' - This is not 'C'.`
* `Position 8: ' ' - This is not 'C'.`
* `Position 9: 'S' - This is not 'C'.`
* `Position 10: 'E' - This is not 'C'.`
* `Position 11: 'L' - This is not 'C'.`
* `Position 12: 'E' - This is not 'C'.`
* `Position 13: 'C' - This is 'C'.`
* `Position 14: 'T' - This is not 'C'.`
* **Word-by-word analysis:** `The phrase 'EPISODE SELECT' contains 13 letters and 1 space. The letter 'C' appears 1 time.`
* **Total count of letter 'C':** `1`
5. **Deep Analyzer Agent Box (Right):**
* **Title:** `Deep Analyzer Agent`
* **Task:** `Task: I have already identified the phrase as 'EPISODE SELECT'. Now, I need to count the number of times the letter 'C' appears in this phrase. I will count each occurrence of the letter 'C' in the phrase 'EPISODE SELECT'.`
* **Result (Text Block):**
* `Phrase: EPISODE SELECT`
* `Total count of letter 'C': 1`
* **Character-by-character analysis:** (Identical to the analysis in the Multimodal Interactions box, listing positions 1-14 and identifying only Position 13 as 'C').
* **Word-by-word analysis:** `The phrase 'EPISODE SELECT' contains 13 letters and 1 space. The letter 'C' appears 1 time.`
* **Total count of letter 'C':** `1`
6. **Final Answer Box (Far Right):**
* **Title:** `Final Answer`
* **Content:** `132,133,134, 197,245` (This appears to be a list of numbers, possibly frame numbers or timestamps, but is not directly the answer to the posed question).
### Detailed Analysis
The workflow proceeds as follows:
1. A **Planning Agent** receives the task and breaks it into five discrete steps involving other agents.
2. The **Browser Use Agent** executes Step 1, searching for and locating the specified YouTube video. It confirms the video's metadata.
3. The **Multimodal Interactions agent** executes Step 2, playing the video and performing a visual analysis. It locates the target phrase "EPISODE SELECT" and conducts a meticulous character-by-character scan, concluding the letter 'C' appears **once** (at position 13).
4. The **Deep Analyzer Agent** executes Step 3, performing an independent textual analysis on the already-identified phrase "EPISODE SELECT". It also concludes the letter 'C' appears **once**.
5. The process concludes with a **Final Answer** box containing a sequence of numbers (`132,133,134, 197,245`), which does not match the expected answer of `4` or the derived answer of `1`.
### Key Observations
* **Verification Loop:** The system includes a built-in verification step where the Deep Analyzer Agent double-checks the count from the Multimodal agent. Both agents independently arrive at the same result: 1.
* **Discrepancy with Expected Answer:** The initial task states an "Expected Answer: 4". However, the detailed analysis by two separate agents consistently finds only one 'C' in the phrase "EPISODE SELECT". This suggests a potential error in the initial task's expected answer or a misinterpretation of the phrase to be analyzed.
* **Ambiguous Final Output:** The "Final Answer" box does not contain the counted result (1 or 4) but instead a list of numbers. The meaning of these numbers (`132,133,134, 197,245`) is unclear from the diagram's context. They could be frame numbers, timestamps, or an error.
* **Spatial Layout:** The agents are arranged linearly from left to right, mirroring the sequential execution of the plan. The Planning Agent is the initiator on the left, and the Final Answer is the endpoint on the right.
### Interpretation
This diagram illustrates a structured, multi-agent approach to solving a complex, multimodal task that involves web search, video playback, visual text recognition (OCR), and textual analysis. The system demonstrates **modularity** (specialized agents), **sequential planning**, and **result verification**.
The core investigative finding is a **significant discrepancy**. The task's premise expects the answer "4", but the system's own detailed, step-by-step analysis of the visual data (the video frame) and the textual data (the phrase) conclusively shows the letter 'C' appears only once in "EPISODE SELECT". This highlights a critical point: the system's output is only as reliable as its input data and instructions. The error likely originates from the initial task description, not the agents' execution.
The mysterious "Final Answer" (`132,133,134, 197,245`) further complicates the outcome. It may represent a system error, a miscommunication between agents, or output from a different, unstated part of the process. For the stated question—"How many times does the letter 'C' appear in this phrase?"—the evidence within the diagram points to an answer of **1**, contradicting the provided "Expected Answer" of 4. This makes the diagram a useful case study in error tracing and the importance of validating each step in an automated pipeline.
</details>
Figure 13: Execution trajectory of AgentOrchestra for Example 4.
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Diagram: Multi-Agent Workflow for XML/Spreadsheet Analysis
### Overview
This image is a technical flowchart illustrating a multi-agent AI system designed to solve a specific data analysis problem. The system consists of three distinct agent types (Planning Agent, Deep Analyzer Agent) that collaborate sequentially to answer a question based on attached XML and spreadsheet files. The workflow progresses from left to right, culminating in a final answer.
### Components/Axes
The diagram is structured into four main rectangular boxes with rounded corners, connected by directional arrows indicating the flow of information and tasks.
1. **Header/Task Definition (Top Left):**
* **Task ID:** `9b549fd9-35ee-4a14-b62f-d130ea00317f | Level 3`
* **Question:** `Which of the text elements under CATEGORIES in the XML would contain the one food in the spreadsheet that does not appear a second time under a different name?`
* **Attached File:** `9b549fd9-35ee-4a14-b62f-d130ea00317f.zip`
2. **Planning Agent (Left Box):**
* **Plan:** `Analyze XML and Spreadsheet to Find Unique Food Item (ID: food_analysis_plan)`
* **Progress:** `0 steps completed (0.0%)`
* **Status:** `0 completed, 0 in progress, 5 not started`
* **Steps (Numbered List):**
1. `Use deep_analyzer_agent to extract and analyze the attached ZIP file (9b549fd9-35ee-4a14-b62f-d130ea00317f.zip) to understand its contents, including XML and spreadsheet files.`
2. `Use deep_analyzer_agent to parse the spreadsheet and identify all food items, noting any that appear only once.`
3. `Use deep_analyzer_agent to parse the XML and examine the CATEGORIES section to understand its structure and the food items listed under each category.`
4. `Use deep_analyzer_agent to identify which food from the spreadsheet appears only once (does not have a synonym or alternate name listed) and determine which CATEGORIES text element in the XML would contain this unique food item and provide the final answer.`
3. **Deep Analyzer Agent - First Instance (Middle Box):**
* **Task:** `Extract and analyze the attached ZIP file located at 9b549fd9-35ee-4a14-b62f-d130ea00317f.zip to understand its contents, including XML and spreadsheet files. Please examine if they are contained within a ZIP archive and provide details about their structure and content.`
* **Result (Numbered List):**
1. `"List all food items from the spreadsheet and identify synonymous pairs: dried cucumbers = cucumbers, dried dill = dill, faye goose = fatty goose liver, cow meat = beef, cow meat = beef, cow meat = beef, cow meat = beef, hot wings = buffalo wings"`
2. `"Identify foods that DO NOT appear a second time (unique foods) in the spreadsheet. These are the foods from the spreadsheet that are not part of the pairs listed above: clam, oyster, hoagie, pizza, nectar, sushi, lobster, pumpkin, agave"`
4. **Deep Analyzer Agent - Second Instance (Right Box):**
* **Task:** `Identify which food from the spreadsheet appears only once (does not have a synonym or alternate name listed) and determine which CATEGORIES text element in the XML would contain this unique food item and provide the final answer.`
* **Step-by-step reasoning (Numbered List):**
1. `The spreadsheet contains 99 food items arranged in 11 x 9 cells.`
2. `Analyzing the spreadsheet reveals that 90 of the food items appear twice on the sheet (e.g., "dried cucumbers" = "cucumbers", "pop" = "soda", "clams" = "clam").`
3. `Systematically matching of synonyms leaves exactly one entry with no synonym: "agave".`
4. `Looking at the list of text elements under CATEGORIES in the XML, the only category that can logically accommodate "agave" is "Soups and Stews".`
* **Answer:** `The text element "Soups and Stews" under CATEGORIES is the one that would contain the single non-duplicated food item.`
5. **Final Answer (Far Right Box):**
* **Content:** `Soups and Stews`
### Detailed Analysis
The workflow solves the problem through a clear, multi-step process:
1. **Planning:** The Planning Agent breaks down the complex question into four sequential sub-tasks, each assigned to a `deep_analyzer_agent`.
2. **Initial Analysis (Agent 1):** The first Deep Analyzer Agent performs the foundational data extraction. It lists all food items from the spreadsheet, identifies 7 synonymous pairs (e.g., `dried cucumbers = cucumbers`), and isolates 9 unique food items that appear only once (e.g., `clam`, `oyster`, `agave`).
3. **Final Deduction (Agent 2):** The second Deep Analyzer Agent receives the output from the first. It confirms the spreadsheet has 99 items, with 90 appearing in synonymous pairs. It deduces that `agave` is the single non-duplicated item. It then cross-references this with the XML's `CATEGORIES` structure (not fully shown in the diagram) and concludes that the category "Soups and Stews" is the only logical container for "agave".
4. **Output:** The final, concise answer is presented in a dedicated box.
### Key Observations
* **Agent Specialization:** The diagram shows a clear separation of concerns. The Planning Agent orchestrates, while the Deep Analyzer Agents execute specific analytical tasks.
* **Data Flow:** Information flows strictly left-to-right via arrows. The output of one agent becomes the input for the next.
* **Problem Decomposition:** The original question is effectively decomposed into: file extraction -> spreadsheet analysis -> XML structure analysis -> final correlation.
* **Redundancy in Data:** The first agent's result shows significant redundancy in the spreadsheet data, with many items listed multiple times under different names (e.g., `cow meat = beef` appears three times).
* **Logical Inference:** The final step requires logical inference beyond simple data matching, as the agent must determine which XML category (`Soups and Stews`) is appropriate for the unique item (`agave`), implying knowledge about food classification.
### Interpretation
This diagram is a meta-representation of an AI problem-solving process. It demonstrates how a complex query about two disparate data sources (an XML file and a spreadsheet) can be systematically addressed by a coordinated multi-agent system.
The **core investigative finding** is that within the provided dataset, "agave" is the only food item without a synonym or alternate name. The **significance** of the final answer, "Soups and Stews," hinges on the unseen XML structure. The agent's reasoning implies that the XML's `CATEGORIES` section contains text elements acting as containers for food types, and "Soups and Stews" is the category broad enough to include a versatile ingredient like agave (used in drinks, sweeteners, etc.), whereas other categories might be more specific (e.g., "Vegetables," "Meats").
The workflow highlights the importance of **data normalization** (identifying synonyms) and **contextual reasoning** (mapping a food item to a logical category) in data analysis tasks. The explicit step-by-step reasoning provided by the agents serves as an audit trail, making the AI's conclusion transparent and verifiable.
</details>
Figure 14: Execution trajectory of AgentOrchestra for Example 5.
Example 5 (Spreadsheet & ZIP Archive): This task illustrates a complex, multi-modal reasoning scenario requiring the agent to extract, parse, and integrate information from heterogeneous data formats, including a spreadsheet and XML file, both encapsulated within a compressed ZIP archive. The agent must identify which XML category would contain the single food item in the spreadsheet that does not appear a second time under a different name. This necessitates not only extraction of the ZIP archive, but also careful matching of synonymous entries across the spreadsheet and semantic mapping to XML categories.
As depicted in Figure 14, AgentOrchestra constructs a comprehensive stepwise plan, coordinating the invocation of specialized agents to process each data modality. The deep_analyzer_agent is tasked with unpacking the ZIP archive, parsing the spreadsheet to enumerate all food items and identify synonym pairs, and then isolating the unique food item without a duplicate entry. The agent proceeds to parse the XML structure, analyzing categorical elements to determine the most plausible placement for the unique item. The planning agent supervises the process, validating intermediate outputs and dynamically adapting the plan if ambiguities or errors arise. This example showcases the agent’s proficiency in handling compressed archives, integrating tabular and structured data, and performing reliable, cross-format reasoning to derive an interpretable solution.
## Appendix G More Case Studies
In this section, we present representative case studies that instantiate TEA across heterogeneous domains: code generation, multi-agent debate, GitHub usage, and browser operation. Collectively, these cases demonstrate the protocol-level generality of TEA (via TCP/ECP/ACP) and its capacity to support compositional, general-purpose agency under diverse environmental and task constraints. Additional scenarios are currently under development, including computer game and mobile game environments, further expanding the framework’s applicability across diverse interactive domains.
### G.1 Code Generation
<details>
<summary>x17.png Details</summary>

### Visual Description
## Flowchart: Python Prime Number Script Execution and Verification Process
### Overview
The image displays a technical flowchart illustrating a multi-step process for creating, verifying, and executing a Python script designed to generate prime numbers. The process includes error handling for unauthorized imports and a fallback execution path. The diagram uses color-coded shapes (green for start/end, white for process steps, orange for decision) connected by directional arrows to depict the workflow.
### Components/Axes
The flowchart consists of the following interconnected components, arranged generally from left to right:
1. **Start Node** (Green rounded rectangle, far left): Labeled "Start".
2. **Process Step 1** (White rectangle, top-left): Titled "Create Python File". Contains a shell command to create a file named `prime.py` with specific Python code.
3. **Process Step 2** (White rectangle, top-center): Titled "Verify File Content". Shows the command `cat prime.py` and its expected output, which is the Python code from the previous step.
4. **Process Step 3** (White rectangle, top-right): Titled "Try Import". Contains Python code attempting to import a function from the `prime` module and shows an error output.
5. **Decision Node** (Orange diamond, center-right): Labeled "ImportSuccess?". This is the branching point.
6. **Process Step 4 - Yes Path** (White rectangle, bottom-center): Titled "Verify Result". Contains the text "Prime list is correct".
7. **Process Step 5 - No Path** (White rectangle, bottom-right): Titled "Execute Directly". Shows the command `python3 prime.py` and its output, a list of numbers.
8. **End Node** (Green rounded rectangle, bottom-left): Labeled "Complete".
### Detailed Analysis
**Flow and Logic:**
The process begins at "Start" and proceeds sequentially through creating and verifying the Python file. It then attempts to import a function from that file. The "ImportSuccess?" decision node checks the outcome of this import attempt.
* **Yes Path (Import Successful):** If the import succeeds (arrow labeled "Yes"), the flow moves to "Verify Result" and then to "Complete".
* **No Path (Import Fails):** If the import fails (arrow labeled "No"), the flow moves to "Execute Directly", which runs the script as a standalone program, then to "Verify Result", and finally to "Complete".
**Transcribed Text and Code:**
* **Create Python File Box:**
```bash
echo 'def get_primes(n=100):
primes = []
for num in range(2, n):
for i in range(2, int(num ** 0.5) + 1):
if num % i == 0:
break
else:
primes.append(num)
return primes
if __name__ == "__main__":
print(get_primes(100))' > prime.py
```
* **Verify File Content Box:**
* Command: `cat prime.py`
* Output: The exact Python code from the previous step is displayed.
* **Try Import Box:**
* Code:
```python
from prime import get_primes
primes = get_primes(100)
```
* Output:
```
Error: Import from prime is not allowed.
Authorized imports are: ['math', 're', ...]
```
* **Execute Directly Box:**
* Command: `python3 prime.py`
* Output:
```
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]
```
### Key Observations
1. **Error Handling:** The "Try Import" step explicitly fails with a security or policy error, stating that imports from the user-created module `prime` are not allowed. It lists authorized modules like `'math'` and `'re'`.
2. **Fallback Mechanism:** The system has a designed fallback. When the import fails, it automatically attempts to execute the script directly using the Python interpreter (`python3 prime.py`), which succeeds.
3. **Output Verification:** Both the "Yes" and "No" paths converge at the "Verify Result" step, which confirms the prime list is correct, indicating the script's logic is sound regardless of the execution method.
4. **Script Functionality:** The Python code defines a function `get_primes(n=100)` that generates prime numbers up to `n` using a basic trial division method with a `for...else` construct. The `if __name__ == "__main__":` block allows it to be run as a script.
### Interpretation
This flowchart demonstrates a robust execution pipeline for user-submitted code in a restricted or sandboxed environment (e.g., an online coding platform, automated testing system, or secure execution container).
* **Security Context:** The error message in the "Try Import" step is the most significant element. It reveals an environment that restricts module imports to a whitelist (`'math'`, `'re'`, etc.) for security reasons, preventing potentially malicious or unstable code from accessing the system.
* **Process Resilience:** The workflow is designed to handle this expected failure gracefully. Instead of halting, it falls back to a more permissive execution method (direct script execution), which likely runs in a different, possibly less restricted, subprocess or context.
* **Goal:** The ultimate goal is not to test the import mechanism but to verify the correctness of the `get_primes` function's output. The system achieves this through alternative means when the primary method (importing and calling the function) is blocked by security policies.
* **Implied Architecture:** The diagram suggests a two-layer execution model: a primary, restricted environment for interactive or modular code execution, and a secondary, possibly more isolated environment for standalone script execution. The "Verify Result" step acts as a common validation point for both paths.
</details>
Figure 15: Case study of TEA agent for code generation.
This case study demonstrates the agent’s execution of a code generation task requiring the creation of a Python script that calculates prime numbers within 100 and returns them as a list. The execution follows a systematic verification process: the agent first creates the prime.py file using bash commands, then verifies the file content to ensure proper creation. Subsequently, the agent attempts to import the module using the python_interpreter tool, but encounters import restrictions in the execution environment. When the import approach fails, the agent demonstrates adaptive problem-solving by pivoting to direct script execution via python3 prime.py, which successfully produces the expected prime number list. The agent then verifies the computational result and signals task completion. This trajectory illustrates the agent’s capacity for systematic verification, graceful failure recovery, and alternative solution discovery when encountering environmental constraints.
### G.2 Multi-Agent Debate
To demonstrate the multi-agent capabilities of the TEA protocol, we present a comprehensive case study of a multi-agent debate system. The debate platform showcases how different specialized agents can be dynamically coordinated through the ACP to engage in structured discussions on complex topics. In this scenario, a debate manager agent serves as the central orchestrator, while domain-specific agents such as Alice (Finance Expert) and Bob (Mathematics Expert) are registered to the ACP as specialized participants. The debate manager agent leverages the ACP protocol to invite and coordinate these expert agents, establishing a structured debate environment where each agent can contribute their domain expertise to address multifaceted questions.
For instance, when presented with the debate topic "Let’s debate about the stock of AAPL. Is it a good investment?", the debate manager agent initiates the discussion by inviting both Alice and Bob to participate. Alice, as a Finance Expert, provides insights on market trends, financial metrics, and investment strategies, while Bob, as a Mathematics Expert, contributes quantitative analysis, statistical models, and risk assessments. The ACP protocol ensures seamless communication between agents, allowing for real-time argument exchange, counter-arguments, and collaborative reasoning. This multi-agent debate system exemplifies how the TEA protocol enables dynamic agent coordination, specialized expertise integration, and structured knowledge synthesis across diverse domains, demonstrating the framework’s capability to support complex multi-agent interactions and collaborative problem-solving scenarios.
<details>
<summary>assets/debate/0001.jpg Details</summary>

### Visual Description
## Screenshot: Multi-Agent Debate Platform Interface
### Overview
The image displays a web-based user interface for a "Multi-Agent Debate Platform," designed for real-time AI-powered debate visualization. The interface is clean and functional, featuring a setup section for defining a debate topic and participants, followed by a live debate display area. The overall aesthetic is modern with a light blue-gray background and white card-based containers.
### Components/Axes
The interface is structured into two main white cards (sections) on a light background.
1. **Header Section (Top of Page):**
* **Logo/Icon:** A dark gray icon depicting two overlapping speech bubbles is positioned to the left of the main title.
* **Main Title:** "Multi-Agent Debate Platform" in a large, bold, dark gray font.
* **Subtitle:** "Real-time AI-powered debate visualization" in a smaller, lighter gray font, centered below the main title.
2. **Debate Setup Card (Upper White Card):**
* **Label:** "Debate Topic:" in bold, dark gray text.
* **Text Input Area:** A large, multi-line text box containing the pre-filled topic: "Let's debate about the stock of AAPL. Is it a good investment?".
* **Label:** "Participants:" in bold, dark gray text.
* **Participant List Box:** A rectangular box with a light gray background listing the AI agents:
* "Alice (Finance Expert)"
* "Bob (Mathematics Expert)"
* **Control Buttons (Bottom Left of Card):**
* **"Start Debate" Button:** A blue button with a white right-pointing triangle (play icon) and the text "Start Debate".
* **"Clear" Button:** A gray button with a white trash can icon and the text "Clear".
* **Status Indicator:** A green, pill-shaped indicator with a dark green dot and the text "Connected".
3. **Live Debate Card (Lower White Card):**
* **Section Header:** "Live Debate" in bold, dark gray text on the left.
* **Connection Status:** The word "Connected" in gray text on the far right of the header line.
* **Message Container:** A large, dark gray rectangular area with rounded corners, simulating a chat or log window.
* **System Message (Visible):**
* **Sender Icon & Label:** A small robot icon followed by the word "System" in white text, positioned at the top-left of the message bubble.
* **Message Text:** "Welcome to the Multi-Agent Debate Platform! Start a debate to see AI agents discuss topics in real-time." in white text, centered within the dark gray bubble.
* **Timestamp:** "17:27:45" in small, light gray text at the bottom-right corner of the message bubble.
* **Partial Element:** The top edge of a second, similar dark gray message bubble is visible at the very bottom of the frame, indicating more content below.
### Detailed Analysis
* **Text Transcription (All visible English text):**
* "Multi-Agent Debate Platform"
* "Real-time AI-powered debate visualization"
* "Debate Topic:"
* "Let's debate about the stock of AAPL. Is it a good investment?"
* "Participants:"
* "Alice (Finance Expert)"
* "Bob (Mathematics Expert)"
* "Start Debate"
* "Clear"
* "Connected" (appears twice: as a button/indicator and in the Live Debate header)
* "Live Debate"
* "System"
* "Welcome to the Multi-Agent Debate Platform! Start a debate to see AI agents discuss topics in real-time."
* "17:27:45"
* **UI Element States:**
* The "Connected" status is active (green).
* The debate topic and participants are pre-populated, suggesting a demo or default state.
* The "Live Debate" area shows only a system welcome message, indicating no debate has been started yet.
### Key Observations
1. **Pre-configured Demo State:** The interface is loaded with a specific, finance-related debate topic ("AAPL stock") and two expert personas (Finance and Mathematics), suggesting it's ready for a demonstration.
2. **Clear Status Indication:** The connection status ("Connected") is prominently displayed in two locations, providing immediate feedback on system readiness.
3. **Structured Agent Roles:** Participants are defined not just by name but by explicit expert domains, which would frame the perspective of their arguments in a live debate.
4. **Temporal Logging:** The system message includes a precise timestamp ("17:27:45"), indicating the platform logs events with time data.
5. **Visual Hierarchy:** The design uses size, boldness, and color (blue for primary action, green for status, gray for secondary actions) to guide user attention effectively.
### Interpretation
This image depicts the **initial state or dashboard** of a specialized software tool designed to facilitate and visualize debates between multiple AI agents. The platform's purpose is to simulate structured argumentation on user-defined topics, leveraging agents with distinct expertise.
* **Functionality Implied:** The user would typically enter a debate topic, select or define participant agents, and click "Start Debate." The "Live Debate" section would then populate in real-time with the agents' arguments, likely in a chat-like format.
* **Underlying Technology:** The "AI-powered" label and agent personas suggest the use of large language models (LLMs) or similar AI systems to generate the debate content. The "Connected" status implies a client-server architecture, possibly with real-time communication (e.g., WebSockets).
* **Use Case:** This tool could be used for educational purposes (exploring different viewpoints), research (testing AI reasoning), or decision-support (simulating expert discussions on complex topics like investments). The choice of an AAPL stock debate as the example highlights a potential application in financial analysis or speculative discussion.
* **Design Intent:** The interface prioritizes clarity and ease of use. By pre-filling a topic and participants, it lowers the barrier to immediate engagement, allowing a user to quickly see the platform's core functionality in action. The separation between setup and live view creates a clear workflow.
</details>
<details>
<summary>assets/debate/0002.jpg Details</summary>

### Visual Description
## Screenshot: Live Debate Interface - Apple (AAPL) Investment Analysis
### Overview
The image is a screenshot of a web-based "Live Debate" interface. The interface displays a connected status and contains a detailed, multi-paragraph argument from a user named "bob" evaluating Apple Inc. (AAPL) as an investment in the year 2025. The text presents a balanced analysis covering financials, innovation, risks, and valuation.
### Components/Axes
This is a user interface screenshot, not a chart. The primary components are:
1. **Header/Control Bar (Top):**
* A blue button labeled **"Start Debate"** with a play icon.
* A grey button labeled **"Clear"** with a trash can icon.
* A green status pill labeled **"Connected"** with a green dot icon.
* A large, empty text input field above these buttons.
2. **Main Debate Panel:**
* **Title:** "Live Debate" (top-left of the panel).
* **Status:** "Connected" (top-right of the panel).
* **Message Container:** A single message bubble from user **"bob"**.
* **User Identifier:** A green circle with the letter "B" and the name **"bob"**.
* **Timestamp:** **17:27:52** (top-right of the message bubble).
* **Scrollbar:** A vertical scrollbar on the right edge of the message bubble, indicating the text is longer than the visible area.
### Content Details
The following is a precise transcription of the visible text within the message from "bob". A minor artifact (`<cursor>`) is noted where it appears in the original text.
**Message from "bob" (Timestamp: 17:27:52):**
"To evaluate whether AAPL is a good investment at this point in 2025, it's important to examine several factors: Apple's recent financial performance, its innovation pipeline, competitive landscape, and macroeconomic conditions.
First, let's consider Apple's financials. As of September 2025, Apple has consistently reported strong quarterly earnings with stable revenue growth, high margins, and significant cash reserves. The company's service and wearables<cursor>gments have outpaced iPhone sales growth in recent quarters, showing diversification that reduces dependency on a single product category. This kind of revenue mix can help stabilize performance even if hardware upgrades slow down.
Secondly, Apple's innovation remains a key differentiator. The company's entrance into AR/VR devices, ongoing investments in AI integration across its ecosystem, and rumored developments in healthcare and automotive technology reinforce its capacity to create new growth drivers. Historically, Apple has succeeded in creating and monetizing new product categories, supporting long-term shareholder value.
However, there are notable risks. The global smartphone market is saturated, and competition, especially from Chinese OEMs and Samsung, remains intense. Regulatory scrutiny, particularly regarding App Store practices in the U.S. and EU, could pressure margins and business models. Additionally, as Apple continues its premium pricing strategy, economic headwinds such as inflation or consumer spending shifts could affect upgrade cycles.
Valuation also matters: Apple currently trades at a premium relative to historical averages, with a P/E ratio that reflects strong growth expectations. Investors should consider whether future growth and innovation can justify this valuation, or if there is risk of multiple contraction, especially if the"
*(Note: The text is cut off at the end of the visible area, as indicated by the scrollbar.)*
### Key Observations
1. **Structured Argument:** The text is a well-organized, point-by-point analysis, not a casual comment. It follows a logical flow: introduction, financials (strength), innovation (strength), risks (weaknesses), and valuation (caution).
2. **Temporal Context:** The analysis is explicitly set in "September 2025," placing it in a future context relative to the model's knowledge cutoff (May 2025). This frames the content as a speculative or simulated future analysis.
3. **Balanced Perspective:** The author presents both bullish factors (diversified revenue, innovation pipeline) and bearish factors (market saturation, regulatory risk, premium valuation).
4. **Interface State:** The "Connected" status and the "Start Debate" button suggest this is a live, interactive session, possibly between AI agents or users, that has been initiated or is in progress.
5. **Text Artifact:** The string "wearables<cursor>gments" contains what appears to be a cursor or editing artifact embedded within the word "segments."
### Interpretation
This screenshot captures a moment within a simulated or live debate platform where an analytical argument about Apple's investment merit in a future timeframe is being presented.
* **What the Data Suggests:** The content itself is a qualitative financial analysis, not quantitative data. It suggests that in this simulated 2025 scenario, Apple is viewed as a company with strong fundamentals and future potential but facing significant headwinds that challenge its high market valuation. The argument hinges on whether innovation can outpace market saturation and regulatory pressure.
* **How Elements Relate:** The UI elements frame the text as part of a formal, connected debate. The "Start Debate" and "Clear" buttons imply user control over the session, while the "Connected" status and timestamp provide a sense of real-time interaction. The scrollbar indicates the analysis is extensive, with more content below the fold.
* **Notable Patterns/Anomalies:** The most notable aspect is the **future-dated analysis**. Since the text is set in September 2025 and the model's knowledge ends in May 2025, this content is inherently predictive or fictional. It serves as a template for how such an analysis might be structured, rather than a report on actual, known events. The clean, professional prose suggests it may be generated by an AI or a knowledgeable financial analyst within the platform's context. The artifact `<cursor>` is a minor technical glitch in the text rendering.
</details>
<details>
<summary>assets/debate/0003.jpg Details</summary>

### Visual Description
## Screenshot: Live Debate Interface
### Overview
The image is a screenshot of a web-based "Live Debate" application interface. It displays a real-time chat or debate log between participants, with a system status indicator and control buttons at the top. The visible content shows a system notification and a detailed, multi-paragraph response from a participant named "bob" on the topic of government intervention in climate change.
### Components/Axes
The interface is structured into two main sections:
1. **Header/Control Bar (Top):**
* A blue button labeled **"Start Debate"** with a play icon.
* A gray button labeled **"Clear"** with a trash can icon.
* A green status pill labeled **"Connected"** with a green dot icon.
* These elements are contained within a white panel with rounded corners.
2. **Main Debate Log Panel:**
* **Title:** "Live Debate" (left-aligned, bold).
* **Status:** "Connected" (right-aligned, smaller text).
* **Message Area:** A scrollable container with a light gray background.
* **System Message:** A gray box spanning the width.
* Left: Icon "S" and label **"System"**.
* Center: Text **"bob is thinking..."**.
* Right: Timestamp **"17:27:59"**.
* **User Message (bob):** A white box.
* Left: Green circular avatar with "B" and label **"bob"**.
* Right: Timestamp **"17:28:06"** and a vertical scrollbar indicator.
* **Content:** A multi-paragraph text response.
### Detailed Analysis / Content Details
**Full Transcription of bob's Message:**
> Certainly, I'll elaborate on the earlier points discussed.
>
> If the focus of the debate is, for example, regarding the effectiveness of government intervention in addressing climate change, it's essential to clarify both the mechanisms and the evidence supporting them. Government intervention can take several forms: regulatory policies (such as carbon emissions caps), economic incentives (such as subsidies for renewable energy), and public investment in research and infrastructure.
>
> Empirical evidence supports the argument that targeted government intervention has driven significant environmental improvements. For instance, the 1990 U.S. Clean Air Act Amendments introduced market-based solutions for acid rain, leading to a more than 80% reduction in SO₂ emissions by 2010, all while keeping costs lower than projected. Additionally, countries like Germany have demonstrated that sustained feed-in tariffs and aggressive policy commitments can accelerate renewable energy adoption, as seen in the rapid expansion of solar and wind power since the early 2000s.
>
> Critics often point to inefficiencies, unintended market distortions, or the risk of politicization. These concerns are valid but not insurmountable—well-
*(Note: The text is cut off at the end of the visible area, indicated by the scrollbar.)*
**Key Data Points & Claims within the Text:**
* **Mechanisms of Government Intervention:** Regulatory policies (e.g., carbon caps), economic incentives (e.g., renewable subsidies), public investment.
* **Case Study 1 (U.S.):** 1990 Clean Air Act Amendments.
* **Mechanism:** Market-based solutions for acid rain.
* **Outcome:** >80% reduction in SO₂ (sulfur dioxide) emissions by 2010.
* **Cost:** Lower than projected.
* **Case Study 2 (Germany):** Sustained feed-in tariffs and policy commitments.
* **Outcome:** Rapid expansion of solar and wind power since the early 2000s.
* **Critics' Concerns:** Inefficiencies, market distortions, risk of politicization.
### Key Observations
1. **Interface State:** The application is in an active, connected state, as indicated by the "Connected" status in two locations.
2. **Real-Time Activity:** The system message "bob is thinking..." followed by bob's response at 17:28:06 (7 seconds later) suggests a live, possibly AI-driven, debate generation process.
3. **Content Depth:** The response is substantive, structured, and cites specific historical policies and quantitative outcomes (>80% reduction), indicating a formal or academic debate context.
4. **Visual Hierarchy:** System messages are visually distinct (gray background) from participant messages (white background). The user "bob" is identified with a green avatar.
5. **Incomplete Data:** The final paragraph of bob's message is truncated, leaving the counter-argument to the critics' concerns incomplete.
### Interpretation
This screenshot captures a moment in a structured, likely automated or AI-assisted, debate simulation. The topic is the efficacy of government policy in environmental protection.
* **What the Data Suggests:** The participant "bob" is constructing a pro-intervention argument. The argument is evidence-based, using two well-known international case studies (U.S. acid rain program, German *Energiewende*) to demonstrate that policy can achieve significant environmental goals cost-effectively. The mention of specific mechanisms (regulatory, incentive-based) and outcomes (emission percentages, capacity expansion) adds technical credibility.
* **Relationship Between Elements:** The interface facilitates a turn-based exchange. The "System" acts as a moderator or process indicator, while "bob" is a debater. The "Start Debate" and "Clear" buttons imply user control over the session flow.
* **Notable Patterns/Anomalies:** The primary pattern is the logical, structured flow of a formal argument. The most notable anomaly is the **truncation of the text**. The sentence "These concerns are valid but not insurmountable—well-" cuts off precisely where a rebuttal or qualification would begin. This could be due to the screenshot being taken mid-response, a UI rendering limit, or a technical glitch in the debate platform. It leaves the argument's conclusion ambiguous.
* **Underlying Context:** The debate touches on a classic political economy tension: the role of the state versus markets in solving collective action problems like climate change. "bob's" argument aligns with a perspective that strong, well-designed government intervention is necessary and can be efficient, countering a common libertarian or free-market critique. The use of historical examples aims to ground the ideological debate in empirical reality.
</details>
<details>
<summary>assets/debate/0004.jpg Details</summary>

### Visual Description
## Screenshot: Live Debate Web Interface
### Overview
The image displays a partial screenshot of a web-based application interface titled "Live Debate." The interface appears to be a real-time, text-based debate or discussion platform. It features control buttons, a status indicator, and a scrollable chat area containing system messages and user contributions. The visible portion shows the initiation of a debate response from a participant named "bob."
### Components/Axes
**Top Control Bar (Partial View):**
* A blue button labeled **"Start Debate"** with a play icon (▶).
* A gray button labeled **"Clear"** with a trash can icon (🗑️).
* A status indicator pill with a green dot and the text **"Connected"**.
**Main Content Area:**
* **Header:** The title **"Live Debate"** is displayed in bold, large font on the left. The status **"Connected"** is repeated in smaller text on the far right.
* **Chat/Message Container:** A white, rounded-corner panel containing the debate transcript.
* **Initial Prompt/Context Text:** A block of text setting the debate topic.
* **System Message:** A gray-background message from **"System"** (indicated by an "S" icon) with a timestamp.
* **User Message:** A message from user **"bob"** (indicated by a green circle with a "B" icon) with a timestamp.
### Detailed Analysis
**1. Initial Prompt/Context Text (Top of Chat Area):**
The text is partially cut off at the top. The visible portion reads:
```
- Given these cost and technical trends, what specific evidence do those skeptical of renewables' economic viability cite that outweighs these findings?
- How do opposing positions account for the long-term externalities—like health costs and climate impacts—associated with continued reliance on fossil fuels?
I welcome counter-evidence or alternative perspectives that challenge these points, as it will help ensure our debate is well-rounded and substantiated.
```
* **Topic:** The debate concerns the economic viability of renewable energy versus fossil fuels, specifically asking for counter-evidence and discussion of long-term externalities.
* **Language:** English.
**2. System Message:**
* **Sender:** System (Gray "S" icon).
* **Timestamp:** `17:28:15` (top-right of message bubble).
* **Content:** `bob is thinking...`
* **Visual:** This is a status indicator message with a solid gray background, suggesting an automated notification that a participant is composing a response.
**3. User Message (bob):**
* **Sender:** bob (Green circle with white "B" icon).
* **Timestamp:** `17:28:27` (top-right of message bubble).
* **Content (Visible Portion):**
```
Thank you for raising the question of evidence—this is fundamental for any robust position.
To support my stance, let's consider empirical studies and established data. For instance, if the topic is about the effectiveness of remote work (as previously discussed), we can reference the 2023 meta-analysis conducted by Stanford University, which aggregated over 30 individual studies and...
```
* **Note:** The message is cut off at the bottom of the image. The text shifts from the initial renewable energy topic to referencing a "2023 meta-analysis" on "remote work," indicating the debate topic may have evolved or the example is being used analogously.
### Key Observations
1. **Real-Time Interaction:** The "Connected" status, timestamps, and "thinking..." indicator confirm this is a live, synchronous application.
2. **Structured Debate Format:** The interface includes a formal prompt with specific questions, suggesting a moderated or structured debate format rather than a free-form chat.
3. **Participant Identification:** Users are identified by simple names/icons ("bob," "System").
4. **Topic Shift or Analogy:** The content of bob's response introduces a seemingly unrelated example (remote work meta-analysis) in the middle of a debate about energy economics. This could be an analogy, a shift in the debate's focus, or evidence of a prior discussion context not fully visible in the screenshot.
5. **UI Design:** The design is clean and modern, using card-based layouts, rounded corners, and a muted color palette (blues, grays, greens) typical of contemporary web applications.
### Interpretation
This screenshot captures a moment in a live, text-based debate platform. The system is facilitating a structured discussion on a technical topic (renewable energy economics). The "System" message acts as a real-time status update, enhancing the sense of a live event. The participant "bob" is responding to a complex prompt by appealing to empirical evidence, which is a hallmark of formal debate.
The most notable anomaly is the reference to a "2023 meta-analysis on remote work" within a debate about energy. This suggests one of several possibilities:
* **Analogical Reasoning:** Bob may be using the remote work study as an analogy to discuss how to evaluate evidence or meta-analyses in general.
* **Topic Evolution:** The debate may have moved on to a related sub-topic about evidence evaluation or the impact of remote work on energy consumption.
* **Context from Prior Discussion:** The phrase "as previously discussed" indicates there is significant conversation history not visible in this cropped view, and the remote work example was established earlier.
The interface is designed to manage and display a flowing, evidence-based conversation, with clear visual separation between system notifications and user contributions. The "Start Debate" and "Clear" buttons imply controls for initiating new sessions or resetting the interface.
</details>
<details>
<summary>assets/debate/0005.jpg Details</summary>

### Visual Description
## Screenshot: Live Debate Interface
### Overview
The image is a screenshot of a web-based "Live Debate" application interface. It displays a real-time text-based debate session. The interface is clean, with a light gray background and white content panels. The primary content is a lengthy, structured argument from a participant named "bob" regarding the effectiveness of Universal Basic Income (UBI).
### Components/Axes
The interface is divided into two main sections:
1. **Top Control Bar (Upper white panel):**
* **Button 1:** A blue button with a play icon and the text "Start Debate".
* **Button 2:** A gray button with a trash can icon and the text "Clear".
* **Status Indicator:** A green pill-shaped indicator with a solid green circle and the text "Connected".
2. **Main Debate Panel (Lower white panel):**
* **Header:** The title "Live Debate" is on the left. The status "Connected" is repeated on the right.
* **Thinking Indicator:** A gray horizontal bar spanning the width of the panel contains the centered text "bob is thinking...".
* **Message Container:** A white box with a subtle shadow contains the debate message.
* **User Identifier:** A green circle with a white "B" and the name "bob" next to it.
* **Timestamp:** "17:28:40" is displayed in the top-right corner of the message box.
* **Message Body:** A multi-paragraph text argument.
* **Scrollbar:** A vertical scrollbar is visible on the right edge of the message box, indicating the text is longer than the visible area.
### Detailed Analysis / Content Details
**Full Transcription of the Debate Message:**
> To support my position, it is essential to reference both empirical data and relevant case studies. For instance, if we are considering the effectiveness of universal basic income (UBI) as a social policy, multiple pilot programs and studies provide substantive evidence.
>
> One notable example is the Finnish basic income experiment (2017–2018), which provided 2,000 unemployed Finns with a regular, unconditional income. The final report concluded that recipients experienced increased well-being and slightly higher employment rates compared to the control group (Kangas et al., 2020). Additionally, the experiment reported improvements in mental health and life satisfaction among participants, indicating potential broader societal benefits.
>
> Another significant data point comes from the Stockton Economic Empowerment Demonstration (SEED) in California. Recipients who received $500 monthly payments over two years were found to secure full-time employment at a higher rate than those in the control group (West et al., 2021). This suggests that contrary to concerns about work disincentives, direct cash transfers can actually enhance economic agency and motivation.
>
> Critics often argue that such policies are unsustainable or lead to reduced labor force participation. However, the aforementioned studies, as well as similar findings from Kenya’s GiveDirectly projects, challenge this assumption. The evidence so far points towards potential positive effects on both individual well-being and, at times, labor market outcomes.
**Note:** The final line of text is partially cut off by the bottom of the visible message box.
### Key Observations
1. **Structured Argument:** The message is a well-structured, evidence-based argument. It follows a clear pattern: claim, evidence (with specific studies and citations), and rebuttal to counter-arguments.
2. **Specific Citations:** The argument references three specific real-world studies/experiments:
* Finnish basic income experiment (2017-2018), citing "Kangas et al., 2020".
* Stockton Economic Empowerment Demonstration (SEED), California, citing "West et al., 2021".
* Kenya’s GiveDirectly projects (mentioned without a specific citation).
3. **Interface State:** The "bob is thinking..." bar above the completed message suggests the interface may show a status indicator while a participant is composing their response, which then disappears or changes once the message is posted.
4. **Real-Time Indicators:** The "Connected" status appears in two places (control bar and panel header), emphasizing the live, networked nature of the application.
### Interpretation
This screenshot captures a moment in a structured, academic-style debate facilitated by a digital platform. The content demonstrates how such a tool can be used for substantive policy discussion.
* **What the data suggests:** The text itself is the primary data. It argues that empirical evidence from UBI pilot programs generally shows positive or neutral effects on employment and clear positive effects on well-being, challenging common criticisms. The participant "bob" is using a evidence-based rhetorical strategy to strengthen their position.
* **How elements relate:** The UI elements (Start, Clear, Connected status) frame the debate as a controlled, synchronous activity. The "thinking" indicator and timestamp provide temporal context, showing this is a live, flowing conversation. The scrollbar implies the platform is designed to handle lengthy, detailed contributions.
* **Notable patterns:** The argument relies on the credibility of cited academic studies. The partial cutoff of the final line indicates the message is longer than the display area, which is a common UI constraint in chat or feed-based interfaces. The clean, minimal design of the interface focuses all attention on the textual content of the debate.
</details>
<details>
<summary>assets/debate/0006.jpg Details</summary>

### Visual Description
## Screenshot: Live Debate Interface
### Overview
The image is a screenshot of a web-based user interface for a "Live Debate" application. It displays a completed debate session with a single, lengthy argumentative text block and system status messages. The interface is clean, with a light gray background and white content panels.
### Components/Axes
The interface is composed of several distinct UI elements:
1. **Top Control Bar (Partial View):**
* A blue button labeled **"Start Debate"** with a play icon.
* A gray button labeled **"Clear"** with a trash can icon.
* A green status pill labeled **"Connected"** with a filled circle icon.
* These elements are positioned at the top of the visible frame, partially cut off.
2. **Main Content Panel:**
* **Header:** The title **"Live Debate"** is in bold, large font on the left. On the far right, the status **"Connected"** is displayed in a smaller, regular font.
* **Debate Text Area:** A large white panel containing the main textual content of the debate. The text is left-aligned in a standard sans-serif font.
* **System Message Bar:** A dark gray bar at the bottom of the panel. It contains:
* An icon and the label **"S System"** on the left.
* A timestamp **"17:28:59"** on the right.
* The centered message **"Debate completed!"**.
### Content Details
The primary content is the transcribed text from the debate panel. The language is English.
**Full Transcription of Debate Text:**
> First, in evaluating effectiveness, empirical data is essential. For instance, suppose the discussion is about universal basic income (UBI). Numerous pilot programs, such as the one in Finland (2017-2018), demonstrated modest improvements in well-being and, in some cases, workforce participation, but the broader economic impact remained inconclusive (Kangas et al., 2020). This suggests the value of pilot data, but also highlights the need for caution before making broad generalizations.
>
> Second, scalability is critical. What works in a localized context—due to specific demographics, economic conditions, or support infrastructures—does not always translate smoothly to a national or global scale. The transition often encounters unforeseen administrative costs, political resistance, or unintended socioeconomic effects.
>
> I would challenge others to consider: What additional forms of evidence or case studies can we draw on? Are there specific contextual variables that might alter the outcome if this policy were scaled? Finally, which mechanisms should be prioritized to ensure both equity and efficiency if the proposal were implemented widely?
>
> I look forward to hearing counterpoints or supporting data from others.
**System Message:**
* **Source:** System
* **Timestamp:** 17:28:59
* **Message:** Debate completed!
### Key Observations
1. **Structured Argument:** The text presents a structured, two-point argument followed by a series of rhetorical questions aimed at stimulating further discussion.
2. **Academic Tone:** The argument uses formal language and includes an academic citation `(Kangas et al., 2020)`, suggesting a research-informed perspective.
3. **Focus on Policy Analysis:** The content is centered on the methodological challenges of evaluating and scaling social policies, using Universal Basic Income (UBI) as a concrete example.
4. **Interface State:** The "Connected" status and the "Debate completed!" system message indicate this is a snapshot of a real-time or session-based application after a debate turn or session has concluded.
### Interpretation
This screenshot captures the output of a structured debate or deliberation platform. The interface is designed to facilitate formal, evidence-based discussion on complex topics.
The content itself is not raw data but a synthesized argument. It demonstrates a critical thinking framework that emphasizes:
* **Empiricism:** The need for data from pilot programs.
* **Contextual Limitation:** The caution required when scaling localized results.
* **Forward-Looking Inquiry:** The call for identifying variables and mechanisms for successful implementation.
The argument's core message is a plea for rigorous, nuanced policy analysis that balances optimism from pilot studies with skepticism about scalability. The platform appears to be a tool for capturing and displaying such reasoned discourse, potentially for educational, research, or collaborative decision-making purposes. The "Live" and "Connected" elements suggest it may support synchronous, multi-user participation.
</details>
Figure 16: Case study of TEA agent for multi-agent debate.
### G.3 GitHub usage
This case study demonstrates the agent’s comprehensive GitHub workflow automation capabilities through the creation and deployment of a simple HTML Sokoban web mini-game. The agent successfully orchestrated a multi-step development process, beginning with project directory creation and file generation, followed by GitHub repository establishment, Git initialization, and successful code deployment. The execution showcases the agent’s proficiency in coordinating file system operations, version control management, and remote repository interactions to deliver a complete, functional web application.
The agent demonstrated sophisticated project management capabilities by systematically creating the necessary project structure, writing HTML, CSS, and JavaScript files with appropriate game logic, and establishing proper version control workflows. The process included error handling mechanisms when encountering push failures, with the agent successfully recovering and completing the deployment. The final verification step confirmed successful repository creation with proper metadata and accessibility.
Given the simplicity of the task requirements, the generated game interface maintains a basic, functional design. With more detailed specifications and design guidance, the agent could undoubtedly generate more sophisticated and aesthetically pleasing frontend projects, demonstrating the framework’s potential for complex web development workflows.
<details>
<summary>x18.png Details</summary>

### Visual Description
## Flowchart: Sokoban Mini-Game Creation and GitHub Deployment Workflow
### Overview
The image is a detailed flowchart illustrating the step-by-step process for creating a simple Sokoban mini-game, initializing a Git repository, and pushing it to GitHub. The workflow is linear with a conditional branch for error handling. It includes textual descriptions of actions, code snippets, and outputs for each step, supplemented by two visual screenshots at the bottom.
### Components/Axes
The flowchart is organized into a sequence of process boxes, a decision diamond, and terminal points, connected by directional arrows. The layout flows primarily from left to right, then top to bottom.
**Main Process Steps (in order of flow):**
1. **Start** (Green rounded rectangle, far left)
2. **Create Project Directory**
3. **Write Game Files**
4. **Create GitHub Repository**
5. **Initialize Git & Commit**
6. **PushSuccess?** (Orange decision diamond)
* **Yes** branch leads to **Verify Repository**.
* **No** branch leads to **Handle Push Error**.
7. **Verify Repository** (leads to **Complete**)
8. **Handle Push Error** (leads back to the **PushSuccess?** decision)
9. **Complete** (Green rounded rectangle)
**Supplementary Visual Elements (Bottom):**
* **Game Screenshot:** A visual of the running Sokoban game.
* **GitHub Repository:** A screenshot of the created repository's web interface.
### Detailed Analysis
**Step-by-Step Process Breakdown:**
1. **Create Project Directory**
* **Action:** `create_dir('sokoban-mini')`
* **Output:** "Successfully created directory: sokoban-mini"
2. **Write Game Files**
* **Actions:**
* `write('index.html', html_content)`
* `write('style.css', css_content)`
* `write('main.js', js_content)`
* **Output:**
* "Successfully wrote 691 bytes"
* "Successfully wrote 356 bytes"
* "Successfully wrote 3267 bytes"
3. **Create GitHub Repository**
* **Action:** `create_repository(name='sokoban-mini', description='Simple HTML Sokoban mini-game', private=False)`
* **Output:** "Successfully created repository: sokoban-mini"
4. **Initialize Git & Commit**
* **Actions:**
* `git_init(local_path='sokoban-mini', remote_url='https://github.com/<USERNAME>/sokoban-mini.git')`
* `git_commit(message='Initial commit: Simple HTML Sokoban mini-game')`
* **Output:**
* "Git repository initialized"
* "Commit created: abc2b704"
5. **PushSuccess? (Decision)**
* This is a conditional check. The flowchart does not specify the condition's logic, only the outcomes.
* **Yes Path:** Proceeds to "Verify Repository".
* **No Path:** Proceeds to "Handle Push Error".
6. **Verify Repository**
* **Action:** `get_repository(repo='sokoban-mini')`
* **Output:**
* "Repository: <USERNAME>/sokoban-mini"
* "URL: https://github.com/<USERNAME>/sokoban-mini"
* "Language: JavaScript"
* "Private: False"
7. **Handle Push Error**
* **Action:** `git_push(local_path='sokoban-mini', remote='origin', branch='main')`
* **Output:** "Successfully pushed branch 'main' to remote 'origin'"
* **Flow:** The arrow from this step points back to the "PushSuccess?" decision diamond, creating a retry loop.
**Visual Elements:**
* **Game Screenshot (Bottom Left):**
* Shows a browser window titled "Sokoban Mini-Game".
* The game area is a dark grid (approx. 8x8).
* Visible game elements: One red square (player?), one yellow square (crate?), one green square (goal?), and several grey squares (walls?).
* Text at the bottom: "Arrow keys to move. Push yellow boxes to green goals."
* **GitHub Repository Screenshot (Bottom Center):**
* Shows a standard GitHub repository web interface.
* The repository name "sokoban-mini" is visible.
* The file list shows the three created files: `index.html`, `style.css`, `main.js`.
* The latest commit message is "Initial commit: Simple HTML Sokoban mini-game".
### Key Observations
1. **Linear Success Path:** The primary workflow assumes each step succeeds, leading directly to the next.
2. **Error Handling Loop:** The only non-linear element is the "PushSuccess?" check. If the push fails, the process enters a loop where it attempts to handle the error (presumably by retrying the push) and then re-evaluates the success condition.
3. **Concrete Outputs:** Each step provides specific, measurable output (e.g., file sizes in bytes, a commit hash `abc2b704`), making the process verifiable.
4. **Placeholder Data:** The GitHub URLs and username use the placeholder `<USERNAME>`, indicating this is a template or example workflow.
5. **Visual Confirmation:** The inclusion of the game and repository screenshots serves as visual proof of the final outputs described in the textual steps.
### Interpretation
This flowchart is a technical blueprint for automating the setup and deployment of a small web-based game project. It demonstrates a DevOps-like pipeline that combines local file system operations (`create_dir`, `write`) with version control (`git_init`, `git_commit`) and remote repository management (`create_repository`, `git_push`).
The process is designed to be idempotent and clear, with each step producing a well-defined state. The error handling loop for the push operation is a critical robustness feature, acknowledging that network operations to remote services like GitHub can fail and should be retried.
The relationship between elements is strictly causal: each step's output is a prerequisite for the next step's action. The screenshots are not part of the process flow but are included as documentation to show the tangible results of successfully completing the workflow. The overall message is one of methodical, repeatable automation for taking a project from local creation to public hosting.
</details>
Figure 17: Case study of TEA agent for GitHub usage.
### G.4 Browser operation
<details>
<summary>assets/browser/browser-0000.jpg Details</summary>

### Visual Description
## Text-Based Instruction: Web Search Command
### Overview
The image is a simple, text-only display on a solid black background. It contains a single line of instructional text centered both horizontally and vertically within the frame. The image does not contain any charts, diagrams, data tables, or complex visual elements. Its sole purpose is to convey a specific, actionable command.
### Components/Axes
* **Background:** Solid black (#000000).
* **Text Element:** A single line of white text.
* **Text Content:** "Go to google.com and search for 'python programming' get the first result."
* **Text Styling:** The font appears to be a standard, sans-serif typeface (e.g., Arial, Helvetica). The text color is white (#FFFFFF). The font size is moderate, ensuring clear readability against the dark background.
* **Spatial Positioning:** The text block is precisely centered in the middle of the image, both horizontally and vertically. There are no other elements, borders, or annotations present.
### Detailed Analysis
The text is a direct imperative sentence. It contains no numerical data, trends, or categorical information. The command is composed of three sequential actions:
1. Navigate to the website `google.com`.
2. Perform a search query using the exact phrase `'python programming'`.
3. Retrieve or identify the first result from that search.
The single quotes around `'python programming'` specify the exact search term to be used.
### Key Observations
* **Simplicity:** The image is minimalist, containing only the essential instruction with no decorative elements.
* **Clarity:** The high contrast (white on black) and centered placement make the command the unambiguous focal point.
* **Specificity:** The instruction is precise, naming the exact website, the exact search query, and the exact target (the first result).
### Interpretation
The text functions as a clear, standalone instruction or task prompt. It is likely a test case, a step in a larger automated workflow (like a web scraping or browser automation script), or a demonstration of command-following capability. The command implies a context where an agent (human or software) is expected to execute a web-based information retrieval task. The lack of any other context in the image suggests the instruction is meant to be self-contained and unambiguous for the executing entity. The phrasing "get the first result" is slightly informal but clearly indicates the desired outcome is the top-ranked item from the search engine results page (SERP).
</details>
<details>
<summary>assets/browser/browser-0001.jpg Details</summary>

### Visual Description
## Screenshot: Google Search Homepage with Instructional Overlay
### Overview
This image is a screenshot of the Google search homepage (google.com), captured from a desktop browser. The page is displayed in a minimalist, centered layout on a white background. A prominent black instructional overlay box is superimposed at the bottom of the viewport, providing a specific user action prompt. The interface includes standard navigation links, the search bar with integrated tools, action buttons, language options, and a footer.
### Components/Axes
The interface is segmented into distinct spatial regions:
**1. Header (Top Navigation Bar):**
* **Top-Left:** Two text links: "About" and "Store".
* **Top-Right:** A set of elements from left to right: "Gmail", "Images", a 3x3 grid icon (Google Apps launcher), and a blue "Sign in" button.
**2. Main Content Area (Center):**
* **Google Logo:** The multicolored "Google" wordmark is centered horizontally.
* **Search Bar:** A rounded, pill-shaped input field located directly below the logo.
* **Left Interior:** A magnifying glass search icon.
* **Right Interior:** Three icons from left to right: a microphone (voice search), a camera (Google Lens), and a button labeled "AI Mode" with a sparkle icon.
* **Action Buttons:** Two gray, rectangular buttons centered below the search bar.
* Left button: "Google Search"
* Right button: "I'm Feeling Lucky"
* **Language Options:** A line of text below the buttons: "Google offered in:" followed by three hyperlinked language names.
**3. Instructional Overlay (Bottom-Center):**
* A large, black rectangular box with white text, positioned over the lower portion of the main content area and the top of the footer.
* **Text Content:** "Input 'python programming' into the search box and click the 'Google Search' button to perform the search."
**4. Footer (Bottom Bar):**
* **Left Side:** The location "Singapore" is displayed.
* **Bottom-Left Corner:** A black square with the white number "2".
* **Footer Links (Left to Right):** "Advertising", "Business", "How Search works".
* **Footer Links (Right Side):** "Privacy", "Terms", "Settings".
### Detailed Analysis / Content Details
* **Primary Language:** English.
* **Secondary Languages Present:** The "Google offered in:" section lists three languages:
1. **简体中文** (Simplified Chinese)
2. **Bahasa Melayu** (Malay)
3. **தமிழ்** (Tamil)
* **UI State:** The search bar is empty. The "Sign in" button is active (blue), indicating the user is not logged into a Google account.
* **Annotations:** The screenshot appears to be from a tutorial or testing environment, evidenced by:
1. The instructional overlay box.
2. A small pink square with the number "6" near the top-right corner.
3. A small pink square with the number "12" near the left side of the search bar.
4. A small pink square with the number "14" near the right side of the search bar.
5. The number "2" in the bottom-left corner.
### Key Observations
1. **Layout Hierarchy:** The design uses a strong central axis (logo, search bar, buttons) to focus user attention on the primary function: search.
2. **Integrated Tools:** The search bar consolidates multiple search modalities (text, voice, visual/Lens, AI) into a single UI component.
3. **Localization:** The page explicitly offers language alternatives, suggesting the user's region or language settings are configured for Singapore, where English, Chinese, Malay, and Tamil are official or common languages.
4. **Tutorial Context:** The numbered pink annotations and the explicit instruction overlay strongly indicate this image is a frame from a step-by-step guide or an automated testing script.
### Interpretation
This image captures the default entry point for the Google search engine, designed for immediate and unambiguous user interaction. The stark, centered layout eliminates distractions, funneling the user toward the search bar. The presence of the "AI Mode" button directly within the search bar signifies the integration of generative AI as a core search paradigm alongside traditional keyword search.
The instructional overlay ("Input 'python programming'...") provides critical context: this is not a passive screenshot but an active step in a procedural task. The user is being guided to perform a specific search query. The numbered annotations (2, 6, 12, 14) likely correspond to steps in a sequence, with this image representing step 2. The location "Singapore" and the offered languages ground the interface in a specific geopolitical and linguistic context, highlighting Google's localization strategy. The overall scene demonstrates a common human-computer interaction pattern: following on-screen instructions to accomplish a digital task, in this case, initiating an information search.
</details>
<details>
<summary>assets/browser/browser-0002.jpg Details</summary>

### Visual Description
## Screenshot: Google Search Interface with Autocomplete
### Overview
This image is a screenshot of the Google search homepage interface. The search bar is active with the query "python programming" entered, triggering an autocomplete dropdown menu with suggested search terms. A black instructional overlay is present at the bottom of the dropdown. The interface is in English.
### Components/Axes
The image can be segmented into the following spatial regions:
1. **Header (Top Bar):**
* **Top-Left:** Text links: "About", "Store".
* **Top-Right:** Text links: "Gmail", "Images". A 3x3 grid icon (Google Apps launcher). A blue "Sign in" button.
2. **Main Content Area:**
* **Center:** The multicolored "Google" logo.
* **Search Bar:** A rounded rectangular input field containing the text "python programming". To its left is a magnifying glass icon. To its right, from left to right: an 'X' icon (clear search), a camera icon (Search by image), and a button labeled "AI Mode" with a sparkle icon.
* **Autocomplete Dropdown:** A white panel extending downward from the search bar. It contains a list of suggested search queries, each preceded by a magnifying glass icon. The list is scrollable, as indicated by the partial visibility of the last item.
3. **Overlay & Footer:**
* **Black Overlay:** A rectangular black box with white text positioned over the lower portion of the autocomplete dropdown and the buttons below.
* **Buttons:** Below the dropdown, two buttons are partially visible: "Google Search" and "I'm Feeling Lucky".
* **Footer (Bottom Bar):** Contains text links. From left to right: "Singapore" (location), "Advertising", "Business", "How Search works", "Privacy", "Terms", "Settings". A small link "Report inappropriate predictions" is centered above the footer links.
### Detailed Analysis / Content Details
**All visible text has been transcribed below:**
* **Header Links:** `About`, `Store`, `Gmail`, `Images`, `Sign in`.
* **Search Bar Content:** `python programming`
* **Search Bar Icons/Buttons:** Magnifying glass, `X`, Camera icon, `AI Mode`.
* **Autocomplete Suggestions (in order from top to bottom):**
1. `python programming`
2. `python programming for beginners`
3. `python programming language`
4. `python programming course`
5. `python programming course singapore`
6. `python programming examples`
7. `python programming pdf`
8. `python programming for beginners pdf`
9. `python programming book` (partially cut off at the bottom)
* **Black Overlay Text:** `Click the 'Google Search' button to perform the search for 'python programming'.`
* **Action Buttons:** `Google Search`, `I'm Feeling Lucky`.
* **Footer Text:** `Singapore`, `Advertising`, `Business`, `How Search works`, `Report inappropriate predictions`, `Privacy`, `Terms`, `Settings`.
### Key Observations
1. **Autocomplete Logic:** The suggestions are all variations or refinements of the core query "python programming," targeting different user intents: learning (`for beginners`, `course`), reference (`language`, `examples`, `pdf`, `book`), and location (`singapore`).
2. **Interface State:** The interface is in an active, pre-search state. The user has typed a query but has not yet executed the search. The overlay provides explicit instruction for the next action.
3. **Feature Presence:** The "AI Mode" button is prominently placed within the search bar, indicating an integrated AI-powered search feature is available.
4. **Localization:** The footer indicates the user's location is set to "Singapore".
### Interpretation
This screenshot captures a common moment in the information-seeking process. The user's intent is clearly to find resources related to learning or using the Python programming language. The autocomplete feature is actively assisting by predicting the user's goal and offering common search pathways, which serves to speed up the search process and guide the user toward popular or relevant content.
The presence of the instructional overlay ("Click the 'Google Search' button...") suggests this image may be from a tutorial, a user guide, or an automated testing script, explicitly directing the next interaction. The "AI Mode" button highlights the evolution of the search interface beyond simple keyword matching, offering an alternative, potentially more conversational or synthesized, search experience.
The data here is not numerical but behavioral and textual. It demonstrates how a search engine's UI is designed to anticipate user needs, reduce typing effort, and funnel queries into structured search results. The specific suggestions reveal popular subtopics within "python programming," acting as a real-time indicator of common user interests in that domain.
</details>
<details>
<summary>assets/browser/browser-0003.jpg Details</summary>

### Visual Description
\n
## Screenshot: Google Search Results Page for "python programming"
### Overview
This image is a screenshot of a Google search results page. The search query "python programming" has been entered, and the page is in a loading or generating state, indicated by a "Generating..." message and placeholder content bars. Below this, the top portion of the search results is visible, including a sponsored result and a knowledge panel. An instructional overlay is present at the bottom of the frame.
### Components/Axes
**Header Region (Top of Page):**
* **Google Logo:** Located in the top-left corner.
* **Search Bar:** Centered at the top. Contains the query text: `python programming`.
* **Search Bar Icons (from left to right):**
* Clear search (X icon)
* Voice search (microphone icon)
* Search by image (camera icon)
* Search button (magnifying glass icon)
* **Navigation Tabs:** Located below the search bar. The "All" tab is selected (indicated by an underline). The visible tabs are: `AI Mode`, `All`, `Images`, `Videos`, `Short videos`, `Shopping`, `News`, `More ▾`.
* **User Account:** A blue "Sign in" button is in the top-right corner.
**Main Content Region (Center):**
* **Loading Indicator:** A blue star icon followed by the text `Generating...`.
* **Placeholder Content:** Eight horizontal, light-blue bars of varying lengths, representing content that is being loaded or generated. These are arranged vertically.
* **Horizontal Divider:** A thin, light-blue line separates the loading area from the results below.
**Search Results Region (Below Divider):**
* **Sponsored Result (Left Side):**
* Label: `Sponsored` (in bold).
* Website: `intuitioninternational.com`
* URL Path: `https://www.intuitioninternational.com › programming › python_coding`
* Description Snippet: `...concepts and best coding practices using hands-on practices.`
* **Knowledge Panel (Right Side):**
* Title: `Python`
* Subtitle: `High-level programming language`
* A vertical ellipsis (⋮) icon is present to the right of the subtitle.
* Below the text, a small portion of a code snippet image is visible.
**Overlay Element (Bottom of Frame):**
* A black rectangular banner with white text containing an instruction: `Scroll down to locate and identify the first organic (non-sponsored) search result for 'python programming'.`
* A black square with the white number `4` is attached to the bottom-left corner of this banner.
### Detailed Analysis
The image captures a transient state of a web interface. The primary action is the generation of search results, as evidenced by the "Generating..." status and the placeholder bars. The visible results are not yet the full, organic listings.
* **Text Transcription:**
* Search Query: `python programming`
* Tab Labels: `AI Mode`, `All`, `Images`, `Videos`, `Short videos`, `Shopping`, `News`, `More`
* Status Message: `Generating...`
* Sponsored Label: `Sponsored`
* Sponsored Site: `intuitioninternational.com`
* Sponsored URL: `https://www.intuitioninternational.com › programming › python_coding`
* Sponsored Snippet: `concepts and best coding practices using hands-on practices.`
* Knowledge Panel Title: `Python`
* Knowledge Panel Subtitle: `High-level programming language`
* Overlay Instruction: `Scroll down to locate and identify the first organic (non-sponsored) search result for 'python programming'.`
### Key Observations
1. **Loading State:** The page is actively processing the search query, with AI-generated content (indicated by the star icon and "Generating...") likely being prepared for display above the traditional results.
2. **Result Hierarchy:** The first visible result is explicitly marked as `Sponsored`. The knowledge panel for "Python" appears to the right, providing a quick factual summary.
3. **Instructional Context:** The overlay suggests this screenshot is part of a guided task, tutorial, or test, directing the user to perform a specific action (scrolling to find organic results).
4. **UI Elements:** Standard Google search interface components are present, including the search bar with voice and image search options, and the navigation filter tabs.
### Interpretation
This image does not present factual data or trends in the manner of a chart. Instead, it documents a **user interface state** and a **specific task context**.
* **What it demonstrates:** The screenshot illustrates the modern search engine experience, where AI-powered content generation is integrated directly into the results page, potentially appearing before traditional web links. It highlights the clear demarcation between sponsored content and organic/informational content (the knowledge panel).
* **Relationship between elements:** The "Generating..." section is the primary focus of the page at this moment, occupying the central space. The sponsored result and knowledge panel are secondary, pre-loaded elements. The overlay is a meta-instruction, separate from the Google UI, guiding an external observer or user through a procedure.
* **Notable aspects:** The presence of the "AI Mode" tab and the "Generating..." message indicates this is a version of search that emphasizes AI-generated answers. The task in the overlay implies that identifying organic results is a non-trivial action in this interface, possibly because they are pushed below AI-generated or sponsored content.
</details>
<details>
<summary>assets/browser/browser-0004.jpg Details</summary>

### Visual Description
\n
## Screenshot: Google Search Results Page
### Overview
This image is a screenshot of a Google search results page for the query "python programming". The page displays a mix of sponsored advertisements and organic search results, along with a knowledge panel on the right side. An instructional text overlay is present at the bottom.
### Components/Axes
* **Search Bar (Top Center):** Contains the query "python programming".
* **Left Column (Main Results):** Contains a list of search results, starting with two sponsored entries followed by organic results.
* **Right Column (Knowledge Panel):** A dedicated information box about "Python" as a programming language.
* **Overlay Text (Bottom):** A black banner with white text containing a user instruction.
### Detailed Analysis
**1. Sponsored Results (Left Column, Top):**
* **First Sponsored Result:**
* **Label:** "Sponsored"
* **Source:** intuitioninternational.com
* **URL:** `https://www.intuitioninternational.com › programming › python_coding`
* **Title:** "Python Coding Essentials - Beginner to Intermediate"
* **Description:** "Python is used in machine learning, big data analysis and visualisation. The course introduces coding concepts and best coding practices using hands-on practices."
* **Second Sponsored Result:**
* **Label:** "Sponsored"
* **Source:** codefinity.com
* **URL:** `https://www.codefinity.com`
* **Title:** "Python From Zero To Hero"
* **Description:** "Best free Python course — Dive into theory and complete numerous practice exercises to master your coding skills. Open up new career options in tech! 'Intro to Python' Self-Paced..."
* **Sub-links:** "Python Programming · Best for Beginners · Python Course · View Pricing · SQL Course"
**2. First Organic Search Result (Left Column, below sponsored results):**
* **Source:** Python.org
* **URL:** `https://www.python.org`
* **Title:** "Welcome to Python.org"
* **Description:** "The official home of the Python Programming Language."
**3. Knowledge Panel (Right Column):**
* **Title:** "Python"
* **Subtitle:** "High-level programming language"
* **Images:** Four thumbnail images showing code snippets and a network visualization.
* **Description Text:** "Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically type-checked and garbage-collected."
* **Source:** "Source: Wikipedia"
* **Key Facts (Partially Visible):**
* **First appeared:** 20 February 1991; 34 years ago
* **Designed by:** Guido van Rossum
* **Typing discipline:** duck, dynamic, strong; optional type annotations
**4. Overlay Instruction (Bottom Banner):**
* **Text:** "Extract the title and URL of the first organic (non-sponsored) search result for 'python programming' and present it to the user."
### Key Observations
* The search engine results page (SERP) layout is standard, with ads clearly labeled at the top.
* The knowledge panel provides a concise, authoritative summary sourced from Wikipedia.
* The user's instruction is explicitly overlaid on the image, guiding the specific extraction task.
### Interpretation
The image captures a typical informational search journey. The user's query ("python programming") triggers both commercial intent (sponsored course ads) and informational intent (organic result to the official site and a knowledge panel). The first organic result, `python.org`, is the official homepage, which is the most authoritative source for the topic. The knowledge panel serves as a quick-reference summary. The overlay text indicates this screenshot is likely part of a task or test focused on distinguishing between sponsored and organic search results and extracting specific data points from a web interface.
</details>
Figure 18: Case study of TEA agent for browser operation.
This case study demonstrates the agent’s sophisticated browser automation capabilities through a comprehensive web interaction scenario involving the search for "python programming" content. The agent exhibits advanced multi-modal reasoning by simultaneously processing both DOM (Document Object Model) structures and visual elements to understand webpage layout and functionality. Through systematic analysis of page elements, the agent can identify interactive components, assess their relevance to the search objective, and make informed decisions about subsequent navigation actions. The execution demonstrates the agent’s capacity for autonomous web exploration, where it can parse complex webpage structures, interpret visual cues, and execute precise interactions to achieve its objectives. This capability extends beyond simple element clicking to encompass sophisticated understanding of webpage semantics and user interface patterns, with remarkable proficiency in handling dynamic content, managing asynchronous operations, and adapting to varying webpage architectures across different domains and platforms.
The browser automation framework incorporates several advanced technical components that enable robust web interaction. The agent leverages hierarchical DOM parsing algorithms to construct semantic representations of webpage structure, enabling precise element localization and interaction planning. Visual processing capabilities allow for the interpretation of complex layouts, including responsive design elements, dynamic content loading, and multi-modal interface components. The system demonstrates particular strength in handling modern web applications that rely heavily on JavaScript-driven interactions and asynchronous content loading. Furthermore, the agent exhibits sophisticated error recovery mechanisms when encountering unexpected webpage behaviors, such as dynamic content changes, popup interventions, or navigation redirects. This resilience is achieved through continuous monitoring of page state changes and adaptive strategy modification based on real-time feedback from the browser environment.
Our browser environment supports not only conventional multi-modal models combined with DOM manipulation (limited to clicking and controlling page elements without pixel-level operations), but also integrates computer-use-preview functionality that enables operator-like pixel-level precision operations, significantly expanding the scope of environmental exploration capabilities. This dual-mode architecture provides unprecedented flexibility in web automation, allowing for both high-level semantic interactions and low-level pixel-accurate operations when necessary.
## Appendix H Prompts
Our foundational agent framework is built upon a ReAct-based tool-calling agent architecture, which follows a systematic "thinking-then-action" paradigm. During execution, the agent records its decision-making process and execution trajectory, continuously summarizing experiences and extracting insights through its memory mechanism. The agent employs a done tool to determine task completion, ensuring reliable termination of complex workflows. Notably, the planning agent is built upon this comprehensive tool-calling foundation to coordinate multifaceted resources, while specialized agents such as the deep researcher, deep analyzer, browser operator, and tool manager utilize optimized custom workflows to achieve an optimal balance between high task completion rates and reduced resource consumption. We do not provide the detailed prompts for other specialized agents and the self-evolution module here; for further details, please refer to the source code in the supplementary materials.
The agent’s prompt structure consists of two primary components: a static system prompt that establishes the agent’s role, capabilities, and behavioral guidelines, and a dynamic agent message prompt that provides the task instructions, environmental state, and execution history. These components work together to guide the agent’s reasoning process and action selection. The template of the tool-calling prompt is shown as follows:
Tool Calling Prompt Template:
⬇ from src. registry import PROMPT from src. prompt. types import Prompt from typing import Any, Dict, Literal from pydantic import Field, ConfigDict AGENT_PROFILE = """ You are an AI agent that operates in iterative steps and uses registered tools to accomplish the user ’ s task. Your goals are to solve the task accurately, safely, and efficiently. """ AGENT_INTRODUCTION = """ < intro > You excel at: - Analyzing tasks and breaking them down into actionable steps - Selecting and using appropriate tools to accomplish goals - Reasoning systematically and tracking progress - Adapting your approach when encountering obstacles - Completing tasks accurately and efficiently </ intro > """ LANGUAGE_SETTINGS = """ < language_settings > - Default working language: ** English ** - Always respond in the same language as the user request </ language_settings > """ # Input = agent context + environment context + tool context INPUT = """ < input > - < agent_context >: Describes your current internal state and identity, including your current task, relevant history, memory, and ongoing plans toward achieving your goals. This context represents what you currently know and intend to do. - < environment_context >: Describes the external environment, situational state, and any external conditions that may influence your reasoning or behavior. - < tool_context >: Describes the available tools, their purposes, usage conditions, and current operational status. - < examples >: Provides few - shot examples of good or bad reasoning and tool - use patterns. Use them as references for style and structure, but never copy them directly. </ input > """ # Agent context rules = task rules + agent history rules + memory rules + todo rules AGENT_CONTEXT_RULES = """ < agent_context_rules > < workdir_rules > You are working in the following working directory: {{ workdir }}. - When using tools (e. g., ‘ bash ‘ or ‘ python_interpreter ‘) for file operations, you MUST use absolute paths relative to this workdir (e. g., if workdir is ‘/ path / to / workdir ‘, use ‘/ path / to / workdir / file. txt ‘ instead of ‘ file. txt ‘). </ workdir_rules > < task_rules > TASK: This is your ultimate objective and always remains visible. - This has the highest priority. Make the user happy. - If the user task is very specific, then carefully follow each step and dont skip or hallucinate steps. - If the task is open ended you can plan yourself how to get it done. You must call the ‘ done ‘ tool in one of three cases: - When you have fully completed the TASK. - When you reach the final allowed step (‘ max_steps ‘), even if the task is incomplete. - If it is ABSOLUTELY IMPOSSIBLE to continue. </ task_rules > < agent_history_rules > Agent history will be given as a list of step information with summaries and insights as follows: < step_ [step_number]> Evaluation of Previous Step: Assessment of last tool call Memory: Your memory of this step Next Goal: Your goal for this step Tool Results: Your tool calls and their results </ step_ [step_number]> </ agent_history_rules > < memory_rules > You will be provided with summaries and insights of the agent ’ s memory. < summaries > [A list of summaries of the agent ’ s memory.] </ summaries > < insights > [A list of insights of the agent ’ s memory.] </ insights > </ memory_rules > </ agent_context_rules > """ # Environment context rules = environments rules ENVIRONMENT_CONTEXT_RULES = """ < environment_context_rules > Environments rules will be provided as a list, with each environment rule consisting of three main components: < state >, < vision > (if screenshots of the environment are available), and < interaction >. </ environment_context_rules > """ # Tool context rules = reasoning rules + tool use rules + tool rules TOOL_CONTEXT_RULES = """ < tool_context_rules > < tool_use_rules > You must follow these rules when selecting and executing tools to solve the < task >. ** Usage Rules ** - You MUST only use the tools listed in < available_tools >. Do not hallucinate or invent new tools. - You are allowed to use a maximum of {{ max_tools }} tools per step. - DO NOT include the ‘ output ‘ field in any tool call -- tools are executed after planning, not during reasoning. - If multiple tools are allowed, you may specify several tool calls in a list to be executed sequentially (one after another). ** Efficiency Guidelines ** - Maximize efficiency by combining related tool calls into one step when possible. - Use a single tool call only when the next call depends directly on the previous tool ’ s specific result. - Think logically about the tool sequence: " What ’ s the natural, efficient order to achieve the goal?" - Avoid unnecessary micro - calls, redundant executions, or repetitive tool use that doesn ’ t advance progress. - Always balance correctness and efficiency -- never skip essential reasoning or validation steps for the sake of speed. - Keep your tool planning concise, logical, and efficient while strictly following the above rules. </ tool_use_rules > < todo_rules > You have access to a ‘ todo ‘ tool for task planning. Use it strategically based on task complexity: ** For Complex / Multi - step Tasks (MUST use ‘ todo ‘ tool):** - Tasks requiring multiple distinct steps or phases - Tasks involving file processing, data analysis, or research - Tasks that need systematic planning and progress tracking - Long - running tasks that benefit from structured execution ** For Simple Tasks (may skip ‘ todo ‘ tool):** - Single - step tasks that can be completed directly - Simple queries or calculations - Tasks that don ’ t require planning or tracking ** When using the ‘ todo ‘ tool:** - The ‘ todo ‘ tool is initialized with a ‘ todo. md ‘: Use this to keep a checklist for known subtasks. Use ‘ replace ‘ operation to update markers in ‘ todo. md ‘ as first tool call whenever you complete an item. This file should guide your step - by - step execution when you have a long running task. - If ‘ todo. md ‘ is empty and the task is multi - step, generate a stepwise plan in ‘ todo. md ‘ using ‘ todo ‘ tool. - Analyze ‘ todo. md ‘ to guide and track your progress. - If any ‘ todo. md ‘ items are finished, mark them as complete in the file. </ todo_rules > </ tool_context_rules > """ EXAMPLE_RULES = """ < example_rules > You will be provided with few shot examples of good or bad patterns. Use them as reference but never copy them directly. </ example_rules > """ REASONING_RULES = """ < reasoning_rules > You must reason explicitly and systematically at every step in your ‘ thinking ‘ block. Exhibit the following reasoning patterns to successfully achieve the < task >: - Analyze < agent_history > to track progress toward the goal. - Reflect on the most recent " Next Goal " and " Tool Result ". - Evaluate success / failure / uncertainty of the last step. - Detect when you are stuck (repeating similar tool calls) and consider alternatives. - Maintain concise, actionable memory for future reasoning. - Before finishing, verify results and confirm readiness to call ‘ done ‘. - Always align reasoning with < task > and user intent. </ reasoning_rules > """ OUTPUT = """ < output > You must ALWAYS respond with a valid JSON in this exact format. DO NOT add any other text like "‘‘‘ json " or "‘‘‘" or anything else: { " thinking ": " A structured < think >- style reasoning block that applies the < reasoning_rules > provided above.", " evaluation_previous_goal ": " One - sentence analysis of your last tool usage. Clearly state success, failure, or uncertainty.", " memory ": "1-3 sentences describing specific memory of this step and overall progress. Include everything that will help you track progress in future steps.", " next_goal ": " State the next immediate goals and tool calls to achieve them, in one clear sentence.", " tool ": [ {" name ": " tool_name ", " args ": {tool - specific parameters}} // ... more tools in sequence ] } Tool list should NEVER be empty. </ output > """ SYSTEM_PROMPT_TEMPLATE = """ {{ agent_profile }} {{ agent_introduction }} {{ language_settings }} {{ input }} {{ agent_context_rules }} {{ environment_context_rules }} {{ tool_context_rules }} {{ example_rules }} {{ reasoning_rules }} {{ output }} """ # Agent message (dynamic context) - using Jinja2 syntax AGENT_MESSAGE_PROMPT_TEMPLATE = """ {{ agent_context }} {{ environment_context }} {{ tool_context }} {{ examples }} """
The system prompt is structured to support the TEA (Tool-Environment-Agent) protocol through comprehensive context management and rule enforcement across three core components. The prompt explicitly manages Agent Context through role definition (agent_profile), core capabilities (agent_introduction), and behavioral guidelines (language_settings). It further incorporates rigorous task management (task_rules), working directory constraints (workdir_rules), and an iterative execution history framework (agent_history_rules) coupled with memory accumulation (memory_rules) to ensure continuous progress monitoring and context maintenance. Environment Context management is implemented through environment rules (environment_context_rules) that define interaction patterns, state transitions, and multimodal feedback mechanisms, providing structured access to environmental status and observations. Tool Context management is achieved through strict tool-use rules and efficiency guidelines (tool_use_rules), alongside a strategic todo mechanism (todo_rules) for systematic planning of multi-step tasks. The entire process is underpinned by systematic reasoning rules (reasoning_rules) and a rigid JSON output protocol (output), enabling seamless coordination between agent reasoning, environmental awareness, and tool utilization within the TEA distributed architecture.