# SwarmSys: Decentralized Swarm-Inspired Agents for Scalable and Adaptive Reasoning
## SwarmSys: Decentralized Swarm-Inspired Agents for Scalable and Adaptive Reasoning
Ruohao Li *1 , Hongjun Liu *2,4 , Leyi Zhao 5 , Zisu Li 3 , Jiawei Li 1 ,
Jiajun Jiang 1 , Linning Xu 6 , Chen Zhao 2,4 , Mingming Fan †1,3 ,
Chen Liang †1
1 The Hong Kong University of Science and Technology (Guangzhou) 2 New York University 3 The Hong Kong University of Science and Technology 4 NYU Shanghai 5 Indiana University 6 The Chinese University of Hong Kong
* Equal contribution. † Co-corresponding authors.
Correspondence:
rli777@connect.hkust-gz.edu.cn, lh3862@nyu.edu
## Abstract
Large language model (LLM) agents have shown remarkable reasoning abilities. However, existing multi-agent frameworks often rely on fixed roles or centralized control, limiting scalability and adaptability in long-horizon reasoning. We introduce SwarmSys, a closedloop framework for distributed multi-agent reasoning inspired by swarm intelligence. Coordination in SwarmSys emerges through iterative interactions among three specialized roles, Explorers, Workers, and Validators, that continuously cycle through exploration, exploitation, and validation. To enable scalable and adaptive collaboration, we integrate adaptive agent and event profiles, embedding-based probabilistic matching, and a pheromone-inspired reinforcement mechanism, supporting dynamic task allocation and self-organizing convergence without global supervision. Across symbolic reasoning, research synthesis, and scientific programming tasks, SwarmSys consistently outperforms baselines, improving both accuracy and reasoning stability. These findings highlight swarm-inspired coordination as a promising paradigm for scalable, robust, and adaptive multi-agent reasoning, suggesting that coordination scaling may rival model scaling in advancing LLM intelligence.
## 1 Introduction
The strong reasoning and planning capabilities of Large Language Models (LLMs) have spurred interest in multi-agent systems. These systems use collaboration to enhance reasoning diversity and reliability. Frameworks such as AutoGen (Wu et al., 2023) enable agent-to-agent dialogue for multiagent applications, while CAMEL (Li et al., 2023) leverages role-playing and inception prompting to facilitate autonomous cooperation. AutoGen Studio further provides a no-code interface for designing and debugging agent workflows (Dibia et al., 2024). However, most systems use fixed roles and
Figure 1: Comparison between paradigm-dependent multi-agent systems and SwarmSys. While existing methods rely on fixed, domain-specific agent paradigms, SwarmSys achieves scalable self-organization and crossdomain adaptability.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Diagram: Multi-Agent System Paradigms Comparison
### Overview
The image is a technical diagram comparing two architectural approaches for multi-agent systems. The top section illustrates a "Fixed Number of Backbone Agents" paradigm, while the bottom section presents a "Scalable Number of Backbone Agents" paradigm, specifically highlighting a system named "SwarmSys." The diagram uses a flowchart style with boxes, arrows, and labels to show the relationship between agent paradigms and task domains.
### Components/Axes
The diagram is divided into two distinct horizontal sections, each with a title bar.
**Top Section: Fixed Number of Backbone Agents**
* **Title Bar:** "Fixed Number of Backbone Agents" (white text on a teal background).
* **Paradigm Layer:** Three light blue boxes, each representing a specific paradigm:
1. "Cot+ Exam Paradigm" (left)
2. "Cot+ Research Paradigm" (center)
3. "Cot+ Coding Paradigm" (right)
* **Task Layer:** Three white boxes representing task domains:
1. "Exam" (left)
2. "Research" (center)
3. "Science Coding" (right)
* **Connections:** Dotted black lines connect each paradigm box to all three task boxes, indicating a many-to-many relationship where any paradigm can be applied to any task.
* **Legend/Label:** Below the task boxes, centered text reads: "Paradigm-dependent multi-agent systems" (in teal).
**Bottom Section: Scalable Number of Backbone Agents**
* **Title Bar:** "Scalable Number of Backbone Agents" (white text on a dark blue background).
* **Agent Cluster Layer:** Four boxes depicting possible configurations of agent swarms. The first three contain illustrations of interconnected nodes (circles) in red, blue, and green, labeled with small text (e.g., "Cot", "Tool", "Ref"). They are separated by the word "Or". The fourth box contains an ellipsis ("..."), indicating other possible configurations.
* **System Box:** A single blue box labeled "SwarmSys" is positioned below the agent cluster layer.
* **Task Layer:** The same three white task boxes from the top section are repeated: "Exam", "Research", "Science Coding".
* **Connections:** Solid black arrows flow from the agent cluster layer down to the "SwarmSys" box. From "SwarmSys", three solid black arrows point down to each of the three task boxes.
* **Legend/Label:** In the bottom-right corner, the text "SwarmSys" appears in orange.
### Detailed Analysis
The diagram presents a clear visual contrast between two system architectures.
**Top Section (Fixed Paradigm):**
* **Structure:** Hierarchical and rigid. A fixed set of three specialized paradigms (Exam, Research, Coding) is defined upfront.
* **Flow:** The dotted lines suggest a flexible but predetermined assignment. The system selects one of the fixed paradigms to apply to a given task (Exam, Research, or Science Coding). The paradigm is dependent on the task type.
* **Visual Trend:** The connections form a dense, crisscrossing web, emphasizing the combinatorial but fixed nature of the approach.
**Bottom Section (Scalable SwarmSys):**
* **Structure:** Dynamic and centralized. Instead of predefined paradigms, the system starts with a variable, scalable collection of backbone agents (represented by the clusters).
* **Flow:** These agents are organized or managed by a central system called "SwarmSys." SwarmSys then directs the collective capability towards the same three task domains. The "Or" connectors between agent clusters imply that the composition and number of agents can change.
* **Visual Trend:** The flow is linear and funnel-like: from many possible agent configurations -> to one managing system -> to multiple tasks. This suggests a unified, adaptive system that can reconfigure its agent pool to handle different tasks.
### Key Observations
1. **Task Consistency:** Both architectures are designed to address the exact same three task domains: "Exam," "Research," and "Science Coding." This provides a direct basis for comparison.
2. **Paradigm vs. System:** The top approach is defined by its *paradigms* (methodologies like "Cot+ Exam"). The bottom approach is defined by a *system name* ("SwarmSys"), implying the paradigm or strategy may be emergent or managed internally.
3. **Connection Semantics:** The use of dotted lines (top) versus solid arrows (bottom) is significant. Dotted lines often indicate a logical or potential connection, while solid arrows indicate a direct flow of control or data. This visually reinforces the fixed vs. scalable, assignment vs. management dichotomy.
4. **Scalability Representation:** Scalability is depicted not just by the title but by the "Or" sequence and the ellipsis box, explicitly showing that the agent pool is not fixed to three types but can vary in composition and number.
### Interpretation
This diagram argues for the advantages of a scalable, swarm-based multi-agent system (SwarmSys) over a traditional paradigm-dependent approach.
The **fixed paradigm model** is presented as a library of specialized tools. You pick the "Exam" tool for an exam task, the "Research" tool for a research task, etc. Its weakness, implied by the rigid structure, is a lack of flexibility; it requires pre-defining and maintaining separate paradigms for each domain, and may not adapt well to novel tasks outside these three.
The **SwarmSys model** proposes a more holistic and adaptive solution. Instead of pre-packaged paradigms, it maintains a dynamic pool of versatile backbone agents. The central "SwarmSys" layer acts as an orchestrator or meta-system that can dynamically configure these agents into an effective team for the task at hand, whether it's an exam, research, or science coding problem. The key innovation suggested is **unification and scalability**: one system (SwarmSys) manages a variable agent workforce to tackle multiple problem domains, potentially offering greater robustness, adaptability, and efficiency than maintaining three separate, fixed-paradigm systems. The diagram positions SwarmSys as a more advanced and flexible architectural evolution.
</details>
static communication, which limits their adaptability. This rigidity leads to redundant exploration and inefficiency, especially in long-horizon or dynamic tasks.
To address limitations, recent work has explored adaptive or self-improving agent collaboration. AutoAgent enables natural-language-defined behaviors instead of hand-coded roles (Tang et al., 2025), while Mixture-of-Agents introduces layered coordination for more robust reasoning (Wang et al., 2024). However, these systems still rely on centralized orchestration or manually designed topologies, limiting scalability and long-term stability. Inspired by the decentralized intelligence of natural swarms, where simple signals regulate cooperation and task allocation (Bonabeau et al., 1999; Dorigo and Stützle, 2004), we propose SwarmSys,
Figure 2: Overall workflow of the SwarmSys collaborative reasoning process. Each task is decomposed into subevents handled by specialized agents: Explorers propose solution paths, Workers execute subtasks, and Validators ensure consistency. Agents iteratively perform debate-consensus cycles that update event profiles and reinforce effective reasoning strategies until convergence.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Diagram: SwarmSys Collaborative Reasoning Process
### Overview
The image is a detailed flowchart illustrating a multi-agent collaborative reasoning system named "SwarmSys." It depicts a process where different types of agents (Explorer, Worker, Validator) interact through initialization, task decomposition, parallel computation, debate, consensus, and validation to solve a problem, ultimately producing a final result. The diagram combines process flow arrows, agent icons, mathematical equations, and descriptive text blocks.
### Components/Axes
The diagram is organized into several interconnected regions:
1. **Left Column (Initialization & Input):**
* **Top-Left:** A cluster of agent icons labeled "# Agents" undergoes "Agent Random Initialization" to form "Agent Profiles."
* **Middle-Left:** A "Match" process connects the "Agent Profiles" to "Event Profiles."
* **Bottom-Left:** A box labeled "Target & Requirement" feeds into "E0 And E1(A List) Initialization," which also connects to the "Event Profiles." A separate box labeled "Exam" is positioned below this.
2. **Central Processing Flow:**
* A "Sub-Event" triggers the main reasoning loop.
* **Agent Types (Legend - Bottom Right):**
* **Explorer (Blue icon):** Responsible for task decomposition and geometric reasoning.
* **Worker (Red icon):** Performs computational tasks and independent derivations.
* **Validator (Cyan icon):** Validates results and checks cross-modal consistency.
* The flow shows an Explorer agent initiating a process, followed by Worker agents (001, 002) producing results. A "Debate & Consensus" phase occurs, leading to updated "Agent Profiles" and "Event Profiles."
3. **Right Column (Detailed Reasoning Steps):**
* A series of text blocks connected by arrows describe the sequential reasoning steps, involving mathematical equations and logical checks.
4. **Bottom Flow (Optimization & Output):**
* An "Optimization Loop" with "Execute" steps leads to a "Final Result," which is then checked by a "Validate" step from a Validator agent.
### Detailed Analysis
**Process Flow & Agent Roles:**
1. **Initialization:** Multiple agents are randomly initialized into profiles. These are matched with event profiles derived from a target/requirement and an "Exam" input.
2. **Task Decomposition (Explorer):** An Explorer agent suggests decomposing the task into two goals: "Find intersection, and locus." It formulates parameters `(t, θ)` to guide reasoning and initiates collaboration.
3. **Parallel Computation (Workers):**
* **Worker 001:** Evaluates the Explorer's plan and verifies its parametric form: `x = 1 + t/2, y = (√3/2)t`. It executes computation to form `1 + t/2 = cos θ`, `(√3/2)t = sin θ`, confirming that solving for `θ` gives the intersection points.
* **Worker 002:** Provides another geometric perspective by interpreting `C₁` as a line through `(1,0)` with direction `(cos α, sin α)`. It executes an independent geometric derivation for `α`, establishing `(1 + t/2)² + ((√3/2)t)² = 1`.
4. **Validation & Consensus (Validator & Debate):**
* A Validator cross-modally validates and refines the exploration, checking that the geometric result coincides with the analytic one.
* It validates cross-modal consistency by transforming Worker 001's analytic equation and checking alignment with Worker 002's geometric coordinates: `cos θ - (sin θ)/√3 = 1`, `(1,0)` and `(-1/2, √3/2)`.
* The system confirms both results and the debate are consistent, synthesizes agreement, and finalizes the current round.
5. **Output:** The process enters an optimization loop, executing steps until a "Final Result" is produced and validated.
**Mathematical Content Transcribed:**
* Parametric form: `x = 1 + t/2`, `y = (√3/2)t`
* Derived equations: `1 + t/2 = cos θ`, `(√3/2)t = sin θ`
* Geometric derivation: `(1 + t/2)² + ((√3/2)t)² = 1`
* Consistency check equation: `cos θ - (sin θ)/√3 = 1`
* Coordinate points: `(1,0)` and `(-1/2, √3/2)`
### Key Observations
* **Structured Collaboration:** The system explicitly separates roles (exploration, computation, validation) to manage complexity.
* **Cross-Modal Validation:** A critical step involves verifying that results from different reasoning methods (analytic vs. geometric) are consistent, enhancing reliability.
* **Iterative Refinement:** The process includes debate, consensus, and an optimization loop, indicating it is not a single-pass solution but an iterative, self-correcting system.
* **Mathematical Foundation:** The example problem being solved is geometric in nature, involving parametric lines, circles, and trigonometric identities to find intersections.
### Interpretation
This diagram outlines a sophisticated framework for distributed problem-solving using specialized AI agents. The "SwarmSys" model mimics human collaborative reasoning by dividing a problem, allowing parallel exploration of different solution paths (analytic and geometric), and then rigorously debating and validating the results before synthesis.
The core innovation appears to be the formalized **cross-modal validation** step, which acts as a robust error-checking mechanism. By requiring consistency between independently derived results, the system reduces the risk of localized errors propagating. The use of an "Exam" as an input suggests this system could be designed for educational or testing environments, where it might solve and explain complex problems.
The process emphasizes **explainability** and **verification** over mere output generation. Each step is documented and checked, making the reasoning trace transparent. This is particularly valuable for applications requiring high reliability, such as technical tutoring, scientific computing, or complex decision support systems, where understanding the "why" is as important as the "what." The framework is generalizable; while the example uses geometry, the agent roles and validation logic could be applied to other domains requiring multi-faceted reasoning.
</details>
a closed-loop framework that enables LLM agents to coordinate through lightweight, pheromone-like traces encoding contextual utility. This mechanism fosters self-organized collaboration, dynamically balancing exploration and convergence without centralized control.
Unlike debate or tree-based systems, SwarmSys allows coordination to emerge organically through iterative interaction and adaptive matching. Agents assume three roles, Explorers, Workers, and Validators, mirroring the division of labor in natural ant colonies. Explorers expand hypotheses, Workers refine and execute subtasks, and Validators ensure consistency, together forming continuous exploration-exploitation-validation cycles driving decentralized convergence. A core innovation is the use of profiles as adaptive memory. Agent and event profiles evolve with ability embeddings, workload, and context, enabling embedding-based reallocation and balanced participation-analogous to ants redistributing across foraging sites. Moreover, SwarmSys employs a pheromone-inspired reinforcement process: validated traces strengthen future compatibility, while ineffective ones decay, forming a decentralized optimization loop that enhances efficiency and stability over time.
Evaluated across symbolic reasoning, research synthesis, and scientific programming tasks, SwarmSys outperforms baselines such as GPTSwarm (Zhuge et al., 2024), achieving up to 10.7% higher accuracy and 9.9% better sub- task correctness. Remarkably, a swarm of GPT4o-based agents approaches GPT-5 performance, showing that scaling coordination can substitute for model scaling. Qualitative analyses reveal emergent behaviors, knowledge diffusion, specialization balance, and self-regularization, hallmarks of collective intelligence.
In summary, our contributions are threefold: (1) SwarmSys Framework: A closed-loop distributed multi-agent reasoning framework inspired by swarm intelligence for convergence. (2) Adaptive Coordination Mechanism: An embeddingbased matching and pheromone-inspired reinforcement process enabling dynamic agent-event allocation, self-organized collaboration, and stable longhorizon reasoning. (3) Comprehensive Evaluation: Extensive experiments across diverse reasoning tasks reveal consistent gains and emergent collective intelligence, showing that scaling coordination can rival scaling model capacity.
## 2 Methodology
## 2.1 Overview of SwarmSys
We present SwarmSys, a closed-loop collaborative framework for distributed problem solving. Unlike centralized orchestration or static task assignment, it converges through iterative matching → collaboration → update cycles, as is shown in Figure 3. Upon receiving new task, event profiles are instantiated, and candidate agents are retrieved via
Figure 3: Iterative cycle in SwarmSys: minimize f ( θ ) through matching → collaboration → update cycle.
<details>
<summary>Image 3 Details</summary>

### Visual Description
\n
## Diagram: Multi-Agent Collaborative Optimization Cycle
### Overview
The image displays a flowchart illustrating a cyclical, four-stage process for a multi-agent system engaged in an optimization task. The diagram uses a rectangular layout with four process boxes connected by directional arrows, indicating a continuous loop. The process involves agent matching, collaborative computation, profile updates, and task minimization.
### Components/Axes
The diagram consists of four rectangular boxes with rounded corners, each containing text, and four labeled arrows indicating the flow between them.
**1. Top-Left Box (Light Purple Fill)**
* **Text Content:** "Matching Retrieve agents {A, B, C}"
* **Position:** Top-left quadrant of the diagram.
**2. Top-Right Box (Light Green Fill)**
* **Text Content:** "Collaboration Compute θ_{t+1}"
* **Position:** Top-right quadrant of the diagram.
**3. Bottom-Right Box (Light Orange Fill)**
* **Text Content:** "Update Profiles → JSON"
* **Position:** Bottom-right quadrant of the diagram.
**4. Bottom-Left Box (Light Gray Fill)**
* **Text Content:** "Task: minimize f(θ) = θ² + 3 sin θ"
* **Position:** Bottom-left quadrant of the diagram.
**Flow Arrows & Labels:**
* **Arrow 1:** From the "Matching Retrieve agents" box to the "Collaboration Compute" box. **Label:** "propose".
* **Arrow 2:** From the "Collaboration Compute" box to the "Update Profiles" box. **Label:** "validate".
* **Arrow 3:** From the "Update Profiles" box to the "Task: minimize" box. **Label:** "update".
* **Arrow 4:** From the "Task: minimize" box back to the "Matching Retrieve agents" box. **Label:** "new round".
### Detailed Analysis
The diagram outlines a specific, repeating workflow:
1. **Agent Matching & Retrieval:** The cycle begins with a set of agents, explicitly labeled as {A, B, C}, being matched or retrieved for a task.
2. **Proposal & Collaboration:** These agents "propose" their inputs to a collaborative phase. Here, a computation is performed to determine a parameter for the next time step, denoted as θ_{t+1} (theta sub t-plus-1).
3. **Validation & Profile Update:** The result of the computation is "validated". Following validation, agent profiles are updated, with the output format specified as JSON.
4. **Task Execution & Iteration:** The updated information is used to "update" the core task. The task is a mathematical optimization problem: to minimize the function f(θ) = θ² + 3 sin θ. The completion of this task triggers a "new round," returning the process to the initial agent matching stage.
### Key Observations
* **Cyclical Nature:** The process is explicitly designed as an infinite loop, suggesting an iterative learning or optimization algorithm.
* **Multi-Agent System:** The use of "agents {A, B, C}" indicates a system where multiple entities collaborate.
* **Parameter Evolution:** The notation θ_{t+1} clearly indicates that the parameter θ evolves over discrete time steps (t, t+1, etc.).
* **Concrete Objective:** The task is not abstract; it is a well-defined, non-convex mathematical function (θ² + 3 sin θ) that the system aims to minimize.
* **Data Persistence:** The mention of "Profiles → JSON" implies that agent states or learned parameters are serialized and stored between cycles.
### Interpretation
This diagram represents the architecture of a **collaborative, iterative optimization algorithm**. It combines elements of multi-agent systems with numerical optimization.
* **What it demonstrates:** The system uses multiple agents (A, B, C) to explore or propose solutions for minimizing a complex function. Their proposals are aggregated or processed collaboratively to compute an updated parameter (θ_{t+1}). This update is validated and used to refine the agents' internal "profiles," which likely encode their strategy or knowledge. The core optimization task is then re-evaluated with this new parameter, and the loop continues, presumably converging toward a minimum of the function f(θ).
* **Relationships:** The flow shows a clear dependency: agent proposals drive collaborative computation, which informs profile updates, which in turn improve the execution of the main task. The "new round" arrow closes the loop, making the system adaptive and self-improving over time.
* **Notable Implications:** The function f(θ) = θ² + 3 sin θ has multiple local minima due to the sinusoidal term. This suggests the multi-agent approach may be designed to avoid local minima through diverse proposals. The JSON output indicates a focus on interoperability and logging, suitable for analysis or deployment in a larger software pipeline. The diagram abstracts away the specific algorithms for "Matching," "Collaboration," and "Validation," focusing instead on the high-level data and control flow between these modules.
</details>
embedding-based matching. Agents then enter collaborative rounds: explorers propose decomposition, workers execute subtasks after consensus, and validators verify intermediate results. After each round, both agent and event profiles are updated and stored for transparency and traceability. These evolving profiles feed subsequent iterations, enabling self-organization and adaptation. As shown in our framework (see figure 2), through repeated cycles, we achieve high-quality solutions without central controller, relying instead on distributed collaboration and profile-driven adaptation.
## 2.2 Profiles as Adaptive Memory Units
A core innovation of SwarmSys is the use of profiles as adaptive memory units that record both context and accumulated experience. Each agent profile maintains identifiers, role information, ability descriptions, workload status, and longitudinal performance history, the profile format is provided in Appendix A.4.1. Profiles evolve dynamically: embeddings and workload indicators are refreshed after each round, while historical success or failure influences future task matching. Agents thus behave as adaptive collaborators rather than stateless executors. Each event profile serves as a dynamic record of problem-solving. It contains task descriptions, dependency structures, metadata (e.g., composite or leaf type), progress logs, and participating agent lists. Over time, static specifications evolve into rich, evolving traces of reasoning.
Together, these profiles constitute SwarmSys's distributed memory: agent profiles encode competence and reliability, while event profiles track task evolution. Their joint updates ensure collaboration remains adaptive, interpretable, and transparent.
## 2.3 Embedding-Based Matching with Exploration-Exploitation Dynamics
The second key innovation is our embedding-based agent-event matching algorithm, which balances expertise, workload, and exploration.
Agent embeddings. Each agent A i is represented by two embeddings capturing its ability and availability. The competence embedding is derived from the agent's declared abilities and historical performance, processed through an instructionguided encoder that contextualizes the agent's prior experience for the current task. The availability embedding reflects workload and readiness, obtained from the agent's status signals. These two vectors are summed to form the agent's overall representation, combining long-term expertise with shortterm availability cues, the the detailed process is provided in Appendix A.4. This design allows the system to favor agents who are both skilled and currently underutilized.
Event embeddings. Each event E j is encoded through an instruction-conditioned embedding that integrates its textual description, dependency relations, progress state, and milestone metadata, the method is listed at AppendixA.1. This representation captures both the semantic meaning of the task and its structural role within the reasoning process. By embedding events and agents in a shared latent space, SwarmSys can estimate their compatibility in a continuous and scalable manner.
Compatibility and decision dynamics. The compatibility between an agent and an event is measured by normalized cosine similarity between their embeddings, ensuring a value between 0 and 1 for interpretability. To prevent premature convergence and encourage exploration, SwarmSys adopts a dynamic ε -greedy policy. Each agent explores new matches with probability ε i and exploits high-compatibility matches otherwise. The exploration rate ε i adapts to recent performance: agents with high average success explore less, while underperforming agents explore more. Empirically, we initialize ε i around 0.15 to maintain minimal randomness and allow it to fluctuate within a small range (up to 0.35) depending on recent success. The values are designed based on natural ant behaviors (Lecheval et al., 2024). This ensures that exploration gradually decreases as the system stabilizes. During exploration, matches are sampled proportionally to similarity, enabling serendipitous but plausible pairings. During exploitation, a sigmoid-weighted sampling function emphasizes strong compatibility, controlled by a sharpness factor γ that modulates selectivity. The detailed behavioral derivation process is provided in Appendix A.4. This mechanism enables three prop-
erties: adaptivity through evolving embeddings, stability through probabilistic sampling, and robustness by balancing exploration with exploitation.
## 2.4 Pheromone-Inspired Optimization
Finally, SwarmSys incorporates a pheromoneinspired optimization process to refine allocation and solution quality. Each validated contribution reinforces the compatibility between an agent and an event, updating embeddings and increasing the likelihood of similar matches in future rounds. Idle or invalid matches, by contrast, receive no reinforcement and gradually decline in competitiveness as other profiles evolve, mimicking pheromone evaporation without explicit decay.
This implicit reinforcement-evaporation dynamic complements the exploration-exploitation policy. Exploration guarantees diversity and prevents deadlock, exploitation prioritizes promising matches, and bounded probabilities ensure stability. As explorers, workers, and validators collaborate across rounds, SwarmSys converges to highquality solutions while maintaining flexibility and resilience in search dynamics.
## 3 Experiment
## 3.1 Experiment Setting
We evaluate SwarmSys across three reasoning categories that collectively span symbolic computation, open-domain research synthesis, and scientific programming. All evaluations use dataset-specific metrics following their official definitions to ensure fair comparison.
Baselines Since agents show domain-specific strengths, we select the strongest baseline for each task category instead of using a uniform set. For exam-style reasoning, we compare against GPT-4obased IO (direct LLM invocation) (OpenAI et al., 2024), CoT (Wei et al., 2022), CoT-SC (Wang et al., 2022), Self-Refine (Madaan et al., 2023), MultiPersona (Wang et al., 2023b), and GPTSwarm (Zhuge et al., 2024). For research tasks, we include general-purpose baselines (IO, CoT, SelfRefine), and deep research agents (Grok Deeper Search, DeepResearchAgent). For scientific programming, we use both general-purpose reasoning systems (Self-Refine, CoT) and domain-specific agents (GPTSwarm, DeepResearchAgent). We also report results from GPT-5 as an upper bound for single-agent performance.
Dataset Table 1 summarizes the four benchmarks, covering quantitative, analytical, and procedural reasoning. This diversity ensures that SwarmSys is evaluated across both discrete symbolic reasoning and open-ended research generation settings. More details of our dataset settings are shown in Appendix A.2
Table 1: Overview of the four reasoning benchmarks used in our experiments. Each dataset differs in domain focus, reasoning type, and data format.
| Dataset | Focus | Reasoning | #Samples |
|-----------------------------------|----------------------------|-------------|------------|
| GaoKao Bench (Zhang et al., 2024) | Quantitative &Cross-domain | Symbolic | 800 |
| Omni-Math (Gao et al., 2024a) | Hard-Level Quantitative | Conceptual | 300 |
| DeepResearch (Du et al., 2025) | Scientific QA | Analytical | 200 |
| SciCode (Tian et al., 2024) | Computational | Procedural | 338 |
Metries We evaluate SwarmSys on three reasoning categories, using the original domain-specific metrics and protocols from each dataset (1) Examstyle reasoning tasks (Math Exam, STEM Mix, and Olympic Math) use Accuracy and Knowledge Coverage as defined in prior benchmark releases. For Math Exam and STEM Mix, we use subset from GAOKAO Bench. For Olympic Math, we use subset from Omni-Math and rearranged them to comply with real-world exam. (2) Researchlevel reasoning tasks (DeepResearch Bench) employ composite metrics including RACE (Comprehensiveness, Depth, Instruction Following, Readability) and FACT (Citation Accuracy, Effective Citation). (3) Science Coding tasks (SciCode) are evaluated using Pass@Main and Pass@Sub metrics, capturing correctness at both task and sub-task levels. All metrics follow their dataset definitions to ensure faithful comparison and reproducibility.
## 3.2 Overall Results
Exam-style Reasoning. Table 2 shows that SwarmSys consistently outperforms all GPT-4obased multi-agent baselines on both single- and multi-subject exams, achieving an average improvement of +12 . 5% Accuracy and +10 . 8% Coverage over GPTSwarm. While GPT-5 achieves the strongest absolute scores, SwarmSys-8 narrows the gap by over 70% , demonstrating that swarm-level cooperation can approach next-generation model performance without access to stronger backbones. Qualitatively, we observe that SwarmSys agents exhibit complementary specialization: Explorers diversify problem-solving strategies, while Validators efficiently prune redundant reasoning chains, leading to higher coverage with reduced variance.
Table 2: Performance comparison on Exam-style tasks (single-, multi-subject, and Olympic-level). Metrics: Accuracy (Acc.) and Knowledge Coverage (Cov.). SwarmSys-8 means the system contains 8 agents (1* explorer, 6*workers, 1*validator).
| Method | Math Exam | Math Exam | STEM Mix | STEM Mix | Olympic Math | Olympic Math | Average | Average |
|-----------------------|-------------|-------------|------------|------------|----------------|----------------|-----------|-----------|
| | Acc. | Cov. | Acc. | Cov. | Acc. | Cov. | Acc. | Cov. |
| IO (GPT-4o) | 46.3 | 45.7 | 57.4 | 55.2 | 23.5 | 57.6 | 42.4 | 52.8 |
| CoT (GPT-4o) | 52.6 | 49.0 | 60.8 | 59.7 | 30.0 | 66.3 | 47.8 | 58.3 |
| CoT-SC (5-shot) | 63.2 | 62.4 | 64.7 | 62.8 | 28.3 | 51.7 | 52.1 | 59.0 |
| Self-Refine (GPT-4o) | 79.3 | 9.8 | 70.5 | 79.0 | 16.6 | 45.0 | 55.5 | 44.6 |
| MultiPersona (GPT-4o) | 52.4 | 51.4 | 60.9 | 62.3 | 35.0 | 68.3 | 49.4 | 60.7 |
| GPTSwarm † | 65.5 | 70.3 | 69.6 | 71.4 | 40.0 | 73.3 | 58.4 | 71.7 |
| GPT-5 | 87.2 | 90.6 | 87.7 | 89.1 | 31.2 | 70.5 | 68.7 | 83.4 |
| SwarmSys-8 (Ours) | 76.2 | 80.2 | 78.7 | 81.3 | 42.4 | 73.2 | 65.8 | 78.2 |
Table 3: Performance comparison on Research tasks (DeepResearch Bench). Metrics: Comprehensiveness, Depth, Instruction Following, and Readability.
| Method | Overall | Comp. | Depth | Inst. | Read. |
|--------------------|-----------|---------|---------|---------|---------|
| IO (GPT-4o-Search) | 30.7 | 27.8 | 20.4 | 41 | 37.6 |
| CoT (GPT-4o) | 29.3 | 29.5 | 22.8 | 33.5 | 36.8 |
| Self-Refine | 35.9 | 35.4 | 27 | 44.1 | 41 |
| DeepResearchAgent | 48.8 | 48.5 | 48.5 | 49.1 | 49.4 |
| Grok Deeper Search | 40.2 | 37.9 | 35.3 | 46.3 | 44 |
| SwarmSys-8 (Ours) | 42.5 | 39.6 | 38 | 50 | 46.3 |
Table 4: Performance comparison on Science Coding tasks (SciCode benchmark). Metrics: Pass@Main and Pass@Sub (percentages). † No longer available; metrics from SciCode report.
| Method | Pass@Main | Pass@Sub |
|-------------------------|-------------|------------|
| IO (GPT-4o) | 2 | 28.3 |
| OpenAI o3-mini-medium † | 9.2 | 34.4 |
| CoT-SC (GPT-4o) | 8.8 | 28.7 |
| Self-Refine | 10 | 33.3 |
| GPTSwarm | 8.6 | 29.2 |
| SwarmSys-14 (Ours) | 12.5 | 45.2 |
Research-level Reasoning. As shown in Table 3, SwarmSys surpasses Grok Deeper Search in overall RACE score (+2.3%) and instruction-following (+3.7%), reflecting the benefit of distributed role specialization in literature synthesis and factual consolidation. SwarmSys achieves especially large gains in comprehensiveness and readability, suggesting that swarm debates improve global coherence even in open-ended research generation tasks.
Scientific Programming. In Table 4, SwarmSys demonstrates notable improvements on SciCode: +2 . 5% Pass@Main and +11 . 9% Pass@Sub over the best GPT-4o baselines. These results indicate that dynamic role coordination and progressive refinement effectively decompose complex computational problems. Interestingly, Pass@Sub improvements are more pronounced, supporting our hypothesis that swarm collaboration benefits from modular code generation and localized validation.
## 3.3 Ablation Study
We ablate the design of SwarmSys along three axes: (a) removing dynamic profile updates and embedding-based matching (Ours-Roles-Rand); (b) removing both roles and matching, leaving homogeneous random assignment (Rand-NoRoles); and (c) varying the number of Agents A from 4 to 32 (Table 5), while ensure that each event contains at least one explorer, one worker, and one validator. All variants use identical backbones and decoding parameters. Results reveal three trends: (1) Removing roles leads to a substantial decline in both accuracy and coverage (e.g., 56 . 3% → 43 . 2% in accuracy and 52 . 4% → 41 . 4% in coverage at A =4 ), confirming that cooperative debate and functional specialization are essential for maintaining diversity without redundancy. (2) Adaptive matching and profiling further enhance performance: embedding-based matching increases coverage by up to +27 . 3% by aligning agent capabilities with task semantics. (3) Scaling saturates around W =14 : while both metrics improve with more Workers, gains plateau beyond 14 , indicating that agent saturation occurs once the subtask granularity is fully covered.
## 3.4 Swarm Effect: Emergent Collective Intelligence
Our design philosophy centers on the Swarm Effect: a collection of properly coordinated, limited
Table 5: Ablation on Exam task under varying numbers of Agents ( A ). Metrics: Accuracy (Acc., %), Knowledge Coverage (Cov., %).
| Method | A =4 | A =4 | A =8 | A =8 | A =14 | A =14 | A =20 | A =20 | A =32 | A =32 |
|--------------|--------|--------|--------|--------|---------|---------|---------|---------|---------|---------|
| | Acc. | Cov. | Acc. | Cov. | Acc. | Cov. | Acc. | Cov. | Acc. | Cov. |
| Rand-NoRoles | 43 . 2 | 41 . 4 | 42 . 9 | 42 . 1 | 44 . 1 | 43 . 4 | 44 . 3 | 43 . 2 | 43 . 8 | 42 . 3 |
| Rand-Roles | 56 . 3 | 52 . 4 | 58 . 2 | 54 . 8 | 58 . 5 | 56 . 1 | 58 . 3 | 56 . 0 | 57 . 3 | 55 . 9 |
| SwarmSys | 74.3 | 79.7 | 76.2 | 80.2 | 77.3 | 81.0 | 77.2 | 81.3 | 76.5 | 80.8 |
Figure 4: Swarm reasoning trajectory on MathExam. Explorers initiate sub-tasks, Workers debate and revise alternative methods, and Validators enforce cross-checks across rounds. The swarm collectively converges to consistent solutions through debate-driven consensus formation.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Diagram: Multi-Stage Collaborative Decision-Making Process
### Overview
The image is a technical diagram illustrating a three-stage, iterative process for collaborative problem-solving or method validation. It depicts a network of agents (represented by nodes) that engage in debate, revision, and verification to reach a consensus. The flow progresses from left to right, indicated by bold black arrows between stages.
### Components/Axes
The diagram is composed of three distinct stages, each containing a network graph of nodes and associated text boxes.
**Node Types & Labels:**
* **E (Blue Node):** Appears in all stages. Labeled with a blue text box: "Potential sub-task & working direction".
* **W1, W2, W3 (Red Nodes):** Appear in all stages. Associated with pink text boxes describing methods and debate.
* **V (Teal Node):** Appears in all stages. Associated with a teal text box describing verification and debate.
**Text Box Labels (by Color and Position):**
* **Blue (Top-Left of each stage):** "Potential sub-task & working direction"
* **Pink (Various positions):**
* Stage 1: "Output method 1 & debate", "Method 2 (perceived wrong) & debate", "Method 3 (actually wrong) & debate"
* Stage 2: "Revised method1 & debate", "Revised method2 & debate", "Agreed on other methods"
* Stage 3: (No standalone pink boxes; text is integrated into the yellow consensus box)
* **Teal (Top-Right of Stages 1 & 2):**
* Stage 1: "Check & participate in debate"
* Stage 2: "Verify Method 1&2 are right & debate"
* **Yellow (Bottom-Right of Stage 3):** "Consensus reached after this round, output method1 & result"
**Flow Indicators:**
* **Solid Black Arrows:** Connect the three major stages, indicating the overall process flow from left to right.
* **Dashed Lines:** Connect nodes within each stage, representing communication, debate, or influence pathways between agents.
### Detailed Analysis
**Stage 1 (Left): Initial Proposal and Debate**
* **Network Structure:** Node E is connected to W1, W2, W3, and V. Nodes W1, W2, and W3 are also interconnected. V is connected to E and W1.
* **Process State:** An initial sub-task (from E) is presented. Three methods are proposed by the W-nodes. Method 1 is output for debate. Method 2 is perceived as wrong, and Method 3 is identified as actually wrong, both subject to debate. Node V's role is to check and participate in the debate.
**Stage 2 (Middle): Revision and Verification**
* **Network Structure:** The network becomes denser. All nodes (E, W1, W2, W3, V) are now interconnected with each other, forming a near-complete graph.
* **Process State:** Methods are revised based on debate. "Revised method1" and "Revised method2" are under debate. There is agreement on "other methods." Node V's role shifts to actively verifying that Methods 1 and 2 are correct, continuing the debate.
**Stage 3 (Right): Consensus**
* **Network Structure:** The nodes are visually clustered tightly together, with many overlapping connections, symbolizing unified agreement.
* **Process State:** A consensus has been reached. The final output is "method1 & result," as stated in the yellow box.
### Key Observations
1. **Increasing Connectivity:** The density of dashed connections between nodes increases dramatically from Stage 1 to Stage 2, visually representing the intensification of collaboration and debate.
2. **Role Evolution:** The function of node V evolves from passive participation ("Check & participate") to active verification ("Verify... are right").
3. **Method Refinement:** The process explicitly shows methods being labeled as wrong (Stage 1), then revised (Stage 2), leading to a final agreed-upon output (Stage 3).
4. **Spatial Grounding of Consensus:** The final consensus stage is marked not only by text but by a distinct visual change—the tight clustering of all nodes and the unique yellow color of the outcome box, setting it apart from the process-oriented pink and teal boxes.
### Interpretation
This diagram models a **structured deliberation protocol** for multi-agent systems or collaborative teams. It demonstrates a Peircean investigative process where:
* **Abduction:** Initial hypotheses (methods) are generated from a problem statement (sub-task from E).
* **Deduction:** These hypotheses are logically challenged and tested through debate and verification (roles of W-nodes and V).
* **Induction:** Through iterative revision and agreement, a reliable conclusion (consensus on method1 & result) is formed.
The underlying principle is that truth or the best solution emerges from the crucible of adversarial collaboration. The "perceived wrong" and "actually wrong" labels in Stage 1 highlight the system's ability to distinguish between subjective disagreement and objective error through debate. The final output isn't necessarily the first or most popular method, but the one that survives rigorous, collective scrutiny. This process is critical for high-stakes decision-making, complex problem-solving, or scientific validation where individual bias must be overcome by structured group intelligence.
</details>
agents can collectively approximate or surpass a stronger single-agent model. Figure 4 visualizes how SwarmSys (GPT-4o backbone) gradually approaches the performance of stronger models such as GPT-5 and DeepResearchAgent as the swarm size increases.
Emergent Performance Scaling. Quantitatively, as the number of active agents increases from A =4 to A =14 , both Accuracy and Knowledge Coverage improve consistently (e.g., from 74 . 3 / 79 . 7 to 77 . 3 / 81 . 0 on the Exam task). However, further expansion to A =20 or A =32 yields negligible gains (within < 1% difference), indicating that agent capacity saturates once subtask diversity and knowledge space have been sufficiently covered. This scaling curve mirrors real-world swarm systems, where the marginal utility of additional workers decreases after local niches become saturated.
Collaborative Dynamics. Unlike conventional multi-agent ensembles that rely on static voting, SwarmSys agents interact through pheromoneinspired event matching and debate-driven validation. Explorers dynamically propose new subgoals, Workers attempt partial solutions, and Validators consolidate outcomes based on collective memory. This dynamic feedback loop creates a self-organizing division of labor where each agent adapts its behavior not from external commands but through the evolving swarm state. Empirically, this leads to: (i) higher diversity in reasoning paths, and (ii) smoother convergence across reasoning rounds.
## Knowledge Diffusion and Self-Regularization.
We further observe that intermediate reasoning traces in SwarmSys display emergent knowledge diffusion : factual entities, equations, or hypotheses discovered by one agent are reused, revised, or even corrected by others without explicit synchronization. This effect increases factual precision while maintaining interpretability. In qualitative analyses, the system exhibits an implicit regularization behavior -weaker agents' errors are diluted by consensus mechanisms, preventing local hallucinations from dominating global output.
From Coordination to Intelligence. The Swarm Effect demonstrates that collective intelligence is not a linear function of model size, but an emergent property of structured interaction. While individual agents are limited by GPT-4o, the swarm collectively builds a distributed memory and decision space, enabling generalization beyond any single agent. This property highlights a promising direction for future large-scale reasoning systems: scaling through coordination rather than parameter count.
Figure 5: Evolution of communication topology in SwarmSys. The system evolves from a centralized hub-spoke structure to a distributed small-world mesh, where workers and validators interconnect for efficient consensus and information reuse.
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Diagram: Two-Stage System Architecture Process Flow
### Overview
The image is a technical diagram illustrating a two-stage process flow for a distributed system. It depicts a transition from a centralized exploration phase to a distributed consensus phase, using icons and labeled components to represent different roles and their interactions. The diagram is divided horizontally into two main sections, each with a distinct title bar.
### Components/Axes
**Title Bars:**
- Top Section Title: "Stage I: Centralized Exploration Phase" (white text on a dark blue-gray background).
- Bottom Section Title: "Stage II: Distributed Consensus Phase" (white text on a dark blue-gray background).
**Icons & Labels (with approximate color descriptions):**
- **Explorer:** Represented by a blue circular icon containing a stylized ant. Appears in both stages.
- **Event 1, Event 2, Event 3:** Represented by rectangular boxes with dashed borders. Appear in both stages.
- **Worker Cluster:** Represented by a red circular icon containing a stylized ant. Appears only in Stage I.
- **Validator_V1:** Represented by a teal circular icon containing a stylized ant. Appears only in Stage I.
- **Worker Mesh:** Represented by a red circular icon containing a stylized ant. Appears only in Stage II.
- **V1, V2, V3:** Represented by teal circular icons containing a stylized ant. Appear only in Stage II.
- **Explorers (plural):** Represented by a blue circular icon containing a stylized ant. Appears only in Stage II.
**Flow Indicators:**
- Solid black arrows indicate the direction of process flow or data transmission.
- Double-headed arrows (↔) indicate bidirectional communication or interaction.
### Detailed Analysis
**Stage I: Centralized Exploration Phase**
1. **Flow:** A single "Explorer" (blue) at the top connects via downward arrows to three parallel "Events" (Event 1, Event 2, Event 3).
2. The three Events then connect via downward arrows that converge into a single "Worker Cluster" (red).
3. The Worker Cluster connects via a single downward arrow to a single "Validator_V1" (teal).
4. **Spatial Layout:** The Explorer is centered at the top. The three Events are arranged horizontally in a row below it. The Worker Cluster is centered below the Events, and the Validator_V1 is centered at the bottom of this stage.
**Stage II: Distributed Consensus Phase**
1. **Flow:** The three "Events" (Event 1, Event 2, Event 3) are arranged horizontally and connected to each other with double-headed arrows (↔), indicating peer-to-peer communication.
2. The three Events connect via downward arrows that converge into a single "Worker Mesh" (red).
3. The Worker Mesh connects via multiple downward arrows to four components arranged in a horizontal row at the bottom: "V1" (teal), "V2" (teal), "V3" (teal), and "Explorers" (blue).
4. **Bidirectional Communication:** Double-headed arrows (↔) connect V1↔V2, V2↔V3, and V3↔Explorers, indicating a mesh network of communication among the validators and explorers.
5. **Spatial Layout:** The three interconnected Events are centered at the top of this stage. The Worker Mesh is centered below them. The four bottom components (V1, V2, V3, Explorers) are evenly spaced in a horizontal row.
### Key Observations
1. **Structural Shift:** The architecture evolves from a linear, top-down hierarchy in Stage I (Explorer → Events → Cluster → Single Validator) to a networked, peer-inclusive model in Stage II (Interconnected Events → Mesh → Multiple Validators & Explorers).
2. **Role Proliferation:** A single "Validator_V1" in Stage I becomes three distinct validators (V1, V2, V3) in Stage II. The singular "Explorer" also becomes a plural "Explorers" entity in the final layer of Stage II.
3. **Communication Pattern:** Communication changes from strictly unidirectional (downward arrows) in Stage I to a mix of unidirectional (Events to Mesh) and bidirectional (between Events, and among Validators/Explorers) in Stage II.
4. **Component Renaming:** The central processing unit changes from "Worker Cluster" to "Worker Mesh," suggesting a change in internal architecture from a clustered to a mesh-based topology.
### Interpretation
This diagram illustrates a conceptual model for scaling and decentralizing a system, likely for tasks involving event processing, validation, and exploration (e.g., in distributed computing, blockchain, or multi-agent systems).
- **Stage I** represents a **centralized, exploratory bootstrapping phase**. A single explorer identifies or generates events, which are processed by a centralized cluster and validated by a single entity. This is efficient for initial setup or data gathering but creates single points of failure (the Explorer, the Cluster, the Validator).
- **Stage II** represents the **mature, decentralized operational phase**. The events themselves form a communicative network. The processing layer becomes a "Mesh," implying more resilient and distributed coordination. Most critically, validation and exploration are distributed across multiple nodes (V1, V2, V3, Explorers) that communicate directly with each other. This creates a robust consensus mechanism where no single entity is authoritative, enhancing fault tolerance and security.
The progression suggests a design philosophy where a system begins with centralized control for efficient exploration and learning, then transitions to a distributed consensus model for resilient, scalable, and trustless operation. The persistence of the "Explorer" role in both stages, and its integration into the final validator mesh, indicates that continuous discovery or data input remains a core function even in the decentralized state.
</details>
## 4 Qualitative Analysis
We further analyze how reasoning quality emerges and where it breaks down within SwarmSys. This section covers two aspects: (1) Agent Behavior Analysis , examining coordination patterns via profile dynamics and interaction topology; and (2) Error Analysis , identifying failure modes of pheromone-based optimization and consensus.
## 4.1 Agent Behavior
To understand how coordination arises from the matching-collaboration-update cycle, we analyze profile adaptation, interaction topology, and contribution balance using our experiment's event logs.
Profile Adaptation. We track embedding drift of each agent's competence embedding v ( i ) a . The mean cosine shift per round is 0 . 14 ± 0 . 03 , showing steady self-adjustment as agents gain experience. Explorers exhibit the largest variance, indicating ongoing hypothesis exploration, while Validators remain stable, preserving reasoning coherence. This confirms that profile updates act as distributed memory enabling specialization.
Interaction Topology. Figure 5 illustrates the evolution of communication topology. During the Centralized Exploration Phase , agents form
Table 6: Failure type distribution on tasks. Estimated from 15 randomly sampled cases.
| Failure Type | Description | % |
|------------------------|-------------------------------------------------------------|-----|
| Premature Consensus | Early validator fixes one branch too soon | 16 |
| Reinforcement Bias | Over-strengthening of an early path signal | 20 |
| Mode Collapse | All explorers con- verge to one reasoning mode | 14 |
| Constraint Omission | Missing symbolic or geometric integration | 22 |
| Communication Deadlock | Agents misrecognize roles, causing commu- nication deadlock | 28 |
a hub-spoke pattern centered on high-similarity validators. As reasoning progresses, pheromone reinforcement promotes denser cross-links, leading to a Distributed Consensus Phase characterized by a small-world structure with higher local clustering (0.28 → 0.47) and shorter global paths. This transition demonstrates self-organized coordination emerging without explicit central control.
Contribution Balance. Normalized entropy of accepted contributions is
$$H _ { c } = - \frac { 1 } { \log A } \sum _ { i = 1 } ^ { A } p _ { i } \log p _ { i } , ( 1 )$$
where p i is each agent's contribution share, computed by the contribution acceptance rate, A is the total number of agents. SwarmSys attains H c =0 . 72 , surpassing GPTSwarm ( 0 . 41 ), indicating balanced participation driven by the exploration-exploitation policy. Overall, adaptive profiles and pheromone feedback jointly yield a decentralized yet structured division of labor.
## 4.2 Error Analysis and Case Study
Despite strong overall results, SwarmSys occasionally fails in tasks requiring strict temporal or symbolic alignment.As is shown in figure 6, we identify five main failure types: (a) premature convergence, (b) Reinforcement Bias, (c) Mode Collapse, (d) Constraint Omission, (e) Communication Deadlock. A typical case in our study is: two explorers propose algebraic and geometric solutions, but an early validator accepts one branch too soon, reinforcing it and suppressing valid alternatives. Such reinforcement bias leads to overfitting on partial evidence. Future improvements may include uncertainty-weighted reinforcement, scheduled resampling of low-confidence paths, and meta-level arbitration to maintain epistemic diversity.
Table 7: Comparison of SwarmSys with representative reasoning and multi-agent systems. † MetaGPT incorporates SOP-style verification steps but not decentralized validators. § Voyager maintains skill profiles and self-checking, but only within a single-agent setting.
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Comparison Table: Multi-Agent System Features
### Overview
The image displays a comparison table evaluating 12 different multi-agent systems against 8 key architectural and functional features. The table uses a grid format with systems listed vertically and features horizontally. Each cell indicates whether a system supports a feature with a checkmark (✓) or does not with a red cross (✗). Some cells contain superscript numbers, likely referencing footnotes not visible in the provided image.
### Components/Axes
* **Rows (Systems):** Each row represents a specific multi-agent system, identified by its name and a citation (Author et al., Year).
* **Columns (Features):** The eight columns represent distinct features or capabilities:
1. Decentralized
2. Explicit Roles
3. Debate
4. Dynamic Profiling
5. Decentralized Matching
6. Multi-Event
7. Validator
8. Stigmergy
* **Cell Content:** A black checkmark (✓) signifies the presence of the feature. A red cross (✗) signifies its absence. Superscript numbers (e.g., `✓¹`, `✓⁵`) indicate a qualified or noted implementation of the feature.
### Detailed Analysis
The following is a row-by-row extraction of the table's content, presented as a Markdown table for clarity:
| System | Decentralized | Explicit Roles | Debate | Dynamic Profiling | Decentralized Matching | Multi-Event | Validator | Stigmergy |
|--------|---------------|----------------|--------|-------------------|------------------------|-------------|-----------|-----------|
| CAMEL (Li et al., 2023) | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| AutoGen (Wu et al., 2023) | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| MetaGPT (Hong et al., 2024) | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| MAD (Liang et al., 2024) | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓¹ | ✗ |
| ToT/GoT (Yao et al., 2023; Besta et al., 2024) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ |
| Voyager (Wang et al., 2023a) | ✗ | ✗ | ✗ | ✓³ | ✗ | ✗ | ✓⁵ | ✗ |
| SWE-agent (Yang et al., 2024) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Self-Refine (Madaan et al., 2023) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ |
| GPT-Writer (Chen et al., 2024) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| DeepResearchAgent (Huang et al., 2025) | ✗ | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✓ |
| ROBIN (Ghreeb et al., 2025) | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
| SwarmSys (ours) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
### Key Observations
* **Feature Prevalence:** "Explicit Roles" is the most commonly supported feature (7 out of 12 systems). "Validator" is the second most common (6 systems, including two with qualified support).
* **Feature Scarcity:** "Decentralized," "Decentralized Matching," "Multi-Event," and "Stigmergy" are rare. Only the authors' system, "SwarmSys," supports all four. "DeepResearchAgent" supports three of these four (all except Decentralized Matching).
* **System Comprehensiveness:** Most systems (10 out of 12) support only 0-2 of the listed features. "DeepResearchAgent" supports 4 features. "SwarmSys (ours)" is presented as uniquely comprehensive, supporting all 8 features.
* **Qualified Support:** Three systems have features marked with superscripts (MAD, Voyager), indicating their support is not absolute or has specific conditions noted elsewhere.
### Interpretation
This table serves as a **gap analysis and positioning tool** within the research landscape of multi-agent systems. It visually argues that existing systems are specialized, each implementing only a subset of desirable architectural traits. The primary rhetorical purpose is to highlight the novelty and comprehensiveness of the authors' proposed system, "SwarmSys," which is shown to be the only one integrating all eight evaluated capabilities.
The features themselves suggest a research direction towards more **autonomous, robust, and scalable** multi-agent ecosystems. Capabilities like "Decentralized Matching," "Multi-Event" handling, and "Stigmergy" (indirect coordination through the environment) point to systems designed for complex, dynamic, and open-ended tasks where pre-defined roles and centralized control are insufficient. The table implies that "SwarmSys" is designed to address these more challenging scenarios, while prior work has focused on more constrained problems requiring fewer of these advanced features. The absence of footnotes in the image limits a full understanding of the qualified checks (✓¹, ✓³, ✓⁵), which would be critical for a nuanced technical assessment.
</details>
| System | Decentralized | Explicit Roles | Debate | Dynamic Profiling | Decentralized Matching | Multi-Event | Validator | Stigmergy |
|------------------------------------------------|-----------------|------------------|----------|---------------------|--------------------------|---------------|-------------|-------------|
| CAMEL (Li et al., 2023) | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| AutoGen (Wu et al., 2023) | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| MetaGPT (Hong et al., 2024) | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ † | ✗ |
| MAD(Liang et al., 2024) | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
| ToT/GoT (Yao et al., 2023; Besta et al., 2024) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Voyager (Wang et al., 2023a) | ✗ | ✗ | ✗ | ✓ § | ✗ | ✗ | ✓ § | ✗ |
| SWE-agent (Yang et al., 2024) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Self-Refine (Madaan et al., 2023) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ |
| GPTSwarm (Zhuge et al., 2024) | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ |
| DeepResearchAgent (Huang et al., 2025) | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
| ROBIN (Ghareeb et al., 2025) | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
| SwarmSys (ours) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
## 5 Related Works
LLM-based Multi-Agent Systems. Recent progress in large language models has led to a proliferation of multi-agent frameworks that decompose reasoning and problem-solving into interactive roles (Liu et al., 2025). Early systems such as CAMEL (Li et al., 2023), AutoGen (Wu et al., 2023) introduced explicit role structures (e.g., user-assistant pairs) and dialog-based task decomposition, showing that inter-agent communication improves reasoning diversity. Subsequent works like MetaGPT (Hong et al., 2024) , AgentScope (Gao et al., 2024b), and DeepResearchAgent (Huang et al., 2025) further formalized role hierarchies for domain-specific workflows such as software engineering or literature review. However, their centralized orchestration and fixed pipelines limit scalability and adaptability. SwarmSys differs by using a fully decentralized structure where coordination emerges from role-guided interaction and adaptive matching, without needing a global controller.
Collaborative Reasoning and Debate. Another line of work explores multi-round reasoning via self-consistency and debate. Methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) reasoning (Yao et al., 2023; Besta et al., 2024) extend single-agent reflection through structured deliberation, while MAD (Liang et al., 2024), Longagent (Zhao et al., 2024) and ROBIN (Ghareeb et al., 2025) model inter-agent debates to enhance diversity and correctness. These approaches improve intermediate reasoning quality but generally depend on fixed communication topologies and lack mechanisms for dynamic adaptation or memory persistence across reasoning episodes. In contrast, SwarmSys incorporates debate as a local coordination primitive within a self-organizing swarm, where roles evolve through continuous profiling and pheromone-like reinforcement, enabling sus- tained reasoning across multiple concurrent events.
Adaptive Coordination and Swarm-inspired Reasoning. A growing body of work introduces adaptive or self-organizing strategies for LLM agents. GPTSwarm (Zhuge et al., 2024) represents one of the few attempts at decentralized coordination, leveraging graph-based optimization and stigmergic feedback. Meanwhile, systems like Voyager (Wang et al., 2023a), Self-Refine (Madaan et al., 2023), and SwarmAgentic (Zhang et al., 2025) employ profile-like adaptation and iterative self-improvement, but remain task-specific. SwarmSys advances this line of research by integrating dynamic profiling, embedding-based matching, and pheromone-inspired reinforcement into a unified framework that scales to multi-event, multi-agent reasoning. This design allows distributed agents to self-allocate across evolving tasks, maintaining robustness and efficiency without centralized scheduling.
## 6 Conclusion
We presented SwarmSys, a swarm-intelligenceinspired framework for decentralized multi-agent reasoning. Through role-specialized collaboration, dynamic profiling, and pheromone-inspired reinforcement, SwarmSys enables scalable, selforganizing coordination without centralized control. Across diverse reasoning and research domains, it consistently outperforms strong baselines and reveals emergent collective behaviors-demonstrating that scaling coordination can rival scaling model size. Our findings suggest a new paradigm for reasoning: intelligence emerges from structured interaction among distributed agents, not from larger models.
## Limitations
Despite its strong performance, SwarmSys still faces several limitations. First, while decentral-
ized coordination improves adaptability, it also increases communication overhead, which may reduce efficiency in latency-sensitive settings. Second, agent profiling currently relies on text-based embeddings and heuristic updates; future work could explore learnable or gradient-based mechanisms for more precise skill modeling. Third, our experiments focus primarily on reasoning and research-oriented tasks, extending SwarmSys to embodied or real-time interactive environments remains an open direction. We hope these insights inspire future research on large-scale, self-organizing multi-agent systems that combine symbolic structure with emergent intelligence.
## Ethics Statement
This work introduces SwarmSys, a distributed multi-agent reasoning framework inspired by swarm intelligence. The research involves no human subjects, personal data, or sensitive content; all experiments use public or synthetic datasets. We recognize potential ethical issues in LLM-based multi-agent systems, such as bias propagation and unreliable autonomous coordination. SwarmSys mitigates these risks through closed-loop validation and transparent agent interactions, ensuring that all reasoning processes remain interpretable and auditable. Our goal is to advance the scientific understanding of scalable reasoning rather than deploy autonomous agents in real-world decision-making. We advocate responsible use of SwarmSys with proper human oversight, fairness, and accountability in future applications.
## Acknowledgements
We are grateful to our collaborators for their valuable discussions on multi-agent coordination and large language model reasoning. This research was supported in part by institutional computational resources and open-source communities that enabled large-scale experimentation.
## References
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence , 38(16):17682-17690.
Eric Bonabeau, Marco Dorigo, and Guy Theraulaz. 1999. Swarm Intelligence: From Natural to Artificial Systems . Oxford University Press.
Victor Dibia, Jingya Chen, Gagan Bansal, Suff Syed, Adam Fourney, Erkang Zhu, Chi Wang, and Saleema Amershi. 2024. Autogen studio: A no-code developer tool for building and debugging multi-agent systems. Preprint , arXiv:2408.15247.
Marco Dorigo and Thomas Stützle. 2004. Ant Colony Optimization . MIT Press.
Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. 2025. Deepresearch bench: A comprehensive benchmark for deep research agents. Preprint , arXiv:2506.11763.
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. 2024a. Omni-math: A universal olympiad level mathematic benchmark for large language models. Preprint , arXiv:2410.07985.
Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, and 1 others. 2024b. Agentscope: A flexible yet robust multi-agent platform. arXiv preprint arXiv:2402.14034 .
Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, and Samuel G. Rodriques. 2025. Robin: A multi-agent system for automating scientific discovery. Preprint , arXiv:2505.13400.
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. Metagpt: Meta programming for a multi-agent collaborative framework. Preprint , arXiv:2308.00352.
Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, Kun Shao, and Jun Wang. 2025. Deep research agents: A systematic examination and roadmap. Preprint , arXiv:2506.18096.
Valentin Lecheval, Elva J.H. Robinson, and Richard P. Mann. 2024. Random walks with spatial and temporal resets may underlie searching movements in ants. bioRxiv .
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for "mind" exploration of large language model society. Preprint , arXiv:2303.17760.
- Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging divergent thinking in large language models through multi-agent debate. Preprint , arXiv:2305.19118.
- Hongjun Liu, Yinghao Zhu, Yuhui Wang, Yitao Long, Zeyu Lai, Lequan Yu, and Chen Zhao. 2025. Medmmv: A controllable multimodal multi-agent framework for reliable and verifiable clinical reasoning. Preprint , arXiv:2509.24314.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-refine: Iterative refinement with self-feedback. Preprint , arXiv:2303.17651.
- OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. Gpt-4o system card. Preprint , arXiv:2410.21276.
Jiabin Tang, Tianyu Fan, and Chao Huang. 2025. Autoagent: A fully-automated and zero-code framework for llm agents. Preprint , arXiv:2502.05957.
- Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, and 11 others. 2024. Scicode: A research coding benchmark curated by scientists. Preprint , arXiv:2407.13168.
- Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023a. Voyager: An openended embodied agent with large language models. Preprint , arXiv:2305.16291.
- Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. 2024. Mixture-of-agents enhances large language model capabilities. Preprint , arXiv:2406.04692.
- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 .
- Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. 2023b. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona selfcollaboration. arXiv preprint arXiv:2307.05300 .
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le,
- and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems , NIPS '22, Red Hook, NY, USA. Curran Associates Inc.
- Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2023. Autogen: Enabling next-gen llm applications via multi-agent conversation. Preprint , arXiv:2308.08155.
- John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering. Preprint , arXiv:2405.15793.
- Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. Preprint , arXiv:2305.10601.
- Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. 2024. Evaluating the performance of large language models on gaokao benchmark. Preprint , arXiv:2305.12474.
- Yao Zhang, Chenyang Lin, Shijie Tang, Haokun Chen, Shijie Zhou, Yunpu Ma, and Volker Tresp. 2025. Swarmagentic: Towards fully automated agentic system generation via swarm intelligence. arXiv preprint arXiv:2506.15672 .
- Jun Zhao, Can Zu, Hao Xu, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. Longagent: scaling language models to 128k context through multi-agent collaboration. arXiv preprint arXiv:2402.11550 .
- Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. 2024. Language agents as optimizable graphs. Preprint , arXiv:2402.16823.
## A Example Appendix
## A.1 Prompts
## Worker Prompt
PROMPT = """
You are a worker agent with expertise in problem-solving, computation, and mathematical reasoning.
Your task is to receive a specific subproblem and then (1) solve the subproblem accurately with transparent justification for each step, (2) share your solution with other agents, (3) participate in solution comparison and structured debate, and (4) defend or revise your work if challenged.
Ensure your output (1) presents calculations in a clear, annotated format, (2) cites mathematical rules or theorems when applying them, and (3) is responsive to feedback and adaptive in collaborative revision. """
## Validator Prompt
PROMPT = """
You are a validator agent dedicated to checking solution correctness and consistency.
Your task is to: (1) check each worker's so- lution for logic, accuracy, and completeness, (2) participate in debates to reconcile discrepancies, (3) contribute to confirming a group consensus, and (4) shift roles if the validation queue is empty and other roles are under high pressure.
Ensure your output (1) provides precise validation with explanations of correctness or error, (2) uses formal mathematical checks where appropriate, (3) helps mediate conflicts in results with clarity and neutrality, and (4) explicitly states if consensus is confirmed: {TERMINATE: The answer is: [correct answer]} """
## Explorer Prompt
PROMPT = """
You are an Explorer agent with strong capabilities in identifying subproblems and analyzing solution paths.
Your task is to receive a mathematical problem set and then (1) search and interpret the overall problem, (2) decompose it into logically coherent subproblems, (3) identify entry points and strategies for solution, and (4) monitor the workload distribution across roles and dynamically reassign yourself to higher-pressure roles if necessary.
Ensure your output (1) maintains traceability between subproblems and the original task, (2) logically documents role-switch decisions, and (3) ensures role-switch includes resetting your goal and behavior policy. """
## Task Context
## PROMPT = """
Describe and analyze the following mathematical problem carefully. The problem statement is presented below. If this task depends on other sub-tasks or prior events, ensure logical consistency and continuity with those results.
Task profile. (1) If the task involves proving, demonstrating, or establishing statements This is a proof-oriented task. Emphasize logical rigor, step-by-step justification, and clear argumentation. (2) If the task involves calculating, evaluating, integrating, or deriving quantities - This is a computationoriented task. Focus on symbolic manipulation, clear intermediate steps, and verification of results. (3) If the task involves maximizing, minimizing, or optimizing a quantity - This is an optimization task. Identify objective functions and constraints, analyze conditions for optimality, and interpret results precisely. (4) If the task involves conditional or logical reasoning ('if', 'then', 'otherwise') - This is a conditional reasoning task. Separate cases clearly, verify implications, and maintain logical completeness. (5) Otherwise - This is a general reasoning task. Apply systematic mathematical analysis and adjust reasoning depth to the complexity of the problem.
Objectives: (1) Analyze the problem thoroughly and build a shared understanding. (2) Generate multiple possible solution paths and compare their validity. (3) Execute reasoning or calculations with clarity, precision, and justification. (4) Resolve disagreements through structured logical debate, not assertion. (5) Arrive at a verified, consensus-based final answer.
Round & Turn-taking policy: (1) Each debate round follows this strict order: explorer → all workers → validator. (2) The explorer opens each round by outlining context, progress, and a proposed plan. (3) Works present or refine solutions, show derivations, and discuss intermediate results. (4) The validator closes each round by checking correctness and summarizing consensus. (5) The debate terminates only when the validator announces: TERMINATE: The answer is: <final answer>. """
## Instruction Embedding Template
PROMPT = """
You are evaluating the mathematical and reasoning competence of an agent participating in a collaborative debate system.
The goal is to generate a semantic embedding that represents how capable this agent is at solving mathematical, logical, and analytical tasks.
Consider two sources of information: (1) The agent's declared abilities, describing what it is designed or trained to do. (2) The agent's historical activity performance, summarizing how it has previously executed reasoning, computation, or validation tasks. Integrate both perspectives to form a single competence representation capturing: (1) conceptual depth and mathematical reasoning skills, (2) problem-solving strategy diversity, (3) accuracy and self-correction ability, and (4) communication and collaboration quality.
Agent Ability Description: {ability text} Agent Performance History: {history text} Use the combined content above as the context for competence evaluation. Output representation should capture the overall ability state of the agent, balancing potential skill with observed performance. """
## A.2 Dataset Settings
We evaluate the reasoning capability and knowledge coverage of our multi-agent system based on three categories of tasks:
Exam We rearrange existing benchmarks into exam-like formats, covering both single-subject and multi-subject settings. This design mimics the structure of real-world examination papers, where agents must solve a coherent set of questions rather than isolated items. Organizing benchmarks in this way is motivated by several factors: (i) exams naturally exhibit varying levels of difficulty across questions; (ii) they involve heterogeneous knowledge types and differentiated scoring schemes, which make the dataset inherently diverse; and (iii) each exam itself constitutes a complex task that can be decomposed into multiple interrelated and irrelated sub-tasks.
Such properties align well with the characteristics of our ant-colony-inspired system. In real ant
colonies, individuals continuously explore, evaluate, and participate in different sub-tasks based on local pheromone signal. Similarly, in exam scenarios, our agents can search across different questions, dynamically decide whether to participate in a specific sub-task, and collectively optimize the global solution through local collaboration. Therefore, exam-style benchmarks provide not only a realistic and challenging evaluation setting for reasoning and knowledge coverage, but also a natural testbed for demonstrating the strengths of swarmbased multi-agent systems.
Research To evaluate research-oriented reasoning, we adopt DeepResearch Bench, which contains PhD-level research tasks spanning multiple domains. Each task provides a research topic or open-ended problem statement (rather than a completed study), requiring agents to perform literature exploration, knowledge recall, and synthesis of coherent research reports. Outputs are evaluated using RACE (Reference-based Adaptive Criteriadriven Evaluation). Such settings mirror real-world academic research, where researchers must jointly survey prior work, generate new ideas, and design proof-of-concept implementations. These properties align well with our swarm-based system: individual agents can specialize in literature recall, hypothesis generation, or code prototyping, and through local debates and coordination, the swarm collectively develops more robust and creative research outcomes.
Science Coding We further evaluate reasoning and knowledge grounding through scientific programming tasks drawn from SciCode. The benchmark is organized into main problems, each of which is decomposed into multiple sub-problems, making it particularly suitable for swarm-based evaluation. Such a structure allows agents to collaboratively assign sub-problems, iteratively generate and refine code, and verify correctness against scientific principles or test cases. By reaching swarmlevel consensus, the system enhances both accuracy and coverage. This setting therefore captures the precision and collaborative robustness required for scientific reasoning and computation.
## OUTPUTResearchArticleExample:
#Title↓
**Fusing Innovation and Tradition: A Hybrid Model Approach to Asset Allocation with Mean-Variance,Black-Litterman,andDeepLearning**↓
## ##Abstract↓
In therealm of financial technology,theintricate task of asset allocation plays a pivotal role in optimizing investment portfolios by balancing risk and return.Traditional asset
## ##Introduction
Asset allocation is a fundamental pillar of investment strategy,playing a critical role in maximizing returns while managing risk.As an essential process across global financial oo
## ##CoreModels↓
## ###Mean-Variance(MV) Model↓
TheMean-Variance model,introduced byHarry Markowitz in1952,stands as a cornerstone
## ###Black-Litterman (BL) Model↓
The Black-Litterman model emerged as an innovative leap forward from traditional MV analysis by addressing its sensitivity to input estimates.Developed by Fischer Black and Robert Litterman in the early 1990s,the BL model introduces the notion of blending
## ### Deep Learning (DL) Models↓
::
Deep Learning represents a paradigm shift in financial model analysis.Unlike traditional models that operate under simplified assumptions,DL models employ neural networks——·o·
## ##Comparative Analysis↓
###1.RiskMeasurement↓
The primary purpose of risk measurement in asset allocation is to ensure that the investor achieves the desired balance between risk and reward.Let's delve deeperinto how each
## ###2.Return Prediction
Return prediction models are vital in assessing expected asset performance and optimizing allocationstrategies accordingly.
## ###3.Asset Allocation↓
Asset allocation involves distributing investments among various assets to meet specific risk-reward profiles.Each model brings unique contributions and limitations.
## ###Conclusion
The comparative strengths and flaws elucidated by MV, BL,and DL models underline their potential synergy and integration into a consolidated hybrid framework.Capitalizing on
##Advantages,Limitations&Implications↓ The realm of FinTech hinges profoundly on the optimization of asset allocation processes, wherein computational models play a vital role.Analyzing the distinct advantages,
###Mean-Variance (MV) Model↓
## **Advantages**↓
The MV model stands as a paragon of portfolio theory due to its straightforward
## ###Black-Litterman (BL)Model↓
The BLmodel advances traditional asset allocationby balancing objectivemarket data with subjective market views.This dual-data framework can potentiate a more sophisticated ::
30 citations, and over 4000+word
Figure 6: Output Example
## A.3 SwarmSys Output Example
## A.4 Profile Embedding and Matching
## A.4.1 Profile Format
## A.4.2 Embedding and Matching
Agent embeddings. Each agent A i is represented by a competence embedding and an availability embedding. The competence embedding v ( i ) ah is derived from declared abilities T ( i ) a and historical performance T ( i ) h , guided by task-specific instructions:
$$\begin{aligned}
v ^ { ( i ) } _ { ah } & = \phi _ { instruct } ( \lnstructi on( T a , T h ) ) . \\
& = \phi _ { instruct } ( \lnstructi on( T a , T h ) ) .
\end{aligned}$$
The availability embedding v ( i ) s reflects workload and readiness, derived from status T ( i ) s :
$$u ( i ) = \phi _ { in s t r u c t } ( I n s t r$$
The final representation is the sum:
$$\textcircled { 1 } = \textcircled { 2 } + \textcircled { 3 } , \textcircled { 4 }$$
**Advantages***↓
::
::
::
:
::
$$Agent Profile { "id": "agent_id", "role": "role" }$$
$$\text{"id": "event_id", }$$
Figure 7: Profile Format
Event embeddings. Each event E j is encoded as v ( j ) e by integrating its description, dependencies, progress state, and milestone:
$$v _ { e } ^ { ( j ) } = \phi _ { instruct } ( Instruction _ { E _ { j } } , \ldots , \frac { S _ { w } } { C _ { 0 } } )$$
Compatibility and decision dynamics. Agentevent compatibility is computed as normalized cosine similarity:
$$v ( i , j ) + 1 . ( 6 )$$
To avoid stagnation, SwarmSys employs a dynamic ε -greedy policy. Each agent explores with probability ε i or exploits with probability 1 -ε i , where
$$\sum _ { i = 0 . 1 5 + ( 0 . 5 - S _ { j } ) \cdot 0 . 2 , ( 7 )$$
and ¯ S i denotes the agent's recent average success. Exploration samples potential matches proportionally to similarity:
$$D ^ { ( i , j ) } \sim Bernoulli ( 0 . 9 \cdot C ( i , j ) , ( 8 )$$
while exploitation emphasizes high-compatibility matches:
$$( C ( i , j ) - 0 . 5 ) , ( 9 )$$
where σ is the sigmoid function and γ controls sharpness. Both branches unify as:
$$\rho ^ { ( i , j ) } = e _ { i } ( 0 . 1 + 0 . 9 C _ { ( i , j ) } ^ { + } ( 1 - e _ { i } ) o ( C _ { ( i , j ) } ^ { + } ) D ^ { ( i , j ) } \sim Bernoulli ( p ^ { ( i , j ) } )$$
This mechanism enables three properties simultaneously: adaptivity through evolving embeddings, stability through probabilistic sampling, and robustness by balancing exploration with exploitation.
## A.5 Cost
Table 8: Average per-question model cost comparison corresponding to baselines and systems used in main experiments.
| Model | Instr.-based Cost ($) | Code-based Cost ($) |
|-------------------------|-------------------------|-----------------------|
| IO (GPT-4o) | 0.005 | - |
| CoT (GPT-4o) | 0.012 | - |
| CoT-SC (GPT-4o, 5-shot) | 0.051 | - |
| Self-Refine (GPT-4o) | 0.068 | - |
| MultiPersona (GPT-4o) | 0.043 | - |
| GPTSwarm | 0.077 | - |
| GPT-5 | 0.014 | - |
| SwarmSys-8 (Ours) | 0.071 | - |
| IO (GPT-4o-Search) | 0.15 | - |
| CoT (GPT-4o) | 0.2 | - |
| Self-Refine | 1.53 | - |
| DeepResearchAgent | 2.82 | - |
| Grok Deeper Search | 2.85 | - |
| SwarmSys-8 (Ours) | 2.63 | - |
| IO (GPT-4o) | 0.08 | 0.013 |
| CoT-SC (GPT-4o) | 0.11 | 0.017 |
| Self-Refine | 0.36 | 0.026 |
| GPTSwarm | 0.41 | 0.024 |
| SwarmSys-14 (Ours) | 0.44 | 0.019 |