# SwarmSys: Decentralized Swarm-Inspired Agents for Scalable and Adaptive Reasoning
## SwarmSys: Decentralized Swarm-Inspired Agents for Scalable and Adaptive Reasoning
Ruohao Li *1 , Hongjun Liu *2,4 , Leyi Zhao 5 , Zisu Li 3 , Jiawei Li 1 ,
Jiajun Jiang 1 , Linning Xu 6 , Chen Zhao 2,4 , Mingming Fan †1,3 ,
Chen Liang †1
1 The Hong Kong University of Science and Technology (Guangzhou) 2 New York University 3 The Hong Kong University of Science and Technology 4 NYU Shanghai 5 Indiana University 6 The Chinese University of Hong Kong
* Equal contribution. † Co-corresponding authors.
Correspondence:
rli777@connect.hkust-gz.edu.cn, lh3862@nyu.edu
## Abstract
Large language model (LLM) agents have shown remarkable reasoning abilities. However, existing multi-agent frameworks often rely on fixed roles or centralized control, limiting scalability and adaptability in long-horizon reasoning. We introduce SwarmSys, a closedloop framework for distributed multi-agent reasoning inspired by swarm intelligence. Coordination in SwarmSys emerges through iterative interactions among three specialized roles, Explorers, Workers, and Validators, that continuously cycle through exploration, exploitation, and validation. To enable scalable and adaptive collaboration, we integrate adaptive agent and event profiles, embedding-based probabilistic matching, and a pheromone-inspired reinforcement mechanism, supporting dynamic task allocation and self-organizing convergence without global supervision. Across symbolic reasoning, research synthesis, and scientific programming tasks, SwarmSys consistently outperforms baselines, improving both accuracy and reasoning stability. These findings highlight swarm-inspired coordination as a promising paradigm for scalable, robust, and adaptive multi-agent reasoning, suggesting that coordination scaling may rival model scaling in advancing LLM intelligence.
## 1 Introduction
The strong reasoning and planning capabilities of Large Language Models (LLMs) have spurred interest in multi-agent systems. These systems use collaboration to enhance reasoning diversity and reliability. Frameworks such as AutoGen (Wu et al., 2023) enable agent-to-agent dialogue for multiagent applications, while CAMEL (Li et al., 2023) leverages role-playing and inception prompting to facilitate autonomous cooperation. AutoGen Studio further provides a no-code interface for designing and debugging agent workflows (Dibia et al., 2024). However, most systems use fixed roles and
Figure 1: Comparison between paradigm-dependent multi-agent systems and SwarmSys. While existing methods rely on fixed, domain-specific agent paradigms, SwarmSys achieves scalable self-organization and crossdomain adaptability.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Diagram: Multi-Agent System Paradigms
### Overview
The image presents a diagram illustrating two approaches to multi-agent systems: one with a fixed number of backbone agents and another with a scalable number of backbone agents. Both approaches are linked to three paradigms: Exam, Research, and Science Coding. The diagram outlines the flow and relationships between these components.
### Components/Axes
* **Top Section Title:** Fixed Number of Backbone Agents
* **Top Level Paradigms (light blue boxes):**
* Cot+ Exam Paradigm
* Cot+ Research Paradigm
* Cot+ Coding Paradigm
* **Middle Level Paradigms (white boxes):**
* Exam
* Research
* Science Coding
* **Middle Section Title:** Paradigm-dependent multi-agent systems
* **Bottom Section Title:** Scalable Number of Backbone Agents
* **Agent Representations:** Three boxes containing graphical representations of agents (red and blue circles with arrows) and a fourth box with an ellipsis (...). The boxes are connected by the word "Or".
* **SwarmSys (light blue box):** Represents the SwarmSys system.
* **Bottom Level Paradigms (white boxes):**
* Exam
* Research
* Science Coding
* **Bottom Right Label:** SwarmSys
### Detailed Analysis
* **Fixed Number of Backbone Agents:**
* The "Fixed Number of Backbone Agents" section starts with a single node that branches out to three paradigms: "Cot+ Exam Paradigm", "Cot+ Research Paradigm", and "Cot+ Coding Paradigm".
* Each of these paradigms is connected to all three of the "Exam", "Research", and "Science Coding" paradigms via dashed lines.
* Each of the "Cot+" paradigms is also connected to its corresponding "Exam", "Research", or "Science Coding" paradigm via a solid line.
* **Scalable Number of Backbone Agents:**
* The "Scalable Number of Backbone Agents" section starts with a single node that branches out to four boxes. The first three boxes contain graphical representations of agents, and the fourth box contains an ellipsis (...).
* These boxes are connected to the "SwarmSys" node.
* The "SwarmSys" node is connected to the "Exam", "Research", and "Science Coding" paradigms.
### Key Observations
* The diagram highlights two distinct approaches to multi-agent systems: one with a fixed number of backbone agents and another with a scalable number of backbone agents.
* The "Fixed Number of Backbone Agents" approach uses "Cot+" paradigms, while the "Scalable Number of Backbone Agents" approach uses "SwarmSys".
* Both approaches ultimately connect to the same three paradigms: "Exam", "Research", and "Science Coding".
* The dashed lines in the "Fixed Number of Backbone Agents" section suggest a more complex relationship between the "Cot+" paradigms and the "Exam", "Research", and "Science Coding" paradigms.
### Interpretation
The diagram illustrates two different architectural approaches to designing multi-agent systems. The "Fixed Number of Backbone Agents" approach seems to suggest a more traditional, pre-defined structure where the "Cot+" paradigms act as high-level strategies that influence the lower-level "Exam", "Research", and "Science Coding" tasks. The dashed lines indicate a complex interplay, possibly representing different strategies being applied to different tasks.
The "Scalable Number of Backbone Agents" approach, on the other hand, suggests a more dynamic and adaptive system where the number of agents can change based on the needs of the task. The "SwarmSys" system likely represents a swarm intelligence approach where agents collaborate to solve problems. The ellipsis (...) suggests that the number of agents can scale indefinitely.
The fact that both approaches ultimately connect to the same three paradigms ("Exam", "Research", and "Science Coding") suggests that these paradigms represent the core tasks or applications that the multi-agent systems are designed to address. The choice of architecture (fixed vs. scalable) likely depends on the specific requirements of the application.
</details>
static communication, which limits their adaptability. This rigidity leads to redundant exploration and inefficiency, especially in long-horizon or dynamic tasks.
To address limitations, recent work has explored adaptive or self-improving agent collaboration. AutoAgent enables natural-language-defined behaviors instead of hand-coded roles (Tang et al., 2025), while Mixture-of-Agents introduces layered coordination for more robust reasoning (Wang et al., 2024). However, these systems still rely on centralized orchestration or manually designed topologies, limiting scalability and long-term stability. Inspired by the decentralized intelligence of natural swarms, where simple signals regulate cooperation and task allocation (Bonabeau et al., 1999; Dorigo and Stützle, 2004), we propose SwarmSys,
Figure 2: Overall workflow of the SwarmSys collaborative reasoning process. Each task is decomposed into subevents handled by specialized agents: Explorers propose solution paths, Workers execute subtasks, and Validators ensure consistency. Agents iteratively perform debate-consensus cycles that update event profiles and reinforce effective reasoning strategies until convergence.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Diagram: SwarmSys Collaborative Reasoning Process
### Overview
The image is a flowchart illustrating the SwarmSys Collaborative Reasoning Process. It outlines the steps involved in a multi-agent system where agents collaborate to solve a task. The process involves initialization, matching, debate, optimization, and validation, with different types of agents (Explorers, Workers, Validators) playing specific roles.
### Components/Axes
The diagram consists of the following components:
1. **Title:** "SwarmSys Collaborative Reasoning Process" (top-left)
2. **Agent Random Initialization:**
* "# Agents": A cluster of gray agents.
* "Agent Profiles": A cluster of agents with different colors (red, blue, teal).
3. **Match:** An arrow pointing from "Agent Profiles" to a matching process, where agents are aligned and compared.
4. **Target & Requirement:**
* "Exam": A box representing the initial problem or task.
* "E0 And E1(A List) Initialization": An agent (blue) initializing the event profiles.
5. **Event Profiles:** A list of event profiles.
6. **Sub-Event:** A small process box leading to a blue agent.
7. **Decomposition and Geometric Reasoning:** A series of text boxes describing the steps taken by the agents:
* "Suggest to decompose the task into two goals: Find intersection, and locus."
* "Formulates potential parameters (t, θ) to guide geometric reasoning."
* "Initiates collaborative solving by delegating sub-tasks to workers."
8. **Explorer's Plan Evaluation:** A text box describing the Explorer's role:
* "Evaluates the Explorer's plan by... and verifying its parametric form: x = 1 + t/2, y = (√3/2)t."
* "Executes computation independently by ... to form 1 + t/2 = cos θ, (√3/2)t = sin θ. 1 + (sin θ/√3) = cos θ."
* "confirming that solving for θ gives the intersection points."
9. **Debate & Consensus:**
* A series of red agents labeled 0, 1, and 2.
* A plus sign (+) combining the red agents with a blue agent.
10. **Geometric Perspective:** A text box describing the geometric interpretation:
* "Provides another geometric perspective by interpreting C₁ as a line through (1,0) with direction (cos α, sin α)."
* "Executes an independent geometric derivation for α, establishing (1 + t/2)² + ((√3/2)t)² = 1."
* "Validates cross-modally and refines explorer that the geometric result coincides with..."
11. **Cross-Modal Consistency:** A text box describing the cross-modal validation:
* "Validates cross-modal consistency by transforming Worker 001's analytic equation"
* "Checks alignment with Worker 002's geometric coordinates cos θ - (sin θ/√3) = 1,"
* "Confirming both results and debate are consistent. (1,0) and (-1/2, -√3/2)"
* "Synthesizes agreement, finalizes current round."
12. **Agent Profiles (Updated):** A cluster of agents with different colors (red, blue, teal).
13. **Event Profiles (Updated):** A list of updated event profiles.
14. **Optimization Loop:** An iterative process labeled "Optimization Loop" with "Execute" steps.
15. **Final Result:** A box representing the final outcome.
16. **Agent Roles:**
* "Explorer" (blue agent)
* "Worker" (red agent)
* "Validator" (teal agent)
### Detailed Analysis or ### Content Details
* **Agent Initialization:** The process starts with a group of agents, which are then profiled based on their characteristics.
* **Matching:** The agent profiles are matched against the target and requirements.
* **Event Profiles:** The matching process generates event profiles, which are then initialized.
* **Sub-Event Decomposition:** The task is decomposed into sub-events, and geometric reasoning is applied.
* **Explorer's Plan:** The Explorer agent evaluates a plan and verifies its parametric form.
* **Debate & Consensus:** Agents debate and reach a consensus on the solution.
* **Geometric Perspective:** Another geometric perspective is provided, and an independent geometric derivation is executed.
* **Cross-Modal Consistency:** Cross-modal consistency is validated by transforming equations and checking alignment.
* **Optimization Loop:** The process enters an optimization loop, where the solution is refined iteratively.
* **Final Result:** The final result is validated and presented.
### Key Observations
* The diagram illustrates a complex collaborative reasoning process involving multiple agents with different roles.
* The process involves both geometric reasoning and cross-modal validation.
* The optimization loop ensures that the solution is refined iteratively.
### Interpretation
The diagram depicts a sophisticated multi-agent system designed for collaborative problem-solving. The SwarmSys Collaborative Reasoning Process leverages the strengths of different agent types (Explorers, Workers, Validators) to achieve a robust and validated solution. The process emphasizes geometric reasoning, cross-modal consistency, and iterative optimization, suggesting a focus on problems that can be approached from multiple perspectives and require continuous refinement. The use of debate and consensus-building mechanisms highlights the importance of collaboration and agreement among the agents. The overall process aims to ensure that the final result is both accurate and reliable.
</details>
a closed-loop framework that enables LLM agents to coordinate through lightweight, pheromone-like traces encoding contextual utility. This mechanism fosters self-organized collaboration, dynamically balancing exploration and convergence without centralized control.
Unlike debate or tree-based systems, SwarmSys allows coordination to emerge organically through iterative interaction and adaptive matching. Agents assume three roles, Explorers, Workers, and Validators, mirroring the division of labor in natural ant colonies. Explorers expand hypotheses, Workers refine and execute subtasks, and Validators ensure consistency, together forming continuous exploration-exploitation-validation cycles driving decentralized convergence. A core innovation is the use of profiles as adaptive memory. Agent and event profiles evolve with ability embeddings, workload, and context, enabling embedding-based reallocation and balanced participation-analogous to ants redistributing across foraging sites. Moreover, SwarmSys employs a pheromone-inspired reinforcement process: validated traces strengthen future compatibility, while ineffective ones decay, forming a decentralized optimization loop that enhances efficiency and stability over time.
Evaluated across symbolic reasoning, research synthesis, and scientific programming tasks, SwarmSys outperforms baselines such as GPTSwarm (Zhuge et al., 2024), achieving up to 10.7% higher accuracy and 9.9% better sub- task correctness. Remarkably, a swarm of GPT4o-based agents approaches GPT-5 performance, showing that scaling coordination can substitute for model scaling. Qualitative analyses reveal emergent behaviors, knowledge diffusion, specialization balance, and self-regularization, hallmarks of collective intelligence.
In summary, our contributions are threefold: (1) SwarmSys Framework: A closed-loop distributed multi-agent reasoning framework inspired by swarm intelligence for convergence. (2) Adaptive Coordination Mechanism: An embeddingbased matching and pheromone-inspired reinforcement process enabling dynamic agent-event allocation, self-organized collaboration, and stable longhorizon reasoning. (3) Comprehensive Evaluation: Extensive experiments across diverse reasoning tasks reveal consistent gains and emergent collective intelligence, showing that scaling coordination can rival scaling model capacity.
## 2 Methodology
## 2.1 Overview of SwarmSys
We present SwarmSys, a closed-loop collaborative framework for distributed problem solving. Unlike centralized orchestration or static task assignment, it converges through iterative matching → collaboration → update cycles, as is shown in Figure 3. Upon receiving new task, event profiles are instantiated, and candidate agents are retrieved via
Figure 3: Iterative cycle in SwarmSys: minimize f ( θ ) through matching → collaboration → update cycle.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Diagram: Agent Collaboration Workflow
### Overview
The image depicts a cyclical workflow diagram illustrating the process of agent collaboration. The diagram consists of four stages represented by rounded rectangles, connected by arrows indicating the flow of the process.
### Components/Axes
* **Nodes:**
* **Matching Retrieve agents {A, B, C}:** (Top-left, blue) Represents the stage where agents are matched and retrieved. The agents are denoted as A, B, and C.
* **Collaboration Compute θt+1:** (Top-right, green) Represents the collaboration stage where a value θt+1 is computed.
* **Update Profiles → JSON:** (Bottom-right, orange) Represents the stage where profiles are updated and converted to JSON format.
* **Task: minimize f(θ) = θ² + 3 sin θ:** (Bottom-left, gray) Represents the task of minimizing the function f(θ) = θ² + 3 sin θ.
* **Edges (Arrows):**
* **propose:** Connects "Matching Retrieve agents {A, B, C}" to "Collaboration Compute θt+1".
* **validate:** Connects "Collaboration Compute θt+1" to "Update Profiles → JSON".
* **update:** Connects "Update Profiles → JSON" to "Task: minimize f(θ) = θ² + 3 sin θ".
* **new round:** Connects "Task: minimize f(θ) = θ² + 3 sin θ" to "Matching Retrieve agents {A, B, C}".
### Detailed Analysis
The workflow starts with the "Matching Retrieve agents {A, B, C}" stage. The agents A, B, and C are retrieved and matched. The process then moves to the "Collaboration Compute θt+1" stage via the "propose" action. In this stage, the agents collaborate to compute a value θt+1. The computed value is then validated, leading to the "Update Profiles → JSON" stage. Here, the profiles are updated and converted into JSON format. Finally, the profiles are updated, and the process loops back to the "Task: minimize f(θ) = θ² + 3 sin θ" stage via the "update" action, where the task is to minimize the function f(θ) = θ² + 3 sin θ. The process then loops back to the "Matching Retrieve agents {A, B, C}" stage via the "new round" action, restarting the cycle.
### Key Observations
* The diagram represents a closed-loop system.
* The process involves matching agents, collaboration, updating profiles, and minimizing a function.
* The arrows indicate the direction of the workflow.
### Interpretation
The diagram illustrates an iterative process of agent collaboration aimed at minimizing a given function. The agents are matched and retrieved, they collaborate to compute a value, the profiles are updated based on this value, and the process repeats with the goal of minimizing the function f(θ). The conversion to JSON suggests that the profiles are being stored or transmitted in a standardized data format. The cyclical nature of the diagram indicates a continuous optimization process.
</details>
embedding-based matching. Agents then enter collaborative rounds: explorers propose decomposition, workers execute subtasks after consensus, and validators verify intermediate results. After each round, both agent and event profiles are updated and stored for transparency and traceability. These evolving profiles feed subsequent iterations, enabling self-organization and adaptation. As shown in our framework (see figure 2), through repeated cycles, we achieve high-quality solutions without central controller, relying instead on distributed collaboration and profile-driven adaptation.
## 2.2 Profiles as Adaptive Memory Units
A core innovation of SwarmSys is the use of profiles as adaptive memory units that record both context and accumulated experience. Each agent profile maintains identifiers, role information, ability descriptions, workload status, and longitudinal performance history, the profile format is provided in Appendix A.4.1. Profiles evolve dynamically: embeddings and workload indicators are refreshed after each round, while historical success or failure influences future task matching. Agents thus behave as adaptive collaborators rather than stateless executors. Each event profile serves as a dynamic record of problem-solving. It contains task descriptions, dependency structures, metadata (e.g., composite or leaf type), progress logs, and participating agent lists. Over time, static specifications evolve into rich, evolving traces of reasoning.
Together, these profiles constitute SwarmSys's distributed memory: agent profiles encode competence and reliability, while event profiles track task evolution. Their joint updates ensure collaboration remains adaptive, interpretable, and transparent.
## 2.3 Embedding-Based Matching with Exploration-Exploitation Dynamics
The second key innovation is our embedding-based agent-event matching algorithm, which balances expertise, workload, and exploration.
Agent embeddings. Each agent A i is represented by two embeddings capturing its ability and availability. The competence embedding is derived from the agent's declared abilities and historical performance, processed through an instructionguided encoder that contextualizes the agent's prior experience for the current task. The availability embedding reflects workload and readiness, obtained from the agent's status signals. These two vectors are summed to form the agent's overall representation, combining long-term expertise with shortterm availability cues, the the detailed process is provided in Appendix A.4. This design allows the system to favor agents who are both skilled and currently underutilized.
Event embeddings. Each event E j is encoded through an instruction-conditioned embedding that integrates its textual description, dependency relations, progress state, and milestone metadata, the method is listed at AppendixA.1. This representation captures both the semantic meaning of the task and its structural role within the reasoning process. By embedding events and agents in a shared latent space, SwarmSys can estimate their compatibility in a continuous and scalable manner.
Compatibility and decision dynamics. The compatibility between an agent and an event is measured by normalized cosine similarity between their embeddings, ensuring a value between 0 and 1 for interpretability. To prevent premature convergence and encourage exploration, SwarmSys adopts a dynamic ε -greedy policy. Each agent explores new matches with probability ε i and exploits high-compatibility matches otherwise. The exploration rate ε i adapts to recent performance: agents with high average success explore less, while underperforming agents explore more. Empirically, we initialize ε i around 0.15 to maintain minimal randomness and allow it to fluctuate within a small range (up to 0.35) depending on recent success. The values are designed based on natural ant behaviors (Lecheval et al., 2024). This ensures that exploration gradually decreases as the system stabilizes. During exploration, matches are sampled proportionally to similarity, enabling serendipitous but plausible pairings. During exploitation, a sigmoid-weighted sampling function emphasizes strong compatibility, controlled by a sharpness factor γ that modulates selectivity. The detailed behavioral derivation process is provided in Appendix A.4. This mechanism enables three prop-
erties: adaptivity through evolving embeddings, stability through probabilistic sampling, and robustness by balancing exploration with exploitation.
## 2.4 Pheromone-Inspired Optimization
Finally, SwarmSys incorporates a pheromoneinspired optimization process to refine allocation and solution quality. Each validated contribution reinforces the compatibility between an agent and an event, updating embeddings and increasing the likelihood of similar matches in future rounds. Idle or invalid matches, by contrast, receive no reinforcement and gradually decline in competitiveness as other profiles evolve, mimicking pheromone evaporation without explicit decay.
This implicit reinforcement-evaporation dynamic complements the exploration-exploitation policy. Exploration guarantees diversity and prevents deadlock, exploitation prioritizes promising matches, and bounded probabilities ensure stability. As explorers, workers, and validators collaborate across rounds, SwarmSys converges to highquality solutions while maintaining flexibility and resilience in search dynamics.
## 3 Experiment
## 3.1 Experiment Setting
We evaluate SwarmSys across three reasoning categories that collectively span symbolic computation, open-domain research synthesis, and scientific programming. All evaluations use dataset-specific metrics following their official definitions to ensure fair comparison.
Baselines Since agents show domain-specific strengths, we select the strongest baseline for each task category instead of using a uniform set. For exam-style reasoning, we compare against GPT-4obased IO (direct LLM invocation) (OpenAI et al., 2024), CoT (Wei et al., 2022), CoT-SC (Wang et al., 2022), Self-Refine (Madaan et al., 2023), MultiPersona (Wang et al., 2023b), and GPTSwarm (Zhuge et al., 2024). For research tasks, we include general-purpose baselines (IO, CoT, SelfRefine), and deep research agents (Grok Deeper Search, DeepResearchAgent). For scientific programming, we use both general-purpose reasoning systems (Self-Refine, CoT) and domain-specific agents (GPTSwarm, DeepResearchAgent). We also report results from GPT-5 as an upper bound for single-agent performance.
Dataset Table 1 summarizes the four benchmarks, covering quantitative, analytical, and procedural reasoning. This diversity ensures that SwarmSys is evaluated across both discrete symbolic reasoning and open-ended research generation settings. More details of our dataset settings are shown in Appendix A.2
Table 1: Overview of the four reasoning benchmarks used in our experiments. Each dataset differs in domain focus, reasoning type, and data format.
| Dataset | Focus | Reasoning | #Samples |
|-----------------------------------|----------------------------|-------------|------------|
| GaoKao Bench (Zhang et al., 2024) | Quantitative &Cross-domain | Symbolic | 800 |
| Omni-Math (Gao et al., 2024a) | Hard-Level Quantitative | Conceptual | 300 |
| DeepResearch (Du et al., 2025) | Scientific QA | Analytical | 200 |
| SciCode (Tian et al., 2024) | Computational | Procedural | 338 |
Metries We evaluate SwarmSys on three reasoning categories, using the original domain-specific metrics and protocols from each dataset (1) Examstyle reasoning tasks (Math Exam, STEM Mix, and Olympic Math) use Accuracy and Knowledge Coverage as defined in prior benchmark releases. For Math Exam and STEM Mix, we use subset from GAOKAO Bench. For Olympic Math, we use subset from Omni-Math and rearranged them to comply with real-world exam. (2) Researchlevel reasoning tasks (DeepResearch Bench) employ composite metrics including RACE (Comprehensiveness, Depth, Instruction Following, Readability) and FACT (Citation Accuracy, Effective Citation). (3) Science Coding tasks (SciCode) are evaluated using Pass@Main and Pass@Sub metrics, capturing correctness at both task and sub-task levels. All metrics follow their dataset definitions to ensure faithful comparison and reproducibility.
## 3.2 Overall Results
Exam-style Reasoning. Table 2 shows that SwarmSys consistently outperforms all GPT-4obased multi-agent baselines on both single- and multi-subject exams, achieving an average improvement of +12 . 5% Accuracy and +10 . 8% Coverage over GPTSwarm. While GPT-5 achieves the strongest absolute scores, SwarmSys-8 narrows the gap by over 70% , demonstrating that swarm-level cooperation can approach next-generation model performance without access to stronger backbones. Qualitatively, we observe that SwarmSys agents exhibit complementary specialization: Explorers diversify problem-solving strategies, while Validators efficiently prune redundant reasoning chains, leading to higher coverage with reduced variance.
Table 2: Performance comparison on Exam-style tasks (single-, multi-subject, and Olympic-level). Metrics: Accuracy (Acc.) and Knowledge Coverage (Cov.). SwarmSys-8 means the system contains 8 agents (1* explorer, 6*workers, 1*validator).
| Method | Math Exam | Math Exam | STEM Mix | STEM Mix | Olympic Math | Olympic Math | Average | Average |
|-----------------------|-------------|-------------|------------|------------|----------------|----------------|-----------|-----------|
| | Acc. | Cov. | Acc. | Cov. | Acc. | Cov. | Acc. | Cov. |
| IO (GPT-4o) | 46.3 | 45.7 | 57.4 | 55.2 | 23.5 | 57.6 | 42.4 | 52.8 |
| CoT (GPT-4o) | 52.6 | 49.0 | 60.8 | 59.7 | 30.0 | 66.3 | 47.8 | 58.3 |
| CoT-SC (5-shot) | 63.2 | 62.4 | 64.7 | 62.8 | 28.3 | 51.7 | 52.1 | 59.0 |
| Self-Refine (GPT-4o) | 79.3 | 9.8 | 70.5 | 79.0 | 16.6 | 45.0 | 55.5 | 44.6 |
| MultiPersona (GPT-4o) | 52.4 | 51.4 | 60.9 | 62.3 | 35.0 | 68.3 | 49.4 | 60.7 |
| GPTSwarm † | 65.5 | 70.3 | 69.6 | 71.4 | 40.0 | 73.3 | 58.4 | 71.7 |
| GPT-5 | 87.2 | 90.6 | 87.7 | 89.1 | 31.2 | 70.5 | 68.7 | 83.4 |
| SwarmSys-8 (Ours) | 76.2 | 80.2 | 78.7 | 81.3 | 42.4 | 73.2 | 65.8 | 78.2 |
Table 3: Performance comparison on Research tasks (DeepResearch Bench). Metrics: Comprehensiveness, Depth, Instruction Following, and Readability.
| Method | Overall | Comp. | Depth | Inst. | Read. |
|--------------------|-----------|---------|---------|---------|---------|
| IO (GPT-4o-Search) | 30.7 | 27.8 | 20.4 | 41 | 37.6 |
| CoT (GPT-4o) | 29.3 | 29.5 | 22.8 | 33.5 | 36.8 |
| Self-Refine | 35.9 | 35.4 | 27 | 44.1 | 41 |
| DeepResearchAgent | 48.8 | 48.5 | 48.5 | 49.1 | 49.4 |
| Grok Deeper Search | 40.2 | 37.9 | 35.3 | 46.3 | 44 |
| SwarmSys-8 (Ours) | 42.5 | 39.6 | 38 | 50 | 46.3 |
Table 4: Performance comparison on Science Coding tasks (SciCode benchmark). Metrics: Pass@Main and Pass@Sub (percentages). † No longer available; metrics from SciCode report.
| Method | Pass@Main | Pass@Sub |
|-------------------------|-------------|------------|
| IO (GPT-4o) | 2 | 28.3 |
| OpenAI o3-mini-medium † | 9.2 | 34.4 |
| CoT-SC (GPT-4o) | 8.8 | 28.7 |
| Self-Refine | 10 | 33.3 |
| GPTSwarm | 8.6 | 29.2 |
| SwarmSys-14 (Ours) | 12.5 | 45.2 |
Research-level Reasoning. As shown in Table 3, SwarmSys surpasses Grok Deeper Search in overall RACE score (+2.3%) and instruction-following (+3.7%), reflecting the benefit of distributed role specialization in literature synthesis and factual consolidation. SwarmSys achieves especially large gains in comprehensiveness and readability, suggesting that swarm debates improve global coherence even in open-ended research generation tasks.
Scientific Programming. In Table 4, SwarmSys demonstrates notable improvements on SciCode: +2 . 5% Pass@Main and +11 . 9% Pass@Sub over the best GPT-4o baselines. These results indicate that dynamic role coordination and progressive refinement effectively decompose complex computational problems. Interestingly, Pass@Sub improvements are more pronounced, supporting our hypothesis that swarm collaboration benefits from modular code generation and localized validation.
## 3.3 Ablation Study
We ablate the design of SwarmSys along three axes: (a) removing dynamic profile updates and embedding-based matching (Ours-Roles-Rand); (b) removing both roles and matching, leaving homogeneous random assignment (Rand-NoRoles); and (c) varying the number of Agents A from 4 to 32 (Table 5), while ensure that each event contains at least one explorer, one worker, and one validator. All variants use identical backbones and decoding parameters. Results reveal three trends: (1) Removing roles leads to a substantial decline in both accuracy and coverage (e.g., 56 . 3% → 43 . 2% in accuracy and 52 . 4% → 41 . 4% in coverage at A =4 ), confirming that cooperative debate and functional specialization are essential for maintaining diversity without redundancy. (2) Adaptive matching and profiling further enhance performance: embedding-based matching increases coverage by up to +27 . 3% by aligning agent capabilities with task semantics. (3) Scaling saturates around W =14 : while both metrics improve with more Workers, gains plateau beyond 14 , indicating that agent saturation occurs once the subtask granularity is fully covered.
## 3.4 Swarm Effect: Emergent Collective Intelligence
Our design philosophy centers on the Swarm Effect: a collection of properly coordinated, limited
Table 5: Ablation on Exam task under varying numbers of Agents ( A ). Metrics: Accuracy (Acc., %), Knowledge Coverage (Cov., %).
| Method | A =4 | A =4 | A =8 | A =8 | A =14 | A =14 | A =20 | A =20 | A =32 | A =32 |
|--------------|--------|--------|--------|--------|---------|---------|---------|---------|---------|---------|
| | Acc. | Cov. | Acc. | Cov. | Acc. | Cov. | Acc. | Cov. | Acc. | Cov. |
| Rand-NoRoles | 43 . 2 | 41 . 4 | 42 . 9 | 42 . 1 | 44 . 1 | 43 . 4 | 44 . 3 | 43 . 2 | 43 . 8 | 42 . 3 |
| Rand-Roles | 56 . 3 | 52 . 4 | 58 . 2 | 54 . 8 | 58 . 5 | 56 . 1 | 58 . 3 | 56 . 0 | 57 . 3 | 55 . 9 |
| SwarmSys | 74.3 | 79.7 | 76.2 | 80.2 | 77.3 | 81.0 | 77.2 | 81.3 | 76.5 | 80.8 |
Figure 4: Swarm reasoning trajectory on MathExam. Explorers initiate sub-tasks, Workers debate and revise alternative methods, and Validators enforce cross-checks across rounds. The swarm collectively converges to consistent solutions through debate-driven consensus formation.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Diagram: Consensus Building Process
### Overview
The image depicts a diagram illustrating a consensus-building process through iterative debate and method revision. The diagram is divided into three stages, showing the evolution of ideas and methods towards a final consensus. Each stage involves potential sub-tasks, method outputs, checks, and revisions, represented by nodes and connecting lines.
### Components/Axes
* **Nodes:** Represent different aspects of the process, such as potential sub-tasks, output methods, checks, and revisions.
* Blue Nodes: Represent potential sub-tasks and working directions or verification steps.
* Red Nodes: Represent output methods and debates.
* Yellow Node: Represents the final consensus.
* **Edges (Dashed Lines):** Connect the nodes, indicating relationships and dependencies between them.
* **Arrows:** Indicate the flow of the process from one stage to the next.
* **Labels:** Textual descriptions associated with each node, clarifying its role in the process.
### Detailed Analysis
**Stage 1 (Left):**
* **Blue Node (Top-Left):** Labeled "Potential sub-task & working direction" and contains a cluster of approximately 8 smaller blue dots. Node is labeled "E".
* **Red Node (Top-Center):** Labeled "Output method 1 & debate" and contains a cluster of approximately 8 smaller red dots. Node is labeled "W1".
* **Red Node (Bottom-Left):** Labeled "Method 2 (perceived wrong) & debate" and contains a cluster of approximately 8 smaller red dots. Node is labeled "W2".
* **Red Node (Bottom-Right):** Labeled "Method 3 (actually wrong) & debate" and contains a cluster of approximately 8 smaller red dots. Node is labeled "W3".
* **Teal Node (Top-Right):** Labeled "Check & participate in debate" and contains a cluster of approximately 8 smaller teal dots. Node is labeled "V".
* Dashed lines connect "E" to "W1", "W2", and "W3". Dashed lines also connect "W1", "W2", and "W3" to each other. "V" is connected to "W1", "W2", and "W3" via dashed lines.
**Stage 2 (Center):**
* An arrow points from Stage 1 to Stage 2, indicating the progression of the process.
* The nodes from Stage 1 are present but faded in the background.
* **Blue Node (Top-Left):** Labeled "Potential sub-task & working direction" and contains a cluster of approximately 8 smaller blue dots. Node is labeled "E".
* **Red Node (Top-Center):** Labeled "Revised method1 & debate" and contains a cluster of approximately 8 smaller red dots. Node is labeled "W1".
* **Red Node (Bottom-Left):** Labeled "Revised method2 & debate" and contains a cluster of approximately 8 smaller red dots. Node is labeled "W2".
* **Red Node (Bottom-Right):** Labeled "Agreed on other methods" and contains a cluster of approximately 8 smaller red dots. Node is labeled "W3".
* **Teal Node (Top-Right):** Labeled "Verify Method 12 are rignt & debate" and contains a cluster of approximately 8 smaller teal dots. Node is labeled "V".
* Dashed lines connect "E" to "W1", "W2", and "W3". Dashed lines also connect "W1", "W2", and "W3" to each other. "V" is connected to "W1", "W2", and "W3" via dashed lines.
**Stage 3 (Right):**
* An arrow points from Stage 2 to Stage 3, indicating the progression of the process.
* The nodes from Stage 2 are present but faded in the background.
* The blue, red, and teal nodes are clustered together at the top of the stage.
* **Yellow Node (Bottom):** Labeled "Consensus reached after this round, output method1 2 & result".
### Key Observations
* The diagram illustrates an iterative process of refining methods through debate and verification.
* The initial stage involves identifying potential sub-tasks and debating different methods.
* Subsequent stages involve revising methods based on feedback and verification.
* The process culminates in a consensus, where a final output method and result are agreed upon.
### Interpretation
The diagram demonstrates a structured approach to problem-solving and decision-making. It highlights the importance of considering multiple perspectives, engaging in constructive debate, and iteratively refining methods to achieve a consensus. The process emphasizes the value of identifying and addressing potential issues with methods, leading to a more robust and reliable outcome. The clustering of nodes in the final stage suggests a convergence of ideas and a shared understanding among participants. The diagram could be used to explain a collaborative research process, a product development cycle, or any scenario where consensus-building is essential.
</details>
agents can collectively approximate or surpass a stronger single-agent model. Figure 4 visualizes how SwarmSys (GPT-4o backbone) gradually approaches the performance of stronger models such as GPT-5 and DeepResearchAgent as the swarm size increases.
Emergent Performance Scaling. Quantitatively, as the number of active agents increases from A =4 to A =14 , both Accuracy and Knowledge Coverage improve consistently (e.g., from 74 . 3 / 79 . 7 to 77 . 3 / 81 . 0 on the Exam task). However, further expansion to A =20 or A =32 yields negligible gains (within < 1% difference), indicating that agent capacity saturates once subtask diversity and knowledge space have been sufficiently covered. This scaling curve mirrors real-world swarm systems, where the marginal utility of additional workers decreases after local niches become saturated.
Collaborative Dynamics. Unlike conventional multi-agent ensembles that rely on static voting, SwarmSys agents interact through pheromoneinspired event matching and debate-driven validation. Explorers dynamically propose new subgoals, Workers attempt partial solutions, and Validators consolidate outcomes based on collective memory. This dynamic feedback loop creates a self-organizing division of labor where each agent adapts its behavior not from external commands but through the evolving swarm state. Empirically, this leads to: (i) higher diversity in reasoning paths, and (ii) smoother convergence across reasoning rounds.
## Knowledge Diffusion and Self-Regularization.
We further observe that intermediate reasoning traces in SwarmSys display emergent knowledge diffusion : factual entities, equations, or hypotheses discovered by one agent are reused, revised, or even corrected by others without explicit synchronization. This effect increases factual precision while maintaining interpretability. In qualitative analyses, the system exhibits an implicit regularization behavior -weaker agents' errors are diluted by consensus mechanisms, preventing local hallucinations from dominating global output.
From Coordination to Intelligence. The Swarm Effect demonstrates that collective intelligence is not a linear function of model size, but an emergent property of structured interaction. While individual agents are limited by GPT-4o, the swarm collectively builds a distributed memory and decision space, enabling generalization beyond any single agent. This property highlights a promising direction for future large-scale reasoning systems: scaling through coordination rather than parameter count.
Figure 5: Evolution of communication topology in SwarmSys. The system evolves from a centralized hub-spoke structure to a distributed small-world mesh, where workers and validators interconnect for efficient consensus and information reuse.
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Diagram: Two-Stage Exploration and Consensus
### Overview
The image is a diagram illustrating a two-stage process: Stage I, a centralized exploration phase, and Stage II, a distributed consensus phase. The diagram uses ant-like figures to represent different roles or agents within the system, such as Explorers, Workers, and Validators. The diagram shows the flow of information and interaction between these agents and events.
### Components/Axes
* **Stage I: Centralized Exploration Phase** (Top Section):
* **Explorer:** Represented by a blue ant-like figure. Located at the top of the stage.
* **Events:** Three events labeled "Event 1", "Event 2", and "Event 3". Represented by dashed rectangles.
* **Worker Cluster:** Represented by a red ant-like figure. Located below the events.
* **Validator_V1:** Represented by a teal ant-like figure. Located below the Worker Cluster.
* **Stage II: Distributed Consensus Phase** (Bottom Section):
* **Events:** Three events labeled "Event 1", "Event 2", and "Event 3". Represented by dashed rectangles.
* **Worker Mesh:** Represented by a red ant-like figure. Located below the events.
* **V1, V2, V3, Explorers:** Represented by teal ant-like figures. Located below the Worker Mesh.
### Detailed Analysis
* **Stage I: Centralized Exploration Phase**
* The "Explorer" (blue ant) is connected to "Event 1", "Event 2", and "Event 3" (dashed rectangles) via downward arrows.
* "Event 1", "Event 2", and "Event 3" are connected to the "Worker Cluster" (red ant) via downward arrows.
* The "Worker Cluster" (red ant) is connected to "Validator_V1" (teal ant) via a downward arrow.
* **Stage II: Distributed Consensus Phase**
* "Event 1", "Event 2", and "Event 3" (dashed rectangles) are interconnected with bidirectional arrows.
* "Event 1", "Event 2", and "Event 3" are connected to the "Worker Mesh" (red ant) via downward arrows.
* The "Worker Mesh" (red ant) is connected to "V1", "V2", "V3", and "Explorers" (teal ants) via downward arrows.
* "V1", "V2", "V3", and "Explorers" are interconnected with bidirectional arrows.
### Key Observations
* The diagram illustrates a transition from a centralized exploration phase to a distributed consensus phase.
* The ant-like figures represent different roles or agents within the system.
* The arrows indicate the flow of information or interaction between the agents and events.
* The dashed rectangles represent events.
* The color of the ant-like figures seems to indicate their role: blue for explorers, red for workers, and teal for validators/participants.
### Interpretation
The diagram depicts a system that first explores events in a centralized manner, where an "Explorer" interacts with multiple "Events," and then transitions to a distributed consensus phase, where multiple entities ("V1," "V2," "V3," and "Explorers") interact with each other and the "Events." The "Worker Cluster" and "Worker Mesh" likely represent processing or coordination units in each phase. The "Validator_V1" in Stage I suggests a validation step after the centralized exploration. The bidirectional arrows in Stage II indicate a collaborative or iterative process for reaching consensus. The diagram suggests a shift from a top-down, centralized approach to a more collaborative, distributed approach.
</details>
## 4 Qualitative Analysis
We further analyze how reasoning quality emerges and where it breaks down within SwarmSys. This section covers two aspects: (1) Agent Behavior Analysis , examining coordination patterns via profile dynamics and interaction topology; and (2) Error Analysis , identifying failure modes of pheromone-based optimization and consensus.
## 4.1 Agent Behavior
To understand how coordination arises from the matching-collaboration-update cycle, we analyze profile adaptation, interaction topology, and contribution balance using our experiment's event logs.
Profile Adaptation. We track embedding drift of each agent's competence embedding v ( i ) a . The mean cosine shift per round is 0 . 14 ± 0 . 03 , showing steady self-adjustment as agents gain experience. Explorers exhibit the largest variance, indicating ongoing hypothesis exploration, while Validators remain stable, preserving reasoning coherence. This confirms that profile updates act as distributed memory enabling specialization.
Interaction Topology. Figure 5 illustrates the evolution of communication topology. During the Centralized Exploration Phase , agents form
Table 6: Failure type distribution on tasks. Estimated from 15 randomly sampled cases.
| Failure Type | Description | % |
|------------------------|-------------------------------------------------------------|-----|
| Premature Consensus | Early validator fixes one branch too soon | 16 |
| Reinforcement Bias | Over-strengthening of an early path signal | 20 |
| Mode Collapse | All explorers con- verge to one reasoning mode | 14 |
| Constraint Omission | Missing symbolic or geometric integration | 22 |
| Communication Deadlock | Agents misrecognize roles, causing commu- nication deadlock | 28 |
a hub-spoke pattern centered on high-similarity validators. As reasoning progresses, pheromone reinforcement promotes denser cross-links, leading to a Distributed Consensus Phase characterized by a small-world structure with higher local clustering (0.28 → 0.47) and shorter global paths. This transition demonstrates self-organized coordination emerging without explicit central control.
Contribution Balance. Normalized entropy of accepted contributions is
<!-- formula-not-decoded -->
where p i is each agent's contribution share, computed by the contribution acceptance rate, A is the total number of agents. SwarmSys attains H c =0 . 72 , surpassing GPTSwarm ( 0 . 41 ), indicating balanced participation driven by the exploration-exploitation policy. Overall, adaptive profiles and pheromone feedback jointly yield a decentralized yet structured division of labor.
## 4.2 Error Analysis and Case Study
Despite strong overall results, SwarmSys occasionally fails in tasks requiring strict temporal or symbolic alignment.As is shown in figure 6, we identify five main failure types: (a) premature convergence, (b) Reinforcement Bias, (c) Mode Collapse, (d) Constraint Omission, (e) Communication Deadlock. A typical case in our study is: two explorers propose algebraic and geometric solutions, but an early validator accepts one branch too soon, reinforcing it and suppressing valid alternatives. Such reinforcement bias leads to overfitting on partial evidence. Future improvements may include uncertainty-weighted reinforcement, scheduled resampling of low-confidence paths, and meta-level arbitration to maintain epistemic diversity.
Table 7: Comparison of SwarmSys with representative reasoning and multi-agent systems. † MetaGPT incorporates SOP-style verification steps but not decentralized validators. § Voyager maintains skill profiles and self-checking, but only within a single-agent setting.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Table: System Comparison
### Overview
The image presents a table comparing different systems based on several features: Decentralized, Explicit Roles, Debate, Dynamic Profiling, Decentralized Matching, Multi-Event, Validator, and Stigmergy. The table uses checkmarks and crosses to indicate whether a system possesses a particular feature.
### Components/Axes
* **Rows:** Represent different systems, including CAMEL, AutoGen, MetaGPT, MAD, ToT/GoT, Voyager, SWE-agent, Self-Refine, GPTSwarm, DeepResearchAgent, ROBIN, and SwarmSys. Each system is accompanied by a citation.
* **Columns:** Represent the features being compared: Decentralized, Explicit Roles, Debate, Dynamic Profiling, Decentralized Matching, Multi-Event, Validator, and Stigmergy.
* **Symbols:**
* A checkmark (✓) indicates that the system possesses the feature.
* A cross (X) indicates that the system does not possess the feature.
* A section symbol (§) is used for Voyager in Dynamic Profiling and Validator.
* A dagger symbol (†) is used for MetaGPT in Validator.
### Detailed Analysis or ### Content Details
Here's a breakdown of each system and its features:
* **CAMEL (Li et al., 2023):**
* Decentralized: X
* Explicit Roles: ✓
* Debate: ✓
* Dynamic Profiling: X
* Decentralized Matching: X
* Multi-Event: X
* Validator: X
* Stigmergy: X
* **AutoGen (Wu et al., 2023):**
* Decentralized: X
* Explicit Roles: ✓
* Debate: ✓
* Dynamic Profiling: X
* Decentralized Matching: X
* Multi-Event: X
* Validator: X
* Stigmergy: X
* **MetaGPT (Hong et al., 2024):**
* Decentralized: X
* Explicit Roles: ✓
* Debate: X
* Dynamic Profiling: X
* Decentralized Matching: X
* Multi-Event: X
* Validator: ✓†
* Stigmergy: X
* **MAD (Liang et al., 2024):**
* Decentralized: X
* Explicit Roles: X
* Debate: X
* Dynamic Profiling: X
* Decentralized Matching: X
* Multi-Event: X
* Validator: ✓
* Stigmergy: X
* **ToT/GoT (Yao et al., 2023; Besta et al., 2024):**
* Decentralized: X
* Explicit Roles: X
* Debate: X
* Dynamic Profiling: X
* Decentralized Matching: X
* Multi-Event: X
* Validator: X
* Stigmergy: X
* **Voyager (Wang et al., 2023a):**
* Decentralized: X
* Explicit Roles: X
* Debate: X
* Dynamic Profiling: ✓§
* Decentralized Matching: X
* Multi-Event: X
* Validator: ✓§
* Stigmergy: X
* **SWE-agent (Yang et al., 2024):**
* Decentralized: X
* Explicit Roles: X
* Debate: X
* Dynamic Profiling: X
* Decentralized Matching: X
* Multi-Event: X
* Validator: X
* Stigmergy: X
* **Self-Refine (Madaan et al., 2023):**
* Decentralized: X
* Explicit Roles: X
* Debate: X
* Dynamic Profiling: X
* Decentralized Matching: X
* Multi-Event: X
* Validator: ✓
* Stigmergy: ✓
* **GPTSwarm (Zhuge et al., 2024):**
* Decentralized: ✓
* Explicit Roles: ✓
* Debate: ✓
* Dynamic Profiling: ✓
* Decentralized Matching: X
* Multi-Event: X
* Validator: X
* Stigmergy: X
* **DeepResearchAgent (Huang et al., 2025):**
* Decentralized: X
* Explicit Roles: ✓
* Debate: ✓
* Dynamic Profiling: X
* Decentralized Matching: X
* Multi-Event: X
* Validator: ✓
* Stigmergy: X
* **ROBIN (Ghareeb et al., 2025):**
* Decentralized: X
* Explicit Roles: ✓
* Debate: ✓
* Dynamic Profiling: X
* Decentralized Matching: X
* Multi-Event: X
* Validator: ✓
* Stigmergy: X
* **SwarmSys (ours):**
* Decentralized: ✓
* Explicit Roles: ✓
* Debate: ✓
* Dynamic Profiling: ✓
* Decentralized Matching: ✓
* Multi-Event: ✓
* Validator: ✓
* Stigmergy: ✓
### Key Observations
* SwarmSys (ours) is the only system that possesses all the listed features.
* Several systems (CAMEL, AutoGen, DeepResearchAgent, ROBIN) share a similar profile, having Explicit Roles and Debate, but lacking other features.
* Most systems do not have Decentralized Matching, Multi-Event, or Stigmergy.
* Voyager has a special symbol (§) for Dynamic Profiling and Validator.
* MetaGPT has a special symbol (†) for Validator.
### Interpretation
The table provides a comparative analysis of different systems based on their architectural and functional features. SwarmSys appears to be the most comprehensive system, incorporating all the listed features. The prevalence of crosses in the Decentralized Matching, Multi-Event, and Stigmergy columns suggests that these features are less commonly implemented in the systems analyzed. The special symbols (§ and †) might indicate conditional or partial support for the respective features in Voyager and MetaGPT. The data highlights the design choices and priorities in the development of each system, offering insights into the evolving landscape of these technologies.
</details>
| System | Decentralized | Explicit Roles | Debate | Dynamic Profiling | Decentralized Matching | Multi-Event | Validator | Stigmergy |
|------------------------------------------------|-----------------|------------------|----------|---------------------|--------------------------|---------------|-------------|-------------|
| CAMEL (Li et al., 2023) | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| AutoGen (Wu et al., 2023) | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| MetaGPT (Hong et al., 2024) | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ † | ✗ |
| MAD(Liang et al., 2024) | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
| ToT/GoT (Yao et al., 2023; Besta et al., 2024) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Voyager (Wang et al., 2023a) | ✗ | ✗ | ✗ | ✓ § | ✗ | ✗ | ✓ § | ✗ |
| SWE-agent (Yang et al., 2024) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Self-Refine (Madaan et al., 2023) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ |
| GPTSwarm (Zhuge et al., 2024) | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ |
| DeepResearchAgent (Huang et al., 2025) | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
| ROBIN (Ghareeb et al., 2025) | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
| SwarmSys (ours) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
## 5 Related Works
LLM-based Multi-Agent Systems. Recent progress in large language models has led to a proliferation of multi-agent frameworks that decompose reasoning and problem-solving into interactive roles (Liu et al., 2025). Early systems such as CAMEL (Li et al., 2023), AutoGen (Wu et al., 2023) introduced explicit role structures (e.g., user-assistant pairs) and dialog-based task decomposition, showing that inter-agent communication improves reasoning diversity. Subsequent works like MetaGPT (Hong et al., 2024) , AgentScope (Gao et al., 2024b), and DeepResearchAgent (Huang et al., 2025) further formalized role hierarchies for domain-specific workflows such as software engineering or literature review. However, their centralized orchestration and fixed pipelines limit scalability and adaptability. SwarmSys differs by using a fully decentralized structure where coordination emerges from role-guided interaction and adaptive matching, without needing a global controller.
Collaborative Reasoning and Debate. Another line of work explores multi-round reasoning via self-consistency and debate. Methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) reasoning (Yao et al., 2023; Besta et al., 2024) extend single-agent reflection through structured deliberation, while MAD (Liang et al., 2024), Longagent (Zhao et al., 2024) and ROBIN (Ghareeb et al., 2025) model inter-agent debates to enhance diversity and correctness. These approaches improve intermediate reasoning quality but generally depend on fixed communication topologies and lack mechanisms for dynamic adaptation or memory persistence across reasoning episodes. In contrast, SwarmSys incorporates debate as a local coordination primitive within a self-organizing swarm, where roles evolve through continuous profiling and pheromone-like reinforcement, enabling sus- tained reasoning across multiple concurrent events.
Adaptive Coordination and Swarm-inspired Reasoning. A growing body of work introduces adaptive or self-organizing strategies for LLM agents. GPTSwarm (Zhuge et al., 2024) represents one of the few attempts at decentralized coordination, leveraging graph-based optimization and stigmergic feedback. Meanwhile, systems like Voyager (Wang et al., 2023a), Self-Refine (Madaan et al., 2023), and SwarmAgentic (Zhang et al., 2025) employ profile-like adaptation and iterative self-improvement, but remain task-specific. SwarmSys advances this line of research by integrating dynamic profiling, embedding-based matching, and pheromone-inspired reinforcement into a unified framework that scales to multi-event, multi-agent reasoning. This design allows distributed agents to self-allocate across evolving tasks, maintaining robustness and efficiency without centralized scheduling.
## 6 Conclusion
We presented SwarmSys, a swarm-intelligenceinspired framework for decentralized multi-agent reasoning. Through role-specialized collaboration, dynamic profiling, and pheromone-inspired reinforcement, SwarmSys enables scalable, selforganizing coordination without centralized control. Across diverse reasoning and research domains, it consistently outperforms strong baselines and reveals emergent collective behaviors-demonstrating that scaling coordination can rival scaling model size. Our findings suggest a new paradigm for reasoning: intelligence emerges from structured interaction among distributed agents, not from larger models.
## Limitations
Despite its strong performance, SwarmSys still faces several limitations. First, while decentral-
ized coordination improves adaptability, it also increases communication overhead, which may reduce efficiency in latency-sensitive settings. Second, agent profiling currently relies on text-based embeddings and heuristic updates; future work could explore learnable or gradient-based mechanisms for more precise skill modeling. Third, our experiments focus primarily on reasoning and research-oriented tasks, extending SwarmSys to embodied or real-time interactive environments remains an open direction. We hope these insights inspire future research on large-scale, self-organizing multi-agent systems that combine symbolic structure with emergent intelligence.
## Ethics Statement
This work introduces SwarmSys, a distributed multi-agent reasoning framework inspired by swarm intelligence. The research involves no human subjects, personal data, or sensitive content; all experiments use public or synthetic datasets. We recognize potential ethical issues in LLM-based multi-agent systems, such as bias propagation and unreliable autonomous coordination. SwarmSys mitigates these risks through closed-loop validation and transparent agent interactions, ensuring that all reasoning processes remain interpretable and auditable. Our goal is to advance the scientific understanding of scalable reasoning rather than deploy autonomous agents in real-world decision-making. We advocate responsible use of SwarmSys with proper human oversight, fairness, and accountability in future applications.
## Acknowledgements
We are grateful to our collaborators for their valuable discussions on multi-agent coordination and large language model reasoning. This research was supported in part by institutional computational resources and open-source communities that enabled large-scale experimentation.
## References
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence , 38(16):17682-17690.
Eric Bonabeau, Marco Dorigo, and Guy Theraulaz. 1999. Swarm Intelligence: From Natural to Artificial Systems . Oxford University Press.
Victor Dibia, Jingya Chen, Gagan Bansal, Suff Syed, Adam Fourney, Erkang Zhu, Chi Wang, and Saleema Amershi. 2024. Autogen studio: A no-code developer tool for building and debugging multi-agent systems. Preprint , arXiv:2408.15247.
Marco Dorigo and Thomas Stützle. 2004. Ant Colony Optimization . MIT Press.
Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. 2025. Deepresearch bench: A comprehensive benchmark for deep research agents. Preprint , arXiv:2506.11763.
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. 2024a. Omni-math: A universal olympiad level mathematic benchmark for large language models. Preprint , arXiv:2410.07985.
Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, and 1 others. 2024b. Agentscope: A flexible yet robust multi-agent platform. arXiv preprint arXiv:2402.14034 .
Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, and Samuel G. Rodriques. 2025. Robin: A multi-agent system for automating scientific discovery. Preprint , arXiv:2505.13400.
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. Metagpt: Meta programming for a multi-agent collaborative framework. Preprint , arXiv:2308.00352.
Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, Kun Shao, and Jun Wang. 2025. Deep research agents: A systematic examination and roadmap. Preprint , arXiv:2506.18096.
Valentin Lecheval, Elva J.H. Robinson, and Richard P. Mann. 2024. Random walks with spatial and temporal resets may underlie searching movements in ants. bioRxiv .
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for "mind" exploration of large language model society. Preprint , arXiv:2303.17760.
- Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging divergent thinking in large language models through multi-agent debate. Preprint , arXiv:2305.19118.
- Hongjun Liu, Yinghao Zhu, Yuhui Wang, Yitao Long, Zeyu Lai, Lequan Yu, and Chen Zhao. 2025. Medmmv: A controllable multimodal multi-agent framework for reliable and verifiable clinical reasoning. Preprint , arXiv:2509.24314.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-refine: Iterative refinement with self-feedback. Preprint , arXiv:2303.17651.
- OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. Gpt-4o system card. Preprint , arXiv:2410.21276.
Jiabin Tang, Tianyu Fan, and Chao Huang. 2025. Autoagent: A fully-automated and zero-code framework for llm agents. Preprint , arXiv:2502.05957.
- Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, and 11 others. 2024. Scicode: A research coding benchmark curated by scientists. Preprint , arXiv:2407.13168.
- Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023a. Voyager: An openended embodied agent with large language models. Preprint , arXiv:2305.16291.
- Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. 2024. Mixture-of-agents enhances large language model capabilities. Preprint , arXiv:2406.04692.
- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 .
- Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. 2023b. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona selfcollaboration. arXiv preprint arXiv:2307.05300 .
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le,
- and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems , NIPS '22, Red Hook, NY, USA. Curran Associates Inc.
- Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2023. Autogen: Enabling next-gen llm applications via multi-agent conversation. Preprint , arXiv:2308.08155.
- John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering. Preprint , arXiv:2405.15793.
- Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. Preprint , arXiv:2305.10601.
- Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. 2024. Evaluating the performance of large language models on gaokao benchmark. Preprint , arXiv:2305.12474.
- Yao Zhang, Chenyang Lin, Shijie Tang, Haokun Chen, Shijie Zhou, Yunpu Ma, and Volker Tresp. 2025. Swarmagentic: Towards fully automated agentic system generation via swarm intelligence. arXiv preprint arXiv:2506.15672 .
- Jun Zhao, Can Zu, Hao Xu, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. Longagent: scaling language models to 128k context through multi-agent collaboration. arXiv preprint arXiv:2402.11550 .
- Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. 2024. Language agents as optimizable graphs. Preprint , arXiv:2402.16823.
## A Example Appendix
## A.1 Prompts
## Worker Prompt
PROMPT = """
You are a worker agent with expertise in problem-solving, computation, and mathematical reasoning.
Your task is to receive a specific subproblem and then (1) solve the subproblem accurately with transparent justification for each step, (2) share your solution with other agents, (3) participate in solution comparison and structured debate, and (4) defend or revise your work if challenged.
Ensure your output (1) presents calculations in a clear, annotated format, (2) cites mathematical rules or theorems when applying them, and (3) is responsive to feedback and adaptive in collaborative revision. """
## Validator Prompt
PROMPT = """
You are a validator agent dedicated to checking solution correctness and consistency.
Your task is to: (1) check each worker's so- lution for logic, accuracy, and completeness, (2) participate in debates to reconcile discrepancies, (3) contribute to confirming a group consensus, and (4) shift roles if the validation queue is empty and other roles are under high pressure.
Ensure your output (1) provides precise validation with explanations of correctness or error, (2) uses formal mathematical checks where appropriate, (3) helps mediate conflicts in results with clarity and neutrality, and (4) explicitly states if consensus is confirmed: {TERMINATE: The answer is: [correct answer]} """
## Explorer Prompt
PROMPT = """
You are an Explorer agent with strong capabilities in identifying subproblems and analyzing solution paths.
Your task is to receive a mathematical problem set and then (1) search and interpret the overall problem, (2) decompose it into logically coherent subproblems, (3) identify entry points and strategies for solution, and (4) monitor the workload distribution across roles and dynamically reassign yourself to higher-pressure roles if necessary.
Ensure your output (1) maintains traceability between subproblems and the original task, (2) logically documents role-switch decisions, and (3) ensures role-switch includes resetting your goal and behavior policy. """
## Task Context
## PROMPT = """
Describe and analyze the following mathematical problem carefully. The problem statement is presented below. If this task depends on other sub-tasks or prior events, ensure logical consistency and continuity with those results.
Task profile. (1) If the task involves proving, demonstrating, or establishing statements This is a proof-oriented task. Emphasize logical rigor, step-by-step justification, and clear argumentation. (2) If the task involves calculating, evaluating, integrating, or deriving quantities - This is a computationoriented task. Focus on symbolic manipulation, clear intermediate steps, and verification of results. (3) If the task involves maximizing, minimizing, or optimizing a quantity - This is an optimization task. Identify objective functions and constraints, analyze conditions for optimality, and interpret results precisely. (4) If the task involves conditional or logical reasoning ('if', 'then', 'otherwise') - This is a conditional reasoning task. Separate cases clearly, verify implications, and maintain logical completeness. (5) Otherwise - This is a general reasoning task. Apply systematic mathematical analysis and adjust reasoning depth to the complexity of the problem.
Objectives: (1) Analyze the problem thoroughly and build a shared understanding. (2) Generate multiple possible solution paths and compare their validity. (3) Execute reasoning or calculations with clarity, precision, and justification. (4) Resolve disagreements through structured logical debate, not assertion. (5) Arrive at a verified, consensus-based final answer.
Round & Turn-taking policy: (1) Each debate round follows this strict order: explorer → all workers → validator. (2) The explorer opens each round by outlining context, progress, and a proposed plan. (3) Works present or refine solutions, show derivations, and discuss intermediate results. (4) The validator closes each round by checking correctness and summarizing consensus. (5) The debate terminates only when the validator announces: TERMINATE: The answer is: <final answer>. """
## Instruction Embedding Template
PROMPT = """
You are evaluating the mathematical and reasoning competence of an agent participating in a collaborative debate system.
The goal is to generate a semantic embedding that represents how capable this agent is at solving mathematical, logical, and analytical tasks.
Consider two sources of information: (1) The agent's declared abilities, describing what it is designed or trained to do. (2) The agent's historical activity performance, summarizing how it has previously executed reasoning, computation, or validation tasks. Integrate both perspectives to form a single competence representation capturing: (1) conceptual depth and mathematical reasoning skills, (2) problem-solving strategy diversity, (3) accuracy and self-correction ability, and (4) communication and collaboration quality.
Agent Ability Description: {ability text} Agent Performance History: {history text} Use the combined content above as the context for competence evaluation. Output representation should capture the overall ability state of the agent, balancing potential skill with observed performance. """
## A.2 Dataset Settings
We evaluate the reasoning capability and knowledge coverage of our multi-agent system based on three categories of tasks:
Exam We rearrange existing benchmarks into exam-like formats, covering both single-subject and multi-subject settings. This design mimics the structure of real-world examination papers, where agents must solve a coherent set of questions rather than isolated items. Organizing benchmarks in this way is motivated by several factors: (i) exams naturally exhibit varying levels of difficulty across questions; (ii) they involve heterogeneous knowledge types and differentiated scoring schemes, which make the dataset inherently diverse; and (iii) each exam itself constitutes a complex task that can be decomposed into multiple interrelated and irrelated sub-tasks.
Such properties align well with the characteristics of our ant-colony-inspired system. In real ant
colonies, individuals continuously explore, evaluate, and participate in different sub-tasks based on local pheromone signal. Similarly, in exam scenarios, our agents can search across different questions, dynamically decide whether to participate in a specific sub-task, and collectively optimize the global solution through local collaboration. Therefore, exam-style benchmarks provide not only a realistic and challenging evaluation setting for reasoning and knowledge coverage, but also a natural testbed for demonstrating the strengths of swarmbased multi-agent systems.
Research To evaluate research-oriented reasoning, we adopt DeepResearch Bench, which contains PhD-level research tasks spanning multiple domains. Each task provides a research topic or open-ended problem statement (rather than a completed study), requiring agents to perform literature exploration, knowledge recall, and synthesis of coherent research reports. Outputs are evaluated using RACE (Reference-based Adaptive Criteriadriven Evaluation). Such settings mirror real-world academic research, where researchers must jointly survey prior work, generate new ideas, and design proof-of-concept implementations. These properties align well with our swarm-based system: individual agents can specialize in literature recall, hypothesis generation, or code prototyping, and through local debates and coordination, the swarm collectively develops more robust and creative research outcomes.
Science Coding We further evaluate reasoning and knowledge grounding through scientific programming tasks drawn from SciCode. The benchmark is organized into main problems, each of which is decomposed into multiple sub-problems, making it particularly suitable for swarm-based evaluation. Such a structure allows agents to collaboratively assign sub-problems, iteratively generate and refine code, and verify correctness against scientific principles or test cases. By reaching swarmlevel consensus, the system enhances both accuracy and coverage. This setting therefore captures the precision and collaborative robustness required for scientific reasoning and computation.
## OUTPUTResearchArticleExample:
#Title↓
**Fusing Innovation and Tradition: A Hybrid Model Approach to Asset Allocation with Mean-Variance,Black-Litterman,andDeepLearning**↓
## ##Abstract↓
In therealm of financial technology,theintricate task of asset allocation plays a pivotal role in optimizing investment portfolios by balancing risk and return.Traditional asset
## ##Introduction
Asset allocation is a fundamental pillar of investment strategy,playing a critical role in maximizing returns while managing risk.As an essential process across global financial oo
## ##CoreModels↓
## ###Mean-Variance(MV) Model↓
TheMean-Variance model,introduced byHarry Markowitz in1952,stands as a cornerstone
## ###Black-Litterman (BL) Model↓
The Black-Litterman model emerged as an innovative leap forward from traditional MV analysis by addressing its sensitivity to input estimates.Developed by Fischer Black and Robert Litterman in the early 1990s,the BL model introduces the notion of blending
## ### Deep Learning (DL) Models↓
::
Deep Learning represents a paradigm shift in financial model analysis.Unlike traditional models that operate under simplified assumptions,DL models employ neural networks——·o·
## ##Comparative Analysis↓
###1.RiskMeasurement↓
The primary purpose of risk measurement in asset allocation is to ensure that the investor achieves the desired balance between risk and reward.Let's delve deeperinto how each
## ###2.Return Prediction
Return prediction models are vital in assessing expected asset performance and optimizing allocationstrategies accordingly.
## ###3.Asset Allocation↓
Asset allocation involves distributing investments among various assets to meet specific risk-reward profiles.Each model brings unique contributions and limitations.
## ###Conclusion
The comparative strengths and flaws elucidated by MV, BL,and DL models underline their potential synergy and integration into a consolidated hybrid framework.Capitalizing on
##Advantages,Limitations&Implications↓ The realm of FinTech hinges profoundly on the optimization of asset allocation processes, wherein computational models play a vital role.Analyzing the distinct advantages,
###Mean-Variance (MV) Model↓
## **Advantages**↓
The MV model stands as a paragon of portfolio theory due to its straightforward
## ###Black-Litterman (BL)Model↓
The BLmodel advances traditional asset allocationby balancing objectivemarket data with subjective market views.This dual-data framework can potentiate a more sophisticated ::
30 citations, and over 4000+word
Figure 6: Output Example
## A.3 SwarmSys Output Example
## A.4 Profile Embedding and Matching
## A.4.1 Profile Format
## A.4.2 Embedding and Matching
Agent embeddings. Each agent A i is represented by a competence embedding and an availability embedding. The competence embedding v ( i ) ah is derived from declared abilities T ( i ) a and historical performance T ( i ) h , guided by task-specific instructions:
<!-- formula-not-decoded -->
The availability embedding v ( i ) s reflects workload and readiness, derived from status T ( i ) s :
<!-- formula-not-decoded -->
The final representation is the sum:
<!-- formula-not-decoded -->
**Advantages***↓
::
::
::
:
::
Figure 7: Profile Format
Event embeddings. Each event E j is encoded as v ( j ) e by integrating its description, dependencies, progress state, and milestone:
<!-- formula-not-decoded -->
Compatibility and decision dynamics. Agentevent compatibility is computed as normalized cosine similarity:
<!-- formula-not-decoded -->
To avoid stagnation, SwarmSys employs a dynamic ε -greedy policy. Each agent explores with probability ε i or exploits with probability 1 -ε i , where
<!-- formula-not-decoded -->
and ¯ S i denotes the agent's recent average success. Exploration samples potential matches proportionally to similarity:
<!-- formula-not-decoded -->
while exploitation emphasizes high-compatibility matches:
<!-- formula-not-decoded -->
where σ is the sigmoid function and γ controls sharpness. Both branches unify as:
<!-- formula-not-decoded -->
This mechanism enables three properties simultaneously: adaptivity through evolving embeddings, stability through probabilistic sampling, and robustness by balancing exploration with exploitation.
## A.5 Cost
Table 8: Average per-question model cost comparison corresponding to baselines and systems used in main experiments.
| Model | Instr.-based Cost ($) | Code-based Cost ($) |
|-------------------------|-------------------------|-----------------------|
| IO (GPT-4o) | 0.005 | - |
| CoT (GPT-4o) | 0.012 | - |
| CoT-SC (GPT-4o, 5-shot) | 0.051 | - |
| Self-Refine (GPT-4o) | 0.068 | - |
| MultiPersona (GPT-4o) | 0.043 | - |
| GPTSwarm | 0.077 | - |
| GPT-5 | 0.014 | - |
| SwarmSys-8 (Ours) | 0.071 | - |
| IO (GPT-4o-Search) | 0.15 | - |
| CoT (GPT-4o) | 0.2 | - |
| Self-Refine | 1.53 | - |
| DeepResearchAgent | 2.82 | - |
| Grok Deeper Search | 2.85 | - |
| SwarmSys-8 (Ours) | 2.63 | - |
| IO (GPT-4o) | 0.08 | 0.013 |
| CoT-SC (GPT-4o) | 0.11 | 0.017 |
| Self-Refine | 0.36 | 0.026 |
| GPTSwarm | 0.41 | 0.024 |
| SwarmSys-14 (Ours) | 0.44 | 0.019 |