# Group Deliberation Oriented Multi-Agent Conversational Model for Complex Reasoning
**Authors**: Zheyu Shi, Dong Qiu, Shanlong Yu
## Group Deliberation Oriented Multi Agent Conversational Model for Complex Reasoning
Zheyu Shi* Brown University Providence, USA zheyu\_shi@brown.edu
Dong Qiu New England College Henniker, USA DQiu\_GPS@nec.edu
Abstract -This study proposes a group deliberation multi-agent dialogue model to optimize the limitations of single-language models for complex reasoning tasks. The model constructs a threelevel role division architecture of "generation - verification integration." An opinion-generating agent produces differentiated reasoning perspectives, an evidence-verifying agent matches external evidence and quantifies the support of facts, and a consistency-arbitrating agent integrates logically coherent conclusions. A self-game mechanism is incorporated to expand the reasoning path, and a retrieval enhancement module supplements dynamic knowledge. A composite reward function is designed, and an improved proximal strategy is used to optimize collaborative training. Experiments show that the model improves multi-hop reasoning accuracy by 16.8%, 14.3%, and 19.2% on the HotpotQA, 2WikiMultihopQA, and MeetingBank datasets, respectively, and improves consistency by 21.5%. Its reasoning efficiency surpasses mainstream multi-agent models, achieving a balance between accuracy, stability, and efficiency, providing an efficient technical solution for complex reasoning.
Keywords- Multi-agent dialogue; group discussion; complex reasoning; role division; self-game mechanism; retrieval enhancement
## I. INTRODUCTION
In real-world scenarios of complex reasoning tasks (such as multi-hop question answering and group decision-making), multi-agent collaboration is a core requirement for overcoming the bottleneck of single-model reasoning depth. These tasks require integrating multi-dimensional information and verifying multi-source facts. Prior work has shown that multiagent interaction such as debate can improve factuality and reasoning robustness, implicitly addressing failure modes of
Shanlong Yu Georgia Institute of Technology Atlanta, USA joesyu779@outlook.com single-model reasoning. Furthermore, factual accuracy relies on pre-trained knowledge, making it difficult to dynamically supplement external information, resulting in insufficient stability and reliability in complex tasks. This study proposes a group deliberation multi-agent dialogue model: constructing a collaborative reasoning closed loop through role-based LLM agents (viewpoint generation, evidence verification, consistency arbitration), introducing a self-game mechanism to generate multi-path reasoning chains to expand perspectives, combining a retrieval enhancement module to dynamically supplement external knowledge to strengthen factual accuracy, and designing a reward model based on factual consistency and logical coherence, using a proximal strategy optimization to achieve multi-agent collaborative training. Multi-agent reinforcement learning has been extensively studied across a wide range of collaborative decision-making tasks.
## II. GROUP DELIBERATION MULTI-AGENT DIALOGUE MODEL DESIGN
## A. Role-Based LLM Agent Architecture
This model constructs a three-level collaborative architecture of "generation - verification - integration" (Figure 1), following the emerging paradigm of role-based multi-agent language model systems [1-2]. Through the division of labor and cooperation among LLM agents with differentiated functions, the architecture enables structured collaboration, similar to recent communicative agent frameworks [3]. The architecture starts with task input, and through a closed-loop process of opinion generation, evidence verification, and consistency arbitration, it outputs reasoning results that are diverse, factual, and logical.
Figure 1. Role-based Multi-agent Collaborative Reasoning Architecture.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Flowchart: Collaborative Knowledge Verification and Coherence System
### Overview
The diagram illustrates a cyclical process for generating, verifying, and refining knowledge through multiple specialized agents. The system integrates external knowledge validation with logical coherence mechanisms to produce consistent conclusions.
### Components/Axes
1. **Central Loop Structure**:
- Circular flow connecting five core components
- Arrows indicate sequential processing direction
- Blue circular path represents iterative refinement cycle
2. **Key Components**:
- **Top**: Task Input (rectangular box)
- **Bottom**: Consistency Arbitration Output (rectangular box)
- **Left**: Viewpoint Generation Agent (document icon)
- **Right**: Consistency Arbitration Agent (person icon)
- **Center**: Evidence Verification Agent (magnifying glass icon)
3. **Intermediate Elements**:
- Taskpoint Collaborpoint Generation (diamond shape)
- Fact matching score (S_fact)
- Logical coherence score (S_cohe)
### Detailed Analysis
1. **Component Descriptions**:
- **Viewpoint Generation Agent**:
- Employs diversity constraint mechanism (K)
- Uses self-game mechanism
- Includes retrieval augmentation module
- **Evidence Verification Agent**:
- Matches facts against external knowledge base
- Calculates fact matching score (S_fact)
- **Consistency Arbitration Agent**:
- Integrates verified viewpoints
- Produces logically coherent conclusion
- Generates logical coherence score (S_cohe)
2. **Flow Dynamics**:
- Task Input → Viewpoint Generation Agent
- Viewpoint → Taskpoint Collaborpoint Generation
- Collaborpoint → Evidence Verification Agent
- Verified facts → Consistency Arbitration Output
- Output → Consistency Arbitration Agent
- Final output loops back to Viewpoint Generation Agent
### Key Observations
1. **Iterative Nature**: The circular flow indicates continuous refinement of knowledge through verification and arbitration
2. **Score Integration**: Two quantitative metrics (S_fact and S_cohe) suggest quantitative evaluation of intermediate steps
3. **Human-AI Collaboration**: Person icon suggests human oversight in final arbitration stage
4. **External Knowledge Integration**: Explicit mention of external knowledge base verification
### Interpretation
This system demonstrates a sophisticated knowledge management framework combining:
1. **Diversity Preservation**: Through constraint mechanisms in viewpoint generation
2. **Evidence Validation**: Via external knowledge base cross-referencing
3. **Logical Coherence**: Through arbitration processes that integrate multiple perspectives
4. **Human-AI Synergy**: Final arbitration stage incorporates human judgment
The cyclical nature suggests an AI system designed for continuous learning and improvement, where each iteration enhances both factual accuracy (through S_fact) and logical consistency (through S_cohe). The inclusion of human arbitration implies a hybrid intelligence approach where machine processing is complemented by human oversight for final decision-making.
</details>
n Figure 1, arrows from each agent to the "Task Input" block indicate persistent read-only access, not reverse data flow. Each agent relies on the full task input throughout reasoning: the Viewpoint Generation Agent uses it to guide diverse trajectories, the Evidence Verification Agent aligns retrieved facts with the original context, and the Consistency Arbitration Agent ensures semantic coherence in final outputs. This context-preserving design maintains factual grounding and consistency in multiagent collaboration.
## 1) Viewpoint Generation Agent
The core function of this agent is to generate differentiated reasoning viewpoints based on the task input, avoiding the limitations of a single perspective. The generation of multiple differentiated viewpoints helps mitigate single-path reasoning bias, which is consistent with findings from self-consistency based reasoning methods [4]. Its generation process introduces a diversity constraint mechanism, mathematically expressed as:
Wherein, LLMV represents a dedicated LLM for opinion generation (such as a fine-tuned Llama 2), ωk is the viewpoint weight vector, following a multivariate normal distribution with mean μ and covariance matrix Σ , used to control the direction of opinion differentiation; This distribution is used for three reasons. First, the multivariate normal distribution offers a continuous, symmetric space around the reasoning center μ, supporting diverse yet coherent viewpoint generation without directional bias. Second, its covarian ce matrix Σ allows control over inter-factor correlations, enabling structured variations across reasoning dimensions. Third, Gaussian parameters align well with gradient-based learning and self-game updates, ensuring stable exploration and convergence. Thus, the normal distribution acts as an effective inductive bias balancing diversity, control, and stability, rather than assuming a fixed probabilistic form. Here, Emb(Q) is the task input embedding, ⊙ denotes element-wise multiplication, and weight modulation lets each Vk attend to different reasoning aspects. The selfgame mechanism explores varied reasoning paths, akin to treestructured deliberation.
## 2) Evidence Verification Agent
This agent is responsible for matching factual evidence with each candidate opinion Vk and verifying its reasonableness. This retrieval-enhanced verification process is inspired by retrieval-augmented generation frameworks that integrate external knowledge to improve factual grounding in language models [5]. Its core function is to calculate the factual matching degree between the opinion and the evidence:
In the formula, ℰk is the set of evidence related to Vk retrieved from the external knowledge base 𝒦 , | ℰk | represents the number of evidence (the first 5 are taken by default); Emb (⋅) is the embedding function based on SentenceBERT, Tr (⋅) represents the matrix trace operation, ‖ ⋅ ‖ F is the Frobenius norm, which is essentially an improved cosine similarity calculation, Sfact ∈ [ 0 , 1 ] , and the larger the value, the stronger the factual support for the viewpoint [4]. The viewpoint enters the next stage only when Sfact ( Vk , ℰk ) ≥ τ ( τ = 0 . 75 is the preset threshold).
## 3) Consistency Arbitration Agent
Its responsibility is to integrate the verified viewpoints and output a logically coherent unified conclusion. Its logical coherence evaluation formula is:
Where LLMC is the arbitration-specific LLM, Prompt cohe is the logical evaluation prompt, wcohe and bcohe are linear transformation parameters; σ (⋅) is the Sigmoid function, mapping the evaluation result to the [0,1] interval, and a higher Scohe indicates a more coherent conclusion logic.
## B. Self-Game Mechanism
To enrich the diversity of reasoning paths, a self-game mechanism between agents is designed to generate multi-path reasoning chains through viewpoint confrontation:
$$\sum _ { k = 1 } ^ { n } \lambda _ { j = k } S _ { fact ( V _ { i } ) } ^ { 2 }$$
In the formula, η is the learning rate (default 0.01), and ∇ ωk represents the gradient with respect to the weight vector ωk . This formula updates the viewpoint weights by maximizing the difference in factual matching between the current viewpoint and other viewpoints, thereby promoting the generation of more diverse and effective reasoning paths.
## C. Reward Model
A composite reward function integrating factual consistency and logical coherence is constructed to guide multi-agent collaborative optimization:
$$- y \cdot K L ( P ( V _ { k } ) || P _ { ref } ( V ) ) ( S )$$
Where λ = 0 . 6 is the factual consistency weight, ( 1 -λ ) is the logical coherence weight; γ = 0 . 1 is the regularization coefficient, and KL (⋅ ‖ ⋅) is the KL divergence, used to constrain the difference between the opinion distribution P ( Vk ) and the reference distribution Pref ( V ) , avoiding excessive divergence of opinions; a larger R indicates better agent collaboration.
## D. Collaborative Training Strategy
Multi-agent collaborative training is achieved using Improved Proximal Policy Optimization (PPO) to avoid inference collapse and loop generation. The objective function is:
$$\frac { L _ { p p o } = E [ \min ( r _ { t } ( \theta ) A _ { e } , c l i p H ( P ( \theta ) ) ] } { ( 6 ) }$$
The use of PPO for collaborative optimization is motivated by its demonstrated effectiveness in cooperative multi-agent reinforcement learning settings [6]. In the formula, r t ( θ ) = π θ ( a t ∣ s t ) π θ old ( a t ∣ s t ) is the policy update ratio, πθ is the current policy, and πθold is the old policy; At is the advantage function, ϵ = 0 . 2 is the pruning coefficient; H ( P ( θ )) is the policy entropy, and β = 0 . 05 is the entropy regularization coefficient. Introducing the entropy term encourages policy exploration and avoids inference loops caused by local optima.Figure 2 compares the performance of traditional PPO-RLHF (GPT-3.5) and the proposed multi-agent PPO collaborative training over the same training steps. The x-axis shows PPO steps (in thousands), and the y-axis shows cumulative performance scores. While both start similarly, the proposed model shows more stable and significant gains beyond 50,000 steps, indicating convergence to a superior, stable policy. This highlights the effectiveness of the collaborative mechanism in reducing policy collapse and improving inference stability. Compared to earlier methods like MADDPG [7-8], PPO-based strategies provide better training stability in cooperative tasks, supporting our optimization design.
Figure 2: Performance Comparison of PPO Collaborative Training in Multi-Agent Tasks
<details>
<summary>Image 2 Details</summary>

### Visual Description
```markdown
## Line Graphs: PPO-RLHF vs. Proposed PPO-Collaborative Training Performance
### Overview
The image contains two side-by-side line graphs comparing the training performance of two methods: **PPO-RLHF (GPT-3.5)** and **Proposed PPO-Collaborative Training**. Both graphs plot **Performance Score** (y-axis) against **PPO Training Steps (K)** (x-axis), with data points and trend lines indicating performance progression over training steps.
---
### Components/Axes
1. **Left Graph**:
- **Title**: "PPO-RLHF (GPT-3.5) Training Performance"
- **X-axis**: "PPO Training Steps (K)" (0 to 70K, increments of 10K).
- **Y-axis**: "Performance Score" (0.1 to 0.8, increments of 0.1).
- **Legend**: Blue dots labeled "PPO-RLHF (GPT-3.5)" with a dotted trend line.
2. **Right Graph**:
- **Title**: "Proposed PPO-Collaborative Training Performance"
- **X-axis**: "PPO Training Steps (K)" (0 to 70K, increments of 10K).
- **Y-axis**: "Performance Score" (0.1 to 0.8, increments of 0.1).
- **Legend**: Green dots labeled "Proposed PPO-Collaborative Training" with a dotted trend line.
---
### Detailed Analysis
1. **Left Graph (PPO-RLHF)**:
- **Trend**: The blue dotted line slopes upward gradually, starting near 0.1 at 0K steps and reaching approximately **0.75** at 70K steps.
- **Data Points**: Blue dots cluster tightly around the trend line, showing consistent improvement.
2. **Right Graph (Proposed PPO-Collaborative Training)**:
- **Trend**: The green dotted line slopes upward more steeply, starting near 0.2 at 0K steps and reaching approximately **0.78** at 70K steps.
- **Data Points**: Green dots are slightly more dispersed but follow the trend line closely.
---
### Key Observations
1. Both methods show **monotonic improvement** in performance with increasing training steps.
2. The **proposed PPO-Collaborative Training** (green) consistently outperforms **PPO-RLHF (GPT-3.5)** (blue) across all training steps.
3. The **performance gap widens** as training progresses:
- At 10K steps: Proposed method ≈ 0.3 vs. PPO-RLHF ≈ 0.25.
- At 70K steps: Proposed method ≈ 0.78 vs. PPO-RLHF ≈ 0.75.
4. The **steeper slope** of the proposed method suggests faster convergence.
---
### Interpretation
The data demonstrates that the **proposed PPO-Collaborative Training** method achieves higher performance scores than the baseline PPO-RLHF (GPT-3.5) with the same training effort. This suggests that the collaborative approach may:
- Leverage additional data or feedback mechanisms more effectively.
- Reduce inefficiencies in the training process (e.g., better reward modeling or policy updates).
- Be more scalable
</details>
## E. Retrieval Enhancement Module
External knowledge is dynamically supplemented during the evidence verification stage. The retrieval probability model is as follows:
$$P ( e \vert V _ { k } ) = \frac { \sum _ { l = 1 } ^ { n } p ( S i m ( E m b ( e ) ) ; a ) } { \sum _ { l = 1 } ^ { n } p ( S i m ( C ) ; E m b ( V _ { k } ) ; a ) }$$
Where Sim (⋅,⋅) is the cosine similarity, and α = 1 . 5 is the temperature coefficient used to adjust the confidence level of the retrieval; the matching probability of each knowledge e and opinion Vk is obtained by normalization using the Softmax function, and the top M = 5 knowledge items with the highest probabilities are selected as evidence to strengthen the factual support of the opinion. This modular integration strategy is conceptually related to neuro-symbolic systems that combine language models with external tools and knowledge sources [910].
## III. EXPERIMENTAL DESIGN AND RESULT ANALYSIS
## A. Experimental Datasets
The experiment uses three complex reasoning datasets to ensure result generality: ① HotpotQA: A multi-hop QA benchmark with 113K training and 74K validation samples, requiring reasoning across 2 -5 documents. About 60% are "bridging" questions, stressing cross-document logic. ② WikiMultihopQA: Based on Wikipedia, with 25K training samples and 3.2 average reasoning steps per question. It focuses on entity-based long-chain reasoning. ③ MeetingBank: A group dialogue dataset with 5K+ real meeting samples, requiring integration of multi-round discussions to derive consistent conclusions.
The three datasets correspond to "document-level multihop," "entity-level multi-hop," and "dialogue-level integration" scenarios, respectively, comprehensively validating the model's performance across various complex inference tasks.
## B. Baseline Model and Evaluation Metrics
The baseline model selects mainstream complex inference methods to ensure fairness in the comparison: ① Single LLM model: GPT-3.5, Llama 2-7B (fine-tuned); ② Multi-proxy model: AutoGPT, MetaGPT; ③ Search enhancement model: RAG (Search Enhancement Generation), REALM.GPT-3.5 is selected as a baseline model due to its strong instructionfollowing ability, trained using reinforcement learning from human feedback .
Evaluation metrics focus on core performance and stability: ① Multi-hop reasoning accuracy (Acc): The precise match between the reasoning conclusion and the standard answer, measuring the correctness of the reasoning; ② Consistency index (Cons): The logical consistency rate of the conclusion in multiple rounds of reasoning, calculated by the overlap of the results of 5 consecutive reasoning iterations, measuring stability; ③ Reasoning efficiency (Time): The average time (in seconds) for reasoning per sample, measuring practicality.
## C. Experimental Procedure and Parameter Settings
The experiment includes three steps: ① Data preprocessing: Standardize formats and extract questions, context, and answers. ② GPT-3.5 baseline uses sequential prompting without state sharing: Step 2 inputs the raw question for an initial reasoning chain; Step 3 reuses the output as context for extended reasoning; Step 4 merges prior chains for a final answer. No memory compression or pruning is used, causing performance drops in long chains due to token overflow and diluted context -underscoring the need for collaborative memory and role division. ③ Performance testing: Run each model 10 times on the test set and average the results.
Key parameter configurations are as follows: Number of opinion generation agents K = 3 , Number of evidence retrievals M = 5 , Fact matching threshold τ = 0 . 75 , Reward function weight λ = 0 . 6 , PPO pruning coefficient ϵ = 0 . 2 , Entropy regularization coefficient β = 0 . 05 .
## D. Experimental Results and Analysis
## 1) Performance of the Main Experiment
Table 1 shows the comprehensive performance comparison of each model on the binary dataset. The model presented in this paper significantly outperforms the baseline in all core metrics. Table 1 shows that the proposed model improves accuracy by 16.8%, 14.3%, and 19.2% on HotpotQA, 2WikiMultihopQA, and MeetingBank, respectively, and improves consistency by 21.5%. Although the inference time is slightly higher than that of a single LLM, it is significantly lower than other multi-agent models, achieving a balance between accuracy, stability, and efficiency.
TABLE I. COMPARISON OF OVERALL PERFORMANCE OF VARIOUS MODELS ON BINARY DATASETS
| Model | HotpotQA(Acc/%) | 2WikiMultihopQA(Acc/%) | MeetingBank(Acc/%) | Cons/% | Time/s |
|-----------------------------------|-------------------|--------------------------|----------------------|----------|----------|
| GPT-3.5 | 62.3 | 58.7 | 51.2 | 65.8 | 2.1 |
| Llama 2-7B | 59.1 | 55.3 | 48.6 | 63.2 | 1.8 |
| AutoGPT | 68.5 | 63.2 | 57.9 | 70.2 | 4.3 |
| MetaGPT | 70.2 | 65.1 | 59.4 | 72.5 | 4.7 |
| RAG | 71.1 | 66.5 | 60.3 | 72.4 | 3.5 |
| REALM | 72.4 | 67.8 | 61.7 | 73.8 | 3.8 |
| This article's model | 79.1 | 73 | 70.4 | 87.3 | 3.9 |
| Relative improvement (vs GPT-3.5) | 16.8 | 14.3 | 19.2 | 21.5 | 1.8 |
Figure 3 compares Acc between the proposed model and GPT-3.5 across reasoning steps (on 2WikiMultihopQA). The x- axis shows steps (1 -5), and the y-axis shows Acc (%). The box plot reveals GPT-3.5 accuracy drops sharply to 42.1% at 5 steps,
while the proposed model maintains 65.3%, demonstrating strong resistance to long-chain degradation via multi-agent architecture.
Figure 3. Comparison of Model Accuracy under Different Inference Steps.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Box Plot: Accuracy Comparison of GPT-3.5 and Our Model Across Reasoning Steps
### Overview
The image is a comparative box plot visualizing the accuracy distribution of two models (GPT-3.5 and "Our Model") across varying numbers of reasoning steps (1–5 steps). Accuracy is measured in percentage, with box plots showing median, quartiles, and outliers.
### Components/Axes
- **X-Axis**: "Number of Reasoning Steps" (categories: 1 Step, 2 Steps, 3 Steps, 4 Steps, 5 Steps).
- **Y-Axis**: "Accuracy (%)" (range: 40%–80%).
- **Legend**:
- Blue square: GPT-3.5
- Red square: Our Model
- **Box Plot Elements**:
- Median (horizontal line inside the box).
- Interquartile range (box boundaries).
- Whiskers (extending to min/max excluding outliers).
- Outliers (individual dots beyond whiskers).
### Detailed Analysis
1. **1 Step**:
- GPT-3.5: Median ~79% (blue box), range ~60%–79%.
- Our Model: Not present (no red box).
2. **2 Steps**:
- GPT-3.5: Median ~74.5% (blue box), range ~60%–74.5%.
- Our Model: Median ~70.3% (red box), range ~55%–70.3%.
3. **3 Steps**:
- GPT-3.5: Median ~70.3% (blue box), range ~55%–70.3%.
- Our Model: Median ~67.1% (red box), range ~50%–67.1%.
4. **4 Steps**:
- GPT-3.5: Median ~65.3% (blue box), range ~40%–65.3%.
- Our Model: Median ~65.3% (red box), range ~50%–65.3%.
5. **5 Steps**:
- GPT-3.5: Median ~42.1% (blue box), range ~30%–42.1%.
- Our Model: Median ~65.3% (red box), range ~50%–65.3%.
### Key Observations
- **GPT-3.5**:
- Accuracy declines sharply with increasing steps (79% → 42.1%).
- Outliers at 5 Steps suggest extreme underperformance in some cases.
- **Our Model**:
- Maintains relatively stable accuracy (74.5% → 65.3%) across steps.
- Outliers at 5 Steps are lower than the median but less extreme than GPT-3.5’s drop.
### Interpretation
The data demonstrates that **Our Model** exhibits greater robustness in multi-step reasoning tasks compared to GPT-3.5. While GPT-3.5’s accuracy deteriorates significantly with complexity (e.g., 79% at 1 step vs. 42.1% at 5 steps), Our Model’s performance remains consistent, suggesting better architectural or algorithmic design for handling sequential reasoning. The outliers for Our Model at 5 steps indicate occasional failures but do not negate the overall trend of stability. This implies potential advantages in applications requiring complex, multi-stage problem-solving.
</details>
## 2) Validation of Model Structure Effectiveness
To assess module importance, ablation tests were conducted:
① no self-play; ② no retrieval enhancement; ③ no reward model; ④ single-agent (only opinion generation).Figure 4 shows changes in Cons (left y-axis, %) and Acc (right y-axis, %) across these settings. The x-axis shows the models: full / -selfplay / -retrieval / -reward / single-agent. The single-agent setup yields Cons 64.7% and Acc 61.3%. The full model improves Cons by 22.6% and Acc by 17.8%. Without retrieval, Cons drops 12.2%; without self-play, Acc drops 8.9%, confirming each module ' s necessity.
Figure 4. Comparison of Consistency and Accuracy.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Diagram: Model Configuration Comparison
### Overview
The diagram illustrates a comparative analysis of different model configurations derived from a "Full Model" framework. It evaluates the impact of removing specific components (e.g., Multi-Path Reasoning, External Retrieval Input, Cooperative Optimization) on two performance metrics: **Consistency (Cons)** and **Accuracy (Acc)**. The configurations are visualized as interconnected components with labeled metrics.
---
### Components/Axes
- **Legend**:
- Blue circles represent **Consistency (Cons)**.
- Red circles represent **Accuracy (Acc)**.
- **Main Configurations**:
1. **Full Model**: Includes all components (Self-Play, Retrieval, Reward Model, View Generation Agent).
2. **Self-Play**: Removes Multi-Path Reasoning.
3. **Reward Model**: Removes External Retrieval Input.
4. **Single Agent**: Removes Cooperative Optimization.
- **Key Components**:
- Self-Play
- Retrieval Augmentation
- Reward Model
- View Generation Agent
- Single-Path Reasoning
- Verification
---
### Detailed Analysis
#### Full Model
- **Consistency (Cons)**: 87.3% (blue dot).
- **Accuracy (Acc)**: 79.1% (red dot).
#### Self-Play (Remove Multi-Path Reasoning)
- **Consistency (Cons)**: 75.1% (blue dot).
- **Accuracy (Acc)**: 70.2% (red dot).
- **Trend**: Both metrics decline compared to the Full Model, with a sharper drop in accuracy (-8.9%).
#### Reward Model (Remove External Retrieval Input)
- **Consistency (Cons)**: 84.7% (blue dot).
- **Accuracy (Acc)**: 61.3% (red dot).
- **Trend**: Consistency remains high, but accuracy drops significantly (-17.8%).
#### Single Agent (Remove Cooperative Optimization)
- **Consistency (Cons)**: 84.7% (blue dot).
- **Accuracy (Acc)**: 64.7% (red dot).
- **Trend**: Consistency matches the Reward Model, but accuracy is slightly higher than the Reward Model (-14.4%).
---
### Key Observations
1. **Multi-Path Reasoning Impact**: Removing it (Self-Play) causes the largest accuracy drop (-8.9%), suggesting it is critical for precise predictions.
2. **External Retrieval Importance**: Removing it (Reward Model) preserves consistency but severely harms accuracy (-17.8%), indicating its role in data quality.
3. **Cooperative Optimization Trade-off**: Removing it (Single Agent) balances consistency (84.7%) and accuracy (64.7%), though both metrics lag behind the Full Model.
4. **Full Model Dominance**: Achieves the highest accuracy (79.1%) but has lower consistency (87.3%) compared to some simplified configurations.
---
### Interpretation
The diagram highlights trade-offs between model complexity and performance:
- **Accuracy vs. Consistency**: The Full Model prioritizes accuracy but sacrifices some consistency. Simplified models (e.g., Reward Model) retain consistency at the cost of accuracy.
- **Component Criticality**: Multi-Path Reasoning and External Retrieval are pivotal for accuracy and consistency, respectively. Their removal disproportionately impacts performance.
- **Practical Implications**: The Single Agent configuration offers a middle ground, potentially useful in resource-constrained scenarios where both metrics need balancing.
This analysis underscores the importance of component-specific contributions in model design, guiding decisions on which elements to retain or optimize based on application needs.
</details>
Further analysis shows that the retrieval enhancement module reduces factual errors by supplementing external facts, improving consistency. The self-game mechanism promotes multi-path reasoning, reducing logical blind spots and enhancing accuracy. The reward model, via multi-objective collaborative optimization, avoids goal conflicts and maintains performance balance. In contrast, single-agent models lack division of labor and coordination, fail to meet multidimensional reasoning needs, and struggle with factual verification, resulting in performance drops. This comparison highlights the superiority of the multi-agent role-based architecture, which addresses fact verification, logical integrity, and goal consistency through modular collaboration. Single agents, limited in cognitive scope, cannot balance factual accuracy and logical rigor or verify via multiple perspectives -explaining their core limitations. The role-based multi-agent design offers a robust solution for complex reasoning, aligning with recent findings on reflective agents emphasizing iterative self-correction [11-12].
## IV. CONCLUSION
The proposed group deliberation multi-agent dialogue model, through a three-level role-based architecture, self-game mechanism, retrieval enhancement module, and collaborative training strategy, effectively solves the problems of logical collapse, cyclic generation, and insufficient factuality in complex reasoning of single models. Experimental verification shows that the model's reasoning accuracy and consistency indicators on three typical datasets are significantly better than the baseline, and its reasoning efficiency is balanced, providing reliable support for scenarios such as multi-hop question answering and group decision-making. In addition to the promising performance, several limitations should be noted. First, the reported improvements are obtained under controlled experimental settings with fixed model configurations and dataset splits, and may not fully generalize to other task distributions or prompt formulations. Moreover, due to computational constraints, key hyperparameters were selected based on preliminary validation rather than exhaustive search. Furthermore, while repeated experiments reduce variance, formal statistical significance testing and detailed qualitative error analysis were not conducted. Future work will focus on automated hyperparameter optimization, rigorous significance evaluation, and systematic analysis of failure cases, particularly in scenarios involving ambiguous or conflicting evidence.
## REFERENCES
## [1] Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., & Mordatch, I.
Improving Factuality and Reasoning in Language Models through Multi-Agent Debate. International Conference on Learning Representations (ICLR), 2024.
- [2] Song T, Tan Y, Zhu Z, et al. Multi-agents are social groups: Investigating social influence of multiple agents in human-agent interactions[J]. Proceedings of the ACM on Human-Computer Interaction, 2025, 9(7): 1-33.
- [3] Li, G., Hammoud, H., Itani, H., Khizbullin, D., & Ghanem, B.
3. CAMEL: Communicative Agents for 'Mind' Exploration of Large Language Models. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- [4] Wang, X., Wei, J., Schuurmans, D., et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. International Conference on Learning Representations (ICLR), 2023.
- [5] Lewis, P., Perez, E., Piktus, A., et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- [6] Yu, C., Velu, A., Vinitsky, E., Gao, J., Wang, Y., Bayen, A., & Wu, Y. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games.
7. Advances in Neural Information Processing Systems (NeurIPS), 2022. [7] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., & Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Advances in Neural Information Processing Systems (NeurIPS), 2017.
- [8] Ouyang, L., Wu, J., Jiang, X., et al. Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems (NeurIPS), 2022.
[9] Zhang, Z., Liu, Z., Zhou, M., et al. Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms. Artificial Intelligence, vol. 285, 2020.
[10]Sun Y, Liu X. Research and Application of a Multi-Agent-Based Intelligent Mine Gas State Decision-Making System[J]. Applied Sciences, 2025, 15(2): 968.
- [11] Shinn, N., et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS Workshop, 2023.
- [12] Yao, S., et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in Neural Information Processing Systems (NeurIPS), 2023.