\n
## Radar Chart & Bar Charts: AgentFlow Performance Comparison
### Overview
The image presents a comparison of the AgentFlow model's performance against a baseline (AgentFlow w/o Flow-GRPO) across several benchmarks. The comparison is visualized using a radar chart for an overview and a series of bar charts for detailed performance on individual benchmarks. The benchmarks cover diverse areas like question answering (2Wiki, HotpotQA, GAIA, GPQA), reasoning (Math, GameOf24, AMC23, AIME24), and science (MedQA, Musique, Bamboogle).
### Components/Axes
* **Radar Chart:**
* **Axes:** MedQA, Science, GPQA, GameOf24, Math, AMC23, AIME24, GAIA, 2Wiki, HotpotQA, Musique, Bamboogle. These represent the different benchmarks.
* **Scale:** 0 to 80 (approximately).
* **Lines:**
* AgentFlow (w/o Flow-GRPO) - Blue line
* AgentFlow - Red line
* **Legend:** Located in the top-left corner.
* **Bar Charts:**
* **X-axis:** Model names (Qwen-2.5-7B, GPT-4o (~200B), Search-R1 (7B), ReSearch (7B), AutoGen (7B), AgentFlow (7B)).
* **Y-axis:** Accuracy (%) - Scale from 0 to 80 (approximately).
* **Charts:** 2Wiki (Search), HotpotQA (Search), GAIA (Agentic), AIME24 (Math), GameOf24 (Math), GPQA (Science).
* **Legend:** Color-coded bars representing each model.
### Detailed Analysis or Content Details
**Radar Chart Analysis:**
The radar chart displays the performance of AgentFlow with and without Flow-GRPO across 12 benchmarks. The red line represents AgentFlow *with* Flow-GRPO, and the blue line represents AgentFlow *without* Flow-GRPO.
* **MedQA:** AgentFlow (w/o Flow-GRPO): ~80.0%, AgentFlow: ~76.0%
* **Science:** AgentFlow (w/o Flow-GRPO): ~76.0%, AgentFlow: ~69.6%
* **GPQA:** AgentFlow (w/o Flow-GRPO): ~47.0%, AgentFlow: ~37.0%
* **GameOf24:** AgentFlow (w/o Flow-GRPO): ~53.0%, AgentFlow: ~47.4%
* **Math:** AgentFlow (w/o Flow-GRPO): ~61.5%, AgentFlow: ~40.0%
* **AMC23:** AgentFlow (w/o Flow-GRPO): ~61.5%, AgentFlow: ~31.0%
* **AIME24:** AgentFlow (w/o Flow-GRPO): ~17.2%, AgentFlow: ~16.7%
* **GAIA:** AgentFlow (w/o Flow-GRPO): ~58.4%, AgentFlow: ~33.1%
* **2Wiki:** AgentFlow (w/o Flow-GRPO): ~71.2%, AgentFlow: ~60.0%
* **HotpotQA:** AgentFlow (w/o Flow-GRPO): ~51.3%, AgentFlow: ~57.0%
* **Musique:** AgentFlow (w/o Flow-GRPO): ~25.3%, AgentFlow: ~19.2%
* **Bamboogle:** AgentFlow (w/o Flow-GRPO): ~69.6%, AgentFlow: ~60.0%
The chart also indicates overall performance improvements: +7.0% (GPQA), +19.8% (Math), +15.9% (GAIA), +10.1% (2Wiki).
**Bar Chart Analysis:**
* **2Wiki (Search):** Qwen-2.5-7B: ~49.5%, GPT-4o (~200B): ~72.2%, Search-R1 (7B): ~38.2%, AutoGen (7B): ~44.0%, ReSearch (7B): ~21.0%, AgentFlow (7B): ~23.3%
* **HotpotQA (Search):** Qwen-2.5-7B: ~54.0%, GPT-4o (~200B): ~43.5%, Search-R1 (7B): ~37.0%, AutoGen (7B): ~30.0%, ReSearch (7B): ~3.2%, AgentFlow (7B): ~6.3%
* **GAIA (Agentic):** Qwen-2.5-7B: ~50.0%, GPT-4o (~200B): ~33.1%, Search-R1 (7B): ~17.3%, AutoGen (7B): ~19.1%, ReSearch (7B): ~6.3%, AgentFlow (7B): ~17.3%
* **AIME24 (Math):** Qwen-2.5-7B: ~40.0%, GPT-4o (~200B): ~13.3%, Search-R1 (7B): ~10.0%, AutoGen (7B): ~20.0%, ReSearch (7B): ~6.7%, AgentFlow (7B): ~10.0%
* **GameOf24 (Math):** Qwen-2.5-7B: ~53.0%, GPT-4o (~200B): ~31.0%, Search-R1 (7B): ~33.0%, AutoGen (7B): ~30.0%, ReSearch (7B): ~24.0%, AgentFlow (7B): ~33.0%
* **GPQA (Science):** Qwen-2.5-7B: ~42.0%, GPT-4o (~200B): ~35.0%, Search-R1 (7B): ~34.0%, AutoGen (7B): ~31.0%, ReSearch (7B): ~47.0%, AgentFlow (7B): ~42.0%
### Key Observations
* AgentFlow consistently performs better than the baseline (AgentFlow w/o Flow-GRPO) across all benchmarks in the radar chart.
* The largest performance gains with AgentFlow are observed in GPQA and Math.
* GPT-4o (~200B) generally achieves the highest accuracy across most benchmarks in the bar charts.
* AgentFlow (7B) generally performs lower than GPT-4o (~200B), Qwen-2.5-7B, and Search-R1 (7B) in the bar charts.
* ReSearch (7B) shows very low performance in HotpotQA and AIME24.
### Interpretation
The data suggests that the Flow-GRPO component significantly improves the performance of the AgentFlow model across a diverse set of tasks. The radar chart provides a holistic view of these improvements, while the bar charts offer a more granular comparison against other models. The consistent outperformance of GPT-4o highlights the current state-of-the-art in large language models. The relatively low performance of AgentFlow (7B) compared to larger models suggests that scaling model size remains a crucial factor in achieving high accuracy. The significant drop in ReSearch (7B)'s performance on HotpotQA and AIME24 could indicate a specific weakness in that model's architecture or training data for those tasks. The combination of radar and bar charts provides a comprehensive assessment of AgentFlow's capabilities and areas for potential improvement.