# Technical Data Extraction: Performance Comparison Chart (F1 Score vs. Round)
## 1. Image Overview
This image is a line and scatter plot illustrating the performance (F1 Score %) of three different optimization frameworks over 80 "Rounds." The chart uses a step-line format to show the "best-so-far" performance, while individual data points (scatter) represent the results of specific iterations within those rounds.
## 2. Component Isolation
### A. Header / Legend
* **Location:** Top-left quadrant [x: ~150, y: ~100].
* **Legend Items:**
* **ADAS:** Blue dotted line with square markers (▪).
* **AFlow:** Orange dashed line with 'x' markers (x).
* **MASS (Ours):** Solid light-red/pink line with circular markers (●).
### B. Axis Definitions
* **Y-Axis (Vertical):**
* **Label:** F1 (%)
* **Range:** 65 to 95
* **Major Tick Intervals:** 5 units (65, 70, 75, 80, 85, 90, 95)
* **X-Axis (Horizontal):**
* **Label:** Round
* **Range:** 0 to 80
* **Major Tick Intervals:** 10 units (0, 10, 20, 30, 40, 50, 60, 70, 80)
### C. Annotations (Grey Text with Upward Arrows)
These labels indicate specific architectural or methodological changes occurring at certain rounds for the **MASS (Ours)** series:
1. **"Better Prompt"**: Located at Round ~2.
2. **"1PO: Aggregate"**: Located at Round ~20.
3. **"1PO: Debate"**: Located at Round ~38.
4. **"2TO: More parallel agents involved"**: Located at Round ~49.
5. **"3PO: Workflow-level PO"**: Located at Round ~75.
---
## 3. Data Series Analysis & Trends
### Series 1: MASS (Ours)
* **Visual Trend:** A solid red step-line that consistently outperforms the other two methods. It shows rapid initial gains and maintains a high plateau between rounds 50 and 75 before a final jump.
* **Key Data Points (Step-line):**
* **Round 0:** Starts at ~70%.
* **Round 2:** Jumps to ~78.5% ("Better Prompt").
* **Round 20:** Jumps to ~84.5% ("1PO: Aggregate").
* **Round 38:** Jumps to ~85.8% ("1PO: Debate").
* **Round 49:** Jumps to ~87.2% ("2TO: More parallel agents").
* **Round 75:** Jumps to ~89.8% ("3PO: Workflow-level PO").
* **Scatter Data:** Most individual runs (circles) stay within the 82%–85% range after round 30, with the step-line tracking the maximum achieved value.
### Series 2: AFlow
* **Visual Trend:** An orange dashed step-line. It starts significantly later than the others and shows large, infrequent jumps.
* **Key Data Points (Step-line):**
* **Round 0-10:** No data/Below 65%.
* **Round 10:** Appears at ~68.5%.
* **Round 15:** Jumps to ~81.5%.
* **Round 50:** Jumps to ~85.1%.
* **Scatter Data:** Highly volatile. Includes a high point at round 50 (~85%) but several low-performing outliers near round 40 (~68%) and round 60 (~76%).
### Series 3: ADAS
* **Visual Trend:** A blue dotted step-line. This is the lowest performing series, showing very gradual improvements and remaining below 72.5% throughout the 80 rounds.
* **Key Data Points (Step-line):**
* **Round 0:** Starts at ~67%.
* **Round 3:** Jumps to ~69%.
* **Round 24:** Jumps to ~70.2%.
* **Round 51:** Jumps to ~72%.
* **Scatter Data:** Most points (squares) cluster between 66% and 71%.
---
## 4. Summary Table of Best F1 Scores (%)
| Round | ADAS (Blue Dotted) | AFlow (Orange Dashed) | MASS (Ours) (Red Solid) |
| :--- | :--- | :--- | :--- |
| **0** | ~67.0 | N/A | ~70.0 |
| **10** | ~69.0 | ~68.5 | ~78.5 |
| **20** | ~69.5 | ~81.5 | ~84.5 |
| **40** | ~70.2 | ~81.5 | ~85.8 |
| **60** | ~72.0 | ~85.1 | ~87.2 |
| **80** | ~72.0 | ~85.1 | ~89.8 |
## 5. Conclusion
The **MASS (Ours)** framework demonstrates superior performance and efficiency. It reaches the 80% F1 threshold much earlier than AFlow (Round 20 vs Round 15, but with a higher starting baseline) and achieves a final F1 score of approximately **89.8%**, outperforming AFlow (~85.1%) and ADAS (~72.0%) by significant margins. The annotations suggest that the "Workflow-level PO" and "Parallel agents" are the primary drivers for the final performance leads.