## Line Chart: MATH Accuracy Across Tournament Rounds
### Overview
The chart compares the MATH accuracy of four AI models across four tournament rounds (0-3). Models are differentiated by size (7B vs 32B parameters) and whether voting mechanisms were used. All lines show upward trends, with larger models and voting mechanisms achieving higher accuracy.
### Components/Axes
- **X-axis**: Tournament round (0, 1, 2, 3)
- **Y-axis**: MATH accuracy (82-90%)
- **Legend**:
- Red: RRM-7B with voting
- Orange: RRM-7B without voting
- Blue: RRM-32B with voting
- Purple: RRM-32B without voting
- **Gridlines**: Horizontal at 82, 84, 86, 88, 90
### Detailed Analysis
1. **RRM-32B with voting** (blue):
- Starts at 82.2 (round 0)
- Reaches 90.5 (round 3)
- Steepest slope (≈2.1% per round)
2. **RRM-32B without voting** (purple):
- Starts at 82.5 (round 0)
- Reaches 89.8 (round 3)
- Slope ≈1.8% per round
3. **RRM-7B with voting** (red):
- Starts at 82.3 (round 0)
- Reaches 88.8 (round 3)
- Slope ≈1.6% per round
4. **RRM-7B without voting** (orange):
- Starts at 82.4 (round 0)
- Reaches 88.2 (round 3)
- Slope ≈1.5% per round
### Key Observations
- All models show consistent improvement across rounds
- Larger models (32B) outperform smaller models (7B) by 1.3-2.3% at round 3
- Voting mechanisms improve accuracy by 0.7-1.3% across all models
- RRM-32B with voting achieves 90.5% accuracy (highest value)
- RRM-7B without voting has lowest performance (88.2% at round 3)
### Interpretation
The data demonstrates that:
1. Model size significantly impacts performance (32B models outperform 7B by ~2% at final round)
2. Voting mechanisms provide measurable accuracy improvements (0.7-1.3% boost)
3. Performance gains accelerate over time (slopes increase in later rounds)
4. The combination of large model size and voting yields optimal results
The consistent upward trends suggest that both model capacity and ensemble methods (voting) are critical factors in mathematical reasoning performance. The 32B models with voting achieve near-perfect accuracy (90.5%) by round 3, indicating potential saturation of performance gains in this domain.