## [Line Charts (Subplots)]: Performance of Four Models Across Mathematics Benchmarks with Increasing Sampled Solutions
### Overview
The image contains four horizontally arranged line charts (subplots), each representing a distinct mathematics benchmark: **AIME**, **MATH**, **Olympiad Bench**, and **College Math**. Each chart plots the performance (y-axis) of four models against the number of sampled solutions (x-axis, values: 2, 4, 8, 16, 32, 64). Models are distinguished by color:
- Green: *rStar-Qwen2.5-Math-7B*
- Red: *rStar-Qwen2.5-Math-1.5B*
- Blue: *rStar-Qwen2-Math-7B*
- Yellow: *rStar-Phi3-mini*
### Components/Axes
- **Legend**: Positioned at the top, listing four models with corresponding colors (green, red, blue, yellow).
- **X-axis (all charts)**: Labeled *“#Sampled solutions”* with tick marks at 2, 4, 8, 16, 32, 64.
- **Y-axis (per chart)**:
- *AIME*: Range ~30–60 (performance metric, e.g., accuracy).
- *MATH*: Range ~85–95.
- *Olympiad Bench*: Range ~60–75.
- *College Math*: Range ~60–70.
### Detailed Analysis (Per Chart)
#### 1. AIME Chart
- **Trend**: All models show increasing performance with more sampled solutions.
- **Data Points (approximate)**:
- Green (*rStar-Qwen2.5-Math-7B*): ~43 (2), ~50 (4), ~50 (8), ~50 (16), ~57 (32), ~62 (64).
- Red (*rStar-Qwen2.5-Math-1.5B*): ~37 (2), ~43 (4), ~47 (8), ~50 (16), ~53 (32), ~58 (64).
- Blue (*rStar-Qwen2-Math-7B*): ~40 (2), ~47 (4), ~50 (8), ~50 (16), ~57 (32), ~57 (64).
- Yellow (*rStar-Phi3-mini*): ~30 (2), ~37 (4), ~43 (8), ~43 (16), ~53 (32), ~60 (64).
#### 2. MATH Chart
- **Trend**: All models show increasing performance, converging at higher sampled solutions.
- **Data Points (approximate)**:
- Green: ~86 (2), ~90 (4), ~92 (8), ~93 (16), ~94 (32), ~95 (64).
- Red: ~86 (2), ~88 (4), ~91 (8), ~93 (16), ~94 (32), ~95 (64).
- Blue: ~86 (2), ~89 (4), ~91 (8), ~93 (16), ~94 (32), ~95 (64).
- Yellow: ~83 (2), ~87 (4), ~89 (8), ~91 (16), ~93 (32), ~94 (64).
#### 3. Olympiad Bench Chart
- **Trend**: All models show increasing performance, with green, red, blue converging at higher sampled solutions.
- **Data Points (approximate)**:
- Green: ~62 (2), ~66 (4), ~70 (8), ~73 (16), ~74 (32), ~75 (64).
- Red: ~63 (2), ~67 (4), ~70 (8), ~73 (16), ~74 (32), ~75 (64).
- Blue: ~62 (2), ~66 (4), ~70 (8), ~73 (16), ~74 (32), ~75 (64).
- Yellow: ~58 (2), ~62 (4), ~66 (8), ~70 (16), ~72 (32), ~73 (64).
#### 4. College Math Chart
- **Trend**: All models show increasing performance, with yellow (*rStar-Phi3-mini*) slightly outperforming others at 64 sampled solutions.
- **Data Points (approximate)**:
- Green: ~58 (2), ~61 (4), ~63 (8), ~65 (16), ~67 (32), ~69 (64).
- Red: ~59 (2), ~61 (4), ~63 (8), ~65 (16), ~67 (32), ~68 (64).
- Blue: ~59 (2), ~61 (4), ~63 (8), ~65 (16), ~67 (32), ~69 (64).
- Yellow: ~59 (2), ~61 (4), ~63 (8), ~65 (16), ~67 (32), ~70 (64).
### Key Observations
- **Performance Trend**: All models improve with more sampled solutions across all benchmarks, indicating that increasing sampled solutions enhances performance.
- **Model Comparison**:
- In *AIME*, *rStar-Phi3-mini* (yellow) starts lowest but rises to match/exceed others at 64.
- In *MATH*, all models converge to similar high performance at 64.
- In *Olympiad Bench*, green, red, blue converge, while yellow lags slightly.
- In *College Math*, yellow (*rStar-Phi3-mini*) slightly outperforms others at 64.
- **Consistency**: Green (*rStar-Qwen2.5-Math-7B*) and red (*rStar-Qwen2.5-Math-1.5B*) often perform similarly, suggesting comparable performance between the 7B and 1.5B variants.
### Interpretation
The charts demonstrate that increasing sampled solutions (from 2 to 64) consistently improves performance across all models and benchmarks. This suggests that sampling more solutions (e.g., in reasoning/generation tasks) enhances outcomes, likely due to increased diversity/quality of solutions. Convergence at higher sampled solutions (e.g., *MATH*, *Olympiad Bench*) implies diminishing returns or a performance ceiling. The slight outperformance of *rStar-Phi3-mini* in *College Math* at 64 may indicate its strength in that benchmark, while *rStar-Qwen2.5-Math* variants show strong cross-benchmark performance. This data informs how model size (7B vs. 1.5B) and architecture (Qwen2.5-Math vs. Phi3-mini) interact with sampling strategies in mathematical reasoning tasks.
(Note: All values are approximate, based on visual estimation of the charts.)