## Chart Type: Grid of Line Charts showing Model Accuracy vs. Number of Solutions
### Overview
This image displays an 8-panel grid of line charts, each illustrating the "Accuracy (%)" on the Y-axis against the "Number of Solutions (N)" on the X-axis. The charts compare the performance of various language models and solution aggregation strategies across four different mathematical reasoning benchmarks: MATH, AMC23, AIME24, and Minerva Math. The top row of four charts evaluates models based on the "Qwen" family, while the bottom row evaluates models based on the "Gemma" family. A common legend at the top defines ten distinct data series, each representing a different model or strategy.
### Components/Axes
**Common Legend (positioned at the top, horizontally across the entire image):**
* **Pass@K**: Gray dotted line with star markers.
* **GenPRM-7B (Pass@1)**: Dark blue solid line with circle markers.
* **GenPRM-7B (Maj@8)**: Green solid line with triangle markers.
* **Qwen2.5-Math-7B**: Purple solid line with square markers.
* **Skywork-7B**: Light pink/magenta solid line with diamond markers.
* **Maj.** (Majority Voting): Light gray solid line with square markers.
* **GenPRM-7B (Maj@4)**: Orange solid line with square markers.
* **Direct GenPRM-7B**: Red solid line with circle markers.
* **Qwen2.5-Math-7B-PRM800K**: Dark brown solid line with circle markers.
* **Skywork-1.5B**: Yellow-green solid line with square markers.
**Common Axes:**
* **X-axis Title (bottom of each chart):** "Number of Solutions (N)"
* X-axis labels are powers of 2: 2^0 (1), 2^1 (2), 2^2 (4), 2^3 (8), 2^4 (16), 2^5 (32), 2^6 (64), 2^7 (128), 2^8 (256). The specific range varies per chart.
* **Y-axis Title (left of each chart):** "Accuracy (%)"
**Sub-Chart Titles and Specific Axis Ranges:**
**Top Row (Qwen Models):**
* **(a) MATH (Qwen)**
* X-axis range: 2^0 to 2^5.
* Y-axis range: 82% to 94%, with major ticks at 82, 84, 86, 88, 90, 92, 94.
* **(b) AMC23 (Qwen)**
* X-axis range: 2^0 to 2^8.
* Y-axis range: 70% to 95%, with major ticks at 70, 75, 80, 85, 90, 95.
* **(c) AIME24 (Qwen)**
* X-axis range: 2^0 to 2^8.
* Y-axis range: 0% to 35%, with major ticks at 5, 10, 15, 20, 25, 30, 35.
* **(d) Minerva Math (Qwen)**
* X-axis range: 2^0 to 2^5.
* Y-axis range: 32.5% to 52.5%, with major ticks at 32.5, 35.0, 37.5, 40.0, 42.5, 45.0, 47.5, 50.0, 52.5.
**Bottom Row (Gemma Models):**
* **(e) MATH (Gemma)**
* X-axis range: 2^0 to 2^5.
* Y-axis range: 82% to 94%, with major ticks at 82, 84, 86, 88, 90, 92, 94.
* **(f) AMC23 (Gemma)**
* X-axis range: 2^0 to 2^8.
* Y-axis range: 60% to 95%, with major ticks at 65, 70, 75, 80, 85, 90, 95.
* **(g) AIME24 (Gemma)**
* X-axis range: 2^0 to 2^8.
* Y-axis range: 15% to 40%, with major ticks at 15, 20, 25, 30, 35, 40.
* **(h) Minerva Math (Gemma)**
* X-axis range: 2^0 to 2^5.
* Y-axis range: 30% to 55%, with major ticks at 30, 35, 40, 45, 50, 55.
### Detailed Analysis
**Chart (a) MATH (Qwen)**
* **Pass@K (Gray dotted star)**: Steadily increases from ~82.5% at 2^0 to ~95% at 2^5.
* **GenPRM-7B (Pass@1) (Dark blue solid circle)**: Increases from ~82% at 2^0 to ~89% at 2^5.
* **GenPRM-7B (Maj@8) (Green solid triangle)**: Increases from ~82% at 2^0 to ~89.5% at 2^5.
* **Qwen2.5-Math-7B (Purple solid square)**: Increases from ~82% at 2^0 to ~87% at 2^5.
* **Skywork-7B (Light pink solid diamond)**: Increases from ~82% at 2^0 to ~87.5% at 2^5.
* **Maj. (Light gray solid square)**: Increases from ~82% at 2^0 to ~86.5% at 2^5.
* **GenPRM-7B (Maj@4) (Orange solid square)**: Increases from ~82% at 2^0 to ~88.5% at 2^5.
* **Direct GenPRM-7B (Red solid circle)**: Increases from ~82% at 2^0 to ~86% at 2^5.
* **Qwen2.5-Math-7B-PRM800K (Dark brown solid circle)**: Increases from ~82% at 2^0 to ~87% at 2^5.
* **Skywork-1.5B (Yellow-green solid square)**: Increases from ~82% at 2^0 to ~86% at 2^5.
**Chart (b) AMC23 (Qwen)**
* **Pass@K (Gray dotted star)**: Steadily increases from ~70% at 2^0 to ~96% at 2^8.
* **GenPRM-7B (Pass@1) (Dark blue solid circle)**: Increases from ~70% at 2^0, peaks at ~80% at 2^3, then fluctuates, ending at ~79% at 2^8.
* **GenPRM-7B (Maj@8) (Green solid triangle)**: Increases from ~70% at 2^0, peaks at ~83% at 2^3, then fluctuates, ending at ~82% at 2^8.
* **Qwen2.5-Math-7B (Purple solid square)**: Increases from ~70% at 2^0, peaks at ~79% at 2^3, then fluctuates, ending at ~77% at 2^8.
* **Skywork-7B (Light pink solid diamond)**: Increases from ~70% at 2^0, peaks at ~77% at 2^3, then fluctuates, ending at ~74% at 2^8.
* **Maj. (Light gray solid square)**: Increases from ~70% at 2^0, peaks at ~77% at 2^3, then fluctuates, ending at ~73% at 2^8.
* **GenPRM-7B (Maj@4) (Orange solid square)**: Increases from ~70% at 2^0, peaks at ~83% at 2^3, then fluctuates, ending at ~81% at 2^8.
* **Direct GenPRM-7B (Red solid circle)**: Increases from ~70% at 2^0, peaks at ~75% at 2^3, then fluctuates, ending at ~71% at 2^8.
* **Qwen2.5-Math-7B-PRM800K (Dark brown solid circle)**: Increases from ~70% at 2^0, peaks at ~78% at 2^3, then fluctuates, ending at ~75% at 2^8.
* **Skywork-1.5B (Yellow-green solid square)**: Increases from ~70% at 2^0, peaks at ~76% at 2^3, then fluctuates, ending at ~72% at 2^8.
**Chart (c) AIME24 (Qwen)**
* **Pass@K (Gray dotted star)**: Steadily increases from ~2.5% at 2^0 to ~36% at 2^8.
* **GenPRM-7B (Pass@1) (Dark blue solid circle)**: Increases from ~2.5% at 2^0, peaks at ~12% at 2^3, then fluctuates, ending at ~15% at 2^8.
* **GenPRM-7B (Maj@8) (Green solid triangle)**: Increases from ~2.5% at 2^0, peaks at ~13% at 2^3, then fluctuates, ending at ~17% at 2^8.
* **Qwen2.5-Math-7B (Purple solid square)**: Increases from ~2.5% at 2^0, peaks at ~11% at 2^3, then fluctuates, ending at ~14% at 2^8.
* **Skywork-7B (Light pink solid diamond)**: Increases from ~2.5% at 2^0, peaks at ~10% at 2^3, then fluctuates, ending at ~12% at 2^8.
* **Maj. (Light gray solid square)**: Increases from ~2.5% at 2^0, peaks at ~10% at 2^3, then fluctuates, ending at ~11% at 2^8.
* **GenPRM-7B (Maj@4) (Orange solid square)**: Increases from ~2.5% at 2^0, peaks at ~13% at 2^3, then fluctuates, ending at ~16% at 2^8.
* **Direct GenPRM-7B (Red solid circle)**: Increases from ~2.5% at 2^0, peaks at ~8% at 2^3, then fluctuates, ending at ~9% at 2^8.
* **Qwen2.5-Math-7B-PRM800K (Dark brown solid circle)**: Increases from ~2.5% at 2^0, peaks at ~11% at 2^3, then fluctuates, ending at ~13% at 2^8.
* **Skywork-1.5B (Yellow-green solid square)**: Increases from ~2.5% at 2^0, peaks at ~9% at 2^3, then fluctuates, ending at ~10% at 2^8.
**Chart (d) Minerva Math (Qwen)**
* **Pass@K (Gray dotted star)**: Steadily increases from ~33% at 2^0 to ~53% at 2^5.
* **GenPRM-7B (Pass@1) (Dark blue solid circle)**: Increases from ~33% at 2^0 to ~38.5% at 2^5.
* **GenPRM-7B (Maj@8) (Green solid triangle)**: Increases from ~33% at 2^0 to ~39.5% at 2^5.
* **Qwen2.5-Math-7B (Purple solid square)**: Increases from ~33% at 2^0 to ~37% at 2^5.
* **Skywork-7B (Light pink solid diamond)**: Increases from ~33% at 2^0 to ~36.5% at 2^5.
* **Maj. (Light gray solid square)**: Increases from ~33% at 2^0 to ~35.5% at 2^5.
* **GenPRM-7B (Maj@4) (Orange solid square)**: Increases from ~33% at 2^0 to ~38% at 2^5.
* **Direct GenPRM-7B (Red solid circle)**: Increases from ~33% at 2^0 to ~35% at 2^5.
* **Qwen2.5-Math-7B-PRM800K (Dark brown solid circle)**: Increases from ~33% at 2^0 to ~36.5% at 2^5.
* **Skywork-1.5B (Yellow-green solid square)**: Increases from ~33% at 2^0 to ~35.5% at 2^5.
**Chart (e) MATH (Gemma)**
* **Pass@K (Gray dotted star)**: Steadily increases from ~82.5% at 2^0 to ~95% at 2^5.
* **GenPRM-7B (Pass@1) (Dark blue solid circle)**: Increases from ~82% at 2^0 to ~88% at 2^5.
* **GenPRM-7B (Maj@8) (Green solid triangle)**: Increases from ~82% at 2^0 to ~89% at 2^5.
* **Qwen2.5-Math-7B (Purple solid square)**: Increases from ~82% at 2^0 to ~86% at 2^5.
* **Skywork-7B (Light pink solid diamond)**: Increases from ~82% at 2^0 to ~86.5% at 2^5.
* **Maj. (Light gray solid square)**: Increases from ~82% at 2^0 to ~85% at 2^5.
* **GenPRM-7B (Maj@4) (Orange solid square)**: Increases from ~82% at 2^0 to ~87.5% at 2^5.
* **Direct GenPRM-7B (Red solid circle)**: Increases from ~82% at 2^0 to ~85% at 2^5.
* **Qwen2.5-Math-7B-PRM800K (Dark brown solid circle)**: Increases from ~82% at 2^0 to ~86% at 2^5.
* **Skywork-1.5B (Yellow-green solid square)**: Increases from ~82% at 2^0 to ~85% at 2^5.
**Chart (f) AMC23 (Gemma)**
* **Pass@K (Gray dotted star)**: Steadily increases from ~64% at 2^0 to ~96% at 2^8.
* **GenPRM-7B (Pass@1) (Dark blue solid circle)**: Increases from ~64% at 2^0, peaks at ~80% at 2^3, then fluctuates, ending at ~78% at 2^8.
* **GenPRM-7B (Maj@8) (Green solid triangle)**: Increases from ~64% at 2^0, peaks at ~85% at 2^3, then fluctuates, ending at ~83% at 2^8.
* **Qwen2.5-Math-7B (Purple solid square)**: Increases from ~64% at 2^0, peaks at ~78% at 2^3, then fluctuates, ending at ~76% at 2^8.
* **Skywork-7B (Light pink solid diamond)**: Increases from ~64% at 2^0, peaks at ~75% at 2^3, then fluctuates, ending at ~72% at 2^8.
* **Maj. (Light gray solid square)**: Increases from ~64% at 2^0, peaks at ~75% at 2^3, then fluctuates, ending at ~71% at 2^8.
* **GenPRM-7B (Maj@4) (Orange solid square)**: Increases from ~64% at 2^0, peaks at ~84% at 2^3, then fluctuates, ending at ~82% at 2^8.
* **Direct GenPRM-7B (Red solid circle)**: Increases from ~64% at 2^0, peaks at ~70% at 2^3, then fluctuates, ending at ~68% at 2^8.
* **Qwen2.5-Math-7B-PRM800K (Dark brown solid circle)**: Increases from ~64% at 2^0, peaks at ~76% at 2^3, then fluctuates, ending at ~73% at 2^8.
* **Skywork-1.5B (Yellow-green solid square)**: Increases from ~64% at 2^0, peaks at ~72% at 2^3, then fluctuates, ending at ~69% at 2^8.
**Chart (g) AIME24 (Gemma)**
* **Pass@K (Gray dotted star)**: Steadily increases from ~16% at 2^0 to ~40% at 2^8.
* **GenPRM-7B (Pass@1) (Dark blue solid circle)**: Increases from ~16% at 2^0, peaks at ~27% at 2^3, then fluctuates, ending at ~26% at 2^8.
* **GenPRM-7B (Maj@8) (Green solid triangle)**: Increases from ~16% at 2^0, peaks at ~28% at 2^3, then fluctuates, ending at ~27% at 2^8.
* **Qwen2.5-Math-7B (Purple solid square)**: Increases from ~16% at 2^0, peaks at ~25% at 2^3, then fluctuates, ending at ~24% at 2^8.
* **Skywork-7B (Light pink solid diamond)**: Increases from ~16% at 2^0, peaks at ~23% at 2^3, then fluctuates, ending at ~21% at 2^8.
* **Maj. (Light gray solid square)**: Increases from ~16% at 2^0, peaks at ~22% at 2^3, then fluctuates, ending at ~20% at 2^8.
* **GenPRM-7B (Maj@4) (Orange solid square)**: Increases from ~16% at 2^0, peaks at ~27% at 2^3, then fluctuates, ending at ~26% at 2^8.
* **Direct GenPRM-7B (Red solid circle)**: Increases from ~16% at 2^0, peaks at ~20% at 2^3, then fluctuates, ending at ~18% at 2^8.
* **Qwen2.5-Math-7B-PRM800K (Dark brown solid circle)**: Increases from ~16% at 2^0, peaks at ~24% at 2^3, then fluctuates, ending at ~23% at 2^8.
* **Skywork-1.5B (Yellow-green solid square)**: Increases from ~16% at 2^0, peaks at ~21% at 2^3, then fluctuates, ending at ~19% at 2^8.
**Chart (h) Minerva Math (Gemma)**
* **Pass@K (Gray dotted star)**: Steadily increases from ~30% at 2^0 to ~53% at 2^5.
* **GenPRM-7B (Pass@1) (Dark blue solid circle)**: Increases from ~30% at 2^0 to ~35% at 2^5.
* **GenPRM-7B (Maj@8) (Green solid triangle)**: Increases from ~30% at 2^0 to ~36% at 2^5.
* **Qwen2.5-Math-7B (Purple solid square)**: Increases from ~30% at 2^0 to ~34% at 2^5.
* **Skywork-7B (Light pink solid diamond)**: Increases from ~30% at 2^0 to ~33.5% at 2^5.
* **Maj. (Light gray solid square)**: Increases from ~30% at 2^0 to ~32.5% at 2^5.
* **GenPRM-7B (Maj@4) (Orange solid square)**: Increases from ~30% at 2^0 to ~35.5% at 2^5.
* **Direct GenPRM-7B (Red solid circle)**: Increases from ~30% at 2^0 to ~32% at 2^5.
* **Qwen2.5-Math-7B-PRM800K (Dark brown solid circle)**: Increases from ~30% at 2^0 to ~33.5% at 2^5.
* **Skywork-1.5B (Yellow-green solid square)**: Increases from ~30% at 2^0 to ~32.5% at 2^5.
### Key Observations
1. **Dominance of Pass@K**: Across all 8 benchmarks and both Qwen and Gemma models, "Pass@K" consistently achieves the highest accuracy and shows the most significant, near-linear improvement with an increasing "Number of Solutions (N)". It serves as a strong upper bound or ideal performance metric.
2. **General Trend of Increasing Accuracy**: For almost all methods and benchmarks, accuracy generally increases as the "Number of Solutions (N)" increases, indicating that providing more solutions for evaluation or aggregation improves performance.
3. **Performance Plateaus/Fluctuations**: While accuracy generally increases, many methods, especially on AMC23 and AIME24 benchmarks, show a plateau or even slight fluctuations/decreases in accuracy after an initial rapid rise (e.g., around 2^3 or 2^4). This suggests diminishing returns or potential instability with very high numbers of solutions for certain aggregation strategies.
4. **GenPRM Variants Outperform Baselines**: The "GenPRM-7B (Maj@8)" (green triangle) and "GenPRM-7B (Maj@4)" (orange square) generally perform better than other non-Pass@K methods, including "GenPRM-7B (Pass@1)" (dark blue circle), "Maj." (light gray square), "Direct GenPRM-7B" (red circle), and the Qwen/Skywork base models. "GenPRM-7B (Maj@8)" often slightly edges out "GenPRM-7B (Maj@4)".
5. **"Direct GenPRM-7B" is a Lower Performer**: The "Direct GenPRM-7B" (red circle) consistently ranks among the lowest-performing methods across all benchmarks and model families, often performing similarly to or worse than simple majority voting ("Maj.").
6. **Benchmark Difficulty**:
* MATH shows the highest overall accuracies (80-95%).
* AMC23 shows moderate accuracies (60-95%).
* Minerva Math shows lower moderate accuracies (30-55%).
* AIME24 is the most challenging benchmark, with accuracies ranging from very low single digits to around 40%.
7. **Model Family Comparison (Qwen vs. Gemma)**:
* For MATH and Minerva Math, the overall accuracy ranges and relative performance of methods are quite similar between Qwen and Gemma models.
* For AMC23, Gemma models start at a slightly lower baseline (~64% vs ~70% for Qwen) but show similar trends and peak performance.
* For AIME24, Gemma models also start at a slightly higher baseline (~16% vs ~2.5% for Qwen) and achieve slightly higher peak accuracies for most methods compared to Qwen, suggesting Gemma might be marginally better on this specific hard task.
8. **Impact of PRM800K**: "Qwen2.5-Math-7B-PRM800K" (dark brown circle) generally performs slightly better than "Qwen2.5-Math-7B" (purple square) but not as well as the top GenPRM-7B (Maj@X) variants.
### Interpretation
The data strongly suggests that increasing the number of solutions (N) generally improves the accuracy of mathematical reasoning models, but the extent and nature of this improvement depend heavily on the aggregation strategy and the difficulty of the benchmark.
The "Pass@K" metric, which likely represents the accuracy if at least one correct solution is found among K attempts, sets an upper bound on performance. Its consistent, strong upward trend indicates that the underlying models are generating correct solutions, even if the aggregation methods struggle to identify them. The gap between "Pass@K" and other methods highlights the challenge of effectively leveraging multiple generated solutions.
Among the practical aggregation strategies, the "GenPRM-7B (Maj@8)" and "GenPRM-7B (Maj@4)" methods consistently demonstrate superior performance. This implies that using a Program-of-Thought (PoT) based approach with majority voting over a moderate number of solutions (4 or 8) is more effective than simpler methods like "Pass@1" (likely taking the first correct solution) or plain "Maj." (majority voting without PRM). The "Direct GenPRM-7B" performs poorly, suggesting that the specific implementation or direct application of GenPRM without further aggregation or refinement is not effective.
The performance differences across benchmarks underscore their varying difficulty levels. Tasks like AIME24, with low absolute accuracies, indicate that even with advanced models and aggregation, these problems remain highly challenging. The diminishing returns observed for some methods at higher N values, particularly on AMC23 and AIME24, could be due to several factors:
1. **Error Accumulation**: With more solutions, there might be more incorrect solutions that confuse the aggregation mechanism.
2. **Redundancy**: Beyond a certain point, additional solutions may not provide novel correct information, leading to a plateau.