\n
## Line Chart: Multi-Benchmark Performance of AI Models (Scores by Model Number)
### Overview
This image is a line chart comparing the performance of 22 different AI models (numbered 1 through 22) across seven distinct mathematical reasoning benchmarks. The chart plots the score percentage for each model on each benchmark, revealing significant variability in performance both across models and across different types of mathematical tasks.
### Components/Axes
* **X-Axis:** Labeled **"Model Number"**. It is a linear scale with integer markers from **1 to 22**.
* **Y-Axis:** Labeled **"Score (%)"**. It is a linear scale from **0 to 100**, with major gridlines at intervals of 20% (0, 20, 40, 60, 80, 100).
* **Data Series (Legend & Placement):** The legend is integrated directly into the chart area, with labels placed near the end of their respective lines.
1. **MGSM** (Orange line, square markers): Label positioned near the top-left, above its final data point.
2. **MATH** (Blue line, circle markers): Label positioned in the middle-left area, above its line.
3. **MATH-500** (Pink line, circle markers): Label positioned in the upper-middle area, above its line.
4. **MathVista** (Red line, triangle markers): Label positioned in the middle-right area, above its line.
5. **AIME 2024** (Brown line, diamond markers): Label positioned near the top-center, above its line.
6. **AIME 2025** (Yellow-green line, circle markers): Label positioned at the top-right, above its line.
7. **FrontierMath, Tier 1-3** (Cyan line, no markers): Label positioned in the bottom-right corner, above its line.
### Detailed Analysis
**Trend Verification & Data Points (Approximate):**
* **MGSM (Orange, Squares):** Shows a strong upward trend. Starts at ~56% (Model 1), rises to ~74% (Model 2), peaks at ~88% (Model 3), dips slightly to ~87% (Model 4), and ends at ~90% (Model 5).
* **MATH (Blue, Circles):** Shows an overall upward trend with a mid-dip. Starts at ~43% (Model 1), stays flat at ~42% (Model 2), jumps to ~72% (Model 3), dips to ~70% (Model 4), and ends at ~76% (Model 5).
* **MATH-500 (Pink, Circles):** Shows a steep, consistent upward trend. Starts at ~60% (Model 5), rises to ~85% (Model 6), ~90% (Model 7), and peaks at ~95% (Model 8).
* **MathVista (Red, Triangles):** Shows high volatility. Starts at ~58% (Model 3), dips to ~56% (Model 4), rises to ~64% (Model 5), ~70% (Model 6), peaks at ~74% (Model 8), then drops sharply to ~56% (Model 10). It recovers to ~73% (Model 11), holds at ~72% (Models 12, 13), jumps to ~87% (Model 14), dips to ~84% (Model 15), and ends at ~86% (Model 16).
* **AIME 2024 (Brown, Diamonds):** Shows extreme volatility. Starts very low at ~8% (Model 4), rises to ~13% (Model 5), then surges to ~57% (Model 6), ~70% (Model 7), ~83% (Model 8), and peaks at ~86% (Model 9). It then crashes to ~29% (Model 10), recovers to ~50% (Model 11), dips to ~48% (Model 12) and ~37% (Model 13), before a strong recovery to ~87% (Model 14), ~93% (Model 15), ~91% (Model 16), ~93% (Model 17), and ends at ~96% (Model 18).
* **AIME 2025 (Yellow-green, Circles):** Shows a consistent, high-level upward trend. Starts at ~79% (Model 8), rises to ~87% (Model 14), ~93% (Model 15), ~98% (Model 16), and ends at a perfect or near-perfect ~100% (Model 22).
* **FrontierMath, Tier 1-3 (Cyan, No Markers):** Shows a gradual upward trend from a low baseline. Starts at ~19% (Model 15), dips to ~16% (Model 16), then rises to ~27% (Model 20), dips slightly to ~26% (Model 21), and ends at ~32% (Model 22).
### Key Observations
1. **Benchmark Difficulty Spectrum:** There is a clear hierarchy of benchmark difficulty. **AIME 2025** and **AIME 2024** (for later models) yield the highest scores, while **FrontierMath, Tier 1-3** yields the lowest scores by a significant margin.
2. **Model 10 Anomaly:** Model 10 is a critical outlier, causing a severe performance drop for both **MathVista** (to ~56%) and especially **AIME 2024** (to ~29%). This suggests this model has a specific weakness tested by these benchmarks at that point.
3. **Performance Volatility:** The **AIME 2024** and **MathVista** series are highly volatile, indicating that model performance on these benchmarks is not stable and can vary dramatically between consecutive model numbers.
4. **Late-Model Dominance:** Models numbered 14 and above generally show strong, high performance across most benchmarks where they are evaluated, particularly on the AIME series.
5. **Benchmark-Specific Strengths:** No single model is plotted on all benchmarks. Models 1-5 are tested on MGSM/MATH; models 5-8 on MATH-500; models 3-16 on MathVista; models 4-18 on AIME 2024; models 8-22 on AIME 2025; and models 15-22 on FrontierMath. This suggests a possible evolution in benchmarking focus over successive model generations.
### Interpretation
This chart visualizes the progression and specialization of AI models in mathematical reasoning. The data suggests that as model numbers increase (likely representing newer or more advanced versions), performance on challenging competition-style math (AIME) improves dramatically, eventually reaching near-perfect scores on the 2025 version. However, this progress is not linear or universal.
The extreme volatility in series like **AIME 2024** indicates that improvements can be brittle; a model might excel at one set of problems but fail at a slightly different set presented in the next model iteration. The catastrophic drop at **Model 10** is a key investigative point—it may represent a model that was optimized for a different objective, had a training regression, or encountered a specific type of problem it was not equipped to handle.
The consistently low scores on **FrontierMath, Tier 1-3** highlight a persistent challenge. Even the most advanced models (20-22) only achieve scores in the 20-30% range, suggesting this benchmark tests a frontier of mathematical reasoning that remains largely unsolved by current AI. The chart ultimately tells a story of significant but uneven progress, where mastery of one domain (e.g., AIME) does not guarantee mastery of another (e.g., FrontierMath), and where model development involves both leaps forward and occasional, unexplained setbacks.