This image displays two radar charts, labeled (a) and (b), comparing the performance of six different models across various benchmarks and domains. Both charts share a common legend. The radial axes represent performance percentages, ranging from 0% at the center to 100% at the outermost circle, with intermediate markers at 20%, 40%, 60%, and 80%.
The legend, located in the top right quadrant of the overall image, to the right of chart (a) and above chart (b), defines the models and their corresponding line styles and colors:
* `--` (dashed, dark gray): Qwen2.5-32B
* `- -` (dashed, light blue): Qwen2.5-32B-IT
* `- -` (dashed, orange): ORZ-32B
* `- -` (dashed, green): SimpleRL-32B
* `- -` (dashed, purple): Baseline-32B
* `—` (solid, red): SwS-32B
The SwS-32B model's performance is highlighted by a light red shaded area, indicating it generally achieves the highest scores, forming the outer boundary of this shaded region. Numerical values are explicitly labeled only for the SwS-32B model at each axis point. For other models, performance values are estimated based on their proximity to the radial grid lines.
---
### (a) Performance across Benchmarks
This radar chart, titled "(a) Performance across Benchmarks," evaluates models against seven distinct benchmarks, with "GSM8K" appearing as a central contextual label.
**Radial Axes (Benchmarks) and SwS-32B Performance (clockwise from top):**
* GSM8K: 96.3
* MATH 500: 89.4
* Minerva Math: 47.1
* Olympiad Bench: 60.5
* GaoKao 2023: 80.3
* AMC23: 90.0
* AIME @32: 31.2
**Data Series (Model Performance) and Trends:**
* **SwS-32B** (solid red line, light red shaded area):
* **Trend:** This model consistently demonstrates the highest performance across all benchmarks, forming the outermost boundary of the shaded region. Its performance is particularly strong on GSM8K, AMC23, and MATH 500, while its lowest scores are on AIME @32 and Minerva Math.
* **Performance Values:** GSM8K: 96.3, MATH 500: 89.4, Minerva Math: 47.1, Olympiad Bench: 60.5, GaoKao 2023: 80.3, AMC23: 90.0, AIME @32: 31.2.
* **Qwen2.5-32B** (dashed dark gray line):
* **Trend:** This model generally exhibits the lowest performance among all models, with noticeable dips on Olympiad Bench and AIME @32.
* **Estimated Performance Values:** GSM8K: ~85%, MATH 500: ~75%, Minerva Math: ~35%, Olympiad Bench: ~30%, GaoKao 2023: ~55%, AMC23: ~65%, AIME @32: ~20%.
* **Qwen2.5-32B-IT** (dashed light blue line):
* **Trend:** This model shows moderate performance, generally above Qwen2.5-32B but below SwS-32B and other models on most benchmarks.
* **Estimated Performance Values:** GSM8K: ~90%, MATH 500: ~80%, Minerva Math: ~40%, Olympiad Bench: ~45%, GaoKao 2023: ~70%, AMC23: ~80%, AIME @32: ~25%.
* **ORZ-32B** (dashed orange line):
* **Trend:** This model's performance is competitive, often similar to Baseline-32B and SimpleRL-32B, but generally below SwS-32B. It shows a relatively strong performance on Olympiad Bench compared to the Qwen2.5 models.
* **Estimated Performance Values:** GSM8K: ~88%, MATH 500: ~78%, Minerva Math: ~42%, Olympiad Bench: ~50%, GaoKao 2023: ~75%, AMC23: ~85%, AIME @32: ~28%.
* **SimpleRL-32B** (dashed green line):
* **Trend:** This model performs strongly, often very close to SwS-32B, particularly on GSM8K and MATH 500. It has a noticeable dip on Minerva Math and Olympiad Bench compared to its other scores.
* **Estimated Performance Values:** GSM8K: ~95%, MATH 500: ~88%, Minerva Math: ~38%, Olympiad Bench: ~48%, GaoKao 2023: ~78%, AMC23: ~88%, AIME @32: ~29%.
* **Baseline-32B** (dashed purple line):
* **Trend:** This model shows consistently good performance, often ranking second or third among the models, generally above ORZ-32B and Qwen2.5 models, but below SwS-32B. It performs relatively well on AIME @32.
* **Estimated Performance Values:** GSM8K: ~92%, MATH 500: ~85%, Minerva Math: ~45%, Olympiad Bench: ~55%, GaoKao 2023: ~79%, AMC23: ~89%, AIME @32: ~30%.
---
### (b) Performance across Domains
This radar chart, titled "(b) Performance across Domains," evaluates models across seven distinct mathematical domains, with "Prealgebra" appearing as a central contextual label.
**Radial Axes (Domains) and SwS-32B Performance (clockwise from top):**
* Prealgebra: 96.3
* Intermediate Algebra: 84.1
* Algebra: 76.6
* Geometry: 60.8
* Counting & Probability: 57.1
* Precalculus: 72.3
* Number Theory: 66.5
**Data Series (Model Performance) and Trends:**
* **SwS-32B** (solid red line, light red shaded area):
* **Trend:** This model consistently demonstrates the highest performance across all domains, forming the outermost boundary of the shaded region. Its performance is particularly strong on Prealgebra and Intermediate Algebra, while its lowest scores are on Counting & Probability and Geometry.
* **Performance Values:** Prealgebra: 96.3, Intermediate Algebra: 84.1, Algebra: 76.6, Geometry: 60.8, Counting & Probability: 57.1, Precalculus: 72.3, Number Theory: 66.5.
* **Qwen2.5-32B** (dashed dark gray line):
* **Trend:** This model generally exhibits the lowest performance among all models across most domains, with significant dips on Geometry and Counting & Probability.
* **Estimated Performance Values:** Prealgebra: ~85%, Intermediate Algebra: ~70%, Algebra: ~60%, Geometry: ~40%, Counting & Probability: ~35%, Precalculus: ~50%, Number Theory: ~45%.
* **Qwen2.5-32B-IT** (dashed light blue line):
* **Trend:** This model shows moderate performance, generally above Qwen2.5-32B but below SwS-32B and other models on most domains.
* **Estimated Performance Values:** Prealgebra: ~90%, Intermediate Algebra: ~75%, Algebra: ~65%, Geometry: ~50%, Counting & Probability: ~45%, Precalculus: ~60%, Number Theory: ~55%.
* **ORZ-32B** (dashed orange line):
* **Trend:** This model's performance is competitive, often similar to Baseline-32B and SimpleRL-32B, but generally below SwS-32B. It shows a relatively strong performance on Geometry and Counting & Probability compared to the Qwen2.5 models.
* **Estimated Performance Values:** Prealgebra: ~88%, Intermediate Algebra: ~78%, Algebra: ~70%, Geometry: ~55%, Counting & Probability: ~50%, Precalculus: ~65%, Number Theory: ~60%.
* **SimpleRL-32B** (dashed green line):
* **Trend:** This model performs strongly, often very close to SwS-32B, particularly on Prealgebra and Intermediate Algebra. It has a noticeable dip on Geometry and Counting & Probability compared to its other scores, but still performs better than most other non-SwS models.
* **Estimated Performance Values:** Prealgebra: ~95%, Intermediate Algebra: ~82%, Algebra: ~75%, Geometry: ~58%, Counting & Probability: ~55%, Precalculus: ~70%, Number Theory: ~65%.
* **Baseline-32B** (dashed purple line):
* **Trend:** This model shows consistently good performance, often ranking second or third among the models, generally above ORZ-32B and Qwen2.5 models, but below SwS-32B. It performs relatively well across all domains, with its lowest scores on Counting & Probability and Geometry.
* **Estimated Performance Values:** Prealgebra: ~92%, Intermediate Algebra: ~80%, Algebra: ~72%, Geometry: ~59%, Counting & Probability: ~56%, Precalculus: ~71%, Number Theory: ~66%.