## Scatter Plot: OlymMATH Performance vs. Model Size
### Overview
The image consists of two scatter plots, each comparing the performance of various language models on the OlymMATH dataset. The left plot shows performance on the "EASY" subset, while the right plot shows performance on the "HARD" subset. The x-axis represents performance on the Chinese version (ZH) of the dataset, and the y-axis represents performance on the English version (EN). Data points are colored according to the model size, with a color gradient ranging from purple (small) to yellow (large). A dashed diagonal line is present in both plots, representing equal performance on both ZH and EN versions.
### Components/Axes
**Left Plot (EASY)**
* **Title:** OlymMATH EN-EASY (pass@1) vs. OlymMATH ZH-EASY (pass@1)
* **X-axis:** OlymMATH ZH-EASY (pass@1)
* Scale: 0.0 to 0.8, incrementing by 0.2
* **Y-axis:** OlymMATH EN-EASY (pass@1)
* Scale: 0.2 to 0.8, incrementing by 0.2
* **Data Points:** Represent different language models, colored by model size.
* Shapes: Circles and Diamonds
* **Diagonal Line:** Dashed line indicating equal performance on ZH and EN.
**Right Plot (HARD)**
* **Title:** OlymMATH EN-HARD (pass@1) vs. OlymMATH ZH-HARD (pass@1)
* **X-axis:** OlymMATH ZH-HARD (pass@1)
* Scale: 0.0 to 0.6, incrementing by 0.1
* **Y-axis:** OlymMATH EN-HARD (pass@1)
* Scale: 0.0 to 0.6, incrementing by 0.1
* **Data Points:** Represent different language models, colored by model size.
* Shapes: Circles and Diamonds
* **Diagonal Line:** Dashed line indicating equal performance on ZH and EN.
**Color Legend (Right Side)**
* **Title:** Model Parameters (Billions)
* **Scale:** 1.5B to 32.0B
* 1.5B (Purple)
* 9.1B (Blue)
* 16.8B (Green)
* 24.4B (Yellow-Green)
* 32.0B (Yellow)
### Detailed Analysis
**Left Plot (EASY)**
* **DS-R1-1.5B:** Located at approximately (0.1, 0.15), purple.
* **STILL-3-1.5B:** Located at approximately (0.15, 0.2), purple.
* **DeepScaler-1.5B:** Located at approximately (0.2, 0.2), purple.
* **Light-R1-DS-7B:** Located at approximately (0.4, 0.5), blue.
* **DS-R1-7B:** Located at approximately (0.35, 0.45), blue.
* **Skywork-OR1-7B:** Located at approximately (0.35, 0.5), blue.
* **OpenThinker2-7B:** Located at approximately (0.5, 0.6), blue.
* **AceMath-RL-7B:** Located at approximately (0.5, 0.65), blue.
* **Skywork-OR1-Math-7B:** Located at approximately (0.4, 0.6), blue.
* **DS-R1-14B:** Located at approximately (0.55, 0.7), light blue.
* **Light-R1-DS-14B:** Located at approximately (0.5, 0.7), light blue.
* **DS-R1-32B:** Located at approximately (0.6, 0.7), yellow-green.
* **Qwen3-4B:** Located at approximately (0.65, 0.8), yellow-green.
* **Light-R1-DS-32B:** Located at approximately (0.6, 0.8), yellow-green.
* **OpenMath-1.5B:** Located at approximately (0.65, 0.4), purple.
* **OpenMath-7B:** Located at approximately (0.7, 0.6), blue.
* **OpenMath-14B:** Located at approximately (0.75, 0.75), light blue.
* **OpenMath-32B:** Located at approximately (0.55, 0.65), yellow-green.
* **Skywork-OR1-32B:** Located at approximately (0.75, 0.85), yellow-green.
* **QWQ-32B:** Located at approximately (0.75, 0.85), yellow-green.
* **Qwen3-30B-A3B:** Located at approximately (0.75, 0.85), yellow-green.
* **Qwen3-235B-A22B:** Located at approximately (0.75, 0.85), yellow-green.
* **o3-mini (high):** Located at approximately (0.75, 0.85), yellow-green.
* **Gemini 2.5 Pro Exp:** Located at approximately (0.75, 0.85), yellow-green.
* **DeepSeek-R1:** Located at approximately (0.75, 0.8), yellow-green.
* **GLM-Z1-AIR:** Located at approximately (0.75, 0.8), yellow-green.
* **OpenThinker2-32B:** Located at approximately (0.7, 0.7), yellow-green.
**Right Plot (HARD)**
* **DS-R1-1.5B:** Located at approximately (0.02, 0.01), purple.
* **STILL-3-1.5B:** Located at approximately (0.05, 0.02), purple.
* **DeepScaler-1.5B:** Located at approximately (0.1, 0.05), purple.
* **Light-R1-DS-7B:** Located at approximately (0.25, 0.05), blue.
* **DS-R1-7B:** Located at approximately (0.05, 0.05), blue.
* **Skywork-OR1-7B:** Located at approximately (0.05, 0.05), blue.
* **OpenThinker2-7B:** Located at approximately (0.15, 0.1), blue.
* **AceMath-RL-7B:** Located at approximately (0.1, 0.1), blue.
* **Skywork-OR1-Math-7B:** Located at approximately (0.25, 0.1), blue.
* **DS-R1-14B:** Located at approximately (0.25, 0.15), light blue.
* **Light-R1-DS-14B:** Located at approximately (0.15, 0.15), light blue.
* **DS-R1-32B:** Located at approximately (0.1, 0.2), yellow-green.
* **Qwen3-4B:** Located at approximately (0.15, 0.15), yellow-green.
* **Light-R1-DS-32B:** Located at approximately (0.15, 0.25), yellow-green.
* **OpenMath-1.5B:** Located at approximately (0.25, 0.05), purple.
* **OpenMath-7B:** Located at approximately (0.35, 0.2), blue.
* **OpenMath-14B:** Located at approximately (0.3, 0.25), light blue.
* **OpenMath-32B:** Located at approximately (0.4, 0.2), yellow-green.
* **Skywork-OR1-32B:** Located at approximately (0.25, 0.3), yellow-green.
* **QWQ-32B:** Located at approximately (0.15, 0.25), yellow-green.
* **Qwen3-30B-A3B:** Located at approximately (0.25, 0.3), yellow-green.
* **Qwen3-235B-A22B:** Located at approximately (0.3, 0.4), yellow-green.
* **o3-mini (high):** Located at approximately (0.35, 0.35), yellow-green.
* **Gemini 2.5 Pro Exp:** Located at approximately (0.6, 0.6), yellow-green.
* **DeepSeek-R1:** Located at approximately (0.2, 0.25), yellow-green.
* **GLM-Z1-AIR:** Located at approximately (0.15, 0.25), yellow-green.
* **OpenThinker2-32B:** Located at approximately (0.2, 0.2), yellow-green.
### Key Observations
* **Model Size and Performance:** Generally, larger models (yellow) tend to perform better on both EASY and HARD subsets, as indicated by their position towards the top-right of the plots.
* **Difficulty Impact:** The performance range is compressed in the HARD subset compared to the EASY subset, suggesting that the HARD subset is more challenging for all models.
* **ZH vs. EN Performance:** Most models cluster around the diagonal line, indicating similar performance on both Chinese and English versions of the dataset. However, some models deviate, suggesting a bias towards one language.
* **Outliers:** Gemini 2.5 Pro Exp stands out in the HARD subset, showing significantly better performance compared to other models.
* **Model Grouping:** Models with similar architectures or training methodologies tend to cluster together, indicating shared strengths and weaknesses.
### Interpretation
The scatter plots provide a comparative analysis of language model performance on the OlymMATH dataset, considering both problem difficulty (EASY vs. HARD) and language (ZH vs. EN). The data suggests that model size is a significant factor in performance, with larger models generally achieving higher accuracy. However, the plots also reveal that architectural choices and training methodologies play a crucial role, as models with similar characteristics tend to cluster together.
The HARD subset highlights the limitations of current models, as the performance range is compressed, indicating that even the largest models struggle with the more challenging problems. The presence of outliers, such as Gemini 2.5 Pro Exp, suggests that certain models may have specific advantages in handling complex mathematical reasoning.
The comparison between ZH and EN performance reveals potential language biases in some models, which could be attributed to differences in training data or architectural design. Further investigation into these biases could lead to improvements in model generalization and cross-lingual transfer learning.
In summary, the scatter plots provide valuable insights into the strengths and weaknesses of various language models on the OlymMATH dataset, highlighting the importance of model size, architecture, and training methodology in achieving high accuracy and robustness.