## Line Chart: Accuracy of Four Language Models Across Mathematical Topics
### Overview
This image is a line chart comparing the performance of four large language models (LLMs) on a wide array of mathematical topics. The chart plots "Accuracy" (y-axis) against a comprehensive list of mathematical concepts (x-axis). Each model is represented by a distinct colored line with markers, showing its accuracy score for each topic. The overall visual impression is one of high variability, with models performing very differently depending on the specific mathematical domain.
### Components/Axes
* **Chart Type:** Multi-line chart with data point markers.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear scale from 0 to 100.
* **Major Gridlines:** Horizontal dashed lines at 20, 40, 60, 80, and 100.
* **X-Axis:**
* **Label:** None explicit. The axis consists of categorical labels for mathematical topics.
* **Categories (from left to right):** Add & subtract, Arithmetic sequences, Congruence & similarity, Consumer math, Counting principle, Distance between two points, Divide, Domain & range of functions, Equiv measurements, Estimate metric measurements, Exponents & scientific notation, Financial literacy, Fractions & decimals, Geometric sequences, Interpret functions, Linear equations, Linear functions, Lines & angles, Make predictions, Multiply, Nonlinear functions, One-variable statistics, Percent, Perimeter & area, Prime factorization, Prime or composite events, Probability of compound events, Probability of one event, Probability of opposite events, Proportional relationships, Rational & irrational numbers, Scale drawings, Square roots & cube roots, Surface area & volume, Systems of equations, Triangles, Two-variable statistics, Absolute value, Axis, Center & variability, Circle, Factors, Independent & dependent events, Inequalities, Mean, median, mode & range, Opposite integer, Outlier, Polygons, Polyhedra, Radical exps, Transformations, Trapezoids, Variable exprs.
* **Legend:**
* **Position:** Top center, above the plot area.
* **Series:**
1. **InternLM2-20B:** Blue line with circular markers.
2. **Yi-34B:** Orange line with circular markers.
3. **Qwen-72B:** Green line with circular markers.
4. **GPT-3.5:** Red line with circular markers.
### Detailed Analysis
The chart reveals significant performance disparities among the models across the 60+ mathematical topics.
**1. GPT-3.5 (Red Line):**
* **Trend:** Generally the highest-performing model, frequently occupying the top position. Its line shows high volatility, with many peaks at or near 100% accuracy and several deep troughs.
* **Key Points:** Achieves ~100% accuracy on topics like "Add & subtract," "Congruence & similarity," "Counting principle," "Prime or composite events," "Probability of opposite events," "Systems of equations," "Absolute value," "Circle," "Independent & dependent events," "Polygons," and "Variable exprs." Its lowest points appear to be around "Distance between two points" (~75%), "Scale drawings" (~65%), and "Two-variable statistics" (~70%).
**2. Qwen-72B (Green Line):**
* **Trend:** Often the second-best performer, closely following GPT-3.5. It shows a similar pattern of peaks and valleys but generally sits slightly below the red line.
* **Key Points:** Matches or nearly matches GPT-3.5's high scores on several topics (e.g., "Add & subtract," "Congruence & similarity"). It has notable peaks at "Geometric sequences" (~95%), "Prime factorization" (~95%), and "Polyhedra" (~95%). Its performance dips significantly on "Linear equations" (~40%), "One-variable statistics" (~45%), and "Two-variable statistics" (~35%).
**3. Yi-34B (Orange Line):**
* **Trend:** Typically the third-best performer, with its line often situated between the green (Qwen) and blue (InternLM) lines. It exhibits extreme volatility, with some of the highest peaks and lowest valleys on the chart.
* **Key Points:** Reaches near 100% on "Add & subtract" and "Prime or composite events." It suffers severe drops, notably on "Divide" (~45%), "Linear functions" (~30%), "Nonlinear functions" (~55%), and "Two-variable statistics" (~50%).
**4. InternLM2-20B (Blue Line):**
* **Trend:** Consistently the lowest-performing model across almost all topics. Its line is distinctly separated below the others, often fluctuating between 20% and 60% accuracy.
* **Key Points:** Its highest accuracy appears to be on "Add & subtract" (~65%) and "Variable exprs" (~35%). It has numerous points at or below 20%, including "Divide," "Linear functions," "Nonlinear functions," "Probability of compound events," "Scale drawings," "Two-variable statistics," "Center & variability," and "Radical exps."
### Key Observations
* **Universal Strength:** All four models perform best on foundational arithmetic ("Add & subtract"), with accuracies clustering between ~65% (InternLM) and ~100% (GPT-3.5).
* **Universal Challenge:** "Two-variable statistics" appears to be the most difficult topic overall, with all models scoring below 70%, and three models (InternLM, Yi, Qwen) scoring at or below 50%.
* **Performance Gap:** There is a consistent and significant performance gap between the top tier (GPT-3.5, Qwen-72B) and the bottom tier (InternLM2-20B), often spanning 30-50 percentage points on the same topic.
* **Volatility:** All models show high topic-dependent volatility. No model maintains a flat, high accuracy across the board. Performance is highly sensitive to the specific mathematical concept being tested.
* **Model Ranking Consistency:** The relative ranking of the models (GPT-3.5 > Qwen-72B > Yi-34B > InternLM2-20B) is remarkably consistent across the vast majority of topics.
### Interpretation
This chart provides a detailed benchmark of LLM capabilities in mathematical reasoning, revealing that model performance is not monolithic but highly domain-specific.
* **What the data suggests:** The data demonstrates that larger, more advanced models (GPT-3.5, Qwen-72B) have a substantially stronger grasp of a wide range of mathematical concepts compared to the other models tested. However, even the leading models have clear weaknesses in specific areas like statistics and certain algebraic functions.
* **How elements relate:** The x-axis represents a curriculum of mathematical knowledge. The chart effectively maps each model's "knowledge profile" against this curriculum. The close tracking of the GPT-3.5 and Qwen-72B lines suggests they may have been trained on similar data or have similar architectural strengths for math, while the distinct separation of the InternLM2-20B line indicates a different capability level.
* **Notable anomalies:** The extreme volatility within each model's line is the most striking feature. It indicates that "mathematical ability" in LLMs is not a single skill but a collection of competencies that can be strong in one area (e.g., geometry) and weak in another (e.g., statistics) within the same model. The near-perfect scores on some topics versus sub-50% scores on others for the same model highlight the importance of granular, topic-specific evaluation over aggregate benchmarks.