## Line Chart: Accuracy Comparison of Four Language Models on Mathematical Topics
### Overview
This image is a line chart comparing the performance (accuracy) of four different large language models (LLMs) across a wide range of mathematical topics. The chart displays the accuracy percentage for each model on each topic, allowing for a direct comparison of their strengths and weaknesses in mathematical reasoning. The data is presented as four distinct, jagged lines, each corresponding to a specific model.
### Components/Axes
* **Chart Type:** Multi-line chart.
* **Y-Axis:**
* **Label:** "Accuracy" (written vertically on the left side).
* **Scale:** Linear scale from 0 to approximately 90.
* **Major Gridlines:** Horizontal dashed lines at intervals of 20 (0, 20, 40, 60, 80).
* **X-Axis:**
* **Label:** None explicitly stated. The axis represents discrete mathematical topics.
* **Tick Labels:** A series of mathematical topic names written in Chinese, rotated at a 45-degree angle for readability. The full list of topics (with English translations) is provided in the Detailed Analysis section.
* **Legend:**
* **Position:** Centered at the top of the chart, above the plot area.
* **Content:** Four entries, each with a colored line segment and marker, followed by the model name.
1. **Blue line with circle markers:** `Baichuan2-13B`
2. **Orange line with circle markers:** `LLaMA2-13B`
3. **Green line with circle markers:** `Qwen-14B`
4. **Red line with circle markers:** `InternLM2-Math-20B`
### Detailed Analysis
The chart plots accuracy (0-100%) for each model across 47 distinct mathematical topics. Below is an approximate data extraction for each model, listed in the order the topics appear on the x-axis (left to right). Values are estimated from the chart's gridlines and carry an uncertainty of ±3-5%.
**X-Axis Topics (Chinese -> English Translation):**
1. 全等三角形 -> Congruent Triangles
2. 等腰三角形 -> Isosceles Triangles
3. 等边三角形 -> Equilateral Triangles
4. 平行四边形性质 -> Properties of Parallelograms
5. 圆周角定理 -> Inscribed Angle Theorem
6. 弧长和扇形面积 -> Arc Length and Sector Area
7. 点与圆的位置关系 -> Positional Relationship between a Point and a Circle
8. 函数与二元一次方程 -> Function and Linear Equation in Two Variables
9. 函数与一元一次方程 -> Function and Linear Equation in One Variable
10. 函数与一元二次方程 -> Function and Quadratic Equation in One Variable
11. 求一次函数的解析式 -> Finding the Analytic Expression of a Linear Function
12. 二次函数的性质 -> Properties of Quadratic Functions
13. 反比例函数的性质 -> Properties of Inverse Proportional Functions
14. 反比例函数的应用 -> Application of Inverse Proportional Functions
15. 点的坐标特征 -> Coordinate Characteristics of Points
16. 代数式求值 -> Evaluating Algebraic Expressions
17. 同底数幂 -> Powers with the Same Base
18. 约分与通分 -> Reduction and Reduction to a Common Denominator
19. 十字相乘法 -> Cross Multiplication Method
20. 提公因式法 -> Factoring by Common Factor
21. 流程图 -> Flowcharts
22. 简单的轴对称图形 -> Simple Axially Symmetric Figures
23. 整式的乘法与因式分解 -> Multiplication of Integral Expressions and Factorization
24. 二次根式的乘除 -> Multiplication and Division of Quadratic Radicals
25. 二次根式的加减 -> Addition and Subtraction of Quadratic Radicals
26. 平方根与算术平方根 -> Square Root and Arithmetic Square Root
27. 一元一次方程的应用 -> Application of Linear Equation in One Variable
28. 一元二次方程的解法 -> Solution of Quadratic Equation in One Variable
29. 一元二次方程的应用 -> Application of Quadratic Equation in One Variable
30. 一元一次不等式 -> Linear Inequality in One Variable
31. 一元一次不等式组 -> System of Linear Inequalities in One Variable
32. 解一元二次方程 -> Solving Quadratic Equation in One Variable
33. 分式方程的应用 -> Application of Fractional Equations
34. 分式的化简求值 -> Simplification and Evaluation of Fractions
35. 数据的集中趋势 -> Central Tendency of Data
36. 数据的波动程度 -> Dispersion of Data
37. 频数分布直方图 -> Frequency Distribution Histogram
38. 概率的求法 -> Calculation of Probability
39. 随机事件与概率 -> Random Events and Probability
**Approximate Accuracy Data by Model:**
* **Baichuan2-13B (Blue Line):**
* **Trend:** Highly volatile, with frequent sharp peaks and troughs. Shows strong performance on several algebraic and geometric topics but also significant dips.
* **Sample Data Points (Topic #, ~Accuracy%):** (1, 65), (2, 55), (3, 70), (4, 45), (5, 35), (6, 35), (7, 25), (8, 50), (9, 55), (10, 45), (11, 80), (12, 70), (13, 55), (14, 40), (15, 55), (16, 78), (17, 85), (18, 50), (19, 68), (20, 60), (21, 40), (22, 45), (23, 55), (24, 40), (25, 65), (26, 50), (27, 40), (28, 80), (29, 45), (30, 75), (31, 55), (32, 50), (33, 70), (34, 70), (35, 70), (36, 45), (37, 50).
* **LLaMA2-13B (Orange Line):**
* **Trend:** Generally lower accuracy than the other models, with a few notable peaks. Performance is particularly weak on geometry and data statistics topics.
* **Sample Data Points (Topic #, ~Accuracy%):** (1, 35), (2, 20), (3, 15), (4, 20), (5, 15), (6, 5), (7, 25), (8, 45), (9, 50), (10, 20), (11, 55), (12, 45), (13, 20), (14, 20), (15, 60), (16, 50), (17, 55), (18, 20), (19, 50), (20, 45), (21, 5), (22, 10), (23, 15), (24, 35), (25, 25), (26, 15), (27, 25), (28, 40), (29, 25), (30, 40), (31, 30), (32, 25), (33, 60), (34, 30), (35, 40), (36, 15), (37, 30), (38, 50).
* **Qwen-14B (Green Line):**
* **Trend:** Shows the most consistent low-to-mid range performance, with very few high peaks. It frequently has the lowest accuracy, especially on geometry and equation-solving topics.
* **Sample Data Points (Topic #, ~Accuracy%):** (1, 5), (2, 10), (3, 5), (4, 25), (5, 15), (6, 5), (7, 20), (8, 0), (9, 5), (10, 5), (11, 15), (12, 10), (13, 15), (14, 5), (15, 45), (16, 30), (17, 5), (18, 15), (19, 30), (20, 30), (21, 15), (22, 5), (23, 40), (24, 15), (25, 20), (26, 10), (27, 20), (28, 20), (29, 5), (30, 20), (31, 20), (32, 40), (33, 25), (34, 0), (35, 30), (36, 15), (37, 10), (38, 5), (39, 10).
* **InternLM2-Math-20B (Red Line):**
* **Trend:** Often the top-performing model, with several high peaks above 80%. It shows particular strength in algebra, functions, and probability, but also has significant variability.
* **Sample Data Points (Topic #, ~Accuracy%):** (1, 45), (2, 65), (3, 65), (4, 70), (5, 55), (6, 20), (7, 40), (8, 35), (9, 50), (10, 25), (11, 90), (12, 65), (13, 65), (14, 25), (15, 45), (16, 75), (17, 70), (18, 70), (19, 85), (20, 83), (21, 25), (22, 30), (23, 50), (24, 55), (25, 60), (26, 35), (27, 35), (28, 75), (29, 50), (30, 75), (31, 35), (32, 80), (33, 50), (34, 25), (35, 55), (36, 75), (37, 85), (38, 70), (39, 70).
### Key Observations
1. **Performance Hierarchy:** `InternLM2-Math-20B` (Red) and `Baichuan2-13B` (Blue) are generally the top performers, frequently trading the lead. `LLaMA2-13B` (Orange) and `Qwen-14B` (Green) consistently perform at a lower tier.
2. **Topic Sensitivity:** All models show extreme sensitivity to the specific mathematical topic. Accuracy can swing by 40-60 percentage points between adjacent topics. This suggests the models' mathematical reasoning is not robust or generalized but highly dependent on the specific problem type.
3. **Model-Specific Strengths:**
* `InternLM2-Math-20B` peaks on topics like "Finding the Analytic Expression of a Linear Function" (#11, ~90%) and "Solution of Quadratic Equation in One Variable" (#28, ~75%).
* `Baichuan2-13B` excels on "Powers with the Same Base" (#17, ~85%) and "Application of Fractional Equations" (#33, ~70%).
* `LLaMA2-13B` has a notable peak on "Coordinate Characteristics of Points" (#15, ~60%).
* `Qwen-14B` performs best on "Properties of Inverse Proportional Functions" (#13, ~45%) and "Solution of Quadratic Equation in One Variable" (#32, ~40%).
4. **Common Difficult Areas:** Geometry topics (e.g., #5-7, Inscribed Angle Theorem, Arc Length, Point-Circle Relationship) and data statistics (#35-37) appear challenging for most models, particularly `LLaMA2-13B` and `Qwen-14B`, which often score below 20% in these areas.
5. **Volatility:** The green line (`Qwen-14B`) is the most consistently low, while the red line (`InternLM2-Math-20B`) exhibits the highest peaks but also deep valleys, indicating specialized rather than broad competence.
### Interpretation
This chart provides a granular diagnostic of LLM capabilities in mathematical reasoning, moving beyond aggregate benchmarks. The data suggests that:
1. **Specialization over Generalization:** The high volatility indicates that these models have not achieved a unified "understanding" of mathematics. Instead, they possess a patchwork of competencies, likely reflecting biases in their training data towards certain problem formats or topics. A model may excel at algebraic manipulation but fail at geometric visualization.
2. **The "Math" in Model Names Matters:** The `InternLM2-Math-20B` model, which likely underwent math-specific fine-tuning or training, demonstrates a clear, though not absolute, advantage, especially on complex algebraic tasks. This validates the approach of domain-specific adaptation for technical fields.
3. **Instruction Following vs. Reasoning:** The poor performance on applied topics (e.g., "Application of...") across several models may highlight a gap between procedural knowledge (solving a given equation type) and the deeper reasoning required to model a word problem into a mathematical formulation.
4. **Implications for Use:** Users cannot assume consistent performance from any single model across a math curriculum. A model strong in algebra may be unreliable for geometry. This underscores the need for topic-aware model selection or ensemble approaches for educational or technical applications.
5. **Data as a Diagnostic Tool:** For developers, the specific topics where a model fails (e.g., `Qwen-14B` on "Congruent Triangles" or `LLaMA2-13B` on "Flowcharts") provide direct targets for improving training data curation or fine-tuning strategies.
In essence, the chart reveals that current LLMs are not monolithic "math solvers" but tools with highly variable and topic-dependent proficiencies. Their performance is a complex function of model architecture, training data composition, and potential specialized tuning, with no single model demonstrating comprehensive mastery.