# ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models
## Abstract
This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with concept-wise accuracies. Based on our ConcepthMath, we evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones. Besides, we also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the developers to understand the fine-grained mathematical abilities of their models and facilitate the growth of foundation models The data and code are available at https://github.com/conceptmath/conceptmath..
footnotetext: * First three authors contributed equally. footnotetext: ${}^{\dagger}$ Corresponding Author: Jiaheng Liu.
## 1 Introduction
Mathematical reasoning is a crucial capability for Large Language Models (LLMs). Recent advancements in LLMs, including Anthropic Anthropic (2023), GPT-4 (OpenAI, 2023), and LLaMA (Touvron et al., 2023a), have demonstrated impressive mathematical reasoning on existing benchmarks with high average accuracies on datasets like GSM8K (Cobbe et al., 2021). Although these benchmarks are able to measure the overall mathematical reasoning capabilities of LLMs on average, they fail to probe the fine-grained failure modes of mathematical reasoning on specific mathematical concepts. For example, Fig. 1 shows that the performance of LLaMA2-13B varies significantly across different concepts and fails on simple concepts like Rational number and Cylinders. It is crucial to know these specific failure modes of the language model, especially in some practical applications where we need to focus on specific mathematical abilities. For example, for financial analysts, calculation and statistics are the concepts of most interest while others like geometry are not as important.
Moreover, the mathematics system, by its nature, is more fine-grained than holistic. It is typically organized into distinct math concepts https://en.wikipedia.org/wiki/Lists_of_mathematics_topics, and humans develop comprehensive mathematical capabilities through a concept-by-concept, curriculum-based learning process (Simon, 2011; Fritz et al., 2013). These issues underscore the core motivation of this paper: the need for a fine-grained benchmark that evaluates concept-wise mathematical reasoning capabilities of LLMs.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Line Chart: Accuracy Comparison of LLaMA2 vs. LLaMA2-FT Across Mathematical Topics
### Overview
This image is a line chart comparing the accuracy percentages of two language models, "LLaMA2" and "LLaMA2-FT" (likely a fine-tuned version), across 13 distinct mathematical topics. The chart is divided into two background regions: a light blue region on the left and a light pink region on the right labeled "Enhancing Weaknesses." The rightmost section, containing the last five topics, is also labeled "Weaknesses" at the bottom. The chart demonstrates how the fine-tuned model (LLaMA2-FT) improves upon the base model's (LLaMA2) performance, particularly in areas identified as weaknesses.
### Components/Axes
* **Chart Type:** Line chart with two data series.
* **Y-Axis:** Labeled "Accuracy". Scale ranges from 0 to 90, with major tick marks every 10 units (0, 10, 20, ..., 90).
* **X-Axis:** Lists 13 categorical mathematical topics. The labels are rotated approximately 45 degrees for readability. From left to right:
1. Powers
2. Numerical exprs
3. Estimation & rounding
4. Decimals
5. Light & heavy
6. Temperature
7. Ratio
8. Patterns
9. Cylinders
10. Perimeter
11. Rational number
12. Polygons
13. Probability
* **Legend:** Located in the bottom-left corner of the plot area.
* A solid green line corresponds to "LLaMA2".
* A solid blue line corresponds to "LLaMA2-FT".
* **Annotations:**
* A light pink shaded rectangle covers the rightmost five topics (Cylinders to Probability), labeled "Enhancing Weaknesses" in the top-right corner.
* The word "Weaknesses" is printed at the bottom of this pink region, aligned with the "Cylinders" tick mark.
* For each of the five topics in the "Weaknesses" region, a black dashed arrow points upward from the green line (LLaMA2) to the blue line (LLaMA2-FT). At the tip of each arrow (on the blue line) is a small, pink, five-pointed star.
### Detailed Analysis
**Data Series Trends:**
* **LLaMA2 (Green Line):** Shows high variability. It starts relatively high (~65%), dips sharply at "Estimation & rounding" (~30%), peaks at "Light & heavy" (~79%), then generally declines, hitting very low points in the "Weaknesses" section (as low as ~11% for "Polygons").
* **LLaMA2-FT (Blue Line):** Follows a similar overall pattern to LLaMA2 but is consistently equal to or higher than the green line. The improvement is most dramatic in the "Weaknesses" section, where the blue line shows a strong upward trend while the green line remains low.
**Approximate Data Points (Topic: LLaMA2 %, LLaMA2-FT %):**
1. **Powers:** ~65%, ~70%
2. **Numerical exprs:** ~65%, ~70%
3. **Estimation & rounding:** ~30%, ~35%
4. **Decimals:** ~65%, ~60% *(Note: LLaMA2-FT is slightly lower here)*
5. **Light & heavy:** ~79%, ~84%
6. **Temperature:** ~68%, ~68% *(Approximately equal)*
7. **Ratio:** ~37%, ~42%
8. **Patterns:** ~42%, ~37% *(Note: LLaMA2-FT is slightly lower here)*
9. **Cylinders:** ~12%, ~47% *(Large improvement, marked with star)*
10. **Perimeter:** ~26%, ~63% *(Large improvement, marked with star)*
11. **Rational number:** ~12%, ~52% *(Large improvement, marked with star)*
12. **Polygons:** ~11%, ~66% *(Large improvement, marked with star)*
13. **Probability:** ~19%, ~75% *(Large improvement, marked with star)*
### Key Observations
1. **Targeted Improvement:** The most significant and consistent gains for LLaMA2-FT occur precisely in the five topics (Cylinders through Probability) highlighted as "Weaknesses" for the base model. The dashed arrows and stars explicitly draw attention to these improvements.
2. **Performance Gap:** In the "Weaknesses" section, the accuracy gap between the two models is substantial, often exceeding 40 percentage points (e.g., Polygons: ~11% vs. ~66%).
3. **General Competence:** In the first eight topics (outside the highlighted weaknesses), the two models perform comparably, with LLaMA2-FT holding a slight edge in most, but not all, cases (e.g., it is slightly worse on "Decimals" and "Patterns").
4. **Peak Performance:** Both models achieve their highest accuracy on the "Light & heavy" topic (~79% and ~84%).
5. **Lowest Performance:** The base LLaMA2 model's lowest accuracy is on "Polygons" (~11%). The fine-tuned model's lowest point is on "Patterns" (~37%), which is still significantly higher than the base model's low points.
### Interpretation
This chart visually argues for the effectiveness of a fine-tuning (FT) process applied to the LLaMA2 model. The core message is that fine-tuning can be strategically used to "enhance weaknesses"—it dramatically improves performance on specific, targeted tasks where the base model struggles, without causing significant degradation on tasks where the base model was already competent.
The design of the chart reinforces this narrative. The shaded "Enhancing Weaknesses" region and the annotation arrows create a clear before-and-after story for the five selected topics. The data suggests the fine-tuning was highly successful, transforming areas of near-failure (10-20% accuracy) into areas of moderate to strong competence (50-75% accuracy). This implies the fine-tuning data or method was particularly well-suited to addressing the specific reasoning challenges posed by topics like geometry (Cylinders, Perimeter, Polygons), number theory (Rational number), and Probability. The chart serves as evidence that model weaknesses are not fixed and can be addressed through targeted post-training.
</details>
Figure 1: The concept-wise accuracies of LLaMA2-13B and the fine-tuned version based on our efficient fine-tuning method (i.e., LLaMA2-FT).
Therefore, first, we introduce ConceptMath, the first bilingual (English and Chinese), concept-wise benchmark for measuring mathematical reasoning. ConceptMath gathers math concepts from four educational systems, resulting in four distinct mathematical concept systems: English Elementary, English Middle, Chinese Elementary, and Chinese Middle The four concept systems are abbreviated as Elementary-EN, Middle-EN, Elementary-ZH, and Middle-ZH.. Each of these concept systems organizes around 50 atomic math concepts under a three-level hierarchy and each concept includes approximately 20 mathematical problems. Overall, ConceptMath comprises a total of 4011 math word problems across 214 math concepts, and Fig. 2 shows the diagram overview of ConceptMath.
Second, based on our ConceptMath, we perform extensive experiments to assess the mathematical reasoning of existing LLMs, including 2 close-sourced LLMs and 17 open-sourced LLMs. These evaluations were performed in zero-shot, chain-of-thought (CoT), and few-shot settings. To our surprise, even though most of the evaluated LLMs claim to achieve high average accuracies on traditional mathematical benchmarks (e.g., GSM8K), they fail catastrophically across a wide spectrum of mathematical concepts.
Third, to make targeted improvements on underperformed math concepts, we propose an efficient fine-tuning strategy by first training a concept classifier and then crawling a set of samples from a large open-sourced math dataset Paster et al. (2023); Wang et al. (2023b) for further LLMs fine-tuning. In Fig. 1, for LLaMA2-FT, we observe that the results of these weaknesses improved a lot after using the efficient fine-tuning method.
In summary, our contributions are as follows:
- We introduce ConceptMath, the first bilingual, concept-wise benchmark for measuring mathematical reasoning. ConceptMath encompasses 4 systems, approximately 214 math concepts, and 4011 math word problems, which can guide further improvements on the mathematical reasoning of existing models.
- Based on ConceptMath, we evaluate many LLMs and perform a comprehensive analysis of their results. For example, we observe that most of these LLMs (including open-sourced, closed-sourced, general-purpose, or math-specialized models) show significant variations in their performance results across math concepts.
- We also evaluate the contamination rate of our ConceptMath and introduce a simple and efficient fine-tuning method to improve the weaknesses of existing LLMs.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Sunburst Diagram: Mathematics Curriculum Knowledge Map
### Overview
The image displays a multi-level circular sunburst chart (also known as a radial treemap or multi-level pie chart) that visually organizes mathematical topics into a hierarchical structure. It represents a taxonomy or curriculum map, likely for K-12 mathematics education, showing how broad domains break down into specific sub-topics. The diagram uses color-coding and radial segmentation to illustrate relationships and relative emphasis.
### Components/Axes
The diagram has three concentric rings or levels, radiating from the center outward:
1. **Inner Circle (Core Domains):** Four primary mathematical domains, each assigned a distinct color.
2. **Middle Ring (Sub-Domains):** Major sub-categories within each core domain.
3. **Outer Ring (Specific Topics):** Granular topics or skills within each sub-domain.
**Color Legend (by Core Domain):**
* **Yellow:** Measurement
* **Orange:** Statistics
* **Blue:** Geometry
* **Green:** Calculate & Properties
**Spatial Layout:**
* The **Measurement** domain (yellow) occupies the bottom-left quadrant.
* The **Statistics** domain (orange) occupies the top-left quadrant.
* The **Geometry** domain (blue) occupies the top-right quadrant.
* The **Calculate & Properties** domain (green) occupies the bottom-right quadrant.
### Detailed Analysis
#### 1. Core Domain: Measurement (Yellow)
* **Sub-Domains (Middle Ring):**
* Basic Knowledge
* Ratio
* Weight
* Money
* **Specific Topics (Outer Ring, clockwise from top):**
* *Under Basic Knowledge:* Size, Length, Area, Volume, Time, Temperature.
* *Under Ratio:* Ratio, Proportional, Percents.
* *Under Weight:* Light & Heavy.
* *Under Money:* Exchanging Money, Coin Names & Value.
#### 2. Core Domain: Statistics (Orange)
* **Sub-Domains (Middle Ring):**
* Data
* Classifying & Sorting
* Probability
* **Specific Topics (Outer Ring, clockwise from top):**
* *Under Data:* Statistics.
* *Under Classifying & Sorting:* Classifying & Sorting.
* *Under Probability:* Probability.
#### 3. Core Domain: Geometry (Blue)
* **Sub-Domains (Middle Ring):**
* Angles
* Two-Dim Shapes
* Three-Dim Shapes
* Coordinate Plane
* **Specific Topics (Outer Ring, clockwise from top):**
* *Under Angles:* Angles.
* *Under Two-Dim Shapes:* Circles, Perimeter, Polygons, Quadrilaterals, Triangles.
* *Under Three-Dim Shapes:* Cones, Cubes, Cylinders, Spheres, Volume of 3D Shapes.
* *Under Coordinate Plane:* Coordinate Plane.
#### 4. Core Domain: Calculate & Properties (Green)
* **Sub-Domains (Middle Ring):**
* Properties
* Calculate
* **Specific Topics (Outer Ring, clockwise from bottom-left):**
* *Under Properties:* Patterns, Estimation & Rounding, Count, Compare, Variable Expressions.
* *Under Calculate:* Subtraction, Rational Number, Powers, Place Value, Numerical Expressions, Multiple, Mixed Operations, Fractions, Equations, Division, Decimals, Add.
### Key Observations
* **Hierarchical Emphasis:** The width of each segment appears to represent relative emphasis or curricular weight. For example, within "Geometry," the "Two-Dim Shapes" segment is significantly wider than "Angles," suggesting it covers more topics or is a larger unit of study.
* **Topic Density:** The "Calculate & Properties" domain has the highest density of specific topics in the outer ring, indicating it encompasses a wide range of fundamental arithmetic and algebraic skills.
* **Color Consistency:** Each sub-domain and its child topics in the outer ring consistently share the same base color as their parent core domain, with slight variations in shade for visual distinction.
* **Radial Flow:** The diagram is organized to be read clockwise within each quadrant, moving from broader concepts near the center to specific skills at the periphery.
### Interpretation
This diagram serves as a **visual curriculum framework or knowledge map for mathematics**. It demonstrates a structured, hierarchical approach to organizing mathematical concepts, moving from foundational domains to discrete, teachable skills.
* **Relationships:** It explicitly shows dependency and categorization. For instance, understanding "Volume of 3D Shapes" is nested under the broader "Three-Dim Shapes" sub-domain, which itself is part of the core "Geometry" domain. This helps educators and learners see how specific skills fit into the larger picture.
* **Pedagogical Insight:** The segmentation suggests a potential learning pathway or unit structure. A teacher might design a unit on "Geometry" that first covers "Angles," then progresses through various "Two-Dim Shapes," before moving to "Three-Dim Shapes."
* **Notable Anomaly:** The "Coordinate Plane" appears both as a sub-domain in the middle ring and as a specific topic in the outer ring directly beneath it. This could indicate it is both a major sub-category and a singular, important skill within itself.
* **Overall Purpose:** The map is a tool for curriculum planning, standards alignment, or student orientation. It answers the question, "What are the components of elementary/middle school mathematics?" by providing a single, comprehensive, and visually organized reference. It emphasizes the interconnectedness of mathematical ideas while clearly delineating discrete areas of study.
</details>
((a)) English Elementary (Elementary-EN)
<details>
<summary>x3.png Details</summary>

### Visual Description
## Diagram: Mathematics Curriculum Sunburst Chart
### Overview
The image displays a multi-level sunburst chart (a radial hierarchical diagram) that organizes mathematical topics into a structured curriculum. The chart is divided into four primary color-coded categories, each branching into subcategories and specific topics. The design is informational, intended to show the relationships and hierarchy within a mathematics learning framework. All text is in English.
### Components/Axes
The diagram has no traditional axes. It is composed of concentric rings divided into colored segments. The hierarchy flows from the center outward:
- **Innermost Ring (Core Categories):** Four main domains.
- **Middle Ring (Subcategories):** Major topic areas within each domain.
- **Outer Ring (Specific Topics):** Granular skills and concepts.
**Legend/Color Key (Integrated into the chart):**
- **Green (Teal):** Calculate
- **Blue:** Geometry
- **Yellow:** Statistic & Probability
- **Orange:** Exprs, Equations & Functions
### Detailed Analysis
The chart is segmented into four quadrants, each representing a core mathematical domain. Below is a complete transcription of all text, organized by quadrant and hierarchical level.
#### **1. Calculate (Green/Teal Quadrant - Top Right)**
* **Core Category:** Calculate
* **Subcategories & Specific Topics:**
* **Basic Calculate**
* Add & Subtract
* Decimals
* Divide
* Exponents & Scientific Notation
* Fractions & Decimals
* Multiply
* Square Roots & Cube Roots
* **Number Theory**
* Factors
* Prime Factorization
* Prime or Composite
* **Sequences**
* Arithmetic Sequences
* Geometric Sequences
* **Integers**
* Opposite Integers
* Absolute Value
* **Rational & Irrational Numbers**
* **Ratios & Rates**
* **Percents**
* **Financial Literacy**
* **Measurement**
* Estimate Metric Measurements
* **Consumer Math**
#### **2. Geometry (Blue Quadrant - Bottom Right)**
* **Core Category:** Geometry
* **Subcategories & Specific Topics:**
* **Two-Dim Figures**
* Polygons
* Perimeter & Area
* Lines & Angles
* Circle
* Square
* Trapezoids
* Triangle
* **Coordinate Plane**
* Axes
* Distance Between Two Points
* Quadrants
* **Three-Dim Figures**
* Polyhedra
* Surface Area & Volume
* **Congruence & Similarity**
* **Transformations**
* **Scale Drawings**
* **Slope**
#### **3. Statistic & Probability (Yellow Quadrant - Bottom Left)**
* **Core Category:** Statistic & Probability
* **Subcategories & Specific Topics:**
* **Data**
* One-Variable Statistics
* Two-Variable Statistics
* Outlier
* Mean, Median, Mode & Range
* Center & Variability
* **Probability**
* Probability of Simple and Opposite Events
* Probability of One Event
* Probability of Compound Events
* Make Predictions
* Independent & Dependent Events
* Counting Principle
#### **4. Exprs, Equations & Functions (Orange Quadrant - Top Left)**
* **Core Category:** Exprs, Equations & Functions
* **Subcategories & Specific Topics:**
* **Function Concepts**
* Interpret Functions
* Domain & Range of Functions
* Linear Functions
* Nonlinear Functions
* **Exprs**
* Equivalent Exprs
* Radical Exprs
* Variable Exprs
* **Equations**
* Linear Equations
* Systems of Equations
* **Inequalities**
### Key Observations
1. **Hierarchical Structure:** The diagram clearly shows a three-tiered hierarchy: Core Domain -> Major Topic Area -> Specific Skill/Concept.
2. **Topic Density:** The "Calculate" and "Geometry" domains appear to contain the highest number of discrete, specific topics in the outer ring.
3. **Conceptual Grouping:** Related concepts are clustered together. For example, all basic arithmetic operations are under "Basic Calculate," and all shape-related topics are under "Two-Dim Figures."
4. **Visual Organization:** The use of color and radial segmentation effectively isolates each major domain while showing they are part of a unified whole (the complete mathematics curriculum).
### Interpretation
This diagram serves as a comprehensive visual taxonomy of a mathematics curriculum, likely for middle school or early high school level. It demonstrates how broad mathematical disciplines (Calculation, Geometry, Statistics/Probability, Algebra) are broken down into teachable and assessable components.
The structure suggests a pedagogical approach that values both foundational skills (e.g., "Add & Subtract," "Multiply") and higher-order conceptual understanding (e.g., "Interpret Functions," "Make Predictions"). The inclusion of practical applications like "Financial Literacy" and "Consumer Math" indicates an emphasis on real-world relevance.
The radial, interconnected design implies that these domains are not isolated silos but are interrelated. For instance, "Slope" (under Geometry) is fundamentally connected to "Linear Functions" (under Exprs, Equations & Functions). The chart effectively communicates the breadth and depth of a standard mathematics education framework, making it a useful tool for curriculum planning, student guidance, or educational standards mapping.
</details>
((b)) English Middle (Middle-EN)
<details>
<summary>x4.png Details</summary>

### Visual Description
## Diagram: Hierarchical Taxonomy of Mathematical Problem Types (Sunburst Chart)
### Overview
The image displays a multi-level sunburst chart (a radial treemap) that visually organizes a comprehensive taxonomy of mathematical problem types, likely for educational curriculum or problem-solving classification. The chart is structured in concentric rings, moving from broad categories at the center to specific problem types at the outer edge. All text is in Chinese.
### Components/Axes
* **Chart Type:** Sunburst Chart / Radial Treemap.
* **Structure:** Three visible hierarchical levels radiating from the center.
* **Color Coding:** Each primary category and its descendants share a distinct color family.
* **Orange:** 应用 (Application)
* **Yellow:** 度量与统计 (Measurement and Statistics)
* **Blue:** 几何 (Geometry)
* **Green:** 数与代数 (Number and Algebra)
* **Spatial Layout:** The chart is divided into four main quadrants corresponding to the primary categories. The legend (color key) is inherent in the chart's structure, with each colored segment acting as its own label.
### Detailed Analysis
The following is a complete hierarchical extraction of all textual labels, proceeding clockwise from the top.
**1. Primary Category: 应用 (Application) - Orange Segment (Top-Left Quadrant)**
* **Second Level (Subcategories):**
* 基础问题 (Basic Problems)
* 经典问题 (Classic Problems)
* **Third Level (Specific Problem Types under 基础问题):**
* 和差倍问题 (Sum-Difference-Multiple Problems)
* 基础问题 (Basic Problems) *[Note: Appears as both a subcategory and a specific type]*
* 相遇问题 (Meeting Problems)
* 行程问题 (Travel/Distance Problems)
* 追击问题 (Pursuit Problems)
* 面积问题 (Area Problems)
* 浓度问题 (Concentration Problems)
* 人民币问题 (RMB/Money Problems)
* 时间问题 (Time Problems)
* 温度问题 (Temperature Problems)
* 规律 (Patterns)
* **Third Level (Specific Problem Types under 经典问题):**
* 归一问题 (Unitary Method Problems)
* 归总问题 (Total Method Problems)
* 差倍问题 (Difference-Multiple Problems)
* 周期问题 (Periodic Problems)
* 植树问题 (Tree Planting Problems)
* 折扣问题 (Discount Problems)
* 页码问题 (Page Number Problems)
* 税率问题 (Tax Rate Problems)
* 年龄问题 (Age Problems)
* 利息问题 (Interest Problems)
* 鸡兔同笼问题 (Chicken-Rabbit in the Same Cage Problems)
* 还原问题 (Restoration/Reverse Problems)
* 工程问题 (Work/Engineering Problems)
* 对折问题 (Folding Problems)
**2. Primary Category: 度量与统计 (Measurement and Statistics) - Yellow Segment (Top-Right Quadrant)**
* **Second Level (Subcategories):**
* 度量 (Measurement)
* 统计 (Statistics)
* **Third Level (Specific Problem Types under 度量):**
* *[No further subdivision visible in this segment]*
* **Third Level (Specific Problem Types under 统计):**
* 排列组合 (Permutations and Combinations)
* 统计指标 (Statistical Indicators)
**3. Primary Category: 几何 (Geometry) - Blue Segment (Bottom-Right Quadrant)**
* **Second Level (Subcategories):**
* 立体图形 (Solid Figures)
* 平面图形 (Plane Figures)
* **Third Level (Specific Problem Types under 立体图形):**
* 圆柱 (Cylinder)
* 长方体 (Cuboid)
* 正方体 (Cube)
* 立体图形综合 (Solid Figures Comprehensive)
* **Third Level (Specific Problem Types under 平面图形):**
* 角 (Angles)
* 平行四边形 (Parallelogram)
* 三角形 (Triangle)
* 梯形 (Trapezoid)
* 圆 (Circle)
* 长方形 (Rectangle)
* 正方形 (Square)
* 平面图形综合 (Plane Figures Comprehensive)
**4. Primary Category: 数与代数 (Number and Algebra) - Green Segment (Bottom-Left Quadrant)**
* **Second Level (Subcategories):**
* 因数与倍数 (Factors and Multiples)
* 分数运算 (Fraction Operations)
* 比与比例 (Ratio and Proportion)
* 基础运算 (Basic Operations)
* **Third Level (Specific Problem Types under 因数与倍数):**
* 因数与倍数综合 (Factors and Multiples Comprehensive)
* 质数问题 (Prime Number Problems)
* 因数问题 (Factor Problems)
* 公约数问题 (Common Divisor Problems)
* 公倍数问题 (Common Multiple Problems)
* **Third Level (Specific Problem Types under 分数运算):**
* 最简分数 (Simplest Fraction)
* 分数运算 (Fraction Operations) *[Note: Appears as both a subcategory and a specific type]*
* 分数与小数 (Fractions and Decimals)
* 分数应用 (Fraction Applications)
* **Third Level (Specific Problem Types under 比与比例):**
* 概率问题 (Probability Problems)
* 比例问题 (Proportion Problems)
* 倍数问题 (Multiple Problems)
* 百分率问题 (Percentage Problems)
* **Third Level (Specific Problem Types under 基础运算):**
* 四则运算 (Four Arithmetic Operations)
* 方程问题 (Equation Problems)
* 定义新运算 (Defining New Operations)
* 倒数问题 (Reciprocal Problems)
* 除法问题 (Division Problems)
* 乘法问题 (Multiplication Problems)
### Key Observations
1. **Asymmetrical Distribution:** The "应用" (Application) category is the most extensive, containing the highest number of specific problem types (over 25), suggesting a strong emphasis on contextual, word-based problem solving.
2. **Hierarchical Depth:** Most categories follow a consistent three-level hierarchy (Category -> Subcategory -> Specific Problem Type), though "度量" (Measurement) under "度量与统计" shows no further subdivision in this visualization.
3. **Terminology Consistency:** Some terms, like "基础问题" (Basic Problems) and "分数运算" (Fraction Operations), appear at both the subcategory and specific problem type levels, indicating they serve as both a class and an exemplar.
4. **Comprehensive Scope:** The taxonomy covers a wide spectrum of elementary to middle school mathematics, from basic arithmetic ("四则运算") to combinatorial thinking ("排列组合") and geometric analysis.
### Interpretation
This diagram serves as a **conceptual map or knowledge graph for mathematical pedagogy**. It is not a data chart plotting numerical values but a structural representation of a domain of knowledge.
* **Purpose:** It likely functions as a curriculum framework, a guide for textbook organization, or a taxonomy for tagging and categorizing math problems for practice systems. The sunburst format effectively shows both the breadth (angular size of segments) and depth (number of rings) of each topic area.
* **Relationships:** The radial flow from center to edge illustrates a "general-to-specific" relationship. The color coding creates immediate visual association between a specific problem type (e.g., "鸡兔同笼问题") and its parent category ("应用").
* **Notable Insight:** The significant visual weight given to "应用" (Application) underscores a pedagogical philosophy that prioritizes the application of mathematical concepts to solve concrete, real-world-style problems. In contrast, "度量与统计" (Measurement and Statistics) is the smallest primary segment, which might reflect its relative weight in the specific curriculum this chart represents.
* **Underlying Structure:** The chart reveals an implicit classification system. Problems are grouped not just by mathematical field (Geometry, Algebra) but also by problem-solving strategy (e.g., "还原问题" - Restoration Problems) and real-world context (e.g., "税率问题" - Tax Rate Problems). This multidimensional grouping is efficiently captured in the hierarchical, radial layout.
</details>
((c)) Chinese Elementary (Elementary-ZH)
<details>
<summary>x5.png Details</summary>

### Visual Description
## Sunburst Chart: Hierarchical Mathematics Curriculum Map
### Overview
The image displays a multi-level sunburst chart (a radial hierarchical diagram) that visually organizes a comprehensive mathematics curriculum, likely for middle school or early high school level. The chart is structured in concentric rings, with the innermost ring representing broad mathematical domains, the middle ring representing major topic areas, and the outermost ring representing specific sub-topics or skills. All text is in Chinese. The chart uses a distinct color-coding scheme to group related domains.
### Components/Axes
* **Chart Type:** Sunburst Chart (Radial Treemap).
* **Structure:** Three concentric rings radiating from a central point.
* **Color-Coding:** Each major domain in the innermost ring has a unique color, which is maintained in varying shades for its sub-topics in the outer rings.
* **Text Orientation:** Labels are oriented radially, following the curve of their respective segments.
* **Spatial Layout:** The chart is divided into five primary colored sectors, each representing a core mathematical domain.
### Detailed Analysis
The hierarchical content is extracted below, moving from the center outward. The primary language is Chinese, with English translations provided.
#### **Innermost Ring (Core Domains)**
1. **数与式 (Numbers and Expressions)** - Dark Green sector (approx. 7 o'clock to 11 o'clock position).
2. **几何 (Geometry)** - Blue sector (approx. 4 o'clock to 7 o'clock position).
3. **函数 (Functions)** - Orange sector (approx. 1 o'clock to 4 o'clock position).
4. **方程与不等式 (Equations and Inequalities)** - Pink/Red sector (approx. 11 o'clock to 1 o'clock position).
5. **统计与概率 (Statistics and Probability)** - Yellow sector (approx. 11 o'clock, adjacent to the pink sector).
#### **Middle Ring (Major Topic Areas)**
* **Under 数与式 (Numbers and Expressions):**
* **根式 (Radicals)** - Darkest green segment.
* **整式 (Polynomials)** - Medium-dark green segment.
* **代数式 (Algebraic Expressions)** - Medium green segment.
* **分式 (Fractions/Rational Expressions)** - Light green segment.
* **无理数 (Irrational Numbers)** - Lighter green segment.
* **因式 (Factoring)** - Lightest green segment.
* **应用 (Applications)** - Pale green segment.
* **Under 几何 (Geometry):**
* **圆 (Circles)** - Dark blue segment.
* **三角形 (Triangles)** - Medium blue segment.
* **四边形 (Quadrilaterals)** - Light blue segment.
* **立体图形 (3D Shapes/Solid Figures)** - Lightest blue segment.
* **Under 函数 (Functions):**
* **一次函数 (Linear Functions)** - Dark orange segment.
* **反比例函数 (Inverse Proportional Functions)** - Medium orange segment.
* **二次函数 (Quadratic Functions)** - Light orange segment.
* **平面直角坐标系 (Cartesian Coordinate System)** - Lightest orange segment.
* **Under 方程与不等式 (Equations and Inequalities):**
* **一元一次方程 (Linear Equations in One Variable)** - Dark pink segment.
* **一元二次方程 (Quadratic Equations in One Variable)** - Medium pink segment.
* **分式方程 (Fractional Equations)** - Light pink segment.
* **不等式与不等式组 (Inequalities and Systems of Inequalities)** - Lightest pink segment.
* **Under 统计与概率 (Statistics and Probability):**
* **数据分析 (Data Analysis)** - Dark yellow segment.
* **概率 (Probability)** - Light yellow segment.
#### **Outermost Ring (Specific Sub-Topics/Skills)**
* **Under 根式 (Radicals):**
* 二次根式的运算 (Operations with Square Roots)
* 立方根 (Cube Roots)
* 平方根与算术平方根 (Square Roots and Arithmetic Square Roots)
* 同类二次根式 (Like Square Root Terms)
* **Under 整式 (Polynomials):**
* 乘法公式 (Multiplication Formulas)
* 整式的乘除及混合 (Multiplication, Division, and Mixed Operations of Polynomials)
* 整式的加减 (Addition and Subtraction of Polynomials)
* **Under 代数式 (Algebraic Expressions):**
* 代数式求值 (Evaluating Algebraic Expressions)
* 同类项 (Like Terms)
* **Under 分式 (Fractions/Rational Expressions):**
* 约分与通分 (Simplification and Finding Common Denominators)
* 指数幂 (Exponential Powers)
* **Under 无理数 (Irrational Numbers):**
* 判断无理数 (Identifying Irrational Numbers)
* **Under 因式 (Factoring):**
* 十字相乘法 (Cross Multiplication Method)
* 提公因式 (Factoring out the Common Factor)
* **Under 应用 (Applications):**
* 鸽巢问题 (Pigeonhole Principle)
* 流水问题 (Work/Flow Problems)
* **Under 圆 (Circles):**
* 垂径定理 (Perpendicular Chord Theorem)
* 点线圆位置关系 (Positional Relationships between Points, Lines, and Circles)
* 弧长和扇形面积 (Arc Length and Sector Area)
* 圆心角 (Central Angles)
* 圆周角 (Inscribed Angles)
* 正多边形和圆 (Regular Polygons and Circles)
* **Under 三角形 (Triangles):**
* 等边三角形 (Equilateral Triangles)
* 等腰三角形 (Isosceles Triangles)
* 勾股定理 (Pythagorean Theorem)
* 全等三角形 (Congruent Triangles)
* **Under 四边形 (Quadrilaterals):**
* 平行四边形 (Parallelograms)
* 梯形 (Trapezoids)
* **Under 立体图形 (3D Shapes/Solid Figures):**
* 圆锥 (Cones)
* **Under 一次函数 (Linear Functions):**
* 求一次函数解析式 (Finding the Analytic Expression of a Linear Function)
* 函数与二元一次方程组 (Functions and Systems of Linear Equations in Two Variables)
* 函数与一元一次不等式 (Functions and Linear Inequalities in One Variable)
* 函数与一元一次方程 (Functions and Linear Equations in One Variable)
* 正比例函数 (Direct Proportional Functions)
* **Under 反比例函数 (Inverse Proportional Functions):**
* 反比例函数的应用 (Applications of Inverse Proportional Functions)
* 反比例函数的性质 (Properties of Inverse Proportional Functions)
* 反比例函数的定义 (Definition of Inverse Proportional Functions)
* **Under 二次函数 (Quadratic Functions):**
* 抛物线的性质 (Properties of Parabolas)
* 二次函数的应用 (Applications of Quadratic Functions)
* **Under 平面直角坐标系 (Cartesian Coordinate System):**
* 有序数对 (Ordered Pairs)
* 点所在象限 (Quadrants of Points)
* **Under 一元一次方程 (Linear Equations in One Variable):**
* 解一元一次方程 (Solving Linear Equations in One Variable)
* 一元一次方程的应用 (Applications of Linear Equations in One Variable)
* **Under 一元二次方程 (Quadratic Equations in One Variable):**
* 解一元二次方程 (Solving Quadratic Equations in One Variable)
* 一元二次方程的应用 (Applications of Quadratic Equations in One Variable)
* **Under 分式方程 (Fractional Equations):**
* 解分式方程 (Solving Fractional Equations)
* **Under 不等式与不等式组 (Inequalities and Systems of Inequalities):**
* 一元一次不等式的应用 (Applications of Linear Inequalities in One Variable)
* 一元一次不等式组 (Systems of Linear Inequalities in One Variable)
* 解一元一次不等式组 (Solving Systems of Linear Inequalities in One Variable)
* 解一元一次不等式 (Solving Linear Inequalities in One Variable)
* **Under 数据分析 (Data Analysis):**
* 数据的集中趋势 (Central Tendency of Data)
* 数据的波动趋势 (Dispersion/Variation of Data)
* **Under 概率 (Probability):**
* 随机事件与概率 (Random Events and Probability)
* 求概率 (Calculating Probability)
* 概率的应用 (Applications of Probability)
### Key Observations
1. **Hierarchical Depth:** The curriculum is organized into three clear levels of granularity: Domains > Topics > Specific Skills.
2. **Color Consistency:** Each domain's color theme is consistently applied across all its sub-topics, aiding visual grouping.
3. **Content Balance:** The "Numbers and Expressions" and "Geometry" domains appear to have the largest number of specific sub-topics in the outermost ring, suggesting a significant focus on foundational algebraic manipulation and geometric properties.
4. **Skill Progression:** The outermost ring often lists skills in a logical order, such as moving from definitions to properties to applications (e.g., under "Inverse Proportional Functions").
5. **Applied Focus:** Several sub-topics explicitly mention "应用" (Applications), indicating an emphasis on applying theoretical concepts to solve problems (e.g., Applications of Quadratic Functions, Applications of Probability).
### Interpretation
This sunburst chart serves as a comprehensive visual taxonomy of a secondary mathematics curriculum. It effectively communicates the structure and scope of the subject matter.
* **Relationships:** The radial hierarchy clearly shows dependency and belonging. For instance, understanding "Parabola Properties" is a sub-skill within "Quadratic Functions," which itself is a major topic under the core domain of "Functions." This helps educators and students see how discrete skills fit into the larger mathematical landscape.
* **Curriculum Design:** The chart reveals a curriculum that balances pure mathematics (e.g., "Properties of Parabolas") with applied problem-solving (e.g., "Work/Flow Problems"). The inclusion of the "Pigeonhole Principle" under "Applications" suggests an intent to introduce logical reasoning and combinatorial thinking.
* **Learning Pathway:** While not explicitly sequential, the structure implies a potential learning pathway from the center outward: master the core domains, then the major topics within them, and finally the specific skills. The grouping of related skills (like all the "Functions and..." topics under Linear Functions) highlights conceptual clusters that are taught together.
* **Utility:** This diagram is a powerful tool for curriculum mapping, textbook design, or student review. It allows one to quickly identify all components of a subject area, check for coverage, and understand the relationships between different mathematical concepts. The absence of numerical data means its value is purely structural and organizational, not quantitative.
</details>
((d)) Chinese Middle (Middle-ZH)
Figure 2: Diagram overview of four concept systems in ConceptMath. We have provided translated Chinese concept names in English (See Appendix A).
## 2 ConceptMath
ConceptMath is the first bilingual, concept-wise benchmark for measuring mathematical reasoning. In this section, we describe the design principle, dataset collection process, dataset statistics and an efficient fine-tuning strategy to enhance the weaknesses identified by our ConceptMath.
### 2.1 Design Principle
We created ConceptMath based on the following two high-level design principles:
#### Concept-wised Hierarchical System.
The primary goal of ConceptMath is to evaluate the mathematical reasoning capacities of language models at different granularity. Therefore, ConceptMath organizes math problems within a three-level hierarchy of mathematical concepts in Fig. 2. This approach provides concept-wise evaluation for mathematical reasoning of language models and makes targeted and effective improvements possible.
#### Bilingualism.
Most of the current mathematical benchmark focuses solely on English, leaving multi-lingual mathematical reasoning unexplored. As an early effort to explore multi-lingual mathematical reasoning, we evaluate mathematical reasoning in two languages: English and Chinese. Besides, since cultures and educational systems vary across different languages, common math concepts can differ a lot. Therefore, we carefully collect concepts in both languages, instead of merely translating from one language to another. For example, measurement metrics (e.g., money, size) are different for English and Chinese.
### 2.2 Data Collection
Subsequently, for data collection, we take a two-step approach to operationalize the aforementioned design principles: First, we recruit experts to delineate a hierarchy of math concepts based on different education systems. Secondly, we collect problems for each concept from various sources or design problems manually, which is succeeded by quality assessment and data cleaning.
#### Math Concept System Construction.
Since the education systems provide a natural hierarchy of math concepts, we recruited four teachers from elementary and middle schools, specializing in either English or Chinese, to organize a hierarchy of math concepts for different education systems. This leads to four concept systems: Elementary-EN, Middle-EN, Elementary-ZH, and Middle-ZH, with each system consisting of a three-level hierarchy of around 50 atomic math concepts (Fig. 2).
#### Math Problem Construction.
Then we conducted a thorough data acquisition from various sources (including educational websites, textbooks, and search engines with specific concepts) to collect math word problems (including both questions and answers) for each math concept. To guarantee a balance across all concepts, approximately 20 problems were gathered for each math concept. Following this, both GPT-4 OpenAI (2023) and human experts were employed to verify and rectify the categorization and the solution of each problem. However, we observed that for some concepts, the problem count was significantly below 20. To address this issue, manual efforts were undertaken to augment these categories, ensuring a consistent collection of 20 problems for each concept. Furthermore, to broaden the diversity of the dataset and minimize the risk of data contamination, all gathered problems were paraphrased using GPT-4. It is important to note that the collection and annotation processes were carried out by a team of six members, each possessing a university degree in an engineering discipline, to maintain a high level of technical expertise in executing these tasks.
### 2.3 Dataset Statistics
Comparison to existing datasets. As shown in Table 1, our ConceptMath differs from related datasets in various aspects: (1) ConceptMath is the first dataset to study fine-grained mathematical concepts and encompasses 4 systems, 214 math concepts, and 4011 math word problems. (2) Problems in ConcepthMath are carefully annotated based on the mainstream education systems for English (EN) and Chinese (ZH).
Details on the hierarchical system. Apart from Fig. 2, we also provide the details on the hierarchical system more clearly in Appendix A.
Length distribution. Fig. 3 shows the length distribution of our ConcepthMath, where number of tokens is reported We use the “cl100k_base” tokenizer from https://github.com/openai/tiktoken. The minimum, average and maximum of the tokens for these questions are 4, 41 and 309, respectively, which shows that they have lexical richness.
| Benchmark | Language | Fine-grained | Size |
| --- | --- | --- | --- |
| GSM8K | EN | ✗ | 1319 |
| MATH | EN | ✗ | 5000 |
| TabMWP | EN | ✗ | 7686 |
| Dolphin18K | EN | ✗ | 1504 |
| Math23K | ZH | ✗ | 1000 |
| ASDiv | EN | ✗ | 2305 |
| SVAMP | EN | ✗ | 300 |
| SingleOp | EN | ✗ | 159 |
| MMLU-Math | EN | ✗ | 906 |
| ConceptMath | EN&ZH | ✓ | 4011 |
Table 1: A comparison of our ConceptMath with some notable mathematical datasets. Note that the size is the number of samples of the test split.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Histogram: Distribution of Question Lengths in Tokens
### Overview
The image displays a histogram chart illustrating the frequency distribution of question lengths, measured in tokens, within a dataset. The chart shows a right-skewed distribution, indicating that most questions are relatively short, with a long tail of less frequent, longer questions.
### Components/Axes
* **Chart Type:** Histogram (vertical bar chart).
* **X-Axis (Horizontal):**
* **Label:** "Question Length (# Tokens)"
* **Scale:** Linear scale from 0 to 200, with major tick marks and labels every 10 units (0, 10, 20, ..., 200). The final bin is labeled ">200".
* **Y-Axis (Vertical):**
* **Label:** "Number of Questions"
* **Scale:** Linear scale from 0 to 100, with major tick marks and labels every 20 units (0, 20, 40, 60, 80, 100).
* **Data Series:** A single series represented by light blue vertical bars. Each bar's height corresponds to the count of questions falling within a specific token-length bin.
* **Legend:** Not present, as there is only one data series.
* **Language:** All text in the chart is in English.
### Detailed Analysis
The histogram bins appear to have a width of approximately 1-2 tokens, creating a detailed view of the distribution.
* **Trend Verification:** The visual trend shows a rapid increase in frequency from 0 tokens, peaking in the 20-40 token range, followed by a steady, gradual decline as question length increases. The distribution has a long right tail extending past 200 tokens.
* **Peak/Mode:** The highest frequency (mode) occurs in the bin corresponding to approximately **25-30 tokens**. The bar at this peak reaches a height of just over **100 questions** (approx. 102-105).
* **High-Frequency Region:** A dense cluster of high bars exists between roughly **15 and 45 tokens**. Within this range, multiple bars exceed 80 questions.
* **Decline:** After the 45-token mark, the frequency shows a consistent downward trend. By 100 tokens, the count per bin drops below 10 questions.
* **Long Tail:** From 100 to 200 tokens, the bars are very short, often representing fewer than 5 questions per bin, with many bins appearing empty.
* **Final Bin (">200"):** There is a small but distinct bar at the far right for questions longer than 200 tokens. Its height is approximately **10-12 questions**.
### Key Observations
1. **Right-Skewed Distribution:** The data is not symmetric. The mass of the distribution is concentrated on the left (shorter questions), with a tail stretching to the right (longer questions).
2. **Dominant Length Range:** The vast majority of questions in this dataset are between **10 and 60 tokens** long.
3. **Rare Long Questions:** Questions exceeding 100 tokens are uncommon, and those over 200 tokens are rare outliers.
4. **Smooth Decline:** The decrease in frequency after the peak is relatively smooth, without major secondary peaks, suggesting a single underlying pattern in question composition.
### Interpretation
This histogram characterizes the length profile of a collection of questions, likely from a user query dataset, a benchmark, or a conversational AI log.
* **What the data suggests:** The distribution indicates that users or creators predominantly formulate concise questions. The peak at 25-30 tokens suggests a "sweet spot" for question complexity—long enough to be specific but short enough to be easily digestible.
* **How elements relate:** The x-axis (length) directly determines the y-axis (frequency). The shape of the curve is a fundamental property of the dataset, revealing its inherent structure. The ">200" bin acts as a catch-all for extreme outliers, preventing the x-axis from becoming excessively long.
* **Notable patterns/anomalies:**
* The smooth, unimodal shape is typical of many natural language length distributions.
* The small spike at ">200" is notable. It confirms the existence of a non-zero, albeit small, population of very complex or verbose questions that might warrant separate analysis (e.g., for testing model performance on long contexts).
* The absence of questions near 0 tokens is expected, as a meaningful question requires a minimum length.
**In summary, this chart provides a clear quantitative snapshot: the dataset is dominated by short-to-medium length questions, with a predictable and rapidly diminishing frequency for longer questions.** This information is crucial for understanding dataset bias, designing token-efficient systems, or setting appropriate context window limits for processing such questions.
</details>
Figure 3: Length distributions of our ConceptMath.
### 2.4 Efficient Fine-Tuning
Based on our ConceptMath, we are able to identify the weaknesses in the mathematical reasoning capability of LLMs through concept-wise evaluation. In this section, we explore a straightforward approach to enhance mathematical abilities towards specific concepts by first training a concept classifier and then curating a set of samples from a large open-sourced math dataset. Specifically, first, by additionally collecting extra 10 problems per concept, we construct a classifier capable of identifying the concept class of a given question. The backbone of this classifier is a pretrained bilingual LLM, where the classification head is operated on its last hidden output feature. Then, we proceed to fine-tune LLMs using this specific dataset combined with the existing general math dataset, which aims to avoid overfitting on a relatively small dataset. More details have been provided in the Appendix B.
## 3 Experiments
In this section, we perform extensive experiments to demonstrate the effect of our ConceptMath.
### 3.1 Experimental Setup
#### Evaluated Models.
We assess the mathematical reasoning of existing advanced LLMs on ConceptMath, including 2 close-sourced LLMs (i.e., GPT-3.5/GPT-4 (OpenAI, 2023)) and 17 open-sourced LLMs (i.e., WizardMath-13B Luo et al. (2023), MetaMath-13B Yu et al. (2023), MAmmoTH-13B Yue et al. (2023), Qwen-14B/72B Bai et al. (2023b), Baichuan2-13B Baichuan (2023), ChatGLM3-6B Du et al. (2022), InternLM2-7B/20B Team (2023a), InternLM2-Math-7B/20B Ying et al. (2024), LLaMA2-7B/13B/70B Touvron et al. (2023b), Yi-6B/34B Team (2023b) and DeepSeekMath-7B Shao et al. (2024)). Note that WizardMath-13B, MetaMath-13B, and MAmmoTH-13B are specialized math language models fine-tuned from LLaMA2. InternLM2-Math and DeepSeekMath-7B are specialized math language models fine-tuned from corresponding language models. More details of these evaluated models can be seen in Appendix C.
| Model | Elementary-EN | Middle-EN | Elementary-ZH | Middle-ZH | Avg. | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ZS | ZS-COT | FS | ZS | ZS-COT | FS | ZS | ZS-COT | FS | ZS | ZS-COT | FS | | |
| Yi-6B | 67.94 | 67.56 | 59.03 | 65.55 | 64.59 | 56.05 | 34.33 | 31.91 | 37.86 | 36.46 | 36.19 | 36.46 | 49.49 |
| ChatGLM3-6B | 60.69 | 63.10 | 53.18 | 51.25 | 60.17 | 51.34 | 46.23 | 43.63 | 40.74 | 44.77 | 43.32 | 40.43 | 49.90 |
| DeepSeekMath-7B | 66.92 | 77.35 | 73.92 | 56.53 | 69.87 | 66.31 | 60.47 | 62.33 | 64.19 | 56.50 | 56.95 | 56.86 | 64.02 |
| InternLM2-Math-7B | 71.12 | 72.01 | 69.59 | 63.44 | 62.96 | 63.05 | 57.30 | 58.23 | 58.60 | 53.79 | 53.16 | 53.88 | 61.43 |
| InternLM2-7B | 68.83 | 69.97 | 66.67 | 37.04 | 65.83 | 55.47 | 47.63 | 49.02 | 53.02 | 45.22 | 45.40 | 44.86 | 54.08 |
| LLaMA2-7B | 36.51 | 42.62 | 38.68 | 34.26 | 39.16 | 33.69 | 15.72 | 17.67 | 17.58 | 30.87 | 32.22 | 27.80 | 30.57 |
| MAmmoTH-13B | 61.32 | 52.42 | 56.49 | 53.93 | 45.20 | 48.08 | 22.33 | 33.30 | 23.81 | 27.98 | 43.05 | 29.15 | 41.42 |
| WizardMath-13B | 41.73 | 44.78 | 34.99 | 36.85 | 37.72 | 45.11 | 10.51 | 11.26 | 18.70 | 12.36 | 15.52 | 22.92 | 27.70 |
| MetaMath-13B | 54.45 | 51.78 | 47.96 | 44.24 | 43.47 | 47.50 | 11.44 | 17.30 | 27.53 | 21.21 | 26.08 | 29.60 | 35.21 |
| Baichuan2-13B | 68.83 | 68.58 | 54.07 | 67.66 | 69.67 | 40.40 | 57.02 | 58.23 | 22.05 | 55.05 | 55.32 | 26.90 | 53.65 |
| LLaMA2-13B | 44.02 | 49.75 | 47.07 | 44.72 | 46.45 | 43.09 | 20.19 | 24.19 | 22.14 | 33.30 | 35.38 | 26.17 | 36.37 |
| Qwen-14B | 46.95 | 65.78 | 72.65 | 38.48 | 59.60 | 67.85 | 28.09 | 65.12 | 64.47 | 22.92 | 58.30 | 62.09 | 54.36 |
| InternLM2-Math-20B | 74.05 | 75.32 | 73.41 | 64.11 | 71.21 | 70.83 | 62.98 | 61.95 | 61.77 | 55.14 | 55.78 | 56.86 | 65.28 |
| InternLM2-20B | 53.31 | 72.52 | 73.28 | 45.11 | 67.47 | 56.72 | 48.19 | 55.53 | 59.81 | 45.13 | 50.63 | 56.68 | 57.03 |
| Yi-34B | 74.68 | 73.66 | 56.36 | 72.26 | 74.66 | 65.83 | 50.05 | 51.16 | 38.79 | 45.40 | 43.95 | 40.97 | 57.31 |
| LLaMA2-70B | 56.11 | 60.31 | 30.53 | 58.06 | 60.94 | 31.67 | 28.65 | 26.70 | 24.37 | 37.64 | 34.30 | 28.43 | 39.81 |
| Qwen-72B | 77.10 | 75.06 | 77.23 | 74.66 | 69.87 | 73.99 | 71.16 | 68.65 | 61.86 | 71.30 | 65.43 | 62.45 | 70.73 |
| GPT-3.5 | 85.75 | 92.37 | 84.35 | 83.88 | 90.12 | 82.73 | 56.47 | 53.21 | 56.93 | 51.90 | 53.52 | 55.69 | 70.58 |
| GPT-4 | 86.77 | 90.20 | 89.57 | 84.26 | 89.83 | 88.68 | 67.91 | 72.28 | 72.00 | 63.81 | 64.26 | 66.61 | 78.02 |
| Avg. | 63.00 | 66.59 | 61.00 | 56.65 | 62.57 | 57.28 | 41.93 | 45.35 | 43.49 | 42.67 | 45.72 | 43.41 | 52.47 |
Table 2: Results of different models on our constructed ConceptMath benchmark dataset. Note that “ZS”, “ZS-COT”, “FS” represents “zero-shot”, “zero-shot w/ chain-of-thought” and “few-shot”, repsectively. Models are grouped roughly according to their model sizes.
#### Evaluation Settings.
We employ three distinct evaluation settings: zero-shot, zero-shot with chain-of-thought (CoT), and few-shot promptings. The zero-shot prompting assesses the models’ intrinsic problem-solving abilities without any prior examples. The zero-shot with CoT prompting evaluates the models’ ability to employ a logical chain of thought. In the few-shot prompting setting, the model is provided with fixed 5-shot prompts for different systems (See Appendix E), which includes five newly created examples with concise ground truth targets. This approach is designed to measure the in-context learning abilities. Besides, following MATH (Hendrycks et al., 2021b), all questions and answers in ConceptMath have been carefully curated, and each problem is evaluated based on exact matches. Moreover, greedy decoding with a temperature of 0 is used.
### 3.2 Results
#### Overall Accuracy
We present the overall accuracies of different LLMs on our ConceptMath benchmark under various prompt settings in Table 2. Subsequently, we analyzed the mathematical abilities of these LLMs in both English and Chinese in Fig. 4. Our analysis led to the following key findings: (1) GPT-3.5/4 showcases the most advanced mathematical reasoning abilities among LLMs in both English and Chinese systems, and the leading open-source Qwen-72B model archives comparable performance compared with GPT-3.5. (2) The scores on Chinese systems of most existing LLMs are lower than English systems a lot. For example, accuracies on Middle-ZH and Middle-EN for GPT-4 are 63.81 and 84.26. (3) Several models (e.g., WizardMath-13B or MetaMath-13B) fine-tuned from LLaMA2-13B achieve slight improvements on English systems, but the performance results are lower than LLaMA2-13B on Chinese systems a lot, which indicates that domain-specific fine-tuning may degrade the generalization abilities of LLMs. (4). The mathematical models (i.e., InternLM2-Math-7B/20B and DeepSeekMath-7B) by continuing pretraining on the large-scale math-related dataset (¿=100B tokens) show sufficient improvements when compared to models with similar size, which indicates that large-scale pertaining is effective to improve the mathematical reasoning abilities.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Mean Accuracy of Language Models by Language
### Overview
This image is a line chart comparing the mean accuracy of various large language models (LLMs) across three categories: English, Chinese, and an Overall score. The chart plots model names on the x-axis against a "Mean Accuracy" score (ranging from 20 to 90) on the y-axis. The data suggests a general upward trend in model performance from left to right, with English performance consistently outperforming Chinese performance for most models.
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **Y-Axis:**
* **Title:** "Mean Accuracy"
* **Scale:** Linear, from 20 to 90, with major gridlines at intervals of 10.
* **X-Axis:**
* **Labels (from left to right):** WizardMath-13B, LLaMA2-7B, MetaMath-13B, LLaMA2-13B, LLaMA2-70B, MAmmoTH-13B, Yi-6B, ChatGLM3-6B, Baichuan2-13B, InternLM2-7B, Qwen-14B, InternLM2-20B, Yi-34B, InternLM2-Math-7B, DeepSeekMath-7B, InternLM2-Math-20B, GPT-3.5, Qwen-72B, GPT-4.
* **Legend (Top-Left Corner):**
* **Blue dashed line with circle markers:** "English"
* **Green dashed line with circle markers:** "Chinese"
* **Black solid line with circle markers:** "Overall"
### Detailed Analysis
The following table reconstructs the approximate data points for each model, based on visual inspection of the chart. Values are estimated to the nearest integer.
| Model Name | English Accuracy (Blue) | Chinese Accuracy (Green) | Overall Accuracy (Black) |
| :--- | :--- | :--- | :--- |
| WizardMath-13B | ~40 | ~22 | ~28 |
| LLaMA2-7B | ~38 | ~27 | ~31 |
| MetaMath-13B | ~50 | ~28 | ~35 |
| LLaMA2-13B | ~47 | ~31 | ~36 |
| LLaMA2-70B | ~51 | ~34 | ~40 |
| MAmmoTH-13B | ~56 | ~34 | ~41 |
| Yi-6B | ~65 | ~42 | ~50 |
| ChatGLM3-6B | ~57 | ~46 | ~50 |
| Baichuan2-13B | ~65 | ~48 | ~54 |
| InternLM2-7B | ~61 | ~51 | ~54 |
| Qwen-14B | ~56 | ~54 | ~55 |
| InternLM2-20B | ~61 | ~55 | ~57 |
| Yi-34B | ~69 | ~51 | ~58 |
| InternLM2-Math-7B | ~69 | ~58 | ~61 |
| DeepSeekMath-7B | ~69 | ~62 | ~64 |
| InternLM2-Math-20B | ~72 | ~62 | ~65 |
| GPT-3.5 | ~86 | ~63 | ~70 |
| Qwen-72B | ~76 | ~68 | ~71 |
| GPT-4 | ~88 | ~73 | ~78 |
**Trend Verification:**
* **English (Blue Dashed Line):** Shows a general upward trend with significant volatility. Notable peaks occur at Yi-6B, Baichuan2-13B, and a very sharp increase at GPT-3.5 and GPT-4. There are dips at LLaMA2-7B, LLaMA2-13B, ChatGLM3-6B, and Qwen-14B.
* **Chinese (Green Dashed Line):** Shows a steadier, more consistent upward trend with fewer sharp fluctuations. The growth is relatively smooth from WizardMath-13B to GPT-4.
* **Overall (Black Solid Line):** Follows a smooth, consistent upward trajectory that generally lies between the English and Chinese lines, acting as an average. It shows very few dips.
### Key Observations
1. **Performance Hierarchy:** For nearly every model, the English accuracy score is higher than the Chinese score, which is in turn higher than or equal to the Overall score. The Overall line is a composite metric.
2. **Model Progression:** There is a clear, general trend of increasing accuracy from older/smaller models on the left (e.g., WizardMath-13B) to newer/larger models on the right (e.g., GPT-4).
3. **Significant Outliers:**
* **GPT-3.5** shows a dramatic spike in English accuracy (~86), far above its Chinese (~63) and Overall (~70) scores, creating the largest gap between English and Chinese performance on the chart.
* **Yi-34B** exhibits a notable dip in Chinese accuracy (~51) compared to its neighbors, while its English accuracy remains high (~69).
* **Qwen-14B** is an instance where the Chinese accuracy (~54) nearly matches the English accuracy (~56), showing one of the smallest language performance gaps.
4. **Top Performers:** GPT-4 leads in all three categories (English: ~88, Chinese: ~73, Overall: ~78). Qwen-72B and GPT-3.5 follow in the overall ranking.
### Interpretation
This chart visualizes the progression and comparative capabilities of LLMs, likely on a mathematical or reasoning benchmark given the model names (e.g., "Math"). The data demonstrates two key findings:
1. **The English Advantage:** Most models, especially Western-developed ones like GPT-3.5, show a significant performance advantage in English over Chinese. This suggests potential biases in training data or architectural optimizations favoring English-language tasks.
2. **The Closing Gap:** Newer models, particularly those from Chinese labs (e.g., Qwen-72B, InternLM2-Math-20B), show a narrowing gap between English and Chinese performance. This indicates successful efforts in improving multilingual capabilities, especially for Chinese.
3. **Overall Trend:** The consistent rise of the black "Overall" line signifies broad, across-the-board improvements in model capabilities over successive generations. The volatility in the English line suggests that gains in English performance may be less stable or more sensitive to specific model architectures and training techniques compared to the steadier improvements in Chinese performance.
The chart serves as a benchmark snapshot, highlighting both the current state-of-the-art (GPT-4) and the dynamic landscape of LLM development, where language-specific performance remains a critical differentiator.
</details>
Figure 4: Mean accuracies for English, Chinese, and overall educational systems.
#### Average Concept-wised Accuracy.
In Fig. 5 and Fig. 6, to better analyze the effectiveness of our ConceptMath, we further provide the concept-wised accuracies average on evaluated models for different mathematical concepts by zero-shot prompting on Middle-EN and Middle-ZH. (See Appendix D for more results on Elementary-EN and Elementary-ZH). In Fig. 5 and Fig. 6, we observe that the accuracies across concepts vary a lot for existing LLMs. For example, for Middle-ZH in Fig. 6, around 18% of concepts exhibit an accuracy lower than 30%. Thus, to improve the mathematical abilities of LLMs, these concepts with large room for improvement should be given the highest priority, which further shows the advantage of ConceptMath.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Bar Chart: Mean Accuracy Across Mathematical Topics
### Overview
This image is a vertical bar chart displaying the "Mean Accuracy" scores for a wide range of mathematical topics. The chart is sorted in ascending order of accuracy, from the lowest-performing topic on the left to the highest-performing topic on the right. The data suggests a performance evaluation, likely from an assessment or learning platform, across numerous specific math skills.
### Components/Axes
* **Chart Type:** Vertical Bar Chart.
* **Y-Axis (Vertical):**
* **Label:** "Mean Accuracy"
* **Scale:** Linear scale from 0 to 80.
* **Major Grid Lines:** Horizontal dashed lines at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70, 80).
* **X-Axis (Horizontal):**
* **Label:** None explicit. The axis contains categorical labels for each bar.
* **Categories:** 62 distinct mathematical topics, listed below in order from left (lowest accuracy) to right (highest accuracy). The labels are rotated approximately 45 degrees for readability.
* **Legend:** Not present. Each bar represents a single data series ("Mean Accuracy").
* **Data Series:** A single series of blue bars. The color is a consistent medium blue across all bars.
### Detailed Analysis
The following table lists each mathematical topic (x-axis category) and its approximate mean accuracy value (y-axis), estimated from the bar height relative to the grid lines. Values are approximate due to visual estimation.
| Order (L→R) | Mathematical Topic (X-Axis Label) | Approx. Mean Accuracy |
| :--- | :--- | :--- |
| 1 | Circle | ~33 |
| 2 | Radical exprs | ~34 |
| 3 | Exponents & scientific notation | ~40 |
| 4 | Quadratics | ~41 |
| 5 | Geometric sequences | ~41 |
| 6 | Probability of compound events | ~43 |
| 7 | Independent & dependent events | ~43 |
| 8 | Counting principle | ~44 |
| 9 | Transformations | ~44 |
| 10 | Rational & irrational numbers | ~45 |
| 11 | Systems of equations | ~46 |
| 12 | Simple & opposite events | ~47 |
| 13 | Scale drawings | ~50 |
| 14 | Absolute value | ~53 |
| 15 | Probability of one event | ~53 |
| 16 | One-variable statistics | ~54 |
| 17 | Make predictions | ~55 |
| 18 | Linear functions | ~55 |
| 19 | Domain & range of functions | ~55 |
| 20 | Two-variable statistics | ~56 |
| 21 | Financial literacy | ~56 |
| 22 | Mean, median, mode & range | ~56 |
| 23 | Arithmetic sequences | ~57 |
| 24 | Linear equations | ~58 |
| 25 | Center & variation | ~58 |
| 26 | Prime factorization | ~58 |
| 27 | Percents | ~59 |
| 28 | Fractions & decimals | ~59 |
| 29 | Surface area & volume | ~60 |
| 30 | Distance between two points | ~61 |
| 31 | Variable expressions | ~62 |
| 32 | Square roots & cube roots | ~62 |
| 33 | Congruence & similarity | ~62 |
| 34 | Nonlinear functions | ~63 |
| 35 | Decimals | ~63 |
| 36 | Perimeter & area | ~64 |
| 37 | Translate | ~65 |
| 38 | Multiply | ~66 |
| 39 | Add & subtract | ~67 |
| 40 | Axes | ~67 |
| 41 | Polygons | ~70 |
| 42 | Factors | ~70 |
| 43 | Trapezoids | ~71 |
| 44 | Interpret functions | ~71 |
| 45 | Equivalent expressions | ~71 |
| 46 | Lines & angles | ~73 |
| 47 | Proportional relationships | ~73 |
| 48 | Slope | ~75 |
| 49 | Opposite integers | ~76 |
| 50 | Inequalities | ~77 |
| 51 | Consecutive math | ~78 |
| 52 | Polyhedra | ~78 |
| 53 | Outlier | ~79 |
| 54 | Prime or composite | ~80 |
| 55 | Estimate metric measurements | ~83 |
### Key Observations
1. **Clear Ascending Trend:** The chart is explicitly sorted by performance. There is a strong, nearly monotonic increase in mean accuracy from left to right.
2. **Performance Range:** Accuracy scores span approximately 50 percentage points, from a low of ~33% to a high of ~83%.
3. **Clustering:** Topics cluster in performance bands. For example:
* **Lowest Tier (~33-47%):** Foundational but abstract concepts like Circle, Radicals, Exponents, Probability, and Transformations.
* **Middle Tier (~50-64%):** Core algebra and statistics topics like Linear Functions, Equations, Statistics, and Geometry basics.
* **Highest Tier (~65-83%):** More concrete or procedural skills like basic arithmetic operations (Add, Multiply), shape identification (Polygons, Trapezoids), and fundamental properties (Slope, Inequalities, Prime/Composite).
4. **Highest & Lowest:** The lowest accuracy is for "Circle" (~33%). The highest accuracy is for "Estimate metric measurements" (~83%).
5. **Notable Jumps:** There are noticeable jumps in accuracy between some adjacent topics, such as from "Simple & opposite events" (~47%) to "Scale drawings" (~50%), and from "Axes" (~67%) to "Polygons" (~70%).
### Interpretation
This chart likely represents aggregated performance data from a mathematics assessment or learning system. The sorting reveals a potential hierarchy of difficulty or student mastery.
* **Conceptual vs. Procedural:** The data suggests that students struggle most with abstract, conceptual, or multi-step topics (e.g., Circle area formulas, radical expressions, compound probability). They perform best on concrete, procedural, or foundational arithmetic and geometry skills (e.g., estimating measurements, identifying shapes, performing basic operations).
* **Curriculum Implications:** The topics on the left side of the chart (e.g., Quadratics, Systems of Equations, Transformations) may require targeted instructional support or revised teaching strategies. The high performance on topics like "Slope" and "Proportional relationships" indicates these are well-understood core concepts.
* **Data Limitation:** The chart shows "Mean Accuracy," which aggregates performance. It does not reveal the distribution of scores (e.g., standard deviation), the number of students assessed per topic, or whether the topics were assessed under identical conditions. The ordering is by value, not by any inherent curricular sequence.
* **Underlying Question:** The chart answers "What math topics do students find most and least challenging?" based on this specific dataset. The clear gradient implies the assessment successfully differentiated between varying levels of conceptual difficulty or student preparedness.
</details>
Figure 5: Mean concept accuracies on Middle-EN.
Figure 6: Mean concept accuracies on Middle-ZH.
#### Concept-wised Accuracy.
Fig. 7 and Fig. 8 show that most existing LLMs, whether open-sourced, closed-sourced, general-purpose, or math-specialized, exhibit notable differences in their concept accuracies in the zero-shot prompt setting. These disparities may stem from variations in training datasets, strategies, and model sizes, which suggests that apart from common weaknesses, each model possesses its unique areas of deficiency or shortcomings. For the sake of brevity in the presentation, we only show a subset of models on Middle-EN and Middle-ZH. The concept accuracies of Elementary-EN and Elementary-ZH systems and all results of all models can be found in Appendix D.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Accuracy Comparison of AI Models Across Math Topics
### Overview
This image is a line chart comparing the accuracy performance of three large language models (LLMs) across a wide range of mathematical topics. The chart displays the accuracy percentage (0-100) on the vertical axis against a detailed list of math categories on the horizontal axis. The three models compared are MetaMath-13B, LLaMA2-70B, and GPT-4.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear scale from 0 to 100, with major gridlines at intervals of 20 (0, 20, 40, 60, 80, 100).
* **X-Axis (Horizontal):**
* **Label:** Not explicitly labeled, but contains a comprehensive list of mathematical topics/skills.
* **Categories (from left to right):** Add & subtract, Arithmetic sequences, Congruence & similarity, Counting Principle, Distance between two points, Divide, Domain & range of functions, Equivalent expressions, Estimate metric measurements, Exponents & scientific notation, Financial literacy, Fractions & decimals, Geometric sequences, Interpret inequalities, Linear equations, Linear functions, Lines & angles, Make predictions, Multiply, Nonlinear functions, One-variable statistics, Perimeter & area, Prime factorization, Probability of compound events, Probability of one event, Probability of simple & opposite events, Proportional relationships, Quadrants, Rational & irrational numbers, Scale drawings, Square roots & cube roots, Surface area & volume, Systems of equations, Two-variable statistics, Absolute value, Axis, Center & variability, Circle, Factors, Independent & dependent events, Inequalities, Mean, median, mode, & range, Opposite integers, Outliers, Polynomials, Power laws, Radical expressions, Transformations, Square, Trapezoids, Variable expressions.
* **Legend:**
* **Position:** Top center, above the plot area.
* **Series:**
1. **MetaMath-13B:** Blue line with circular markers.
2. **LLaMA2-70B:** Orange line with circular markers.
3. **GPT-4:** Green line with circular markers.
### Detailed Analysis
The chart shows the accuracy of each model for each discrete math topic. The lines are highly volatile, indicating significant performance variation depending on the specific skill being tested.
**Trend Verification & Data Points (Approximate):**
* **GPT-4 (Green Line):** This line consistently occupies the highest position on the chart. Its trend is one of high but fluctuating accuracy. It frequently peaks near or at 100% (e.g., for "Add & subtract", "Congruence & similarity", "Distance between two points", "Prime factorization", "Probability of one event", "Square roots & cube roots", "Outliers", "Square"). Its lowest points are still relatively high, rarely dipping below 60% (e.g., ~65% for "Geometric sequences", ~68% for "Rational & irrational numbers").
* **LLaMA2-70B (Orange Line):** This line generally occupies the middle position. Its trend shows moderate accuracy with substantial swings. It has several peaks above 80% (e.g., ~95% for "Congruence & similarity", ~80% for "Domain & range of functions", ~84% for "Independent & dependent events") and deep troughs below 40% (e.g., ~32% for "Divide", ~18% for "Geometric sequences", ~20% for "Circle").
* **MetaMath-13B (Blue Line):** This line is generally the lowest, indicating the lowest average accuracy among the three. Its trend is the most volatile, with extreme peaks and valleys. It has a few high points matching or exceeding LLaMA2-70B (e.g., ~95% for "Congruence & similarity", ~83% for "Prime factorization", ~89% for "Square") but also suffers from very low accuracy, including points near 0% (e.g., for "Rational & irrational numbers") and several below 20% (e.g., "Geometric sequences", "Nonlinear functions", "Circle").
**Spatial Grounding & Color Confirmation:**
* The legend is positioned at the top center. The blue line (MetaMath-13B) is consistently the bottom-most line for the majority of categories. The orange line (LLaMA2-70B) is typically sandwiched between the blue and green lines. The green line (GPT-4) is almost always the top-most line.
* For the category "Congruence & similarity," all three lines converge at a very high point (~95-100%), with the green (GPT-4) and blue (MetaMath-13B) markers nearly overlapping at the peak, and the orange (LLaMA2-70B) marker just slightly below.
* For the category "Geometric sequences," there is a dramatic drop: the green line dips to ~65%, the orange line plummets to ~18%, and the blue line falls to ~12%.
### Key Observations
1. **Clear Performance Hierarchy:** GPT-4 demonstrates superior and more consistent performance across nearly all mathematical topics compared to the other two models.
2. **High Volatility:** All models show significant performance variance from one topic to another. No model is uniformly excellent or poor across all categories.
3. **Topic-Specific Strengths/Weaknesses:**
* All models perform very well on foundational arithmetic ("Add & subtract") and geometry ("Congruence & similarity", "Distance between two points").
* All models struggle notably with "Geometric sequences" and "Circle," showing sharp dips in accuracy.
* MetaMath-13B shows extreme weakness in "Rational & irrational numbers" (near 0%) but surprising strength in "Square" (~89%), outperforming LLaMA2-70B on that specific topic.
4. **Model Comparison:** LLaMA2-70B generally outperforms MetaMath-13B, but the gap narrows or reverses on specific topics (e.g., "Square", "Variable expressions"). GPT-4's lead is substantial and consistent.
### Interpretation
This chart provides a granular diagnostic view of LLM capabilities in mathematical reasoning. It moves beyond average scores to reveal a landscape of specific competencies and deficiencies.
* **What the data suggests:** Mathematical proficiency in LLMs is not monolithic. It is highly dependent on the specific sub-domain of math. The consistent high performance of GPT-4 suggests a more robust and generalizable underlying mathematical reasoning capability. The volatility of the other models, particularly MetaMath-13B, indicates that their training or fine-tuning may have created strong associations for certain problem types while leaving others poorly understood.
* **How elements relate:** The x-axis represents a spectrum of mathematical complexity and abstraction. The models' performance lines create a "fingerprint" of their mathematical knowledge. The close tracking of the lines on some topics (e.g., "Add & subtract") versus their wide divergence on others (e.g., "Geometric sequences") highlights areas of consensus in model capability versus areas of significant disparity.
* **Notable anomalies:** The near-zero performance of MetaMath-13B on "Rational & irrational numbers" is a critical failure point, suggesting a fundamental gap in understanding that concept. Conversely, its high score on "Square" is an outlier relative to its overall trend, possibly indicating over-specialization from its training data. The universal difficulty with "Geometric sequences" and "Circle" may point to these being inherently more challenging topics for current LLM architectures or training paradigms, or they may reflect issues with the evaluation dataset for those categories.
* **Peircean investigative reading:** The chart is an indexical sign pointing to the current state of AI math education. It shows that while models can achieve near-perfect scores on many discrete skills (the iconic signs of "math ability"), their synthetic reasoning across all domains (the symbolic sign of "true mathematical understanding") remains uneven. The sharp drops are abductive clues, inviting investigation into why specific concepts like geometric sequences or circles are problematic. Is it the need for multi-step spatial reasoning, the abstraction of infinite series, or a lack of relevant training examples? This chart doesn't answer "why," but it precisely maps the "where" of current limitations.
</details>
Figure 7: Concept accuracies on Middle-EN.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: Accuracy Comparison of Three AI Models Across Mathematical Topics
### Overview
This image is a line chart comparing the accuracy performance of three large language models (LLMs) across a wide range of mathematical topics. The chart plots "Accuracy" on the y-axis against 40 distinct mathematical categories on the x-axis. The three models compared are MetaMath-13B, LLaMA2-70B, and GPT-4. The data reveals significant variability in performance both between models and across different mathematical domains.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **Y-Axis:**
* **Label:** "Accuracy" (written vertically on the left side).
* **Scale:** Linear scale from 0 to approximately 95.
* **Major Grid Lines:** Horizontal dashed lines at intervals of 20 (0, 20, 40, 60, 80).
* **X-Axis:**
* **Label:** None explicitly stated. The axis consists of 40 categorical labels representing mathematical topics.
* **Labels (Transcribed from Chinese, with English translation):**
1. 全等三角形 - Congruent Triangles
2. 等腰三角形 - Isosceles Triangles
3. 等边三角形 - Equilateral Triangles
4. 平行四边形 - Parallelograms
5. 圆周角 - Inscribed Angle
6. 圆心角 - Central Angle
7. 弧长和扇形面积 - Arc Length and Sector Area
8. 点线面位置关系 - Positional Relationships of Points, Lines, and Planes
9. 函数与一元一次方程 - Functions and Linear Equations in One Variable
10. 函数与一元二次方程 - Functions and Quadratic Equations in One Variable
11. 函数与二元一次方程组 - Functions and Systems of Linear Equations in Two Variables
12. 求一次函数解析式 - Finding the Analytic Expression of a Linear Function
13. 二次函数的性质 - Properties of Quadratic Functions
14. 正比例函数的应用 - Application of Direct Proportional Functions
15. 反比例函数的定义 - Definition of Inverse Proportional Functions
16. 反比例函数的性质 - Properties of Inverse Proportional Functions
17. 反比例函数的应用 - Application of Inverse Proportional Functions
18. 有理数的乘方 - Exponentiation of Rational Numbers
19. 代数式求值 - Evaluating Algebraic Expressions
20. 同底数幂的乘法 - Multiplication of Powers with the Same Base
21. 约分与通分 - Reduction and Reduction to a Common Denominator
22. 十字相乘法 - Cross Multiplication Method
23. 提公因式法 - Factoring by Taking out the Common Factor
24. 分组分解法 - Factoring by Grouping
25. 整式的乘法 - Multiplication of Polynomials
26. 整式的加减 - Addition and Subtraction of Polynomials
27. 二次根式的性质 - Properties of Quadratic Radicals
28. 平方根与算术平方根 - Square Root and Arithmetic Square Root
29. 一元一次方程的根应用 - Application of Roots of Linear Equations in One Variable
30. 解一元二次方程 - Solving Quadratic Equations in One Variable
31. 一元二次方程的根应用 - Application of Roots of Quadratic Equations in One Variable
32. 解一元一次不等式 - Solving Linear Inequalities in One Variable
33. 解一元一次不等式组 - Solving Systems of Linear Inequalities in One Variable
34. 解二元一次方程组 - Solving Systems of Linear Equations in Two Variables
35. 分式方程的应用 - Application of Fractional Equations
36. 数据的收集与整理 - Data Collection and Organization
37. 数据的集中趋势 - Central Tendency of Data
38. 概率的求法 - Methods of Calculating Probability
39. 随机事件与概率 - Random Events and Probability
40. (The 40th label is partially cut off but appears to be a continuation of the probability topic).
* **Legend:**
* **Position:** Top center, above the plot area.
* **Items:**
* **MetaMath-13B:** Blue line with circular markers.
* **LLaMA2-70B:** Orange line with circular markers.
* **GPT-4:** Green line with circular markers.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
The chart shows high volatility for all models. Below is an analysis of each series' trend and key points.
* **GPT-4 (Green Line):**
* **Overall Trend:** Consistently the highest-performing model, with accuracy mostly ranging between 60 and 95. It shows a jagged, "sawtooth" pattern with sharp peaks and valleys but maintains a high baseline.
* **Key Data Points (Approximate):**
* Peaks near or above 90: "函数与一元二次方程" (~90), "反比例函数的性质" (~90), "同底数幂的乘法" (~90), "整式的乘法" (~90), "解一元二次方程" (~90), "分式方程的应用" (~90), "概率的求法" (~95).
* Notable Valleys (still relatively high): "等腰三角形" (~70), "圆心角" (~50), "反比例函数的定义" (~30), "十字相乘法" (~35), "整式的加减" (~35), "平方根与算术平方根" (~50), "解一元一次不等式组" (~25).
* **LLaMA2-70B (Orange Line):**
* **Overall Trend:** Occupies the middle ground, with accuracy generally between 10 and 70. Its performance is also volatile but sits systematically below GPT-4 and often above MetaMath-13B.
* **Key Data Points (Approximate):**
* Peaks: "函数与一元二次方程" (~60), "反比例函数的性质" (~65), "同底数幂的乘法" (~55), "整式的乘法" (~70), "解一元二次方程" (~65), "分式方程的应用" (~60).
* Valleys: "等腰三角形" (~20), "圆心角" (~5), "反比例函数的定义" (~15), "十字相乘法" (~10), "整式的加减" (~10), "平方根与算术平方根" (~20), "解一元一次不等式组" (~15).
* **MetaMath-13B (Blue Line):**
* **Overall Trend:** Generally the lowest-performing model, with accuracy mostly between 0 and 55. It exhibits the most extreme volatility, with several points dropping to near 0 and a few spikes that approach or briefly surpass LLaMA2-70B.
* **Key Data Points (Approximate):**
* Notable Spikes: "平行四边形" (~55), "函数与一元一次方程" (~45), "求一次函数解析式" (~35), "整式的乘法" (~50), "解一元二次方程" (~60).
* Valleys at or near 0: "函数与二元一次方程组", "反比例函数的定义", "约分与通分", "整式的加减", "解一元一次不等式组".
### Key Observations
1. **Performance Hierarchy:** A clear and consistent hierarchy is visible: GPT-4 > LLaMA2-70B > MetaMath-13B across almost all topics.
2. **Topic Difficulty Correlation:** All three models show synchronized dips and peaks on certain topics, suggesting some mathematical areas are universally more challenging (e.g., "圆心角" - Central Angle, "反比例函数的定义" - Definition of Inverse Proportional Functions, "整式的加减" - Addition/Subtraction of Polynomials).
3. **Model-Specific Strengths:** MetaMath-13B shows surprising competence in specific geometry ("平行四边形" - Parallelograms) and algebra ("解一元二次方程" - Solving Quadratic Equations) topics, where its accuracy spikes significantly.
4. **Volatility:** The performance of all models is highly topic-dependent. No model demonstrates flat, consistent accuracy; mastery is domain-specific.
5. **Zero-Performance Points:** MetaMath-13B has multiple data points at or near 0% accuracy, indicating complete failure on those specific topics for the evaluated test set.
### Interpretation
This chart provides a granular diagnostic of LLM capabilities in mathematics, moving beyond aggregate scores. The data suggests:
* **Generalization vs. Specialization:** GPT-4 demonstrates robust generalization across a wide mathematical curriculum. In contrast, MetaMath-13B, despite its smaller size and specialized training, shows a "spiky" profile—it excels in narrow domains but fails completely in others, indicating potential gaps in its training data or reasoning generalization.
* **The "Curriculum" of Difficulty:** The synchronized performance dips reveal an implicit hierarchy of difficulty within middle/high school mathematics as perceived by these models. Foundational topics like definitions ("反比例函数的定义") and basic operations ("整式的加减") prove surprisingly difficult, possibly because they require precise, rule-based reasoning that LLMs can find challenging.
* **Implications for Use:** For reliable mathematical problem-solving, GPT-4 is the most dependable choice across the board. Using a model like MetaMath-13B would require careful human oversight and knowledge of its specific strengths and weaknesses. The chart acts as a capability map for selecting the right tool for a specific mathematical task.
* **Investigative Insight (Peircean):** The chart is an *index* of the models' internal knowledge structures and reasoning flaws. The valleys are not random; they point to specific conceptual hurdles (e.g., handling definitions, certain algebraic manipulations). The spikes in MetaMath-13B suggest its fine-tuning was highly effective for certain problem *types* but did not confer broad mathematical understanding. This visual data argues that "math accuracy" is not a monolithic skill but a constellation of competencies.
</details>
Figure 8: Concept accuracies on Middle-ZH.
| Model | Elementary-EN | Middle-EN | Elementary-ZH | Middle-ZH | Avg. $\downarrow$ |
| --- | --- | --- | --- | --- | --- |
| Yi-6B | 5.30 / 1.73 | 5.21 / 1.37 | 0.04 / 0.20 | 0.36 / 0.35 | 2.73 / 0.91 |
| ChatGLM3-6B | 7.42 / 0.22 | 7.55 / 0.23 | 0.11 / 0.02 | 0.35 / 0.05 | 3.86 / 0.13 |
| InternLM2-Math-7B | 7.42 / 0.22 | 7.55 / 0.23 | 0.11 / 0.02 | 0.35 / 0.05 | 3.86 / 0.13 |
| InternLM2-7B | 5.36 / 1.03 | 5.27 / 0.84 | 0.01 / 0.37 | 0.33 / 0.49 | 2.74 / 0.68 |
| MAmmoTH-13B | 7.67 / 0.47 | 7.97 / 0.46 | 0.00 / 0.03 | 0.35 / 0.03 | 4.00 / 0.25 |
| WizardMath-13B | 8.41 / 0.35 | 8.23 / 0.34 | 0.00 / 0.02 | 0.55 / 0.02 | 4.30 / 0.18 |
| MetaMath-13B | 7.67 / 0.47 | 7.97 / 0.46 | 0.00 / 0.03 | 0.35 / 0.03 | 4.00 / 0.25 |
| Baichuan2-13B | 7.20 / 1.43 | 6.58 / 1.18 | 0.05 / 0.54 | 0.41 / 0.65 | 3.56 / 0.95 |
| LLaMA2-13B | 6.80 / 0.73 | 6.36 / 0.64 | 0.01 / 0.15 | 0.56 / 0.16 | 3.43 / 0.42 |
| Qwen-14B | 11.04 / 1.58 | 9.73 / 1.08 | 1.43 / 1.27 | 0.70 / 0.93 | 5.73 / 1.22 |
| InternLM2-Math-20B | 5.58 / 1.30 | 5.51 / 0.99 | 0.03 / 0.47 | 0.34 / 0.47 | 2.86 / 0.81 |
| InternLM2-20B | 7.20 / 1.43 | 6.58 / 1.18 | 0.05 / 0.54 | 0.41 / 0.65 | 3.56 / 0.95 |
| GPT-3.5 | 9.48 / - | 9.21 / - | 0.00 / - | 0.31 / - | 4.75 / - |
| GPT-4 | 8.68 / - | 8.24 / - | 0.15 / - | 0.68 / - | 4.44 / - |
Table 3: Data contamination rate of LLMs. We provide two different contamination detection methods. The values in the table represent “Rouge / Prob”. Note that the second method based on output probability distributions can only be applied to the open-source models.
### 3.3 Analysis
#### Contamination.
To determine whether a text is in the pretraining data of a LLM, we provide two different contamination detection methods (i.e., Rouge-based and Prob-based methods) to analyze our ConceptMath in Table 3. Specifically, for the Rouge-based method, we just input the first 50% of the question as the input and compute the Rouge-L score between the generation results and the ground-truth label of the last 50% of the text, where a lower Rouge-L score means a lower contamination rate. For the Prob-based method, we follow Shi et al. (2023) to use the MIN-K% probability metric, which first gets the probability for each token in the test, and selects the K% tokens with minimum probabilities and calculates their average log-likelihood. If the average log-likelihood is high, the text is likely in the pretraining data. Note that we choose $K$ as 10 in our setting. In Table 3, we observe that the contaminate rates on our ConceptMath are very low, which means that our ConceptMath can provide a reasonable evaluation for existing LLMs.
| Concept | Concept Acc | Error Type | Example |
| --- | --- | --- | --- |
| Perimeter | 26.32 | Calculation Error | Q: Find the perimeter of a 16 by 18 feet rectangle. A: The perimeter is 16 + 18 + 16 + 18 = 72 feet. |
| Knowledge Error | Q: The perimeter of a rectangular sticker is 14 centimeters. The sticker is 2 centimeters wide. How tall is it? A: The height equals the perimeter subtracts the width. The height of the rectangle is $14-12=12$ centimeters. | | |
| Polygons | 11.11 | Knowledge Error | Q: A regular polygon has an interior angle that measures 120 degrees. How many sides does this polygon have? A: Let $n$ be the number of sides of the polygon. We have that $n\times 120=360$ . Hence: $n=3$ . |
| Cylinders | 11.76 | Calculation Error | Q: How much soup can a cylinder-shaped can store, if the radius of the can is 3 inches and the height is 5 inches? A: … The volume of the can is V = $\pi\times 3^{2}\times 5=30\pi$ . |
Table 4: Error reasons of several unmastered concepts of LLaMA2-13B.
#### Unmastered Concepts.
We also highlight the several unmastered concepts of the LLaMA2-13B in Table 4, which shows ConceptMath is effective in guiding further refinement of existing LLMs.
| Models | LLaMA2 | LLaMA2 | LLaMA2 |
| --- | --- | --- | --- |
| (w/ MMQA) | (w/ MMQA | | |
| &CS) | | | |
| Cones | 0.00 | 17.65 | 23.53 |
| Spheres | 5.88 | 29.41 | 35.29 |
| Polygons | 11.11 | 61.11 | 66.67 |
| Rational Number | 11.76 | 23.53 | 52.94 |
| Cylinders | 11.76 | 35.29 | 47.06 |
| Angles | 11.76 | 47.06 | 58.82 |
| Probability | 18.75 | 25.00 | 75.00 |
| Perimeter | 26.32 | 42.11 | 63.16 |
| Volume | 27.78 | 38.89 | 66.67 |
| Proportional | 27.78 | 33.33 | 44.44 |
| Avg Acc. | 15.29 | 36.88 | 53.36 |
| (over 10 concepts) | | | |
| Avg Acc. | 51.94 | 58.14 | 60.67 |
| (over 33 concepts) | | | |
| Overall Acc. | 44.02 | 53.94 | 59.29 |
Table 5: Results of fine-tuning models. “MMQA” and “CS” denote MetaMathQA and our constructed Concept-Specific training datasets, respectively. Introducing CS data specifically for the bottom 10 concepts significantly enhances these concepts’ performance, while slightly improving the performance across the remaining 33 concepts.
#### Evaluation Prompting.
Different from the few-shot or cot prompting evaluation that can boost closed-source models, we find that zero-shot prompting is more effective for certain open-source LLMs in Table 2. This disparity may arise either because the models are not sufficiently powerful to own mathematical CoT capabilities Yu et al. (2023); Wei et al. (2022) or because these models have already incorporated CoT data during training Longpre et al. (2023). Consequently, to ensure a comprehensive analysis, we have employed all three prompting methods for evaluation.
#### Efficient Fine-tuning.
To show the effect of efficient fine-tuning, we take the LLaMA2-13B as an example in Table 5. Specifically, for LLaMA2-13B, we first select 10 concepts with the lowest accuracies in Elementary-EN. Then, we crawl 495 samples (about 50 samples per concept) using the trained classifier as the Concept-Specific (CS) training data (See Appendix B for more details). Meanwhile, to avoid overfitting, we introduce the MetaMathQA (MMQA Yu et al. (2023) ) data to preserve general mathematical abilities. After that, we can fine-tune LLaMA2-13B by only using MMQA (i.e., LLaMA2 (w/ MMQA)), or using both MMQA and CS data (i.e., LLaMA2 (w/ MMQA & CS)). In Table 5, we observe that LLaMA2 (w/ MMQA & CS) archives significant improvements on the lowest 10 concepts and preserves well on the other 33 concepts, which shows the effect of efficient fine-tuning and the advantages of our ConceptMath.
## 4 Related Work
#### Large Language Models for Mathematics.
Large Language Models (LLMs) such as GPT-3.5 and GPT-4 have exhibited promising capabilities in complex mathematical tasks. However, the proficiency of open-source alternatives like LLaMA (Touvron et al., 2023a) and LLaMA2 (Touvron et al., 2023b) remains notably inferior on these datasets, particularly in handling non-English problems. In contrast, models like Baichuan2 (Baichuan, 2023) and Qwen (Bai et al., 2023b) pretrained on multilingual datasets (i.e., Chinese and English) have achieved remarkable performance. Recently, many domain-specialized math language models have been proposed. For example, MetaMath (Yu et al., 2023) leverages the LLaMA2 models and finetunes on the constructed MetaMathQA dataset. MAmmoTH (Yue et al., 2023) synergizes Chain-of-Thought (CoT) and Program-of-Thought (PoT) rationales.
#### Mathmatical Reasoning Benchmarks.
Recently, many mathematical datasets Roy and Roth (2015); Koncel-Kedziorski et al. (2015); Lu et al. (2023); Huang et al. (2016); Miao et al. (2020); Patel et al. (2021) have been proposed. For example, SingleOp (Roy et al., 2015), expands the scope to include more complex operations like multiplication and division. Math23k (Wang et al., 2017) gathers 23,161 problems labeled with structured equations and corresponding answers. GSM8K (Cobbe et al., 2021) is a widely used dataset, which requires a sequence of elementary calculations with basic arithmetic operations.
#### Fine-Grained Benchmarks.
Traditional benchmarks focus on assessing certain abilities of models on one task Guo et al. (2023b); Wang et al. (2023a); Liu et al. (2020); Guo et al. (2022); Chai et al. (2024); Liu et al. (2024); Guo et al. (2024, 2023c); Bai et al. (2023a); Liu et al. (2022); Guo et al. (2023a); Bai et al. (2024); Liu et al. (2021) (e.g., reading comprehension (Rajpurkar et al., 2018), machine translation (Bojar et al., 2014), and summarization (Narayan et al., 2018)). For example, the GLUE benchmark (Wang et al., 2019) combines a collection of tasks, and has witnessed superhuman model performance for pretraining models (Kenton and Toutanova, 2019; Radford et al., 2019) (Hendrycks et al., 2021a) introduced MMLU, a benchmark with multiple-choice questions across 57 subjects including STEM, humanities, and social sciences, for assessing performance and identifying weaknesses. (et al., 2022) proposed BIG-bench with over 200 tasks. To enhance the mathematical capabilities of LLMs, we introduce a comprehensive mathematical reasoning ConceptMath dataset designed to assess model performance across over 200 diverse mathematical concepts in both Chinese and English.
## 5 Conclusion
We introduce a new bilingual concept-wise math reasoning dataset called ConceptMath to assess models across a diverse set of concepts. First, ConceptMath covers more than 200 concepts across elementary and middle schools for mainstream English and Chinese systems. Second, we extensively evaluate existing LLMs by three prompting methods, which can guide further improvements for these LLMs on mathematical abilities. Third, we analyze the contamination rates, error cases and provide a simple and efficient fine-tuning strategy to enhance the weaknesses.
#### Limitations.
Human efforts are required to carefully design the hierarchical systems of mathematical concepts. In the future, we have three plans as follows: (1) Extend the input modality to multi-modalities. (2) Extend the education systems to high school and college levels. (3) Extend the reasoning abilities to more STEM fields.
## References
- Anthropic (2023) Anthropic. 2023. Model card and evaluations for claude models.
- Bai et al. (2024) Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv.
- Bai et al. (2023a) Jiaqi Bai, Hongcheng Guo, Jiaheng Liu, Jian Yang, Xinnian Liang, Zhao Yan, and Zhoujun Li. 2023a. Griprank: Bridging the gap between retrieval and generation via the generative knowledge improved passage ranking. CIKM.
- Bai et al. (2023b) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023b. Qwen technical report. arXiv preprint arXiv:2309.16609.
- Baichuan (2023) Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Bojar et al. (2014) Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics.
- Chai et al. (2024) Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xiannian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, et al. 2024. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. arXiv preprint arXiv:2401.07037.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems.
- Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- et al. (2022) Aarohi Srivastava et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv: Arxiv-2206.04615.
- Fritz et al. (2013) Annemarie Fritz, Antje Ehlert, and Lars Balzer. 2013. Development of mathematical concepts as basis for an elaborated mathematical understanding. South African Journal of Childhood Education, 3(1):38–67.
- Guo et al. (2022) Hongcheng Guo, Jiaheng Liu, Haoyang Huang, Jian Yang, Zhoujun Li, Dongdong Zhang, Zheng Cui, and Furu Wei. 2022. Lvp-m3: language-aware visual prompt for multilingual multimodal machine translation. EMNLP.
- Guo et al. (2023a) Hongcheng Guo, Boyang Wang, Jiaqi Bai, Jiaheng Liu, Jian Yang, and Zhoujun Li. 2023a. M2c: Towards automatic multimodal manga complement. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9876–9882.
- Guo et al. (2024) Hongcheng Guo, Jian Yang, Jiaheng Liu, Jiaqi Bai, Boyang Wang, Zhoujun Li, Tieqiao Zheng, Bo Zhang, Qi Tian, et al. 2024. Logformer: A pre-train and tuning pipeline for log anomaly detection. AAAI.
- Guo et al. (2023b) Hongcheng Guo, Jian Yang, Jiaheng Liu, Liqun Yang, Linzheng Chai, Jiaqi Bai, Junran Peng, Xiaorong Hu, Chao Chen, Dongfeng Zhang, et al. 2023b. Owl: A large language model for it operations. arXiv preprint arXiv:2309.09298.
- Guo et al. (2023c) Jinyang Guo, Jiaheng Liu, Zining Wang, Yuqing Ma, Ruihao Gong, Ke Xu, and Xianglong Liu. 2023c. Adaptive contrastive knowledge distillation for bert compression. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8941–8953.
- Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021a. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- Huang et al. (2016) Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin, and Wei-Ying Ma. 2016. How well do computers solve math word problems? large-scale dataset construction and evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 887–896.
- Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- Koncel-Kedziorski et al. (2015) Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. 2015. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
- Liu et al. (2024) Jiaheng Liu, Zhiqi Bai, Yuanxing Zhang, Chenchen Zhang, Yu Zhang, Ge Zhang, Jiakai Wang, Haoran Que, Yukang Chen, Wenbo Su, et al. 2024. E2-llm: Efficient and extreme length extension of large language models. arXiv preprint arXiv:2401.06951.
- Liu et al. (2021) Jiaheng Liu, Yudong Wu, Yichao Wu, Chuming Li, Xiaolin Hu, Ding Liang, and Mengyu Wang. 2021. Dam: discrepancy alignment metric for face recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3814–3823.
- Liu et al. (2022) Jiaheng Liu, Tan Yu, Hanyu Peng, Mingming Sun, and Ping Li. 2022. Cross-lingual cross-modal consolidation for effective multilingual video corpus moment retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1854–1862.
- Liu et al. (2020) Jiaheng Liu, Shunfeng Zhou, Yichao Wu, Ken Chen, Wanli Ouyang, and Dong Xu. 2020. Block proposal neural architecture search. IEEE Transactions on Image Processing, 30:15–25.
- Longpre et al. (2023) Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. 2023. The flan collection: designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- Lu et al. (2023) Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. 2023. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In The Eleventh International Conference on Learning Representations.
- Luo et al. (2023) Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
- Megill and Wheeler (2019) Norman Megill and David A Wheeler. 2019. Metamath: a computer language for mathematical proofs. Lulu. com.
- Miao et al. (2020) Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984.
- Narayan et al. (2018) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
- OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. PREPRINT.
- Paster et al. (2023) Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. 2023. Openwebmath: An open dataset of high-quality mathematical web text.
- Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094.
- Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
- Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
- Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752.
- Roy et al. (2015) Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reasoning about quantities in natural language. Transactions of the Association for Computational Linguistics, 3:1–13.
- Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
- Shi et al. (2023) Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. 2023. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
- Simon (2011) Martin A Simon. 2011. Studying mathematics conceptual learning: Student learning through their mathematical activity. North American Chapter of the International Group for the Psychology of Mathematics Education.
- Team (2023a) InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport.
- Team (2023b) Yi Team. 2023b. Yi: Building the next generation of open-source and bilingual llms. https://github.com/01-ai/Yi.
- Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
- Wang et al. (2017) Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 845–854.
- Wang et al. (2023a) Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhu Chen, Jie Fu, and Junran Peng. 2023a. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv: 2310.00746.
- Wang et al. (2023b) Zengzhi Wang, Rui Xia, and Liu Pengfei. 2023b. Generative ai for math: Part i – mathpile: A billion-token-scale pretraining corpus for math. arXiv preprint arXiv:2312.17120.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Ying et al. (2024) Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin. 2024. Internlm-math: Open math large language models toward verifiable reasoning.
- Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
- Yue et al. (2023) Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2023. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv: 2309.05653.
- Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
## Appendix A Details on the ConceptMath
As shown in Table 7, Table 8, Table 17 and Table 9, we have provided the details on the three-level hierarchical system of our ConceptMath for better illustration.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Bar Chart: Mean Accuracy Across Mathematical Topics
### Overview
The image displays a vertical bar chart titled "Mean Accuracy" on the y-axis, plotting the performance across a wide range of mathematical topics on the x-axis. The chart shows a clear ascending trend from left to right, indicating increasing mean accuracy across the listed topics. The data appears to represent the performance of a system or group on various math skills.
### Components/Axes
* **Chart Type:** Vertical Bar Chart.
* **Y-Axis:**
* **Label:** "Mean Accuracy"
* **Scale:** Linear scale from 0 to 80, with major grid lines at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70, 80).
* **X-Axis:**
* **Label:** Not explicitly labeled with a title, but contains categorical labels for each bar.
* **Categories (from left to right):** Proportional, Cones, Cylinders, Estimation & rounding, Probability, Place value, Circles, Spheres, Fractions, Rational number, Volume of 3d shapes, Perimeter, Patterns, Angles, Coordinate plane, Percents, Division, Time, Polygons, Ratio, Volume, Mixed operations, Equations, Temperature, Coin names & value, Statistics, Classifying & sorting, Numerical exprs, Area, Compare, Quadrilaterals, Triangles, Powers, Length, Cubes, Subtraction, Count, Decimals, Exchanging money, Multiple, Add, Variable exprs, Light & heavy.
* **Legend:** No legend is present. All bars are the same blue color.
* **Spatial Layout:** The chart occupies the entire frame. The y-axis label is positioned vertically on the left. The x-axis labels are rotated approximately 45 degrees for readability and are positioned below the baseline of the bars.
### Detailed Analysis
The bars are sorted in ascending order of mean accuracy. Below is an approximate reconstruction of the data, reading the bar heights against the y-axis grid lines. Values are estimates based on visual alignment.
| Topic (Left to Right) | Approximate Mean Accuracy |
| :--- | :--- |
| Proportional | ~41 |
| Cones | ~43 |
| Cylinders | ~44 |
| Estimation & rounding | ~46 |
| Probability | ~47 |
| Place value | ~49 |
| Circles | ~49 |
| Spheres | ~49 |
| Fractions | ~50 |
| Rational number | ~56 |
| Volume of 3d shapes | ~56 |
| Perimeter | ~57 |
| Patterns | ~57 |
| Angles | ~59 |
| Coordinate plane | ~60 |
| Percents | ~61 |
| Division | ~62 |
| Time | ~62 |
| Polygons | ~62 |
| Ratio | ~64 |
| Volume | ~66 |
| Mixed operations | ~66 |
| Equations | ~67 |
| Temperature | ~67 |
| Coin names & value | ~68 |
| Statistics | ~68 |
| Classifying & sorting | ~69 |
| Numerical exprs | ~70 |
| Area | ~70 |
| Compare | ~70 |
| Quadrilaterals | ~70 |
| Triangles | ~71 |
| Powers | ~71 |
| Length | ~73 |
| Cubes | ~74 |
| Subtraction | ~74 |
| Count | ~76 |
| Decimals | ~77 |
| Exchanging money | ~79 |
| Multiple | ~80 |
| Add | ~81 |
| Variable exprs | ~84 |
| Light & heavy | ~85 |
**Trend Verification:** The visual trend is a consistent, near-monotonic increase from the lowest-performing topic ("Proportional") on the far left to the highest-performing topic ("Light & heavy") on the far right. There are small plateaus (e.g., "Place value", "Circles", "Spheres" are nearly equal) but no significant dips in the overall ascending pattern.
### Key Observations
1. **Performance Range:** Mean accuracy spans from approximately 41% to 85%, a range of about 44 percentage points.
2. **Lowest Accuracy Cluster:** The first five topics ("Proportional" through "Probability") all have mean accuracy below 50%.
3. **Highest Accuracy Cluster:** The last five topics ("Count" through "Light & heavy") all have mean accuracy at or above 76%.
4. **Mid-Range Plateau:** A large cluster of topics from "Numerical exprs" to "Powers" hover around the 70% accuracy mark.
5. **Topic Grouping:** The topics appear to be loosely grouped by mathematical domain, with geometry topics (Cones, Cylinders, Circles, Spheres) appearing early, arithmetic operations (Division, Subtraction, Add) appearing later, and measurement/comparison topics (Length, Light & heavy) at the high end.
### Interpretation
This chart likely illustrates the results of an evaluation (e.g., of an AI model, a curriculum, or a student cohort) across a standardized set of mathematical skills. The data suggests a clear hierarchy of difficulty for the evaluated entity.
* **Foundational & Abstract Concepts are Challenging:** Topics often considered more abstract or foundational, such as Proportional reasoning, Estimation, and Probability, show the lowest accuracy. This indicates these areas may require more targeted instruction or represent a current weakness.
* **Procedural & Concrete Skills are Stronger:** Skills that are more procedural, concrete, or based on direct comparison and counting (e.g., Add, Light & heavy, Exchanging money) show the highest accuracy. This suggests the evaluated entity has a stronger grasp of basic arithmetic and concrete measurement.
* **Implication for Learning/Development:** The ascending order provides a potential roadmap for skill development. Mastery of the lower-accuracy topics on the left may be a prerequisite for or a current barrier to improving performance on the more complex topics that appear later in the sequence. The chart does not show causation, but the strong correlation between topic order and accuracy is a significant pattern for investigation.
* **Data Limitation:** The chart presents only the mean accuracy. It does not show the variance, sample size, or difficulty of the individual questions within each topic, which are crucial for a full understanding. A topic with a high mean could still have a wide spread of scores.
</details>
Figure 9: Mean concept accuracies of Elementary-EN.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: Mean Accuracy Across Mathematical Problem Categories
### Overview
The image displays a vertical bar chart illustrating the "Mean Accuracy" (y-axis) for a large number of distinct mathematical problem categories (x-axis). The chart is sorted in ascending order of accuracy, from the lowest-performing category on the left to the highest-performing on the right. All bars are a uniform blue color. The chart lacks an overall title.
### Components/Axes
* **Y-Axis:** Labeled "Mean Accuracy". The scale runs from 0 to 80, with major grid lines and numerical markers at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70, 80).
* **X-Axis:** Contains 52 categorical labels, each representing a specific type of mathematical problem. The labels are written in Chinese and are rotated at a 45-degree angle for readability.
* **Legend:** There is no legend present, as all data series (bars) represent the same metric ("Mean Accuracy") for different categories.
* **Data Series:** A single series of 52 blue vertical bars. The height of each bar corresponds to the mean accuracy for that category.
### Detailed Analysis
The chart presents the mean accuracy for 52 distinct mathematical problem categories. The data is sorted, revealing a clear performance gradient. Below is the list of categories from left (lowest accuracy) to right (highest accuracy), with approximate accuracy values estimated from the bar heights relative to the y-axis grid.
**Language Note:** The x-axis labels are in Chinese. The original Chinese text is provided, followed by an English translation.
1. 周期问题 (Periodic Problems) ~22
2. 求差问题 (Difference Problems) ~23
3. 相遇问题 (Meeting Problems) ~23
4. 工程问题 (Work Problems) ~24
5. 鸡兔同笼问题 (Chicken and Rabbit Cage Problems) ~25
6. 立体图形问题 (3D Shape Problems) ~27
7. 排列组合 (Permutations and Combinations) ~28
8. 圆 (Circles) ~30
9. 规律 (Patterns) ~31
10. 和差倍问题 (Sum, Difference, Multiple Problems) ~31
11. 长方体 (Cuboids) ~32
12. 因数问题 (Factor Problems) ~33
13. 分数运算 (Fraction Operations) ~34
14. 行程问题 (Travel Problems) ~34
15. 分数应用 (Fraction Applications) ~35
16. 还原问题 (Reverse Problems) ~36
17. 相遇问题 (Meeting Problems - *Note: Duplicate label from #3*) ~36
18. 平面图形问题 (2D Shape Problems) ~37
19. 长方形 (Rectangles) ~37
20. 梯形 (Trapezoids) ~38
21. 正方体 (Cubes) ~40
22. 时钟问题 (Clock Problems) ~44
23. 公倍数问题 (Common Multiple Problems) ~44
24. 归一问题 (Unitary Method Problems) ~45
25. 统计图表 (Statistical Charts) ~45
26. 除法竖式 (Long Division) ~46
27. 定义新运算 (New Operation Definitions) ~46
28. 比例问题 (Ratio Problems) ~46
29. 中点问题 (Midpoint Problems) ~47
30. 折扣问题 (Discount Problems) ~47
31. 三角形 (Triangles) ~48
32. 四则混合运算 (Mixed Arithmetic Operations) ~51
33. 因数与倍数综合 (Factors and Multiples Comprehensive) ~54
34. 和问题 (Sum Problems) ~55
35. 人民币问题 (RMB/Currency Problems) ~56
36. 归总问题 (Inverse Unitary Method Problems) ~57
37. 概率问题 (Probability Problems) ~58
38. 乘法原理 (Multiplication Principle) ~59
39. 对折问题 (Folding Problems) ~60
40. 角 (Angles) ~61
41. 温度问题 (Temperature Problems) ~62
42. 最简分数 (Simplest Fractions) ~63
43. 面积问题 (Area Problems) ~66
44. 质数问题 (Prime Number Problems) ~69
45. 基础应用题 (Basic Word Problems) ~69
46. 平行四边形 (Parallelograms) ~73
47. 方程问题 (Equation Problems) ~79
48. 公约数问题 (Common Divisor Problems) ~79
49. 百分数问题 (Percentage Problems) ~79
50. 积差问题 (Product Difference Problems) ~79
51. 分数与小数 (Fractions and Decimals) ~79
52. *(The final bar on the far right has no visible label beneath it, but its height is the maximum, ~79.)*
### Key Observations
1. **Wide Performance Range:** There is a substantial spread in mean accuracy, from a low of approximately 22 to a high of approximately 79.
2. **Steady Ascending Trend:** The data is explicitly sorted, creating a smooth, monotonically increasing curve from left to right. There are no sudden drops or plateaus in the overall trend.
3. **Clustering:** Several groups of categories have very similar accuracy levels, forming small clusters (e.g., categories 24-31 all fall between ~44 and ~48).
4. **High-Performing Cluster:** The last seven labeled categories (方程问题 through 分数与小数) all show very high accuracy, clustering near the top of the scale (~73-79).
5. **Duplicate Label:** The category "相遇问题" (Meeting Problems) appears twice (positions #3 and #17), which may indicate a data entry error or a sub-category distinction not visible in the label.
### Interpretation
This chart likely represents the performance of an AI model, a student group, or a pedagogical study on a comprehensive set of elementary to middle school mathematics topics. The sorting by accuracy immediately highlights relative difficulty.
* **Problem Difficulty Hierarchy:** The chart establishes a clear hierarchy. Foundational arithmetic (e.g., 分数与小数 - Fractions and Decimals), algebraic thinking (方程问题 - Equation Problems), and core geometry (面积问题 - Area Problems) show the highest mastery. Conversely, complex word problems involving rates, work, and combinatorics (周期问题, 工程问题, 排列组合) present the greatest challenge.
* **Conceptual Grouping:** The high accuracy on "方程问题" (Equation Problems) suggests strong symbolic manipulation skills, while the low accuracy on "排列组合" (Permutations and Combinations) indicates difficulty with abstract combinatorial reasoning. The mid-range performance on many geometry topics (三角形, 长方形) suggests partial but not complete mastery of spatial concepts.
* **Data Anomaly:** The duplicate "相遇问题" label is a critical point for investigation. It could represent two different problem subtypes (e.g., linear vs. circular motion) that were not differentiated in the labeling, or it could be an error that affects the data's integrity.
* **Implication for Learning/Development:** The steep gradient suggests a curriculum or training regimen where mastery is highly topic-dependent. Interventions should focus on the leftmost categories. The cluster of high-performing topics on the right may represent well-practiced or more procedurally straightforward concepts.
</details>
Figure 10: Mean concept accuracies of Elementary-ZH.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Line Chart: Accuracy of Three AI Models Across Mathematical Topics
### Overview
This image is a line chart comparing the performance accuracy (in percentage) of three large language models—MetaMath-13B, LLaMA2-70B, and GPT-4—across a wide range of mathematical problem categories. The chart visualizes how each model's accuracy fluctuates depending on the specific topic.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **Y-Axis:** Labeled "Accuracy". Scale runs from 0 to 100 in increments of 20, with gridlines at each increment.
* **X-Axis:** Lists 42 distinct mathematical topics or problem types. The labels are rotated approximately 45 degrees for readability.
* **Legend:** Positioned at the top-center of the chart. It defines the three data series:
* **MetaMath-13B:** Blue line with circular markers.
* **LLaMA2-70B:** Orange line with circular markers.
* **GPT-4:** Green line with circular markers.
### Detailed Analysis
**Data Series Trends & Approximate Values:**
**1. GPT-4 (Green Line):**
* **Trend:** Consistently the highest-performing model across nearly all topics. Its line remains in the upper portion of the chart, rarely dipping below 80% accuracy. It shows a relatively stable, high-performance trend with moderate fluctuations.
* **Key Approximate Data Points (Selected):**
* Highs (~95-100%): Angles, Cubes, Fractions, Mixed operations, Numerical exprs, Rational number, Triangles, Volume of 3d shapes, Probability, Quadrilaterals, Volume.
* Lows (~70-80%): Cones, Coordinate plane, Decimals, Patterns, Subtraction, Compare, Division, Length, Percent, Polygons, Proportional, Ratio.
**2. MetaMath-13B (Blue Line):**
* **Trend:** Exhibits high variability. It performs well on some topics (e.g., Classifying & sorting, Cubes, Triangles) but shows dramatic drops to very low accuracy on others (e.g., Exchanging money, Probability). Its line has the most pronounced peaks and valleys.
* **Key Approximate Data Points (Selected):**
* Highs (~75-85%): Classifying & sorting, Cubes, Fractions, Triangles, Volume of 3d shapes, Count, Equations, Statistics.
* Lows (~0-30%): Exchanging money, Light & heavy, Patterns, Probability, Proportional.
**3. LLaMA2-70B (Orange Line):**
* **Trend:** Generally performs in the middle range between the other two models. It shows significant volatility, with sharp drops on specific topics (e.g., Cylinders, Light & heavy, Subtraction) and strong peaks on others (e.g., Mixed operations, Percents).
* **Key Approximate Data Points (Selected):**
* Highs (~75-95%): Mixed operations, Percents, Volume.
* Lows (~10-40%): Cylinders, Light & heavy, Subtraction, Probability, Quadrilaterals.
**X-Axis Topics (Complete List):**
Angles, Area, Circles, Classifying & sorting, Coin names & value, Cones, Coordinate plane, Cubes, Cylinders, Decimals, Estimation & rounding, Exchanging money, Fractions, Light & heavy, Mixed operations, Multiple, Numerical exprs, Patterns, Perimeter, Place value, Powers, Rational number, Spheres, Subtraction, Time, Triangles, Variable exprs, Volume of 3d shapes, Add, Compare, Count, Division, Equations, Length, Statistics, Percents, Polygons, Probability, Proportional, Quadrilaterals, Ratio, Temperature, Volume.
### Key Observations
1. **Performance Hierarchy:** A clear hierarchy is visible: GPT-4 > LLaMA2-70B ≈ MetaMath-13B, with the latter two often crossing but generally following this order.
2. **Topic Sensitivity:** All models show significant performance variation based on the mathematical topic. No model is uniformly excellent or poor.
3. **Notable Outliers:**
* **MetaMath-13B** plummets to near 0% on "Probability" and very low on "Proportional" and "Exchanging money".
* **LLaMA2-70B** has a severe dip on "Cylinders" (~12%) and "Light & heavy" (~12%).
* **GPT-4** has its lowest point on "Cones" (~70%), which is still higher than the average performance of the other models on many topics.
4. **Correlated Dips:** There are topics where all three models struggle simultaneously, indicated by a synchronized dip in all three lines. Examples include "Patterns", "Subtraction", and "Probability". This suggests these topics are inherently more challenging for the evaluated models.
### Interpretation
This chart provides a comparative benchmark of mathematical reasoning capabilities across different AI model architectures and sizes. The data suggests that:
* **Model Scale and Training Matter:** GPT-4's consistently superior performance likely reflects its larger scale and/or more advanced training methodology for mathematical reasoning.
* **Specialization vs. Generalization:** MetaMath-13B, despite its name suggesting mathematical specialization, shows extreme weakness in specific areas like probability and proportional reasoning, indicating potential gaps in its training data or fine-tuning. LLaMA2-70B shows a more generalized but still inconsistent performance.
* **The Challenge of "Simple" Math:** The synchronized poor performance on topics like "Subtraction" and "Patterns" is intriguing. It may indicate that these foundational concepts, when framed as specific problem types, present unique challenges that are not solved simply by scaling up models. They might require different training approaches or data.
* **Practical Implications:** For a user or developer, this chart is a guide for model selection. If the task involves "Probability" or "Proportional" reasoning, GPT-4 is the only reliable choice among these three. For topics like "Volume" or "Mixed operations," the performance gap narrows, and other factors like cost or speed might influence the choice.
The chart effectively communicates that AI mathematical proficiency is not monolithic but is a landscape of strengths and weaknesses that vary dramatically by topic.
</details>
Figure 11: Concept accuracies on Elementary-EN.
<details>
<summary>x15.png Details</summary>

### Visual Description
## Line Chart: Accuracy Comparison of Three AI Models Across Math Problem Categories
### Overview
This image is a line chart comparing the accuracy percentages of three large language models (LLMs) across a wide range of Chinese mathematics problem categories. The chart visualizes performance variability, showing that GPT-4 consistently achieves the highest accuracy, followed by LLaMA2-70B, with MetaMath-13B generally performing the lowest. The data is presented as three distinct, jagged lines plotted against a categorical x-axis.
### Components/Axes
* **Chart Type:** Multi-line chart.
* **Y-Axis:**
* **Label:** "Accuracy" (written vertically on the left side).
* **Scale:** Linear scale from 0 to 100, with major gridlines at intervals of 20 (0, 20, 40, 60, 80, 100).
* **X-Axis:**
* **Label:** None explicitly stated. The axis consists of categorical labels for various math problem types.
* **Categories (from left to right, with approximate English translations):**
1. 三角形 (Triangle)
2. 圆 (Circle)
3. 平行四边形 (Parallelogram)
4. 梯形 (Trapezoid)
5. 长方形 (Rectangle)
6. 平面图形综合 (Plane Figure Synthesis)
7. 长方体 (Cuboid)
8. 圆柱 (Cylinder)
9. 圆锥 (Cone)
10. 立体图形综合 (Solid Figure Synthesis)
11. 和差问题 (Sum and Difference Problem)
12. 和倍问题 (Sum and Multiple Problem)
13. 差倍问题 (Difference and Multiple Problem)
14. 植树问题 (Tree Planting Problem)
15. 归一问题 (Unitary Method Problem)
16. 归总问题 (Total Method Problem)
17. 盈亏问题 (Surplus and Deficit Problem)
18. 年龄问题 (Age Problem)
19. 折叠问题 (Folding Problem)
20. 还原问题 (Reversal Problem)
21. 植树问题 (Tree Planting Problem - appears repeated)
22. 鸡兔同笼 (Chicken and Rabbit in the Same Cage)
23. 行程问题 (Travel Problem)
24. 相遇问题 (Meeting Problem)
25. 追及问题 (Pursuit Problem)
26. 时钟问题 (Clock Problem)
27. 浓度问题 (Concentration Problem)
28. 温度问题 (Temperature Problem)
29. 面积问题 (Area Problem)
30. 排列组合 (Permutation and Combination)
31. 统计与概率 (Statistics and Probability)
32. 分数与小数 (Fractions and Decimals)
33. 分数运算 (Fraction Operations)
34. 比和比例 (Ratio and Proportion)
35. 公约数公倍数 (Greatest Common Divisor & Least Common Multiple)
36. 因数与倍数问题 (Factor and Multiple Problem)
37. 质数与合数 (Prime and Composite Numbers)
38. 定义新运算 (New Operation Definition)
39. 除法与除法算式 (Division and Division Expressions)
40. 乘法与乘法算式 (Multiplication and Multiplication Expressions)
41. 等式与方程 (Equality and Equation)
42. 比例问题 (Proportion Problem)
43. 百分率问题 (Percentage Problem)
* **Legend:**
* **Position:** Top center, above the plot area.
* **Items:**
1. **MetaMath-13B:** Blue line with circular markers.
2. **LLaMA2-70B:** Orange line with circular markers.
3. **GPT-4:** Green line with circular markers.
### Detailed Analysis
**Trend Verification & Data Series Analysis:**
1. **GPT-4 (Green Line):**
* **Visual Trend:** The green line is consistently the highest, exhibiting high volatility with sharp peaks and valleys. It frequently reaches or approaches 100% accuracy and rarely drops below 40%.
* **Key Data Points (Approximate):**
* **Peaks (~90-100%):** "圆" (Circle), "和差问题" (Sum and Difference), "植树问题" (Tree Planting), "排列组合" (Permutation and Combination), "分数与小数" (Fractions and Decimals), "比例问题" (Proportion).
* **Notable Lows (~40-50%):** "立体图形综合" (Solid Figure Synthesis), "盈亏问题" (Surplus and Deficit), "时钟问题" (Clock Problem), "质数与合数" (Prime and Composite Numbers).
2. **LLaMA2-70B (Orange Line):**
* **Visual Trend:** The orange line occupies the middle ground, generally between 10% and 50% accuracy, with a few significant spikes. It shows more volatility than MetaMath-13B but less consistent high performance than GPT-4.
* **Key Data Points (Approximate):**
* **Major Spikes:** A very prominent spike to ~90% on "分数与小数" (Fractions and Decimals). Other spikes to ~45-65% on "和差问题" (Sum and Difference), "植树问题" (Tree Planting), "排列组合" (Permutation and Combination), "除法与除法算式" (Division).
* **Typical Range:** Most points cluster between 10% and 35%.
3. **MetaMath-13B (Blue Line):**
* **Visual Trend:** The blue line is consistently the lowest, mostly fluctuating between 0% and 25% accuracy. It has a few moderate peaks but is often near the bottom of the chart.
* **Key Data Points (Approximate):**
* **Highest Points:** ~50% on "平行四边形" (Parallelogram), ~35% on "排列组合" (Permutation and Combination), ~45% on "除法与除法算式" (Division).
* **Frequent Lows:** Hits 0% accuracy on multiple categories, including "梯形" (Trapezoid), "长方体" (Cuboid), "圆锥" (Cone), "归一问题" (Unitary Method), "归总问题" (Total Method), "温度问题" (Temperature Problem), "因数与倍数问题" (Factor and Multiple Problem).
### Key Observations
1. **Clear Performance Hierarchy:** There is a distinct and consistent stratification: GPT-4 > LLaMA2-70B > MetaMath-13B across nearly all categories.
2. **Category-Specific Strengths:** All models show significant performance variation by category. For example, "分数与小数" (Fractions and Decimals) is a relative strength for LLaMA2-70B (its highest point), while "排列组合" (Permutation and Combination) sees strong performance from both GPT-4 and LLaMA2-70B.
3. **High Volatility:** The performance of all models is highly sensitive to the specific type of math problem, as indicated by the jagged, non-smooth lines.
4. **Zero-Performance Categories:** MetaMath-13B scores 0% on several geometry and algebra topics, suggesting complete failure on those specific test sets.
### Interpretation
This chart demonstrates a significant disparity in mathematical reasoning capabilities among the tested LLMs. GPT-4's dominant performance suggests a more robust and generalizable underlying model for mathematical problem-solving across diverse topics. The high volatility for all models indicates that mathematical reasoning is not a monolithic skill; proficiency is highly dependent on the specific problem structure and required operations.
The dramatic spike for LLaMA2-70B on "Fractions and Decimals" is an interesting anomaly. It could indicate that this model was particularly well-trained or fine-tuned on data related to that topic, or that the specific test questions for that category aligned well with its internal representations. Conversely, MetaMath-13B's frequent 0% scores highlight severe limitations or potential alignment issues with certain mathematical concepts.
For a technical document, this data underscores the importance of evaluating AI models on a granular, category-specific basis rather than relying on aggregate scores. It also suggests that while smaller models (like MetaMath-13B) can show competence in specific areas (e.g., basic geometry, division), they lack the broad mathematical understanding exhibited by larger, more advanced models like GPT-4. The results could guide further research into targeted fine-tuning for weaker models or help users select the appropriate model for specific mathematical tasks.
</details>
Figure 12: Concept accuracies on Elementary-ZH.
<details>
<summary>x16.png Details</summary>

### Visual Description
## Multi-Line Chart: Accuracy of Four AI Models Across Math Topics
### Overview
This image is a multi-line chart comparing the performance (accuracy percentage) of four different large language models (LLMs) across a wide range of mathematical topics. The chart visualizes how each model's accuracy fluctuates significantly depending on the specific math skill being tested.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **Y-Axis:** Labeled "Accuracy". Scale runs from 0 to 100 in increments of 20. Horizontal grid lines are present at 20, 40, 60, 80, and 100.
* **X-Axis:** Lists 42 distinct mathematical topics or skills. The labels are rotated approximately 45 degrees for readability.
* **Legend:** Positioned at the top center of the chart, inside the plot area. It contains four entries:
* **Blue line with circle markers:** Yi-6B
* **Orange line with circle markers:** ChatGLM3-6B
* **Green line with circle markers:** LLaMA2-7B
* **Red line with circle markers:** DeepSeekMath-7B
### Detailed Analysis
**Data Series Trends and Approximate Values:**
1. **Yi-6B (Blue Line):**
* **Trend:** Highly volatile, with the highest peaks and some of the lowest troughs among all models. It frequently trades the top position with ChatGLM3-6B and DeepSeekMath-7B.
* **Notable Points (Approximate):**
* **Peak:** ~100% on "Fractions".
* **High Points (>80%):** "Circles" (~82%), "Mixed operations" (~84%), "Time" (~90%), "Triangles" (~90%), "Add" (~90%), "Count" (~85%), "Length" (~90%), "Statistics" (~88%), "Quadrilaterals" (~88%).
* **Low Points (<40%):** "Cones" (~35%), "Cubes" (~30%), "Subtraction" (~35%), "Probability" (~40%).
2. **ChatGLM3-6B (Orange Line):**
* **Trend:** Also highly volatile, often performing near the top but with sharp drops. It shows a strong performance on arithmetic and geometry topics.
* **Notable Points (Approximate):**
* **Peak:** ~95% on "Fractions" and "Triangles".
* **High Points (>80%):** "Angles" (~70%), "Area" (~80%), "Cylinders" (~90%), "Decimals" (~85%), "Exchanging money" (~95%), "Mixed operations" (~80%), "Add" (~84%), "Compare" (~78%).
* **Low Points (<40%):** "Estimation & rounding" (~35%), "Light & heavy" (~19%), "Subtraction" (~29%), "Probability" (~25%).
3. **LLaMA2-7B (Green Line):**
* **Trend:** Consistently the lowest-performing model across nearly all topics. Its accuracy is often below 40%, with several topics near 0%. It shows a slight upward trend in the latter third of the topics (from "Add" onwards).
* **Notable Points (Approximate):**
* **Peak:** ~75% on "Cylinders".
* **High Points (>60%):** "Cylinders" (~75%), "Add" (~68%), "Compare" (~75%).
* **Very Low Points (<10%):** "Cones" (~6%), "Light & heavy" (~5%), "Rational number" (~0%), "Spheres" (~0%).
4. **DeepSeekMath-7B (Red Line):**
* **Trend:** Generally strong and more stable than Yi-6B and ChatGLM3-6B, often occupying the second-highest position. It has fewer extreme drops.
* **Notable Points (Approximate):**
* **Peak:** ~89% on "Fractions".
* **High Points (>80%):** "Area" (~84%), "Fractions" (~89%), "Mixed operations" (~89%), "Powers" (~80%), "Time" (~89%), "Triangles" (~89%), "Add" (~79%), "Length" (~80%), "Temperature" (~84%).
* **Low Points (<40%):** "Probability" (~12%), "Proportional" (~50%).
**X-Axis Topics (Complete List):**
Angles, Area, Circles, Classifying & sorting, Coin names & value, Cones, Coordinate plane, Cubes, Cylinders, Decimals, Estimation & rounding, Exchanging money, Fractions, Light & heavy, Mixed operations, Multiple, Numerical exprs, Patterns, Perimeter, Place value, Powers, Rational number, Spheres, Subtraction, Time, Triangles, Variable exprs, Volume of 3d shapes, Add, Compare, Count, Division, Equations, Length, Statistics, Percents, Polygons, Probability, Proportional, Quadrilaterals, Ratio, Temperature, Volume.
### Key Observations
1. **Model Hierarchy:** A rough performance hierarchy is visible: LLaMA2-7B (green) is consistently at the bottom. DeepSeekMath-7B (red) and Yi-6B (blue) frequently compete for the top spot, with ChatGLM3-6B (orange) close behind.
2. **Topic Difficulty:** All models show significant performance drops on specific topics, suggesting these are universally challenging. Examples include "Probability", "Subtraction", and "Proportional".
3. **Model Strengths:**
* **Yi-6B** excels at "Fractions" and several geometry/measurement topics.
* **ChatGLM3-6B** shows particular strength in "Exchanging money" and "Cylinders".
* **DeepSeekMath-7B** is strong in arithmetic ("Time", "Triangles") and maintains relatively high accuracy across many topics.
* **LLaMA2-7B** only performs competitively on a handful of topics like "Cylinders" and "Add"/"Compare".
4. **Anomaly:** LLaMA2-7B's accuracy drops to approximately 0% for "Rational number" and "Spheres", which is a severe outlier compared to its other scores.
### Interpretation
This chart provides a comparative benchmark of mathematical reasoning capabilities across four 6-7B parameter language models. The data suggests that:
* **Specialization Matters:** No single model dominates all topics. Performance is highly task-dependent, indicating that the models' training data or architectures have led to different strengths in mathematical sub-domains.
* **The "Probability" Challenge:** The sharp, collective dip for all models on "Probability" (with DeepSeekMath-7B at ~12%, others even lower) highlights this as a particularly difficult area for current LLMs, likely due to its abstract and conditional nature.
* **Foundation Model Gap:** The consistently poor performance of LLaMA2-7B, a general-purpose model, compared to the others (which may have more specialized tuning or data) underscores the importance of domain-specific training for technical tasks like mathematics.
* **Volatility as a Metric:** The high volatility in scores for Yi-6B and ChatGLM3-6B suggests their mathematical knowledge may be less robust or more "brittle" compared to the somewhat more consistent DeepSeekMath-7B. A user could not rely on these models to perform uniformly well across a math curriculum.
In essence, the chart is a diagnostic tool showing that while modern LLMs can achieve high accuracy on specific math problems, their performance is not generalized, and significant challenges remain in creating a model with robust, consistent mathematical reasoning across all fundamental topics.
</details>
<details>
<summary>x17.png Details</summary>

### Visual Description
## Multi-Line Chart: Accuracy of Four AI Models Across 35 Mathematics Topics
### Overview
This image is a multi-line chart comparing the performance (accuracy) of four different AI models across a wide range of discrete mathematics topics. The chart visualizes how each model's accuracy fluctuates significantly depending on the specific mathematical concept being tested.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, ranging from 0 to 100 (implied percentage).
* **Major Gridlines:** Horizontal dashed lines at 20, 40, 60, and 80.
* **X-Axis:**
* **Label:** None explicit. Contains 35 categorical labels for mathematics topics.
* **Categories (from left to right):** Angles, Area, Circles, Classifying & sorting, Coin names & value, Cones, Coordinate plane, Cubes, Cylinders, Decimals, Estimation & rounding, Exchanging money, Fractions, Light & heavy, Mixed operations, Multiple, Numerical exprs, Patterns, Perimeter, Place value, Powers, Rational number, Spheres, Subtraction, Time, Triangles, Variable exprs, Volume of 3d shapes, Add, Compare, Count, Division, Equations, Length, Statistics, Percents, Polygons, Probability, Proportional, Quadrilaterals, Ratio, Temperature, Volume.
* **Legend:**
* **Position:** Top center, above the plot area.
* **Series:**
1. **InternLM2-Math-7B:** Blue line with circular markers.
2. **InternLM2-7B:** Orange line with circular markers.
3. **MAmmoTH-13B:** Green line with circular markers.
4. **WizardMath-13B:** Red line with circular markers.
### Detailed Analysis
The chart shows high variability in model performance across topics. Below is an approximate data extraction for each model across all 35 topics. Values are estimated from the chart's position relative to the gridlines.
**Trend Verification & Data Points (Approximate Accuracy %):**
| Topic | InternLM2-Math-7B (Blue) | InternLM2-7B (Orange) | MAmmoTH-13B (Green) | WizardMath-13B (Red) |
| :--- | :--- | :--- | :--- | :--- |
| **Angles** | ~82 | ~82 | ~23 | ~18 |
| **Area** | ~74 | ~47 | ~68 | ~53 |
| **Circles** | ~53 | ~53 | ~53 | ~23 |
| **Classifying & sorting** | ~77 | ~82 | ~65 | ~59 |
| **Coin names & value** | ~82 | ~82 | ~82 | ~65 |
| **Cones** | ~65 | ~65 | ~23 | ~29 |
| **Coordinate plane** | ~67 | ~56 | ~78 | ~33 |
| **Cubes** | ~80 | ~94 | ~60 | ~70 |
| **Cylinders** | ~59 | ~47 | ~53 | ~35 |
| **Decimals** | ~45 | ~75 | ~35 | ~55 |
| **Estimation & rounding** | ~55 | ~45 | ~50 | ~20 |
| **Exchanging money** | ~70 | ~65 | ~77 | ~19 |
| **Fractions** | ~84 | ~89 | ~89 | ~79 |
| **Light & heavy** | ~78 | ~89 | ~72 | ~44 |
| **Mixed operations** | ~78 | ~83 | ~83 | ~67 |
| **Multiple** | ~65 | ~55 | ~60 | ~30 |
| **Numerical exprs** | ~53 | ~68 | ~42 | ~21 |
| **Patterns** | ~58 | ~63 | ~47 | ~31 |
| **Perimeter** | ~62 | ~55 | ~44 | ~37 |
| **Place value** | ~55 | ~47 | ~65 | ~50 |
| **Powers** | ~70 | ~59 | ~41 | ~12 |
| **Rational number** | ~94 | ~84 | ~41 | ~35 |
| **Spheres** | ~80 | ~75 | ~63 | ~53 |
| **Subtraction** | ~88 | ~83 | ~60 | ~40 |
| **Time** | ~89 | ~84 | ~61 | ~50 |
| **Triangles** | ~61 | ~56 | ~56 | ~28 |
| **Variable exprs** | ~89 | ~74 | ~79 | ~42 |
| **Volume of 3d shapes** | ~70 | ~84 | ~70 | ~35 |
| **Add** | ~67 | ~89 | ~68 | ~56 |
| **Compare** | ~79 | ~84 | ~78 | ~53 |
| **Count** | ~69 | ~61 | ~39 | ~22 |
| **Division** | ~70 | ~90 | ~70 | ~50 |
| **Equations** | ~79 | ~59 | ~69 | ~48 |
| **Length** | ~94 | ~83 | ~88 | ~53 |
| **Statistics** | ~72 | ~31 | ~44 | ~39 |
| **Percents** | ~56 | ~45 | ~45 | ~19 |
| **Polygons** | ~94 | ~82 | ~77 | ~59 |
| **Probability** | ~84 | ~63 | ~63 | ~37 |
| **Proportional** | ~89 | ~69 | ~42 | ~47 |
| **Quadrilaterals** | ~70 | ~85 | ~65 | ~20 |
| **Ratio** | (Data point not clearly visible) | (Data point not clearly visible) | (Data point not clearly visible) | (Data point not clearly visible) |
| **Temperature** | (Data point not clearly visible) | (Data point not clearly visible) | (Data point not clearly visible) | (Data point not clearly visible) |
| **Volume** | (Data point not clearly visible) | (Data point not clearly visible) | (Data point not clearly visible) | (Data point not clearly visible) |
*Note: The last three topics (Ratio, Temperature, Volume) have data points that are obscured or not clearly plotted in the provided image.*
### Key Observations
1. **High Variability:** All models show dramatic swings in accuracy (often >40 percentage points) between different topics. No model is consistently superior.
2. **Model Strengths:**
* **InternLM2-Math-7B (Blue):** Often achieves the highest peaks (e.g., Rational number, Length, Polygons ~94%). It shows particular strength in more abstract or advanced topics.
* **InternLM2-7B (Orange):** Frequently competes for the top spot and shows very high accuracy in foundational arithmetic topics (e.g., Cubes, Fractions, Add, Division).
* **MAmmoTH-13B (Green):** Generally performs in the middle of the pack but has notable peaks in geometry (Fractions, Mixed operations) and specific topics like "Variable exprs".
* **WizardMath-13B (Red):** Consistently performs at the lowest accuracy level across nearly all topics, with its lowest point at "Powers" (~12%).
3. **Topic Difficulty:** Some topics appear universally challenging (e.g., "Powers", "Numerical exprs", "Patterns") where all models score below 70%. Others like "Fractions" and "Mixed operations" see high scores from multiple models.
4. **Outliers:** The "Rational number" topic shows a massive performance gap, with InternLM2-Math-7B scoring ~94% while WizardMath-13B scores ~35%.
### Interpretation
This chart is a comparative benchmark revealing the specialized nature of these AI models. The data suggests that:
* **Model Architecture & Training Matters:** The "InternLM2-Math-7B" variant, presumably fine-tuned for mathematics, frequently outperforms its base "InternLM2-7B" counterpart, especially on more complex topics. This demonstrates the effectiveness of domain-specific training.
* **No Universal Solver:** The extreme variability indicates that these models have not achieved a generalized mathematical reasoning ability. Their performance is highly dependent on the specific format and concept of the problem, akin to a student who excels in geometry but struggles with algebra.
* **WizardMath-13B's Underperformance:** The consistently lower scores of WizardMath-13B suggest its training or architecture may be less effective for this broad set of topics compared to the other models evaluated, or it may be optimized for a different type of mathematical problem (e.g., competition math) not well-represented here.
* **Benchmark Utility:** For a user or developer, this chart is crucial for model selection. If one needs a model for geometry problems, "InternLM2-7B" or "MAmmoTH-13B" might be preferred. For advanced topics like "Rational number" or "Polygons", "InternLM2-Math-7B" is the clear choice. The chart argues against using a single model for all mathematical tasks without understanding its specific strengths and weaknesses.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Line Chart: Accuracy of Four Language Models Across Mathematical Categories
### Overview
This image is a line chart comparing the performance accuracy of four different large language models (LLMs) across a wide range of mathematical problem categories. The chart displays how each model's accuracy varies significantly depending on the specific type of math problem.
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **Y-Axis:** Labeled "Accuracy". The scale runs from 0 to approximately 95, with major gridlines at intervals of 20 (0, 20, 40, 60, 80).
* **X-Axis:** Lists 43 distinct mathematical categories. The labels are rotated approximately 45 degrees for readability. The categories are, from left to right:
1. Angles
2. Area
3. Circles
4. Classifying & sorting
5. Coin names & value
6. Cones
7. Coordinate plane
8. Cubes
9. Cylinders
10. Decimals
11. Estimation & rounding
12. Exchanging money
13. Fractions
14. Light & heavy
15. Mixed operations
16. Multiple
17. Numerical exprs
18. Patterns
19. Perimeter
20. Place value
21. Powers
22. Rational number
23. Spheres
24. Subtraction
25. Time
26. Triangles
27. Variable exprs
28. Volume of 3d shapes
29. Add
30. Compare
31. Count
32. Division
33. Equations
34. Length
35. Statistics
36. Percents
37. Polygons
38. Probability
39. Proportional
40. Quadrilaterals
41. Ratio
42. Temperature
43. Volume
* **Legend:** Positioned at the top center of the chart. It defines four data series:
* **Baichuan2-13B:** Blue line with circular markers.
* **LLaMA2-13B:** Orange line with circular markers.
* **Qwen-14B:** Green line with circular markers.
* **InternLM2-Math-20B:** Red line with circular markers.
### Detailed Analysis
The chart shows high variability in performance for all models across the 43 categories. Below is a summary of trends and approximate accuracy values for each model.
**1. Baichuan2-13B (Blue Line):**
* **Trend:** Highly volatile, with frequent sharp peaks and troughs. Often performs in the middle-to-high range but has significant dips.
* **Notable Highs:** Fractions (~90), Mixed operations (~83), Numerical exprs (~82), Add (~90), Equations (~88), Percents (~88).
* **Notable Lows:** Coordinate plane (~53), Subtraction (~41), Probability (~44), Quadrilaterals (~44).
**2. LLaMA2-13B (Orange Line):**
* **Trend:** Generally the lowest-performing model across most categories, with the most extreme low values. Shows a few moderate peaks.
* **Notable Highs:** Circles (~70), Fractions (~78), Mixed operations (~65), Add (~75), Compare (~61), Count (~63).
* **Notable Lows:** Coordinate plane (~0), Cubes (~12), Subtraction (~6), Probability (~11), Quadrilaterals (~35).
**3. Qwen-14B (Green Line):**
* **Trend:** Performance is often in the middle range, below Baichuan2 and InternLM2 but above LLaMA2. Shows a distinct pattern of peaks and valleys.
* **Notable Highs:** Fractions (~79), Mixed operations (~72), Numerical exprs (~61), Add (~74), Compare (~69), Volume (~75).
* **Notable Lows:** Coordinate plane (~18), Cubes (~33), Subtraction (~13), Probability (~23), Quadrilaterals (~18).
**4. InternLM2-Math-20B (Red Line):**
* **Trend:** Frequently the top-performing model, especially in arithmetic and algebraic categories. Its line is often at the top of the chart, though it has sharp drops in geometry and measurement topics.
* **Notable Highs:** Fractions (~94), Mixed operations (~94), Numerical exprs (~80), Add (~94), Equations (~84), Percents (~94), Ratio (~94).
* **Notable Lows:** Coordinate plane (~41), Cubes (~59), Subtraction (~53), Probability (~28), Quadrilaterals (~28).
**Cross-Model Comparison by Category Type:**
* **Arithmetic & Algebra (e.g., Fractions, Mixed operations, Add, Equations):** InternLM2-Math-20B and Baichuan2-13B consistently lead, often scoring above 80. LLaMA2-13B lags significantly.
* **Geometry & Measurement (e.g., Coordinate plane, Cubes, Spheres, Volume of 3d shapes):** Performance is more mixed and generally lower for all models. No single model dominates. For "Coordinate plane," all models score below 55, with LLaMA2 at ~0.
* **Basic Concepts (e.g., Count, Compare, Classifying & sorting):** Models show relatively closer performance, though InternLM2 and Baichuan2 still tend to be higher.
### Key Observations
1. **Model Specialization:** InternLM2-Math-20B shows a clear strength in core mathematical operations (fractions, mixed operations, equations, percents, ratio), suggesting specialized training or fine-tuning for these areas.
2. **Universal Difficulty:** Certain categories prove challenging for all models. "Coordinate plane" and "Probability" see low scores across the board, indicating these are harder reasoning tasks for current LLMs.
3. **Extreme Volatility:** The performance of each model is not consistent; it is highly dependent on the specific problem category. A model can be near the top in one category and near the bottom in another.
4. **LLaMA2-13B's Struggles:** The LLaMA2-13B model has the weakest overall performance, with several categories near or at 0% accuracy, suggesting a potential lack of relevant training data or capability for those specific math skills.
### Interpretation
This chart provides a diagnostic breakdown of LLM capabilities in mathematical reasoning. It moves beyond an "average accuracy" score to reveal a nuanced landscape of strengths and weaknesses.
* **What the data suggests:** Mathematical reasoning in LLMs is not a monolithic skill. Proficiency is highly fragmented across different domains. A model's overall benchmark score would mask these critical variations.
* **How elements relate:** The x-axis categories represent a taxonomy of elementary to middle-school math skills. The diverging lines show that model architecture and training data create distinct "profiles" of competency. For instance, InternLM2-Math-20B's profile is spiked in arithmetic/algebra, while its geometry performance is more average.
* **Notable anomalies:** The near-zero score for LLaMA2-13B on "Coordinate plane" is a stark outlier, suggesting a complete failure mode for that model on that specific task type. The consistent high performance of InternLM2-Math-20B on categories involving fractions, percents, and ratios indicates a possible targeted optimization for proportional reasoning.
* **Implication:** For practical applications, one cannot assume a model good at "math" is good at *all* math. Task-specific evaluation is crucial. The chart also highlights areas (like probability and coordinate geometry) where all current models need significant improvement, guiding future research and training efforts.
</details>
<details>
<summary>x19.png Details</summary>

### Visual Description
## Line Chart: Accuracy Comparison of Four AI Models on Math Topics
### Overview
This image is a line chart comparing the performance (accuracy percentage) of four different large language models across a wide range of mathematical topics. The chart displays four distinct data series, each represented by a colored line with markers, plotted against a categorical x-axis of math skills and a numerical y-axis of accuracy.
### Components/Axes
* **Chart Title:** Not explicitly stated. The content implies a title like "Model Accuracy on Math Benchmark Tasks."
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear scale from 20 to 100.
* **Major Ticks:** 20, 30, 40, 50, 60, 70, 80, 90, 100.
* **X-Axis:**
* **Label:** Not explicitly labeled, but contains categorical data points for math topics.
* **Categories (from left to right):** Angles, Area, Circles, Classifying & sorting, Coin names & value, Cones, Coordinate plane, Cubes, Cylinders, Decimals, Estimation & rounding, Exchanging money, Fractions, Light & heavy, Mixed operations, Multiple, Numerical exprs, Patterns, Perimeter, Place value, Powers, Rational number, Sphere, Spheres, Subtraction, Time, Triangles, Variable exprs, Volume of 3d shapes, Add, Compare, Count, Division, Equations, Length, Statistics, Percents, Polygons, Probability, Proportional, Quadrilaterals, Ratio, Temperature, Volume.
* **Legend:** Positioned at the top center of the chart area.
* **InternLM2-20B:** Blue line with circle markers.
* **Yi-34B:** Orange line with diamond markers.
* **Qwen-72B:** Green line with square markers.
* **GPT-3.5:** Red line with triangle markers.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
* **InternLM2-20B (Blue Line, Circles):**
* **Trend:** Highly variable, generally the lowest-performing series. Shows sharp peaks and deep troughs.
* **Key Points (Approx.):** Angles (~23%), Area (~63%), Circles (~53%), Classifying & sorting (~41%), Cubes (~50%), Decimals (~40%), Fractions (~59%), Mixed operations (~68%), Multiple (~67%), Numerical exprs (~55%), Patterns (~31%), Perimeter (~42%), Place value (~31%), Powers (~65%), Rational number (~70%), Sphere (~47%), Subtraction (~63%), Time (~50%), Triangles (~58%), Variable exprs (~50%), Volume of 3d shapes (~58%), Add (~45%), Compare (~39%), Count (~63%), Division (~45%), Equations (~75%), Length (~42%), Statistics (~35%), Percents (~50%), Polygons (~28%), Probability (~41%), Proportional (~58%), Quadrilaterals (~52%), Ratio (~70%).
* **Yi-34B (Orange Line, Diamonds):**
* **Trend:** Mid-to-high performance, often tracking closely with Qwen-72B but generally slightly below it and GPT-3.5. Shows significant volatility.
* **Key Points (Approx.):** Angles (~59%), Area (~79%), Circles (~59%), Classifying & sorting (~76%), Coin names & value (~59%), Cones (~65%), Coordinate plane (~80%), Cubes (~48%), Cylinders (~45%), Decimals (~70%), Estimation & rounding (~82%), Exchanging money (~69%), Fractions (~90%), Light & heavy (~83%), Mixed operations (~95%), Multiple (~90%), Numerical exprs (~68%), Patterns (~63%), Perimeter (~56%), Place value (~79%), Powers (~59%), Rational number (~53%), Sphere (~75%), Subtraction (~89%), Time (~56%), Triangles (~85%), Variable exprs (~72%), Volume of 3d shapes (~69%), Add (~85%), Compare (~72%), Count (~70%), Division (~94%), Equations (~89%), Length (~78%), Statistics (~53%), Percents (~69%), Polygons (~88%), Probability (~84%), Proportional (~89%), Quadrilaterals (~84%), Ratio (~85%).
* **Qwen-72B (Green Line, Squares):**
* **Trend:** High performance, frequently the second-best series. Often follows a similar pattern to GPT-3.5 but at a slightly lower accuracy level.
* **Key Points (Approx.):** Angles (~70%), Area (~47%), Circles (~82%), Classifying & sorting (~82%), Coin names & value (~70%), Cones (~65%), Coordinate plane (~72%), Cubes (~80%), Cylinders (~41%), Decimals (~45%), Estimation & rounding (~70%), Exchanging money (~50%), Fractions (~90%), Light & heavy (~83%), Mixed operations (~94%), Multiple (~90%), Numerical exprs (~79%), Patterns (~69%), Perimeter (~69%), Place value (~85%), Powers (~77%), Rational number (~77%), Sphere (~80%), Subtraction (~83%), Time (~84%), Triangles (~83%), Variable exprs (~95%), Volume of 3d shapes (~80%), Add (~78%), Compare (~69%), Count (~90%), Division (~85%), Equations (~84%), Length (~53%), Statistics (~69%), Percents (~89%), Polygons (~84%), Probability (~88%), Proportional (~89%), Quadrilaterals (~84%), Ratio (~85%).
* **GPT-3.5 (Red Line, Triangles):**
* **Trend:** Consistently the highest-performing series. Maintains high accuracy with less severe drops compared to other models.
* **Key Points (Approx.):** Angles (~94%), Area (~79%), Circles (~59%), Classifying & sorting (~82%), Coin names & value (~70%), Cones (~83%), Coordinate plane (~95%), Cubes (~82%), Cylinders (~94%), Decimals (~75%), Estimation & rounding (~94%), Exchanging money (~78%), Fractions (~95%), Light & heavy (~100%), Mixed operations (~100%), Multiple (~84%), Numerical exprs (~84%), Patterns (~87%), Perimeter (~95%), Place value (~82%), Powers (~89%), Rational number (~85%), Sphere (~95%), Subtraction (~89%), Time (~95%), Triangles (~90%), Variable exprs (~100%), Volume of 3d shapes (~70%), Add (~94%), Compare (~79%), Count (~90%), Division (~95%), Equations (~82%), Length (~87%), Statistics (~56%), Percents (~88%), Polygons (~84%), Probability (~89%), Proportional (~95%), Quadrilaterals (~85%), Ratio (~95%).
### Key Observations
1. **Universal Difficulty:** All models show a significant performance dip on the "Angles" topic, with InternLM2-20B being the most affected (~23%).
2. **Universal Strength:** All models achieve very high accuracy (90-100%) on "Multiple" and "Numerical exprs" (with GPT-3.5 hitting 100% on both).
3. **Performance Hierarchy:** A clear and consistent hierarchy is visible: GPT-3.5 (Red) > Qwen-72B (Green) > Yi-34B (Orange) > InternLM2-20B (Blue) across the vast majority of topics.
4. **Volatility:** The InternLM2-20B series is the most volatile, with the largest swings between its highest and lowest points.
5. **Anomaly:** The "Statistics" topic shows a notable outlier where GPT-3.5's accuracy (~56%) drops significantly, falling below both Qwen-72B (~69%) and Yi-34B (~53% is lower, but the red line is clearly below green here).
### Interpretation
This chart provides a comparative benchmark of mathematical reasoning capabilities across four AI models. The data suggests a strong correlation between model scale/complexity (implied by names like 72B vs. 20B) and performance on these tasks. GPT-3.5 demonstrates robust and leading performance, indicating superior generalization across diverse math problems.
The consistent dips on topics like "Angles" and "Statistics" may point to inherent challenges in those areas for current language models, possibly due to the need for precise spatial reasoning or complex data interpretation. Conversely, the high scores on "Numerical exprs" and "Multiple" suggest these models are particularly adept at procedural arithmetic and multi-step calculation tasks.
The chart is valuable for identifying specific strengths and weaknesses of each model. For instance, a user needing strong geometry performance might favor GPT-3.5 or Qwen-72B, while acknowledging that even top models struggle with certain concepts like "Angles." The performance gap between the largest model (GPT-3.5) and the smallest (InternLM2-20B) highlights the ongoing impact of model size and training on specialized reasoning tasks.
</details>
Figure 13: Concept accuracies on Elementary-EN of more models.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Multi-Line Chart: Accuracy of Four AI Models Across Mathematical Topics
### Overview
This image is a multi-line chart comparing the performance (accuracy) of four different large language models (LLMs) across a wide range of mathematical problem categories. The chart visualizes how each model's accuracy fluctuates significantly depending on the specific mathematical topic.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, from 0 to 100.
* **Major Ticks:** 0, 20, 40, 60, 80, 100.
* **X-Axis:**
* **Label:** Not explicitly labeled, but contains a dense list of mathematical topics.
* **Content:** A series of 52 distinct mathematical categories, listed from left to right. The text is rotated approximately 45 degrees for readability.
* **Language:** All x-axis labels are in English.
* **Legend:**
* **Position:** Top-center, above the plot area.
* **Content:** Four entries, each associating a color and marker style with a model name.
1. **Blue line with circle markers:** Yi-6B
2. **Orange line with circle markers:** ChatGLM3-6B
3. **Green line with circle markers:** LLaMA2-7B
4. **Red line with circle markers:** DeepSeekMath-7B
### Detailed Analysis
The chart shows high variability in model performance. Below is an analysis grouped by general mathematical domain, listing approximate accuracy values (to the nearest 5%) for key points. Values are approximate due to visual estimation from the chart.
**1. Arithmetic & Basic Operations (Leftmost section):**
* **Trend:** Models show moderate to high accuracy, with significant divergence.
* **Data Points (Approx.):**
* *Add & subtract:* Yi-6B ~70%, ChatGLM3-6B ~65%, LLaMA2-7B ~45%, DeepSeekMath-7B ~70%.
* *Arithmetic sequences:* All models dip, with LLaMA2-7B lowest (~20%).
* *Consumer math:* Yi-6B peaks (~75%), others are lower (30-50%).
**2. Algebra & Equations (Left-center section):**
* **Trend:** Extreme volatility. Some models achieve near-perfect scores on specific topics while failing on others.
* **Data Points (Approx.):**
* *Linear equations:* Yi-6B ~70%, ChatGLM3-6B ~55%, LLaMA2-7B ~45%, DeepSeekMath-7B ~70%.
* *Nonlinear functions:* A notable peak for DeepSeekMath-7B (~90%) and Yi-6B (~85%).
* *Probability of compound events:* A major low point for all models. Yi-6B ~35%, ChatGLM3-6B ~10%, LLaMA2-7B ~15%, DeepSeekMath-7B ~15%.
**3. Geometry & Measurement (Center section):**
* **Trend:** Mixed performance. LLaMA2-7B consistently underperforms in this domain.
* **Data Points (Approx.):**
* *Perimeter & area:* DeepSeekMath-7B ~90%, Yi-6B ~60%, ChatGLM3-6B ~65%, LLaMA2-7B ~40%.
* *Pythagorean theorem:* DeepSeekMath-7B ~85%, others between 40-65%.
* *Surface area & volume:* All models show a dip, with LLaMA2-7B near 0%.
**4. Advanced & Applied Topics (Right-center section):**
* **Trend:** Continued high variance. Some advanced topics see strong performance from specialized models.
* **Data Points (Approx.):**
* *Two-variable statistics:* Yi-6B ~80%, DeepSeekMath-7B ~65%, others lower.
* *Systems of equations:* DeepSeekMath-7B ~80%, Yi-6B ~70%, ChatGLM3-6B ~65%, LLaMA2-7B ~45%.
* *Independent & dependent events:* A catastrophic drop for ChatGLM3-6B to ~0%. Others range from 20-75%.
**5. Calculus & Functions (Rightmost section):**
* **Trend:** DeepSeekMath-7B and Yi-6B show strong, leading performance.
* **Data Points (Approx.):**
* *Transformations:* DeepSeekMath-7B ~90%, Yi-6B ~80%, ChatGLM3-6B ~60%, LLaMA2-7B ~25%.
* *Variable exprs (final point):* DeepSeekMath-7B ~55%, Yi-6B ~60%, ChatGLM3-6B ~55%, LLaMA2-7B ~35%.
### Key Observations
1. **Model Specialization:** DeepSeekMath-7B (red) frequently achieves the highest peaks, especially in geometry, algebra, and calculus topics, suggesting strong mathematical specialization. Yi-6B (blue) is a consistent high performer across many domains.
2. **General Weakness:** LLaMA2-7B (green) is consistently the lowest or among the lowest performers across nearly all categories, rarely exceeding 50% accuracy.
3. **Topic-Specific Failures:** All models exhibit severe performance drops on specific topics. Notable universal low points include "Probability of compound events" and "Surface area & volume." ChatGLM3-6B has an extreme outlier near 0% on "Independent & dependent events."
4. **High Volatility:** No model demonstrates smooth, consistent performance. Accuracy is highly dependent on the specific mathematical concept being tested.
### Interpretation
This chart demonstrates that mathematical reasoning in LLMs is not a monolithic capability but a highly fragmented one. Performance is exquisitely sensitive to the specific sub-domain of mathematics.
* **What the data suggests:** The models likely have uneven coverage in their training data or differing architectural biases for handling symbolic vs. numerical reasoning. The near-perfect scores on some topics (e.g., DeepSeekMath-7B on "Perimeter & area") contrasted with near-zero scores on others indicate brittle knowledge rather than robust, generalized mathematical understanding.
* **How elements relate:** The x-axis represents a spectrum of mathematical complexity and abstraction. The jagged, non-parallel lines show that model capabilities are not simply "better" or "worse" in a linear fashion; they are incomparable across different topics. A model strong in algebra may be weak in geometry.
* **Notable anomalies:** The catastrophic failure of ChatGLM3-6B on "Independent & dependent events" is a critical outlier. It suggests a potential fundamental gap in its training or reasoning approach for probabilistic concepts involving dependency, which is a core statistical idea. The universal struggle with "Probability of compound events" highlights a common challenge area for current LLMs in handling layered probabilistic logic.
**Conclusion for a Technical Document:** The chart provides compelling evidence that evaluating LLMs on "math" as a single category is insufficient. A granular, topic-by-topic analysis is required to understand a model's true capabilities and limitations. For applications requiring mathematical reasoning, model selection must be tightly coupled to the specific mathematical domain of the task.
</details>
<details>
<summary>x21.png Details</summary>

### Visual Description
## Line Chart: Accuracy of Four AI Models Across Math Topics
### Overview
This image is a line chart comparing the accuracy percentages of four different AI models across a wide range of mathematical topics. The chart displays performance variability, with each model showing distinct strengths and weaknesses depending on the specific math domain.
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **Y-Axis:** Labeled "Accuracy". Scale ranges from 0 to approximately 90, with major gridlines at intervals of 20 (0, 20, 40, 60, 80).
* **X-Axis:** Lists 50 distinct mathematical topics or problem categories. The labels are rotated for readability.
* **Legend:** Positioned at the top of the chart, centered. It identifies four data series:
* **InternLM2-Math-7B:** Blue line with circle markers.
* **InternLM2-7B:** Orange line with circle markers.
* **MAmmoTH-13B:** Green line with circle markers.
* **WizardMath-13B:** Red line with circle markers.
### Detailed Analysis
**X-Axis Categories (Math Topics):**
The topics listed from left to right are:
Add & subtract, Arithmetic sequences, Congruence & similarity, Counting principles, Decimals, Distance between two points, Divide, Domain & range of funcs, Equivalent expressions, Exponents & scientific notation, Fractions, Geometric sequences, Interpret functions, Linear equations, Linear functions, Logarithms & exps, Make predictions, Matrices, Nonlinear functions, One-variable statistics, Perimeter & area, Permutations, Prime factorization, Probability of compound events, Probability of one event, Proportional relationships, Rational & irrational numbers, Scale drawings, Simple & compound events, Square roots & cube roots, Surface area & volume, Systems of equations, Two-variable statistics, Absolute value, Center & variability, Circle, Factors, Independent & dependent events, Inequalities, Inscribed angles, Mean/median/mode/range, Opposite integers, Outliers, Polygons, Polyhedra, Radical expressions, Transformations, Trapezoids, Variable expressions.
**Data Series Trends & Approximate Values:**
* **InternLM2-Math-7B (Blue):** Generally the highest-performing model. Shows high variability but frequently peaks above 80% accuracy. Notable high points include "Polygons" (~95%), "Transformations" (~95%), and "Geometric sequences" (~85%). Its lowest points are around 40-50% for topics like "Probability of compound events" and "Systems of equations".
* **InternLM2-7B (Orange):** Typically the lowest-performing model. Accuracy mostly fluctuates between 20% and 60%. It has a few peaks near 60% (e.g., "Matrices", "Polygons") and several dips below 20% (e.g., "Divide", "Logarithms & exps", "Probability of compound events").
* **MAmmoTH-13B (Green):** Shows very high volatility. It achieves some of the highest peaks on the chart, such as "Domain & range of funcs" (~95%) and "Geometric sequences" (~85%), but also experiences severe drops, including near 0% for "Probability of compound events" and "Independent & dependent events".
* **WizardMath-13B (Red):** Also highly volatile, with a performance range from near 0% to over 70%. It has several points at or near 0% accuracy (e.g., "Probability of compound events", "Independent & dependent events", "Opposite integers"). Its highest performance is around 70% for topics like "Polygons" and "Transformations".
### Key Observations
1. **Topic-Specific Performance:** No single model dominates across all topics. Performance is highly dependent on the specific math domain.
2. **Common Struggles:** All models show a significant performance drop for "Probability of compound events", with MAmmoTH-13B and WizardMath-13B hitting near 0%.
3. **Specialization:** The "InternLM2-Math-7B" model, which likely has specialized math training, consistently outperforms the base "InternLM2-7B" model, demonstrating the value of domain-specific fine-tuning.
4. **High Variability:** The green (MAmmoTH-13B) and red (WizardMath-13B) lines exhibit extreme swings, suggesting their performance is less stable and more sensitive to the problem type compared to the blue line.
5. **Geometry Strength:** Several models, particularly the blue and green ones, show relatively strong performance on geometry-related topics like "Polygons", "Transformations", and "Geometric sequences".
### Interpretation
This chart provides a comparative benchmark of AI model capabilities in mathematics. The data suggests that while larger or specialized models (like InternLM2-Math-7B) generally achieve higher accuracy, their performance is not uniformly superior. The significant volatility in some models indicates potential brittleness—they may excel at certain types of reasoning (e.g., geometry) but fail catastrophically at others (e.g., complex probability).
The near-universal failure on "Probability of compound events" highlights a specific, challenging area for current AI math reasoning. The stark difference between the blue and orange lines underscores the effectiveness of targeted mathematical training. For a user or developer, this chart is a tool for model selection: choosing the best model depends heavily on the specific mathematical domain of interest. It also identifies clear areas for future model improvement, particularly in probability and statistics topics where performance is inconsistent or poor across the board.
</details>
<details>
<summary>x22.png Details</summary>

### Visual Description
## Multi-Line Chart: Accuracy of Four AI Models Across Mathematical Topics
### Overview
This image is a multi-line chart comparing the performance (accuracy) of four different large language models (LLMs) across a wide range of mathematical topics. The chart displays the accuracy percentage for each model on each topic, allowing for a direct comparison of their strengths and weaknesses in various mathematical domains.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **Y-Axis:** Labeled "Accuracy". The scale runs from 0 to 100 in increments of 20 (0, 20, 40, 60, 80, 100). Horizontal grid lines are present at these intervals.
* **X-Axis:** Lists 57 distinct mathematical topics or problem categories. The labels are rotated approximately 45 degrees for readability. The full list of topics, from left to right, is:
1. Add & subtract
2. Arithmetic sequences
3. Congruence & similarity
4. Counting methods
5. Counting Principle
6. Distance between two points
7. Divisibility
8. Domain & range of functions
9. Equivalent expressions
10. Estimate measurements
11. Exponents & scientific notation
12. Financial & depreciation
13. Fractions & decimals
14. Geometric sequences
15. Inequalities
16. Integral functions
17. Linear equations
18. Linear functions
19. Linear inequalities
20. Make predictions
21. Matrices
22. Nonlinear functions
23. One-variable statistics
24. Permutations
25. Prime factorization
26. Prime or composite
27. Probability of compound events
28. Probability of one event
29. Proportional relationships
30. Rational & irrational numbers
31. Scale drawings
32. Slope
33. Square roots & cube roots
34. Surface area & volume
35. Systems of equations
36. Two-variable relationships
37. Absolute value
38. Axis
39. Center & variation
40. Circle
41. Factors
42. Independent & dependent events
43. Inequalities
44. Inequalities & ranges
45. Mean, median, mode & range
46. Opposite integers
47. Outlier
48. Polygons
49. Polynomials
50. Radical expressions
51. Transformations
52. Trapezoids
53. Triangles
54. Variable exprs
* **Legend:** Positioned at the top center of the chart. It identifies four data series:
* **Baichuan2-13B:** Blue line with circular markers.
* **LLaMA2-13B:** Orange line with circular markers.
* **Qwen-14B:** Green line with circular markers.
* **InternLM2-Math-20B:** Red line with circular markers.
### Detailed Analysis
The chart shows high variability in model performance across topics. Below is an analysis of each model's trend and approximate key data points.
**1. Baichuan2-13B (Blue Line):**
* **Trend:** Shows a generally high but volatile performance. It frequently achieves the highest or second-highest accuracy on many topics but also has significant dips.
* **Key Points (Approximate):**
* Highs: ~90% on "Prime factorization", ~100% on "Outlier", ~95% on "Radical expressions".
* Lows: ~50% on "Linear equations", ~55% on "Systems of equations", ~50% on "Transformations".
* Notable: Peaks sharply at "Outlier" (100%) and "Radical expressions" (~95%).
**2. LLaMA2-13B (Orange Line):**
* **Trend:** Exhibits the most extreme volatility. It has the single highest point on the chart but also some of the lowest.
* **Key Points (Approximate):**
* Highs: **100% on "Prime factorization"** (the chart's maximum), ~85% on "Congruence & similarity", ~80% on "Systems of equations".
* Lows: **~5% on "Probability of compound events"** (the chart's minimum), ~10% on "Matrices", ~15% on "Nonlinear functions".
* Notable: The dramatic drop at "Probability of compound events" is a major outlier for this model.
**3. Qwen-14B (Green Line):**
* **Trend:** Generally the lowest-performing model across most topics, with a few exceptions where it matches or exceeds others. Its line is often at the bottom of the cluster.
* **Key Points (Approximate):**
* Highs: ~75% on "Congruence & similarity", ~85% on "Outlier", ~50% on "Transformations".
* Lows: **0% on "Matrices"**, ~0% on "Radical expressions", ~10% on "Distance between two points", ~10% on "Nonlinear functions".
* Notable: Hits the absolute bottom (0%) on two topics.
**4. InternLM2-Math-20B (Red Line):**
* **Trend:** Shows consistently strong performance, often occupying the top or second-highest position. It appears to be the most stable high-performer, with fewer extreme lows compared to LLaMA2-13B.
* **Key Points (Approximate):**
* Highs: ~95% on "Congruence & similarity", ~95% on "Outlier", ~90% on "Radical expressions".
* Lows: ~15% on "Probability of compound events", ~40% on "Linear equations", ~45% on "Systems of equations".
* Notable: While it also drops on "Probability of compound events", its low (~15%) is not as severe as LLaMA2-13B's (~5%).
### Key Observations
1. **Topic-Dependent Performance:** No single model is superior across all 54 topics. Performance is highly dependent on the specific mathematical domain.
2. **Extreme Outliers:** Two data points stand out: LLaMA2-13B's perfect score (100%) on "Prime factorization" and its near-zero score (~5%) on "Probability of compound events".
3. **Model Clustering:** On many topics (e.g., "Add & subtract", "Arithmetic sequences"), the models cluster within a 20-30% accuracy range. On others (e.g., "Prime factorization", "Probability of compound events"), the spread is enormous (50-95%).
4. **Consistent Laggard:** Qwen-14B (green) is frequently the lowest-performing model, hitting 0% accuracy on two topics.
5. **Strong Contenders:** Baichuan2-13B (blue) and InternLM2-Math-20B (red) are often the top two performers, trading the lead depending on the topic.
### Interpretation
This chart provides a granular benchmark of mathematical reasoning capabilities across different LLMs. The data suggests that:
* **Specialization vs. Generalization:** LLaMA2-13B demonstrates extreme specialization, achieving perfect accuracy in one area (Prime factorization) while failing almost completely in another (Probability of compound events). This indicates its training or architecture may be highly optimized for certain procedural math but lacks robustness in probabilistic reasoning.
* **Stability of Advanced Models:** InternLM2-Math-20B, despite being a math-specialized model, shows a more stable high-performance profile. Its smaller performance drops on difficult topics (like probability) suggest better generalization within mathematics compared to LLaMA2-13B.
* **The Challenge of Probability:** The topic "Probability of compound events" causes a severe performance drop for three of the four models (LLaMA2, Qwen, InternLM2). This highlights a specific, common weakness in LLM mathematical reasoning, likely due to the combinatorial and conditional nature of such problems.
* **Benchmarking Utility:** For a user or developer, this chart is invaluable for model selection. If one's primary need is solving problems related to "Outlier" detection or "Radical expressions," Baichuan2-13B or InternLM2-Math-20B would be strong choices. If the task involves "Matrices," Qwen-14B should be avoided. The chart moves beyond aggregate scores to reveal the nuanced landscape of AI capabilities in mathematics.
</details>
<details>
<summary>x23.png Details</summary>

### Visual Description
## Line Chart: Accuracy of Four Language Models Across Mathematical Topics
### Overview
This image is a line chart comparing the performance of four large language models (LLMs) on a wide array of mathematical topics. The chart plots "Accuracy" (y-axis) against a comprehensive list of mathematical concepts (x-axis). Each model is represented by a distinct colored line with markers, showing its accuracy score for each topic. The overall visual impression is one of high variability, with models performing very differently depending on the specific mathematical domain.
### Components/Axes
* **Chart Type:** Multi-line chart with data point markers.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear scale from 0 to 100.
* **Major Gridlines:** Horizontal dashed lines at 20, 40, 60, 80, and 100.
* **X-Axis:**
* **Label:** None explicit. The axis consists of categorical labels for mathematical topics.
* **Categories (from left to right):** Add & subtract, Arithmetic sequences, Congruence & similarity, Consumer math, Counting principle, Distance between two points, Divide, Domain & range of functions, Equiv measurements, Estimate metric measurements, Exponents & scientific notation, Financial literacy, Fractions & decimals, Geometric sequences, Interpret functions, Linear equations, Linear functions, Lines & angles, Make predictions, Multiply, Nonlinear functions, One-variable statistics, Percent, Perimeter & area, Prime factorization, Prime or composite events, Probability of compound events, Probability of one event, Probability of opposite events, Proportional relationships, Rational & irrational numbers, Scale drawings, Square roots & cube roots, Surface area & volume, Systems of equations, Triangles, Two-variable statistics, Absolute value, Axis, Center & variability, Circle, Factors, Independent & dependent events, Inequalities, Mean, median, mode & range, Opposite integer, Outlier, Polygons, Polyhedra, Radical exps, Transformations, Trapezoids, Variable exprs.
* **Legend:**
* **Position:** Top center, above the plot area.
* **Series:**
1. **InternLM2-20B:** Blue line with circular markers.
2. **Yi-34B:** Orange line with circular markers.
3. **Qwen-72B:** Green line with circular markers.
4. **GPT-3.5:** Red line with circular markers.
### Detailed Analysis
The chart reveals significant performance disparities among the models across the 60+ mathematical topics.
**1. GPT-3.5 (Red Line):**
* **Trend:** Generally the highest-performing model, frequently occupying the top position. Its line shows high volatility, with many peaks at or near 100% accuracy and several deep troughs.
* **Key Points:** Achieves ~100% accuracy on topics like "Add & subtract," "Congruence & similarity," "Counting principle," "Prime or composite events," "Probability of opposite events," "Systems of equations," "Absolute value," "Circle," "Independent & dependent events," "Polygons," and "Variable exprs." Its lowest points appear to be around "Distance between two points" (~75%), "Scale drawings" (~65%), and "Two-variable statistics" (~70%).
**2. Qwen-72B (Green Line):**
* **Trend:** Often the second-best performer, closely following GPT-3.5. It shows a similar pattern of peaks and valleys but generally sits slightly below the red line.
* **Key Points:** Matches or nearly matches GPT-3.5's high scores on several topics (e.g., "Add & subtract," "Congruence & similarity"). It has notable peaks at "Geometric sequences" (~95%), "Prime factorization" (~95%), and "Polyhedra" (~95%). Its performance dips significantly on "Linear equations" (~40%), "One-variable statistics" (~45%), and "Two-variable statistics" (~35%).
**3. Yi-34B (Orange Line):**
* **Trend:** Typically the third-best performer, with its line often situated between the green (Qwen) and blue (InternLM) lines. It exhibits extreme volatility, with some of the highest peaks and lowest valleys on the chart.
* **Key Points:** Reaches near 100% on "Add & subtract" and "Prime or composite events." It suffers severe drops, notably on "Divide" (~45%), "Linear functions" (~30%), "Nonlinear functions" (~55%), and "Two-variable statistics" (~50%).
**4. InternLM2-20B (Blue Line):**
* **Trend:** Consistently the lowest-performing model across almost all topics. Its line is distinctly separated below the others, often fluctuating between 20% and 60% accuracy.
* **Key Points:** Its highest accuracy appears to be on "Add & subtract" (~65%) and "Variable exprs" (~35%). It has numerous points at or below 20%, including "Divide," "Linear functions," "Nonlinear functions," "Probability of compound events," "Scale drawings," "Two-variable statistics," "Center & variability," and "Radical exps."
### Key Observations
* **Universal Strength:** All four models perform best on foundational arithmetic ("Add & subtract"), with accuracies clustering between ~65% (InternLM) and ~100% (GPT-3.5).
* **Universal Challenge:** "Two-variable statistics" appears to be the most difficult topic overall, with all models scoring below 70%, and three models (InternLM, Yi, Qwen) scoring at or below 50%.
* **Performance Gap:** There is a consistent and significant performance gap between the top tier (GPT-3.5, Qwen-72B) and the bottom tier (InternLM2-20B), often spanning 30-50 percentage points on the same topic.
* **Volatility:** All models show high topic-dependent volatility. No model maintains a flat, high accuracy across the board. Performance is highly sensitive to the specific mathematical concept being tested.
* **Model Ranking Consistency:** The relative ranking of the models (GPT-3.5 > Qwen-72B > Yi-34B > InternLM2-20B) is remarkably consistent across the vast majority of topics.
### Interpretation
This chart provides a detailed benchmark of LLM capabilities in mathematical reasoning, revealing that model performance is not monolithic but highly domain-specific.
* **What the data suggests:** The data demonstrates that larger, more advanced models (GPT-3.5, Qwen-72B) have a substantially stronger grasp of a wide range of mathematical concepts compared to the other models tested. However, even the leading models have clear weaknesses in specific areas like statistics and certain algebraic functions.
* **How elements relate:** The x-axis represents a curriculum of mathematical knowledge. The chart effectively maps each model's "knowledge profile" against this curriculum. The close tracking of the GPT-3.5 and Qwen-72B lines suggests they may have been trained on similar data or have similar architectural strengths for math, while the distinct separation of the InternLM2-20B line indicates a different capability level.
* **Notable anomalies:** The extreme volatility within each model's line is the most striking feature. It indicates that "mathematical ability" in LLMs is not a single skill but a collection of competencies that can be strong in one area (e.g., geometry) and weak in another (e.g., statistics) within the same model. The near-perfect scores on some topics versus sub-50% scores on others for the same model highlight the importance of granular, topic-specific evaluation over aggregate benchmarks.
</details>
Figure 14: Concept accuracies on Middle-EN of more models.
<details>
<summary>x24.png Details</summary>

### Visual Description
## Multi-Line Chart: Accuracy of Four Language Models Across Mathematical Problem Categories
### Overview
This image is a multi-line chart comparing the accuracy (0-100%) of four different large language models (LLMs) across a wide range of mathematical problem categories. The chart is dense, with each model's performance plotted as a distinct colored line connecting data points for each category. The overall visual impression is one of high variability, with significant performance differences between models and across categories.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, from 0 to 100.
* **Major Ticks/Gridlines:** At intervals of 20 (0, 20, 40, 60, 80, 100).
* **X-Axis:**
* **Label:** Not explicitly labeled, but contains a dense series of categorical problem types.
* **Categories (Transcribed from left to right):** 三角形, 圆, 平行四边形, 梯形, 平面图形综合, 长方形, 圆柱体, 立体图形综合, 和差问题, 和倍问题, 差倍问题, 归一问题, 归总问题, 周期问题, 盈亏问题, 年龄问题, 扩缩问题, 植树问题, 鸡兔同笼, 还原问题, 相遇问题, 行程问题, 人民币问题, 时间问题, 浓度问题, 温度问题, 排列组合, 统计指标, 分数与分数, 分数与小数, 公约数与公倍数, 公约数与倍数, 因数与倍数综合, 质数与合数, 乘法问题, 定义新运算, 方程问题, 除法问题, 估算问题, 百分数问题.
* **Language:** The category labels are in **Chinese**. An English translation is provided in the "Content Details" section.
* **Legend:**
* **Position:** Top-center, above the plot area.
* **Content:** Four entries, each with a colored line and marker:
* **Blue line with circle marker:** Yi-6B
* **Orange line with circle marker:** ChatGLM3-6B
* **Green line with circle marker:** LLaMA2-7B
* **Red line with circle marker:** DeepSeekMath-7B
### Detailed Analysis
**Model Performance Trends (Trend Verification):**
1. **DeepSeekMath-7B (Red Line):**
* **Trend:** Exhibits the highest overall performance, frequently occupying the top position. Its line shows sharp peaks and valleys but maintains a generally high baseline.
* **Key Data Points:** Reaches near-perfect accuracy (~100%) on "鸡兔同笼" (Rabbit and Chicken Cages) and "因数与倍数综合" (Factors and Multiples Comprehensive). Its lowest points are around 20-30% on categories like "归一问题" (Unitary Method) and "归总问题" (Inverse Unitary Method).
2. **ChatGLM3-6B (Orange Line):**
* **Trend:** Often the second-best performer, closely following DeepSeekMath-7B. It shows high volatility, with some categories where it matches or exceeds the red line, and others where it drops significantly.
* **Key Data Points:** Peaks at ~90% on "百分数问题" (Percentage Problems) and ~85% on "平面图形综合" (Plane Figures Comprehensive). Dips to ~20% on "归一问题" (Unitary Method).
3. **Yi-6B (Blue Line):**
* **Trend:** Generally performs in the middle-to-lower range among the four models. Its line is often below the orange and red lines but above the green line.
* **Key Data Points:** Has a notable peak of ~78% on "因数与倍数综合" (Factors and Multiples Comprehensive). Its performance is frequently in the 20-50% range.
4. **LLaMA2-7B (Green Line):**
* **Trend:** Consistently the lowest-performing model across almost all categories. Its line forms the bottom boundary of the data cluster.
* **Key Data Points:** Rarely exceeds 30% accuracy. Its highest point is approximately 30% on "百分数问题" (Percentage Problems). It frequently scores near 0% on multiple categories (e.g., "归一问题", "归总问题", "盈亏问题").
**Category Performance (Spatial Grounding & Data Extraction):**
Performance varies dramatically by category. Here is a sample of translated categories with approximate accuracy values for each model (values are visual estimates):
| Category (Chinese) | Category (English) | Yi-6B (Blue) | ChatGLM3-6B (Orange) | LLaMA2-7B (Green) | DeepSeekMath-7B (Red) |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 三角形 | Triangle | ~40% | ~55% | ~20% | ~70% |
| 圆 | Circle | ~10% | ~25% | ~5% | ~70% |
| 平行四边形 | Parallelogram | ~45% | ~80% | ~25% | ~90% |
| 鸡兔同笼 | Rabbit & Chicken Cages | ~73% | ~87% | ~27% | **~100%** |
| 行程问题 | Travel/Distance Problems | ~30% | ~40% | ~15% | ~50% |
| 百分数问题 | Percentage Problems | ~75% | ~90% | ~30% | ~65% |
| 归一问题 | Unitary Method | ~10% | ~15% | **~0%** | ~20% |
### Key Observations
1. **Clear Performance Hierarchy:** A consistent ranking is visible: DeepSeekMath-7B > ChatGLM3-6B > Yi-6B > LLaMA2-7B across the vast majority of categories.
2. **High Category Sensitivity:** All models show extreme performance variance depending on the problem type. No model is uniformly strong or weak.
3. **Specialized Strengths:** DeepSeekMath-7B shows exceptional strength in specific, likely more complex or structured problem types like "鸡兔同笼" and "因数与倍数综合".
4. **Systematic Weakness in Foundational Methods:** All models, especially LLaMA2-7B, perform poorly on categories like "归一问题" (Unitary Method) and "归总问题" (Inverse Unitary Method), suggesting a common difficulty with these fundamental problem-solving frameworks.
5. **Visual Clustering:** The lines for Yi-6B (Blue) and LLaMA2-7B (Green) are often tightly clustered in the lower half of the chart, while ChatGLM3-6B (Orange) and DeepSeekMath-7B (Red) compete in the upper half.
### Interpretation
This chart provides a comparative benchmark of mathematical reasoning capabilities across four LLMs. The data suggests that **DeepSeekMath-7B is the most capable and robust model for this set of mathematical tasks**, demonstrating both high peak performance and a generally high floor. **ChatGLM3-6B is a strong competitor**, often close in performance.
The **dramatic category-by-category variation** is the most critical insight. It indicates that mathematical problem-solving is not a monolithic skill for these models; their proficiency is highly dependent on the specific type of logic, formula, or reasoning step required. The near-zero scores for LLaMA2-7B on foundational methods like "归一问题" highlight a potential gap in its training data or reasoning architecture for certain algorithmic approaches.
The chart effectively argues that evaluating LLMs on a single "math accuracy" score is insufficient. A granular, category-specific analysis, as presented here, is necessary to understand a model's true strengths and weaknesses, which is crucial for selecting the right model for specific educational, research, or application purposes in quantitative domains. The poor performance of all models on certain categories also points to persistent challenges in AI mathematical reasoning that remain unsolved.
</details>
<details>
<summary>x25.png Details</summary>

### Visual Description
## Line Chart: Accuracy Comparison of Four AI Models Across Math Problem Categories
### Overview
This image is a line chart comparing the performance (accuracy) of four different large language models on a wide variety of math problem categories. The chart displays accuracy percentages on the y-axis against a dense list of specific math topics on the x-axis. The data reveals significant variability in model performance across different types of mathematical problems.
### Components/Axes
* **Chart Type:** Multi-series line chart.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, ranging from 0 to approximately 95 (implied by the highest data point).
* **Major Gridlines:** Horizontal dashed lines at 0, 20, 40, 60, and 80.
* **X-Axis:**
* **Label:** Not explicitly labeled, but contains a dense list of math problem categories.
* **Categories (Transcribed from Chinese, left to right):**
1. 三角形 (Triangle)
2. 圆 (Circle)
3. 平行四边形 (Parallelogram)
4. 梯形 (Trapezoid)
5. 长方形 (Rectangle)
6. 平面图形综合 (Plane Figure Synthesis)
7. 角 (Angle)
8. 长方体 (Cuboid)
9. 正方体 (Cube)
10. 立体图形综合 (Solid Figure Synthesis)
11. 和差倍问题 (Sum-Difference-Multiple Problem)
12. 盈亏问题 (Surplus-Deficit Problem)
13. 归一问题 (Unitary Method Problem)
14. 归总问题 (Total Method Problem)
15. 和倍问题 (Sum-Multiple Problem)
16. 差倍问题 (Difference-Multiple Problem)
17. 植树问题 (Tree Planting Problem)
18. 年龄问题 (Age Problem)
19. 鸡兔同笼 (Chicken and Rabbit in the Same Cage)
20. 行程问题 (Travel Problem)
21. 工程问题 (Work Problem)
22. 浓度问题 (Concentration Problem)
23. 经济问题 (Economic Problem)
24. 数列与数表 (Sequence and Number Table)
25. 排列组合 (Permutation and Combination)
26. 统计与概率 (Statistics and Probability)
27. 分数分数应用题 (Fraction Application Problem)
28. 分数百分数 (Fraction Percentage)
29. 公式计算 (Formula Calculation)
30. 公因数与公倍数 (Common Factor and Common Multiple)
31. 因数与倍数 (Factor and Multiple)
32. 质数合数 (Prime and Composite Numbers)
33. 定义新运算 (New Operation Definition)
34. 方程与方程组 (Equation and System of Equations)
35. 除法与减法 (Division and Subtraction)
36. 模糊问题 (Fuzzy Problem)
37. 百分率问题 (Percentage Rate Problem)
* **Legend:** Located at the top center of the chart.
* **Blue line with circle markers:** InternLM2-Math-7B
* **Orange line with circle markers:** InternLM2-7B
* **Green line with circle markers:** MAmmoTH-13B
* **Red line with circle markers:** WizardMath-13B
### Detailed Analysis
**Trend Verification & Data Points (Approximate):**
* **InternLM2-Math-7B (Blue):** Exhibits a highly volatile, "spiky" trend with frequent, sharp peaks and troughs. It achieves the highest overall accuracy on the chart.
* **Notable Peaks:** ~85% (三角形/Triangle), ~85% (和差倍问题/Sum-Difference-Multiple), ~92% (植树问题/Tree Planting), ~88% (排列组合/Permutation and Combination), ~85% (因数与倍数/Factor and Multiple).
* **Notable Troughs:** ~20% (长方体/Cuboid), ~15% (鸡兔同笼/Chicken and Rabbit), ~30% (行程问题/Travel Problem), ~20% (分数百分数/Fraction Percentage).
* **InternLM2-7B (Orange):** Also shows high volatility, often tracking closely with the blue line but generally at a slightly lower accuracy level, with some exceptions.
* **Notable Peaks:** ~95% (植树问题/Tree Planting - highest point on chart), ~83% (经济问题/Economic Problem), ~90% (排列组合/Permutation and Combination), ~85% (百分率问题/Percentage Rate Problem).
* **Notable Troughs:** ~20% (梯形/Trapezoid), ~25% (归一问题/Unitary Method), ~20% (鸡兔同笼/Chicken and Rabbit), ~25% (方程与方程组/Equation).
* **MAmmoTH-13B (Green):** Shows a less volatile but consistently lower performance trend compared to the blue and orange lines. Its accuracy rarely exceeds 60%.
* **Notable Peaks:** ~65% (排列组合/Permutation and Combination), ~55% (因数与倍数/Factor and Multiple), ~45% (圆/Circle).
* **Notable Troughs:** ~0% (长方形/Rectangle, 鸡兔同笼/Chicken and Rabbit, 统计与概率/Statistics and Probability), ~5% (和差倍问题/Sum-Difference-Multiple).
* **WizardMath-13B (Red):** Demonstrates the lowest overall performance, with accuracy frequently at or near 0% for many categories. Its highest points are modest.
* **Notable Peaks:** ~40% (百分率问题/Percentage Rate Problem), ~30% (排列组合/Permutation and Combination), ~25% (公式计算/Formula Calculation).
* **Notable Troughs:** 0% (observed for at least 10 categories including 长方形/Rectangle, 盈亏问题/Surplus-Deficit, 鸡兔同笼/Chicken and Rabbit, 统计与概率/Statistics and Probability).
### Key Observations
1. **Performance Hierarchy:** A clear performance hierarchy is visible: InternLM2-Math-7B (Blue) and InternLM2-7B (Orange) are the top performers, followed by MAmmoTH-13B (Green), with WizardMath-13B (Red) performing the worst.
2. **Problem-Specific Strengths/Weaknesses:** All models show extreme variability. No single model dominates all categories. For example, the "Tree Planting" (植树问题) category sees peak performance for both top models, while "Chicken and Rabbit in the Same Cage" (鸡兔同笼) is a major weakness for all.
3. **Correlation:** The blue and orange lines (both InternLM2 variants) show a strong positive correlation in their peaks and troughs, suggesting similar underlying strengths and weaknesses despite the "Math" fine-tuning of the blue model.
4. **Zero-Performance Categories:** WizardMath-13B (Red) and MAmmoTH-13B (Green) hit 0% accuracy on multiple categories, indicating a complete failure to solve those problem types in this evaluation.
### Interpretation
This chart provides a granular diagnostic of model capabilities across the spectrum of elementary to middle school mathematics. The data suggests that:
* **Specialization Matters:** The "InternLM2-Math-7B" model, presumably fine-tuned for math, does not consistently outperform its base "InternLM2-7B" counterpart. Their performance is intertwined, indicating that general capability may be as important as specialized tuning for this broad set of tasks.
* **The "Long Tail" of Math Problems:** The extreme volatility highlights that AI math proficiency is not monolithic. Models have "blind spots" (e.g., classic puzzles like "Chicken and Rabbit") and areas of relative strength (e.g., combinatorics, percentage problems). This implies that benchmarking on a narrow set of problems can be misleading.
* **Model Robustness:** The consistent underperformance and frequent 0% scores of WizardMath-13B and MAmmoTH-13B suggest they may lack the robustness or foundational reasoning skills required for this diverse problem set compared to the InternLM2 family.
* **Investigative Insight (Peircean):** The chart acts as a **diagnostic index**. The shared troughs (like "Chicken and Rabbit") point to a class of problems that remain a fundamental challenge for current LLMs, possibly due to their reliance on specific logical framing or algebraic translation that models struggle with. The shared peaks suggest problem types that are more amenable to pattern recognition or have been well-represented in training data. The divergence between the two top models on specific categories could be a fruitful area for investigating the effects of math-specific fine-tuning.
</details>
<details>
<summary>x26.png Details</summary>

### Visual Description
## Line Chart: Accuracy Comparison of Four AI Models Across Chinese Math Problem Categories
### Overview
This image is a line chart comparing the performance (accuracy) of four different large language models (LLMs) on a wide variety of Chinese-language mathematics problem categories. The chart displays the accuracy percentage for each model across approximately 50 distinct problem types, revealing significant variability in performance both between models and across different mathematical domains.
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **Y-Axis:**
* **Label:** "Accuracy" (written vertically on the left side).
* **Scale:** Linear scale from 0 to 100.
* **Major Ticks/Gridlines:** At 0, 20, 40, 60, 80, 100. Horizontal dashed gridlines extend from these ticks across the chart.
* **X-Axis:**
* **Label:** None explicitly stated. The axis represents discrete categories of math problems.
* **Tick Labels:** A long series of Chinese text labels, each representing a specific math problem category. They are rotated approximately 45 degrees for readability.
* **Legend:**
* **Position:** Centered at the top of the chart, inside the plot area.
* **Content:** Four entries, each with a colored line segment and marker:
* **Blue line with circle marker:** `Baichuan2-13B`
* **Orange line with circle marker:** `LLaMA2-13B`
* **Green line with circle marker:** `Qwen-14B`
* **Red line with circle marker:** `InternLM2-Math-20B`
### Detailed Analysis
**X-Axis Categories (Translated from Chinese):**
The categories, from left to right, are:
1. 三角形 (Triangle)
2. 圆 (Circle)
3. 平行四边形 (Parallelogram)
4. 梯形 (Trapezoid)
5. 平面图形综合 (Plane Figure Synthesis)
6. 长方体 (Cuboid)
7. 圆柱 (Cylinder)
8. 圆锥 (Cone)
9. 立体图形综合 (Solid Figure Synthesis)
10. 和差问题 (Sum and Difference Problem)
11. 提问问题 (Question Problem - *likely a specific problem type*)
12. 归一问题 (Unitary Method Problem)
13. 和倍问题 (Sum and Multiple Problem)
14. 差倍问题 (Difference and Multiple Problem)
15. 对称问题 (Symmetry Problem)
16. 工程问题 (Work Problem)
17. 年龄问题 (Age Problem)
18. 扩倍问题 (Expansion and Multiple Problem)
19. 积木问题 (Block Problem - *likely spatial reasoning*)
20. 交通问题 (Traffic Problem)
21. 鸡兔同笼 (Chicken and Rabbit in the Same Cage)
22. 相遇问题 (Meeting Problem)
23. 行程问题 (Travel Problem)
24. 人民币问题 (RMB/Currency Problem)
25. 计数问题 (Counting Problem)
26. 浓度问题 (Concentration Problem)
27. 盈亏问题 (Surplus and Deficit Problem)
28. 面积问题 (Area Problem)
29. 统计图表 (Statistical Charts)
30. 指数律 (Exponent Laws)
31. 分数与小数 (Fractions and Decimals)
32. 分数应用题 (Fraction Word Problems)
33. 公因数与公倍数 (Common Factors and Multiples)
34. 因数与倍数综合 (Factors and Multiples Synthesis)
35. 比和比例综合 (Ratio and Proportion Synthesis)
36. 案例问题 (Case Problem)
37. 定义新运算 (Define New Operations)
38. 方程与方程组 (Equations and Systems of Equations)
39. 除法与减法 (Division and Subtraction)
40. 倍数问题 (Multiple Problem)
41. 移动问题 (Movement Problem)
42. 百分率问题 (Percentage Problem)
**Model Performance Trends (Visual Verification):**
* **InternLM2-Math-20B (Red Line):** This line is frequently the highest on the chart, showing a generally upward trend with high volatility. It peaks at or near 100% accuracy for "Percentage Problem" (far right) and shows very high accuracy (>90%) for categories like "Sum and Difference Problem", "Unitary Method Problem", and "Define New Operations". Its lowest points are around 20-40% for categories like "Statistical Charts" and "Concentration Problem".
* **Baichuan2-13B (Blue Line):** This line is highly volatile, often competing with the red line for the top position but also dropping significantly. It shows strong performance (>80%) in "Triangle", "Circle", "Sum and Difference Problem", and "Define New Operations". It has notable dips below 40% in areas like "立体图形综合 (Solid Figure Synthesis)" and "统计图表 (Statistical Charts)".
* **Qwen-14B (Green Line):** This line generally occupies the middle-to-lower range of accuracy. It has a significant peak above 90% for "鸡兔同笼 (Chicken and Rabbit in the Same Cage)" but otherwise mostly stays between 20% and 60%. It shows a notable dip to 0% for "盈亏问题 (Surplus and Deficit Problem)".
* **LLaMA2-13B (Orange Line):** This line is consistently the lowest-performing model across almost all categories. Its accuracy rarely exceeds 40%, with many points at or near 0%. Its highest points are around 60-65% for "分数与小数 (Fractions and Decimals)" and "百分率问题 (Percentage Problem)".
### Key Observations
1. **Performance Hierarchy:** There is a clear, though not absolute, hierarchy: InternLM2-Math-20B ≥ Baichuan2-13B > Qwen-14B > LLaMA2-13B.
2. **Domain Specificity:** All models show extreme variability. No model is uniformly good or bad. Performance is highly dependent on the specific math domain. For example, Qwen-14B excels at "鸡兔同笼" but fails at "盈亏问题".
3. **Common Struggles:** The category "统计图表 (Statistical Charts)" appears to be challenging for all models, with accuracies clustered between ~20% and ~50%.
4. **Model Strengths:**
* **InternLM2-Math-20B:** Shows particular strength in algebraic and arithmetic word problems (e.g., Sum/Difference, Unitary Method, Define New Operations).
* **Baichuan2-13B:** Shows strength in geometry (Triangle, Circle) and some word problems.
* **Qwen-14B:** Has a standout performance on the classic "鸡兔同笼" problem.
* **LLaMA2-13B:** Shows relative strength in foundational arithmetic (Fractions/Decimals, Percentages) compared to its own performance on other topics.
5. **Volatility:** The blue (Baichuan) and red (InternLM) lines are the most volatile, indicating their performance is the most sensitive to the problem type.
### Interpretation
This chart provides a granular benchmark of LLM capabilities in mathematical reasoning within the Chinese language context. The data suggests that:
1. **Specialization Over Generalization:** The models, especially the top performers, are not general-purpose math solvers. Their capabilities are highly specialized. The InternLM2-Math-20B model, likely fine-tuned for mathematics, demonstrates the benefit of domain-specific training, but even it has clear weaknesses.
2. **The "Chinese Math Problem" Spectrum:** The x-axis represents a comprehensive curriculum of Chinese elementary and middle school math. The chart effectively maps which parts of this curriculum are more or less accessible to current LLMs. Foundational arithmetic and classic puzzle types (鸡兔同笼) are more accessible than applied topics like statistics or complex concentration problems.
3. **Model Architecture and Training Data Implications:** The stark difference between LLaMA2-13B (a general English-centric model) and the others (likely with more Chinese and/or math-specific data) highlights the critical role of pre-training data composition and potential fine-tuning for achieving proficiency in specific domains and languages.
4. **A Diagnostic Tool:** For a researcher, this chart is a diagnostic map. It doesn't just say "Model X is better." It shows *where* and *by how much* it is better, and more importantly, *where it fails*. This is crucial for guiding future model development, indicating which mathematical reasoning skills (e.g., handling statistical data, understanding concentration) require more focused training or architectural innovation.
**In summary, the image is a dense, information-rich performance matrix. It moves beyond aggregate scores to reveal the nuanced, domain-specific landscape of AI mathematical reasoning in Chinese, highlighting both significant progress and persistent challenges.**
</details>
<details>
<summary>x27.png Details</summary>

### Visual Description
## Multi-Line Chart: Accuracy of Four AI Models Across Chinese Math Problem Categories
### Overview
This image is a multi-line chart comparing the accuracy (0-100%) of four large language models (LLMs) across a wide range of Chinese mathematics problem categories. The chart is dense, with each model's performance plotted as a distinct colored line connecting data points for each category on the x-axis.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, from 0 to 100.
* **Major Ticks/Gridlines:** At 0, 20, 40, 60, 80, 100.
* **X-Axis:**
* **Label:** None explicit. Contains category labels in Chinese.
* **Categories (Left to Right):** 三角形, 平行四边形, 梯形, 长方形, 平面图形综合, 长方体, 圆柱, 立体图形综合, 和差倍问题, 基础应用题, 发展问题, 归一问题, 和周问题, 对称问题, 工程问题, 折线统计图, 扇形统计图, 概率问题, 鸡兔同笼问题, 行程问题, 人民币问题, 温度问题, 面积问题, 统计与概率, 分数与小数, 分数的意义与性质, 公因数与公倍数, 公约数问题, 因数与倍数综合, 乘法问题, 定义新运算问题, 方程与解方程, 解比例问题, 比例问题, 百分数问题.
* **Legend:**
* **Position:** Top center, above the plot area.
* **Items:**
1. **InternLM2-20B:** Blue line with circle markers.
2. **Yi-34B:** Orange line with diamond markers.
3. **Qwen-72B:** Green line with pentagon markers.
4. **GPT-3.5:** Red line with square markers.
### Detailed Analysis
**Trend Verification & Data Extraction (Approximate Values):**
The chart shows high volatility, with accuracy swinging dramatically between categories for all models. No single model consistently dominates across all topics.
* **InternLM2-20B (Blue, Circles):**
* **Trend:** Highly variable, often in the middle-to-lower range compared to others. Shows several deep troughs.
* **Notable Points:** Low points around 10-20% in categories like "和差倍问题" and "行程问题". Peaks near 80% in "平面图形综合" and "百分数问题".
* **Yi-34B (Orange, Diamonds):**
* **Trend:** Also highly variable, with extreme highs and lows. Notably, it hits 0% in one category.
* **Notable Points:** Hits 0% in "对称问题". Has very high peaks (~95%) in "分数与小数" and "因数与倍数综合". Generally performs poorly on geometry and word problems.
* **Qwen-72B (Green, Pentagons):**
* **Trend:** Appears to be the most consistently high-performing model, frequently occupying the top position. Its line is often the upper envelope of the chart.
* **Notable Points:** Reaches or nears 100% in several categories: "平面图形综合", "折线统计图", "扇形统计图", "分数与小数". Its lowest points are still relatively high, rarely dropping below 40%.
* **GPT-3.5 (Red, Squares):**
* **Trend:** Volatile, often competing with Qwen-72B for the top spot but with more pronounced dips.
* **Notable Points:** Hits 100% in "扇形统计图". Has strong performance in statistics and probability topics. Shows a significant dip below 20% in "行程问题".
### Key Observations
1. **Category Difficulty:** Certain categories appear universally challenging, causing sharp drops for all models. Examples include "行程问题" (Travel Problems) and "鸡兔同笼问题" (Chicken-and-Rabbit Cage Problems), where most models score below 60%.
2. **Model Strengths:**
* **Qwen-72B** excels in geometry ("平面图形综合"), statistics ("折线/扇形统计图"), and number theory ("分数与小数").
* **GPT-3.5** shows particular strength in statistics and probability.
* **Yi-34B** has isolated high peaks in specific number theory topics but is unreliable.
3. **Model Weaknesses:** All models struggle with complex word problems ("行程问题", "鸡兔同笼问题") and certain applied concepts.
4. **Extreme Outlier:** Yi-34B's 0% accuracy in "对称问题" (Symmetry Problems) is a critical failure point.
### Interpretation
This chart provides a granular benchmark of LLM capabilities in Chinese elementary and middle school mathematics. The data suggests:
* **No Universal Solver:** No single model is superior across all mathematical domains. Model selection should be task-specific.
* **Specialization vs. Generalization:** Qwen-72B demonstrates the most robust generalization, maintaining high accuracy across diverse topics. Yi-34B appears more specialized, with extreme variance indicating potential gaps in its training data or reasoning for certain problem types.
* **The Challenge of Applied Math:** The consistent poor performance on word problems ("行程", "鸡兔同笼") highlights a persistent weakness in LLMs: translating textual narratives into formal mathematical models and solving them step-by-step. This requires robust reasoning and planning, not just pattern recognition.
* **Data for Improvement:** The specific categories where models fail (e.g., GPT-3.5 on "行程问题", Yi-34B on "对称问题") provide clear targets for diagnostic analysis and future model training or fine-tuning.
**Language Note:** All category labels on the X-axis are in Chinese (Simplified). English translations have been provided in the Components/Axes section.
</details>
Figure 15: Concept accuracies on Elementary-ZH of more models.
<details>
<summary>x28.png Details</summary>

### Visual Description
## Line Chart: Accuracy of Four AI Models Across Mathematical Topics
### Overview
This image is a line chart comparing the performance (accuracy) of four different large language models (LLMs) on a wide range of mathematical topics. The chart displays the accuracy percentage for each model across approximately 50 distinct mathematical concepts, primarily from middle school and high school curricula. The data is presented as four interconnected line series, each with markers at data points.
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **Legend:** Positioned at the top center of the chart area. It defines four data series:
* **Yi-6B:** Blue line with circular markers.
* **ChatGLM3-6B:** Orange line with circular markers.
* **LLaMA2-7B:** Green line with circular markers.
* **DeepSeekMath-7B:** Red line with circular markers.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy" (written vertically).
* **Scale:** Linear scale from 0 to approximately 90.
* **Major Gridlines:** Horizontal dashed lines at intervals of 20 (0, 20, 40, 60, 80).
* **X-Axis (Horizontal):**
* **Label:** Not explicitly labeled with a title, but contains category labels for mathematical topics.
* **Category Labels:** A dense series of labels written in Chinese, rotated at a 45-degree angle for readability. These represent the specific mathematical topics tested.
* **Language:** The primary language of the axis labels is **Chinese**. An English translation is provided below for each label.
### Detailed Analysis
The chart plots accuracy values for each model on each topic. Below is a reconstruction of the data, reading the approximate values from the chart. Values are estimated to the nearest 5% due to visual interpretation.
**X-Axis Categories (Chinese -> English Translation):**
1. 全等三角形 -> Congruent Triangles
2. 等腰三角形 -> Isosceles Triangles
3. 等边三角形 -> Equilateral Triangles
4. 平行四边形 -> Parallelograms
5. 圆周角 -> Inscribed Angle
6. 圆心角 -> Central Angle
7. 弧长和扇形面积 -> Arc Length and Sector Area
8. 点与圆的位置关系 -> Positional Relationship between a Point and a Circle
9. 函数与一元一次方程 -> Function and Linear Equation in One Variable
10. 函数与一元一次不等式 -> Function and Linear Inequality in One Variable
11. 函数与二元一次方程组 -> Function and System of Linear Equations in Two Variables
12. 求一次函数的解析式 -> Finding the Analytic Expression of a Linear Function
13. 二次函数的定义 -> Definition of Quadratic Function
14. 反比例函数的定义 -> Definition of Inverse Proportional Function
15. 反比例函数的性质 -> Properties of Inverse Proportional Function
16. 有理数的乘方 -> Exponentiation of Rational Numbers
17. 点的坐标与象限 -> Coordinates of a Point and Quadrants
18. 同底数幂的乘法 -> Multiplication of Powers with the Same Base
19. 约分与通分 -> Reduction and Reduction to a Common Denominator
20. 十字相乘法 -> Cross Multiplication Method
21. 提公因式法 -> Factoring by Common Factor
22. 流水问题 -> Flow/Current Problems (Word Problems)
23. 鸡兔同笼 -> Chicken and Rabbit in the Same Cage (Classic Problem)
24. 整式的乘法与加减 -> Multiplication and Addition/Subtraction of Polynomials
25. 平方差公式 -> Difference of Squares Formula
26. 完全平方公式 -> Perfect Square Formula
27. 二次根式的乘除 -> Multiplication and Division of Quadratic Radicals
28. 二次根式的加减 -> Addition and Subtraction of Quadratic Radicals
29. 二次根式的化简 -> Simplification of Quadratic Radicals
30. 二次根式的运算 -> Operations with Quadratic Radicals
31. 一元二次方程的根 -> Roots of a Quadratic Equation in One Variable
32. 解一元二次方程 -> Solving Quadratic Equations in One Variable
33. 一元二次方程的应用 -> Application of Quadratic Equations in One Variable
34. 一元二次不等式的解法 -> Solving Quadratic Inequalities in One Variable
35. 解一元二次不等式 -> Solving Quadratic Inequalities in One Variable (Repeated/Variant)
36. 分式方程的应用 -> Application of Fractional Equations
37. 数据的波动中的极差 -> Range in Data Fluctuation
38. 数据的波动中的方差 -> Variance in Data Fluctuation
39. 频率的求解概率 -> Solving Probability using Frequency
40. 随机事件与概率 -> Random Events and Probability
**Data Series Trends & Approximate Values:**
* **Trend Verification:**
* **DeepSeekMath-7B (Red):** Generally the highest-performing series, with frequent peaks above 70% and several above 80%. It shows high volatility, with sharp drops on some topics (e.g., near 0% on "Chicken and Rabbit" problem).
* **ChatGLM3-6B (Orange):** Often the second-highest, closely following the red line. It has the single highest peak on the chart (approx. 90% on "Function and Linear Equation"). It also experiences significant drops.
* **Yi-6B (Blue):** Typically performs in the middle range, between 20% and 60%. It has a notable low point (approx. 5%) on "Solving Quadratic Inequalities".
* **LLaMA2-7B (Green):** Generally the lowest-performing series, frequently below 40%. It has several very low points (below 10%) on topics like "Congruent Triangles" and "Chicken and Rabbit" problem.
* **Sample Data Points (First 10 Topics):**
| Topic (English) | Yi-6B (Blue) | ChatGLM3-6B (Orange) | LLaMA2-7B (Green) | DeepSeekMath-7B (Red) |
| :--- | :--- | :--- | :--- | :--- |
| Congruent Triangles | ~40% | ~50% | ~40% | ~55% |
| Isosceles Triangles | ~35% | ~20% | ~25% | ~70% |
| Equilateral Triangles | ~35% | ~45% | ~10% | ~65% |
| Parallelograms | ~50% | ~60% | ~20% | ~75% |
| Inscribed Angle | ~30% | ~55% | ~15% | ~70% |
| Central Angle | ~5% | ~0% | ~5% | ~30% |
| Arc Length/Sector Area | ~30% | ~30% | ~15% | ~30% |
| Point & Circle Relation | ~25% | ~40% | ~20% | ~55% |
| Function & Linear Eq. | ~50% | ~90% | ~65% | ~85% |
| Function & Linear Ineq. | ~45% | ~55% | ~35% | ~70% |
### Key Observations
1. **High Variability:** Performance for all models is extremely topic-dependent. No model is consistently superior across all mathematical domains.
2. **Model Hierarchy:** A rough performance hierarchy is visible: DeepSeekMath-7B (Red) ≥ ChatGLM3-6B (Orange) > Yi-6B (Blue) > LLaMA2-7B (Green). However, this order flips on specific topics.
3. **Notable Outliers:**
* **High Peaks:** ChatGLM3-6B achieves ~90% on "Function and Linear Equation". DeepSeekMath-7B peaks near 85% on several topics.
* **Severe Drops:** Multiple models score near 0% on "Chicken and Rabbit in the Same Cage" (Topic 23). LLaMA2-7B scores very low (~5%) on "Congruent Triangles". Yi-6B drops to ~5% on "Solving Quadratic Inequalities".
4. **Topic Difficulty:** Some topics appear universally challenging (e.g., "Chicken and Rabbit", "Central Angle"), where all models score below 40%. Others, like "Function and Linear Equation", see high scores from multiple models.
5. **Specialization:** DeepSeekMath-7B shows particular strength in algebraic and function-related topics (e.g., topics 9-15, 31-35). LLaMA2-7B struggles significantly with geometry topics (e.g., topics 1-8).
### Interpretation
This chart provides a granular comparison of mathematical reasoning capabilities across four LLMs. The data suggests that:
* **Model Architecture and Training Data Matter:** The superior performance of DeepSeekMath-7B, likely a model fine-tuned for mathematics, indicates that specialized training can yield significant gains in specific domains like math problem-solving.
* **Mathematical Reasoning is Not Monolithic:** The extreme volatility in scores shows that "math ability" in AI is not a single skill. Proficiency in algebra does not guarantee proficiency in geometry or word problems. Models have distinct strengths and weaknesses.
* **Classic Problems Remain a Challenge:** The near-zero scores on the "Chicken and Rabbit" problem—a classic logic puzzle—highlight a potential weakness in handling certain types of structured, non-computational word problems, even for otherwise strong models.
* **Benchmarking Value:** For developers or researchers, this chart is valuable for identifying which model might be best suited for a specific educational application (e.g., a geometry tutor vs. an algebra solver). It also pinpoints specific areas (like "Central Angle" or "Chicken and Rabbit" problems) where all current models need improvement, guiding future research and fine-tuning efforts.
In essence, the chart moves beyond a single "accuracy" score to reveal the complex, topic-dependent landscape of AI mathematical reasoning, emphasizing that model selection should be guided by the specific task at hand.
</details>
<details>
<summary>x29.png Details</summary>

### Visual Description
## Multi-Line Chart: Model Accuracy Across Mathematical Topics
### Overview
This is a multi-line chart comparing the accuracy (0-100%) of four different large language models across a wide range of mathematical topics. The chart is dense, with approximately 40 distinct topics plotted on the x-axis. The overall visual impression is one of high variability, with models showing significant performance differences depending on the specific mathematical domain.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, from 0 to 100.
* **Major Ticks:** 0, 20, 40, 60, 80, 100.
* **Grid Lines:** Horizontal dashed lines at each major tick.
* **X-Axis:**
* **Label:** Not explicitly labeled, but contains a series of mathematical topic names.
* **Language:** The labels are in **Chinese**.
* **Content:** A dense series of approximately 40 mathematical topic labels, rotated at a 45-degree angle for readability. The topics span geometry, algebra, functions, equations, and probability/statistics.
* **Legend:**
* **Position:** Top center, above the plot area.
* **Content:** Four entries, each with a colored line and marker:
1. **Blue line with circle markers:** `InternLM2-Math-7B`
2. **Orange line with circle markers:** `InternLM2-7B`
3. **Green line with circle markers:** `MAmmoTH-13B`
4. **Red line with circle markers:** `WizardMath-13B`
### Detailed Analysis
**Trend Verification & Data Points (Approximate):**
The chart shows highly variable performance. No single model dominates across all topics. The lines frequently cross, indicating model strengths are topic-specific.
* **InternLM2-Math-7B (Blue):** This model shows the highest peak performance, reaching near 100% accuracy on one topic (likely "有理数的混合运算" - Mixed Operations with Rational Numbers). It generally performs in the upper tier (40-80% range) for many topics but has significant dips, including one near 0% (likely "随机事件与概率" - Random Events and Probability). Its trend is highly volatile.
* **InternLM2-7B (Orange):** This model closely follows the trend of its math-specialized counterpart (Blue) but often at a slightly lower accuracy level. Its peaks are strong (80-85%) but not as high as the Blue model's maximum. It also experiences deep troughs, sometimes below 20%.
* **MAmmoTH-13B (Green):** This model's performance is generally in the middle-to-lower range (10-60%). It has a few notable peaks around 60% but is frequently the third-best performer. Its trend line is less volatile than the top two but still shows significant variation.
* **WizardMath-13B (Red):** This model consistently performs the worst across almost all topics. Its accuracy rarely exceeds 40% and frequently hovers between 0-20%. It has a few small peaks but is the clear bottom performer in this comparison.
**Cluster Analysis (Grouping similar x-axis positions):**
* **High-Performance Cluster (Left side, ~60-70% for top models):** Topics like "全等三角形" (Congruent Triangles), "等腰三角形" (Isosceles Triangles), "平行四边形" (Parallelograms). Blue and Orange models lead here.
* **Peak Performance Cluster (Center):** A sharp peak for Blue (~100%) and Orange (~85%) on a topic related to rational number operations. Green peaks here as well (~50%).
* **Low-Performance Cluster (Right side):** Topics related to probability and statistics ("数据的收集、整理与描述" - Data Collection, Organization, and Description; "随机事件与概率" - Random Events and Probability). All models show a sharp decline, with WizardMath (Red) and often MAmmoTH (Green) near 0-10%.
* **Notable Outlier Point:** There is a data point where the Red line (WizardMath) spikes to ~60%, briefly matching the Green line (MAmmoTH). This occurs on a topic in the middle of the chart (possibly "分式方程" - Fractional Equations).
### Key Observations
1. **Specialization Matters:** The `InternLM2-Math-7B` (Blue) model, presumably fine-tuned for mathematics, achieves the highest overall accuracy and generally outperforms the base `InternLM2-7B` (Orange), though the base model remains competitive.
2. **Consistent Underperformance:** `WizardMath-13B` (Red) is consistently the lowest-performing model across nearly the entire spectrum of topics tested.
3. **Topic-Dependent Difficulty:** All models struggle significantly with topics on the far right of the chart (probability/statistics), suggesting these are more challenging for the evaluated models than core geometry or algebra topics.
4. **High Variability:** Performance is not stable; accuracy can swing by 50-80 percentage points between adjacent topics for the same model, indicating that mathematical domain is a critical factor in model performance.
### Interpretation
This chart provides a comparative benchmark of mathematical reasoning capabilities across four LLMs. The data suggests that:
* **Mathematical fine-tuning is effective:** The `InternLM2-Math-7B` model's superior peak performance and general lead over its base variant demonstrate that targeted training improves mathematical problem-solving accuracy.
* **There is no universal "best" math model:** While Blue leads overall, Orange sometimes matches or exceeds it on specific topics. The choice of model could depend on the specific mathematical domain of interest.
* **Probability and statistics represent a significant challenge:** The uniform poor performance of all models on the rightmost topics indicates a potential weakness in current LLMs for handling uncertainty, data analysis, and probabilistic reasoning compared to deterministic algebraic or geometric reasoning.
* **WizardMath-13B's architecture or training may be ill-suited** for this broad set of mathematical tasks, as it fails to achieve competitive accuracy on any topic.
**Language Note:** The primary language of the chart's x-axis labels is **Chinese**. The English translation for the axis topics is provided in the analysis where relevant (e.g., "全等三角形" = Congruent Triangles). The model names and axis title ("Accuracy") are in English.
</details>
<details>
<summary>x30.png Details</summary>

### Visual Description
## Line Chart: Accuracy Comparison of Four Language Models on Mathematical Topics
### Overview
This image is a line chart comparing the performance (accuracy) of four different large language models (LLMs) across a wide range of mathematical topics. The chart displays the accuracy percentage for each model on each topic, allowing for a direct comparison of their strengths and weaknesses in mathematical reasoning. The data is presented as four distinct, jagged lines, each corresponding to a specific model.
### Components/Axes
* **Chart Type:** Multi-line chart.
* **Y-Axis:**
* **Label:** "Accuracy" (written vertically on the left side).
* **Scale:** Linear scale from 0 to approximately 90.
* **Major Gridlines:** Horizontal dashed lines at intervals of 20 (0, 20, 40, 60, 80).
* **X-Axis:**
* **Label:** None explicitly stated. The axis represents discrete mathematical topics.
* **Tick Labels:** A series of mathematical topic names written in Chinese, rotated at a 45-degree angle for readability. The full list of topics (with English translations) is provided in the Detailed Analysis section.
* **Legend:**
* **Position:** Centered at the top of the chart, above the plot area.
* **Content:** Four entries, each with a colored line segment and marker, followed by the model name.
1. **Blue line with circle markers:** `Baichuan2-13B`
2. **Orange line with circle markers:** `LLaMA2-13B`
3. **Green line with circle markers:** `Qwen-14B`
4. **Red line with circle markers:** `InternLM2-Math-20B`
### Detailed Analysis
The chart plots accuracy (0-100%) for each model across 47 distinct mathematical topics. Below is an approximate data extraction for each model, listed in the order the topics appear on the x-axis (left to right). Values are estimated from the chart's gridlines and carry an uncertainty of ±3-5%.
**X-Axis Topics (Chinese -> English Translation):**
1. 全等三角形 -> Congruent Triangles
2. 等腰三角形 -> Isosceles Triangles
3. 等边三角形 -> Equilateral Triangles
4. 平行四边形性质 -> Properties of Parallelograms
5. 圆周角定理 -> Inscribed Angle Theorem
6. 弧长和扇形面积 -> Arc Length and Sector Area
7. 点与圆的位置关系 -> Positional Relationship between a Point and a Circle
8. 函数与二元一次方程 -> Function and Linear Equation in Two Variables
9. 函数与一元一次方程 -> Function and Linear Equation in One Variable
10. 函数与一元二次方程 -> Function and Quadratic Equation in One Variable
11. 求一次函数的解析式 -> Finding the Analytic Expression of a Linear Function
12. 二次函数的性质 -> Properties of Quadratic Functions
13. 反比例函数的性质 -> Properties of Inverse Proportional Functions
14. 反比例函数的应用 -> Application of Inverse Proportional Functions
15. 点的坐标特征 -> Coordinate Characteristics of Points
16. 代数式求值 -> Evaluating Algebraic Expressions
17. 同底数幂 -> Powers with the Same Base
18. 约分与通分 -> Reduction and Reduction to a Common Denominator
19. 十字相乘法 -> Cross Multiplication Method
20. 提公因式法 -> Factoring by Common Factor
21. 流程图 -> Flowcharts
22. 简单的轴对称图形 -> Simple Axially Symmetric Figures
23. 整式的乘法与因式分解 -> Multiplication of Integral Expressions and Factorization
24. 二次根式的乘除 -> Multiplication and Division of Quadratic Radicals
25. 二次根式的加减 -> Addition and Subtraction of Quadratic Radicals
26. 平方根与算术平方根 -> Square Root and Arithmetic Square Root
27. 一元一次方程的应用 -> Application of Linear Equation in One Variable
28. 一元二次方程的解法 -> Solution of Quadratic Equation in One Variable
29. 一元二次方程的应用 -> Application of Quadratic Equation in One Variable
30. 一元一次不等式 -> Linear Inequality in One Variable
31. 一元一次不等式组 -> System of Linear Inequalities in One Variable
32. 解一元二次方程 -> Solving Quadratic Equation in One Variable
33. 分式方程的应用 -> Application of Fractional Equations
34. 分式的化简求值 -> Simplification and Evaluation of Fractions
35. 数据的集中趋势 -> Central Tendency of Data
36. 数据的波动程度 -> Dispersion of Data
37. 频数分布直方图 -> Frequency Distribution Histogram
38. 概率的求法 -> Calculation of Probability
39. 随机事件与概率 -> Random Events and Probability
**Approximate Accuracy Data by Model:**
* **Baichuan2-13B (Blue Line):**
* **Trend:** Highly volatile, with frequent sharp peaks and troughs. Shows strong performance on several algebraic and geometric topics but also significant dips.
* **Sample Data Points (Topic #, ~Accuracy%):** (1, 65), (2, 55), (3, 70), (4, 45), (5, 35), (6, 35), (7, 25), (8, 50), (9, 55), (10, 45), (11, 80), (12, 70), (13, 55), (14, 40), (15, 55), (16, 78), (17, 85), (18, 50), (19, 68), (20, 60), (21, 40), (22, 45), (23, 55), (24, 40), (25, 65), (26, 50), (27, 40), (28, 80), (29, 45), (30, 75), (31, 55), (32, 50), (33, 70), (34, 70), (35, 70), (36, 45), (37, 50).
* **LLaMA2-13B (Orange Line):**
* **Trend:** Generally lower accuracy than the other models, with a few notable peaks. Performance is particularly weak on geometry and data statistics topics.
* **Sample Data Points (Topic #, ~Accuracy%):** (1, 35), (2, 20), (3, 15), (4, 20), (5, 15), (6, 5), (7, 25), (8, 45), (9, 50), (10, 20), (11, 55), (12, 45), (13, 20), (14, 20), (15, 60), (16, 50), (17, 55), (18, 20), (19, 50), (20, 45), (21, 5), (22, 10), (23, 15), (24, 35), (25, 25), (26, 15), (27, 25), (28, 40), (29, 25), (30, 40), (31, 30), (32, 25), (33, 60), (34, 30), (35, 40), (36, 15), (37, 30), (38, 50).
* **Qwen-14B (Green Line):**
* **Trend:** Shows the most consistent low-to-mid range performance, with very few high peaks. It frequently has the lowest accuracy, especially on geometry and equation-solving topics.
* **Sample Data Points (Topic #, ~Accuracy%):** (1, 5), (2, 10), (3, 5), (4, 25), (5, 15), (6, 5), (7, 20), (8, 0), (9, 5), (10, 5), (11, 15), (12, 10), (13, 15), (14, 5), (15, 45), (16, 30), (17, 5), (18, 15), (19, 30), (20, 30), (21, 15), (22, 5), (23, 40), (24, 15), (25, 20), (26, 10), (27, 20), (28, 20), (29, 5), (30, 20), (31, 20), (32, 40), (33, 25), (34, 0), (35, 30), (36, 15), (37, 10), (38, 5), (39, 10).
* **InternLM2-Math-20B (Red Line):**
* **Trend:** Often the top-performing model, with several high peaks above 80%. It shows particular strength in algebra, functions, and probability, but also has significant variability.
* **Sample Data Points (Topic #, ~Accuracy%):** (1, 45), (2, 65), (3, 65), (4, 70), (5, 55), (6, 20), (7, 40), (8, 35), (9, 50), (10, 25), (11, 90), (12, 65), (13, 65), (14, 25), (15, 45), (16, 75), (17, 70), (18, 70), (19, 85), (20, 83), (21, 25), (22, 30), (23, 50), (24, 55), (25, 60), (26, 35), (27, 35), (28, 75), (29, 50), (30, 75), (31, 35), (32, 80), (33, 50), (34, 25), (35, 55), (36, 75), (37, 85), (38, 70), (39, 70).
### Key Observations
1. **Performance Hierarchy:** `InternLM2-Math-20B` (Red) and `Baichuan2-13B` (Blue) are generally the top performers, frequently trading the lead. `LLaMA2-13B` (Orange) and `Qwen-14B` (Green) consistently perform at a lower tier.
2. **Topic Sensitivity:** All models show extreme sensitivity to the specific mathematical topic. Accuracy can swing by 40-60 percentage points between adjacent topics. This suggests the models' mathematical reasoning is not robust or generalized but highly dependent on the specific problem type.
3. **Model-Specific Strengths:**
* `InternLM2-Math-20B` peaks on topics like "Finding the Analytic Expression of a Linear Function" (#11, ~90%) and "Solution of Quadratic Equation in One Variable" (#28, ~75%).
* `Baichuan2-13B` excels on "Powers with the Same Base" (#17, ~85%) and "Application of Fractional Equations" (#33, ~70%).
* `LLaMA2-13B` has a notable peak on "Coordinate Characteristics of Points" (#15, ~60%).
* `Qwen-14B` performs best on "Properties of Inverse Proportional Functions" (#13, ~45%) and "Solution of Quadratic Equation in One Variable" (#32, ~40%).
4. **Common Difficult Areas:** Geometry topics (e.g., #5-7, Inscribed Angle Theorem, Arc Length, Point-Circle Relationship) and data statistics (#35-37) appear challenging for most models, particularly `LLaMA2-13B` and `Qwen-14B`, which often score below 20% in these areas.
5. **Volatility:** The green line (`Qwen-14B`) is the most consistently low, while the red line (`InternLM2-Math-20B`) exhibits the highest peaks but also deep valleys, indicating specialized rather than broad competence.
### Interpretation
This chart provides a granular diagnostic of LLM capabilities in mathematical reasoning, moving beyond aggregate benchmarks. The data suggests that:
1. **Specialization over Generalization:** The high volatility indicates that these models have not achieved a unified "understanding" of mathematics. Instead, they possess a patchwork of competencies, likely reflecting biases in their training data towards certain problem formats or topics. A model may excel at algebraic manipulation but fail at geometric visualization.
2. **The "Math" in Model Names Matters:** The `InternLM2-Math-20B` model, which likely underwent math-specific fine-tuning or training, demonstrates a clear, though not absolute, advantage, especially on complex algebraic tasks. This validates the approach of domain-specific adaptation for technical fields.
3. **Instruction Following vs. Reasoning:** The poor performance on applied topics (e.g., "Application of...") across several models may highlight a gap between procedural knowledge (solving a given equation type) and the deeper reasoning required to model a word problem into a mathematical formulation.
4. **Implications for Use:** Users cannot assume consistent performance from any single model across a math curriculum. A model strong in algebra may be unreliable for geometry. This underscores the need for topic-aware model selection or ensemble approaches for educational or technical applications.
5. **Data as a Diagnostic Tool:** For developers, the specific topics where a model fails (e.g., `Qwen-14B` on "Congruent Triangles" or `LLaMA2-13B` on "Flowcharts") provide direct targets for improving training data curation or fine-tuning strategies.
In essence, the chart reveals that current LLMs are not monolithic "math solvers" but tools with highly variable and topic-dependent proficiencies. Their performance is a complex function of model architecture, training data composition, and potential specialized tuning, with no single model demonstrating comprehensive mastery.
</details>
<details>
<summary>x31.png Details</summary>

### Visual Description
## Multi-Line Chart: AI Model Accuracy Across Chinese Mathematics Topics
### Overview
This image is a multi-line chart comparing the performance accuracy of four different large language models (LLMs) across a wide range of Chinese middle school mathematics topics. The chart displays accuracy percentages on the y-axis against a series of specific math topics on the x-axis.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** 0 to 100, with major gridlines at intervals of 20 (0, 20, 40, 60, 80, 100).
* **X-Axis:**
* **Label:** Not explicitly labeled, but contains a series of categorical math topics.
* **Categories (Transcribed from Chinese, with English translation):**
1. 全等三角形 (Congruent Triangles)
2. 等腰三角形 (Isosceles Triangles)
3. 等边三角形 (Equilateral Triangles)
4. 平行四边形 (Parallelograms)
5. 圆 (Circles)
6. 圆心角 (Central Angles)
7. 弧长与扇形面积 (Arc Length and Sector Area)
8. 点与圆的位置关系 (Positional Relationship between a Point and a Circle)
9. 直线与圆的位置关系 (Positional Relationship between a Line and a Circle)
10. 函数与二元一次方程 (Functions and Linear Equations in Two Variables)
11. 函数与一元二次方程 (Functions and Quadratic Equations in One Variable)
12. 求一次函数解析式 (Finding the Analytic Expression of a Linear Function)
13. 一次函数的应用 (Application of Linear Functions)
14. 反比例函数的性质 (Properties of Inverse Proportional Functions)
15. 反比例函数的定义 (Definition of Inverse Proportional Functions)
16. 反比例函数的应用 (Application of Inverse Proportional Functions)
17. 对顶角、邻补角 (Vertical Angles, Supplementary Adjacent Angles)
18. 平行线的性质 (Properties of Parallel Lines)
19. 同位角、内错角、同旁内角 (Corresponding Angles, Alternate Interior Angles, Consecutive Interior Angles)
20. 不等式及其解集 (Inequalities and Their Solution Sets)
21. 一元一次不等式 (Linear Inequalities in One Variable)
22. 约分与通分 (Reduction and Reduction to a Common Denominator)
23. 分式方程 (Fractional Equations)
24. 分式的乘除 (Multiplication and Division of Fractions)
25. 分式的加减 (Addition and Subtraction of Fractions)
26. 提公因式法 (Method of Factoring by Common Factor)
27. 整式的乘法 (Multiplication of Integral Expressions)
28. 整式的除法 (Division of Integral Expressions)
29. 整式的加减 (Addition and Subtraction of Integral Expressions)
30. 平方根与算术平方根 (Square Roots and Arithmetic Square Roots)
31. 二次根式的乘除 (Multiplication and Division of Quadratic Radicals)
32. 二次根式的加减 (Addition and Subtraction of Quadratic Radicals)
33. 一元一次方程的应用 (Application of Linear Equations in One Variable)
34. 解一元一次方程 (Solving Linear Equations in One Variable)
35. 一元二次方程的应用 (Application of Quadratic Equations in One Variable)
36. 解一元二次方程 (Solving Quadratic Equations in One Variable)
37. 二元一次方程组的应用 (Application of Systems of Linear Equations in Two Variables)
38. 解二元一次方程组 (Solving Systems of Linear Equations in Two Variables)
39. 分式方程的应用 (Application of Fractional Equations)
40. 数据的波动趋势 (Trend of Data Fluctuation)
41. 数据的集中趋势 (Central Tendency of Data)
42. 概率的应用 (Application of Probability)
43. 随机事件与概率 (Random Events and Probability)
* **Legend:** Positioned at the top center of the chart. It maps line colors and markers to model names.
* **Blue line with circle markers:** InternLM2-20B
* **Orange line with circle markers:** Yi-34B
* **Green line with circle markers:** Qwen-72B
* **Red line with circle markers:** GPT-3.5
### Detailed Analysis
The chart plots the accuracy of four models across 43 distinct math topics. The data is dense, with significant volatility for all models. Below is a model-by-model trend analysis and approximate data point extraction.
**Trend Verification & Data Points (Approximate):**
* **Qwen-72B (Green Line):**
* **Trend:** This model consistently demonstrates the highest performance, frequently reaching or approaching 100% accuracy. Its line is often the topmost on the chart, showing strong peaks but also notable dips.
* **Key Data Points (Approximate %):** Starts at ~60, peaks at 100 for "函数与二元一次方程" and "反比例函数的性质", dips to ~45 for "分式的加减", and ends at ~75.
* **GPT-3.5 (Red Line):**
* **Trend:** Shows high volatility, with sharp peaks and deep troughs. It often performs competitively with the top model but exhibits more instability.
* **Key Data Points (Approximate %):** Starts at ~40, peaks at ~85 for "函数与二元一次方程" and "解一元二次方程", drops to a low of ~10 for "反比例函数的定义", and ends at ~55.
* **InternLM2-20B (Blue Line):**
* **Trend:** Generally performs in the middle to lower tier among the four models. It has several significant drops, particularly in the middle section of the topics.
* **Key Data Points (Approximate %):** Starts at ~45, peaks at ~85 for "反比例函数的性质", drops to a low of ~5 for "直线与圆的位置关系" and ~10 for "二元一次方程组的应用", and ends at ~30.
* **Yi-34B (Orange Line):**
* **Trend:** Often the lowest-performing model, with a trend line that frequently sits at the bottom of the cluster. It shows less extreme peaks than GPT-3.5 but has consistent low points.
* **Key Data Points (Approximate %):** Starts at ~40, peaks at ~80 for "反比例函数的性质", drops to lows of ~10 for "直线与圆的位置关系" and "分式的加减", and ends at ~45.
### Key Observations
1. **Model Hierarchy:** Qwen-72B (green) is the clear leader, followed by a competitive but volatile GPT-3.5 (red). InternLM2-20B (blue) and Yi-34B (orange) generally trail, with Yi-34B often at the bottom.
2. **Topic Difficulty:** All models show synchronized, sharp declines on specific topics, indicating these are universally challenging. Notable low points occur around:
* "直线与圆的位置关系" (Positional Relationship between a Line and a Circle)
* "分式的加减" (Addition and Subtraction of Fractions)
* "二元一次方程组的应用" (Application of Systems of Linear Equations in Two Variables)
3. **Peak Performance:** The highest accuracy (100%) is achieved by Qwen-72B on two topics: "函数与二元一次方程" and "反比例函数的性质".
4. **Volatility:** GPT-3.5 exhibits the most dramatic swings in performance from one topic to the next.
### Interpretation
This chart provides a comparative benchmark of LLM capabilities in solving structured, rule-based mathematical problems from the Chinese curriculum. The data suggests:
* **Specialization vs. Generalization:** Qwen-72B's consistent high performance may indicate superior training on mathematical or logical reasoning datasets. The volatility of GPT-3.5 suggests its performance is highly sensitive to the specific formulation or type of math problem.
* **Curriculum Insights:** The topics where all models struggle (e.g., geometric relationships, complex fraction operations, applied word problems) highlight areas where current LLMs have inherent weaknesses. These likely require multi-step reasoning, spatial understanding, or translation of real-world scenarios into equations—skills that are less about pattern recognition and more about deep procedural and conceptual understanding.
* **Model Selection Implications:** For applications requiring reliable performance across a broad spectrum of math problems, Qwen-72B appears to be the most robust choice based on this data. However, for specific topics where other models peak, they could still be viable. The poor performance on "applied" topics (e.g., "应用" problems) across the board indicates a significant gap between solving pure equations and applying them to contextual scenarios.
**Language Note:** The primary language of the chart's textual content (x-axis labels) is **Chinese (Simplified)**. All labels have been transcribed above and provided with English translations.
</details>
Figure 16: Concept accuracies on Middle-ZH of more models.
## Appendix B Details on the Efficient Fine-Tuning
In this section, we provide the details on the efficient fine-tuning to enhance mathematical reasoning abilities towards specific concepts by first training a concept classifier and then curating a set of samples from a large open-sourced math dataset. Specifically, first, by additionally collecting extra 10 problems per concept, we construct a classifier capable of identifying the concept class of a given question. The backbone of this classifier is a pretrained bilingual LLM (i.e., Baichuan2-13B), where the classification head is operated on its last hidden output feature. Note that the concept classification accuracies in English and Chinese are 92.5 and 86.9, respectively, which indicates that it is reasonable to use an additional classifier for curating an extra concept-related dataset from large-scale math-related data. Note that in our work, we crawl from the OpenWebMath Paster et al. (2023) to produce the concept-related training dataset.
## Appendix C Details on the Evaluated Models
In this section, we offer a detailed overview of the Large Language Models (LLMs) and present the corresponding model links in Table 6.
- GPT-3.5/GPT-4 OpenAI (2023): The most powerful closed-model from OpenAI. We utilize its API: gpt-3.5-turbo and gpt-4.
- LLaMa2-7B/13B/70B Touvron et al. (2023b): A set of open-source models developed by Meta.
- Qwen-14B/72B Bai et al. (2023b): This model pre-trained on multilingual data, concentrates on Chinese and English languages. We employ both the Qwen-Base-14B, and the Qwen-Base-72B.
- Baichuan2-13B Baichuan (2023): This model demonstrates impressive performance in both Chinese and English benchmarks.
- MetaMath-13B Megill and Wheeler (2019): A domain-specific language model for mathematical reasoning, fine-tuned from the LLaMA-2 model using the MetaMathQA https://huggingface.co/datasets/meta-math/MetaMathQA dataset.
- WizardMath-13B Luo et al. (2023): Another domain-specific language model for mathematical reasoning, fine-tuned from the LLaMA-2 model using reinforcement learning.
- MAmmoTH-13B Yue et al. (2023): This model is specifically designed for general math problem-solving and has been fine-tuned from the LLaMA model using the MathInstruct https://huggingface.co/datasets/TIGER-Lab/MathInstruct dataset. This dataset features training data that includes both chain-of-thought (CoT) and program-of-thought (PoT) rationales.
- Yi-6B/34B Team (2023b): This model released by 01 shows promising performance results in both Chinese and English.
- ChatGLM3-6B Zeng et al. (2022): a lightweight and high-performance pre-trained dialogue model released by Zhipu AI in both Chinese and English.
- InternLM-7B/20B Team (2023a): A Multilingual Language Model with Progressively Enhanced Capabilities released by InternLM team.
- InternLM-Math-7B/20B Ying et al. (2024): Well-performed math reasoning language models.
- DeepSeekMath-7B Shao et al. (2024): One powerful mathematical language model released by DeepSeek.
| Models | HuggingFace Link / OpenAI Model | |
| --- | --- | --- |
| ChatGLM3 | ChatGLM3-6B | https://huggingface.co/THUDM/chatglm3-6b |
| DeepSeekMath | DeepSeekMath-7B | https://huggingface.co/deepseek-ai/deepseek-math-7b-instruct |
| Baichuan2 | Baichuan2-13B | https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat |
| MetaMath | MetaMath-13B | https://huggingface.co/meta-math/MetaMath-13B-V1.0 |
| WizardMath | WizardMath-13B | https://huggingface.co/WizardLM/WizardMath-13B-V1.0 |
| MAmmoTH | MAmmoTH-13B | https://huggingface.co/TIGER-Lab/MAmmoTH-13B |
| InternLM | InternLM-7B | https://huggingface.co/internlm/internlm2-chat-7b |
| InternLM-20B | https://huggingface.co/internlm/internlm2-chat-20b | |
| InternLM-Math-7B | https://huggingface.co/internlm/internlm2-math-7b | |
| InternLM-Math-20B | https://huggingface.co/internlm/internlm2-math-20b | |
| Yi | Yi-6B | https://huggingface.co/01-ai/Yi-6B-Chat |
| Yi-34B | https://huggingface.co/01-ai/Yi-34B-Chat | |
| LLaMA2 | LLaMA2-7B | https://huggingface.co/meta-llama/Llama-2-7b-chat-hf |
| LLaMA2-13B | https://huggingface.co/meta-llama/Llama-2-13b-chat-hf | |
| LLaMA2-70B | https://huggingface.co/meta-llama/Llama-2-70b-chat | |
| Qwen | Qwen-14B | https://huggingface.co/Qwen/Qwen-14B-Chat |
| Qwen-72B | https://huggingface.co/Qwen/Qwen-72B-Chat | |
| GPT | GPT-3.5 | gpt-3.5-turbo |
| GPT-4 | gpt-4 | |
Table 6: Model links.
## Appendix D More Results
As shown in Fig. 9 and Fig. 10, we provide the mean concept accuracies of Elementary-EN and Elementary-ZH of the evaluated models across different concepts.
As shown in Fig. 11 and Fig. 12, we illustrate the concept accuracies on Elementary-EN and Elementary-ZH for different selected models. For the results of all models, please refer to Fig. 13, Fig. 14, Fig. 15 and Fig. 16.
## Appendix E Details on 5-shot Prompts
We provide the 5-shot prompts for our ConceptMath in Pages 17-20.
Q: For a party, Oscar gets 4 different kinds of soda. He buys a 6-pack of each kind. At home, he divides the sodas evenly among 3 coolers. How many sodas are in each cooler? A: Oscar buys 4 different kinds of soda, and he gets a 6-pack of each kind. This means that he has 4 * 6 = 24. sodas in total. He divides these 24 sodas evenly among 3 coolers. So number of sodas per cooler is 24 / 3 = 8. Therefore, the answer is 8. Q: Amazing Animals Zoo is building a new home for its 8 zebras. Each zebra will get 2 acres of grassy land to roam around in. The zebras will have a pond that covers 1 acre. How big will the zebras’ new home be? A: The total required grassland area for the zebras is: 8 zebras * 2 acres per zebra = 16 acres. Then, we need to add the extra 1 acre for the pond: 16 acres (for the zebras) + 1 acre (for the pond) = 17 acres. Therefore, the answer is 17. Q: The state taxes personal income exceeding 3500 yuan (this portion not to exceed 3500 yuan) at a rate of 3%. If Xiaoming’s father’s monthly salary is 4700 yuan, how much personal income tax should Xiaoming’s father pay each month? A: To calculate the amount of personal income tax Xiaoming’s father should pay, we can follow this simple process:Subtract the tax-free threshold from the monthly salary. Taxable income = Monthly salary - Tax-free threshold = 4700 - 3500 = 1200 yuan. Apply the tax rate to the amount that exceeds the threshold. Tax owed = Taxable income $*$ Tax rate = 1200 * 3% = 36 yuan. Therefore, the answer is 36. Q: Find the 10th term in the arithmetic sequence where the first term is 3 and the common difference is 4. A: To find the nth term in an arithmetic sequence, we can use the formula: $a_{n}=a_{1}+(n-1)*d$ , where: $a_{n}$ is the $n$ th term, $a_{1}$ is the first term, d is the common difference, n is the term number. Given that the first term ( $a_{1}$ ) is 3 and the common difference (d) is 4, we can find the $10$ th term ( $a_{10}$ ) by plugging these values into the formula: $a_{10}=a_{1}+(10-1)*d=3+9*4=3+36=39$ . Therefore, the answer is 39. Q: A regular polygon has an interior angle of 156 degrees. How many sides does the polygon have? A: To find the number of sides of a regular polygon with a given interior angle, we can use the formula: Interior Angle = [(n - 2) * 180] / n, where n is the number of sides. Given that interior angle is 156: 156 = [(n - 2) * 180] / n. Solve for n, so we get n = 15. Therefore, the answer is 15.
Q: $g(x)=x^{2}$ , $f(x)=3(x-2)^{2}$ , $h(x)=g(x)-f(x)$ , $y=h(23)$ , give me the value of $y$ . A: First, calculate $g(23)$ and $f(23)$ : $g(23)=23^{2}=529$ . Next, calculate $f(23)$ : $f(23)=3*(23-2)^{2}=3*(21)^{2}=3*441=1323$ . Now, we calculate $h(23)$ : $h(23)=g(23)-f(23)=529-1323=-794$ . Therefore, the answer is -794. Q: What is the slope of the line segment between the two points: (3,1), (5,2)? A: The slope of a line between two points $(x_{1},y_{1})$ and $(x_{2},y_{2})$ can be found using the formula: $slope(m)=(y_{2}-y_{1})/(x_{2}-x_{1})$ . Given the two points (3, 1) and (5, 2), we can substitute these values into the formula: $slope(m)=(2-1)/(5-3)=1/2=0.5$ . Therefore, the answer is 1/2. Q: In a party, there are 5 cupcakes remaining and 10 children. A mother inserts 10 pieces of paper into a hat, 5 with a cupcake image and 5 without. The children who draw the cupcake paper will receive a cupcake. If the first child draws a paper with a cupcake image, what is the probability that the second child will also draw a cupcake paper? A: Initially, there are 5 cupcake papers and 5 non-cupcake papers in the hat, making a total of 10 papers. After the first child draws a cupcake paper, there are 4 cupcake papers and 5 non-cupcake papers left in the hat, with a total of 9 papers. The probability that the second child will draw a cupcake paper is then the number of cupcake papers left divided by the total number of papers left: Probability = Number of cupcake papers left / Total number of papers left = 4/9. Therefore, the answer is 4/9. Q: What is the total area of an irregular polygon that consists of two squares and two rectangles where the sides of the squares are 12 cm and 16 cm respectively, and the dimensions of the rectangles are 13 cm by 20 cm and 10 cm by 7 cm respectively? A: To find the total area of an irregular polygon consisting of two squares and two rectangles, you would sum up the individual areas of each shape. The area of a square is given by the formula $A=s^{2}$ , where s is the length of a side of the square. For the first square with a side of 12 cm: $A_{1}=12^{2}=144cm^{2}$ . For the second square with a side of 16 cm: $A_{2}=16^{2}=256cm^{2}$ . The area of a rectangle is given by the formula ( $A=l*w$ ), where $l$ is the length and $w$ is the width of the rectangle. For the first rectangle with dimensions 13 cm by 20 cm: $A_{3}=13*20=260cm^{2}$ For the second rectangle with dimensions 10 cm by 7 cm: $A_{4}=10*7=70cm^{2}$ $TotaArea=A_{1}+A_{2}+A_{3}+A_{4}=144+256+260+70=730cm^{2}$ Therefore, the answer is 730. Q: Susie is making invitations to her party this coming weekend. It takes her about 5 minutes to make each invitation. It took her 2 hours to finish all the invitations. How many people is Susie inviting? A: First, convert the 2 hours into minutes to match the unit of time used for the invitation-making process. 2 hours = 120 minutes. Then, divide the total time spent by the time it takes to make one invitation to find out how many invitations she made: 120 minutes total / 5 minutes per invitation = 24 invitations Therefore, the answer is 24.
Q: 小文用量角器测量一个角的度数时,误把外圈刻度看成了内圈刻度,量出的角是 $65°$ ,这 个角实际是多少度? A: 量角器一般有两个圈的刻度,内圈刻度通常是顺时针方向从0度到180度,外圈刻度通常是逆 时针方向从0度到180度。如果小文误把外圈刻度看成了内圈刻度,那么测量出的角度其实是 从180°减去错误读数。因此,实际的角度应该是:180° - 65° = 115°。 因此,答案是115。 Q: 小明以每小时2千米的速度上山,到了山顶后立即以每小时3千米的速度按原路返回。求小明 上下山的平均速度是多少千米? A: 要计算小明上下山的平均速度,我们需要知道他上山和下山所走的总距离以及总时间。 首先,总距离 = 上山距离+下山距离=d+d= 2d千米。 然后,计算总时间:上山时间 = 上山距离/上山速度 = d/2小时,下山时间 = 下山距离/下山速度 = d/3小时,因此,总时间=上山时间+下山时间 = d/2 + d/3小时。 最后,计算平均速度:平均速度=总距离 / 总时间 = 2d/(d/2+d/3)=2d/(5d/6) = 12/5 = 2.4千米/小时。 因此,答案是2.4千米/小时。 Q: 一个礼盒长20cm、宽15cm、高10cm,用彩带捆扎这样的一个礼盒,至少需要多长的彩带?(接头处12厘米) A: 要捆扎这样一个礼盒,通常情况下我们会沿着礼盒的长和宽方向各绕一圈,即绕过礼盒的长和宽两次,然后再留一些长度作为接头。 首先,计算沿着礼盒的长方向绕一圈的长度:绕长方向长度 = (礼盒的长 + 礼盒的高)* 2 = (20 + 10)* 2 = 30 * 2 = 60 cm 然后,计算沿着礼盒的宽方向绕一圈的长度:绕宽方向长度 = (礼盒的宽 + 礼盒的高)* 2 = (15 + 10)* 2 = 25 * 2 = 50 cm 最后,计算总需要彩带长度:总长度 = 绕长方向长度 + 绕宽方向长度 + 接头长度 = 60 + 50 + 12 = 122 cm 因此,答案是122厘米。 Q: 千岛湖大桥的一侧每隔10米安装1盏灯,共需91盏(两端都装),现在改装61盏(两端都装),平均每相邻两盏灯之间的距离是多少米? A: 原来每隔10米安装1盏灯,共需91盏,这意味着桥的总长度覆盖了90个10米的间隔,所以桥的总长度 = 90间隔 * 10米/间隔 = 900米。 现在改装61盏灯(两端都装),也就是说将这900米分成60个间隔(因为61个灯包含了两端的灯,所以间隔数比灯的数量少一个),每个间隔的长度 = 900米 / 60间隔 = 15米/间隔。 因此,答案是115米。 Q: a是b的3/4,则a比b少( ),b比a多( ),b与a的比是( ),如果a、b两数的和是210,则a=( ),b=( ),括号处分别为多少? A: 已知a是b的3/4,则a = 3/4 * b。因此,a比b少 (1-3/4)/1 = 1/4,b比a多(1-3/4)/(3/4) = 1/3,b与a的比为b : (3/4 * b) = 4:3 若已知a、b两数的和是210,则a+b = 210,将a = (3/4) * b 代入方程得到:3/4 * b + b = 210,解得b = 120,a = 3/4 * b = 3/4 * 120 = 90 因此,答案分别是 1/4,1/3,4:3,90,120。
Q: 求不等式 $-x^{2}-2x+8>0$ 的解集? A: $-x^{2}-2x+8>0$ , 通过分解因式方式,得到 $-(x-2)(x+4)>0$ 因此,答案是 (-4, 2)。 Q: 已知函数 $y=2x+3$ 与的图象 $y=-1/2x+m$ 交y轴于点A、B,两函数图象的交点为P(n,-2),求m和n的值? A: 两函数图像交点为P(n,-2),意味着这个点同时满足上述两个函数方程: 将P点的坐标(n,-2)代入函数 $y=3x+3$ ,即 $-2=3n+3$ ,解得 $n=-5/2$ 。 将P点坐标(-5/2,-2)代入函数 $y=-1/2x+m$ ,即 $-2=-1/2*(-5/2)+m$ ,解得m=-13/4。 因此,答案是m为-13/4,n为-5/2。 Q: 已知PA是圆O的切线,A为切点,PC与圆O相交于B、C两点,PB=2cm,BC=8cm,则PA的长是多少厘米? A: 由于PA是圆O的切线,根据切割线定理,我们知道从圆外一点引圆的两条割线,这一点到每条割线与圆的交点的两条线段长的积相等,即 $PA^{2}=PC*PB$ , 已知PB=2cm、BC=8cm,则 $PC=PB+BC=2cm+8cm=10cm$ 。 $PA^{2}=PC*PB=10*2=20$ ,解得PA = 2 $\sqrt{5}$ cm。 因此,答案是2 $\sqrt{5}$ 厘米。 Q: 南宋著名数学家杨辉精研数学,著有《详解九章算法》,对数的运算进行了深入研究与总结,类比其中的思想方法,可以解决很多数与式的计算问题已知a,b为实数,且 $a+b=3,ab=1$ ,计算可得: $a^{2}+b^{2}=7$ , $a^{3}+b^{3}=18$ , $a^{4}+b^{4}=47$ ,由此求 $a^{5}+b^{5}$ 得多少? A: 已知 $a+b=3$ 和 $ab=1$ ,可得: $a^{2}+b^{2}=(a+b)^{2}-2ab=3^{2}-2*1=9-2=7a^{3}+b^{3}=(a+b)(a^{2}-ab+b^{2})=3(7 -1)=3*6=18a^{4}+b^{4}=(a^{2}+b^{2})^{2}-2a^{2}b^{2}=7^{2}-2*1^{2}=49-2=47a^{5} +b^{5}=(a+b)(a^{4}-a^{3}b+a^{2}b^{2}-ab^{3}+b^{4})=3(a^{4}+b^{4}-a^{2}b^{2})=3 (47-1)=3*46=138$ 。 因此,答案是138。 Q: 某单位要招聘1名英语翻译,甲、乙两人报名参加了听、说、读、写等项素质测试,甲的成绩分别为:90分、80分、85分、78分;乙的成绩分别为:78分、82分、85分、88分.如果把听、说、读、写的成绩按 $2:2:3:3$ 计算素质测试平均成绩,那么甲和乙的平均成绩分别是多少? A: 甲的平均成绩为 $(90*2+80*2+85*3+78*3)/(2+2+3+3)=(180+160+255+234)/10=829/10=82.9$ 分,乙的平均成绩为 $(78*2+82*2+85*3+88*3)/(2+2+3+3)=(156+164+255+264)/10=839/10=83.9$ 分。 因此,答案是甲的平均成绩是82.9分,乙的平均成绩是83.9分。
| LEVEL1 | LEVEL2 | LEVEL3 | # Samples |
| --- | --- | --- | --- |
| Calculation & Properties | Calculation | Add | 19 |
| Decimals | 20 | | |
| Division | 19 | | |
| Equations | 18 | | |
| Fractions | 16 | | |
| Mixed Operations | 18 | | |
| Multiple | 18 | | |
| Numerical Expressions | 20 | | |
| Place Value | 16 | | |
| Powers | 20 | | |
| Rational Number | 17 | | |
| Subtraction | 19 | | |
| Variable Expressions | 19 | | |
| Properties | Compare | 20 | |
| Count | 18 | | |
| Estimation & Rounding | 20 | | |
| Patterns | 19 | | |
| Geometry | Angles | 17 | |
| Coordinate Plane | Coordinate Plane | 18 | |
| Three-dimensional Shapes | Cones | 17 | |
| Cubes | 20 | | |
| Cylinders | 17 | | |
| Spheres | 17 | | |
| Volume of 3D shapes | 18 | | |
| Two-dimensional Shapes | Circles | 17 | |
| Perimeter | 19 | | |
| Polygons | 18 | | |
| Quadrilaterals | 17 | | |
| Triangles | 18 | | |
| Measurement | Basic Knowledge | Temperature | 19 |
| Time | 20 | | |
| Money | Coin Names & Value | 17 | |
| Exchanging Money | 17 | | |
| Ratio | Percent | 17 | |
| Proportion | 18 | | |
| Ratio | 19 | | |
| Size | Area | 19 | |
| Length | 20 | | |
| Volume | 20 | | |
| Weight | Light & Heavy | 20 | |
| Statistics | Classifying & Sorting | Classifying & Sorting | 17 |
| Data | Mode/Mean/Median/Range | 19 | |
| Probability | Probability | 16 | |
Table 7: Details of the hierarchical concepts in Elementary-EN.
| LEVEL1 | LEVEL2 | LEVEL3 | # Samples |
| --- | --- | --- | --- |
| Calculation | Basic Calculation | Add & Subtract | 20 |
| Decimals | 19 | | |
| Divide | 19 | | |
| Exponents & Scientific Notation | 16 | | |
| Fractions & Decimals | 18 | | |
| Multiply | 18 | | |
| Square Roots & Cube Roots | 20 | | |
| Consumer Math | Consumer Math | 18 | |
| Financial Literacy | Financial Literacy | 19 | |
| Integers | Absolute Value | 18 | |
| Opposite Integers | 20 | | |
| Measurement | Measurement Metric | 19 | |
| Number Theory | Factors | 20 | |
| Prime Factorization | 19 | | |
| Prime or Composite | 18 | | |
| Percents | Percents | 20 | |
| Rational & Irrational Numbers | Rational & Irrational Numbers | 18 | |
| Ratios & Rates | Proportional Relationships | 18 | |
| Sequences | Arithmetic Sequences | 19 | |
| Geometric Sequences | 18 | | |
| Expressions, equations, and functions | Equations | Linear Equations | 20 |
| Systems of Equations | 18 | | |
| Expressions | Equivalent Expressions | 20 | |
| Radical | 17 | | |
| Variable | 18 | | |
| Function | Domain & Range of Functions | 18 | |
| Interpret Functions | 19 | | |
| Linear Functions | 20 | | |
| Nonlinear Functions | 18 | | |
| Inequalities | Inequalities | 19 | |
| Geometry | Congruence & Similarity | Congruence & Similarity | 19 |
| Coordinate Plane | Axes | 17 | |
| Distance Between Two Points | 19 | | |
| Quadrants | 16 | | |
| Scale Drawings | Scale Drawings | 16 | |
| Slope | Slope | 20 | |
| Three-dimensional Figures | Polyhedra | 19 | |
| Surface Area & Volume | 17 | | |
| Transformations | Transformations | 18 | |
| Two-dimensional Figures | Circle | 20 | |
| Lines & Angles | 18 | | |
| Perimeter & Area | 20 | | |
| Polygons | 18 | | |
| Square | 18 | | |
| Trapezoids | 16 | | |
| Triangle | 18 | | |
| Statistic and Probability | Data | Center & Variability | 18 |
| Mean, Median, Mode & Range | 19 | | |
| Outlier | 20 | | |
| One-variable Statistics | One-variable Statistics | 19 | |
| Probability | Counting Principle | 16 | |
| Independent & Dependent Events | 16 | | |
| Make Predictions | 17 | | |
| Probability of Compound Events | 16 | | |
| Probability of One Event | 17 | | |
| Probability of Simple and Opposite Events | 19 | | |
| Two-variable Statistics | Two-variable Statistics | 18 | |
Table 8: Details of the hierarchical concepts in Middle-EN.
<details>
<summary>x32.png Details</summary>

### Visual Description
## Data Table: Hierarchical Classification of Mathematical Problems with Sample Counts
### Overview
The image displays a structured table that categorizes mathematical problems into a three-level hierarchy. The table is presented in Chinese, with English translations provided in parentheses for each category. It lists the number of samples available for each specific problem type under the third level (LEVEL3). The table appears to be a dataset taxonomy, likely for an educational or research dataset focused on elementary or middle school mathematics.
### Components/Axes
The table has four columns:
1. **LEVEL1**: The broadest category (e.g., Geometry, Application).
2. **LEVEL2**: A sub-category within LEVEL1 (e.g., Two-dimensional shapes, Fundamental Problem).
3. **LEVEL3**: The most specific problem type (e.g., Triangles, Add & Differential & Multiple).
4. **# Samples**: The count of samples or problems available for that specific LEVEL3 category.
The table is organized into four main sections corresponding to the LEVEL1 categories.
### Detailed Analysis
The table contains the following hierarchical data:
**Language**: The primary language is Chinese. English translations are provided in parentheses for every category label.
**Table Content**:
| LEVEL1 | LEVEL2 | LEVEL3 | # Samples |
| :--- | :--- | :--- | :--- |
| **几何 (Geometry)** | **平面图形 (Two-dimensional shapes)** | 三角形 (Triangles) | 20 |
| | | 圆 (Circle) | 20 |
| | | 平行四边形 (Parallelogram) | 20 |
| | | 梯形 (Trapezium) | 20 |
| | | 正方形 (Square) | 20 |
| | | 平面图形综合 (Synthesis Problem) | 20 |
| | | 角 (Angle) | 20 |
| | | 长方形 (Rectangle) | 20 |
| | **立体图形 (Three-dimensional Shapes)** | 圆柱 (Cylinder) | 20 |
| | | 正方体 (Cube) | 20 |
| | | 立体图形综合问题 (Synthesis Problem) | 20 |
| | | 长方体 (Cuboid) | 20 |
| **应用 (Application)** | **基础 (Fundamental Problem)** | 和差倍问题 (Add & Differential & Multiple) | 20 |
| | | 基础 (Basics) | 21 |
| | | 差倍问题 (Differential) | 20 |
| | | 归一问题 (Normalization) | 20 |
| | | 归总问题 (Induction) | 20 |
| | **经典问题 (Classical Problem)** | 利息问题 (Interest) | 20 |
| | | 周期问题 (Period) | 10 |
| | | 对折问题 (Folding) | 20 |
| | | 工程问题 (Engineering) | 20 |
| | | 年龄问题 (Age) | 20 |
| | | 折扣问题 (Discount) | 20 |
| | | 植树问题 (Planting) | 20 |
| | | 税率问题 (Tax) | 15 |
| | | 还原问题 (Reduction) | 20 |
| | | 页码问题 (Pagination) | 20 |
| | | 鸡兔同笼问题 (Chickens & Rabbits in the Same Cage) | 20 |
| | **路程问题 (Distance Problem)** | 相遇问题 (Encounter) | 20 |
| | | 行程问题 (Travel) | 20 |
| | | 追击问题 (Pursuit) | 20 |
| **度量与统计 (Measurement and Statistics)** | **度量 (Measurement)** | 人民币问题 (RMB) | 9 |
| | | 时间问题 (Time) | 20 |
| | | 浓度问题 (Concentration) | 20 |
| | | 温度问题 (Temperature) | 6 |
| | | 面积问题 (Area) | 17 |
| | **统计 (Statistics)** | 排列组合 (Permutation) | 20 |
| | | 统计指标 (Statistical Metrics) | 20 |
| | | 规律 (Law) | 18 |
| **数与代数 (Number and algebra)** | **分数运算 (Fractional Operation)** | 分数与小数 (Fraction & Decimal) | 20 |
| | | 分数应用 (Fractional Application) | 20 |
| | | 分数运算 (Fractional Operation) | 20 |
| | | 最简分数 (Simplest Fraction) | 16 |
| | **因数与倍数 Factors & Multiples** | 公倍数问题 (Common Multiples) | 16 |
| | | 公约数问题 (Common Divisors) | 11 |
| | | 因数问题 (Factor) | 20 |
| | | 因数与倍数综合问题 (Synthesis Problem) | 11 |
| | | 质数问题 (Prime Number) | 9 |
| | **基础运算 (Basic Operation)** | 乘法问题 (Multiplication) | 20 |
| | | 倒数问题 (Reciprocal Problem) | 16 |
| | | 四则运算 (Four-rule Operation) | 20 |
| | | 新运算定义 (New Operation Definition) | 20 |
| | | 方程问题 (Equation) | 20 |
| | | 除法问题 (Division) | 20 |
| | **比 (Ratio)** | 倍数问题 (Multiple) | 20 |
| | | 概率问题 (Probability) | 20 |
| | | 比例问题 (Proportion) | 20 |
| | | 百分率问题 (Percentage) | 20 |
### Key Observations
1. **Sample Count Consistency**: The vast majority of LEVEL3 categories have exactly 20 samples. This suggests a deliberate effort to balance the dataset.
2. **Notable Outliers**: Several categories have fewer than 20 samples:
* **周期问题 (Period)**: 10 samples.
* **税率问题 (Tax)**: 15 samples.
* **人民币问题 (RMB)**: 9 samples.
* **温度问题 (Temperature)**: 6 samples.
* **面积问题 (Area)**: 17 samples.
* **最简分数 (Simplest Fraction)**: 16 samples.
* **公倍数问题 (Common Multiples)**: 16 samples.
* **公约数问题 (Common Divisors)**: 11 samples.
* **因数与倍数综合问题 (Synthesis Problem)**: 11 samples.
* **质数问题 (Prime Number)**: 9 samples.
* **倒数问题 (Reciprocal Problem)**: 16 samples.
* **规律 (Law)**: 18 samples.
3. **Single Exception**: The category **基础 (Basics)** under **基础 (Fundamental Problem)** has 21 samples, one more than the standard 20.
4. **Hierarchical Structure**: The taxonomy is well-organized, moving from broad mathematical domains (Geometry, Application, etc.) to specific, teachable problem types.
### Interpretation
This table represents the schema of a curated dataset for mathematical problem-solving, likely intended for training or evaluating AI models, or for educational research. The structure reveals a comprehensive coverage of core elementary mathematics topics.
* **Data Suggestion**: The dataset is designed to be balanced, with a target of 20 samples per fine-grained problem type. The outliers indicate either difficulty in sourcing/generating problems for those specific topics (e.g., Temperature, RMB) or a conscious decision to allocate fewer resources to them.
* **Relationships**: The hierarchy shows how complex application problems (like "Chickens & Rabbits") are built upon fundamental concepts. The separation of "Measurement" and "Statistics" under one LEVEL1 category groups practical quantification skills together.
* **Anomalies & Trends**: The near-perfect uniformity of 20 samples is the dominant trend, making the deviations significant. The lowest counts (6 for Temperature, 9 for RMB and Prime Number) may represent niche or more advanced topics within this curriculum scope. The presence of "Synthesis Problem" categories under both 2D and 3D geometry, as well as Factors & Multiples, indicates the dataset includes problems that integrate multiple concepts, which are crucial for assessing higher-order understanding.
* **Purpose**: This is not raw data but a metadata table. Its primary value is in documenting the composition and scope of an underlying collection of math problems, ensuring transparency about the dataset's coverage and potential biases (e.g., under-representation of temperature-related problems).
</details>
Figure 17: Details of the hierarchical concepts in Elementary-ZH.
| LEVEL1 | LEVEL2 | LEVEL3 | # Samples |
| --- | --- | --- | --- |
| 几何 (Geometry) | 三角形(Triangle) | 全等三角形(Congruent Triangle) | 20 |
| 勾股定理(Pythagorean Theorem) | 20 | | |
| 等腰三角形(Isosceles Triangle) | 20 | | |
| 等边三角形(Equilateral Triangle) | 20 | | |
| 四边形(Quadrilateral) | 平行四边形(Parallelogram) | 20 | |
| 梯形(Trapezium) | 20 | | |
| 圆(Circle) | 圆周角(Angle of Circumference) | 20 | |
| 圆心角(Angle of Center) | 20 | | |
| 垂径定理(Vertical Path Theorem) | 20 | | |
| 弧长和扇形面积(Arc length & Sector Area) | 20 | | |
| 正多边形和圆(Regular Polygons & Circles) | 20 | | |
| 点线圆位置关系(Relations of Point, Line & Circle) | 20 | | |
| 立体图形 (Three-dimensional Shapes) | 圆锥(Cone) | 20 | |
| 函数 (Function) | 一次函数(Linear Function) | 函数与一元一次方程 (Univariate Function & Equation) | 20 |
| 函数与一元一次不等式 (Linear Functions & Univariate Linear Inequalities) | 20 | | |
| 一次函数与二元一次方程组 (Linear Functions & System of Binary Linear Equations) | 20 | | |
| 正比例函数(Proportional Function) | 20 | | |
| 一次函数解析式 (Analytical Formula of Linear Functions ) | 20 | | |
| 二次函数(Quadratic Function) | 二次函数的应用 (Applications of Quadratic Functions) | 20 | |
| 抛物线的性质 (Properties of Parabolas) | 18 | | |
| 反比例函数 (Inverse Proportional Function) | 定义(Definition) | 20 | |
| 应用(Applications) | 20 | | |
| 性质(Properties) | 19 | | |
| 平面直角坐标系 (Rectangular Coordinate System) | 有序数对(Ordered Pair) | 20 | |
| 象限中的点(Points of Quadrant) | 14 | | |
| 数与式 (Number and Expression) | 代数式(Algebra Expression) | 代数式求值(Algebraic Expression Evaluation) | 20 |
| 同类项(Similar Items) | 20 | | |
| 分式(Fraction) | 指数幂(Exponential Power) | 20 | |
| 约分(Fraction Reduction) | 19 | | |
| 因式(Factor) | 十字相乘法(Cross Multiplication) | 20 | |
| 公因式提取(Common Factor Extraction) | 18 | | |
| 应用(Application) | 流水问题(Flow Problem) | 20 | |
| 鸽巢问题(Pigeon Nest Problem) | 20 | | |
| 整式(Integral Expression) | 乘法公式(Multiplication) | 20 | |
| 整式的乘除及混合(Multiplication, Division & Mixing) | 20 | | |
| 整式的加减(Addition & Subtraction) | 20 | | |
| 无理数(Irrational Number) | 无理数识别(Irrational Number Recognition) | 20 | |
| 根式(Radical Expression) | 二次根式的运算(Operation of Quadratic Radicals) | 20 | |
| 同类二次根式(Similar Quadratic Radicals) | 20 | | |
| 平方根与算术平方根(Square Root & Arithmetic Square Root) | 20 | | |
| 立方根(Cube Root) | 20 | | |
| 方程与不等式 (Equations & Inequalities) | 一元一次方程 (Linear Equation in One Variable) | 一元一次方程的应用(Applications) | 20 |
| 解一元一次方程(Solutions) | 20 | | |
| 一元二次方程 (Quadratic Equation in One Variable) | 一元二次方程的应用(Applications) | 20 | |
| 解一元二次方程(Solutions) | 20 | | |
| 不等式与不等式组 (Inequalities & Groups of Inequalities) | 一元一次不等式的应用 (Applications of Unary First Order Inequality) | 20 | |
| 一元一次不等式组的应用(Applications of Unary First Order Groups of Inequalities) | 20 | | |
| 解一元一次不等式(Solve the First Inequality of One Variable) | 20 | | |
| 解一元一次不等式组(Solve Unary First Order Groups of Inequalities) | 20 | | |
| 分式方程(Fractional Equation) | 分式方程的应用(Application of Fractional Equation) | 20 | |
| 解分式方程(Solve Fractional Equation) | 20 | | |
| 统计与概率 (Statistics and Probability) | 数据分析(Data Analysis) | 数据的波动趋势(Fluctuating Trend of Data) | 20 |
| 数据的集中趋势(Central Tendency of Data) | 20 | | |
| 概率(Probability) | 概率的应用(Applications of Probability) | 20 | |
| 求概率(Find Probability) | 20 | | |
| 随机事件与概率(Random Events & Probabilities) | 20 | | |
Table 9: Details of the hierarchical concepts in Middle-ZH.