2402.14660v2

Model: nemotron-free

# ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models ## Abstract This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with concept-wise accuracies. Based on our ConcepthMath, we evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones. Besides, we also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the developers to understand the fine-grained mathematical abilities of their models and facilitate the growth of foundation models The data and code are available at https://github.com/conceptmath/conceptmath.. footnotetext: * First three authors contributed equally. footnotetext: ${}^{\dagger}$ Corresponding Author: Jiaheng Liu. ## 1 Introduction Mathematical reasoning is a crucial capability for Large Language Models (LLMs). Recent advancements in LLMs, including Anthropic Anthropic (2023), GPT-4 (OpenAI, 2023), and LLaMA (Touvron et al., 2023a), have demonstrated impressive mathematical reasoning on existing benchmarks with high average accuracies on datasets like GSM8K (Cobbe et al., 2021). Although these benchmarks are able to measure the overall mathematical reasoning capabilities of LLMs on average, they fail to probe the fine-grained failure modes of mathematical reasoning on specific mathematical concepts. For example, Fig. 1 shows that the performance of LLaMA2-13B varies significantly across different concepts and fails on simple concepts like Rational number and Cylinders. It is crucial to know these specific failure modes of the language model, especially in some practical applications where we need to focus on specific mathematical abilities. For example, for financial analysts, calculation and statistics are the concepts of most interest while others like geometry are not as important. Moreover, the mathematics system, by its nature, is more fine-grained than holistic. It is typically organized into distinct math concepts https://en.wikipedia.org/wiki/Lists_of_mathematics_topics, and humans develop comprehensive mathematical capabilities through a concept-by-concept, curriculum-based learning process (Simon, 2011; Fritz et al., 2013). These issues underscore the core motivation of this paper: the need for a fine-grained benchmark that evaluates concept-wise mathematical reasoning capabilities of LLMs. <details> <summary>x1.png Details</summary> ![48b62143](/v1/image/48b62143724568b338e327d7c2a724c559b384d5fba4c0e3b70c1602427e6a74) ### Visual Description ## Line Graph: Accuracy Comparison of LLaMA2 and LLaMA2-FT Models ### Overview The image is a line graph comparing the accuracy of two models, **LLaMA2** (green line) and **LLaMA2-FT** (blue line), across 12 categories. The graph highlights performance trends, weaknesses, and enhancements in model capabilities. A shaded "Weaknesses" region (gray) and an "Enhancing Weaknesses" region (pink) are marked, with stars and arrows indicating specific data points. --- ### Components/Axes - **X-Axis (Categories)**: Powers, Numerical exprs, Estimation & rounding, Decimals, Light & heavy, Temperature, Ratio, Patterns, Cylinders, Perimeter, Rational number, Polygons, Probability. - **Y-Axis (Accuracy)**: Scale from 0 to 90, labeled "Accuracy." - **Legend**: - Green line: **LLaMA2** - Blue line: **LLaMA2-FT** - **Shaded Regions**: - **Weaknesses**: Gray area spanning "Cylinders" to "Probability." - **Enhancing Weaknesses**: Pink area spanning "Cylinders" to "Probability," with stars and arrows. --- ### Detailed Analysis #### LLaMA2 (Green Line) Trends: 1. **High Accuracy in Early Categories**: - Starts at ~65 accuracy in "Powers" and "Numerical exprs." - Drops sharply to ~30 in "Estimation & rounding." - Recovers to ~70 in "Decimals" and "Light & heavy." - Declines to ~40 in "Temperature" and ~35 in "Ratio." - Further drops to ~20 in "Patterns" and ~10 in "Cylinders." - Minimal recovery to ~15 in "Perimeter" and ~10 in "Rational number." - Slight improvement to ~15 in "Polygons" and ~20 in "Probability." 2. **Weaknesses Region**: - Consistently low accuracy (10–20) in "Cylinders," "Perimeter," "Rational number," and "Polygons." #### LLaMA2-FT (Blue Line) Trends: 1. **Improved Performance**: - Starts at ~70 in "Powers" and "Numerical exprs." - Drops to ~40 in "Estimation & rounding" but recovers to ~75 in "Decimals." - Peaks at ~85 in "Light & heavy" and ~70 in "Temperature." - Declines to ~45 in "Ratio" and ~35 in "Patterns." - **Enhancing Weaknesses Region**: - Sharp increase from ~10 in "Cylinders" to ~75 in "Probability." - Stars and arrows highlight key improvements: - ~50 in "Cylinders" (star). - ~65 in "Perimeter" (star). - ~75 in "Rational number" (star). - ~75 in "Polygons" (star). - ~85 in "Probability" (star). 2. **Weaknesses Region**: - Moderate accuracy (35–75) in "Cylinders," "Perimeter," "Rational number," and "Polygons," with significant upward trends. --- ### Key Observations 1. **LLaMA2-FT Outperforms LLaMA2**: - Higher accuracy in most categories (e.g., "Light & heavy," "Probability"). - LLaMA2-FT’s accuracy in "Probability" (~85) is 65% higher than LLaMA2 (~20). 2. **Enhancements in Weaknesses**: - The "Enhancing Weaknesses" region shows targeted improvements, with LLaMA2-FT achieving near-peak accuracy in "Probability" and "Polygons." 3. **Persistent Weaknesses**: - Both models struggle with "Cylinders" and "Rational number," though LLaMA2-FT shows marked improvement. --- ### Interpretation The graph demonstrates that **LLaMA2-FT**, likely a fine-tuned version of LLaMA2, addresses weaknesses in the original model. The shaded "Enhancing Weaknesses" region highlights categories where LLaMA2-FT shows significant accuracy gains, particularly in "Probability" and "Polygons." The stars and arrows emphasize these improvements, suggesting intentional model adjustments to strengthen performance in previously weak areas. The persistent low accuracy in "Cylinders" and "Rational number" for both models indicates unresolved challenges, possibly requiring further refinement. Overall, the data supports the hypothesis that fine-tuning enhances model robustness, especially in complex or specialized tasks. </details> Figure 1: The concept-wise accuracies of LLaMA2-13B and the fine-tuned version based on our efficient fine-tuning method (i.e., LLaMA2-FT). Therefore, first, we introduce ConceptMath, the first bilingual (English and Chinese), concept-wise benchmark for measuring mathematical reasoning. ConceptMath gathers math concepts from four educational systems, resulting in four distinct mathematical concept systems: English Elementary, English Middle, Chinese Elementary, and Chinese Middle The four concept systems are abbreviated as Elementary-EN, Middle-EN, Elementary-ZH, and Middle-ZH.. Each of these concept systems organizes around 50 atomic math concepts under a three-level hierarchy and each concept includes approximately 20 mathematical problems. Overall, ConceptMath comprises a total of 4011 math word problems across 214 math concepts, and Fig. 2 shows the diagram overview of ConceptMath. Second, based on our ConceptMath, we perform extensive experiments to assess the mathematical reasoning of existing LLMs, including 2 close-sourced LLMs and 17 open-sourced LLMs. These evaluations were performed in zero-shot, chain-of-thought (CoT), and few-shot settings. To our surprise, even though most of the evaluated LLMs claim to achieve high average accuracies on traditional mathematical benchmarks (e.g., GSM8K), they fail catastrophically across a wide spectrum of mathematical concepts. Third, to make targeted improvements on underperformed math concepts, we propose an efficient fine-tuning strategy by first training a concept classifier and then crawling a set of samples from a large open-sourced math dataset Paster et al. (2023); Wang et al. (2023b) for further LLMs fine-tuning. In Fig. 1, for LLaMA2-FT, we observe that the results of these weaknesses improved a lot after using the efficient fine-tuning method. In summary, our contributions are as follows: - We introduce ConceptMath, the first bilingual, concept-wise benchmark for measuring mathematical reasoning. ConceptMath encompasses 4 systems, approximately 214 math concepts, and 4011 math word problems, which can guide further improvements on the mathematical reasoning of existing models. - Based on ConceptMath, we evaluate many LLMs and perform a comprehensive analysis of their results. For example, we observe that most of these LLMs (including open-sourced, closed-sourced, general-purpose, or math-specialized models) show significant variations in their performance results across math concepts. - We also evaluate the contamination rate of our ConceptMath and introduce a simple and efficient fine-tuning method to improve the weaknesses of existing LLMs. <details> <summary>x2.png Details</summary> ![db19d4af](/v1/image/db19d4af639dd3fcf42697d01d6f672d87876375b5acbc3f7c6924cde2ac8520) ### Visual Description ## Circular Diagram: Mathematical Concept Hierarchy ### Overview The image depicts a circular hierarchical diagram organizing mathematical concepts into three primary categories: **Geometry**, **Measurement**, and **Calculate & Properties**. Each category branches into subcategories with color-coded segments. The diagram uses concentric rings to represent hierarchical relationships, with the outermost ring showing broad categories and inner rings detailing subcategories. ### Components/Axes - **Main Categories** (outermost ring): - Geometry (blue) - Measurement (yellow) - Calculate & Properties (green) - **Subcategories** (middle ring): - Geometry: Two-Dim Shapes, Three-Dim Shapes, Coordinate Plane - Measurement: Basic Knowledge, Statistics, Size - Calculate & Properties: Calculate, Properties - **Sub-subcategories** (innermost ring): - Geometry: Triangles, Quadrilaterals, Polygons, etc. - Measurement: Ratio, Proportional, Data, Probability, etc. - Calculate: Add, Subtract, Multiply, etc. - Properties: Count, Compare, Estimation, etc. - **Legend**: Located at the top, with color coding: - Yellow = Measurement - Blue = Geometry - Green = Calculate & Properties ### Detailed Analysis 1. **Geometry** (blue): - Two-Dim Shapes: Triangles, Quadrilaterals, Polygons, Perimeter, Angles - Three-Dim Shapes: Cubes, Cylinders, Spheres, Cones, Volume of 3D Shapes - Coordinate Plane: Coordinate Plane (subcategory) 2. **Measurement** (yellow): - Basic Knowledge: Ratio, Proportional, Percent, Temperature, Time - Statistics: Data, Classifying & Sorting, Probability - Size: Length, Area, Volume 3. **Calculate & Properties** (green): - Calculate: Add, Subtract, Multiply, Divide, Fractions, Decimals, etc. - Properties: Count, Compare, Estimation & Rounding, Patterns, Variables, etc. ### Key Observations - **Hierarchical Structure**: Concepts are organized from broad categories (Geometry) to specific subcategories (e.g., "Triangles" under Two-Dim Shapes). - **Color Coding**: Main categories use distinct colors (yellow, blue, green), while subcategories use lighter shades of the same color. - **Size Proportions**: - Calculate & Properties occupies ~40% of the diagram (largest section). - Geometry (~35%) and Measurement (~25%) are smaller. - **Subcategory Density**: - Geometry has the most sub-subcategories (14+). - Measurement has moderate density (12+). - Calculate & Properties has the highest subcategory count (20+). ### Interpretation This diagram visually represents the interconnectedness of mathematical concepts, emphasizing: 1. **Foundational Role of Calculation**: The largest section (Calculate & Properties) suggests arithmetic operations and numerical properties form the base of mathematical understanding. 2. **Spatial vs. Abstract Concepts**: Geometry (spatial) and Measurement (practical applications) are positioned as complementary to calculation. 3. **Hierarchical Learning Path**: The concentric rings imply a progression from broad topics (Geometry) to specific skills (e.g., "Volume of 3D Shapes"). 4. **Interdisciplinary Links**: Overlapping subcategories (e.g., "Ratio" appears in both Basic Knowledge and Properties) highlight cross-topic applications. The diagram serves as a pedagogical tool to illustrate how mathematical domains interrelate, with Calculate & Properties acting as a central hub connecting spatial, measurement, and abstract concepts. </details> ((a)) English Elementary (Elementary-EN) <details> <summary>x3.png Details</summary> ![a38e2a25](/v1/image/a38e2a25c2d479680eecd7facb448c81e82041f534e2cddcb67200f340a68394) ### Visual Description ## Circular Diagram: Mathematical Concepts Hierarchy ### Overview The image is a circular diagram divided into four main sections, each representing a branch of mathematics: **Calculate**, **Geometry**, **Statistic & Probability**, and **Exprs, Equations & Functions**. Each section contains hierarchical subcategories, with colors distinguishing main sections and their subcategories. The legend is centrally placed, and subcategories are arranged radially around the main sections. --- ### Components/Axes #### Legend - **Calculate**: Teal (#008080) - **Geometry**: Blue (#0000FF) - **Statistic & Probability**: Yellow (#FFFF00) - **Exprs, Equations & Functions**: Orange (#FFA500) #### Main Sections 1. **Calculate** (Teal) - Subcategories: Basic Calculate, Number Theory, Irrational Numbers, Rational & Proportional Relationships, Financial Literacy, Measurement, Inequalities, Equations, Function Concepts. 2. **Geometry** (Blue) - Subcategories: Two-Dim Figures, Three-Dim Figures, Coordinate Plane, Lines & Angles, Slope. 3. **Statistic & Probability** (Yellow) - Subcategories: Data, Probability. 4. **Exprs, Equations & Functions** (Orange) - Subcategories: Function Concepts, Equations, Inequalities. #### Sub-Subcategories - **Basic Calculate** (Teal) - Add & Subtract, Multiply, Decimals, Exponents & Scientific Notation. - **Number Theory** (Teal) - Prime or Composite, Prime Factorization, Factors. - **Irrational Numbers** (Teal) - Square Roots & Cube Roots, Absolute Value, Opposite Integers. - **Rational & Proportional Relationships** (Teal) - Ratios & Rates, Percentages, Proportions. - **Financial Literacy** (Teal) - Financial Measurements, Consumer Math, Estimate Metrics. - **Measurement** (Teal) - Financial Literacy, Consumer Math, Estimate Metrics. - **Inequalities** (Teal) - Systems of Equations, Inequalities. - **Equations** (Teal) - Linear Equations, Variable Equations, Equivalent Expressions. - **Function Concepts** (Teal) - Domain & Range, Two-Variable Statistics, One-Variable Statistics, Outlier, Center & Variability, Probability of Simple and Opposite Events, Probability of One Event, Probability of Compound Events, Make Predictions, Independent & Dependent Events, Counting Principle. - **Two-Dim Figures** (Blue) - Square, Trapezoids, Triangle, Polygons, Perimeter & Area. - **Three-Dim Figures** (Blue) - Polyhedra, Surface Area & Volume, Congruence & Similarity, Transformations. - **Coordinate Plane** (Blue) - Axes, Distance Between Two Points, Quadrants. - **Lines & Angles** (Blue) - Circle, Lines & Angles, Perimeter & Area. - **Slope** (Blue) - Slope, Scale Drawings, Transformations, Congruence & Similarity. - **Data** (Yellow) - Mean, Median, Mode & Range, Center & Variability. - **Probability** (Yellow) - Probability of Simple and Opposite Events, Probability of One Event, Probability of Compound Events, Make Predictions, Independent & Dependent Events, Counting Principle. - **Exprs, Equations & Functions** (Orange) - Function Concepts, Equations, Inequalities. --- ### Detailed Analysis #### Calculate Section - **Basic Calculate**: Focuses on arithmetic operations (Add & Subtract, Multiply) and foundational concepts (Decimals, Exponents). - **Number Theory**: Explores properties of numbers (Prime/Composite, Factorization). - **Irrational Numbers**: Covers roots and absolute values. - **Rational & Proportional Relationships**: Includes ratios, percentages, and proportions. - **Financial Literacy**: Links math to real-world applications (Financial Measurements, Consumer Math). - **Measurement**: Overlaps with Financial Literacy and Consumer Math. - **Inequalities & Equations**: Connects algebraic concepts (Systems of Equations, Linear Equations). #### Geometry Section - **Two-Dim Figures**: Focuses on shapes (Square, Triangle) and measurements (Perimeter & Area). - **Three-Dim Figures**: Expands to 3D shapes (Polyhedra) and spatial concepts (Surface Area & Volume). - **Coordinate Plane**: Introduces axes, distance calculations, and quadrants. - **Lines & Angles**: Covers circles, lines, and angular measurements. - **Slope**: Integrates transformations and congruence. #### Statistic & Probability Section - **Data**: Covers central tendency (Mean, Median) and variability (Center & Variability). - **Probability**: Explores event likelihood (Simple/Opposite Events, Compound Events) and predictions. #### Exprs, Equations & Functions Section - **Function Concepts**: Delves into domains, ranges, and statistical analysis. - **Equations & Inequalities**: Bridges algebraic expressions and systems. --- ### Key Observations 1. **Hierarchical Structure**: Each main section is subdivided into increasingly specific concepts, reflecting the depth of mathematical study. 2. **Color-Coding**: Distinct colors for main sections and lighter shades for subcategories enhance visual differentiation. 3. **Interconnectedness**: Subcategories like "Financial Literacy" and "Measurement" overlap with multiple main sections, emphasizing interdisciplinary connections. 4. **Probability & Statistics**: Positioned as a standalone section but overlaps with Function Concepts (e.g., Probability of Events). --- ### Interpretation The diagram illustrates the **interconnectedness of mathematical disciplines**, showing how foundational concepts (e.g., Basic Calculate) underpin advanced topics (e.g., Financial Literacy). The radial arrangement suggests a cyclical relationship between areas, where skills in one domain (e.g., Geometry) support others (e.g., Probability). The color-coding aids in visualizing the hierarchy, while overlapping subcategories highlight real-world applications (e.g., Statistics in Function Concepts). This structure mirrors curricular frameworks, emphasizing both theoretical and applied mathematics. </details> ((b)) English Middle (Middle-EN) <details> <summary>x4.png Details</summary> ![3aa84512](/v1/image/3aa845124df10f8f66bea46649787c5351633790ce5d9c2e2a278de032b4a1f4) ### Visual Description ## Pie Chart: Topic Distribution in Educational Curriculum ### Overview The chart visually represents the distribution of educational topics across five main categories: 应用 (Application), 数据与代数 (Data & Algebra), 几何 (Geometry), 度量与统计 (Measurement & Statistics), and 度量 (Measurement). Each category is color-coded and subdivided into specific subtopics with hierarchical relationships. ### Components/Axes - **Main Categories** (Color-Coded): - **应用 (Application)**: Orange - **数据与代数 (Data & Algebra)**: Green - **几何 (Geometry)**: Blue - **度量与统计 (Measurement & Statistics)**: Yellow - **度量 (Measurement)**: Light Blue - **Subcategories**: - Each main category contains 3–5 subcategories, further divided into specific topics (e.g., 经典问题 (Classic Problems) under 应用). - Sub-subcategories include problem types (e.g., 约瑟夫问题 (Josephus Problem)), geometric shapes (e.g., 圆 (Circle)), and statistical concepts (e.g., 统计指标 (Statistical Indicators)). ### Detailed Analysis #### 应用 (Application) [Orange] - **经典问题 (Classic Problems)** (~35% of total chart): - 约瑟夫问题 (Josephus Problem) - 狐狸问题 (Fox Problem) - 约瑟夫问题变体 (Josephus Problem Variant) - 狐狸问题变体 (Fox Problem Variant) - 约瑟夫问题与代数 (Josephus Problem & Algebra) - 狐狸问题与代数 (Fox Problem & Algebra) - **基础问题 (Basic Problems)** (~25%): - 约瑟夫问题基础 (Josephus Problem Basics) - 狐狸问题基础 (Fox Problem Basics) - 约瑟夫问题基础变体 (Josephus Problem Basics Variant) - 狐狸问题基础变体 (Fox Problem Basics Variant) - **经典与统计 (Classical & Statistical)** (~15%): - 约瑟夫问题与统计 (Josephus Problem & Statistics) - 狐狸问题与统计 (Fox Problem & Statistics) - 约瑟夫问题与代数与统计 (Josephus Problem & Algebra & Statistics) - 狐狸问题与代数与统计 (Fox Problem & Algebra & Statistics) #### 数据与代数 (Data & Algebra) [Green] - **分数运算 (Fraction Operations)** (~20%): - 分数运算算法 (Fraction Operations Algorithms) - 分数运算应用 (Fraction Operations Applications) - 分数运算问题 (Fraction Operations Problems) - **分数运算算 (Fraction Operations Algorithms)** (~10%): - 分数运算算法 (Fraction Operations Algorithms) - 分数运算算法变体 (Fraction Operations Algorithms Variant) - **分数运算应用 (Fraction Operations Applications)** (~8%): - 分数运算应用问题 (Fraction Operations Application Problems) - 分数运算应用变体 (Fraction Operations Application Variant) - **分数运算问题 (Fraction Operations Problems)** (~2%): - 分数运算问题 (Fraction Operations Problems) - 分数运算问题变体 (Fraction Operations Problems Variant) #### 几何 (Geometry) [Blue] - **平面图形 (Plane Figures)** (~25%): - 圆 (Circle) - 三角形 (Triangle) - 四边形 (Quadrilateral) - 多边形 (Polygon) - 图形组合 (Shape Combinations) - **立体图形 (Solid Figures)** (~15%): - 立体图形综合 (Solid Figures Comprehensive) - 立体图形组合 (Solid Figures Combination) - 立体图形变体 (Solid Figures Variant) - **度量与统计 (Measurement & Statistics)** (~5%): - 度量问题 (Measurement Problems) - 统计问题 (Statistical Problems) - 度量与代数 (Measurement & Algebra) - 统计与代数 (Statistics & Algebra) #### 度量与统计 (Measurement & Statistics) [Yellow] - **度量问题 (Measurement Problems)** (~3%): - 度量问题 (Measurement Problems) - 度量问题变体 (Measurement Problems Variant) - **统计问题 (Statistical Problems)** (~2%): - 统计问题 (Statistical Problems) - 统计问题变体 (Statistical Problems Variant) #### 度量 (Measurement) [Light Blue] - **度量问题 (Measurement Problems)** (~2%): - 度量问题 (Measurement Problems) - 度量问题变体 (Measurement Problems Variant) ### Key Observations 1. **Dominance of 应用 (Application)**: The largest section (35–40%) emphasizes problem-solving, with a focus on classic and foundational problems. 2. **Data & Algebra Focus**: 数据与代数 (Data & Algebra) highlights fractions and their applications, suggesting a curriculum prioritizing computational skills. 3. **Geometry Emphasis**: 几何 (Geometry) covers both 2D (平面图形) and 3D (立体图形) shapes, indicating a balanced approach to spatial reasoning. 4. **Smaller Sections**: 度量与统计 (Measurement & Statistics) and 度量 (Measurement) are minor components, reflecting specialized or advanced topics. ### Interpretation The chart suggests a curriculum structured around **problem-solving** (应用) and **computational skills** (数据与代数), with geometry and measurement as supporting pillars. The inclusion of statistical concepts under 度量与统计 (Measurement & Statistics) indicates an integration of data analysis. The hierarchical subcategories (e.g., 经典问题 → 约瑟夫问题) imply a progression from foundational to advanced topics. The use of color coding and segmentation visually reinforces the prioritization of certain areas, such as the dominance of 应用 and 数据与代数. </details> ((c)) Chinese Elementary (Elementary-ZH) <details> <summary>x5.png Details</summary> ![d1cc100a](/v1/image/d1cc100a620e9f8cc85abcc42a0bd996e6fe180c53edebdaf59b4778d686ba68) ### Visual Description ## Circular Flowchart: Mathematical Concept Hierarchy ### Overview The diagram presents a color-coded circular flowchart organizing mathematical concepts into four hierarchical categories: 数与式 (Numbers and Formulas), 几何 (Geometry), 函数 (Functions), and 方程与不等式 (Equations and Inequalities). Each category branches into subcategories with distinct color-coding and spatial arrangement. ### Components/Axes **Legend** (Right Side): - **红色 (Red)**: 数与式 (Numbers and Formulas) - **黄色 (Yellow)**: 几何 (Geometry) - **橙色 (Orange)**: 函数 (Functions) - **蓝色 (Blue)**: 方程与不等式 (Equations and Inequalities) - Additional colors (pink, purple, teal, green) used for subcategories **Main Circle**: - Divided into four quadrants corresponding to the legend colors - Subcategories arranged radially outward from each quadrant ### Detailed Analysis **1. 数与式 (Red Section)**: - **素数与分解**: Prime numbers and factorization - **指数与运算**: Exponents and operations - **同类项与代数式**: Like terms and algebraic expressions - **代数式求值**: Evaluation of algebraic expressions - **整式运算**: Polynomial operations - **乘法公式**: Multiplication formulas - **同类二次根式**: Like quadratic radicals - **平方根与算术平方根**: Square roots and arithmetic square roots **2. 几何 (Yellow Section)**: - **立体图形**: Solid geometry - **立方根**: Cube roots - **圆锥与圆柱**: Cones and cylinders - **平行四边形**: Parallelograms - **圆周角定理**: Circle angle theorem - **圆内接四边形**: Cyclic quadrilaterals **3. 函数 (Orange Section)**: - **一次函数**: Linear functions - **一次函数图像**: Graphs of linear functions - **二次函数**: Quadratic functions - **二次函数图像**: Graphs of quadratic functions - **函数比值**: Function ratios - **反比例函数**: Inverse variation functions **4. 方程与不等式 (Blue Section)**: - **整式方程**: Polynomial equations - **一次方程**: Linear equations - **一次方程组**: Systems of linear equations - **二次方程**: Quadratic equations - **一次不等式**: Linear inequalities - **绝对值方程**: Absolute value equations - **点线圆位置关系**: Point-line-circle relationships **Color-Coding Scheme**: - Red: Core algebraic concepts - Yellow: Geometric shapes and theorems - Orange: Function types and properties - Blue: Equation solving and inequalities - Secondary colors: Sub-concepts and formulas ### Key Observations 1. Hierarchical Structure: Top-level categories (red/yellow/orange/blue) form the foundation, with subcategories radiating outward. 2. Spatial Relationships: Related subcategories cluster near their parent category (e.g., linear/quadratic functions near the orange section). 3. Color Progression: Secondary colors transition from light to dark within each category (e.g., light pink for basic formulas, darker purple for advanced concepts). 4. Balanced Distribution: Each main category contains 5-7 subcategories, maintaining visual equilibrium. ### Interpretation The diagram appears to represent an educational framework for mathematics, emphasizing: - Conceptual grouping through color-coding - Logical progression from basic to advanced topics - Spatial memory aids through radial organization - Visual hierarchy between main categories and subcategories The use of distinct color zones helps differentiate broad mathematical domains while maintaining connections through the circular layout. This structure suggests an intentional design for cognitive mapping of mathematical relationships, particularly effective for visual learners. </details> ((d)) Chinese Middle (Middle-ZH) Figure 2: Diagram overview of four concept systems in ConceptMath. We have provided translated Chinese concept names in English (See Appendix A). ## 2 ConceptMath ConceptMath is the first bilingual, concept-wise benchmark for measuring mathematical reasoning. In this section, we describe the design principle, dataset collection process, dataset statistics and an efficient fine-tuning strategy to enhance the weaknesses identified by our ConceptMath. ### 2.1 Design Principle We created ConceptMath based on the following two high-level design principles: #### Concept-wised Hierarchical System. The primary goal of ConceptMath is to evaluate the mathematical reasoning capacities of language models at different granularity. Therefore, ConceptMath organizes math problems within a three-level hierarchy of mathematical concepts in Fig. 2. This approach provides concept-wise evaluation for mathematical reasoning of language models and makes targeted and effective improvements possible. #### Bilingualism. Most of the current mathematical benchmark focuses solely on English, leaving multi-lingual mathematical reasoning unexplored. As an early effort to explore multi-lingual mathematical reasoning, we evaluate mathematical reasoning in two languages: English and Chinese. Besides, since cultures and educational systems vary across different languages, common math concepts can differ a lot. Therefore, we carefully collect concepts in both languages, instead of merely translating from one language to another. For example, measurement metrics (e.g., money, size) are different for English and Chinese. ### 2.2 Data Collection Subsequently, for data collection, we take a two-step approach to operationalize the aforementioned design principles: First, we recruit experts to delineate a hierarchy of math concepts based on different education systems. Secondly, we collect problems for each concept from various sources or design problems manually, which is succeeded by quality assessment and data cleaning. #### Math Concept System Construction. Since the education systems provide a natural hierarchy of math concepts, we recruited four teachers from elementary and middle schools, specializing in either English or Chinese, to organize a hierarchy of math concepts for different education systems. This leads to four concept systems: Elementary-EN, Middle-EN, Elementary-ZH, and Middle-ZH, with each system consisting of a three-level hierarchy of around 50 atomic math concepts (Fig. 2). #### Math Problem Construction. Then we conducted a thorough data acquisition from various sources (including educational websites, textbooks, and search engines with specific concepts) to collect math word problems (including both questions and answers) for each math concept. To guarantee a balance across all concepts, approximately 20 problems were gathered for each math concept. Following this, both GPT-4 OpenAI (2023) and human experts were employed to verify and rectify the categorization and the solution of each problem. However, we observed that for some concepts, the problem count was significantly below 20. To address this issue, manual efforts were undertaken to augment these categories, ensuring a consistent collection of 20 problems for each concept. Furthermore, to broaden the diversity of the dataset and minimize the risk of data contamination, all gathered problems were paraphrased using GPT-4. It is important to note that the collection and annotation processes were carried out by a team of six members, each possessing a university degree in an engineering discipline, to maintain a high level of technical expertise in executing these tasks. ### 2.3 Dataset Statistics Comparison to existing datasets. As shown in Table 1, our ConceptMath differs from related datasets in various aspects: (1) ConceptMath is the first dataset to study fine-grained mathematical concepts and encompasses 4 systems, 214 math concepts, and 4011 math word problems. (2) Problems in ConcepthMath are carefully annotated based on the mainstream education systems for English (EN) and Chinese (ZH). Details on the hierarchical system. Apart from Fig. 2, we also provide the details on the hierarchical system more clearly in Appendix A. Length distribution. Fig. 3 shows the length distribution of our ConcepthMath, where number of tokens is reported We use the “cl100k_base” tokenizer from https://github.com/openai/tiktoken. The minimum, average and maximum of the tokens for these questions are 4, 41 and 309, respectively, which shows that they have lexical richness. | Benchmark | Language | Fine-grained | Size | | --- | --- | --- | --- | | GSM8K | EN | ✗ | 1319 | | MATH | EN | ✗ | 5000 | | TabMWP | EN | ✗ | 7686 | | Dolphin18K | EN | ✗ | 1504 | | Math23K | ZH | ✗ | 1000 | | ASDiv | EN | ✗ | 2305 | | SVAMP | EN | ✗ | 300 | | SingleOp | EN | ✗ | 159 | | MMLU-Math | EN | ✗ | 906 | | ConceptMath | EN&ZH | ✓ | 4011 | Table 1: A comparison of our ConceptMath with some notable mathematical datasets. Note that the size is the number of samples of the test split. <details> <summary>x6.png Details</summary> ![68ceb178](/v1/image/68ceb178299f9ec64d2fcb3b6181468b59346145e8b276407400f58bc5c8dec7) ### Visual Description ## Bar Chart: Distribution of Question Lengths by Token Count ### Overview The chart displays a distribution of question lengths measured in token counts, with the number of questions plotted on the y-axis. The data shows a left-skewed distribution, peaking at shorter token lengths and tapering off for longer lengths, with a notable outlier at the upper end. ### Components/Axes - **X-axis (Question Length (# Tokens))**: - Scale: 0 to 200+ tokens, with intervals marked at 10, 20, 30, ..., 200+. - Labels: "Question Length (# Tokens)" at the bottom. - **Y-axis (Number of Questions)**: - Scale: 0 to 100 questions, with increments of 20. - Labels: "Number of Questions" on the left. - **Bars**: - Color: Blue (no legend present). - Distribution: Vertical bars representing counts of questions for each token length. ### Detailed Analysis - **Peak**: - The highest frequency occurs between **20–30 tokens**, with approximately **80–100 questions**. - Adjacent bins (10–20 tokens and 30–40 tokens) show slightly lower counts (~60–80 questions). - **Decline**: - A gradual decrease in question counts is observed from 40 tokens onward, dropping to ~20 questions by 60 tokens. - Further declines continue, with counts below 10 questions for lengths >100 tokens. - **Outlier**: - A small spike at **200+ tokens** with ~10 questions, significantly lower than the peak. ### Key Observations 1. **Concentration in Short Questions**: - Over 70% of questions fall within the **10–50 token range**, indicating a preference for concise queries. 2. **Long-Tail Distribution**: - Questions exceeding 100 tokens are rare, with counts dropping sharply after 50 tokens. 3. **Anomaly at 200+ Tokens**: - The isolated spike at the upper end suggests either specialized queries or potential data artifacts. ### Interpretation The data suggests that most questions in the dataset are designed to be brief and direct, likely reflecting user intent for efficiency. The left-skewed distribution aligns with typical human communication patterns, where shorter messages dominate. The outlier at 200+ tokens could represent: - Complex, multi-part questions (e.g., technical or legal inquiries). - Data entry errors or edge cases in the dataset. This distribution has implications for systems processing these questions, such as token budgeting in language models or UI design for input fields. Further investigation into the 200+ token category is warranted to determine its validity and relevance. </details> Figure 3: Length distributions of our ConceptMath. ### 2.4 Efficient Fine-Tuning Based on our ConceptMath, we are able to identify the weaknesses in the mathematical reasoning capability of LLMs through concept-wise evaluation. In this section, we explore a straightforward approach to enhance mathematical abilities towards specific concepts by first training a concept classifier and then curating a set of samples from a large open-sourced math dataset. Specifically, first, by additionally collecting extra 10 problems per concept, we construct a classifier capable of identifying the concept class of a given question. The backbone of this classifier is a pretrained bilingual LLM, where the classification head is operated on its last hidden output feature. Then, we proceed to fine-tune LLMs using this specific dataset combined with the existing general math dataset, which aims to avoid overfitting on a relatively small dataset. More details have been provided in the Appendix B. ## 3 Experiments In this section, we perform extensive experiments to demonstrate the effect of our ConceptMath. ### 3.1 Experimental Setup #### Evaluated Models. We assess the mathematical reasoning of existing advanced LLMs on ConceptMath, including 2 close-sourced LLMs (i.e., GPT-3.5/GPT-4 (OpenAI, 2023)) and 17 open-sourced LLMs (i.e., WizardMath-13B Luo et al. (2023), MetaMath-13B Yu et al. (2023), MAmmoTH-13B Yue et al. (2023), Qwen-14B/72B Bai et al. (2023b), Baichuan2-13B Baichuan (2023), ChatGLM3-6B Du et al. (2022), InternLM2-7B/20B Team (2023a), InternLM2-Math-7B/20B Ying et al. (2024), LLaMA2-7B/13B/70B Touvron et al. (2023b), Yi-6B/34B Team (2023b) and DeepSeekMath-7B Shao et al. (2024)). Note that WizardMath-13B, MetaMath-13B, and MAmmoTH-13B are specialized math language models fine-tuned from LLaMA2. InternLM2-Math and DeepSeekMath-7B are specialized math language models fine-tuned from corresponding language models. More details of these evaluated models can be seen in Appendix C. | Model | Elementary-EN | Middle-EN | Elementary-ZH | Middle-ZH | Avg. | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ZS | ZS-COT | FS | ZS | ZS-COT | FS | ZS | ZS-COT | FS | ZS | ZS-COT | FS | | | | Yi-6B | 67.94 | 67.56 | 59.03 | 65.55 | 64.59 | 56.05 | 34.33 | 31.91 | 37.86 | 36.46 | 36.19 | 36.46 | 49.49 | | ChatGLM3-6B | 60.69 | 63.10 | 53.18 | 51.25 | 60.17 | 51.34 | 46.23 | 43.63 | 40.74 | 44.77 | 43.32 | 40.43 | 49.90 | | DeepSeekMath-7B | 66.92 | 77.35 | 73.92 | 56.53 | 69.87 | 66.31 | 60.47 | 62.33 | 64.19 | 56.50 | 56.95 | 56.86 | 64.02 | | InternLM2-Math-7B | 71.12 | 72.01 | 69.59 | 63.44 | 62.96 | 63.05 | 57.30 | 58.23 | 58.60 | 53.79 | 53.16 | 53.88 | 61.43 | | InternLM2-7B | 68.83 | 69.97 | 66.67 | 37.04 | 65.83 | 55.47 | 47.63 | 49.02 | 53.02 | 45.22 | 45.40 | 44.86 | 54.08 | | LLaMA2-7B | 36.51 | 42.62 | 38.68 | 34.26 | 39.16 | 33.69 | 15.72 | 17.67 | 17.58 | 30.87 | 32.22 | 27.80 | 30.57 | | MAmmoTH-13B | 61.32 | 52.42 | 56.49 | 53.93 | 45.20 | 48.08 | 22.33 | 33.30 | 23.81 | 27.98 | 43.05 | 29.15 | 41.42 | | WizardMath-13B | 41.73 | 44.78 | 34.99 | 36.85 | 37.72 | 45.11 | 10.51 | 11.26 | 18.70 | 12.36 | 15.52 | 22.92 | 27.70 | | MetaMath-13B | 54.45 | 51.78 | 47.96 | 44.24 | 43.47 | 47.50 | 11.44 | 17.30 | 27.53 | 21.21 | 26.08 | 29.60 | 35.21 | | Baichuan2-13B | 68.83 | 68.58 | 54.07 | 67.66 | 69.67 | 40.40 | 57.02 | 58.23 | 22.05 | 55.05 | 55.32 | 26.90 | 53.65 | | LLaMA2-13B | 44.02 | 49.75 | 47.07 | 44.72 | 46.45 | 43.09 | 20.19 | 24.19 | 22.14 | 33.30 | 35.38 | 26.17 | 36.37 | | Qwen-14B | 46.95 | 65.78 | 72.65 | 38.48 | 59.60 | 67.85 | 28.09 | 65.12 | 64.47 | 22.92 | 58.30 | 62.09 | 54.36 | | InternLM2-Math-20B | 74.05 | 75.32 | 73.41 | 64.11 | 71.21 | 70.83 | 62.98 | 61.95 | 61.77 | 55.14 | 55.78 | 56.86 | 65.28 | | InternLM2-20B | 53.31 | 72.52 | 73.28 | 45.11 | 67.47 | 56.72 | 48.19 | 55.53 | 59.81 | 45.13 | 50.63 | 56.68 | 57.03 | | Yi-34B | 74.68 | 73.66 | 56.36 | 72.26 | 74.66 | 65.83 | 50.05 | 51.16 | 38.79 | 45.40 | 43.95 | 40.97 | 57.31 | | LLaMA2-70B | 56.11 | 60.31 | 30.53 | 58.06 | 60.94 | 31.67 | 28.65 | 26.70 | 24.37 | 37.64 | 34.30 | 28.43 | 39.81 | | Qwen-72B | 77.10 | 75.06 | 77.23 | 74.66 | 69.87 | 73.99 | 71.16 | 68.65 | 61.86 | 71.30 | 65.43 | 62.45 | 70.73 | | GPT-3.5 | 85.75 | 92.37 | 84.35 | 83.88 | 90.12 | 82.73 | 56.47 | 53.21 | 56.93 | 51.90 | 53.52 | 55.69 | 70.58 | | GPT-4 | 86.77 | 90.20 | 89.57 | 84.26 | 89.83 | 88.68 | 67.91 | 72.28 | 72.00 | 63.81 | 64.26 | 66.61 | 78.02 | | Avg. | 63.00 | 66.59 | 61.00 | 56.65 | 62.57 | 57.28 | 41.93 | 45.35 | 43.49 | 42.67 | 45.72 | 43.41 | 52.47 | Table 2: Results of different models on our constructed ConceptMath benchmark dataset. Note that “ZS”, “ZS-COT”, “FS” represents “zero-shot”, “zero-shot w/ chain-of-thought” and “few-shot”, repsectively. Models are grouped roughly according to their model sizes. #### Evaluation Settings. We employ three distinct evaluation settings: zero-shot, zero-shot with chain-of-thought (CoT), and few-shot promptings. The zero-shot prompting assesses the models’ intrinsic problem-solving abilities without any prior examples. The zero-shot with CoT prompting evaluates the models’ ability to employ a logical chain of thought. In the few-shot prompting setting, the model is provided with fixed 5-shot prompts for different systems (See Appendix E), which includes five newly created examples with concise ground truth targets. This approach is designed to measure the in-context learning abilities. Besides, following MATH (Hendrycks et al., 2021b), all questions and answers in ConceptMath have been carefully curated, and each problem is evaluated based on exact matches. Moreover, greedy decoding with a temperature of 0 is used. ### 3.2 Results #### Overall Accuracy We present the overall accuracies of different LLMs on our ConceptMath benchmark under various prompt settings in Table 2. Subsequently, we analyzed the mathematical abilities of these LLMs in both English and Chinese in Fig. 4. Our analysis led to the following key findings: (1) GPT-3.5/4 showcases the most advanced mathematical reasoning abilities among LLMs in both English and Chinese systems, and the leading open-source Qwen-72B model archives comparable performance compared with GPT-3.5. (2) The scores on Chinese systems of most existing LLMs are lower than English systems a lot. For example, accuracies on Middle-ZH and Middle-EN for GPT-4 are 63.81 and 84.26. (3) Several models (e.g., WizardMath-13B or MetaMath-13B) fine-tuned from LLaMA2-13B achieve slight improvements on English systems, but the performance results are lower than LLaMA2-13B on Chinese systems a lot, which indicates that domain-specific fine-tuning may degrade the generalization abilities of LLMs. (4). The mathematical models (i.e., InternLM2-Math-7B/20B and DeepSeekMath-7B) by continuing pretraining on the large-scale math-related dataset (¿=100B tokens) show sufficient improvements when compared to models with similar size, which indicates that large-scale pertaining is effective to improve the mathematical reasoning abilities. <details> <summary>x7.png Details</summary> ![76633a05](/v1/image/76633a05e8bff4fd1df978fd15830e3296773e0f728c9abf783fa4b3a0320d48) ### Visual Description ## Line Graph: Mean Accuracy Comparison Across Language Models ### Overview The image displays a line graph comparing mean accuracy performance across multiple language models (LMs) for English, Chinese, and overall metrics. The graph spans 20-90% accuracy on the y-axis and lists 18 models on the x-axis, ranging from WizardMath-13B to GPT-4. ### Components/Axes - **X-axis**: Model names (e.g., WizardMath-13B, LLaMA2-7B, Yi-34B, GPT-4) - **Y-axis**: Mean Accuracy (20-90% in 10% increments) - **Legend**: - Blue dashed line: English - Green dash-dot line: Chinese - Black solid line: Overall - **Placement**: Legend in top-left corner; data points connected by lines with markers ### Detailed Analysis 1. **English (Blue Dashed Line)**: - Starts at ~40% (WizardMath-13B), peaks at ~88% (GPT-4) - Notable dip to ~55% at Yi-34B, then sharp rise to 88% at GPT-4 - Average accuracy: ~65% (excluding GPT-4 outlier) 2. **Chinese (Green Dash-Dot Line)**: - Begins at ~22% (WizardMath-13B), rises steadily to ~73% (GPT-4) - Sharp dip to ~51% at InternLM2-7B, then recovery to 73% - Average accuracy: ~55% (excluding GPT-4 outlier) 3. **Overall (Black Solid Line)**: - Starts at ~28% (WizardMath-13B), climbs to ~78% (GPT-4) - Consistent upward trend with minor fluctuations - Average accuracy: ~55% (excluding GPT-4 outlier) ### Key Observations - **Performance Gaps**: English models consistently outperform Chinese models by 10-15% across most models - **Outliers**: - GPT-4 shows extreme performance (88% English, 73% Chinese) - Yi-34B causes English accuracy drop to 55% - InternLM2-7B causes Chinese accuracy drop to 51% - **Trend Patterns**: - English: Volatile with high peaks - Chinese: Steady growth with mid-range dip - Overall: Smooth progression with minor fluctuations ### Interpretation The data suggests English language models generally achieve higher accuracy than Chinese models, with GPT-4 demonstrating exceptional performance across both languages. The dips observed in Yi-34B (English) and InternLM2-7B (Chinese) indicate potential model-specific limitations or evaluation challenges. The overall metric tracks closely with English performance, suggesting English evaluation may dominate the composite metric. The consistent gap between English and Chinese performance highlights persistent challenges in Chinese language model development compared to English counterparts. </details> Figure 4: Mean accuracies for English, Chinese, and overall educational systems. #### Average Concept-wised Accuracy. In Fig. 5 and Fig. 6, to better analyze the effectiveness of our ConceptMath, we further provide the concept-wised accuracies average on evaluated models for different mathematical concepts by zero-shot prompting on Middle-EN and Middle-ZH. (See Appendix D for more results on Elementary-EN and Elementary-ZH). In Fig. 5 and Fig. 6, we observe that the accuracies across concepts vary a lot for existing LLMs. For example, for Middle-ZH in Fig. 6, around 18% of concepts exhibit an accuracy lower than 30%. Thus, to improve the mathematical abilities of LLMs, these concepts with large room for improvement should be given the highest priority, which further shows the advantage of ConceptMath. <details> <summary>x8.png Details</summary> ![9ae2ece3](/v1/image/9ae2ece358caf1e634eddb802858f230c063999646429dbd9a34bda9ba3295de) ### Visual Description ## Bar Chart: Mean Accuracy Across Mathematical Concepts ### Overview The chart displays mean accuracy scores for 48 mathematical concepts, ranging from basic arithmetic to advanced applied mathematics. Bars are uniformly blue, with heights corresponding to accuracy percentages. The tallest bar ("Estimate metric measurements") exceeds the y-axis maximum of 80, reaching approximately 85. ### Components/Axes - **X-axis**: Mathematical concepts (e.g., "Circle," "Radical notation," "Exponents & scientific notation," ..., "Estimate metric measurements"). Labels are left-aligned, with multi-word terms hyphenated (e.g., "One-variable statistics"). - **Y-axis**: "Mean Accuracy" (0–80, increments of 10). The axis extends visually beyond 80 to accommodate the tallest bar. - **Legend**: Single entry for blue bars, positioned in the top-right corner. No explicit labels for data series beyond color. ### Detailed Analysis 1. **Early Categories (Low Accuracy)**: - "Circle" (~35), "Radical notation" (~38), "Exponents & scientific notation" (~40). - Gradual increase to ~45–50 by "Independent & dependent events" and "Rational & irrational numbers." 2. **Mid-Range Categories**: - "Probability of simple operations" (~47) to "Prime factorization" (~60). - Steady climb to ~65 by "Congruence & similarity." 3. **High-Accuracy Categories**: - "Nonlinear functions" (~62) and "Interpret function expressions" (~63) show slight dips. - Applied topics dominate the top: "Consumer math" (~80), "Polyhedra" (~81), "Outlier" (~82), "Composite" (~83), "Square" (~84), and "Estimate metric measurements" (~85). ### Key Observations - **Steady Progression**: Accuracy increases from ~35 to ~85 across categories, suggesting a learning curve where foundational concepts build toward mastery of applied topics. - **Dips in Complexity**: "Nonlinear functions" and "Interpret function expressions" deviate slightly downward (~62–63), possibly indicating challenges with abstract concepts. - **Applied Topics Excel**: The highest accuracies (>80) cluster in practical applications (e.g., "Consumer math," "Polyhedra"), implying real-world relevance enhances comprehension. - **Axis Limitation**: The y-axis max (80) underrepresents the tallest bar ("Estimate metric measurements" at ~85), requiring visual extrapolation. ### Interpretation The data reflects a pedagogical progression: basic concepts (e.g., "Circle," "Radical notation") have lower accuracy, while applied topics (e.g., "Estimate metric measurements") achieve near-perfect scores. This suggests that: 1. **Foundational Knowledge**: Early topics are harder to master, possibly due to abstract notation or conceptual complexity. 2. **Applied Learning**: Practical applications (e.g., "Consumer math") reinforce understanding, leading to higher accuracy. 3. **Outliers**: Dips in "Nonlinear functions" and "Interpret function expressions" may highlight areas where students struggle with abstraction or contextual application. The chart underscores the importance of scaffolding mathematical education, where incremental mastery of basics enables success in advanced, applied domains. The outlier dips suggest targeted interventions could improve comprehension in abstract topics. </details> Figure 5: Mean concept accuracies on Middle-EN. Figure 6: Mean concept accuracies on Middle-ZH. #### Concept-wised Accuracy. Fig. 7 and Fig. 8 show that most existing LLMs, whether open-sourced, closed-sourced, general-purpose, or math-specialized, exhibit notable differences in their concept accuracies in the zero-shot prompt setting. These disparities may stem from variations in training datasets, strategies, and model sizes, which suggests that apart from common weaknesses, each model possesses its unique areas of deficiency or shortcomings. For the sake of brevity in the presentation, we only show a subset of models on Middle-EN and Middle-ZH. The concept accuracies of Elementary-EN and Elementary-ZH systems and all results of all models can be found in Appendix D. <details> <summary>x10.png Details</summary> ![01ea2b4b](/v1/image/01ea2b4b594b06c63b46e73787c159080b8e44eb51e74a9aeb18a87ed79b164d) ### Visual Description ## Line Chart: Model Accuracy Comparison Across Math Topics ### Overview The chart compares the accuracy of three AI models (MetaMath-13B, LLaMA2-70B, GPT-4) across 30+ math topics. Accuracy is measured on a 0-100 scale, with each model represented by a distinct colored line. The data shows significant variability in performance across topics, with GPT-4 generally maintaining the highest accuracy. ### Components/Axes - **X-axis**: Math topics (e.g., "Add & subtract," "Congruence," "Domain & range," "Probability," "Statistics") - **Y-axis**: Accuracy (0-100 scale) - **Legend**: - Blue (MetaMath-13B) - Orange (LLaMA2-70B) - Green (GPT-4) - **Legend Position**: Top-right corner - **Axis Labels**: - X-axis: "Math Topics" (rotated labels) - Y-axis: "Accuracy" ### Detailed Analysis 1. **GPT-4 (Green Line)**: - Consistently highest accuracy (70-95 range) - Peaks in "Exponents & roots" (95), "Probability" (90), and "Statistics" (92) - Minimal dips (e.g., "Linear equations" at 65) - Smooth, stable trend with no extreme fluctuations 2. **LLaMA2-70B (Orange Line)**: - Moderate accuracy (40-85 range) - Peaks in "Linear equations" (80) and "Statistics" (85) - Sharp drops in "Probability" (20) and "Statistics" (25) - High variability, with 10+ topics below 50 accuracy 3. **MetaMath-13B (Blue Line)**: - Most volatile performance (0-90 range) - Peaks in "Add & subtract" (90) and "Exponents" (85) - Extreme lows in "Probability" (0) and "Statistics" (5) - Frequent oscillations between 20-60 accuracy ### Key Observations - **GPT-4 Dominance**: Outperforms others in 22/30 topics, with 15 topics exceeding 85 accuracy. - **LLaMA2-70B Weaknesses**: Struggles with probability (20) and statistics (25), despite strong linear equation performance. - **MetaMath-13B Instability**: Shows erratic performance, with 8 topics below 30 accuracy and 3 topics above 80. - **Topic-Specific Trends**: - **High Accuracy**: GPT-4 excels in advanced topics (probability, statistics, exponents). - **Mid-Range**: LLaMA2-70B performs best in algebraic topics (linear equations, statistics). - **Low Accuracy**: All models struggle with probability, with MetaMath-13B hitting 0% in this category. ### Interpretation The data suggests GPT-4 has superior mathematical reasoning capabilities, particularly in complex domains like probability and statistics. LLaMA2-70B demonstrates specialized strength in algebraic operations but lacks consistency. MetaMath-13B's extreme variability indicates potential overfitting or training data imbalances. The stark contrast in probability performance (GPT-4: 90% vs. MetaMath-13B: 0%) highlights fundamental differences in model architectures or training methodologies. These results underscore the importance of model selection based on specific mathematical domains in real-world applications. </details> Figure 7: Concept accuracies on Middle-EN. <details> <summary>x11.png Details</summary> ![9c9e32dd](/v1/image/9c9e32dd3cb9e0e4d0a5a1d215679aead7cfc4c1aaf1472017380366d1e4eac3) ### Visual Description ## Line Chart: Accuracy Comparison of AI Models on Math Problems ### Overview The chart compares the accuracy of three AI models (MetaMath-13B, LLaMA2-70B, GPT-4) across 40+ math problems. Accuracy is measured on a 0-100% scale, with jagged lines indicating variability. GPT-4 consistently performs best, while MetaMath-13B shows the most erratic results. ### Components/Axes - **Y-Axis**: Accuracy (0-100%, labeled "Accuracy") - **X-Axis**: Math problems (Chinese text, 40+ categories) - **Legend**: - Blue: MetaMath-13B - Orange: LLaMA2-70B - Green: GPT-4 - **Legend Position**: Top-center - **Data Series**: Three colored lines (blue, orange, green) ### Detailed Analysis 1. **GPT-4 (Green Line)**: - **Trend**: Dominates with peaks near 90% and troughs above 60%. - **Key Values**: - Highest accuracy: ~95% (multiple problems) - Lowest accuracy: ~65% (e.g., "反比例函数" problem) - **Stability**: Minimal dips below 60%. 2. **LLaMA2-70B (Orange Line)**: - **Trend**: Volatile, with sharp rises/falling. Matches GPT-4 in ~20% of problems. - **Key Values**: - Peaks: ~70-80% (e.g., "多项式方程") - Troughs: ~10-30% (e.g., "函数与一元一次方程") - **Inconsistency**: Wide swings (e.g., 90% drop from 80% to 10%). 3. **MetaMath-13B (Blue Line)**: - **Trend**: Most erratic, with extreme lows (near 0%) and moderate highs (~60%). - **Key Values**: - Peaks: ~60% (e.g., "函数的单调性") - Troughs: ~0-5% (e.g., "反比例函数", "函数的反演") - **Unpredictability**: Frequent drops to near-zero accuracy. ### Key Observations - **GPT-4 Superiority**: Consistently outperforms others across all problems. - **LLaMA2-70B Variability**: Strong in specific areas (e.g., polynomial equations) but fails catastrophically on others. - **MetaMath-13B Fragility**: Struggles with foundational concepts (e.g., inverse proportionality, function inversion). ### Interpretation The data suggests GPT-4 has robust, generalizable math-solving capabilities, while LLaMA2-70B excels in niche areas but lacks consistency. MetaMath-13B’s performance indicates significant gaps in handling core mathematical principles. The jagged lines for all models imply sensitivity to problem phrasing or complexity. Notably, LLaMA2-70B’s ability to match GPT-4 in select problems hints at potential for targeted improvements, whereas MetaMath-13B requires foundational retraining. The Chinese problem labels (e.g., "反比例函数" = inverse proportionality functions) highlight domain-specific challenges, with MetaMath-13B failing most frequently on advanced topics. </details> Figure 8: Concept accuracies on Middle-ZH. | Model | Elementary-EN | Middle-EN | Elementary-ZH | Middle-ZH | Avg. $\downarrow$ | | --- | --- | --- | --- | --- | --- | | Yi-6B | 5.30 / 1.73 | 5.21 / 1.37 | 0.04 / 0.20 | 0.36 / 0.35 | 2.73 / 0.91 | | ChatGLM3-6B | 7.42 / 0.22 | 7.55 / 0.23 | 0.11 / 0.02 | 0.35 / 0.05 | 3.86 / 0.13 | | InternLM2-Math-7B | 7.42 / 0.22 | 7.55 / 0.23 | 0.11 / 0.02 | 0.35 / 0.05 | 3.86 / 0.13 | | InternLM2-7B | 5.36 / 1.03 | 5.27 / 0.84 | 0.01 / 0.37 | 0.33 / 0.49 | 2.74 / 0.68 | | MAmmoTH-13B | 7.67 / 0.47 | 7.97 / 0.46 | 0.00 / 0.03 | 0.35 / 0.03 | 4.00 / 0.25 | | WizardMath-13B | 8.41 / 0.35 | 8.23 / 0.34 | 0.00 / 0.02 | 0.55 / 0.02 | 4.30 / 0.18 | | MetaMath-13B | 7.67 / 0.47 | 7.97 / 0.46 | 0.00 / 0.03 | 0.35 / 0.03 | 4.00 / 0.25 | | Baichuan2-13B | 7.20 / 1.43 | 6.58 / 1.18 | 0.05 / 0.54 | 0.41 / 0.65 | 3.56 / 0.95 | | LLaMA2-13B | 6.80 / 0.73 | 6.36 / 0.64 | 0.01 / 0.15 | 0.56 / 0.16 | 3.43 / 0.42 | | Qwen-14B | 11.04 / 1.58 | 9.73 / 1.08 | 1.43 / 1.27 | 0.70 / 0.93 | 5.73 / 1.22 | | InternLM2-Math-20B | 5.58 / 1.30 | 5.51 / 0.99 | 0.03 / 0.47 | 0.34 / 0.47 | 2.86 / 0.81 | | InternLM2-20B | 7.20 / 1.43 | 6.58 / 1.18 | 0.05 / 0.54 | 0.41 / 0.65 | 3.56 / 0.95 | | GPT-3.5 | 9.48 / - | 9.21 / - | 0.00 / - | 0.31 / - | 4.75 / - | | GPT-4 | 8.68 / - | 8.24 / - | 0.15 / - | 0.68 / - | 4.44 / - | Table 3: Data contamination rate of LLMs. We provide two different contamination detection methods. The values in the table represent “Rouge / Prob”. Note that the second method based on output probability distributions can only be applied to the open-source models. ### 3.3 Analysis #### Contamination. To determine whether a text is in the pretraining data of a LLM, we provide two different contamination detection methods (i.e., Rouge-based and Prob-based methods) to analyze our ConceptMath in Table 3. Specifically, for the Rouge-based method, we just input the first 50% of the question as the input and compute the Rouge-L score between the generation results and the ground-truth label of the last 50% of the text, where a lower Rouge-L score means a lower contamination rate. For the Prob-based method, we follow Shi et al. (2023) to use the MIN-K% probability metric, which first gets the probability for each token in the test, and selects the K% tokens with minimum probabilities and calculates their average log-likelihood. If the average log-likelihood is high, the text is likely in the pretraining data. Note that we choose $K$ as 10 in our setting. In Table 3, we observe that the contaminate rates on our ConceptMath are very low, which means that our ConceptMath can provide a reasonable evaluation for existing LLMs. | Concept | Concept Acc | Error Type | Example | | --- | --- | --- | --- | | Perimeter | 26.32 | Calculation Error | Q: Find the perimeter of a 16 by 18 feet rectangle. A: The perimeter is 16 + 18 + 16 + 18 = 72 feet. | | Knowledge Error | Q: The perimeter of a rectangular sticker is 14 centimeters. The sticker is 2 centimeters wide. How tall is it? A: The height equals the perimeter subtracts the width. The height of the rectangle is $14-12=12$ centimeters. | | | | Polygons | 11.11 | Knowledge Error | Q: A regular polygon has an interior angle that measures 120 degrees. How many sides does this polygon have? A: Let $n$ be the number of sides of the polygon. We have that $n\times 120=360$ . Hence: $n=3$ . | | Cylinders | 11.76 | Calculation Error | Q: How much soup can a cylinder-shaped can store, if the radius of the can is 3 inches and the height is 5 inches? A: … The volume of the can is V = $\pi\times 3^{2}\times 5=30\pi$ . | Table 4: Error reasons of several unmastered concepts of LLaMA2-13B. #### Unmastered Concepts. We also highlight the several unmastered concepts of the LLaMA2-13B in Table 4, which shows ConceptMath is effective in guiding further refinement of existing LLMs. | Models | LLaMA2 | LLaMA2 | LLaMA2 | | --- | --- | --- | --- | | (w/ MMQA) | (w/ MMQA | | | | &CS) | | | | | Cones | 0.00 | 17.65 | 23.53 | | Spheres | 5.88 | 29.41 | 35.29 | | Polygons | 11.11 | 61.11 | 66.67 | | Rational Number | 11.76 | 23.53 | 52.94 | | Cylinders | 11.76 | 35.29 | 47.06 | | Angles | 11.76 | 47.06 | 58.82 | | Probability | 18.75 | 25.00 | 75.00 | | Perimeter | 26.32 | 42.11 | 63.16 | | Volume | 27.78 | 38.89 | 66.67 | | Proportional | 27.78 | 33.33 | 44.44 | | Avg Acc. | 15.29 | 36.88 | 53.36 | | (over 10 concepts) | | | | | Avg Acc. | 51.94 | 58.14 | 60.67 | | (over 33 concepts) | | | | | Overall Acc. | 44.02 | 53.94 | 59.29 | Table 5: Results of fine-tuning models. “MMQA” and “CS” denote MetaMathQA and our constructed Concept-Specific training datasets, respectively. Introducing CS data specifically for the bottom 10 concepts significantly enhances these concepts’ performance, while slightly improving the performance across the remaining 33 concepts. #### Evaluation Prompting. Different from the few-shot or cot prompting evaluation that can boost closed-source models, we find that zero-shot prompting is more effective for certain open-source LLMs in Table 2. This disparity may arise either because the models are not sufficiently powerful to own mathematical CoT capabilities Yu et al. (2023); Wei et al. (2022) or because these models have already incorporated CoT data during training Longpre et al. (2023). Consequently, to ensure a comprehensive analysis, we have employed all three prompting methods for evaluation. #### Efficient Fine-tuning. To show the effect of efficient fine-tuning, we take the LLaMA2-13B as an example in Table 5. Specifically, for LLaMA2-13B, we first select 10 concepts with the lowest accuracies in Elementary-EN. Then, we crawl 495 samples (about 50 samples per concept) using the trained classifier as the Concept-Specific (CS) training data (See Appendix B for more details). Meanwhile, to avoid overfitting, we introduce the MetaMathQA (MMQA Yu et al. (2023) ) data to preserve general mathematical abilities. After that, we can fine-tune LLaMA2-13B by only using MMQA (i.e., LLaMA2 (w/ MMQA)), or using both MMQA and CS data (i.e., LLaMA2 (w/ MMQA & CS)). In Table 5, we observe that LLaMA2 (w/ MMQA & CS) archives significant improvements on the lowest 10 concepts and preserves well on the other 33 concepts, which shows the effect of efficient fine-tuning and the advantages of our ConceptMath. ## 4 Related Work #### Large Language Models for Mathematics. Large Language Models (LLMs) such as GPT-3.5 and GPT-4 have exhibited promising capabilities in complex mathematical tasks. However, the proficiency of open-source alternatives like LLaMA (Touvron et al., 2023a) and LLaMA2 (Touvron et al., 2023b) remains notably inferior on these datasets, particularly in handling non-English problems. In contrast, models like Baichuan2 (Baichuan, 2023) and Qwen (Bai et al., 2023b) pretrained on multilingual datasets (i.e., Chinese and English) have achieved remarkable performance. Recently, many domain-specialized math language models have been proposed. For example, MetaMath (Yu et al., 2023) leverages the LLaMA2 models and finetunes on the constructed MetaMathQA dataset. MAmmoTH (Yue et al., 2023) synergizes Chain-of-Thought (CoT) and Program-of-Thought (PoT) rationales. #### Mathmatical Reasoning Benchmarks. Recently, many mathematical datasets Roy and Roth (2015); Koncel-Kedziorski et al. (2015); Lu et al. (2023); Huang et al. (2016); Miao et al. (2020); Patel et al. (2021) have been proposed. For example, SingleOp (Roy et al., 2015), expands the scope to include more complex operations like multiplication and division. Math23k (Wang et al., 2017) gathers 23,161 problems labeled with structured equations and corresponding answers. GSM8K (Cobbe et al., 2021) is a widely used dataset, which requires a sequence of elementary calculations with basic arithmetic operations. #### Fine-Grained Benchmarks. Traditional benchmarks focus on assessing certain abilities of models on one task Guo et al. (2023b); Wang et al. (2023a); Liu et al. (2020); Guo et al. (2022); Chai et al. (2024); Liu et al. (2024); Guo et al. (2024, 2023c); Bai et al. (2023a); Liu et al. (2022); Guo et al. (2023a); Bai et al. (2024); Liu et al. (2021) (e.g., reading comprehension (Rajpurkar et al., 2018), machine translation (Bojar et al., 2014), and summarization (Narayan et al., 2018)). For example, the GLUE benchmark (Wang et al., 2019) combines a collection of tasks, and has witnessed superhuman model performance for pretraining models (Kenton and Toutanova, 2019; Radford et al., 2019) (Hendrycks et al., 2021a) introduced MMLU, a benchmark with multiple-choice questions across 57 subjects including STEM, humanities, and social sciences, for assessing performance and identifying weaknesses. (et al., 2022) proposed BIG-bench with over 200 tasks. To enhance the mathematical capabilities of LLMs, we introduce a comprehensive mathematical reasoning ConceptMath dataset designed to assess model performance across over 200 diverse mathematical concepts in both Chinese and English. ## 5 Conclusion We introduce a new bilingual concept-wise math reasoning dataset called ConceptMath to assess models across a diverse set of concepts. First, ConceptMath covers more than 200 concepts across elementary and middle schools for mainstream English and Chinese systems. Second, we extensively evaluate existing LLMs by three prompting methods, which can guide further improvements for these LLMs on mathematical abilities. Third, we analyze the contamination rates, error cases and provide a simple and efficient fine-tuning strategy to enhance the weaknesses. #### Limitations. Human efforts are required to carefully design the hierarchical systems of mathematical concepts. In the future, we have three plans as follows: (1) Extend the input modality to multi-modalities. (2) Extend the education systems to high school and college levels. (3) Extend the reasoning abilities to more STEM fields. ## References - Anthropic (2023) Anthropic. 2023. Model card and evaluations for claude models. - Bai et al. (2024) Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv. - Bai et al. (2023a) Jiaqi Bai, Hongcheng Guo, Jiaheng Liu, Jian Yang, Xinnian Liang, Zhao Yan, and Zhoujun Li. 2023a. Griprank: Bridging the gap between retrieval and generation via the generative knowledge improved passage ranking. CIKM. - Bai et al. (2023b) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023b. Qwen technical report. arXiv preprint arXiv:2309.16609. - Baichuan (2023) Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305. - Bojar et al. (2014) Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics. - Chai et al. (2024) Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xiannian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, et al. 2024. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. arXiv preprint arXiv:2401.07037. - Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. - Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335. - et al. (2022) Aarohi Srivastava et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv: Arxiv-2206.04615. - Fritz et al. (2013) Annemarie Fritz, Antje Ehlert, and Lars Balzer. 2013. Development of mathematical concepts as basis for an elaborated mathematical understanding. South African Journal of Childhood Education, 3(1):38–67. - Guo et al. (2022) Hongcheng Guo, Jiaheng Liu, Haoyang Huang, Jian Yang, Zhoujun Li, Dongdong Zhang, Zheng Cui, and Furu Wei. 2022. Lvp-m3: language-aware visual prompt for multilingual multimodal machine translation. EMNLP. - Guo et al. (2023a) Hongcheng Guo, Boyang Wang, Jiaqi Bai, Jiaheng Liu, Jian Yang, and Zhoujun Li. 2023a. M2c: Towards automatic multimodal manga complement. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9876–9882. - Guo et al. (2024) Hongcheng Guo, Jian Yang, Jiaheng Liu, Jiaqi Bai, Boyang Wang, Zhoujun Li, Tieqiao Zheng, Bo Zhang, Qi Tian, et al. 2024. Logformer: A pre-train and tuning pipeline for log anomaly detection. AAAI. - Guo et al. (2023b) Hongcheng Guo, Jian Yang, Jiaheng Liu, Liqun Yang, Linzheng Chai, Jiaqi Bai, Junran Peng, Xiaorong Hu, Chao Chen, Dongfeng Zhang, et al. 2023b. Owl: A large language model for it operations. arXiv preprint arXiv:2309.09298. - Guo et al. (2023c) Jinyang Guo, Jiaheng Liu, Zining Wang, Yuqing Ma, Ruihao Gong, Ke Xu, and Xianglong Liu. 2023c. Adaptive contrastive knowledge distillation for bert compression. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8941–8953. - Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021a. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR). - Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). - Huang et al. (2016) Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin, and Wei-Ying Ma. 2016. How well do computers solve math word problems? large-scale dataset construction and evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 887–896. - Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186. - Koncel-Kedziorski et al. (2015) Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. 2015. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597. - Liu et al. (2024) Jiaheng Liu, Zhiqi Bai, Yuanxing Zhang, Chenchen Zhang, Yu Zhang, Ge Zhang, Jiakai Wang, Haoran Que, Yukang Chen, Wenbo Su, et al. 2024. E2-llm: Efficient and extreme length extension of large language models. arXiv preprint arXiv:2401.06951. - Liu et al. (2021) Jiaheng Liu, Yudong Wu, Yichao Wu, Chuming Li, Xiaolin Hu, Ding Liang, and Mengyu Wang. 2021. Dam: discrepancy alignment metric for face recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3814–3823. - Liu et al. (2022) Jiaheng Liu, Tan Yu, Hanyu Peng, Mingming Sun, and Ping Li. 2022. Cross-lingual cross-modal consolidation for effective multilingual video corpus moment retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1854–1862. - Liu et al. (2020) Jiaheng Liu, Shunfeng Zhou, Yichao Wu, Ken Chen, Wanli Ouyang, and Dong Xu. 2020. Block proposal neural architecture search. IEEE Transactions on Image Processing, 30:15–25. - Longpre et al. (2023) Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. 2023. The flan collection: designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org. - Lu et al. (2023) Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. 2023. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In The Eleventh International Conference on Learning Representations. - Luo et al. (2023) Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583. - Megill and Wheeler (2019) Norman Megill and David A Wheeler. 2019. Metamath: a computer language for mathematical proofs. Lulu. com. - Miao et al. (2020) Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984. - Narayan et al. (2018) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics. - OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. PREPRINT. - Paster et al. (2023) Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. 2023. Openwebmath: An open dataset of high-quality mathematical web text. - Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094. - Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. - Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789. - Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752. - Roy et al. (2015) Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reasoning about quantities in natural language. Transactions of the Association for Computational Linguistics, 3:1–13. - Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. - Shi et al. (2023) Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. 2023. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789. - Simon (2011) Martin A Simon. 2011. Studying mathematics conceptual learning: Student learning through their mathematical activity. North American Chapter of the International Group for the Psychology of Mathematics Education. - Team (2023a) InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport. - Team (2023b) Yi Team. 2023b. Yi: Building the next generation of open-source and bilingual llms. https://github.com/01-ai/Yi. - Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. - Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. - Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations. - Wang et al. (2017) Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 845–854. - Wang et al. (2023a) Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhu Chen, Jie Fu, and Junran Peng. 2023a. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv: 2310.00746. - Wang et al. (2023b) Zengzhi Wang, Rui Xia, and Liu Pengfei. 2023b. Generative ai for math: Part i – mathpile: A billion-token-scale pretraining corpus for math. arXiv preprint arXiv:2312.17120. - Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837. - Ying et al. (2024) Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin. 2024. Internlm-math: Open math large language models toward verifiable reasoning. - Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284. - Yue et al. (2023) Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2023. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv: 2309.05653. - Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414. ## Appendix A Details on the ConceptMath As shown in Table 7, Table 8, Table 17 and Table 9, we have provided the details on the three-level hierarchical system of our ConceptMath for better illustration. <details> <summary>x12.png Details</summary> ![fe0973ba](/v1/image/fe0973ba075f3dd5f421709423dd39deb18a2d3ae026fb7743507ef9962fe20c) ### Visual Description ## Bar Chart: Mean Accuracy Across Mathematical Concepts ### Overview The image is a vertical bar chart titled "Mean Accuracy," displaying the mean accuracy of various mathematical concepts. The x-axis lists 50 distinct mathematical topics, while the y-axis represents mean accuracy values ranging from 0 to 80. The bars are uniformly blue, with no legend or color coding. The chart shows a general upward trend in mean accuracy, with the highest values concentrated in the latter categories. ### Components/Axes - **X-axis (Categories)**: 50 mathematical concepts, including: - Proportional, Cones, Cylinders, Estimation & rounding, Probability, Place value, Circles, Spheres, Fractions, Rational number, Volume of 3d shapes, Perimeter, Patterns, Angles, Coordinate plane, Percents, Division, Time, Polygons, Ratio, Volume, Mixed operations, Equations, Temperature, value, Coin names & value, Statistics, Classifying & sorting, Numerical expr, Area, Compare, Quadrilaterals, Triangles, Powers, Length, Cubes, Subtraction, Count, Decimals, Exchanging money, Multiple, Add, Variable expr, Light & heavy. - **Y-axis (Mean Accuracy)**: Labeled "Mean Accuracy," with increments of 10 (0–80). The scale is linear. - **Legend**: Not present in the image. ### Detailed Analysis The chart includes the following approximate mean accuracy values (with uncertainty noted as ±1–2 units due to visual estimation): 1. **Proportional**: ~42 2. **Cones**: ~43 3. **Cylinders**: ~44 4. **Estimation & rounding**: ~45 5. **Probability**: ~46 6. **Place value**: ~48 7. **Circles**: ~49 8. **Spheres**: ~50 9. **Fractions**: ~50 10. **Rational number**: ~52 11. **Volume of 3d shapes**: ~54 12. **Perimeter**: ~55 13. **Patterns**: ~56 14. **Angles**: ~57 15. **Coordinate plane**: ~59 16. **Percents**: ~60 17. **Division**: ~61 18. **Time**: ~62 19. **Polygons**: ~63 20. **Ratio**: ~63 21. **Volume**: ~64 22. **Mixed operations**: ~65 23. **Equations**: ~66 24. **Temperature**: ~67 25. **value**: ~67 26. **Coin names & value**: ~68 27. **Statistics**: ~69 28. **Classifying & sorting**: ~70 29. **Numerical expr**: ~71 30. **Area**: ~72 31. **Compare**: ~73 32. **Quadrilaterals**: ~74 33. **Triangles**: ~74 34. **Powers**: ~75 35. **Length**: ~76 36. **Cubes**: ~77 37. **Subtraction**: ~78 38. **Count**: ~79 39. **Decimals**: ~80 40. **Exchanging money**: ~81 41. **Multiple**: ~82 42. **Add**: ~83 43. **Variable expr**: ~84 44. **Light & heavy**: ~85 ### Key Observations - **Trend**: Mean accuracy increases steadily from ~42 (Proportional) to ~85 (Light & heavy), with a sharp rise in the latter half of the categories. - **Outliers**: "Light & heavy" (85) and "Variable expr" (84) are significantly higher than the rest, suggesting these topics may involve more advanced or specialized knowledge. - **Clustering**: Categories like "Equations," "Statistics," and "Classifying & sorting" show moderate accuracy (~66–70), while "Light & heavy" and "Variable expr" dominate the upper range. ### Interpretation The data suggests that mean accuracy correlates with the complexity or specificity of mathematical concepts. Simpler topics (e.g., Proportional, Cones) have lower accuracy, while advanced or applied topics (e.g., Light & heavy, Variable expr) achieve higher accuracy. This could reflect factors such as: - **Curriculum progression**: Later topics may build on foundational knowledge, improving mastery. - **Assessment design**: Complex topics might be evaluated with more nuanced criteria, leading to higher accuracy. - **Practice and exposure**: Advanced concepts may be taught with greater emphasis on application, enhancing performance. The absence of a legend or color coding limits the ability to differentiate subcategories, but the uniform blue bars suggest a single metric (mean accuracy) is being measured across all topics. The chart highlights the importance of contextualizing accuracy metrics within the scope of mathematical education or assessment frameworks. </details> Figure 9: Mean concept accuracies of Elementary-EN. <details> <summary>x13.png Details</summary> ![6d0d0847](/v1/image/6d0d0847e7fc2ae20c64f440e6bfbba69387b34fd6bea21e3566f4b75e6c708c) ### Visual Description ## Bar Chart: Mean Accuracy Across Chinese Topics ### Overview The chart displays vertical blue bars representing mean accuracy values for 44 different Chinese topics/questions. The y-axis shows "Mean Accuracy" from 0 to 80, while the x-axis lists topics in Chinese. Bars generally trend upward, peaking at ~78, with a 5% margin of error. ### Components/Axes - **X-axis (Horizontal)**: Chinese text labels for topics (e.g., 周期问题, 采购问题, 装修问题, 免费问题, etc.). Topics span domains like time periods, procurement, renovation, free services, etc. - **Y-axis (Vertical)**: "Mean Accuracy" scale from 0 to 80 with increments of 10. - **Legend**: Single blue bar label on the right side ("Mean Accuracy"). - **Bars**: Uniform blue color, approximately 44 total. ### Detailed Analysis 1. **Early Topics (X-axis 1-10)**: - Bars range from **22-28** (e.g., 周期问题: 22, 采购问题: 24). - Smallest variance: All bars cluster within 20-30. - Notable: 周期问题 (Cycle Issues) starts at 22. 2. **Middle Topics (X-axis 11-30)**: - Gradual increase to **30-35** (e.g., 免费问题: 31). - Stagnation around 35 (e.g., 补贴问题: 35). - Drop to 32 (维修问题: 32). - Rising trend: 交通问题 (Transport Issues): 38. 3. **Mid-to-Late Topics (X-axis 31-44)**: - Accelerated increase to **50-78**. - Major jump: 环境问题 (Environmental Issues): 52. - Sustained high values: 清洁问题 (Cleaning Issues): 60, 金融问题 (Financial Issues): 65. - Peaks: 保险问题 (Insurance Issues): 70, 技术问题 (Tech Issues): 78. - Final two topics: 清洁问题 (72), 金融问题 (65). ### Key Observations - **Upward Trend**: Accuracy increases by ~56% from start (22) to end (78). - **Peak Dominance**: Last 5 topics exceed 60, with 技术问题 reaching 78. - **Plateau**: Middle section (X-axis 15-30) shows stagnation at 30-35. - **Dip in Middle**: 维修问题 (Repair Issues) at 32 interrupts upward trend. ### Interpretation The chart suggests a **strong correlation between topic complexity/domain and accuracy**. Early topics (e.g., 周期问题) relate to foundational metrics, while later topics (e.g., 技术问题) involve specialized, high-stakes domains. The 78% accuracy for 技术问题 implies either: 1. Improved data collection/analysis methods for newer topics, or 2. Higher expert scrutiny in technical fields. The plateau at 30-35 may reflect transitional topics (e.g., 补贴问题) where measurement challenges arise. The final 78% value is an outlier, warranting investigation into whether it represents an anomaly or a new benchmark. **Critical Insight**: The stark difference between early (22-28) and late (70-78) values suggests either a paradigm shift in data handling or inherent variability in topic specificity. Further analysis of the "环境问题" to "技术问题" segment could reveal methodology changes. </details> Figure 10: Mean concept accuracies of Elementary-ZH. <details> <summary>x14.png Details</summary> ![367c13e3](/v1/image/367c13e353a61cf354e5cc3d0406e29e9249fbcb177adbc8de11c5ad4e229779) ### Visual Description ## Line Chart: Model Accuracy Across Math Topics ### Overview The chart compares the accuracy of three AI models (MetaMath-13B, LLaMA2-70B, GPT-4) across 30 math-related topics. Accuracy is measured on a 0-100% scale, with notable fluctuations observed across models and topics. ### Components/Axes - **X-axis**: Math topics (Angles, Area, Circles, ..., Volume) - **Y-axis**: Accuracy (0-100%, increments of 20) - **Legend**: Top-left corner, mapping colors to models: - Blue: MetaMath-13B - Orange: LLaMA2-70B - Green: GPT-4 ### Detailed Analysis 1. **MetaMath-13B (Blue Line)** - **Trend**: Highly variable performance, with sharp peaks and troughs. - **Key Data Points**: - Peaks at ~85% in "Circles & sorting" and "Estimation & rounding". - Drops to **0%** in "Probability" (notable outlier). - Ends at ~45% in "Volume". 2. **LLaMA2-70B (Orange Line)** - **Trend**: Moderate consistency, with fewer extreme fluctuations. - **Key Data Points**: - Peaks at ~90% in "Numerical exponents" and "Length". - Lowest point at ~30% in "Decimals". - Ends at ~55% in "Volume". 3. **GPT-4 (Green Line)** - **Trend**: Most stable and highest-performing overall. - **Key Data Points**: - Peaks at **100%** in "Circles & sorting" and "Estimation & rounding". - Rarely drops below 80% (e.g., "Decimals" at ~85%). - Ends at ~95% in "Volume". ### Key Observations - **GPT-4 Dominance**: Consistently outperforms other models, achieving perfect scores in multiple topics. - **MetaMath-13B Instability**: Dramatic drops (e.g., 0% in Probability) suggest potential weaknesses in probabilistic reasoning. - **LLaMA2-70B Middle Ground**: Balanced performance but lags behind GPT-4 in critical areas. - **Topic-Specific Patterns**: - Geometry topics (e.g., "Circles & sorting") show high accuracy across all models. - Probability and Statistics topics reveal MetaMath-13B's vulnerabilities. ### Interpretation The data highlights **GPT-4's superior generalization** in math tasks, likely due to its larger scale and training data. **MetaMath-13B's erratic performance** may stem from specialized training or overfitting to specific problem types. The **0% accuracy in Probability** for MetaMath-13B raises questions about its architectural limitations in handling abstract concepts. LLaMA2-70B's mid-range performance suggests it balances specialization and versatility but lacks GPT-4's robustness. **Critical Insight**: Model size and training focus significantly impact math task performance, with GPT-4's scale enabling near-perfect accuracy across diverse topics. </details> Figure 11: Concept accuracies on Elementary-EN. <details> <summary>x15.png Details</summary> ![9a5b04b5](/v1/image/9a5b04b58ddafef3282643cf9fda05b0f6b348e1525d2de11c13d307f4531718) ### Visual Description ## Line Chart: Model Accuracy Comparison Across Tasks ### Overview The chart compares the accuracy performance of three AI models (MetaMath-13B, LLaMA2-70B, GPT-4) across multiple tasks represented by Chinese characters on the x-axis. Accuracy is measured on a 0-100 scale on the y-axis. The chart shows significant variability in performance across different tasks for each model. ### Components/Axes - **Y-axis**: Accuracy (0-100 scale with 20-unit increments) - **X-axis**: Chinese task labels (e.g., "三角形", "平行四边形", "长方形") - likely representing different mathematical or logical reasoning tasks - **Legend**: - Blue line: MetaMath-13B - Orange line: LLaMA2-70B - Green line: GPT-4 - **Positioning**: Legend at top-center, data lines spanning full width ### Detailed Analysis 1. **GPT-4 (Green Line)**: - Consistently highest performer (60-90 range) - Shows moderate fluctuations but maintains >60 accuracy on all tasks - Peaks at ~95 for several tasks (e.g., "平行四边形", "长方形") - Lowest point ~40 for "平面图形综合" 2. **LLaMA2-70B (Orange Line)**: - Moderate performance (0-80 range) - High variability with sharp spikes and drops - Peaks at ~80 for "平行四边形" and "长方形" - Drops to 0 for "平面图形综合" and "长方形" 3. **MetaMath-13B (Blue Line)**: - Most erratic performance (0-50 range) - Sharp spikes to 50 for "平行四边形" and "长方形" - Frequent drops to 0 for "平面图形综合" and "长方形" - Only task with >30 accuracy: "平行四边形" (~45) ### Key Observations - GPT-4 demonstrates superior and more consistent performance across all tasks - LLaMA2-70B shows moderate capability but significant task-dependent variability - MetaMath-13B exhibits extreme performance swings, suggesting potential task-specific limitations - "平面图形综合" task causes all models to drop to 0 accuracy - "长方形" task shows divergent performance patterns across models ### Interpretation The data suggests GPT-4 maintains the most reliable performance across diverse mathematical tasks, while LLaMA2-70B and MetaMath-13B show significant task-dependent limitations. The extreme drops to 0 accuracy for certain tasks indicate potential architectural limitations in handling specific problem types. The Chinese task labels (e.g., "三角形" = triangle, "平行四边形" = parallelogram) suggest the models were tested on geometric reasoning problems, with GPT-4 demonstrating superior geometric reasoning capabilities. The performance disparities highlight important considerations for model selection based on task requirements and the need for robustness in mathematical reasoning applications. </details> Figure 12: Concept accuracies on Elementary-ZH. <details> <summary>x16.png Details</summary> ![697a2c3e](/v1/image/697a2c3eebc255e155025d2d7ed20baf6e7e6d7ab65b969ac8e86f0fa6d57520) ### Visual Description ## Line Chart: Model Accuracy Comparison Across Math Topics ### Overview The chart compares the accuracy performance of four AI models (Yi-6B, ChatGLM3-6B, LLaMA2-7B, DeepSeekMath-7B) across 30+ math-related topics. Accuracy is measured on a 0-100% scale, with each model represented by a distinct colored line. The x-axis lists math topics in alphabetical order, while the y-axis shows accuracy percentages. ### Components/Axes - **X-axis**: Math topics (Angles, Area, Circles, Classifying & sorting, Coin names & value, Coordinate plane, Cones, Cubes, Cylinders, Decimals, Estimation & rounding, Exchanging money, Fractions, Light & heavy, Mixed operations, Multiple, Numerical exprs, Perimeter, Place value, Powers, Rational number, Spheres, Subtraction, Time, Triangles, Variable exprs, Volume of 3d shapes, Add, Compare, Count, Division, Equations, Length, Statistics, Percentages, Polygons, Probability, Proportional, Quadrilaterals, Ratio, Temperature, Volume) - **Y-axis**: Accuracy (%) from 0 to 100 in 20-unit increments - **Legend**: - Blue: Yi-6B - Orange: ChatGLM3-6B - Green: LLaMA2-7B - Red: DeepSeekMath-7B - **Axis markers**: Dotted lines at 20, 40, 60, 80, 100 accuracy levels ### Detailed Analysis 1. **Yi-6B (Blue)**: - Starts at ~50% (Angles) - Peaks at 100% (Estimation & rounding money) - Shows volatility with sharp drops (e.g., 35% at Coordinate plane) - Ends at ~65% (Volume) 2. **ChatGLM3-6B (Orange)**: - Begins at ~70% (Angles) - Reaches 95% (Mixed operations) - Dips to 20% (Subtraction) - Ends at ~55% (Volume) 3. **LLaMA2-7B (Green)**: - Starts at ~25% (Angles) - Peaks at 75% (Multiple) - Has extreme lows (0% at Subtraction, 5% at Numerical exprs) - Ends at ~20% (Volume) 4. **DeepSeekMath-7B (Red)**: - Begins at ~55% (Angles) - Peaks at 90% (Volume of 3d shapes) - Shows consistent mid-range performance (60-80%) - Ends at ~75% (Volume) ### Key Observations - **Highest Peaks**: - Yi-6B: 100% (Estimation & rounding money) - ChatGLM3-6B: 95% (Mixed operations) - DeepSeekMath-7B: 90% (Volume of 3d shapes) - **Lowest Performance**: - LLaMA2-7B: 0% (Subtraction), 5% (Numerical exprs) - Yi-6B: 35% (Coordinate plane) - **Consistency**: - DeepSeekMath-7B shows the most stable performance (range: 50-90%) - LLaMA2-7B has the most erratic pattern (range: 0-75%) ### Interpretation The data suggests significant variability in model performance across different math domains. Yi-6B and ChatGLM3-6B demonstrate superior handling of complex operations (mixed operations, estimation), while LLaMA2-7B struggles with foundational concepts (subtraction, numerical expressions). DeepSeekMath-7B appears most balanced, maintaining mid-to-high performance across most topics. The extreme lows (e.g., 0% for LLaMA2-7B in subtraction) indicate potential architectural limitations in specific mathematical reasoning capabilities. The correlation between model size (parameter count) and performance is not strictly linear, as smaller models (Yi-6B, ChatGLM3-6B) outperform larger ones (LLaMA2-7B) in certain domains. </details> <details> <summary>x17.png Details</summary> ![e01e3008](/v1/image/e01e3008bfccc366a74f1b179ba7064e7ae90e27748009e8f12f00c5867067c8) ### Visual Description ## Line Graph: Accuracy of Different Math Models Across Various Topics ### Overview The image is a line graph comparing the accuracy of four mathematical models (InternLM2-Math-7B, InternLM2-7B, MAmmoTH-13B, and WizardMath-13B) across 30 distinct math topics. Accuracy is measured on a y-axis (0–100%), while the x-axis lists topics like "Angles," "Area," "Classifying & sorting," and "Volume." The graph shows significant variability in performance across models and topics. --- ### Components/Axes - **Legend**: Located at the top, with four entries: - **Blue (solid line with circles)**: InternLM2-Math-7B - **Orange (dashed line with squares)**: InternLM2-7B - **Green (solid line with triangles)**: MAmmoTH-13B - **Red (dashed line with diamonds)**: WizardMath-13B - **X-axis**: Labeled "Accuracy" with topics listed sequentially (e.g., "Angles," "Area," "Classifying & sorting," ..., "Volume"). - **Y-axis**: Labeled "Accuracy" with increments of 20 (0–100%). --- ### Detailed Analysis 1. **InternLM2-Math-7B (Blue)**: - Starts at ~80% for "Angles," dips to ~60% for "Area," and fluctuates between 50–90%. - Peaks at ~90% for "Cylinders" and "Estimation & rounding." - Ends at ~70% for "Volume." 2. **InternLM2-7B (Orange)**: - Begins at ~80% for "Angles," drops to ~40% for "Area," and oscillates between 40–90%. - Peaks at ~95% for "Cylinders" and "Estimation & rounding." - Ends at ~85% for "Volume." 3. **MAmmoTH-13B (Green)**: - Starts at ~20% for "Angles," rises to ~80% for "Area," and stabilizes between 60–85%. - Peaks at ~90% for "Light & heavy" and "Mixed operations." - Ends at ~65% for "Volume." 4. **WizardMath-13B (Red)**: - Begins at ~20% for "Angles," spikes to ~60% for "Area," and fluctuates wildly between 10–70%. - Sharp drops to ~10% for "Subtraction" and "Proportionality." - Ends at ~20% for "Volume." --- ### Key Observations - **WizardMath-13B (Red)** exhibits the most erratic performance, with extreme lows (e.g., ~10% for "Subtraction") and highs (~70% for "Area"). - **InternLM2-Math-7B (Blue)** and **InternLM2-7B (Orange)** show similar trends but with InternLM2-7B achieving higher peaks (e.g., ~95% for "Cylinders"). - **MAmmoTH-13B (Green)** demonstrates relative stability, with fewer extreme dips compared to other models. - **Lowest Performance**: WizardMath-13B underperforms in "Subtraction" (~10%) and "Proportionality" (~15%). - **Highest Performance**: InternLM2-7B excels in "Cylinders" (~95%) and "Estimation & rounding" (~90%). --- ### Interpretation The data suggests that model performance varies significantly by topic and architecture: 1. **Model Size vs. Performance**: Larger models (e.g., MAmmoTH-13B, WizardMath-13B) do not consistently outperform smaller models (e.g., InternLM2-7B) across all topics. 2. **Topic-Specific Strengths**: - InternLM2-7B excels in geometry-related topics ("Cylinders," "Estimation & rounding"). - WizardMath-13B struggles with arithmetic operations ("Subtraction," "Proportionality"). 3. **Stability**: MAmmoTH-13B shows the least variability, suggesting robustness in handling diverse topics. 4. **Anomalies**: WizardMath-13B’s extreme lows (e.g., ~10% for "Subtraction") indicate potential weaknesses in specific problem types. The graph highlights the importance of model specialization and the need for targeted improvements in underperforming areas. </details> <details> <summary>x18.png Details</summary> ![cc9a11d5](/v1/image/cc9a11d5f9b565eace125d84b588928ed92b85d8c426b30f53ff96a9346670b0) ### Visual Description ## Line Graph: Accuracy Comparison of Math Models on Various Problems ### Overview The image is a multi-line graph comparing the accuracy performance of four large language models (LLMs) across 25 math problem categories. The models compared are Baichuan2-13B (blue), LLaMA2-13B (orange), Qwen-14B (green), and InternLM2-Math-20B (red). Accuracy percentages range from 0-100% on the y-axis, with math topics listed sequentially on the x-axis. ### Components/Axes - **Legend**: Top-left corner, mapping colors to models: - Blue: Baichuan2-13B - Orange: LLaMA2-13B - Green: Qwen-14B - Red: InternLM2-Math-20B - **X-axis**: "Math Problems" with 25 labeled categories (e.g., Angles, Area, Circles, Classifying & sorting, Coin names & value, Coordinate plane, Cubes, Decimals, Estimation & rounding, Fractions, Light & heavy, Mixed operations, Multiple operations, Numerical expressions, Patterns, Perimeter, Place value, Powers, Rational number, Spheres, Subtraction, Time, Triangles, Variable expressions, Volume of 3d shapes, Add, Compare, Count, Division, Equations, Length, Statistics, Percent, Polygons, Probability, Proportional, Quadrilaterals, Ratio, Temperature, Volume). - **Y-axis**: "Accuracy (%)" with ticks at 0, 20, 40, 60, 80, 100. - **Lines**: Four colored lines representing model performance across topics. ### Detailed Analysis 1. **InternLM2-Math-20B (Red Line)**: - Consistently highest performer, peaking above 90% in multiple categories (e.g., Coordinate plane, Fractions, Time). - Shows minor dips but maintains >70% accuracy in all categories. - Notable peaks: 95%+ in Coordinate plane, Fractions, and Time. 2. **Baichuan2-13B (Blue Line)**: - Second-highest performer overall, with peaks near 90% (e.g., Coordinate plane, Fractions). - More volatile than InternLM2, with sharper drops (e.g., 60% in Decimals, 50% in Estimation & rounding). - Strong in geometry topics (Angles, Area, Perimeter). 3. **LLaMA2-13B (Orange Line)**: - Most variable performance, with extreme lows (e.g., 0% in Coordinate plane, 5% in Subtraction). - Strong in algebraic topics (Equations, Variables) with peaks near 80%. - Weaknesses in spatial reasoning (Coordinate plane, Geometry). 4. **Qwen-14B (Green Line)**: - Moderate performance, averaging 60-70%. - Peaks in algebraic topics (Equations, Variables) at ~75%. - Notable dip to 15% in Place value, recovery in Statistics (~60%). ### Key Observations - **Outliers**: - LLaMA2-13B: 0% accuracy in Coordinate plane (potential data error or model weakness). - Qwen-14B: 15% in Place value (significant drop). - **Trends**: - InternLM2-Math-20B dominates in geometry and arithmetic (Angles, Fractions, Time). - All models struggle with spatial reasoning (Coordinate plane, Geometry). - Algebraic topics (Equations, Variables) show higher performance across models. ### Interpretation The data suggests **InternLM2-Math-20B** is the most robust model for math problem-solving, likely due to specialized training on mathematical reasoning. Its consistent performance across diverse topics indicates strong generalization. **LLaMA2-13B** exhibits the most variability, with critical failures in spatial reasoning (Coordinate plane) but strengths in algebraic manipulation. **Qwen-14B** and **Baichuan2-13B** show mid-tier performance, with Qwen excelling in algebraic topics and Baichuan2 performing well in geometry. The anomalies (e.g., LLaMA2's 0% in Coordinate plane) highlight potential gaps in model training data or architectural limitations for specific problem types. This comparison underscores the importance of model specialization for domain-specific tasks like mathematics. </details> <details> <summary>x19.png Details</summary> ![85120a31](/v1/image/85120a316439a21c3375c56162609c77005e6f6b14d5055c7cf346a179aa4b7e) ### Visual Description ## Line Graph: Model Accuracy Across Tasks ### Overview The image is a line graph comparing the accuracy of four AI models (InternLM2-20B, Yi-34B, Qwen-72B, GPT-3.5) across 30 distinct tasks. The x-axis lists tasks (e.g., "Angles," "Area," "Classifying & sorting"), while the y-axis represents accuracy as a percentage from 20 to 100. Four colored lines (blue, orange, green, red) correspond to the models, with the legend positioned at the top. ### Components/Axes - **X-axis**: Task categories (e.g., "Angles," "Area," "Classifying & sorting," "Coordinate plane," "Cubes," "Cylinders," "Decimals," "Estimation & rounding," "Fractions," "Light & heavy," "Mixed operations," "Multiple expressions," "Numerical exprs," "Patterns," "Perimeter," "Place value," "Powers," "Rational number," "Spheres," "Subtraction," "Time," "Triangles," "Variable exprs," "Volume of 3d shapes," "Add," "Compare," "Count," "Division," "Equations," "Length," "Statistics," "Percentages," "Polygons," "Probability," "Proportional," "Proportional 3d shapes," "Ratio," "Temperature," "Volume"). - **Y-axis**: Accuracy (20–100, increments of 10). - **Legend**: - Blue: InternLM2-20B - Orange: Yi-34B - Green: Qwen-72B - Red: GPT-3.5 ### Detailed Analysis - **GPT-3.5 (Red Line)**: - Consistently the highest-performing model, with peaks reaching 100% in tasks like "Multiple expressions" and "Compare." - Notable dips in "Angles" (~60%) and "Proportional 3d shapes" (~55%). - Average accuracy: ~85–95% across most tasks. - **Qwen-72B (Green Line)**: - Strong performance in "Multiple expressions" (~95%) and "Compare" (~90%). - Significant drops in "Angles" (~45%) and "Proportional 3d shapes" (~60%). - Average accuracy: ~75–90%. - **Yi-34B (Orange Line)**: - Peaks at ~95% in "Multiple expressions" and "Compare." - Low points in "Angles" (~50%) and "Proportional 3d shapes" (~65%). - Average accuracy: ~70–85%. - **InternLM2-20B (Blue Line)**: - Lowest overall performance, with a sharp drop to ~25% in "Angles." - Peaks at ~70% in "Multiple expressions" and "Compare." - Average accuracy: ~40–70%. ### Key Observations 1. **GPT-3.5 Dominance**: The red line (GPT-3.5) consistently outperforms others, with the highest peaks and fewest dips. 2. **Task-Specific Variability**: - "Angles" is the weakest task for all models, with InternLM2-20B (blue) at ~25% and GPT-3.5 (red) at ~60%. - "Multiple expressions" and "Compare" are the strongest tasks, with all models achieving 80–100% accuracy. 3. **Model-Specific Trends**: - **InternLM2-20B (Blue)**: Most erratic performance, with extreme lows (e.g., "Angles") and moderate highs. - **Yi-34B (Orange)**: Moderate variability, with mid-range accuracy across most tasks. - **Qwen-72B (Green)**: Strong in complex tasks but struggles with basic geometry ("Angles"). - **GPT-3.5 (Red)**: Most consistent, with minimal dips and high peaks. ### Interpretation The data suggests that GPT-3.5 (red) is the most robust model, excelling in both complex and basic tasks. Qwen-72B (green) and Yi-34B (orange) show task-specific strengths but lag behind GPT-3.5 in consistency. InternLM2-20B (blue) underperforms significantly, particularly in foundational tasks like "Angles." The graph highlights the importance of model architecture and training data in handling diverse computational challenges. Outliers like the blue line's 25% accuracy in "Angles" indicate potential limitations in specific domains, while the red line's 100% peaks in "Multiple expressions" underscore its advanced capabilities in symbolic reasoning. </details> Figure 13: Concept accuracies on Elementary-EN of more models. <details> <summary>x20.png Details</summary> ![8da94724](/v1/image/8da947248519ce32a3935372223ee39be0eb90accb12d1dcb55bd846dc6a1656) ### Visual Description ## Line Chart: ModelAccuracy Across Math Topics ### Overview The chart compares the accuracy of four AI models (Yi-6B, ChatGLM3-6B, LLaMA2-7B, DeepSeekMath-7B) across 30+ math topics. Accuracy is measured on a 0–100% scale, with each model represented by a distinct colored line. The x-axis lists math topics, while the y-axis shows accuracy percentages. ### Components/Axes - **Legend**: Top-left corner, with four entries: - **Yi-6B** (blue line) - **ChatGLM3-6B** (orange line) - **LLaMA2-7B** (green line) - **DeepSeekMath-7B** (red line) - **X-axis**: Labeled "Math Topics," listing 30+ categories (e.g., "Add & subtract," "Probability & statistics," "Geometry & range"). - **Y-axis**: Labeled "Accuracy," with ticks at 0, 20, 40, 60, 80, 100. ### Detailed Analysis 1. **Yi-6B (Blue)**: - Peaks at ~95% in "Probability & statistics" and "Geometry & range." - Dips below 40% in "Linear equations" and "Nonlinear functions." - Average accuracy: ~65–75% across most topics. 2. **ChatGLM3-6B (Orange)**: - Strong performance in "Exponents & scientific notation" (~85%). - Struggles in "Linear equations" (~30%) and "Systems of equations" (~40%). - Average accuracy: ~55–70%. 3. **LLaMA2-7B (Green)**: - Consistently mid-range (40–60%) across most topics. - Peaks at ~70% in "Probability & statistics" and "Geometry & range." - Lowest accuracy: ~10% in "Linear equations." 4. **DeepSeekMath-7B (Red)**: - Highest overall accuracy (~85–95%) in "Probability & statistics," "Geometry & range," and "Exponents & scientific notation." - Dips below 40% in "Linear equations" and "Nonlinear functions." - Average accuracy: ~70–85%. ### Key Observations - **Outliers**: - LLaMA2-7B (green) has the lowest accuracy in "Linear equations" (~10%). - DeepSeekMath-7B (red) achieves the highest accuracy in "Probability & statistics" (~95%). - **Trends**: - Yi-6B and DeepSeekMath-7B show the most variability, with sharp peaks and troughs. - ChatGLM3-6B and LLaMA2-7B exhibit more stable but lower performance. ### Interpretation The data highlights model-specific strengths and weaknesses: - **DeepSeekMath-7B** excels in advanced topics like probability and geometry, suggesting robust training in these areas. - **LLaMA2-7B** underperforms in linear equations, indicating potential gaps in foundational math training. - **Yi-6B** and **ChatGLM3-6B** show mixed results, with Yi-6B performing better in high-variability topics and ChatGLM3-6B struggling in linear systems. The chart underscores the importance of model selection based on the target math domain. For example, DeepSeekMath-7B would be preferable for probability tasks, while LLaMA2-7B might be avoided for linear equations. </details> <details> <summary>x21.png Details</summary> ![6a568005](/v1/image/6a5680054d15ba59a5afc32bd70aeb1e4aece3694181ae58605575a0ebabe6fb) ### Visual Description ## Line Graph: Model Accuracy Comparison Across Math Topics ### Overview The image is a multi-line graph comparing the accuracy of four AI models (InternLM2-Math-7B, InternLM2-7B, MAmmoTH-13B, WizardMath-13B) across 30+ math-related topics. Accuracy is measured on a 0-100% scale, with notable fluctuations across topics. ### Components/Axes - **X-axis**: Math topics (e.g., "Add & subtract," "Congruence & similarity," "Probability of simple events") - **Y-axis**: Accuracy percentage (0-100, increments of 20) - **Legend**: Top-left corner, color-coded: - Blue: InternLM2-Math-7B - Orange: InternLM2-7B - Green: MAmmoTH-13B - Red: WizardMath-13B ### Detailed Analysis 1. **InternLM2-Math-7B (Blue)**: - Consistently highest performer overall - Peaks at 95% in "Prime factorization" and "Polynomials" - Lowest point at 35% in "Radical expressions" - Average accuracy: ~65% 2. **InternLM2-7B (Orange)**: - Most erratic performance - Peaks at 70% in "Linear equations" - Drops to 5% in "Radical expressions" - Average accuracy: ~35% 3. **MAmmoTH-13B (Green)**: - High variability with extreme peaks/troughs - Reaches 90% in "Exponents & logarithms" - Drops to 20% in "Probability of simple events" - Average accuracy: ~55% 4. **WizardMath-13B (Red)**: - Most volatile performance - Spikes to 85% in "Square roots & cube roots" - Plummets to 0% in "Radical expressions" - Average accuracy: ~40% ### Key Observations - **Consistency**: InternLM2-Math-7B shows the most stable performance (standard deviation ~15%) - **Specialization**: All models struggle with "Radical expressions" (all <30%) - **Overperformance**: MAmmoTH-13B and WizardMath-13B show disproportionate peaks in "Probability" topics (up to 80%) - **Baseline**: InternLM2-7B underperforms across all topics compared to its larger counterparts ### Interpretation The data suggests: 1. **Model Specialization**: InternLM2-Math-7B's architecture is optimized for math tasks, evidenced by its consistent performance across diverse topics. 2. **Scaling Limitations**: InternLM2-7B's smaller size correlates with lower accuracy, particularly in complex topics. 3. **Overfitting Risks**: MAmmoTH-13B and WizardMath-13B show extreme variability, indicating potential overfitting to specific problem types. 4. **Knowledge Gaps**: All models struggle with radical expressions, suggesting a common limitation in current math AI systems. The graph reveals tradeoffs between model size, specialization, and generalization capabilities in mathematical reasoning tasks. </details> <details> <summary>x22.png Details</summary> ![11040162](/v1/image/110401625191ec3b8af7f7cf9cbf99c731c1099e6be611e5de609bb9e406b416) ### Visual Description ## Line Chart: Model Accuracy Across Mathematical Domains ### Overview The chart compares the accuracy performance of four large language models (LLMs) across 30+ mathematical domains. Models include Baichuan2-13B (blue), LLaMA2-13B (orange), Qwen-14B (green), and InternLM2-Math-20B (red). Accuracy is measured on a 0-100% scale, with notable volatility in performance across different mathematical topics. ### Components/Axes - **X-axis**: Mathematical domains (e.g., "Add & subtract," "Probability & statistics," "Polynomials") - **Y-axis**: Accuracy percentage (0-100, increments of 20) - **Legend**: Top-left corner with color-coded model identifiers - **Data series**: Four distinct lines with markers (circles for Baichuan2-13B, squares for LLaMA2-13B, diamonds for Qwen-14B, and triangles for InternLM2-Math-20B) ### Detailed Analysis 1. **Baichuan2-13B (Blue)**: - Average accuracy: ~65-75% - Notable peaks: "Exponents & scientific notation" (~85%), "Linear equations" (~80%) - Lowest performance: "Linear equations" (~50%) 2. **LLaMA2-13B (Orange)**: - Highest peak: "Prime factorization" (100%) - Sharp troughs: "Linear equations" (~10%), "Nonlinear functions" (~20%) - Average accuracy: ~50-60% 3. **Qwen-14B (Green)**: - Most consistent performance: ~40-50% across most domains - Lowest point: "Linear equations" (~0%) - Peaks: "Probability & statistics" (~60%), "Geometry" (~70%) 4. **InternLM2-Math-20B (Red)**: - Highest overall accuracy: ~80-90% in most domains - Peaks: "Probability & statistics" (~95%), "Polynomials" (~90%) - Lowest performance: "Linear equations" (~60%) ### Key Observations - **InternLM2-Math-20B** consistently outperforms others, particularly in advanced domains like "Probability & statistics" and "Polynomials." - **Qwen-14B** shows the most significant drop in "Linear equations" (near 0% accuracy). - **LLaMA2-13B** exhibits extreme volatility, with 100% accuracy in "Prime factorization" but near-zero in "Linear equations." - **Baichuan2-13B** demonstrates moderate performance with fewer extreme fluctuations. ### Interpretation The data suggests that InternLM2-Math-20B is optimized for mathematical reasoning, likely due to specialized training on mathematical datasets. Qwen-14B's poor performance in "Linear equations" may indicate a lack of focus on foundational algebraic concepts. LLaMA2-13B's volatility suggests inconsistent generalization across mathematical domains, while Baichuan2-13B shows balanced but suboptimal performance. The stark contrast in "Linear equations" accuracy across models highlights potential gaps in foundational mathematical training for some LLMs. </details> <details> <summary>x23.png Details</summary> ![1f34c424](/v1/image/1f34c424b497ca53597a0d94a29219472c32d6679850a5fed09097db48fe9eca) ### Visual Description ## Line Graph: Model Accuracy Comparison Across Math Tasks ### Overview The image is a multi-line graph comparing the accuracy of four AI models (InternLM2-20B, Yi-34B, Qwen-72B, GPT-3.5) across 30+ math-related tasks. Accuracy is measured on a 0-100% scale, with tasks spanning arithmetic, algebra, geometry, and advanced mathematics. ### Components/Axes - **X-axis**: Math tasks (e.g., "Add & subtract," "Arithmetic sequences," "Exponents & scientific notation," "Variable exprs") - **Y-axis**: Accuracy percentage (0-100%, increments of 20) - **Legend**: Top-right corner, color-coded: - Blue: InternLM2-20B - Orange: Yi-34B - Green: Qwen-72B - Red: GPT-3.5 ### Detailed Analysis 1. **GPT-3.5 (Red Line)**: - Consistently highest accuracy (75-100% range) - Peaks at 100% for tasks like "Prime factorization" and "Polynomials" - Slight dips below 90% for "Probability & variability" and "Radical expressions" 2. **Qwen-72B (Green Line)**: - Second-highest performance (60-95% range) - Matches GPT-3.5 in "Geometry & range" and "Nonlinear functions" - Struggles with "Probability of simple events" (60%) and "Surface area & volume" (70%) 3. **Yi-34B (Orange Line)**: - Third performer (50-85% range) - Excels in "Exponents & scientific notation" (90%) and "Linear equations" (85%) - Weaknesses: "Probability & variability" (50%) and "Radical expressions" (65%) 4. **InternLM2-20B (Blue Line)**: - Lowest performance (20-60% range) - Strong in "Add & subtract" (60%) and "Arithmetic sequences" (70%) - Severe drops in "Probability & variability" (15%) and "Variable exprs" (30%) ### Key Observations - **GPT-3.5 dominance**: Outperforms all models in 22/30 tasks, with 100% accuracy in 5 tasks - **Size vs. performance**: Larger models (Qwen-72B, Yi-34B) generally outperform smaller ones, but Yi-34B (34B params) underperforms Qwen-72B (72B params) in 14 tasks - **Task complexity correlation**: Accuracy drops for all models in advanced topics (e.g., "Polynomials" vs. "Add & subtract") - **Consistency**: GPT-3.5 shows least variance (SD ~5%), while InternLM2-20B has highest volatility (SD ~25%) ### Interpretation The graph reveals GPT-3.5's superior mathematical reasoning capabilities, likely due to its specialized training or architecture. Qwen-72B and Yi-34B demonstrate comparable performance despite size differences, suggesting parameter count isn't the sole determinant of math proficiency. InternLM2-20B's significant underperformance in complex tasks highlights potential limitations in handling abstract mathematical concepts. The data suggests model architecture and training data quality may be more critical than size alone for mathematical reasoning tasks. </details> Figure 14: Concept accuracies on Middle-EN of more models. <details> <summary>x24.png Details</summary> ![465d1e06](/v1/image/465d1e064a22b26c8a4d6c605cf8fc075a14a12abf97db7281b2e5826ac9f664) ### Visual Description ## Line Graph: Model Accuracy Comparison Across Tasks ### Overview The image displays a line graph comparing the accuracy performance of four AI models (Yi-6B, ChatGLM3-6B, LLama2-7B, DeepSeekMath-7B) across 40 distinct tasks. The graph shows significant variability in performance across different tasks, with no clear dominant model across all categories. ### Components/Axes - **X-axis**: 40 task categories labeled in Chinese (see full list below) - **Y-axis**: Accuracy percentage (0-100) - **Legend**: Located at top-right, with four color-coded lines: - Blue: Yi-6B - Orange: ChatGLM3-6B - Green: LLama2-7B - Red: DeepSeekMath-7B ### Task Categories (X-axis) 1. 三角形 2. 风 3. 平行四边形 4. 锐角 5. 平面图形综合 6. 立方体 7. 角 8. 长方体 9. 圆柱 10. 圆锥 11. 立体图形 12. 体 13. 合 14. 体 15. 合 16. 体 17. 合 18. 体 19. 合 20. 体 21. 合 22. 体 23. 合 24. 体 25. 合 26. 体 27. 合 28. 体 29. 合 30. 体 31. 合 32. 体 33. 合 34. 体 35. 合 36. 体 37. 合 38. 体 39. 合 40. 体 ### Detailed Analysis 1. **DeepSeekMath-7B (Red line)**: - Highest peaks (up to ~95% accuracy) - Most frequent extreme values (both high and low) - Notable spikes in tasks: 三角形, 平行四边形, 长方体, 圆锥, 立体图形 - Sharp drops in tasks: 体, 合, 体, 合 2. **Yi-6B (Blue line)**: - Moderate performance (30-70% range) - Consistent mid-range values - Peaks in 立体图形, 圆锥, 立体图形 - Lowest values in 体, 合, 体, 合 3. **ChatGLM3-6B (Orange line)**: - Similar pattern to Yi-6B but slightly higher peaks - Strong performance in 立体图形, 圆锥, 立体图形 - Dips in 体, 合, 体, 合 4. **LLama2-7B (Green line)**: - Most stable performance (10-40% range) - Rarely exceeds 40% accuracy - Minimal fluctuations across all tasks - Consistently lowest values in 体, 合, 体, 合 ### Key Observations - DeepSeekMath-7B demonstrates the highest potential accuracy but with significant task-specific variability - LLama2-7B shows the most consistent but lowest performance across all tasks - Yi-6B and ChatGLM3-6B exhibit intermediate performance with moderate variability - Tasks involving 体, 合, 体, 合 consistently show the lowest performance across all models ### Interpretation The graph reveals that model performance is highly task-dependent. DeepSeekMath-7B appears to excel at geometric tasks (三角形, 长方体, 圆锥) but struggles with abstract concepts (体, 合). LLama2-7B's consistent low performance suggests potential limitations in handling complex spatial reasoning tasks. The stark contrast between model performances indicates that no single model dominates across all task types, highlighting the importance of model selection based on specific use cases. The extreme fluctuations in DeepSeekMath-7B's performance suggest possible overfitting to certain task categories or data quality issues in those domains. </details> <details> <summary>x25.png Details</summary> ![94ee8651](/v1/image/94ee865140eda2af73da7fd0116e541b299edc79a72e196e5851a07110534769) ### Visual Description ## Line Chart: Model Accuracy Across Question Categories ### Overview The image is a line chart comparing the accuracy of four AI models across 30+ question categories (x-axis) measured in percentage (y-axis). Four distinct lines represent different models, with significant variability in performance across categories. ### Components/Axes - **X-axis**: Labeled with Chinese characters representing question categories (e.g., "三角形面积", "平行四边形性质", "长方形周长"). Categories are densely packed and unlabeled in English. - **Y-axis**: Labeled "Accuracy" with a scale from 0 to 100, marked at 20-unit intervals. - **Legend**: Positioned at the top-right, mapping colors to models: - **Blue**: InternLM2-Math-7B - **Orange**: InternLM2-7B - **Green**: MAmmoTH-13B - **Red**: WizardMath-13B ### Detailed Analysis 1. **InternLM2-Math-7B (Blue Line)**: - **Trend**: Dominates with the highest peaks (up to ~90%) and most consistent performance. - **Key Data Points**: - Peaks at ~90% for categories like "三角形面积" and "长方形周长". - Dips below 60% for categories like "平行四边形性质" and "立体图形体积". 2. **InternLM2-7B (Orange Line)**: - **Trend**: Second-highest performance, peaking ~85% but with sharper fluctuations. - **Key Data Points**: - Peaks at ~85% for "平行四边形性质" and "立体图形体积". - Drops to ~30% for "平行四边形性质" and "长方形周长". 3. **MAmmoTH-13B (Green Line)**: - **Trend**: Moderate performance, peaking ~80% but with significant dips. - **Key Data Points**: - Peaks at ~80% for "平行四边形性质" and "立体图形体积". - Drops to ~20% for "平行四边形性质" and "长方形周长". 4. **WizardMath-13B (Red Line)**: - **Trend**: Lowest performance, peaking ~40% with erratic fluctuations. - **Key Data Points**: - Peaks at ~40% for "平行四边形性质" and "立体图形体积". - Drops to ~0% for "平行四边形性质" and "长方形周长". ### Key Observations - **Performance Variability**: All models show category-specific strengths/weaknesses. For example: - InternLM2-Math-7B excels in geometry-related categories (e.g., "三角形面积"). - WizardMath-13B struggles with most categories, suggesting limited training in these areas. - **Model Specialization**: InternLM2-Math-7B’s consistent high performance implies optimization for mathematical reasoning, while others may lack specialization. - **Outliers**: The red line (WizardMath-13B) has the most erratic pattern, with sharp drops to 0% in multiple categories. ### Interpretation The chart highlights that model performance is highly dependent on question category. InternLM2-Math-7B’s dominance suggests it was specifically trained for mathematical tasks, while other models (e.g., WizardMath-13B) may have been fine-tuned for narrower domains. The variability underscores the importance of domain-specific training in AI systems. The red line’s extreme fluctuations could indicate overfitting or insufficient data for certain categories. </details> <details> <summary>x26.png Details</summary> ![11cc8c61](/v1/image/11cc8c61ebd633682b6af590f2e70fe9926510f5bb623ef4f9064c37da4c36d2) ### Visual Description ## Line Graph: Model Accuracy Across Categories ### Overview The image is a line graph comparing the accuracy of four AI models (Baichuan2-13B, LLaMA2-13B, Qwen-14B, InternLM2-Math-20B) across multiple categories. The x-axis contains Chinese text labels (likely categories or topics), and the y-axis represents accuracy as a percentage from 0 to 100. The graph shows significant fluctuations in accuracy for all models, with sharp peaks and troughs. ### Components/Axes - **X-axis**: Chinese text labels (e.g., "三角形", "四边形", "圆形", etc.), representing categories or topics. The exact meaning of these labels is not translated here, as they are in Chinese. - **Y-axis**: Labeled "Accuracy" with a scale from 0 to 100. - **Legend**: Located in the top-left corner, with four colored lines: - **Blue**: Baichuan2-13B - **Orange**: LLaMA2-13B - **Green**: Qwen-14B - **Red**: InternLM2-Math-20B ### Detailed Analysis - **Baichuan2-13B (Blue)**: - Peaks at ~80-90% accuracy in some categories (e.g., "长方体", "立体图形"). - Drops to ~20-30% in others (e.g., "平面图形", "几何图形"). - Average accuracy ~50-60%. - **LLaMA2-13B (Orange)**: - Consistently the lowest performer, with accuracy often near 0% (e.g., "平面图形", "几何图形"). - Peaks at ~40-50% in a few categories (e.g., "立体图形", "几何图形"). - Average accuracy ~20-30%. - **Qwen-14B (Green)**: - Peaks at ~70-80% in some categories (e.g., "立体图形", "几何图形"). - Drops to ~10-20% in others (e.g., "平面图形", "几何图形"). - Average accuracy ~40-50%. - **InternLM2-Math-20B (Red)**: - Highest overall performance, with peaks at ~100% in some categories (e.g., "立体图形", "几何图形"). - Drops to ~40-50% in others (e.g., "平面图形", "几何图形"). - Average accuracy ~60-70%. ### Key Observations 1. **InternLM2-Math-20B (Red)** consistently outperforms other models, achieving the highest accuracy in most categories. 2. **LLaMA2-13B (Orange)** shows the most erratic performance, with frequent drops to near-zero accuracy. 3. **Baichuan2-13B (Blue)** and **Qwen-14B (Green)** exhibit moderate performance, with significant variability depending on the category. 4. **Accuracy fluctuations** suggest that model performance is highly dependent on the specific category or task being evaluated. ### Interpretation The data indicates that **InternLM2-Math-20B** is the most robust model across the tested categories, likely due to specialized training in mathematical or geometric tasks. **LLaMA2-13B**'s poor performance in many categories suggests limitations in handling certain types of problems. The variability in accuracy across models highlights the importance of model selection based on the specific application or domain. The Chinese category labels (e.g., "三角形", "四边形") likely represent geometric shapes or mathematical concepts, but their exact meaning requires translation for deeper analysis. </details> <details> <summary>x27.png Details</summary> ![2e916b67](/v1/image/2e916b6771267a4cf864bb791298f79b6f6e72ec27c476360ddf7a0f2e0405f6) ### Visual Description ## Line Graph: Model Accuracy Comparison Across Tasks ### Overview The image is a multi-line graph comparing the accuracy performance of four AI models (InternLM2-20B, Yi-34B, Qwen-72B, GPT-3.5) across 40+ Chinese-named tasks. The y-axis shows accuracy percentages (0-100), while the x-axis lists tasks in Chinese characters. The graph shows significant variability in performance across different tasks and models. ### Components/Axes - **Legend**: Top-left corner with color-coded labels: - Blue: InternLM2-20B - Orange: Yi-34B - Green: Qwen-72B - Red: GPT-3.5 - **Y-axis**: "Accuracy" (0-100 scale) - **X-axis**: Tasks labeled in Chinese (e.g., 三角形, 四边形, 立体图形, 机器学习) - **Data Points**: Discrete markers connected by lines for each model ### Detailed Analysis Key task-specific accuracy observations (approximate values with uncertainty): 1. **三角形 (Triangle)**: - InternLM2-20B: ~75 - Yi-34B: ~80 - Qwen-72B: ~85 - GPT-3.5: ~55 2. **四边形 (Quadrilateral)**: - InternLM2-20B: ~65 - Yi-34B: ~70 - Qwen-72B: ~75 - GPT-3.5: ~60 3. **立体图形 (3D Shapes)**: - InternLM2-20B: ~80 - Yi-34B: ~75 - Qwen-72B: ~85 - GPT-3.5: ~65 4. **机器学习 (Machine Learning)**: - InternLM2-20B: ~40 - Yi-34B: ~0 (data point missing) - Qwen-72B: ~70 - GPT-3.5: ~40 5. **自然语言处理 (NLP)**: - InternLM2-20B: ~60 - Yi-34B: ~55 - Qwen-72B: ~75 - GPT-3.5: ~65 *(Full task list available in original image)* ### Key Observations 1. **Qwen-72B Dominance**: Consistently highest accuracy across most tasks (e.g., 85-90% in 立体图形, 机器学习) 2. **Yi-34B Anomaly**: Near-zero accuracy in 机器学习 task (potential data error or model weakness) 3. **GPT-3.5 Variability**: Significant dips in performance for 机器学习 (~40) and 自然语言处理 (~65) 4. **InternLM2-20B**: Moderate performance with notable lows in 机器学习 (~40) and 自然语言处理 (~60) 5. **Task-Specific Performance**: - Geometry tasks (三角形, 四边形) show highest overall accuracy - 机器学习 task shows most model divergence ### Interpretation The data suggests Qwen-72B demonstrates superior generalization across diverse tasks, particularly in computational and language processing domains. The Yi-34B's near-zero performance in 机器学习 (machine learning) task is particularly anomalous and warrants investigation - this could indicate either a data collection error or fundamental model limitations in this domain. GPT-3.5 shows consistent mid-range performance but lacks the peak capabilities of Qwen-72B. The InternLM2-20B model exhibits moderate performance with notable weaknesses in machine learning applications. These patterns highlight the importance of model selection based on specific task requirements, with Qwen-72B emerging as the most robust performer in this benchmark. </details> Figure 15: Concept accuracies on Elementary-ZH of more models. <details> <summary>x28.png Details</summary> ![a462bd7d](/v1/image/a462bd7d1da0be7c4e3a0dbd86ac8acaf2b930b1d7d2cdc662d82fc2f8a85a78) ### Visual Description ## Line Chart: Model Accuracy Across Tasks ### Overview The image is a line chart comparing the accuracy of four AI models (Yi-6B, ChatGLM3-6B, LLaMA2-7B, DeepSeekMath-7B) across 30+ tasks represented by Chinese characters on the x-axis. The y-axis measures accuracy from 0 to 100. Each model is represented by a distinct color: blue (Yi-6B), orange (ChatGLM3-6B), green (LLaMA2-7B), and red (DeepSeekMath-7B). The chart shows significant variability in performance across tasks, with sharp peaks and troughs for all models. ### Components/Axes - **X-axis**: Labeled with Chinese characters (e.g., "全等三角形", "等腰三角形", "平行四边形", etc.), representing 30+ distinct tasks or categories. - **Y-axis**: Labeled "Accuracy" with a scale from 0 to 100 in increments of 20. - **Legend**: Positioned at the top-right, mapping colors to models: - Blue: Yi-6B - Orange: ChatGLM3-6B - Green: LLaMA2-7B - Red: DeepSeekMath-7B ### Detailed Analysis 1. **Yi-6B (Blue)**: - Stable but lower performance overall, with peaks around 60 and troughs near 20. - Notable spikes in tasks like "等腰三角形" (~70) and "平行四边形" (~50). - Lowest point: ~5 on "等腰三角形". 2. **ChatGLM3-6B (Orange)**: - Highest peak: ~90 on "等腰三角形". - Sharp declines in tasks like "等腰三角形" (~10) and "平行四边形" (~20). - Moderate performance (~40–60) on most tasks. 3. **LLaMA2-7B (Green)**: - Peaks around 70 (e.g., "等腰三角形", "平行四边形"). - Troughs near 10 on tasks like "等腰三角形". - Consistent mid-range performance (~30–50) on most tasks. 4. **DeepSeekMath-7B (Red)**: - Highest peaks: ~80 on "等腰三角形" and "平行四边形". - Sharp declines to ~20 on tasks like "等腰三角形". - Strong performance in math-related tasks (e.g., "等腰三角形" ~70). ### Key Observations - **Task-Specific Performance**: Models excel in specific tasks (e.g., DeepSeekMath-7B in math, ChatGLM3-6B in geometry). - **Volatility**: All models show extreme fluctuations, with some tasks causing accuracy to drop to near 0. - **Stability**: Yi-6B is the most consistent, though with lower overall accuracy. - **Outliers**: ChatGLM3-6B’s ~90 peak on "等腰三角形" and DeepSeekMath-7B’s ~80 on "平行四边形" stand out. ### Interpretation The data suggests that no single model dominates across all tasks. DeepSeekMath-7B and ChatGLM3-6B show task-specific strengths, likely due to specialized training data. Yi-6B’s stability implies robustness but limited specialization. The extreme variability highlights the importance of model selection based on task requirements. Anomalies like ChatGLM3-6B’s near-zero performance on "等腰三角形" suggest potential overfitting or data mismatch for certain tasks. </details> <details> <summary>x29.png Details</summary> ![f83690ca](/v1/image/f83690cace497a4b184a6c3f499daa3a85ceada4fbd2d4a71405e06beea7ee6a) ### Visual Description ## Line Graph: Model Accuracy Comparison Across Question Categories ### Overview The image shows a line graph comparing the accuracy of four AI models across multiple question categories. The x-axis contains Chinese text labels representing question categories, while the y-axis shows accuracy percentages (0-100%). Four distinct lines represent different models: InternLM2-Math-7B (blue), InternLM2-7B (orange), MAmmoTH-13B (green), and WizardMath-13B (red). ### Components/Axes - **X-axis**: Chinese text labels (question categories) in sequential order - **Y-axis**: Accuracy percentage (0-100% in 20% increments) - **Legend**: Top-left corner with color-coded model labels: - Blue: InternLM2-Math-7B - Orange: InternLM2-7B - Green: MAmmoTH-13B - Red: WizardMath-13B ### Detailed Analysis 1. **InternLM2-Math-7B (Blue Line)** - Highest accuracy overall - Peaks at 100% in multiple categories - Notable high performance in: - 全球经济 (Global Economy) - 环境保护 (Environmental Protection) - 太空探索 (Space Exploration) 2. **InternLM2-7B (Orange Line)** - Second highest performance - Peaks around 80-90% - Strong in: - 量子计算 (Quantum Computing) - 纳米技术 (Nanotechnology) - 太阳能 (Solar Energy) 3. **MAmmoTH-13B (Green Line)** - Moderate performance (30-60% range) - Peaks at 60% in: - 古代文明 (Ancient Civilizations) - 中医学 (Traditional Chinese Medicine) 4. **WizardMath-13B (Red Line)** - Consistently lowest performance (0-30% range) - Rarely exceeds 20% accuracy - Particularly weak in: - 数学建模 (Mathematical Modeling) - 统计分析 (Statistical Analysis) ### Key Observations - **Performance Gradient**: Blue > Orange > Green > Red - **Category Specialization**: - InternLM2-Math-7B excels in STEM fields - WizardMath-13B struggles with all categories - **Volatility**: All models show significant fluctuations between categories - **Consistency**: InternLM2-Math-7B maintains highest minimum accuracy (never drops below 40%) ### Interpretation The data suggests: 1. **Specialized Training**: InternLM2-Math-7B's superior performance in mathematical and scientific categories indicates specialized training in these domains 2. **Architectural Limitations**: WizardMath-13B's consistently low performance suggests fundamental architectural limitations despite larger parameter count 3. **Cultural Context**: Chinese question categories reveal models' proficiency in culturally specific domains 4. **Scaling Effects**: Larger models (13B vs 7B) don't always correlate with better performance, challenging conventional scaling assumptions The graph demonstrates that model architecture and training focus matter more than parameter count alone in determining accuracy across diverse question types. </details> <details> <summary>x30.png Details</summary> ![5347c49e](/v1/image/5347c49eed17721dc2563c64f3a614fa0f4cf36d3281b7d026fc9780d8248c4c) ### Visual Description ## Line Graph: Accuracy Comparison of Different Models on Chinese Question Answering Tasks ### Overview The image is a line graph comparing the accuracy of four AI models (Baichuan2-13B, LLaMA2-13B, Qwen-14B, InternLM2-Math-20B) across 30 Chinese question-answering categories. The y-axis represents accuracy (0-100%), and the x-axis lists question categories in Chinese. The graph shows significant fluctuations in performance across models and categories. ### Components/Axes - **Legend**: Top-left corner, color-coded models: - Blue: Baichuan2-13B - Orange: LLaMA2-13B - Green: Qwen-14B - Red: InternLM2-Math-20B - **Y-axis**: "Accuracy (%)" with a dashed reference line at 80%. - **X-axis**: 30 Chinese question categories (transcribed below with English translations): 1. 全球化 (Globalization) 2. 经济学 (Economics) 3. 环境科学 (Environmental Science) 4. 量子力学 (Quantum Mechanics) 5. 文学分析 (Literary Analysis) 6. 化学反应 (Chemical Reactions) 7. 现代哲学 (Modern Philosophy) 8. 生物学基础 (Biology Basics) 9. 计算机科学 (Computer Science) 10. 古代历史 (Ancient History) 11. 物理学原理 (Physics Principles) 12. 法律体系 (Legal Systems) 13. 心理学理论 (Psychology Theories) 14. 数学建模 (Mathematical Modeling) 15. 地理学概念 (Geography Concepts) 16. 社会学研究 (Sociology Research) 17. 天文学现象 (Astronomy Phenomena) 18. 语言学分析 (Linguistics Analysis) 19. 伦理学讨论 (Ethics Discussion) 20. 统计学方法 (Statistical Methods) 21. 遗传学原理 (Genetics Principles) 22. 纳米技术 (Nanotechnology) 23. 机器学习 (Machine Learning) 24. 古典文学 (Classical Literature) 25. 现代艺术 (Modern Art) 26. 电影理论 (Film Theory) 27. 建筑设计 (Architectural Design) 28. 旅游景点 (Tourism Attractions) 29. 食品科学 (Food Science) 30. 健康与营养 (Health & Nutrition) ### Detailed Analysis 1. **Baichuan2-13B (Blue)**: - Starts at ~65% (Globalization). - Peaks at ~85% (Physics Principles, 11th category). - Ends at ~50% (Health & Nutrition). - Fluctuates moderately, with dips below 40% in 3 categories. 2. **LLaMA2-13B (Orange)**: - Starts at ~35% (Globalization). - Peaks at ~60% (Machine Learning, 23rd category). - Ends at ~45% (Health & Nutrition). - Sharp drops below 20% in 4 categories. 3. **Qwen-14B (Green)**: - Starts at ~10% (Globalization). - Peaks at ~80% (Mathematical Modeling, 14th category). - Ends at ~25% (Health & Nutrition). - Extreme volatility, with 0% in 1 category (Ancient History). 4. **InternLM2-Math-20B (Red)**: - Starts at ~45% (Globalization). - Peaks at ~90% (Mathematical Modeling, 14th category). - Ends at ~70% (Health & Nutrition). - Most consistent high performance, with only 2 dips below 50%. ### Key Observations - **Highest Peaks**: InternLM2-Math-20B (90%) and Qwen-14B (80%) dominate in specialized categories (Mathematical Modeling). - **Lowest Performance**: Qwen-14B struggles in Ancient History (0%) and LLaMA2-13B in Quantum Mechanics (~5%). - **Stability**: Baichuan2-13B shows the least variance (range: 40-85%). - **Final Performance**: InternLM2-Math-20B outperforms others by ~20% on average. ### Interpretation The data suggests **InternLM2-Math-20B** is the most robust model overall, excelling in complex domains like mathematics and maintaining high accuracy across diverse topics. **Qwen-14B** demonstrates exceptional capability in specialized areas (e.g., mathematical modeling) but lacks consistency, failing entirely in historical questions. **Baichuan2-13B** offers balanced performance, while **LLaMA2-13B** underperforms in foundational sciences. The variability highlights the importance of model specialization: InternLM2-Math-20B’s math-focused training likely explains its dominance in analytical tasks, whereas Qwen-14B’s strengths may stem from domain-specific fine-tuning. The graph underscores the need for model selection based on use-case requirements rather than general performance metrics. </details> <details> <summary>x31.png Details</summary> ![d6049a4f](/v1/image/d6049a4fedf7408be662f9e8e95757cfa4777130997e467ee88f1522e47c1a96) ### Visual Description ## Line Chart: Accuracy Comparison of AI Models on Chinese Tasks ### Overview The chart compares the accuracy performance of four AI models (InternLM2-20B, Yi-34B, Qwen-72B, GPT-3.5) across 30+ Chinese language tasks. Accuracy is measured on a 0-100% scale, with significant fluctuations observed across different tasks. The green line (Qwen-72B) generally maintains the highest median accuracy, while the blue line (InternLM2-20B) shows the most volatility. ### Components/Axes - **X-axis**: Chinese task categories (30+ labels in Chinese, e.g., "全球化" [Globalization], "人工智能" [Artificial Intelligence], "环保" [Environmental Protection]) - **Y-axis**: Accuracy percentage (0-100 scale) - **Legend**: - Blue: InternLM2-20B - Orange: Yi-34B - Green: Qwen-72B - Red: GPT-3.5 - **Positioning**: Legend at top-center; X-axis labels at bottom (horizontal orientation) ### Detailed Analysis 1. **Qwen-72B (Green)**: - Peaks at 100% for "全球化" and "人工智能" - Maintains >80% accuracy for 12+ tasks - Lowest point: ~35% for "环保" 2. **GPT-3.5 (Red)**: - Peaks at ~85% for "人工智能" and "经济" (Economics) - Drops below 40% for "环保" and "教育" (Education) - Shows bimodal distribution with two distinct performance clusters 3. **Yi-34B (Orange)**: - Peaks at ~75% for "科技" (Technology) - Most consistent performer among smaller models - Average accuracy: ~55% 4. **InternLM2-20B (Blue)**: - Highest volatility (range: 10-90%) - Peaks at ~90% for "科技" - Lowest point: ~10% for "环保" ### Key Observations - **Performance Gaps**: Qwen-72B outperforms others by 20-30% on average across all tasks - **Task Sensitivity**: All models show >50% variance between their best and worst tasks - **Model Size Correlation**: Larger models (Qwen-72B) demonstrate more consistent high performance - **Anomalies**: - InternLM2-20B's 10% "环保" accuracy is 80% below its peak - GPT-3.5's 85% "人工智能" accuracy matches Qwen-72B's performance ### Interpretation The data suggests model architecture and training data specialization significantly impact task performance. Qwen-72B's consistent high accuracy indicates superior handling of complex Chinese language nuances. GPT-3.5's bimodal distribution suggests either specialized training or data biases in specific domains. The extreme volatility of InternLM2-20B highlights challenges in smaller models maintaining performance across diverse tasks. The 100% accuracy peaks for Qwen-72B on "全球化" and "人工智能" may indicate overfitting to these specific task patterns. **Note**: Chinese task labels have been transcribed with pinyin and English translations for reference. All accuracy values are approximate due to visual estimation limitations. </details> Figure 16: Concept accuracies on Middle-ZH of more models. ## Appendix B Details on the Efficient Fine-Tuning In this section, we provide the details on the efficient fine-tuning to enhance mathematical reasoning abilities towards specific concepts by first training a concept classifier and then curating a set of samples from a large open-sourced math dataset. Specifically, first, by additionally collecting extra 10 problems per concept, we construct a classifier capable of identifying the concept class of a given question. The backbone of this classifier is a pretrained bilingual LLM (i.e., Baichuan2-13B), where the classification head is operated on its last hidden output feature. Note that the concept classification accuracies in English and Chinese are 92.5 and 86.9, respectively, which indicates that it is reasonable to use an additional classifier for curating an extra concept-related dataset from large-scale math-related data. Note that in our work, we crawl from the OpenWebMath Paster et al. (2023) to produce the concept-related training dataset. ## Appendix C Details on the Evaluated Models In this section, we offer a detailed overview of the Large Language Models (LLMs) and present the corresponding model links in Table 6. - GPT-3.5/GPT-4 OpenAI (2023): The most powerful closed-model from OpenAI. We utilize its API: gpt-3.5-turbo and gpt-4. - LLaMa2-7B/13B/70B Touvron et al. (2023b): A set of open-source models developed by Meta. - Qwen-14B/72B Bai et al. (2023b): This model pre-trained on multilingual data, concentrates on Chinese and English languages. We employ both the Qwen-Base-14B, and the Qwen-Base-72B. - Baichuan2-13B Baichuan (2023): This model demonstrates impressive performance in both Chinese and English benchmarks. - MetaMath-13B Megill and Wheeler (2019): A domain-specific language model for mathematical reasoning, fine-tuned from the LLaMA-2 model using the MetaMathQA https://huggingface.co/datasets/meta-math/MetaMathQA dataset. - WizardMath-13B Luo et al. (2023): Another domain-specific language model for mathematical reasoning, fine-tuned from the LLaMA-2 model using reinforcement learning. - MAmmoTH-13B Yue et al. (2023): This model is specifically designed for general math problem-solving and has been fine-tuned from the LLaMA model using the MathInstruct https://huggingface.co/datasets/TIGER-Lab/MathInstruct dataset. This dataset features training data that includes both chain-of-thought (CoT) and program-of-thought (PoT) rationales. - Yi-6B/34B Team (2023b): This model released by 01 shows promising performance results in both Chinese and English. - ChatGLM3-6B Zeng et al. (2022): a lightweight and high-performance pre-trained dialogue model released by Zhipu AI in both Chinese and English. - InternLM-7B/20B Team (2023a): A Multilingual Language Model with Progressively Enhanced Capabilities released by InternLM team. - InternLM-Math-7B/20B Ying et al. (2024): Well-performed math reasoning language models. - DeepSeekMath-7B Shao et al. (2024): One powerful mathematical language model released by DeepSeek. | Models | HuggingFace Link / OpenAI Model | | | --- | --- | --- | | ChatGLM3 | ChatGLM3-6B | https://huggingface.co/THUDM/chatglm3-6b | | DeepSeekMath | DeepSeekMath-7B | https://huggingface.co/deepseek-ai/deepseek-math-7b-instruct | | Baichuan2 | Baichuan2-13B | https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat | | MetaMath | MetaMath-13B | https://huggingface.co/meta-math/MetaMath-13B-V1.0 | | WizardMath | WizardMath-13B | https://huggingface.co/WizardLM/WizardMath-13B-V1.0 | | MAmmoTH | MAmmoTH-13B | https://huggingface.co/TIGER-Lab/MAmmoTH-13B | | InternLM | InternLM-7B | https://huggingface.co/internlm/internlm2-chat-7b | | InternLM-20B | https://huggingface.co/internlm/internlm2-chat-20b | | | InternLM-Math-7B | https://huggingface.co/internlm/internlm2-math-7b | | | InternLM-Math-20B | https://huggingface.co/internlm/internlm2-math-20b | | | Yi | Yi-6B | https://huggingface.co/01-ai/Yi-6B-Chat | | Yi-34B | https://huggingface.co/01-ai/Yi-34B-Chat | | | LLaMA2 | LLaMA2-7B | https://huggingface.co/meta-llama/Llama-2-7b-chat-hf | | LLaMA2-13B | https://huggingface.co/meta-llama/Llama-2-13b-chat-hf | | | LLaMA2-70B | https://huggingface.co/meta-llama/Llama-2-70b-chat | | | Qwen | Qwen-14B | https://huggingface.co/Qwen/Qwen-14B-Chat | | Qwen-72B | https://huggingface.co/Qwen/Qwen-72B-Chat | | | GPT | GPT-3.5 | gpt-3.5-turbo | | GPT-4 | gpt-4 | | Table 6: Model links. ## Appendix D More Results As shown in Fig. 9 and Fig. 10, we provide the mean concept accuracies of Elementary-EN and Elementary-ZH of the evaluated models across different concepts. As shown in Fig. 11 and Fig. 12, we illustrate the concept accuracies on Elementary-EN and Elementary-ZH for different selected models. For the results of all models, please refer to Fig. 13, Fig. 14, Fig. 15 and Fig. 16. ## Appendix E Details on 5-shot Prompts We provide the 5-shot prompts for our ConceptMath in Pages 17-20. Q: For a party, Oscar gets 4 different kinds of soda. He buys a 6-pack of each kind. At home, he divides the sodas evenly among 3 coolers. How many sodas are in each cooler? A: Oscar buys 4 different kinds of soda, and he gets a 6-pack of each kind. This means that he has 4 * 6 = 24. sodas in total. He divides these 24 sodas evenly among 3 coolers. So number of sodas per cooler is 24 / 3 = 8. Therefore, the answer is 8. Q: Amazing Animals Zoo is building a new home for its 8 zebras. Each zebra will get 2 acres of grassy land to roam around in. The zebras will have a pond that covers 1 acre. How big will the zebras’ new home be? A: The total required grassland area for the zebras is: 8 zebras * 2 acres per zebra = 16 acres. Then, we need to add the extra 1 acre for the pond: 16 acres (for the zebras) + 1 acre (for the pond) = 17 acres. Therefore, the answer is 17. Q: The state taxes personal income exceeding 3500 yuan (this portion not to exceed 3500 yuan) at a rate of 3%. If Xiaoming’s father’s monthly salary is 4700 yuan, how much personal income tax should Xiaoming’s father pay each month? A: To calculate the amount of personal income tax Xiaoming’s father should pay, we can follow this simple process:Subtract the tax-free threshold from the monthly salary. Taxable income = Monthly salary - Tax-free threshold = 4700 - 3500 = 1200 yuan. Apply the tax rate to the amount that exceeds the threshold. Tax owed = Taxable income $*$ Tax rate = 1200 * 3% = 36 yuan. Therefore, the answer is 36. Q: Find the 10th term in the arithmetic sequence where the first term is 3 and the common difference is 4. A: To find the nth term in an arithmetic sequence, we can use the formula: $a_{n}=a_{1}+(n-1)*d$ , where: $a_{n}$ is the $n$ th term, $a_{1}$ is the first term, d is the common difference, n is the term number. Given that the first term ( $a_{1}$ ) is 3 and the common difference (d) is 4, we can find the $10$ th term ( $a_{10}$ ) by plugging these values into the formula: $a_{10}=a_{1}+(10-1)*d=3+9*4=3+36=39$ . Therefore, the answer is 39. Q: A regular polygon has an interior angle of 156 degrees. How many sides does the polygon have? A: To find the number of sides of a regular polygon with a given interior angle, we can use the formula: Interior Angle = [(n - 2) * 180] / n, where n is the number of sides. Given that interior angle is 156: 156 = [(n - 2) * 180] / n. Solve for n, so we get n = 15. Therefore, the answer is 15. Q: $g(x)=x^{2}$ , $f(x)=3(x-2)^{2}$ , $h(x)=g(x)-f(x)$ , $y=h(23)$ , give me the value of $y$ . A: First, calculate $g(23)$ and $f(23)$ : $g(23)=23^{2}=529$ . Next, calculate $f(23)$ : $f(23)=3*(23-2)^{2}=3*(21)^{2}=3*441=1323$ . Now, we calculate $h(23)$ : $h(23)=g(23)-f(23)=529-1323=-794$ . Therefore, the answer is -794. Q: What is the slope of the line segment between the two points: (3,1), (5,2)? A: The slope of a line between two points $(x_{1},y_{1})$ and $(x_{2},y_{2})$ can be found using the formula: $slope(m)=(y_{2}-y_{1})/(x_{2}-x_{1})$ . Given the two points (3, 1) and (5, 2), we can substitute these values into the formula: $slope(m)=(2-1)/(5-3)=1/2=0.5$ . Therefore, the answer is 1/2. Q: In a party, there are 5 cupcakes remaining and 10 children. A mother inserts 10 pieces of paper into a hat, 5 with a cupcake image and 5 without. The children who draw the cupcake paper will receive a cupcake. If the first child draws a paper with a cupcake image, what is the probability that the second child will also draw a cupcake paper? A: Initially, there are 5 cupcake papers and 5 non-cupcake papers in the hat, making a total of 10 papers. After the first child draws a cupcake paper, there are 4 cupcake papers and 5 non-cupcake papers left in the hat, with a total of 9 papers. The probability that the second child will draw a cupcake paper is then the number of cupcake papers left divided by the total number of papers left: Probability = Number of cupcake papers left / Total number of papers left = 4/9. Therefore, the answer is 4/9. Q: What is the total area of an irregular polygon that consists of two squares and two rectangles where the sides of the squares are 12 cm and 16 cm respectively, and the dimensions of the rectangles are 13 cm by 20 cm and 10 cm by 7 cm respectively? A: To find the total area of an irregular polygon consisting of two squares and two rectangles, you would sum up the individual areas of each shape. The area of a square is given by the formula $A=s^{2}$ , where s is the length of a side of the square. For the first square with a side of 12 cm: $A_{1}=12^{2}=144cm^{2}$ . For the second square with a side of 16 cm: $A_{2}=16^{2}=256cm^{2}$ . The area of a rectangle is given by the formula ( $A=l*w$ ), where $l$ is the length and $w$ is the width of the rectangle. For the first rectangle with dimensions 13 cm by 20 cm: $A_{3}=13*20=260cm^{2}$ For the second rectangle with dimensions 10 cm by 7 cm: $A_{4}=10*7=70cm^{2}$ $TotaArea=A_{1}+A_{2}+A_{3}+A_{4}=144+256+260+70=730cm^{2}$ Therefore, the answer is 730. Q: Susie is making invitations to her party this coming weekend. It takes her about 5 minutes to make each invitation. It took her 2 hours to finish all the invitations. How many people is Susie inviting? A: First, convert the 2 hours into minutes to match the unit of time used for the invitation-making process. 2 hours = 120 minutes. Then, divide the total time spent by the time it takes to make one invitation to find out how many invitations she made: 120 minutes total / 5 minutes per invitation = 24 invitations Therefore, the answer is 24. Q: 小文用量角器测量一个角的度数时，误把外圈刻度看成了内圈刻度，量出的角是 $65°$ ，这个角实际是多少度？ A: 量角器一般有两个圈的刻度，内圈刻度通常是顺时针方向从0度到180度，外圈刻度通常是逆时针方向从0度到180度。如果小文误把外圈刻度看成了内圈刻度，那么测量出的角度其实是从180°减去错误读数。因此，实际的角度应该是：180° - 65° = 115°。因此，答案是115。 Q: 小明以每小时2千米的速度上山，到了山顶后立即以每小时3千米的速度按原路返回。求小明上下山的平均速度是多少千米？ A: 要计算小明上下山的平均速度，我们需要知道他上山和下山所走的总距离以及总时间。首先，总距离 = 上山距离+下山距离=d+d= 2d千米。然后，计算总时间：上山时间 = 上山距离/上山速度 = d/2小时，下山时间 = 下山距离/下山速度 = d/3小时，因此，总时间=上山时间+下山时间 = d/2 + d/3小时。最后，计算平均速度：平均速度=总距离 / 总时间 = 2d/(d/2+d/3)=2d/(5d/6) = 12/5 = 2.4千米/小时。因此，答案是2.4千米/小时。 Q: 一个礼盒长20cm、宽15cm、高10cm，用彩带捆扎这样的一个礼盒，至少需要多长的彩带？（接头处12厘米） A: 要捆扎这样一个礼盒，通常情况下我们会沿着礼盒的长和宽方向各绕一圈，即绕过礼盒的长和宽两次，然后再留一些长度作为接头。首先，计算沿着礼盒的长方向绕一圈的长度：绕长方向长度 = （礼盒的长 + 礼盒的高）* 2 = （20 + 10）* 2 = 30 * 2 = 60 cm 然后，计算沿着礼盒的宽方向绕一圈的长度：绕宽方向长度 = （礼盒的宽 + 礼盒的高）* 2 = （15 + 10）* 2 = 25 * 2 = 50 cm 最后，计算总需要彩带长度：总长度 = 绕长方向长度 + 绕宽方向长度 + 接头长度 = 60 + 50 + 12 = 122 cm 因此，答案是122厘米。 Q: 千岛湖大桥的一侧每隔10米安装1盏灯，共需91盏（两端都装），现在改装61盏（两端都装），平均每相邻两盏灯之间的距离是多少米？ A: 原来每隔10米安装1盏灯，共需91盏，这意味着桥的总长度覆盖了90个10米的间隔，所以桥的总长度 = 90间隔 * 10米/间隔 = 900米。现在改装61盏灯（两端都装），也就是说将这900米分成60个间隔（因为61个灯包含了两端的灯，所以间隔数比灯的数量少一个），每个间隔的长度 = 900米 / 60间隔 = 15米/间隔。因此，答案是115米。 Q: a是b的3/4，则a比b少( )，b比a多( )，b与a的比是( )，如果a、b两数的和是210，则a＝( )，b＝( )，括号处分别为多少？ A: 已知a是b的3/4，则a = 3/4 * b。因此，a比b少 (1-3/4)/1 = 1/4，b比a多（1-3/4)/(3/4) = 1/3，b与a的比为b : (3/4 * b) = 4:3 若已知a、b两数的和是210，则a+b = 210，将a = (3/4) * b 代入方程得到：3/4 * b + b = 210，解得b = 120，a = 3/4 * b = 3/4 * 120 = 90 因此，答案分别是 1/4，1/3，4:3，90，120。 Q: 求不等式 $-x^{2}-2x+8>0$ 的解集? A: $-x^{2}-2x+8>0$ , 通过分解因式方式，得到 $-(x-2)(x+4)>0$ 因此，答案是 (-4, 2)。 Q: 已知函数 $y=2x+3$ 与的图象 $y=-1/2x+m$ 交y轴于点A、B，两函数图象的交点为P(n,-2)，求m和n的值？ A: 两函数图像交点为P(n,-2)，意味着这个点同时满足上述两个函数方程：将P点的坐标(n,-2)代入函数 $y=3x+3$ ，即 $-2=3n+3$ ，解得 $n=-5/2$ 。将P点坐标(-5/2,-2)代入函数 $y=-1/2x+m$ ，即 $-2=-1/2*(-5/2)+m$ ，解得m=-13/4。因此，答案是m为-13/4，n为-5/2。 Q: 已知PA是圆O的切线，A为切点，PC与圆O相交于B、C两点，PB=2cm，BC=8cm，则PA的长是多少厘米？ A: 由于PA是圆O的切线，根据切割线定理，我们知道从圆外一点引圆的两条割线，这一点到每条割线与圆的交点的两条线段长的积相等，即 $PA^{2}=PC*PB$ , 已知PB=2cm、BC=8cm，则 $PC=PB+BC=2cm+8cm=10cm$ 。 $PA^{2}=PC*PB=10*2=20$ ，解得PA = 2 $\sqrt{5}$ cm。因此，答案是2 $\sqrt{5}$ 厘米。 Q: 南宋著名数学家杨辉精研数学，著有《详解九章算法》，对数的运算进行了深入研究与总结，类比其中的思想方法，可以解决很多数与式的计算问题已知a，b为实数，且 $a+b=3,ab=1$ ，计算可得： $a^{2}+b^{2}=7$ ， $a^{3}+b^{3}=18$ ， $a^{4}+b^{4}=47$ ，由此求 $a^{5}+b^{5}$ 得多少？ A: 已知 $a+b=3$ 和 $ab=1$ ，可得： $a^{2}+b^{2}=(a+b)^{2}-2ab=3^{2}-2*1=9-2=7a^{3}+b^{3}=(a+b)(a^{2}-ab+b^{2})=3(7 -1)=3*6=18a^{4}+b^{4}=(a^{2}+b^{2})^{2}-2a^{2}b^{2}=7^{2}-2*1^{2}=49-2=47a^{5} +b^{5}=(a+b)(a^{4}-a^{3}b+a^{2}b^{2}-ab^{3}+b^{4})=3(a^{4}+b^{4}-a^{2}b^{2})=3 (47-1)=3*46=138$ 。因此，答案是138。 Q: 某单位要招聘1名英语翻译，甲、乙两人报名参加了听、说、读、写等项素质测试，甲的成绩分别为：90分、80分、85分、78分；乙的成绩分别为：78分、82分、85分、88分．如果把听、说、读、写的成绩按 $2:2:3:3$ 计算素质测试平均成绩，那么甲和乙的平均成绩分别是多少？ A: 甲的平均成绩为 $(90*2+80*2+85*3+78*3)/(2+2+3+3)=(180+160+255+234)/10=829/10=82.9$ 分，乙的平均成绩为 $(78*2+82*2+85*3+88*3)/(2+2+3+3)=(156+164+255+264)/10=839/10=83.9$ 分。因此，答案是甲的平均成绩是82.9分，乙的平均成绩是83.9分。 | LEVEL1 | LEVEL2 | LEVEL3 | # Samples | | --- | --- | --- | --- | | Calculation & Properties | Calculation | Add | 19 | | Decimals | 20 | | | | Division | 19 | | | | Equations | 18 | | | | Fractions | 16 | | | | Mixed Operations | 18 | | | | Multiple | 18 | | | | Numerical Expressions | 20 | | | | Place Value | 16 | | | | Powers | 20 | | | | Rational Number | 17 | | | | Subtraction | 19 | | | | Variable Expressions | 19 | | | | Properties | Compare | 20 | | | Count | 18 | | | | Estimation & Rounding | 20 | | | | Patterns | 19 | | | | Geometry | Angles | 17 | | | Coordinate Plane | Coordinate Plane | 18 | | | Three-dimensional Shapes | Cones | 17 | | | Cubes | 20 | | | | Cylinders | 17 | | | | Spheres | 17 | | | | Volume of 3D shapes | 18 | | | | Two-dimensional Shapes | Circles | 17 | | | Perimeter | 19 | | | | Polygons | 18 | | | | Quadrilaterals | 17 | | | | Triangles | 18 | | | | Measurement | Basic Knowledge | Temperature | 19 | | Time | 20 | | | | Money | Coin Names & Value | 17 | | | Exchanging Money | 17 | | | | Ratio | Percent | 17 | | | Proportion | 18 | | | | Ratio | 19 | | | | Size | Area | 19 | | | Length | 20 | | | | Volume | 20 | | | | Weight | Light & Heavy | 20 | | | Statistics | Classifying & Sorting | Classifying & Sorting | 17 | | Data | Mode/Mean/Median/Range | 19 | | | Probability | Probability | 16 | | Table 7: Details of the hierarchical concepts in Elementary-EN. | LEVEL1 | LEVEL2 | LEVEL3 | # Samples | | --- | --- | --- | --- | | Calculation | Basic Calculation | Add & Subtract | 20 | | Decimals | 19 | | | | Divide | 19 | | | | Exponents & Scientific Notation | 16 | | | | Fractions & Decimals | 18 | | | | Multiply | 18 | | | | Square Roots & Cube Roots | 20 | | | | Consumer Math | Consumer Math | 18 | | | Financial Literacy | Financial Literacy | 19 | | | Integers | Absolute Value | 18 | | | Opposite Integers | 20 | | | | Measurement | Measurement Metric | 19 | | | Number Theory | Factors | 20 | | | Prime Factorization | 19 | | | | Prime or Composite | 18 | | | | Percents | Percents | 20 | | | Rational & Irrational Numbers | Rational & Irrational Numbers | 18 | | | Ratios & Rates | Proportional Relationships | 18 | | | Sequences | Arithmetic Sequences | 19 | | | Geometric Sequences | 18 | | | | Expressions, equations, and functions | Equations | Linear Equations | 20 | | Systems of Equations | 18 | | | | Expressions | Equivalent Expressions | 20 | | | Radical | 17 | | | | Variable | 18 | | | | Function | Domain & Range of Functions | 18 | | | Interpret Functions | 19 | | | | Linear Functions | 20 | | | | Nonlinear Functions | 18 | | | | Inequalities | Inequalities | 19 | | | Geometry | Congruence & Similarity | Congruence & Similarity | 19 | | Coordinate Plane | Axes | 17 | | | Distance Between Two Points | 19 | | | | Quadrants | 16 | | | | Scale Drawings | Scale Drawings | 16 | | | Slope | Slope | 20 | | | Three-dimensional Figures | Polyhedra | 19 | | | Surface Area & Volume | 17 | | | | Transformations | Transformations | 18 | | | Two-dimensional Figures | Circle | 20 | | | Lines & Angles | 18 | | | | Perimeter & Area | 20 | | | | Polygons | 18 | | | | Square | 18 | | | | Trapezoids | 16 | | | | Triangle | 18 | | | | Statistic and Probability | Data | Center & Variability | 18 | | Mean, Median, Mode & Range | 19 | | | | Outlier | 20 | | | | One-variable Statistics | One-variable Statistics | 19 | | | Probability | Counting Principle | 16 | | | Independent & Dependent Events | 16 | | | | Make Predictions | 17 | | | | Probability of Compound Events | 16 | | | | Probability of One Event | 17 | | | | Probability of Simple and Opposite Events | 19 | | | | Two-variable Statistics | Two-variable Statistics | 18 | | Table 8: Details of the hierarchical concepts in Middle-EN. <details> <summary>x32.png Details</summary> ![a6ff3c38](/v1/image/a6ff3c38694088f126d65d88bfd8e4295592085c378b6861407e2c16dbcff284) ### Visual Description ## Data Table: Mathematical Concept Hierarchy and Sample Distribution ### Overview This table presents a hierarchical categorization of mathematical concepts across four levels (Geometry, Application, Measurement/Statistics, Number/Algebra) with corresponding sample counts for each subcategory. The structure reveals pedagogical focus areas and resource allocation across mathematical domains. ### Components/Axes | Column | Description | |--------|-------------| | **LEVEL1** | Primary mathematical domains (Geometry, Application, Measurement/Statistics, Number/Algebra) | | **LEVEL2** | Subcategories within each domain (e.g., 2D/3D shapes, fundamental problems) | | **LEVEL3** | Specific mathematical problems/concepts (e.g., Trapezoid, Tax, Pursuit) | | **# Samples** | Number of data points/examples per concept | ### Detailed Analysis #### Geometry (几何) - **2D Shapes** (20 samples each): - Triangle, Circle, Parallelogram, Trapezoid, Square, Synthesis Problem, Angle, Rectangle - **3D Shapes** (20 samples each): - Cylinder, Cube, Synthesis Problem, Cuboid #### Application (应用) - **Fundamental Problems** (20-21 samples): - Addition/Differentiation, Basics, Differential, Normalization, Induction - **Classical Problems** (10-20 samples): - Interest, Period, Folding, Engineering, Age, Discount, Planting, Tax, Reduction, Imagination, Chickens & Rabbits - **Distance Problems** (20 samples each): - Encounter, Travel, Pursuit #### Measurement/Statistics (度量与统计) - **Measurement** (6-20 samples): - RMB (9), Time (20), Concentration (20), Temperature (6), Area (17) - **Statistics** (18-20 samples): - Permutation (20), Statistical Metrics (20), Law (18) #### Number/Algebra (数与代数) - **Fractional Operations** (16-20 samples): - Fraction/Decimal (20), Fractional Application (20), Fractional Operation (20), Simplest Fraction (16) - **Factors/Multiples** (9-20 samples): - Common Multiples (16), Common Divisors (11), Factor (20), Synthesis Problem (11), Prime Number (9) - **Basic Operations** (16-20 samples): - Multiplication (20), Reciprocal Problem (16), Four-rule Operation (20), New Operation Definition (20), Equation (20), Division (20) - **Ratios** (20 samples each): - Multiple, Probability, Proportion, Percentage ### Key Observations 1. **Sample Distribution**: - Geometry and Application domains have the highest sample counts (80-100 total) - Measurement/Statistics shows significant variance (6-20 samples) - Number/Algebra maintains consistent 20-sample entries except for specialized concepts 2. **Concept Complexity**: - Synthesis problems appear across multiple levels (Geometry, Application) - Real-world applications dominate the Application domain (Tax, Age, Discount) - Measurement concepts show practical focus (RMB, Temperature) 3. **Pedagogical Prioritization**: - Foundational concepts (Basics, Fundamental Problems) receive equal emphasis - Classical problems include both traditional (Folding) and modern (Chickens & Rabbits) contexts - Statistical concepts emphasize practical metrics over theoretical frameworks ### Interpretation The table reveals a curriculum design emphasizing geometric visualization (40 samples) and applied mathematics (85 samples), with particular focus on real-world problem solving. The lower sample counts in Measurement/Statistics (particularly temperature and area) suggest either specialized focus areas or resource constraints. The consistent 20-sample entries in Number/Algebra indicate standardized treatment of core concepts, while specialized topics (Prime Numbers, Simplest Fractions) receive targeted but limited attention. The inclusion of synthesis problems across domains highlights an integrative approach to mathematical education, connecting abstract concepts with practical applications. </details> Figure 17: Details of the hierarchical concepts in Elementary-ZH. | LEVEL1 | LEVEL2 | LEVEL3 | # Samples | | --- | --- | --- | --- | | 几何 (Geometry) | 三角形(Triangle) | 全等三角形(Congruent Triangle) | 20 | | 勾股定理(Pythagorean Theorem) | 20 | | | | 等腰三角形(Isosceles Triangle) | 20 | | | | 等边三角形(Equilateral Triangle) | 20 | | | | 四边形(Quadrilateral) | 平行四边形(Parallelogram) | 20 | | | 梯形(Trapezium) | 20 | | | | 圆(Circle) | 圆周角(Angle of Circumference) | 20 | | | 圆心角(Angle of Center) | 20 | | | | 垂径定理(Vertical Path Theorem) | 20 | | | | 弧长和扇形面积(Arc length & Sector Area) | 20 | | | | 正多边形和圆(Regular Polygons & Circles) | 20 | | | | 点线圆位置关系(Relations of Point, Line & Circle) | 20 | | | | 立体图形 (Three-dimensional Shapes) | 圆锥(Cone) | 20 | | | 函数 (Function) | 一次函数(Linear Function) | 函数与一元一次方程 (Univariate Function & Equation) | 20 | | 函数与一元一次不等式 (Linear Functions & Univariate Linear Inequalities) | 20 | | | | 一次函数与二元一次方程组 (Linear Functions & System of Binary Linear Equations) | 20 | | | | 正比例函数(Proportional Function) | 20 | | | | 一次函数解析式 (Analytical Formula of Linear Functions ) | 20 | | | | 二次函数(Quadratic Function) | 二次函数的应用 (Applications of Quadratic Functions) | 20 | | | 抛物线的性质 (Properties of Parabolas) | 18 | | | | 反比例函数 (Inverse Proportional Function) | 定义(Definition) | 20 | | | 应用(Applications) | 20 | | | | 性质(Properties) | 19 | | | | 平面直角坐标系 (Rectangular Coordinate System) | 有序数对(Ordered Pair) | 20 | | | 象限中的点(Points of Quadrant) | 14 | | | | 数与式 (Number and Expression) | 代数式(Algebra Expression) | 代数式求值(Algebraic Expression Evaluation) | 20 | | 同类项(Similar Items) | 20 | | | | 分式(Fraction) | 指数幂(Exponential Power) | 20 | | | 约分(Fraction Reduction) | 19 | | | | 因式(Factor) | 十字相乘法(Cross Multiplication) | 20 | | | 公因式提取(Common Factor Extraction) | 18 | | | | 应用(Application) | 流水问题(Flow Problem) | 20 | | | 鸽巢问题(Pigeon Nest Problem) | 20 | | | | 整式(Integral Expression) | 乘法公式(Multiplication) | 20 | | | 整式的乘除及混合(Multiplication, Division & Mixing) | 20 | | | | 整式的加减(Addition & Subtraction) | 20 | | | | 无理数(Irrational Number) | 无理数识别(Irrational Number Recognition) | 20 | | | 根式(Radical Expression) | 二次根式的运算(Operation of Quadratic Radicals) | 20 | | | 同类二次根式(Similar Quadratic Radicals) | 20 | | | | 平方根与算术平方根(Square Root & Arithmetic Square Root) | 20 | | | | 立方根(Cube Root) | 20 | | | | 方程与不等式 (Equations & Inequalities) | 一元一次方程 (Linear Equation in One Variable) | 一元一次方程的应用(Applications) | 20 | | 解一元一次方程(Solutions) | 20 | | | | 一元二次方程 (Quadratic Equation in One Variable) | 一元二次方程的应用(Applications) | 20 | | | 解一元二次方程(Solutions) | 20 | | | | 不等式与不等式组 (Inequalities & Groups of Inequalities) | 一元一次不等式的应用 (Applications of Unary First Order Inequality) | 20 | | | 一元一次不等式组的应用(Applications of Unary First Order Groups of Inequalities) | 20 | | | | 解一元一次不等式(Solve the First Inequality of One Variable) | 20 | | | | 解一元一次不等式组(Solve Unary First Order Groups of Inequalities) | 20 | | | | 分式方程(Fractional Equation) | 分式方程的应用(Application of Fractional Equation) | 20 | | | 解分式方程(Solve Fractional Equation) | 20 | | | | 统计与概率 (Statistics and Probability) | 数据分析(Data Analysis) | 数据的波动趋势(Fluctuating Trend of Data) | 20 | | 数据的集中趋势(Central Tendency of Data) | 20 | | | | 概率(Probability) | 概率的应用(Applications of Probability) | 20 | | | 求概率(Find Probability) | 20 | | | | 随机事件与概率(Random Events & Probabilities) | 20 | | | Table 9: Details of the hierarchical concepts in Middle-ZH.

Rendering Paper...