2402.14660v2

Model: gemini-2.0-flash

# ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models Abstract This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with concept-wise accuracies. Based on our ConcepthMath, we evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones. Besides, we also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the developers to understand the fine-grained mathematical abilities of their models and facilitate the growth of foundation models The data and code are available at https://github.com/conceptmath/conceptmath.. footnotetext: * First three authors contributed equally. footnotetext: ${}^{\dagger}$ Corresponding Author: Jiaheng Liu. 1 Introduction Mathematical reasoning is a crucial capability for Large Language Models (LLMs). Recent advancements in LLMs, including Anthropic Anthropic (2023), GPT-4 (OpenAI, 2023), and LLaMA (Touvron et al., 2023a), have demonstrated impressive mathematical reasoning on existing benchmarks with high average accuracies on datasets like GSM8K (Cobbe et al., 2021). Although these benchmarks are able to measure the overall mathematical reasoning capabilities of LLMs on average, they fail to probe the fine-grained failure modes of mathematical reasoning on specific mathematical concepts. For example, Fig. 1 shows that the performance of LLaMA2-13B varies significantly across different concepts and fails on simple concepts like Rational number and Cylinders. It is crucial to know these specific failure modes of the language model, especially in some practical applications where we need to focus on specific mathematical abilities. For example, for financial analysts, calculation and statistics are the concepts of most interest while others like geometry are not as important. Moreover, the mathematics system, by its nature, is more fine-grained than holistic. It is typically organized into distinct math concepts https://en.wikipedia.org/wiki/Lists_of_mathematics_topics, and humans develop comprehensive mathematical capabilities through a concept-by-concept, curriculum-based learning process (Simon, 2011; Fritz et al., 2013). These issues underscore the core motivation of this paper: the need for a fine-grained benchmark that evaluates concept-wise mathematical reasoning capabilities of LLMs. <details> <summary>x1.png Details</summary> ![48b62143](/v1/image/48b62143724568b338e327d7c2a724c559b384d5fba4c0e3b70c1602427e6a74) ### Visual Description ## Line Chart: LLaMA2 vs. LLaMA2-FT Accuracy on Various Tasks ### Overview The image is a line chart comparing the accuracy of two models, LLaMA2 and LLaMA2-FT, across a range of tasks. The x-axis represents different task categories, while the y-axis represents accuracy, ranging from 0 to 90. The chart highlights a region where LLaMA2 performs poorly ("Weaknesses") and shows how LLaMA2-FT enhances performance in those areas ("Enhancing Weaknesses"). ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** Task categories: Powers, Numerical exprs, Estimation & rounding, Decimals, Light & heavy, Temperature, Ratio, Patterns, Cylinders, Perimeter, Rational number, Polygons, Probability. * **Y-axis:** Accuracy, ranging from 0 to 90 in increments of 10. * **Legend:** Located in the bottom-left corner: * LLaMA2 (light green line) * LLaMA2-FT (blue line) * **Regions:** * A light blue shaded region spans from "Powers" to "Patterns". * A light red shaded region spans from "Cylinders" to "Probability". This region is labeled "Enhancing Weaknesses" at the top and "Weaknesses" at the bottom. ### Detailed Analysis * **LLaMA2 (light green line):** * Powers: ~65 * Numerical exprs: ~65 * Estimation & rounding: ~30 * Decimals: ~65 * Light & heavy: ~80 * Temperature: ~70 * Ratio: ~40 * Patterns: ~35 * Cylinders: ~12 * Perimeter: ~25 * Rational number: ~11 * Polygons: ~11 * Probability: ~20 * **LLaMA2-FT (blue line):** * Powers: ~70 * Numerical exprs: ~70 * Estimation & rounding: ~35 * Decimals: ~85 * Light & heavy: ~80 * Temperature: ~70 * Ratio: ~42 * Patterns: ~37 * Cylinders: ~45 * Perimeter: ~50 * Rational number: ~62 * Polygons: ~67 * Probability: ~75 * **Trends:** * LLaMA2: Starts high, drops significantly at "Estimation & rounding", rises sharply to "Light & heavy", then declines gradually with a sharp drop at "Cylinders", then slowly rises again. * LLaMA2-FT: Similar to LLaMA2 but generally higher, especially in the "Enhancing Weaknesses" region. * **Enhancing Weaknesses Region:** * Vertical dashed lines connect the LLaMA2 data points to the corresponding LLaMA2-FT data points in the "Enhancing Weaknesses" region, visually indicating the improvement in accuracy. * Red star markers are placed on the LLaMA2-FT line at each data point within the "Enhancing Weaknesses" region. ### Key Observations * LLaMA2-FT consistently outperforms LLaMA2 across all tasks. * The "Enhancing Weaknesses" region clearly demonstrates the improvement achieved by LLaMA2-FT in tasks where LLaMA2 performs poorly. * The largest performance gains are observed in "Cylinders", "Perimeter", "Rational number", "Polygons", and "Probability". ### Interpretation The chart illustrates the effectiveness of fine-tuning (FT) LLaMA2 to improve its accuracy on specific tasks. The "Enhancing Weaknesses" region highlights the targeted improvement achieved by LLaMA2-FT in areas where the base LLaMA2 model struggles. This suggests that fine-tuning is a valuable technique for enhancing the performance of language models on specific domains or tasks. The consistent outperformance of LLaMA2-FT indicates that the fine-tuning process was successful in transferring knowledge and improving the model's ability to handle these tasks. The red stars and dashed lines emphasize the magnitude of the improvement in the "Weaknesses" area. </details> Figure 1: The concept-wise accuracies of LLaMA2-13B and the fine-tuned version based on our efficient fine-tuning method (i.e., LLaMA2-FT). Therefore, first, we introduce ConceptMath, the first bilingual (English and Chinese), concept-wise benchmark for measuring mathematical reasoning. ConceptMath gathers math concepts from four educational systems, resulting in four distinct mathematical concept systems: English Elementary, English Middle, Chinese Elementary, and Chinese Middle The four concept systems are abbreviated as Elementary-EN, Middle-EN, Elementary-ZH, and Middle-ZH.. Each of these concept systems organizes around 50 atomic math concepts under a three-level hierarchy and each concept includes approximately 20 mathematical problems. Overall, ConceptMath comprises a total of 4011 math word problems across 214 math concepts, and Fig. 2 shows the diagram overview of ConceptMath. Second, based on our ConceptMath, we perform extensive experiments to assess the mathematical reasoning of existing LLMs, including 2 close-sourced LLMs and 17 open-sourced LLMs. These evaluations were performed in zero-shot, chain-of-thought (CoT), and few-shot settings. To our surprise, even though most of the evaluated LLMs claim to achieve high average accuracies on traditional mathematical benchmarks (e.g., GSM8K), they fail catastrophically across a wide spectrum of mathematical concepts. Third, to make targeted improvements on underperformed math concepts, we propose an efficient fine-tuning strategy by first training a concept classifier and then crawling a set of samples from a large open-sourced math dataset Paster et al. (2023); Wang et al. (2023b) for further LLMs fine-tuning. In Fig. 1, for LLaMA2-FT, we observe that the results of these weaknesses improved a lot after using the efficient fine-tuning method. In summary, our contributions are as follows: - We introduce ConceptMath, the first bilingual, concept-wise benchmark for measuring mathematical reasoning. ConceptMath encompasses 4 systems, approximately 214 math concepts, and 4011 math word problems, which can guide further improvements on the mathematical reasoning of existing models. - Based on ConceptMath, we evaluate many LLMs and perform a comprehensive analysis of their results. For example, we observe that most of these LLMs (including open-sourced, closed-sourced, general-purpose, or math-specialized models) show significant variations in their performance results across math concepts. - We also evaluate the contamination rate of our ConceptMath and introduce a simple and efficient fine-tuning method to improve the weaknesses of existing LLMs. <details> <summary>x2.png Details</summary> ![db19d4af](/v1/image/db19d4af639dd3fcf42697d01d6f672d87876375b5acbc3f7c6924cde2ac8520) ### Visual Description ## Circular Diagram: Math Concepts ### Overview The image is a circular diagram illustrating the relationships between different mathematical concepts. It is structured in concentric rings, with broader categories in the inner rings and more specific subcategories in the outer rings. The diagram uses color-coding to visually group related concepts. ### Components/Axes * **Center:** Geometry (Dark Teal) * **Ring 2:** * Statistics (Orange) * Measurement (Yellow) * Calculate & Properties (Green) * Two-Dim Shapes (Light Blue) * Three-Dime Shapes (Blue) * Coordinate Plane (Dark Blue) * **Outer Ring:** * **Statistics:** Probability, Classifying & Sorting, Data * **Measurement:** Volume, Length, Area, Size, Time, Temperature, Ratio, Proportional, Percents, Light & Heavy, Exchanging Money, Coin Names & Value * **Calculate & Properties:** Patterns, Estimation & Rounding, Count, Compare, Variable Expressions, Subtraction, Rational Number, Powers, Place Value, Numerical Expressions, Multiple Operations * **Two-Dim Shapes:** Angles, Circles, Perimeter, Polygons, Quadrilaterals, Triangles * **Three-Dime Shapes:** Cones, Cubes, Cylinders, Spheres, Volume of 3D Shapes * **Calculate:** Add, Decimals, Division, Equations, Fractions, Mixed Operations * **Color Coding:** * Dark Teal: Geometry * Orange: Statistics * Yellow: Measurement * Green: Calculate & Properties * Light Blue: Two-Dim Shapes * Blue: Three-Dime Shapes * Dark Blue: Coordinate Plane ### Detailed Analysis or ### Content Details * **Geometry:** The central concept, branching out into shapes, statistics, measurement, and calculation. * **Statistics:** Includes Probability, Classifying & Sorting, and Data. * **Measurement:** Covers a wide range of concepts, from Volume and Length to Time, Temperature, Ratio, Proportional, Percents, Weight (Light & Heavy), and Money (Exchanging Money, Coin Names & Value). * **Calculate & Properties:** Encompasses Patterns, Estimation & Rounding, Count, Compare, Variable Expressions, Subtraction, Rational Number, Powers, Place Value, Numerical Expressions, and Multiple Operations. * **Two-Dim Shapes:** Includes Angles, Circles, Perimeter, Polygons, Quadrilaterals, and Triangles. * **Three-Dime Shapes:** Includes Cones, Cubes, Cylinders, Spheres, and Volume of 3D Shapes. * **Coordinate Plane:** Includes Coordinate Plane. * **Calculate:** Includes Add, Decimals, Division, Equations, Fractions, and Mixed Operations. ### Key Observations * The diagram provides a hierarchical view of mathematical concepts, starting from the broad category of Geometry and branching out into more specific subcategories. * The color-coding helps to visually group related concepts, making it easier to understand the relationships between them. * The diagram covers a wide range of mathematical topics, from basic measurement and calculation to more advanced concepts like statistics and geometry. ### Interpretation The circular diagram serves as a visual representation of how different mathematical concepts are interconnected. It highlights the central role of Geometry as a foundation for other areas of mathematics. The diagram is useful for understanding the relationships between different topics and for providing a high-level overview of the field of mathematics. The structure suggests a curriculum or a learning path, starting with basic concepts and progressing to more advanced topics. The diagram could be used as a reference tool for students or teachers to understand the scope and structure of mathematics. </details> ((a)) English Elementary (Elementary-EN) <details> <summary>x3.png Details</summary> ![a38e2a25](/v1/image/a38e2a25c2d479680eecd7facb448c81e82041f534e2cddcb67200f340a68394) ### Visual Description ## Circular Diagram: Math Concepts ### Overview The image is a circular diagram illustrating various mathematical concepts, categorized into broader areas like "Exprs, Equations & Functions," "Statistic & Probability," "Geometry," and "Calculate." The diagram uses color-coding to visually group related concepts. ### Components/Axes The diagram is structured as a series of concentric rings, with the central area containing the most general categories and the outer rings containing more specific sub-categories. * **Central Categories:** * Exprs, Equations & Functions (Orange) * Statistic & Probability (Yellow) * Geometry (Blue) * Calculate (Green) * **Outer Rings:** Contain specific mathematical concepts related to the central categories. ### Detailed Analysis or Content Details **1. Exprs, Equations & Functions (Orange):** * **Outermost Ring (Darkest Orange to Lightest Orange):** * Linear Equations * Variable Exprs * Radical Exprs * Equivalent Exprs * Systems of Equations * Equations * Inequalities * **Middle Ring:** * Nonlinear Functions * Linear Functions * Interpret Functions * Domain & Range of Functions * Function Concepts * **Innermost Ring:** * Two-Variable Statistics * One-Variable Statistics **2. Statistic & Probability (Yellow):** * **Outermost Ring:** * Outlier * Mean, Median, Mode & Range * Center & Variability * Probability of Simple and Opposite Events * Probability of One Event * Probability of Compound Events * Make Predictions * Independent & Dependent Events * Counting Principle * **Middle Ring:** * Data * Probability * **Innermost Ring:** * Two-Variable Statistics * One-Variable Statistics **3. Geometry (Blue):** * **Outermost Ring (Darkest Blue to Lightest Blue):** * Slope * Scale Drawings * Transformations * Congruence & Similarity * Three-Dim Figures * Surface Area & Volume * Polyhedra * Quadrants * Distance Between Two Points * Axes * Triangle * Trapezoids * Square * Polygons * Perimeter & Area * Lines & Angles * Circle * Two-Dim Figures * **Middle Ring:** * Coordinate Plane **4. Calculate (Green):** * **Outermost Ring (Darkest Green to Lightest Green):** * Consumer Math * Measurement * Estimate Metric Measurements * Financial Literacy * Percents * Ratios & Rates * Proportional Relationships * Rational & Irrational Numbers * Integers * Absolute Value * Opposite Integers * Arithmetic Sequences * Geometric Sequences * Factors * Prime Factorization * Prime or Composite * Add & Subtract * Decimals * Divide * Exponents & Scientific Notation * Fractions & Decimals * Multiply * Square Roots & Cube Roots * **Middle Ring:** * Irrational Numbers * Sequences * Number Theory * Basic Calculate ### Key Observations * The diagram provides a hierarchical organization of mathematical concepts. * The color-coding helps to visually group related concepts. * The level of detail increases from the center to the outer rings. ### Interpretation The circular diagram serves as a visual aid for understanding the relationships between different mathematical concepts. It demonstrates how specific topics like "Linear Equations" and "Slope" fit into broader categories like "Exprs, Equations & Functions" and "Geometry," respectively. The diagram is useful for students or anyone seeking a high-level overview of mathematical topics and their interconnections. The arrangement suggests a progression of learning, starting with fundamental concepts and moving towards more specialized areas. </details> ((b)) English Middle (Middle-EN) <details> <summary>x4.png Details</summary> ![3aa84512](/v1/image/3aa845124df10f8f66bea46649787c5351633790ce5d9c2e2a278de032b4a1f4) ### Visual Description ## Radial Chart: Math Problem Categories ### Overview The image is a radial chart illustrating the categorization of math problems. The chart is divided into several layers, each representing a broader category that branches into more specific subcategories. The chart is written in Chinese. ### Components/Axes * **Center:** The innermost layer is labeled "度量与统计" (Dùliàng yǔ tǒngjì), which translates to "Measurement and Statistics." * **Second Layer:** This layer is divided into four main categories: * "应用" (Yìngyòng) - Application (Orange) * "几何" (Jǐhé) - Geometry (Blue) * "数与代数" (Shù yǔ dàishù) - Numbers and Algebra (Green) * "度量" (Dùliàng) - Measurement (Yellow) * "统计" (Tǒngjì) - Statistics (Yellow) * **Outer Layers:** These layers contain subcategories of math problems, branching out from the main categories. ### Detailed Analysis or ### Content Details **1. 应用 (Yìngyòng) - Application (Orange):** * "行程问题" (Xíngchéng wèntí) - Travel Problems * "追击问题" (Zhuījī wèntí) - Pursuit Problems * "相遇问题" (Xiāngyù wèntí) - Meeting Problems * "基础问题" (Jīchǔ wèntí) - Basic Problems * "和差倍问题" (Hé chā bèi wèntí) - Sum, Difference, and Multiple Problems * "归总问题" (Guīzǒng wèntí) - Total Return Problems * "归一问题" (Guī yī wèntí) - Unitary Method Problems * "差倍问题" (Chā bèi wèntí) - Difference and Multiple Problems * "周期问题" (Zhōuqí wèntí) - Period Problems * "植树问题" (Zhí shù wèntí) - Tree Planting Problems * "折扣问题" (Zhékòu wèntí) - Discount Problems * "页码问题" (Yèmǎ wèntí) - Page Number Problems * "税率问题" (Shuìlǜ wèntí) - Tax Rate Problems * "年龄问题" (Niánlíng wèntí) - Age Problems * "利息问题" (Lìxī wèntí) - Interest Problems * "鸡兔同笼问题" (Jī tù tóng lóng wèntí) - Chicken and Rabbit in the Same Cage Problems * "还原问题" (Huányuán wèntí) - Restoration Problems * "工程问题" (Gōngchéng wèntí) - Work Problems * "经典问题" (Jīngdiǎn wèntí) - Classic Problems **2. 几何 (Jǐhé) - Geometry (Blue):** * "立体图形综合" (Lìtǐ túxíng zònghé) - Comprehensive Solid Geometry * "正方体" (Zhèngfāngtǐ) - Cube * "长方体" (Chángfāngtǐ) - Cuboid * "圆柱" (Yuánzhù) - Cylinder * "立体图形" (Lìtǐ túxíng) - Solid Geometry * "统计指标" (Tǒngjì zhǐbiāo) - Statistical Indicators * "排列组合" (Páilie zǔhé) - Permutations and Combinations * "角" (Jiǎo) - Angle * "平行四边形" (Píngxíng sìbiānxíng) - Parallelogram * "三角形" (Sānjiǎoxíng) - Triangle * "梯形" (Tīxíng) - Trapezoid * "圆" (Yuán) - Circle * "长方形" (Chángfāngxíng) - Rectangle * "正方形" (Zhèngfāngxíng) - Square * "平面图形综合" (Píngmiàn túxíng zònghé) - Comprehensive Plane Geometry * "平面图形" (Píngmiàn túxíng) - Plane Geometry **3. 数与代数 (Shù yǔ dàishù) - Numbers and Algebra (Green):** * "比例问题" (Bǐlì wèntí) - Proportion Problems * "倍数问题" (Bèishù wèntí) - Multiple Problems * "百分率问题" (Bǎifēn lǜ wèntí) - Percentage Problems * "四则运算" (Sìzéi yùsuàn) - Four Arithmetic Operations * "方程问题" (Fāngchéng wèntí) - Equation Problems * "定义新运算" (Dìngyì xīn yùsuàn) - Defined New Operations * "倒数问题" (Dàoshù wèntí) - Reciprocal Problems * "除法问题" (Chúfǎ wèntí) - Division Problems * "乘法问题" (Chéngfǎ wèntí) - Multiplication Problems * "基础运算" (Jīchǔ yùsuàn) - Basic Operations * "比与比例" (Bǐ yǔ bǐlì) - Ratio and Proportion * "概率问题" (Gàilǜ wèntí) - Probability Problems * "分数应用" (Fēnshù yìngyòng) - Fraction Applications * "分数与小数" (Fēnshù yǔ xiǎoshù) - Fractions and Decimals * "分数运算" (Fēnshù yùsuàn) - Fraction Operations * "最简分数" (Zuì jiǎn fēnshù) - Simplest Fraction * "公倍数问题" (Gōngbèishù wèntí) - Common Multiple Problems * "公约数问题" (Gōngyuēshù wèntí) - Common Factor Problems * "因数与倍数综合" (Yīnsù yǔ bèishù zònghé) - Comprehensive Factors and Multiples * "质数问题" (Zhìshù wèntí) - Prime Number Problems * "因数问题" (Yīnsù wèntí) - Factor Problems * "因数与倍数" (Yīnsù yǔ bèishù) - Factors and Multiples **4. 度量 (Dùliàng) - Measurement (Yellow):** * "规律" (Guīlǜ) - Pattern * "温度问题" (Wēndù wèntí) - Temperature Problems * "时间问题" (Shíjiān wèntí) - Time Problems * "人民币问题" (Rénmínbì wèntí) - RMB Problems * "浓度问题" (Nóngdù wèntí) - Concentration Problems * "面积问题" (Miànjī wèntí) - Area Problems ### Key Observations * The chart provides a hierarchical breakdown of math problems, starting from broad categories and drilling down to specific types. * The "Application" category has the most subcategories, indicating a wide range of real-world problem-solving scenarios. * The "Geometry" category covers both solid and plane geometry concepts. * The "Numbers and Algebra" category includes various arithmetic operations, fractions, ratios, and probability. * The "Measurement" category focuses on practical measurement-related problems. ### Interpretation The radial chart serves as a visual guide for categorizing and understanding different types of math problems. It highlights the relationships between broad mathematical concepts and specific problem-solving techniques. This type of chart is useful for students, teachers, and anyone interested in organizing and navigating the landscape of mathematical problems. The chart suggests that problem-solving in mathematics is highly interconnected, with various concepts and techniques building upon each other. </details> ((c)) Chinese Elementary (Elementary-ZH) <details> <summary>x5.png Details</summary> ![d1cc100a](/v1/image/d1cc100a620e9f8cc85abcc42a0bd996e6fe180c53edebdaf59b4778d686ba68) ### Visual Description ## Circular Chart: Mathematics Topics ### Overview The image is a circular chart, resembling a pie chart, that visually organizes various mathematical topics. The chart is divided into several main categories, each represented by a different color and further subdivided into more specific sub-topics. All text is in Chinese, with English translations provided. ### Components/Axes The chart is structured in concentric rings, with the main categories in the inner ring and sub-categories in the outer rings. The main categories are: * **数与式 (Shù yǔ shì)** - Numbers and Expressions (Green) * **统计与概率 (Tǒngjì yǔ gàilǜ)** - Statistics and Probability (Yellow) * **方程与不等式 (Fāngchéng yǔ bù děngshì)** - Equations and Inequalities (Pink) * **函数 (Hánshù)** - Functions (Orange) * **几何 (Jǐhé)** - Geometry (Blue) ### Detailed Analysis or ### Content Details Here's a breakdown of the sub-categories within each main category: **1. 数与式 (Shù yǔ shì) - Numbers and Expressions (Green)** * **代数式 (Dàishùshì)** - Algebraic Expressions * 代数式求值 (Dàishùshì qiúzhí) - Evaluating Algebraic Expressions * 同类项 (Tónglèi xiàng) - Like Terms * **分式 (Fēnshì)** - Fractions * 约分与通分 (Yuē fēn yǔ tōng fēn) - Reducing and Finding Common Denominators * **无理数 (Wúlǐshù)** - Irrational Numbers * 指数幂 (Zhǐshù mì) - Exponential Powers * 判断无理数 (Pànduàn wúlǐshù) - Judging Irrational Numbers * **因式 (Yīnsì)** - Factors * 十字相乘法 (Shízì xiāng chéng fǎ) - Cross Multiplication Method * **整式 (Zhěngshì)** - Polynomials * 整式的加减 (Zhěngshì de jiājiǎn) - Addition and Subtraction of Polynomials * 整式的乘除及混合 (Zhěngshì de chéng chú jí hùnhé) - Multiplication, Division, and Mixture of Polynomials * **根式 (Gēnshì)** - Radicals * 乘法公式 (Chéngfǎ gōngshì) - Multiplication Formulas * 同类二次根式 (Tónglèi èrcì gēnshì) - Like Quadratic Radicals * 平方根与算术平方根 (Píngfāng gēn yǔ suànshù píngfāng gēn) - Square Roots and Arithmetic Square Roots * 二次根式的运算 (Èrcì gēnshì de yùsuàn) - Operations with Quadratic Radicals * **立方根 (Lìfāng gēn)** - Cube Roots **2. 统计与概率 (Tǒngjì yǔ gàilǜ) - Statistics and Probability (Yellow)** * **概率 (Gàilǜ)** - Probability * 求概率 (Qiú gàilǜ) - Finding Probability * 概率的应用 (Gàilǜ de yìngyòng) - Application of Probability * 随机事件与概率 (Suíjī shìjiàn yǔ gàilǜ) - Random Events and Probability * 数据的波动趋势 (Shùjù de bōdòng qūshì) - Trend of Data Fluctuation * 数据的集中趋势 (Shùjù de jízhōng qūshì) - Trend of Data Concentration * **应用 (Yìngyòng)** - Application * 流水问题 (Liúshuǐ wèntí) - Current Problems * 鸽巢问题 (Gē cháo wèntí) - Pigeonhole Principle * **数据分析 (Shùjù fēnxī)** - Data Analysis * 提公因式 (Tí gōng yīnsì) - Factoring out the Greatest Common Factor **3. 方程与不等式 (Fāngchéng yǔ bù děngshì) - Equations and Inequalities (Pink)** * **不等式与不等式组 (Bù děngshì yǔ bù děngshì zǔ)** - Inequalities and Systems of Inequalities * 解一元一次不等式 (Jiě yīyuán yīcì bù děngshì) - Solving Linear Inequalities in One Variable * 一元一次不等式组 (Yīyuán yīcì bù děngshì zǔ) - Systems of Linear Inequalities in One Variable * 一元一次不等式的应用 (Yīyuán yīcì bù děngshì de yìngyòng) - Applications of Linear Inequalities in One Variable * **分式方程 (Fēnshì fāngchéng)** - Fractional Equations * 分式方程的应用 (Fēnshì fāngchéng de yìngyòng) - Applications of Fractional Equations * 解分式方程 (Jiě fēnshì fāngchéng) - Solving Fractional Equations * **一元二次方程 (Yīyuán èrcì fāngchéng)** - Quadratic Equations in One Variable * 一元二次方程的应用 (Yīyuán èrcì fāngchéng de yìngyòng) - Applications of Quadratic Equations in One Variable * 解一元二次方程 (Jiě yīyuán èrcì fāngchéng) - Solving Quadratic Equations in One Variable **4. 函数 (Hánshù) - Functions (Orange)** * **一次函数 (Yīcì hánshù)** - Linear Functions * 函数与一元一次不等式 (Hánshù yǔ yīyuán yīcì bù děngshì) - Functions and Linear Inequalities in One Variable * 函数与一元一次方程 (Hánshù yǔ yīyuán yīcì fāngchéng) - Functions and Linear Equations in One Variable * 函数与二元一次方程组 (Hánshù yǔ èryuán yīcì fāngchéng zǔ) - Functions and Systems of Linear Equations in Two Variables * 求一次函数解析式 (Qiú yīcì hánshù jiěxīshì) - Finding the Analytical Expression of a Linear Function * **反比例函数 (Fǎn bǐlì hánshù)** - Inverse Proportionality Functions * 反比例函数的应用 (Fǎn bǐlì hánshù de yìngyòng) - Applications of Inverse Proportionality Functions * 反比例函数的性质 (Fǎn bǐlì hánshù de xìngzhì) - Properties of Inverse Proportionality Functions * 反比例函数的定义 (Fǎn bǐlì hánshù de dìngyì) - Definition of Inverse Proportionality Functions * **二次函数 (Èrcì hánshù)** - Quadratic Functions * 抛物线的性质 (Pāowùxiàn de xìngzhì) - Properties of Parabolas * 二次函数的应用 (Èrcì hánshù de yìngyòng) - Applications of Quadratic Functions * **平面直角坐标系 (Píngmiàn zhíjiǎo zuòbiāo xì)** - Cartesian Coordinate System * 有序数对 (Yǒuxù shù duì) - Ordered Pairs * 点所在象限 (Diǎn suǒzài xiàngxiàn) - Quadrant Where a Point is Located **5. 几何 (Jǐhé) - Geometry (Blue)** * **圆 (Yuán)** - Circle * 圆心角 (Yuánxīn jiǎo) - Central Angle * 圆周角 (Yuánzhōu jiǎo) - Inscribed Angle * 正多边形和圆 (Zhèng duōbiānxíng hé yuán) - Regular Polygons and Circles * 弧长和扇形面积 (Hú cháng hé shànxíng miànjī) - Arc Length and Sector Area * 点线圆位置关系 (Diǎn xiàn yuán wèizhì guānxì) - Positional Relationship between Points, Lines, and Circles * 垂径定理 (Chuí jìng dìnglǐ) - Perpendicular Bisector Theorem * **三角形 (Sānjiǎoxíng)** - Triangle * 等边三角形 (Děngbiān sānjiǎoxíng) - Equilateral Triangle * 等腰三角形 (Děngyāo sānjiǎoxíng) - Isosceles Triangle * 勾股定理 (Gōugǔ dìnglǐ) - Pythagorean Theorem * 全等三角形 (Quán děng sānjiǎoxíng) - Congruent Triangles * **四边形 (Sìbiānxíng)** - Quadrilateral * 平行四边形 (Píngxíng sìbiānxíng) - Parallelogram * 梯形 (Tīxíng) - Trapezoid * **立体图形 (Lìtǐ túxíng)** - Solid Figures * 圆锥 (Yuánzhuī) - Cone ### Key Observations * The chart provides a comprehensive overview of mathematical topics, categorized into five main areas. * Each main category is further divided into sub-topics, providing a hierarchical structure. * The visual representation allows for easy identification of related concepts. ### Interpretation The circular chart serves as a visual aid for organizing and understanding the relationships between different mathematical concepts. It is designed to provide a high-level overview of the subject matter, making it easier to navigate and comprehend the connections between various topics. The chart could be used for studying, curriculum planning, or as a reference tool for students and educators. The hierarchical structure allows users to drill down from broad categories to specific sub-topics, facilitating a deeper understanding of the subject matter. </details> ((d)) Chinese Middle (Middle-ZH) Figure 2: Diagram overview of four concept systems in ConceptMath. We have provided translated Chinese concept names in English (See Appendix A). 2 ConceptMath ConceptMath is the first bilingual, concept-wise benchmark for measuring mathematical reasoning. In this section, we describe the design principle, dataset collection process, dataset statistics and an efficient fine-tuning strategy to enhance the weaknesses identified by our ConceptMath. 2.1 Design Principle We created ConceptMath based on the following two high-level design principles: Concept-wised Hierarchical System. The primary goal of ConceptMath is to evaluate the mathematical reasoning capacities of language models at different granularity. Therefore, ConceptMath organizes math problems within a three-level hierarchy of mathematical concepts in Fig. 2. This approach provides concept-wise evaluation for mathematical reasoning of language models and makes targeted and effective improvements possible. Bilingualism. Most of the current mathematical benchmark focuses solely on English, leaving multi-lingual mathematical reasoning unexplored. As an early effort to explore multi-lingual mathematical reasoning, we evaluate mathematical reasoning in two languages: English and Chinese. Besides, since cultures and educational systems vary across different languages, common math concepts can differ a lot. Therefore, we carefully collect concepts in both languages, instead of merely translating from one language to another. For example, measurement metrics (e.g., money, size) are different for English and Chinese. 2.2 Data Collection Subsequently, for data collection, we take a two-step approach to operationalize the aforementioned design principles: First, we recruit experts to delineate a hierarchy of math concepts based on different education systems. Secondly, we collect problems for each concept from various sources or design problems manually, which is succeeded by quality assessment and data cleaning. Math Concept System Construction. Since the education systems provide a natural hierarchy of math concepts, we recruited four teachers from elementary and middle schools, specializing in either English or Chinese, to organize a hierarchy of math concepts for different education systems. This leads to four concept systems: Elementary-EN, Middle-EN, Elementary-ZH, and Middle-ZH, with each system consisting of a three-level hierarchy of around 50 atomic math concepts (Fig. 2). Math Problem Construction. Then we conducted a thorough data acquisition from various sources (including educational websites, textbooks, and search engines with specific concepts) to collect math word problems (including both questions and answers) for each math concept. To guarantee a balance across all concepts, approximately 20 problems were gathered for each math concept. Following this, both GPT-4 OpenAI (2023) and human experts were employed to verify and rectify the categorization and the solution of each problem. However, we observed that for some concepts, the problem count was significantly below 20. To address this issue, manual efforts were undertaken to augment these categories, ensuring a consistent collection of 20 problems for each concept. Furthermore, to broaden the diversity of the dataset and minimize the risk of data contamination, all gathered problems were paraphrased using GPT-4. It is important to note that the collection and annotation processes were carried out by a team of six members, each possessing a university degree in an engineering discipline, to maintain a high level of technical expertise in executing these tasks. 2.3 Dataset Statistics Comparison to existing datasets. As shown in Table 1, our ConceptMath differs from related datasets in various aspects: (1) ConceptMath is the first dataset to study fine-grained mathematical concepts and encompasses 4 systems, 214 math concepts, and 4011 math word problems. (2) Problems in ConcepthMath are carefully annotated based on the mainstream education systems for English (EN) and Chinese (ZH). Details on the hierarchical system. Apart from Fig. 2, we also provide the details on the hierarchical system more clearly in Appendix A. Length distribution. Fig. 3 shows the length distribution of our ConcepthMath, where number of tokens is reported We use the “cl100k_base” tokenizer from https://github.com/openai/tiktoken. The minimum, average and maximum of the tokens for these questions are 4, 41 and 309, respectively, which shows that they have lexical richness. | Benchmark | Language | Fine-grained | Size | | --- | --- | --- | --- | | GSM8K | EN | ✗ | 1319 | | MATH | EN | ✗ | 5000 | | TabMWP | EN | ✗ | 7686 | | Dolphin18K | EN | ✗ | 1504 | | Math23K | ZH | ✗ | 1000 | | ASDiv | EN | ✗ | 2305 | | SVAMP | EN | ✗ | 300 | | SingleOp | EN | ✗ | 159 | | MMLU-Math | EN | ✗ | 906 | | ConceptMath | EN&ZH | ✓ | 4011 | Table 1: A comparison of our ConceptMath with some notable mathematical datasets. Note that the size is the number of samples of the test split. <details> <summary>x6.png Details</summary> ![68ceb178](/v1/image/68ceb178299f9ec64d2fcb3b6181468b59346145e8b276407400f58bc5c8dec7) ### Visual Description ## Bar Chart: Question Length Distribution ### Overview The image is a bar chart showing the distribution of question lengths, measured in the number of tokens. The x-axis represents the question length, and the y-axis represents the number of questions. The bars are light blue. ### Components/Axes * **X-axis:** Question Length (# Tokens). The axis is labeled from 0 to 200 in increments of 10, with an additional label ">200" at the end. * **Y-axis:** Number of Questions. The axis is labeled from 0 to 100 in increments of 20. * **Bars:** Light blue bars represent the number of questions for each question length. ### Detailed Analysis The distribution is skewed to the right. * The number of questions increases rapidly from 0 tokens to a peak around 30-40 tokens. * The number of questions then decreases gradually as the question length increases from 40 to 200 tokens. * There is a small spike at the ">200" token mark. Specific data points (approximate due to bar chart resolution): * At 10 tokens, the number of questions is approximately 50. * At 20 tokens, the number of questions is approximately 75. * At 30 tokens, the number of questions is approximately 90. * At 40 tokens, the number of questions is approximately 80. * At 50 tokens, the number of questions is approximately 50. * At 60 tokens, the number of questions is approximately 40. * At 70 tokens, the number of questions is approximately 30. * At 80 tokens, the number of questions is approximately 20. * At 90 tokens, the number of questions is approximately 15. * At 100 tokens, the number of questions is approximately 5. * At 150 tokens, the number of questions is approximately 2. * At 200 tokens, the number of questions is approximately 1. * At >200 tokens, the number of questions is approximately 12. ### Key Observations * Most questions are between 10 and 60 tokens long. * The number of questions decreases as the question length increases beyond 40 tokens. * There are relatively few questions longer than 100 tokens. * There is a small number of questions with more than 200 tokens. ### Interpretation The data suggests that the majority of questions in the dataset are relatively short, with a peak around 30-40 tokens. The distribution indicates that longer questions are less common. The small spike at ">200" suggests that there is a small subset of questions that are significantly longer than the average. This distribution could be influenced by factors such as the nature of the questions being asked, the writing style of the question askers, or any length constraints imposed on the questions. </details> Figure 3: Length distributions of our ConceptMath. 2.4 Efficient Fine-Tuning Based on our ConceptMath, we are able to identify the weaknesses in the mathematical reasoning capability of LLMs through concept-wise evaluation. In this section, we explore a straightforward approach to enhance mathematical abilities towards specific concepts by first training a concept classifier and then curating a set of samples from a large open-sourced math dataset. Specifically, first, by additionally collecting extra 10 problems per concept, we construct a classifier capable of identifying the concept class of a given question. The backbone of this classifier is a pretrained bilingual LLM, where the classification head is operated on its last hidden output feature. Then, we proceed to fine-tune LLMs using this specific dataset combined with the existing general math dataset, which aims to avoid overfitting on a relatively small dataset. More details have been provided in the Appendix B. 3 Experiments In this section, we perform extensive experiments to demonstrate the effect of our ConceptMath. 3.1 Experimental Setup Evaluated Models. We assess the mathematical reasoning of existing advanced LLMs on ConceptMath, including 2 close-sourced LLMs (i.e., GPT-3.5/GPT-4 (OpenAI, 2023)) and 17 open-sourced LLMs (i.e., WizardMath-13B Luo et al. (2023), MetaMath-13B Yu et al. (2023), MAmmoTH-13B Yue et al. (2023), Qwen-14B/72B Bai et al. (2023b), Baichuan2-13B Baichuan (2023), ChatGLM3-6B Du et al. (2022), InternLM2-7B/20B Team (2023a), InternLM2-Math-7B/20B Ying et al. (2024), LLaMA2-7B/13B/70B Touvron et al. (2023b), Yi-6B/34B Team (2023b) and DeepSeekMath-7B Shao et al. (2024)). Note that WizardMath-13B, MetaMath-13B, and MAmmoTH-13B are specialized math language models fine-tuned from LLaMA2. InternLM2-Math and DeepSeekMath-7B are specialized math language models fine-tuned from corresponding language models. More details of these evaluated models can be seen in Appendix C. | Model | Elementary-EN | Middle-EN | Elementary-ZH | Middle-ZH | Avg. | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ZS | ZS-COT | FS | ZS | ZS-COT | FS | ZS | ZS-COT | FS | ZS | ZS-COT | FS | | | | Yi-6B | 67.94 | 67.56 | 59.03 | 65.55 | 64.59 | 56.05 | 34.33 | 31.91 | 37.86 | 36.46 | 36.19 | 36.46 | 49.49 | | ChatGLM3-6B | 60.69 | 63.10 | 53.18 | 51.25 | 60.17 | 51.34 | 46.23 | 43.63 | 40.74 | 44.77 | 43.32 | 40.43 | 49.90 | | DeepSeekMath-7B | 66.92 | 77.35 | 73.92 | 56.53 | 69.87 | 66.31 | 60.47 | 62.33 | 64.19 | 56.50 | 56.95 | 56.86 | 64.02 | | InternLM2-Math-7B | 71.12 | 72.01 | 69.59 | 63.44 | 62.96 | 63.05 | 57.30 | 58.23 | 58.60 | 53.79 | 53.16 | 53.88 | 61.43 | | InternLM2-7B | 68.83 | 69.97 | 66.67 | 37.04 | 65.83 | 55.47 | 47.63 | 49.02 | 53.02 | 45.22 | 45.40 | 44.86 | 54.08 | | LLaMA2-7B | 36.51 | 42.62 | 38.68 | 34.26 | 39.16 | 33.69 | 15.72 | 17.67 | 17.58 | 30.87 | 32.22 | 27.80 | 30.57 | | MAmmoTH-13B | 61.32 | 52.42 | 56.49 | 53.93 | 45.20 | 48.08 | 22.33 | 33.30 | 23.81 | 27.98 | 43.05 | 29.15 | 41.42 | | WizardMath-13B | 41.73 | 44.78 | 34.99 | 36.85 | 37.72 | 45.11 | 10.51 | 11.26 | 18.70 | 12.36 | 15.52 | 22.92 | 27.70 | | MetaMath-13B | 54.45 | 51.78 | 47.96 | 44.24 | 43.47 | 47.50 | 11.44 | 17.30 | 27.53 | 21.21 | 26.08 | 29.60 | 35.21 | | Baichuan2-13B | 68.83 | 68.58 | 54.07 | 67.66 | 69.67 | 40.40 | 57.02 | 58.23 | 22.05 | 55.05 | 55.32 | 26.90 | 53.65 | | LLaMA2-13B | 44.02 | 49.75 | 47.07 | 44.72 | 46.45 | 43.09 | 20.19 | 24.19 | 22.14 | 33.30 | 35.38 | 26.17 | 36.37 | | Qwen-14B | 46.95 | 65.78 | 72.65 | 38.48 | 59.60 | 67.85 | 28.09 | 65.12 | 64.47 | 22.92 | 58.30 | 62.09 | 54.36 | | InternLM2-Math-20B | 74.05 | 75.32 | 73.41 | 64.11 | 71.21 | 70.83 | 62.98 | 61.95 | 61.77 | 55.14 | 55.78 | 56.86 | 65.28 | | InternLM2-20B | 53.31 | 72.52 | 73.28 | 45.11 | 67.47 | 56.72 | 48.19 | 55.53 | 59.81 | 45.13 | 50.63 | 56.68 | 57.03 | | Yi-34B | 74.68 | 73.66 | 56.36 | 72.26 | 74.66 | 65.83 | 50.05 | 51.16 | 38.79 | 45.40 | 43.95 | 40.97 | 57.31 | | LLaMA2-70B | 56.11 | 60.31 | 30.53 | 58.06 | 60.94 | 31.67 | 28.65 | 26.70 | 24.37 | 37.64 | 34.30 | 28.43 | 39.81 | | Qwen-72B | 77.10 | 75.06 | 77.23 | 74.66 | 69.87 | 73.99 | 71.16 | 68.65 | 61.86 | 71.30 | 65.43 | 62.45 | 70.73 | | GPT-3.5 | 85.75 | 92.37 | 84.35 | 83.88 | 90.12 | 82.73 | 56.47 | 53.21 | 56.93 | 51.90 | 53.52 | 55.69 | 70.58 | | GPT-4 | 86.77 | 90.20 | 89.57 | 84.26 | 89.83 | 88.68 | 67.91 | 72.28 | 72.00 | 63.81 | 64.26 | 66.61 | 78.02 | | Avg. | 63.00 | 66.59 | 61.00 | 56.65 | 62.57 | 57.28 | 41.93 | 45.35 | 43.49 | 42.67 | 45.72 | 43.41 | 52.47 | Table 2: Results of different models on our constructed ConceptMath benchmark dataset. Note that “ZS”, “ZS-COT”, “FS” represents “zero-shot”, “zero-shot w/ chain-of-thought” and “few-shot”, repsectively. Models are grouped roughly according to their model sizes. Evaluation Settings. We employ three distinct evaluation settings: zero-shot, zero-shot with chain-of-thought (CoT), and few-shot promptings. The zero-shot prompting assesses the models’ intrinsic problem-solving abilities without any prior examples. The zero-shot with CoT prompting evaluates the models’ ability to employ a logical chain of thought. In the few-shot prompting setting, the model is provided with fixed 5-shot prompts for different systems (See Appendix E), which includes five newly created examples with concise ground truth targets. This approach is designed to measure the in-context learning abilities. Besides, following MATH (Hendrycks et al., 2021b), all questions and answers in ConceptMath have been carefully curated, and each problem is evaluated based on exact matches. Moreover, greedy decoding with a temperature of 0 is used. 3.2 Results Overall Accuracy We present the overall accuracies of different LLMs on our ConceptMath benchmark under various prompt settings in Table 2. Subsequently, we analyzed the mathematical abilities of these LLMs in both English and Chinese in Fig. 4. Our analysis led to the following key findings: (1) GPT-3.5/4 showcases the most advanced mathematical reasoning abilities among LLMs in both English and Chinese systems, and the leading open-source Qwen-72B model archives comparable performance compared with GPT-3.5. (2) The scores on Chinese systems of most existing LLMs are lower than English systems a lot. For example, accuracies on Middle-ZH and Middle-EN for GPT-4 are 63.81 and 84.26. (3) Several models (e.g., WizardMath-13B or MetaMath-13B) fine-tuned from LLaMA2-13B achieve slight improvements on English systems, but the performance results are lower than LLaMA2-13B on Chinese systems a lot, which indicates that domain-specific fine-tuning may degrade the generalization abilities of LLMs. (4). The mathematical models (i.e., InternLM2-Math-7B/20B and DeepSeekMath-7B) by continuing pretraining on the large-scale math-related dataset (¿=100B tokens) show sufficient improvements when compared to models with similar size, which indicates that large-scale pertaining is effective to improve the mathematical reasoning abilities. <details> <summary>x7.png Details</summary> ![76633a05](/v1/image/76633a05e8bff4fd1df978fd15830e3296773e0f728c9abf783fa4b3a0320d48) ### Visual Description ## Line Chart: Model Accuracy Comparison ### Overview The image is a line chart comparing the mean accuracy of various language models on English and Chinese datasets, as well as their overall performance. The x-axis represents different language models, and the y-axis represents the mean accuracy, ranging from 20 to 90. The chart includes three data series: English (blue dashed line), Chinese (green dashed line), and Overall (black solid line). ### Components/Axes * **Title:** None * **X-axis:** Language Models (WizardMath-13B, LLaMA2-7B, MetaMath-13B, LLaMA2-13B, LLaMA2-70B, MAmmoTH-13B, Yi-6B, ChatGLM3-6B, Baichuan2-13B, InternLM2-7B, Qwen-14B, InternLM2-20B, Yi-34B, InternLM2-Math-7B, DeepSeekMath-7B, InternLM2-Math-20B, GPT-3.5, Qwen-72B, GPT-4) * **Y-axis:** Mean Accuracy, ranging from 20 to 90 in increments of 10. * **Legend:** Located in the top-left corner. * English: Blue dashed line with circular markers * Chinese: Green dashed line with circular markers * Overall: Black solid line with circular markers ### Detailed Analysis **English (Blue Dashed Line):** The English accuracy fluctuates more than the other two lines. * WizardMath-13B: ~40 * LLaMA2-7B: ~38 * MetaMath-13B: ~50 * LLaMA2-13B: ~47 * LLaMA2-70B: ~65 * MAmmoTH-13B: ~57 * Yi-6B: ~65 * ChatGLM3-6B: Not Available * Baichuan2-13B: Not Available * InternLM2-7B: ~62 * Qwen-14B: ~69 * InternLM2-20B: ~69 * Yi-34B: ~70 * InternLM2-Math-7B: Not Available * DeepSeekMath-7B: Not Available * InternLM2-Math-20B: ~88 * GPT-3.5: ~75 * Qwen-72B: ~89 * GPT-4: Not Available **Chinese (Green Dashed Line):** The Chinese accuracy generally increases across the models. * WizardMath-13B: ~22 * LLaMA2-7B: ~27 * MetaMath-13B: ~32 * LLaMA2-13B: ~34 * LLaMA2-70B: ~42 * MAmmoTH-13B: ~48 * Yi-6B: ~48 * ChatGLM3-6B: Not Available * Baichuan2-13B: Not Available * InternLM2-7B: ~54 * Qwen-14B: ~52 * InternLM2-20B: ~52 * Yi-34B: ~58 * InternLM2-Math-7B: Not Available * DeepSeekMath-7B: Not Available * InternLM2-Math-20B: ~62 * GPT-3.5: ~67 * Qwen-72B: ~70 * GPT-4: ~73 **Overall (Black Solid Line):** The overall accuracy shows a general upward trend. * WizardMath-13B: ~28 * LLaMA2-7B: ~31 * MetaMath-13B: ~36 * LLaMA2-13B: ~40 * LLaMA2-70B: ~50 * MAmmoTH-13B: ~50 * Yi-6B: ~54 * ChatGLM3-6B: Not Available * Baichuan2-13B: Not Available * InternLM2-7B: ~54 * Qwen-14B: ~57 * InternLM2-20B: ~64 * Yi-34B: ~60 * InternLM2-Math-7B: Not Available * DeepSeekMath-7B: Not Available * InternLM2-Math-20B: ~71 * GPT-3.5: ~71 * Qwen-72B: ~71 * GPT-4: ~78 ### Key Observations * The English accuracy fluctuates more significantly than the Chinese and Overall accuracies. * The Overall accuracy generally increases as the models progress along the x-axis. * The Chinese accuracy consistently lags behind the English accuracy for most models. * The models on the right side of the chart (GPT-3.5, Qwen-72B, GPT-4) generally exhibit higher accuracy across all three metrics. ### Interpretation The chart illustrates the performance of various language models on English and Chinese datasets. The fluctuating English accuracy suggests that some models are better suited for English tasks than others, while the consistently lower Chinese accuracy indicates a potential gap in model performance across different languages. The upward trend in overall accuracy suggests that newer models generally perform better than older models. The models GPT-3.5, Qwen-72B, and GPT-4 show the highest overall performance, indicating their superior capabilities in both English and Chinese language tasks. The data suggests that model architecture and training data play a significant role in determining language-specific performance. </details> Figure 4: Mean accuracies for English, Chinese, and overall educational systems. Average Concept-wised Accuracy. In Fig. 5 and Fig. 6, to better analyze the effectiveness of our ConceptMath, we further provide the concept-wised accuracies average on evaluated models for different mathematical concepts by zero-shot prompting on Middle-EN and Middle-ZH. (See Appendix D for more results on Elementary-EN and Elementary-ZH). In Fig. 5 and Fig. 6, we observe that the accuracies across concepts vary a lot for existing LLMs. For example, for Middle-ZH in Fig. 6, around 18% of concepts exhibit an accuracy lower than 30%. Thus, to improve the mathematical abilities of LLMs, these concepts with large room for improvement should be given the highest priority, which further shows the advantage of ConceptMath. <details> <summary>x8.png Details</summary> ![9ae2ece3](/v1/image/9ae2ece358caf1e634eddb802858f230c063999646429dbd9a34bda9ba3295de) ### Visual Description ## Bar Chart: Mean Accuracy by Mathematical Concept ### Overview The image is a bar chart displaying the mean accuracy achieved on various mathematical concepts. The x-axis represents different mathematical topics, while the y-axis represents the mean accuracy, ranging from 0 to 80. The chart uses blue bars to represent the accuracy for each concept, sorted in ascending order. ### Components/Axes * **Y-axis:** "Mean Accuracy", ranging from 0 to 80, with tick marks at intervals of 10. * **X-axis:** Mathematical concepts, listed horizontally. The labels are somewhat overlapping due to space constraints. * **Bars:** Blue bars representing the mean accuracy for each mathematical concept. ### Detailed Analysis The bar chart presents the mean accuracy for a range of mathematical concepts. The concepts are listed along the x-axis, and the corresponding mean accuracy is indicated by the height of the blue bars. The bars are arranged in ascending order of accuracy. Here's a breakdown of the approximate accuracy for some of the concepts: * **Circle:** Approximately 34 * **Radical exprs:** Approximately 35 * **Exponents & scientific notation:** Approximately 39 * **Quadrants:** Approximately 40 * **Geometric sequences:** Approximately 41 * **Probability of compound events:** Approximately 42 * **Independent & dependent events:** Approximately 43 * **Rational & irrational numbers:** Approximately 43 * **Probability of simple & opposite events:** Approximately 44 * **Systems of equations:** Approximately 45 * **Scale drawings:** Approximately 46 * **Absolute value:** Approximately 47 * **Make predictions:** Approximately 48 * **One-variable statistics:** Approximately 49 * **Domain & range of functions:** Approximately 50 * **Two-variable functions:** Approximately 51 * **Linear functions:** Approximately 52 * **Arithmetic sequences:** Approximately 53 * **Mean, median, mode, & range:** Approximately 53 * **Financial literacy:** Approximately 54 * **Center & variability:** Approximately 55 * **Prime factorization:** Approximately 56 * **Percents:** Approximately 57 * **Divide:** Approximately 57 * **Fractions & decimals:** Approximately 58 * **Surface area & volume:** Approximately 59 * **Distance between two points:** Approximately 60 * **Square roots & cube roots:** Approximately 61 * **Congruence & similarity:** Approximately 62 * **Nonlinear functions:** Approximately 62 * **Variable exprs:** Approximately 63 * **Perimeter & area:** Approximately 64 * **Triangle:** Approximately 65 * **Add & subtract:** Approximately 66 * **Multiply:** Approximately 66 * **Decimals:** Approximately 67 * **Axes:** Approximately 68 * **Polygons:** Approximately 69 * **Factors:** Approximately 70 * **Trapezoids:** Approximately 71 * **Interpret functions:** Approximately 72 * **Lines & angles:** Approximately 73 * **Proportional relationships:** Approximately 74 * **Slope:** Approximately 75 * **Opposite integers:** Approximately 76 * **Inequalities:** Approximately 77 * **Consumer math:** Approximately 78 * **Polyhedra:** Approximately 79 * **Prime or composite:** Approximately 80 * **Square:** Approximately 81 * **Estimate metric measurements:** Approximately 82 The general trend is an upward slope, indicating increasing mean accuracy as we move from left to right along the x-axis. ### Key Observations * The mean accuracy varies significantly across different mathematical concepts. * "Estimate metric measurements" has the highest mean accuracy, while "Circle" has the lowest. * The concepts are sorted by mean accuracy, making it easy to identify the easiest and most challenging topics. ### Interpretation The bar chart provides insights into the relative difficulty of different mathematical concepts. The data suggests that concepts like "Estimate metric measurements" are well-understood, while concepts like "Circle" pose a greater challenge. This information can be valuable for educators to tailor their teaching strategies and focus on areas where students struggle the most. The arrangement of the bars allows for a quick visual assessment of the relative difficulty of each concept. The wide range of accuracy scores suggests that some concepts build upon others, and a strong foundation in the basics is crucial for mastering more advanced topics. </details> Figure 5: Mean concept accuracies on Middle-EN. Figure 6: Mean concept accuracies on Middle-ZH. Concept-wised Accuracy. Fig. 7 and Fig. 8 show that most existing LLMs, whether open-sourced, closed-sourced, general-purpose, or math-specialized, exhibit notable differences in their concept accuracies in the zero-shot prompt setting. These disparities may stem from variations in training datasets, strategies, and model sizes, which suggests that apart from common weaknesses, each model possesses its unique areas of deficiency or shortcomings. For the sake of brevity in the presentation, we only show a subset of models on Middle-EN and Middle-ZH. The concept accuracies of Elementary-EN and Elementary-ZH systems and all results of all models can be found in Appendix D. <details> <summary>x10.png Details</summary> ![01ea2b4b](/v1/image/01ea2b4b594b06c63b46e73787c159080b8e44eb51e74a9aeb18a87ed79b164d) ### Visual Description ## Line Chart: Accuracy Comparison of Language Models on Math Problems ### Overview The image is a line chart comparing the accuracy of three language models (MetaMath-13B, LLaMA2-70B, and GPT-4) across a range of mathematical problem types. The x-axis represents different math topics, and the y-axis represents accuracy, ranging from 0 to 100. ### Components/Axes * **Title:** (Implicit) Accuracy Comparison of Language Models on Math Problems * **X-axis:** Math problem types (listed below) * **Y-axis:** Accuracy (ranging from 0 to 100) with gridlines at intervals of 20. * **Legend:** Located at the top of the chart. * Blue line: MetaMath-13B * Orange line: LLaMA2-70B * Green line: GPT-4 **X-axis Labels (Math Problem Types):** 1. Add & subtract 2. Arithmetic sequences 3. Congruence & similarity 4. Consumer math 5. Counting principles 6. Distance between two points 7. Divide 8. Domain & range of functions 9. Estimate metric measurements 10. Equivalent exprs 11. Exponents & scientific notation 12. Financial literacy 13. Fractions & decimals 14. Geometric sequences 15. Interpret functions 16. Linear equations 17. Linear functions 18. Lines & angles 19. Make predictions 20. Multiply 21. Non-linear functions 22. One-variable statistics 23. Percents 24. Perimeter & area 25. Prime factorization 26. Prime & composite 27. Probability of compound event 28. Probability of one event 29. Probability of simple & composite events 30. Probability of simple & opposite events 31. Proportional relationships 32. Quadrants 33. Rational & irrational numbers 34. Scale drawings 35. Slope 36. Square roots & cube roots 37. Square, area & volume 38. Surface area & volume 39. Systems of equations 40. Triangle 41. Two-variable statistics 42. Absolute value 43. Center & variability 44. Circle 45. Factors 46. Independent & dependent events 47. Inequalities 48. Mean, median, mode, & range 49. Opposite integers 50. Outlier 51. Polygons 52. Polyhedra 53. Radical exprs 54. Square 55. Transformations 56. Trapezoids 57. Variable exprs ### Detailed Analysis * **MetaMath-13B (Blue):** The accuracy fluctuates significantly across different problem types. It shows very low accuracy (near 0) on "Probability of one event" and "Scale drawings". It peaks around 60-70 on "Distance between two points" and "Equivalent exprs". Overall, it has the lowest average accuracy. * **LLaMA2-70B (Orange):** The accuracy is generally higher than MetaMath-13B but lower than GPT-4. It shows a more consistent performance across different problem types, with fewer extreme dips. It peaks around 75 on "Equivalent exprs". * **GPT-4 (Green):** This model consistently demonstrates the highest accuracy across almost all problem types. The accuracy generally stays above 60, often reaching 80-100. **Specific Data Points (Approximate):** | Problem Type | MetaMath-13B | LLaMA2-70B | GPT-4 | | :---------------------------- | :----------- | :----------- | :---- | | Add & subtract | ~50 | ~55 | ~65 | | Distance between two points | ~60 | ~55 | ~95 | | Equivalent exprs | ~70 | ~75 | ~85 | | Probability of one event | ~0 | ~20 | ~75 | | Scale drawings | ~5 | ~45 | ~65 | | Absolute value | ~50 | ~60 | ~90 | ### Key Observations * GPT-4 significantly outperforms MetaMath-13B and LLaMA2-70B across most math problem types. * MetaMath-13B has very poor performance on specific problem types like "Probability of one event" and "Scale drawings". * LLaMA2-70B provides a more stable, though generally lower, accuracy compared to MetaMath-13B. * All models show variability in accuracy depending on the specific math topic. ### Interpretation The chart demonstrates the varying capabilities of different language models in solving mathematical problems. GPT-4's superior performance suggests a more robust understanding of mathematical concepts and problem-solving strategies. The weaknesses of MetaMath-13B in specific areas highlight potential gaps in its training data or architecture. LLaMA2-70B's more consistent performance indicates a more balanced but less specialized skill set. The data suggests that the choice of language model for math-related tasks should be carefully considered based on the specific types of problems being addressed. The large variance in accuracy across problem types for all models suggests that even the best models have areas for improvement. </details> Figure 7: Concept accuracies on Middle-EN. <details> <summary>x11.png Details</summary> ![9c9e32dd](/v1/image/9c9e32dd3cb9e0e4d0a5a1d215679aead7cfc4c1aaf1472017380366d1e4eac3) ### Visual Description ## Line Chart: Model Accuracy Comparison ### Overview The image is a line chart comparing the accuracy of three language models: MetaMath-13B, LLaMA2-70B, and GPT-4, across a series of mathematical problems. The x-axis represents different problem types (labeled in Chinese), and the y-axis represents accuracy, ranging from 0 to 80. ### Components/Axes * **Title:** None visible. * **X-axis:** Represents different mathematical problem types, labeled in Chinese. The labels are closely spaced and difficult to read individually. * **Y-axis:** "Accuracy", ranging from 0 to 80 in increments of 20. Horizontal gridlines are present at each increment. * **Legend:** Located at the top of the chart. * Blue line: MetaMath-13B * Orange line: LLaMA2-70B * Green line: GPT-4 ### Detailed Analysis The chart displays the accuracy of each model across a range of mathematical problems. The x-axis labels are in Chinese, representing different problem types. Here's a breakdown of each model's performance: * **MetaMath-13B (Blue):** Generally shows the lowest accuracy among the three models. The accuracy fluctuates significantly across different problem types, ranging from approximately 0 to 60. * **LLaMA2-70B (Orange):** Shows a moderate level of accuracy, generally higher than MetaMath-13B but lower than GPT-4. The accuracy also fluctuates, ranging from approximately 5 to 65. * **GPT-4 (Green):** Consistently demonstrates the highest accuracy across most problem types. The accuracy fluctuates, ranging from approximately 30 to 90. Here are some approximate data points for each model at the beginning and end of the chart: * **MetaMath-13B:** * First data point: Accuracy ~40 * Last data point: Accuracy ~10 * **LLaMA2-70B:** * First data point: Accuracy ~45 * Last data point: Accuracy ~55 * **GPT-4:** * First data point: Accuracy ~50 * Last data point: Accuracy ~90 Here is a transcription of the x-axis labels, along with their English translations: | Chinese Label | Approximate English Translation | |---|---| | 全等 | Congruence | | 等腰三角形 | Isosceles triangle | | 等边三角形 | Equilateral triangle | | 平行四边形 | Parallelogram | | 弧长 | Arc length | | 圆锥 | Cone | | 函数与一次函数 | Function and linear function | | 函数与二次函数 | Function and quadratic function | | 反比例函数 | Inverse proportional function | | 整式的加减 | Addition and subtraction of polynomials | | 一元一次方程 | Linear equation in one variable | | 平方根 | Square root | | 用平方根解一元二次方程 | Solving quadratic equations in one variable using square roots | | 用公式法解一元二次方程 | Solving quadratic equations in one variable using the quadratic formula | | 用因式分解法解一元二次方程 | Solving quadratic equations in one variable by factoring | | 一元二次方程的应用 | Application of quadratic equations in one variable | | 一元一次不等式 | Linear inequality in one variable | | 一元一次不等式的应用 | Application of linear inequalities in one variable | | 随机事件与概率 | Random events and probability | ### Key Observations * GPT-4 consistently outperforms MetaMath-13B and LLaMA2-70B across the majority of problem types. * MetaMath-13B generally has the lowest accuracy. * All three models exhibit significant fluctuations in accuracy depending on the problem type. * The performance gap between GPT-4 and the other two models is substantial. ### Interpretation The data suggests that GPT-4 is significantly more proficient at solving the mathematical problems represented on the x-axis compared to MetaMath-13B and LLaMA2-70B. The fluctuations in accuracy across different problem types indicate that each model has strengths and weaknesses in specific areas of mathematics. The consistent underperformance of MetaMath-13B suggests it may require further training or optimization to achieve comparable accuracy to the other models. The chart highlights the varying capabilities of different language models in tackling mathematical reasoning tasks. </details> Figure 8: Concept accuracies on Middle-ZH. | Model | Elementary-EN | Middle-EN | Elementary-ZH | Middle-ZH | Avg. $\downarrow$ | | --- | --- | --- | --- | --- | --- | | Yi-6B | 5.30 / 1.73 | 5.21 / 1.37 | 0.04 / 0.20 | 0.36 / 0.35 | 2.73 / 0.91 | | ChatGLM3-6B | 7.42 / 0.22 | 7.55 / 0.23 | 0.11 / 0.02 | 0.35 / 0.05 | 3.86 / 0.13 | | InternLM2-Math-7B | 7.42 / 0.22 | 7.55 / 0.23 | 0.11 / 0.02 | 0.35 / 0.05 | 3.86 / 0.13 | | InternLM2-7B | 5.36 / 1.03 | 5.27 / 0.84 | 0.01 / 0.37 | 0.33 / 0.49 | 2.74 / 0.68 | | MAmmoTH-13B | 7.67 / 0.47 | 7.97 / 0.46 | 0.00 / 0.03 | 0.35 / 0.03 | 4.00 / 0.25 | | WizardMath-13B | 8.41 / 0.35 | 8.23 / 0.34 | 0.00 / 0.02 | 0.55 / 0.02 | 4.30 / 0.18 | | MetaMath-13B | 7.67 / 0.47 | 7.97 / 0.46 | 0.00 / 0.03 | 0.35 / 0.03 | 4.00 / 0.25 | | Baichuan2-13B | 7.20 / 1.43 | 6.58 / 1.18 | 0.05 / 0.54 | 0.41 / 0.65 | 3.56 / 0.95 | | LLaMA2-13B | 6.80 / 0.73 | 6.36 / 0.64 | 0.01 / 0.15 | 0.56 / 0.16 | 3.43 / 0.42 | | Qwen-14B | 11.04 / 1.58 | 9.73 / 1.08 | 1.43 / 1.27 | 0.70 / 0.93 | 5.73 / 1.22 | | InternLM2-Math-20B | 5.58 / 1.30 | 5.51 / 0.99 | 0.03 / 0.47 | 0.34 / 0.47 | 2.86 / 0.81 | | InternLM2-20B | 7.20 / 1.43 | 6.58 / 1.18 | 0.05 / 0.54 | 0.41 / 0.65 | 3.56 / 0.95 | | GPT-3.5 | 9.48 / - | 9.21 / - | 0.00 / - | 0.31 / - | 4.75 / - | | GPT-4 | 8.68 / - | 8.24 / - | 0.15 / - | 0.68 / - | 4.44 / - | Table 3: Data contamination rate of LLMs. We provide two different contamination detection methods. The values in the table represent “Rouge / Prob”. Note that the second method based on output probability distributions can only be applied to the open-source models. 3.3 Analysis Contamination. To determine whether a text is in the pretraining data of a LLM, we provide two different contamination detection methods (i.e., Rouge-based and Prob-based methods) to analyze our ConceptMath in Table 3. Specifically, for the Rouge-based method, we just input the first 50% of the question as the input and compute the Rouge-L score between the generation results and the ground-truth label of the last 50% of the text, where a lower Rouge-L score means a lower contamination rate. For the Prob-based method, we follow Shi et al. (2023) to use the MIN-K% probability metric, which first gets the probability for each token in the test, and selects the K% tokens with minimum probabilities and calculates their average log-likelihood. If the average log-likelihood is high, the text is likely in the pretraining data. Note that we choose $K$ as 10 in our setting. In Table 3, we observe that the contaminate rates on our ConceptMath are very low, which means that our ConceptMath can provide a reasonable evaluation for existing LLMs. | Concept | Concept Acc | Error Type | Example | | --- | --- | --- | --- | | Perimeter | 26.32 | Calculation Error | Q: Find the perimeter of a 16 by 18 feet rectangle. A: The perimeter is 16 + 18 + 16 + 18 = 72 feet. | | Knowledge Error | Q: The perimeter of a rectangular sticker is 14 centimeters. The sticker is 2 centimeters wide. How tall is it? A: The height equals the perimeter subtracts the width. The height of the rectangle is $14-12=12$ centimeters. | | | | Polygons | 11.11 | Knowledge Error | Q: A regular polygon has an interior angle that measures 120 degrees. How many sides does this polygon have? A: Let $n$ be the number of sides of the polygon. We have that $n× 120=360$ . Hence: $n=3$ . | | Cylinders | 11.76 | Calculation Error | Q: How much soup can a cylinder-shaped can store, if the radius of the can is 3 inches and the height is 5 inches? A: … The volume of the can is V = $\pi× 3^{2}× 5=30\pi$ . | Table 4: Error reasons of several unmastered concepts of LLaMA2-13B. Unmastered Concepts. We also highlight the several unmastered concepts of the LLaMA2-13B in Table 4, which shows ConceptMath is effective in guiding further refinement of existing LLMs. | Models | LLaMA2 | LLaMA2 | LLaMA2 | | --- | --- | --- | --- | | (w/ MMQA) | (w/ MMQA | | | | &CS) | | | | | Cones | 0.00 | 17.65 | 23.53 | | Spheres | 5.88 | 29.41 | 35.29 | | Polygons | 11.11 | 61.11 | 66.67 | | Rational Number | 11.76 | 23.53 | 52.94 | | Cylinders | 11.76 | 35.29 | 47.06 | | Angles | 11.76 | 47.06 | 58.82 | | Probability | 18.75 | 25.00 | 75.00 | | Perimeter | 26.32 | 42.11 | 63.16 | | Volume | 27.78 | 38.89 | 66.67 | | Proportional | 27.78 | 33.33 | 44.44 | | Avg Acc. | 15.29 | 36.88 | 53.36 | | (over 10 concepts) | | | | | Avg Acc. | 51.94 | 58.14 | 60.67 | | (over 33 concepts) | | | | | Overall Acc. | 44.02 | 53.94 | 59.29 | Table 5: Results of fine-tuning models. “MMQA” and “CS” denote MetaMathQA and our constructed Concept-Specific training datasets, respectively. Introducing CS data specifically for the bottom 10 concepts significantly enhances these concepts’ performance, while slightly improving the performance across the remaining 33 concepts. Evaluation Prompting. Different from the few-shot or cot prompting evaluation that can boost closed-source models, we find that zero-shot prompting is more effective for certain open-source LLMs in Table 2. This disparity may arise either because the models are not sufficiently powerful to own mathematical CoT capabilities Yu et al. (2023); Wei et al. (2022) or because these models have already incorporated CoT data during training Longpre et al. (2023). Consequently, to ensure a comprehensive analysis, we have employed all three prompting methods for evaluation. Efficient Fine-tuning. To show the effect of efficient fine-tuning, we take the LLaMA2-13B as an example in Table 5. Specifically, for LLaMA2-13B, we first select 10 concepts with the lowest accuracies in Elementary-EN. Then, we crawl 495 samples (about 50 samples per concept) using the trained classifier as the Concept-Specific (CS) training data (See Appendix B for more details). Meanwhile, to avoid overfitting, we introduce the MetaMathQA (MMQA Yu et al. (2023) ) data to preserve general mathematical abilities. After that, we can fine-tune LLaMA2-13B by only using MMQA (i.e., LLaMA2 (w/ MMQA)), or using both MMQA and CS data (i.e., LLaMA2 (w/ MMQA & CS)). In Table 5, we observe that LLaMA2 (w/ MMQA & CS) archives significant improvements on the lowest 10 concepts and preserves well on the other 33 concepts, which shows the effect of efficient fine-tuning and the advantages of our ConceptMath. 4 Related Work Large Language Models for Mathematics. Large Language Models (LLMs) such as GPT-3.5 and GPT-4 have exhibited promising capabilities in complex mathematical tasks. However, the proficiency of open-source alternatives like LLaMA (Touvron et al., 2023a) and LLaMA2 (Touvron et al., 2023b) remains notably inferior on these datasets, particularly in handling non-English problems. In contrast, models like Baichuan2 (Baichuan, 2023) and Qwen (Bai et al., 2023b) pretrained on multilingual datasets (i.e., Chinese and English) have achieved remarkable performance. Recently, many domain-specialized math language models have been proposed. For example, MetaMath (Yu et al., 2023) leverages the LLaMA2 models and finetunes on the constructed MetaMathQA dataset. MAmmoTH (Yue et al., 2023) synergizes Chain-of-Thought (CoT) and Program-of-Thought (PoT) rationales. Mathmatical Reasoning Benchmarks. Recently, many mathematical datasets Roy and Roth (2015); Koncel-Kedziorski et al. (2015); Lu et al. (2023); Huang et al. (2016); Miao et al. (2020); Patel et al. (2021) have been proposed. For example, SingleOp (Roy et al., 2015), expands the scope to include more complex operations like multiplication and division. Math23k (Wang et al., 2017) gathers 23,161 problems labeled with structured equations and corresponding answers. GSM8K (Cobbe et al., 2021) is a widely used dataset, which requires a sequence of elementary calculations with basic arithmetic operations. Fine-Grained Benchmarks. Traditional benchmarks focus on assessing certain abilities of models on one task Guo et al. (2023b); Wang et al. (2023a); Liu et al. (2020); Guo et al. (2022); Chai et al. (2024); Liu et al. (2024); Guo et al. (2024, 2023c); Bai et al. (2023a); Liu et al. (2022); Guo et al. (2023a); Bai et al. (2024); Liu et al. (2021) (e.g., reading comprehension (Rajpurkar et al., 2018), machine translation (Bojar et al., 2014), and summarization (Narayan et al., 2018)). For example, the GLUE benchmark (Wang et al., 2019) combines a collection of tasks, and has witnessed superhuman model performance for pretraining models (Kenton and Toutanova, 2019; Radford et al., 2019) (Hendrycks et al., 2021a) introduced MMLU, a benchmark with multiple-choice questions across 57 subjects including STEM, humanities, and social sciences, for assessing performance and identifying weaknesses. (et al., 2022) proposed BIG-bench with over 200 tasks. To enhance the mathematical capabilities of LLMs, we introduce a comprehensive mathematical reasoning ConceptMath dataset designed to assess model performance across over 200 diverse mathematical concepts in both Chinese and English. 5 Conclusion We introduce a new bilingual concept-wise math reasoning dataset called ConceptMath to assess models across a diverse set of concepts. First, ConceptMath covers more than 200 concepts across elementary and middle schools for mainstream English and Chinese systems. Second, we extensively evaluate existing LLMs by three prompting methods, which can guide further improvements for these LLMs on mathematical abilities. Third, we analyze the contamination rates, error cases and provide a simple and efficient fine-tuning strategy to enhance the weaknesses. Limitations. Human efforts are required to carefully design the hierarchical systems of mathematical concepts. In the future, we have three plans as follows: (1) Extend the input modality to multi-modalities. (2) Extend the education systems to high school and college levels. (3) Extend the reasoning abilities to more STEM fields. References - Anthropic (2023) Anthropic. 2023. Model card and evaluations for claude models. - Bai et al. (2024) Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv. - Bai et al. (2023a) Jiaqi Bai, Hongcheng Guo, Jiaheng Liu, Jian Yang, Xinnian Liang, Zhao Yan, and Zhoujun Li. 2023a. Griprank: Bridging the gap between retrieval and generation via the generative knowledge improved passage ranking. CIKM. - Bai et al. (2023b) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023b. Qwen technical report. arXiv preprint arXiv:2309.16609. - Baichuan (2023) Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305. - Bojar et al. (2014) Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics. - Chai et al. (2024) Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xiannian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, et al. 2024. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. arXiv preprint arXiv:2401.07037. - Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. - Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335. - et al. (2022) Aarohi Srivastava et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv: Arxiv-2206.04615. - Fritz et al. (2013) Annemarie Fritz, Antje Ehlert, and Lars Balzer. 2013. Development of mathematical concepts as basis for an elaborated mathematical understanding. South African Journal of Childhood Education, 3(1):38–67. - Guo et al. (2022) Hongcheng Guo, Jiaheng Liu, Haoyang Huang, Jian Yang, Zhoujun Li, Dongdong Zhang, Zheng Cui, and Furu Wei. 2022. Lvp-m3: language-aware visual prompt for multilingual multimodal machine translation. EMNLP. - Guo et al. (2023a) Hongcheng Guo, Boyang Wang, Jiaqi Bai, Jiaheng Liu, Jian Yang, and Zhoujun Li. 2023a. M2c: Towards automatic multimodal manga complement. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9876–9882. - Guo et al. (2024) Hongcheng Guo, Jian Yang, Jiaheng Liu, Jiaqi Bai, Boyang Wang, Zhoujun Li, Tieqiao Zheng, Bo Zhang, Qi Tian, et al. 2024. Logformer: A pre-train and tuning pipeline for log anomaly detection. AAAI. - Guo et al. (2023b) Hongcheng Guo, Jian Yang, Jiaheng Liu, Liqun Yang, Linzheng Chai, Jiaqi Bai, Junran Peng, Xiaorong Hu, Chao Chen, Dongfeng Zhang, et al. 2023b. Owl: A large language model for it operations. arXiv preprint arXiv:2309.09298. - Guo et al. (2023c) Jinyang Guo, Jiaheng Liu, Zining Wang, Yuqing Ma, Ruihao Gong, Ke Xu, and Xianglong Liu. 2023c. Adaptive contrastive knowledge distillation for bert compression. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8941–8953. - Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021a. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR). - Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). - Huang et al. (2016) Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin, and Wei-Ying Ma. 2016. How well do computers solve math word problems? large-scale dataset construction and evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 887–896. - Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186. - Koncel-Kedziorski et al. (2015) Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. 2015. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597. - Liu et al. (2024) Jiaheng Liu, Zhiqi Bai, Yuanxing Zhang, Chenchen Zhang, Yu Zhang, Ge Zhang, Jiakai Wang, Haoran Que, Yukang Chen, Wenbo Su, et al. 2024. E2-llm: Efficient and extreme length extension of large language models. arXiv preprint arXiv:2401.06951. - Liu et al. (2021) Jiaheng Liu, Yudong Wu, Yichao Wu, Chuming Li, Xiaolin Hu, Ding Liang, and Mengyu Wang. 2021. Dam: discrepancy alignment metric for face recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3814–3823. - Liu et al. (2022) Jiaheng Liu, Tan Yu, Hanyu Peng, Mingming Sun, and Ping Li. 2022. Cross-lingual cross-modal consolidation for effective multilingual video corpus moment retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1854–1862. - Liu et al. (2020) Jiaheng Liu, Shunfeng Zhou, Yichao Wu, Ken Chen, Wanli Ouyang, and Dong Xu. 2020. Block proposal neural architecture search. IEEE Transactions on Image Processing, 30:15–25. - Longpre et al. (2023) Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. 2023. The flan collection: designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org. - Lu et al. (2023) Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. 2023. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In The Eleventh International Conference on Learning Representations. - Luo et al. (2023) Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583. - Megill and Wheeler (2019) Norman Megill and David A Wheeler. 2019. Metamath: a computer language for mathematical proofs. Lulu. com. - Miao et al. (2020) Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984. - Narayan et al. (2018) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics. - OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. PREPRINT. - Paster et al. (2023) Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. 2023. Openwebmath: An open dataset of high-quality mathematical web text. - Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094. - Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. - Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789. - Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752. - Roy et al. (2015) Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reasoning about quantities in natural language. Transactions of the Association for Computational Linguistics, 3:1–13. - Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. - Shi et al. (2023) Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. 2023. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789. - Simon (2011) Martin A Simon. 2011. Studying mathematics conceptual learning: Student learning through their mathematical activity. North American Chapter of the International Group for the Psychology of Mathematics Education. - Team (2023a) InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport. - Team (2023b) Yi Team. 2023b. Yi: Building the next generation of open-source and bilingual llms. https://github.com/01-ai/Yi. - Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. - Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. - Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations. - Wang et al. (2017) Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 845–854. - Wang et al. (2023a) Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhu Chen, Jie Fu, and Junran Peng. 2023a. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv: 2310.00746. - Wang et al. (2023b) Zengzhi Wang, Rui Xia, and Liu Pengfei. 2023b. Generative ai for math: Part i – mathpile: A billion-token-scale pretraining corpus for math. arXiv preprint arXiv:2312.17120. - Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837. - Ying et al. (2024) Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin. 2024. Internlm-math: Open math large language models toward verifiable reasoning. - Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284. - Yue et al. (2023) Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2023. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv: 2309.05653. - Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414. Appendix A Details on the ConceptMath As shown in Table 7, Table 8, Table 17 and Table 9, we have provided the details on the three-level hierarchical system of our ConceptMath for better illustration. <details> <summary>x12.png Details</summary> ![fe0973ba](/v1/image/fe0973ba075f3dd5f421709423dd39deb18a2d3ae026fb7743507ef9962fe20c) ### Visual Description ## Bar Chart: Mean Accuracy by Category ### Overview The image is a bar chart displaying the mean accuracy across various categories. The categories are listed along the x-axis, and the mean accuracy (ranging from 0 to 80) is represented on the y-axis. The bars are blue, and the categories are sorted by increasing mean accuracy. ### Components/Axes * **Y-axis:** "Mean Accuracy", with a numerical scale from 0 to 80, incrementing by 10. Horizontal dashed gridlines are present at each increment of 10. * **X-axis:** Categories, including "Proportional", "Cones", "Cylinders", "Estimation & rounding", "Probability", "Place value", "Circles", "Spheres", "Fractions", "Rational number", "Volume of 3d shapes", "Perimeter", "Patterns", "Angles", "Coordinate plane", "Percents", "Division", "Polygons", "Time", "Ratio", "Volume", "Mixed operations", "Equations", "Temperature", "Coin names & value", "Statistics", "Classifying & sorting", "Numerical exprs", "Area", "Compare", "Quadrilaterals", "Triangles", "Powers", "Length", "Cubes", "Subtraction", "Count", "Decimals", "Exchanging money", "Multiple", "Add", "Variable exprs", and "Light & heavy". ### Detailed Analysis The bar chart shows the mean accuracy for each category. The categories are arranged in ascending order of accuracy. * **Proportional:** Accuracy is approximately 41%. * **Cones:** Accuracy is approximately 43%. * **Cylinders:** Accuracy is approximately 43%. * **Estimation & rounding:** Accuracy is approximately 46%. * **Probability:** Accuracy is approximately 47%. * **Place value:** Accuracy is approximately 49%. * **Circles:** Accuracy is approximately 49%. * **Spheres:** Accuracy is approximately 50%. * **Fractions:** Accuracy is approximately 53%. * **Rational number:** Accuracy is approximately 55%. * **Volume of 3d shapes:** Accuracy is approximately 56%. * **Perimeter:** Accuracy is approximately 56%. * **Patterns:** Accuracy is approximately 58%. * **Angles:** Accuracy is approximately 59%. * **Coordinate plane:** Accuracy is approximately 59%. * **Percents:** Accuracy is approximately 60%. * **Division:** Accuracy is approximately 61%. * **Polygons:** Accuracy is approximately 61%. * **Time:** Accuracy is approximately 61%. * **Ratio:** Accuracy is approximately 62%. * **Volume:** Accuracy is approximately 62%. * **Mixed operations:** Accuracy is approximately 65%. * **Equations:** Accuracy is approximately 66%. * **Temperature:** Accuracy is approximately 66%. * **Coin names & value:** Accuracy is approximately 67%. * **Statistics:** Accuracy is approximately 68%. * **Classifying & sorting:** Accuracy is approximately 69%. * **Numerical exprs:** Accuracy is approximately 70%. * **Area:** Accuracy is approximately 70%. * **Compare:** Accuracy is approximately 70%. * **Quadrilaterals:** Accuracy is approximately 71%. * **Triangles:** Accuracy is approximately 71%. * **Powers:** Accuracy is approximately 71%. * **Length:** Accuracy is approximately 72%. * **Cubes:** Accuracy is approximately 73%. * **Subtraction:** Accuracy is approximately 73%. * **Count:** Accuracy is approximately 74%. * **Decimals:** Accuracy is approximately 75%. * **Exchanging money:** Accuracy is approximately 76%. * **Multiple:** Accuracy is approximately 77%. * **Add:** Accuracy is approximately 78%. * **Variable exprs:** Accuracy is approximately 79%. * **Light & heavy:** Accuracy is approximately 83%. ### Key Observations * The mean accuracy varies significantly across categories, ranging from approximately 41% to 83%. * The categories "Light & heavy", "Variable exprs", and "Add" have the highest mean accuracy. * The categories "Proportional", "Cones", and "Cylinders" have the lowest mean accuracy. * There is a general upward trend in mean accuracy as you move from left to right along the x-axis. ### Interpretation The bar chart provides a visual representation of the relative difficulty or understanding of different mathematical and conceptual categories. The categories with higher mean accuracy suggest that individuals perform better in those areas, while categories with lower mean accuracy may indicate areas where individuals struggle more. The data could be used to identify areas where additional focus or instruction is needed to improve overall understanding and performance. The wide range of accuracy scores suggests that some concepts are significantly easier or more intuitive than others. </details> Figure 9: Mean concept accuracies of Elementary-EN. <details> <summary>x13.png Details</summary> ![6d0d0847](/v1/image/6d0d0847e7fc2ae20c64f440e6bfbba69387b34fd6bea21e3566f4b75e6c708c) ### Visual Description ## Bar Chart: Mean Accuracy by Problem Type ### Overview The image is a bar chart displaying the mean accuracy for different problem types. The x-axis represents various problem types (in Chinese), and the y-axis represents the mean accuracy, ranging from 0 to 80. The bars are arranged in ascending order of mean accuracy. ### Components/Axes * **X-axis:** Problem Types (in Chinese). The labels are oriented vertically. * **Y-axis:** Mean Accuracy, with a scale from 0 to 80 in increments of 10. * **Bars:** Each bar represents a problem type, with the height indicating the mean accuracy. The bars are all blue. * **Gridlines:** Horizontal gridlines are present at intervals of 10 on the y-axis. ### Detailed Analysis The following is a transcription of the x-axis labels (problem types) from left to right, along with their approximate mean accuracy values. Note that the labels are in Chinese, and translations are provided where possible. 1. **周期问题 (Zhōu qī wèn tí):** Periodic Problem - Accuracy ~22 2. **浓度问题 (Nóng dù wèn tí):** Concentration Problem - Accuracy ~23 3. **相遇问题 (Xiāng yù wèn tí):** Meeting Problem - Accuracy ~24 4. **鸡兔同笼 (Jī tù tóng lóng):** Chicken and Rabbit in the Same Cage (a classic math problem) - Accuracy ~25 5. **工程问题 (Gōng chéng wèn tí):** Engineering Problem - Accuracy ~26 6. **立体图形 (Lì tǐ tú xíng):** 3D Shapes - Accuracy ~27 7. **排列组合 (Pái liè zǔ hé):** Permutation and Combination - Accuracy ~28 8. **和差倍问题 (Hé chā bèi wèn tí):** Sum, Difference, and Multiple Problem - Accuracy ~30 9. **圆锥 (Yuán zhuī):** Cone - Accuracy ~31 10. **圆 (Yuán):** Circle - Accuracy ~32 11. **面积 (Miàn jī):** Area - Accuracy ~33 12. **长方体 (Cháng fāng tǐ):** Cuboid - Accuracy ~34 13. **因数问题 (Yīn shù wèn tí):** Factor Problem - Accuracy ~34 14. **分数应用 (Fēn shù yìng yòng):** Fraction Application - Accuracy ~35 15. **行程问题 (Xíng chéng wèn tí):** Travel Problem - Accuracy ~36 16. **平面图形 (Píng miàn tú xíng):** Plane Figure - Accuracy ~37 17. **长方形 (Cháng fāng xíng):** Rectangle - Accuracy ~37 18. **正方体 (Zhèng fāng tǐ):** Cube - Accuracy ~38 19. **公倍数问题 (Gōng bèi shù wèn tí):** Common Multiple Problem - Accuracy ~39 20. **正方形 (Zhèng fāng xíng):** Square - Accuracy ~40 21. **统计 (Tǒng jì):** Statistics - Accuracy ~41 22. **定义新运算 (Dìng yì xīn yùn suàn):** Define New Operation - Accuracy ~42 23. **除法问题 (Chú fǎ wèn tí):** Division Problem - Accuracy ~43 24. **年龄问题 (Nián líng wèn tí):** Age Problem - Accuracy ~44 25. **平均数问题 (Píng jūn shù wèn tí):** Average Problem - Accuracy ~45 26. **简单计算 (Jiǎn dān jì suàn):** Simple Calculation - Accuracy ~45 27. **比例问题 (Bǐ lì wèn tí):** Ratio Problem - Accuracy ~46 28. **页码问题 (Yè mǎ wèn tí):** Page Number Problem - Accuracy ~46 29. **折扣问题 (Zhé kòu wèn tí):** Discount Problem - Accuracy ~47 30. **三角形 (Sān jiǎo xíng):** Triangle - Accuracy ~48 31. **四则运算 (Sì zé yùn suàn):** Four Arithmetic Operations - Accuracy ~49 32. **因数与倍数 (Yīn shù yǔ bèi shù):** Factors and Multiples - Accuracy ~50 33. **复合问题 (Fù hé wèn tí):** Composite Problem - Accuracy ~51 34. **植树问题 (Zhí shù wèn tí):** Tree Planting Problem - Accuracy ~52 35. **人民币问题 (Rén mín bì wèn tí):** RMB (Chinese currency) Problem - Accuracy ~53 36. **归一问题 (Guī yī wèn tí):** Reduction to Unity Problem - Accuracy ~54 37. **概率问题 (Gài lǜ wèn tí):** Probability Problem - Accuracy ~55 38. **对称问题 (Duì chèn wèn tí):** Symmetry Problem - Accuracy ~56 39. **温度问题 (Wēn dù wèn tí):** Temperature Problem - Accuracy ~57 40. **最值问题 (Zuì zhí wèn tí):** Max/Min Problem - Accuracy ~58 41. **基础分数 (Jī chǔ fēn shù):** Basic Fractions - Accuracy ~59 42. **平行四边形 (Píng xíng sì biān xíng):** Parallelogram - Accuracy ~60 43. **工程效率 (Gōng chéng xiào lǜ):** Engineering Efficiency - Accuracy ~61 44. **百分数问题 (Bǎi fēn shù wèn tí):** Percentage Problem - Accuracy ~62 45. **分数与小数 (Fēn shù yǔ xiǎo shù):** Fractions and Decimals - Accuracy ~63 ### Key Observations * The mean accuracy varies significantly across different problem types. * The problem types are sorted in ascending order of mean accuracy. * The accuracy ranges from approximately 22 to 75. * Problems related to basic arithmetic and geometry (e.g., fractions, percentages, parallelograms) tend to have higher mean accuracy compared to more complex problems (e.g., probability, permutations). ### Interpretation The bar chart provides insights into the relative difficulty of different problem types. The data suggests that certain mathematical concepts are more easily grasped or applied than others. This information could be valuable for educators in identifying areas where students may need additional support or focused instruction. The ascending order of accuracy highlights a progression of difficulty, potentially reflecting the complexity of the underlying concepts or the cognitive skills required to solve them. </details> Figure 10: Mean concept accuracies of Elementary-ZH. <details> <summary>x14.png Details</summary> ![367c13e3](/v1/image/367c13e353a61cf354e5cc3d0406e29e9249fbcb177adbc8de11c5ad4e229779) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of three different language models (MetaMath-13B, LLaMA2-70B, and GPT-4) on a variety of math-related tasks. The x-axis represents different math problem types, and the y-axis represents the accuracy score (from 0 to 100). ### Components/Axes * **Title:** (Implicit) Model Accuracy on Math Problems * **X-axis:** Math Problem Types (listed below) * **Y-axis:** Accuracy (ranging from 0 to 100, with gridlines at intervals of 20) * **Legend:** Located at the top of the chart. * Blue line: MetaMath-13B * Orange line: LLaMA2-70B * Green line: GPT-4 ### Detailed Analysis **X-Axis Categories (Math Problem Types):** 1. Angles 2. Area 3. Circles 4. Classifying & sorting 5. Coin names & value 6. Cones 7. Coordinate plane 8. Cubes 9. Cylinders 10. Decimals 11. Estimation & rounding 12. Exchanging money 13. Fractions 14. Light & heavy 15. Mixed operations 16. Multiple 17. Numerical exprs 18. Patterns 19. Perimeter 20. Place value 21. Powers 22. Rational number 23. Spheres 24. Subtraction 25. Time 26. Triangles 27. Variable exprs 28. Volume of 3d shapes 29. Add 30. Compare 31. Count 32. Division 33. Equations 34. Length 35. Percents 36. Polygons 37. Probability 38. Proportional 39. Quadrilaterals 40. Ratio 41. Temperature 42. Volume **Data Series Analysis:** * **MetaMath-13B (Blue):** The accuracy fluctuates significantly across different problem types. It starts around 50-60%, dips to around 30% for "Cylinders", peaks around 70-80% for "Exchanging money", then drops sharply to near 0% for "Quadrilaterals", before recovering to around 50% for "Volume". * Angles: ~53% * Area: ~53% * Circles: ~70% * Classifying & sorting: ~70% * Coin names & value: ~63% * Cones: ~40% * Coordinate plane: ~63% * Cubes: ~30% * Cylinders: ~30% * Decimals: ~33% * Estimation & rounding: ~73% * Exchanging money: ~73% * Fractions: ~50% * Light & heavy: ~53% * Mixed operations: ~47% * Multiple: ~50% * Numerical exprs: ~53% * Patterns: ~77% * Perimeter: ~73% * Place value: ~50% * Powers: ~77% * Rational number: ~80% * Spheres: ~77% * Subtraction: ~73% * Time: ~60% * Triangles: ~43% * Variable exprs: ~60% * Volume of 3d shapes: ~43% * Add: ~43% * Compare: ~43% * Count: ~63% * Division: ~60% * Equations: ~40% * Length: ~33% * Percents: ~13% * Polygons: ~33% * Probability: ~0% * Proportional: ~70% * Quadrilaterals: ~0% * Ratio: ~43% * Temperature: ~43% * Volume: ~50% * **LLaMA2-70B (Orange):** The accuracy also fluctuates, but generally stays between 35% and 95%. It has a low point around 10% for "Decimals" and peaks around 95% for "Place Value". * Angles: ~35% * Area: ~53% * Circles: ~53% * Classifying & sorting: ~63% * Coin names & value: ~40% * Cones: ~63% * Coordinate plane: ~37% * Cubes: ~13% * Cylinders: ~13% * Decimals: ~13% * Estimation & rounding: ~67% * Exchanging money: ~67% * Fractions: ~40% * Light & heavy: ~53% * Mixed operations: ~93% * Multiple: ~93% * Numerical exprs: ~77% * Patterns: ~93% * Perimeter: ~93% * Place value: ~93% * Powers: ~40% * Rational number: ~53% * Spheres: ~53% * Subtraction: ~77% * Time: ~77% * Triangles: ~40% * Variable exprs: ~53% * Volume of 3d shapes: ~20% * Add: ~60% * Compare: ~77% * Count: ~60% * Division: ~57% * Equations: ~87% * Length: ~93% * Percents: ~87% * Polygons: ~57% * Probability: ~57% * Proportional: ~57% * Quadrilaterals: ~57% * Ratio: ~57% * Temperature: ~57% * Volume: ~57% * **GPT-4 (Green):** The accuracy is generally high and more consistent than the other two models, mostly staying above 80%. It peaks at 100% for several categories and dips to around 50% for "Coordinate plane". * Angles: ~77% * Area: ~93% * Circles: ~77% * Classifying & sorting: ~87% * Coin names & value: ~50% * Cones: ~87% * Coordinate plane: ~50% * Cubes: ~87% * Cylinders: ~87% * Decimals: ~93% * Estimation & rounding: ~93% * Exchanging money: ~93% * Fractions: ~100% * Light & heavy: ~93% * Mixed operations: ~93% * Multiple: ~93% * Numerical exprs: ~93% * Patterns: ~93% * Perimeter: ~93% * Place value: ~93% * Powers: ~93% * Rational number: ~93% * Spheres: ~93% * Subtraction: ~93% * Time: ~93% * Triangles: ~93% * Variable exprs: ~93% * Volume of 3d shapes: ~93% * Add: ~93% * Compare: ~93% * Count: ~93% * Division: ~93% * Equations: ~93% * Length: ~93% * Percents: ~93% * Polygons: ~93% * Probability: ~93% * Proportional: ~93% * Quadrilaterals: ~93% * Ratio: ~93% * Temperature: ~93% * Volume: ~93% ### Key Observations * GPT-4 consistently outperforms MetaMath-13B and LLaMA2-70B across almost all math problem types. * MetaMath-13B shows significant weaknesses in "Quadrilaterals" problems. * LLaMA2-70B has a low accuracy on "Decimals" problems. * All models show variability in accuracy depending on the problem type. ### Interpretation The chart demonstrates the relative strengths and weaknesses of different language models in solving various math problems. GPT-4's consistently high accuracy suggests it has a more robust understanding of mathematical concepts compared to MetaMath-13B and LLaMA2-70B. The specific areas where each model struggles (e.g., MetaMath-13B with "Quadrilaterals") could indicate areas for further model training and improvement. The variability in accuracy across problem types highlights the complexity of mathematical reasoning and the challenges in developing AI models that can generalize across different mathematical domains. </details> Figure 11: Concept accuracies on Elementary-EN. <details> <summary>x15.png Details</summary> ![9a5b04b5](/v1/image/9a5b04b58ddafef3282643cf9fda05b0f6b348e1525d2de11c13d307f4531718) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of three language models (MetaMath-13B, LLaMA2-70B, and GPT-4) on a series of math problems. The x-axis represents different types of math problems, and the y-axis represents accuracy, ranging from 0 to 100. ### Components/Axes * **Title:** (Implicit) Model Accuracy on Math Problems * **X-axis:** Math Problem Types (in Chinese) * **Y-axis:** Accuracy (ranging from 0 to 100, with gridlines at intervals of 20) * **Legend:** Located at the top of the chart. * MetaMath-13B (blue) * LLaMA2-70B (orange) * GPT-4 (green) ### Detailed Analysis **X-Axis Labels (Math Problem Types - Chinese with English Translation):** The x-axis labels are in Chinese. Here are the labels and their approximate English translations: 1. 三角形 (sān jiǎo xíng): Triangle 2. 平行四边形 (píng xíng sì biān xíng): Parallelogram 3. 平面图形综合 (píng miàn tú xíng zōng hé): Plane figure synthesis/combination 4. 梯形 (tī xíng): Trapezoid 5. 长方形 (cháng fāng xíng): Rectangle 6. 正方形 (zhèng fāng xíng): Square 7. 立体图形 (lì tǐ tú xíng): Solid figure 8. 和差倍问题 (hé chā bèi wèn tí): Sum-difference-multiple problem 9. 基础几何问题 (jī chǔ jǐ hé wèn tí): Basic geometry problem 10. 归一问题 (guī yī wèn tí): Reduction to one problem 11. 归总问题 (guī zǒng wèn tí): Summation problem 12. 周长问题 (zhōu cháng wèn tí): Perimeter problem 13. 面积问题 (miàn jī wèn tí): Area problem 14. 工程问题 (gōng chéng wèn tí): Work problem 15. 折扣问题 (zhé kòu wèn tí): Discount problem 16. 植树问题 (zhí shù wèn tí): Tree planting problem 17. 还原问题 (huán yuán wèn tí): Restoration problem 18. 盈亏问题 (yíng kuī wèn tí): Profit and loss problem 19. 鸡兔同笼 (jī tù tóng lóng): Chicken and rabbit in the same cage (a type of math problem) 20. 相遇问题 (xiāng yù wèn tí): Meeting problem 21. 行程问题 (xíng chéng wèn tí): Travel problem 22. 浓度问题 (nóng dù wèn tí): Concentration problem 23. 简单平均数 (jiǎn dān píng jūn shù): Simple average 24. 定义新运算 (dìng yì xīn yùn suàn): Define new operation 25. 整数计算 (zhěng shù jì suàn): Integer calculation 26. 除法计算 (chú fǎ jì suàn): Division calculation 27. 比例问题 (bǐ lì wèn tí): Ratio problem 28. 概率问题 (gài lǜ wèn tí): Probability problem 29. 百分数问题 (bǎi fēn shù wèn tí): Percentage problem **Data Series Analysis:** * **MetaMath-13B (blue):** Generally low accuracy across all problem types, mostly below 40. It has a spike at "Plane figure synthesis/combination" reaching approximately 50 accuracy. Many data points are at or near 0 accuracy. * Triangle: ~10 * Parallelogram: ~5 * Plane figure synthesis/combination: ~50 * Trapezoid: ~25 * Rectangle: ~25 * Square: ~20 * Solid figure: ~5 * Sum-difference-multiple problem: ~0 * Basic geometry problem: ~20 * Reduction to one problem: ~5 * Summation problem: ~5 * Perimeter problem: ~5 * Area problem: ~0 * Work problem: ~0 * Discount problem: ~20 * Tree planting problem: ~20 * Restoration problem: ~20 * Profit and loss problem: ~5 * Chicken and rabbit in the same cage: ~35 * Meeting problem: ~20 * Travel problem: ~5 * Concentration problem: ~35 * Simple average: ~20 * Define new operation: ~10 * Integer calculation: ~40 * Division calculation: ~20 * Ratio problem: ~20 * Probability problem: ~20 * Percentage problem: ~10 * **LLaMA2-70B (orange):** Higher accuracy than MetaMath-13B, but still generally lower than GPT-4. Accuracy fluctuates between approximately 10 and 50. * Triangle: ~10 * Parallelogram: ~10 * Plane figure synthesis/combination: ~15 * Trapezoid: ~25 * Rectangle: ~15 * Square: ~25 * Solid figure: ~35 * Sum-difference-multiple problem: ~30 * Basic geometry problem: ~20 * Reduction to one problem: ~35 * Summation problem: ~20 * Perimeter problem: ~20 * Area problem: ~30 * Work problem: ~30 * Discount problem: ~35 * Tree planting problem: ~20 * Restoration problem: ~20 * Profit and loss problem: ~25 * Chicken and rabbit in the same cage: ~20 * Meeting problem: ~30 * Travel problem: ~40 * Concentration problem: ~35 * Simple average: ~45 * Define new operation: ~50 * Integer calculation: ~30 * Division calculation: ~25 * Ratio problem: ~45 * Probability problem: ~50 * Percentage problem: ~30 * **GPT-4 (green):** Significantly higher accuracy than both MetaMath-13B and LLaMA2-70B. Accuracy generally ranges between 40 and 100. * Triangle: ~70 * Parallelogram: ~90 * Plane figure synthesis/combination: ~80 * Trapezoid: ~80 * Rectangle: ~60 * Square: ~65 * Solid figure: ~80 * Sum-difference-multiple problem: ~90 * Basic geometry problem: ~90 * Reduction to one problem: ~50 * Summation problem: ~40 * Perimeter problem: ~75 * Area problem: ~50 * Work problem: ~60 * Discount problem: ~85 * Tree planting problem: ~70 * Restoration problem: ~40 * Profit and loss problem: ~65 * Chicken and rabbit in the same cage: ~85 * Meeting problem: ~100 * Travel problem: ~70 * Concentration problem: ~90 * Simple average: ~70 * Define new operation: ~95 * Integer calculation: ~70 * Division calculation: ~50 * Ratio problem: ~80 * Probability problem: ~85 * Percentage problem: ~90 ### Key Observations * GPT-4 consistently outperforms MetaMath-13B and LLaMA2-70B across all math problem types. * MetaMath-13B has very low accuracy, often near zero, for many problem types. * LLaMA2-70B shows moderate accuracy, but is still significantly lower than GPT-4. * There is significant variance in accuracy across different problem types for all models. ### Interpretation The data suggests that GPT-4 is significantly better at solving these types of math problems compared to MetaMath-13B and LLaMA2-70B. The varying accuracy across different problem types indicates that the models have different strengths and weaknesses. MetaMath-13B appears to struggle significantly with these math problems. The chart highlights the performance gap between different language models on mathematical reasoning tasks. </details> Figure 12: Concept accuracies on Elementary-ZH. <details> <summary>x16.png Details</summary> ![697a2c3e](/v1/image/697a2c3eebc255e155025d2d7ed20baf6e7e6d7ab65b969ac8e86f0fa6d57520) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of four different language models (Yi-6B, ChatGLM3-6B, LLaMA2-7B, and DeepSeekMath-7B) on a variety of math-related tasks. The x-axis represents different math problem types, and the y-axis represents the accuracy score (from 0 to 100). Each model's performance is represented by a colored line. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** Represents different math problem types. The labels are: * Angles * Area * Circles * Classifying & sorting * Coin names & value * Cones * Coordinate plane * Cubes * Cylinders * Decimals * Estimation & rounding * Exchanging money * Fractions * Light & heavy * Mixed operations * Multiple * Numerical exprs * Patterns * Perimeter * Place value * Powers * Rational number * Spheres * Subtraction * Time * Triangles * Variable exprs * Volume of 3d shapes * Add * Compare * Count * Division * Equations * Length * Percents * Polygons * Probability * Proportional * Quadrilaterals * Ratio * Temperature * Volume * **Y-axis:** Represents Accuracy, ranging from 0 to 100 in increments of 20. * 0 * 20 * 40 * 60 * 80 * 100 * **Legend:** Located at the top of the chart. * **Blue:** Yi-6B * **Orange:** ChatGLM3-6B * **Green:** LLaMA2-7B * **Red:** DeepSeekMath-7B * **Gridlines:** Horizontal gridlines are present at each increment of 20 on the y-axis. ### Detailed Analysis **Yi-6B (Blue):** * Trend: Fluctuates significantly across different problem types. Generally performs well, often reaching high accuracy, but has some low points. * Key Data Points: * Angles: ~50 * Coordinate plane: ~35 * Spheres: ~30 * Volume of 3d shapes: ~65 * Probability: ~90 * Volume: ~55 **ChatGLM3-6B (Orange):** * Trend: Similar to Yi-6B, fluctuates but generally maintains a relatively high accuracy. * Key Data Points: * Angles: ~75 * Coordinate plane: ~80 * Spheres: ~50 * Volume of 3d shapes: ~70 * Probability: ~30 * Volume: ~50 **LLaMA2-7B (Green):** * Trend: Consistently lower accuracy compared to the other models. * Key Data Points: * Angles: ~25 * Coordinate plane: ~10 * Spheres: ~0 * Volume of 3d shapes: ~20 * Probability: ~10 * Volume: ~20 **DeepSeekMath-7B (Red):** * Trend: Generally high accuracy, often competing with Yi-6B and ChatGLM3-6B. * Key Data Points: * Angles: ~55 * Coordinate plane: ~60 * Spheres: ~70 * Volume of 3d shapes: ~50 * Probability: ~10 * Volume: ~80 ### Key Observations * LLaMA2-7B (Green) consistently underperforms compared to the other three models across almost all problem types. * Yi-6B (Blue), ChatGLM3-6B (Orange), and DeepSeekMath-7B (Red) show more competitive performance, with varying strengths depending on the problem type. * There are specific problem types (e.g., Probability) where all models struggle, indicating inherent difficulty in those tasks. * The accuracy of all models varies significantly depending on the problem type, suggesting that some mathematical concepts are more challenging for these language models than others. ### Interpretation The chart provides a comparative analysis of the mathematical reasoning abilities of four different language models. The data suggests that: * Model Architecture Matters: The significant difference in performance between LLaMA2-7B and the other models indicates that architectural choices and training methodologies play a crucial role in mathematical reasoning capabilities. * Task-Specific Strengths: The varying performance across different problem types highlights that each model has its own strengths and weaknesses. This could be due to the specific training data or the model's ability to generalize to different mathematical concepts. * Areas for Improvement: The problem types where all models perform poorly identify areas where further research and development are needed to improve the mathematical reasoning abilities of language models. * DeepSeekMath-7B appears to be the most robust model for math problems, with consistently high accuracy across various problem types. </details> <details> <summary>x17.png Details</summary> ![e01e3008](/v1/image/e01e3008bfccc366a74f1b179ba7064e7ae90e27748009e8f12f00c5867067c8) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of four different language models (InternLM2-Math-7B, InternLM2-7B, MAmmoTH-13B, and WizardMath-13B) on a variety of math problem types. The x-axis represents different math categories, and the y-axis represents the accuracy score. ### Components/Axes * **Title:** None explicitly present in the image. * **X-axis:** Math problem categories (listed below). The labels are rotated for readability. * **Y-axis:** Accuracy, ranging from 20 to 80 in increments of 20. * **Legend:** Located at the top of the chart. * Blue: InternLM2-Math-7B * Orange: InternLM2-7B * Green: MAmmoTH-13B * Red: WizardMath-13B * **Gridlines:** Horizontal dashed lines at each y-axis increment (20, 40, 60, 80). ### Detailed Analysis or ### Content Details **Math Problem Categories (X-Axis):** 1. Angles 2. Area 3. Circles 4. Classifying & sorting 5. Coin names & value 6. Cones 7. Coordinate plane 8. Cubes 9. Cylinders 10. Decimals 11. Estimation & rounding 12. Exchanging money 13. Fractions 14. Light & heavy 15. Mixed operations 16. Multiple 17. Numerical exprs 18. Patterns 19. Perimeter 20. Place value 21. Powers 22. Rational number 23. Spheres 24. Subtraction 25. Time 26. Triangles 27. Variable exprs 28. Volume of 3d shapes 29. Add 30. Compare 31. Count 32. Division 33. Equations 34. Length 35. Percents 36. Polygons 37. Probability 38. Proportional 39. Quadrilaterals 40. Ratio 41. Statistics 42. Temperature 43. Volume **Model Performance Trends and Approximate Values:** * **InternLM2-Math-7B (Blue):** Generally performs well, often achieving the highest accuracy among the models. It shows strong performance in categories like "Subtraction" (accuracy ~93%) and "Volume of 3d shapes" (accuracy ~90%). It dips in "Place Value" (accuracy ~58%) and "Ratio" (accuracy ~52%). * **InternLM2-7B (Orange):** Performance is generally high, but slightly lower than InternLM2-Math-7B. It peaks in "Area" (accuracy ~82%) and "Multiple" (accuracy ~70%). It dips in "Spheres" (accuracy ~60%) and "Probability" (accuracy ~30%). * **MAmmoTH-13B (Green):** Shows variable performance across categories. It excels in "Cubes" (accuracy ~82%) and "Volume" (accuracy ~70%). It has lower accuracy in "Decimals" (accuracy ~24%) and "Subtraction" (accuracy ~42%). * **WizardMath-13B (Red):** Exhibits the most fluctuating performance, with some very low accuracy scores. It peaks in "Fractions" (accuracy ~70%) and "Add" (accuracy ~58%). It struggles significantly in "Light & heavy" (accuracy ~10%) and "Quadrilaterals" (accuracy ~10%). ### Key Observations * InternLM2-Math-7B consistently achieves high accuracy across most math problem types. * WizardMath-13B has the widest range of performance, with both high and very low accuracy scores. * There is significant variation in model performance across different math categories, suggesting that some problem types are more challenging than others. * The models show varying strengths and weaknesses depending on the specific math category. ### Interpretation The line chart provides a comparative analysis of the accuracy of four language models on a diverse set of math problems. The data suggests that InternLM2-Math-7B is the most consistently accurate model overall. WizardMath-13B, while showing potential in some areas, is the least reliable due to its significant performance fluctuations. The varying performance across different math categories highlights the specific strengths and weaknesses of each model, indicating areas where further improvement is needed. The chart demonstrates the importance of evaluating language models on a wide range of tasks to gain a comprehensive understanding of their capabilities. </details> <details> <summary>x18.png Details</summary> ![cc9a11d5](/v1/image/cc9a11d5f9b565eace125d84b588928ed92b85d8c426b30f53ff96a9346670b0) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of four different language models (Baichuan2-13B, LLaMA2-13B, Qwen-14B, and InternLM2-Math-20B) on a variety of math problem types. The x-axis represents different math problem categories, and the y-axis represents the accuracy score. ### Components/Axes * **Title:** None explicitly present in the image. * **X-axis:** * Label: Math problem categories. * Categories (from left to right): Angles, Area, Circles, Classifying & sorting, Coin names & value, Cones, Coordinate plane, Cubes, Cylinders, Decimals, Estimation & rounding, Exchanging money, Fractions, Light & heavy, Mixed operations, Multiple, Numerical exprs, Patterns, Perimeter, Place value, Powers, Rational number, Spheres, Subtraction, Time, Triangles, Variable exprs, Volume of 3d shapes, Add, Compare, Count, Division, Equations, Length, Percents, Polygons, Probability, Proportional, Quadrilaterals, Ratio, Statistics, Temperature, Volume. * **Y-axis:** * Label: Accuracy * Scale: 0 to 80, with tick marks at intervals of 20. * **Legend:** Located at the top of the chart. * Baichuan2-13B (Blue) * LLaMA2-13B (Orange) * Qwen-14B (Green) * InternLM2-Math-20B (Red) * Gridlines: Horizontal dashed lines at intervals of 20 on the y-axis. ### Detailed Analysis Here's a breakdown of each model's performance across the different math problem categories: * **Baichuan2-13B (Blue):** * Trend: Highly variable performance across categories. Starts around 60% accuracy for "Angles," drops sharply for "Area," then fluctuates significantly. * Key Data Points: * Angles: ~53% * Area: ~10% * Circles: ~40% * Coordinate plane: ~82% * Fractions: ~30% * Powers: ~65% * Time: ~45% * Add: ~75% * Volume: ~70% * **LLaMA2-13B (Orange):** * Trend: Generally lower accuracy compared to other models. Exhibits significant drops in performance for specific categories. * Key Data Points: * Angles: ~12% * Area: ~42% * Circles: ~2% * Coordinate plane: ~70% * Fractions: ~30% * Powers: ~58% * Time: ~38% * Add: ~40% * Volume: ~65% * **Qwen-14B (Green):** * Trend: More consistent performance than LLaMA2-13B, but still variable. Generally lower than Baichuan2-13B and InternLM2-Math-20B. * Key Data Points: * Angles: ~23% * Area: ~45% * Circles: ~28% * Coordinate plane: ~65% * Fractions: ~22% * Powers: ~55% * Time: ~60% * Add: ~65% * Volume: ~75% * **InternLM2-Math-20B (Red):** * Trend: Generally the highest accuracy among the four models. Shows strong performance across most categories, but still has some variability. * Key Data Points: * Angles: ~80% * Area: ~70% * Circles: ~80% * Coordinate plane: ~90% * Fractions: ~75% * Powers: ~70% * Time: ~80% * Add: ~85% * Volume: ~68% ### Key Observations * InternLM2-Math-20B consistently outperforms the other models across most math problem categories. * LLaMA2-13B generally has the lowest accuracy. * All models exhibit variability in performance depending on the specific math problem category. * There are significant performance differences between models on categories like "Area", "Circles", and "Angles". ### Interpretation The data suggests that InternLM2-Math-20B is the most proficient at solving a wide range of math problems among the models tested. The variability in performance across different categories highlights the strengths and weaknesses of each model in specific areas of mathematical reasoning. The relatively poor performance of LLaMA2-13B suggests it may require further training or fine-tuning to achieve comparable accuracy to the other models. The chart demonstrates the importance of evaluating language models on diverse datasets to understand their capabilities and limitations. </details> <details> <summary>x19.png Details</summary> ![85120a31](/v1/image/85120a316439a21c3375c56162609c77005e6f6b14d5055c7cf346a179aa4b7e) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of four different language models (InternLM2-20B, Yi-34B, Qwen-72B, and GPT-3.5) on a variety of math-related tasks. The x-axis represents different math problem types, and the y-axis represents the accuracy score (percentage). ### Components/Axes * **Title:** None explicitly given in the image. * **X-axis:** Math problem types, including: * Angles * Area * Circles * Classifying & sorting * Coin names & value * Cones * Coordinate plane * Cubes * Cylinders * Decimals * Estimation & rounding * Exchanging money * Fractions * Light & heavy * Mixed operations * Multiple * Numerical exprs * Patterns * Perimeter * Place value * Polygons * Powers * Probability * Proportional * Quadrilaterals * Qwen-72B * Ratio * Rational number * Spheres * Statistics * Subtraction * Temperature * Time * Triangles * Variable exprs * Volume * Volume of 3d shapes * Add * Compare * Count * Division * Equations * Length * Percents * **Y-axis:** Accuracy (percentage), ranging from 20 to 100, with gridlines at intervals of 10. * **Legend:** Located at the top of the chart, associating colors with language models: * Blue: InternLM2-20B * Orange: Yi-34B * Green: Qwen-72B * Red: GPT-3.5 ### Detailed Analysis **InternLM2-20B (Blue):** * Trend: Generally the lowest accuracy across all problem types. * Specific Points (approximate): * Angles: ~63% * Area: ~30% * Circles: ~60% * Classifying & sorting: ~42% * Coin names & value: ~50% * Cones: ~42% * Coordinate plane: ~55% * Cubes: ~68% * Cylinders: ~67% * Decimals: ~55% * Estimation & rounding: ~32% * Exchanging money: ~40% * Fractions: ~50% * Light & heavy: ~78% * Mixed operations: ~50% * Multiple: ~55% * Numerical exprs: ~50% * Patterns: ~40% * Perimeter: ~55% * Place value: ~63% * Polygons: ~50% * Powers: ~40% * Probability: ~63% * Proportional: ~40% * Quadrilaterals: ~50% * Qwen-72B: ~55% * Ratio: ~28% * Rational number: ~58% * Spheres: ~55% * Statistics: ~55% * Subtraction: ~55% * Temperature: ~55% * Time: ~55% * Triangles: ~55% * Variable exprs: ~55% * Volume: ~55% * Volume of 3d shapes: ~55% * Add: ~55% * Compare: ~55% * Count: ~55% * Division: ~55% * Equations: ~55% * Length: ~55% * Percents: ~55% **Yi-34B (Orange):** * Trend: Generally performs better than InternLM2-20B, but lower than Qwen-72B and GPT-3.5. * Specific Points (approximate): * Angles: ~70% * Area: ~80% * Circles: ~58% * Classifying & sorting: ~70% * Coin names & value: ~60% * Cones: ~58% * Coordinate plane: ~70% * Cubes: ~70% * Cylinders: ~70% * Decimals: ~70% * Estimation & rounding: ~70% * Exchanging money: ~70% * Fractions: ~70% * Light & heavy: ~70% * Mixed operations: ~70% * Multiple: ~70% * Numerical exprs: ~70% * Patterns: ~70% * Perimeter: ~70% * Place value: ~70% * Polygons: ~70% * Powers: ~70% * Probability: ~70% * Proportional: ~70% * Quadrilaterals: ~70% * Qwen-72B: ~70% * Ratio: ~70% * Rational number: ~70% * Spheres: ~70% * Statistics: ~70% * Subtraction: ~70% * Temperature: ~70% * Time: ~70% * Triangles: ~70% * Variable exprs: ~70% * Volume: ~70% * Volume of 3d shapes: ~70% * Add: ~70% * Compare: ~70% * Count: ~70% * Division: ~70% * Equations: ~70% * Length: ~70% * Percents: ~70% **Qwen-72B (Green):** * Trend: Generally performs well, often close to GPT-3.5. * Specific Points (approximate): * Angles: ~70% * Area: ~82% * Circles: ~70% * Classifying & sorting: ~50% * Coin names & value: ~80% * Cones: ~50% * Coordinate plane: ~70% * Cubes: ~50% * Cylinders: ~50% * Decimals: ~50% * Estimation & rounding: ~50% * Exchanging money: ~50% * Fractions: ~50% * Light & heavy: ~50% * Mixed operations: ~50% * Multiple: ~50% * Numerical exprs: ~50% * Patterns: ~50% * Perimeter: ~50% * Place value: ~50% * Polygons: ~50% * Powers: ~50% * Probability: ~50% * Proportional: ~50% * Quadrilaterals: ~50% * Qwen-72B: ~50% * Ratio: ~50% * Rational number: ~50% * Spheres: ~50% * Statistics: ~50% * Subtraction: ~50% * Temperature: ~50% * Time: ~50% * Triangles: ~50% * Variable exprs: ~50% * Volume: ~50% * Volume of 3d shapes: ~50% * Add: ~50% * Compare: ~50% * Count: ~50% * Division: ~50% * Equations: ~50% * Length: ~50% * Percents: ~50% **GPT-3.5 (Red):** * Trend: Generally the highest accuracy across most problem types. * Specific Points (approximate): * Angles: ~93% * Area: ~78% * Circles: ~95% * Classifying & sorting: ~70% * Coin names & value: ~85% * Cones: ~70% * Coordinate plane: ~95% * Cubes: ~70% * Cylinders: ~70% * Decimals: ~70% * Estimation & rounding: ~70% * Exchanging money: ~70% * Fractions: ~70% * Light & heavy: ~70% * Mixed operations: ~70% * Multiple: ~70% * Numerical exprs: ~70% * Patterns: ~70% * Perimeter: ~70% * Place value: ~70% * Polygons: ~70% * Powers: ~70% * Probability: ~70% * Proportional: ~70% * Quadrilaterals: ~70% * Qwen-72B: ~70% * Ratio: ~70% * Rational number: ~70% * Spheres: ~70% * Statistics: ~70% * Subtraction: ~70% * Temperature: ~70% * Time: ~70% * Triangles: ~70% * Variable exprs: ~70% * Volume: ~70% * Volume of 3d shapes: ~70% * Add: ~70% * Compare: ~70% * Count: ~70% * Division: ~70% * Equations: ~70% * Length: ~70% * Percents: ~70% ### Key Observations * GPT-3.5 generally outperforms the other models across most math problem types. * InternLM2-20B generally has the lowest accuracy. * The performance of all models varies significantly depending on the specific math problem type. * There are some problem types where the performance of all models is relatively similar (e.g., "Light & heavy"). ### Interpretation The chart provides a comparative analysis of the accuracy of four language models on a diverse set of math-related tasks. The data suggests that GPT-3.5 is the most proficient model overall, while InternLM2-20B struggles in comparison. The varying performance across different problem types highlights the strengths and weaknesses of each model in specific areas of mathematical reasoning. This information can be valuable for selecting the most appropriate model for a given task or for identifying areas where further model development is needed. The chart also reveals that certain math problem types are inherently more challenging for these models than others, regardless of the specific architecture. </details> Figure 13: Concept accuracies on Elementary-EN of more models. <details> <summary>x20.png Details</summary> ![8da94724](/v1/image/8da947248519ce32a3935372223ee39be0eb90accb12d1dcb55bd846dc6a1656) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of four different language models (Yi-6B, ChatGLM3-6B, LLaMA2-7B, and DeepSeekMath-7B) on a variety of math problem types. The x-axis represents different math topics, and the y-axis represents the accuracy score, ranging from 0 to 100. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** Represents different math problem types. The labels are densely packed and rotated for readability. The labels are: * Add & subtract * Arithmetic sequences * Congruence & similarity * Consumer math * Counting principle * Distance between two points * Domain & range of functions * Estimate metric measurements * Exponents & scientific notation * Financial literacy * Geometric sequences * Interpret functions * Linear equations * Linear functions * Lines & angles * Make predictions * Multiply * Nonlinear functions * One-variable statistics * Percents * Perimeter & area * Prime factorization * Prime or composite * Probability of compound events * Probability of one event * Probability of simple & opposite events * Proportional relationships * Quadrants * Radical expressions * Rational & irrational numbers * Scale drawings * Slope * Square * Square roots & cube roots * Surface area & volume * Systems of equations * Transformations * Trapezoids * Triangle * Two-variable statistics * Variable expressions * Absolute value * Center & variability * Circle * Divide * Equivalent expressions * Factors * Inequalities * Independent & dependent events * Mean, median, mode, & range * Opposite integers * Outliers * Polygons * Polyhedra * **Y-axis:** Represents "Accuracy" and ranges from 0 to 100, with tick marks at intervals of 20. * **Legend:** Located at the top of the chart. * **Blue:** Yi-6B * **Orange:** ChatGLM3-6B * **Green:** LLaMA2-7B * **Red:** DeepSeekMath-7B ### Detailed Analysis * **Yi-6B (Blue):** Generally fluctuates between 50 and 80 accuracy, with some dips and peaks. * **ChatGLM3-6B (Orange):** Shows a similar trend to Yi-6B, but often with slightly lower accuracy on many problem types. * **LLaMA2-7B (Green):** Consistently has the lowest accuracy across almost all problem types, often below 40, and sometimes near 0. * **DeepSeekMath-7B (Red):** Appears to have the highest accuracy overall, frequently reaching above 70 and sometimes peaking near 90. **Specific Data Points (Approximate):** It's difficult to provide exact values without a grid, but here are some approximate data points for a few problem types: * **Add & Subtract:** * Yi-6B: ~70 * ChatGLM3-6B: ~65 * LLaMA2-7B: ~45 * DeepSeekMath-7B: ~70 * **Exponents & Scientific Notation:** * Yi-6B: ~80 * ChatGLM3-6B: ~70 * LLaMA2-7B: ~30 * DeepSeekMath-7B: ~75 * **Prime Factorization:** * Yi-6B: ~50 * ChatGLM3-6B: ~40 * LLaMA2-7B: ~20 * DeepSeekMath-7B: ~50 * **Absolute Value:** * Yi-6B: ~70 * ChatGLM3-6B: ~70 * LLaMA2-7B: ~60 * DeepSeekMath-7B: ~80 ### Key Observations * DeepSeekMath-7B generally outperforms the other models. * LLaMA2-7B consistently underperforms compared to the other models. * Yi-6B and ChatGLM3-6B have similar performance, with Yi-6B often slightly better. * There is significant variance in accuracy across different problem types for all models. ### Interpretation The chart demonstrates the varying capabilities of different language models in solving math problems. DeepSeekMath-7B appears to be the most proficient, suggesting it may have been specifically trained or optimized for mathematical reasoning. LLaMA2-7B's lower performance indicates it may not be as well-suited for these types of tasks. The fluctuations in accuracy across different problem types highlight the challenges that even the best models face in handling the diverse range of mathematical concepts. The data suggests that model architecture and training data play a crucial role in determining a language model's ability to solve math problems accurately. </details> <details> <summary>x21.png Details</summary> ![6a568005](/v1/image/6a5680054d15ba59a5afc32bd70aeb1e4aece3694181ae58605575a0ebabe6fb) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of four different language models (InternLM2-Math-7B, InternLM2-7B, MAmmoTH-13B, and WizardMath-13B) on a variety of math problem types. The x-axis represents different math topics, and the y-axis represents accuracy, ranging from 0 to 80. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** Represents different math problem types. The labels are densely packed and rotated for readability. The labels include: * Add & subtract * Arithmetic sequences * Congruence & similarity * Consumer math * Counting Principles * Distance between two points * Domain & range of functions * Estimate metric measurements * Exponents & scientific notation * Financial literacy * Fractions & decimals * Geometric shapes * Interpret functions * Linear equations * Linear functions * Lines & angles * Make predictions * Nonlinear functions * One-variable statistics * Perimeter & area * Prime factorization * Prime or composite * Probability of compound events * Probability of one event * Probability of simple & opposite events * Quadrants * Rational & irrational numbers * Scale drawings * Slope * Square roots & cube roots * Surface area & volume * Systems of equations * Two-variable statistics * Absolute value * Center & variability * Circle * Factors * Independent & dependent events * Inequalities * Mean, median, mode, & range * Opposite integers * Outlier * Polygons * Polyhedra * Radical exprs * Square * Transformations * Trapezoids * Variable exprs * Axes * **Y-axis:** Represents accuracy, labeled "Accuracy". The scale ranges from 0 to 80, with tick marks at intervals of 20. * **Legend:** Located at the top of the chart. * Blue: InternLM2-Math-7B * Orange: InternLM2-7B * Green: MAmmoTH-13B * Red: WizardMath-13B * **Gridlines:** Horizontal dashed lines at each 20-unit interval on the y-axis. ### Detailed Analysis The chart displays the accuracy of four language models across a range of math problem types. Each model's performance is represented by a colored line. * **InternLM2-Math-7B (Blue):** This model generally shows high accuracy, often outperforming the other models. The trend is highly variable, with peaks and valleys across different problem types. * **InternLM2-7B (Orange):** This model generally shows lower accuracy compared to InternLM2-Math-7B and MAmmoTH-13B. The trend is also highly variable. * **MAmmoTH-13B (Green):** This model shows relatively high accuracy, often comparable to InternLM2-Math-7B. The trend is variable. * **WizardMath-13B (Red):** This model generally shows lower accuracy compared to InternLM2-Math-7B and MAmmoTH-13B, but sometimes outperforms InternLM2-7B. The trend is highly variable. **Specific Data Points (Approximate):** It's difficult to provide precise data points due to the density of the chart, but here are some approximate values for a few problem types: * **Add & subtract:** * InternLM2-Math-7B (Blue): ~65 * InternLM2-7B (Orange): ~20 * MAmmoTH-13B (Green): ~65 * WizardMath-13B (Red): ~45 * **Counting Principles:** * InternLM2-Math-7B (Blue): ~60 * InternLM2-7B (Orange): ~40 * MAmmoTH-13B (Green): ~80 * WizardMath-13B (Red): ~55 * **Prime Factorization:** * InternLM2-Math-7B (Blue): ~80 * InternLM2-7B (Orange): ~55 * MAmmoTH-13B (Green): ~50 * WizardMath-13B (Red): ~55 * **Axes:** * InternLM2-Math-7B (Blue): ~85 * InternLM2-7B (Orange): ~30 * MAmmoTH-13B (Green): ~30 * WizardMath-13B (Red): ~50 ### Key Observations * InternLM2-Math-7B (Blue) generally performs the best, achieving the highest accuracy on many problem types. * InternLM2-7B (Orange) generally performs the worst. * The performance of all models varies significantly depending on the problem type. * There are some problem types where all models perform poorly (e.g., near "Rational & irrational numbers"). * There are some problem types where all models perform well (e.g., near "Prime Factorization"). ### Interpretation The chart provides a comparative analysis of the accuracy of four language models on a diverse set of math problems. The data suggests that the InternLM2-Math-7B model is generally the most accurate, while InternLM2-7B is the least accurate. However, the performance of each model is highly dependent on the specific type of math problem. This indicates that different models may have different strengths and weaknesses in their mathematical reasoning abilities. The variability in performance across problem types highlights the complexity of mathematical reasoning and the challenges in developing models that can consistently solve a wide range of math problems. The chart could be used to identify areas where each model needs improvement and to guide the development of more robust and versatile mathematical reasoning systems. </details> <details> <summary>x22.png Details</summary> ![11040162](/v1/image/110401625191ec3b8af7f7cf9cbf99c731c1099e6be611e5de609bb9e406b416) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of four different language models (Baichuan2-13B, LLaMA2-13B, Qwen-14B, and InternLM2-Math-20B) on a variety of mathematical problem types. The x-axis represents the problem types, and the y-axis represents the accuracy score. ### Components/Axes * **Title:** (None visible) * **X-axis:** Mathematical problem types (listed below in "Content Details") * **Y-axis:** Accuracy, ranging from 0 to 100, with gridlines at intervals of 20. * **Legend:** Located at the top of the chart. * Blue: Baichuan2-13B * Orange: LLaMA2-13B * Green: Qwen-14B * Red: InternLM2-Math-20B ### Content Details **X-Axis Categories (Problem Types):** 1. Add & subtract 2. Arithmetic sequences 3. Congruence & similarity 4. Consumer math 5. Counting principle 6. Distance between two points 7. Domain & range of functions 8. Equivalent expressions 9. Estimate metric measurements 10. Exponents & scientific notation 11. Financial literacy 12. Fractions & decimals 13. Geometric sequences 14. Interpret functions 15. Linear equations 16. Linear functions 17. Make predictions 18. Multiply 19. Nonlinear functions 20. One-variable statistics 21. Percents 22. Perimeter & area 23. Prime factorization 24. Prime or composite 25. Probability of compound events 26. Probability of simple & opposite events 27. Proportional relationships 28. Quadrants 29. Rational & irrational numbers 30. Scale drawings 31. Square roots & cube roots 32. Surface area & volume 33. Systems of equations 34. Triangle 35. Two-variable statistics 36. Absolute value 37. Axes 38. Center & variability 39. Factors 40. Independent & dependent events 41. Mean, median, mode & range 42. Opposite integers 43. Outlier 44. Polygons 45. Polyhedra 46. Radical exprs 47. Square 48. Transformations 49. Trapezoids 50. Variable exprs **Data Series Trends and Approximate Values:** * **Baichuan2-13B (Blue):** The line fluctuates, generally staying between 60 and 80 accuracy, with a peak near 100 around problem 47 (Square). * Problem 1 (Add & subtract): ~70 * Problem 10 (Exponents & scientific notation): ~72 * Problem 20 (One-variable statistics): ~75 * Problem 30 (Scale drawings): ~80 * Problem 40 (Independent & dependent events): ~60 * Problem 47 (Square): ~98 * Problem 50 (Variable exprs): ~78 * **LLaMA2-13B (Orange):** The line fluctuates significantly, with lows around 20 and highs near 90. * Problem 1 (Add & subtract): ~60 * Problem 10 (Exponents & scientific notation): ~65 * Problem 20 (One-variable statistics): ~70 * Problem 30 (Scale drawings): ~50 * Problem 40 (Independent & dependent events): ~40 * Problem 47 (Square): ~50 * Problem 50 (Variable exprs): ~68 * **Qwen-14B (Green):** The line generally stays between 20 and 60 accuracy, with some peaks and valleys. * Problem 1 (Add & subtract): ~58 * Problem 10 (Exponents & scientific notation): ~42 * Problem 20 (One-variable statistics): ~48 * Problem 30 (Scale drawings): ~40 * Problem 40 (Independent & dependent events): ~35 * Problem 47 (Square): ~70 * Problem 50 (Variable exprs): ~40 * **InternLM2-Math-20B (Red):** The line fluctuates significantly, with lows around 20 and highs near 100. * Problem 1 (Add & subtract): ~70 * Problem 10 (Exponents & scientific notation): ~75 * Problem 20 (One-variable statistics): ~70 * Problem 30 (Scale drawings): ~80 * Problem 40 (Independent & dependent events): ~50 * Problem 47 (Square): ~80 * Problem 50 (Variable exprs): ~68 ### Key Observations * InternLM2-Math-20B and Baichuan2-13B generally perform better than LLaMA2-13B and Qwen-14B across most problem types. * All models show significant variation in accuracy depending on the problem type. * There are specific problem types where certain models excel or struggle. For example, Qwen-14B has particularly low accuracy on some problem types. * Baichuan2-13B has a peak accuracy on "Square" problems. ### Interpretation The chart illustrates the varying strengths and weaknesses of different language models when applied to mathematical problem-solving. The performance differences highlight the impact of model architecture, training data, and fine-tuning strategies on mathematical reasoning capabilities. The significant fluctuations in accuracy across different problem types suggest that each model has specific areas of expertise and difficulty. The data suggests that no single model consistently outperforms the others across all mathematical domains, indicating the need for specialized models or ensemble approaches for comprehensive mathematical problem-solving. The "Square" problem type being a high point for Baichuan2-13B could indicate a specific emphasis or strength in that area during training. </details> <details> <summary>x23.png Details</summary> ![1f34c424](/v1/image/1f34c424b497ca53597a0d94a29219472c32d6679850a5fed09097db48fe9eca) ### Visual Description ## Chart: Model Accuracy Across Mathematical Tasks ### Overview The image is a line chart comparing the accuracy of four different language models (InternLM2-20B, Yi-34B, Qwen-72B, and GPT-3.5) across a range of mathematical tasks. The x-axis represents different mathematical concepts, and the y-axis represents the accuracy score, ranging from 0 to 100. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** Represents different mathematical tasks/concepts. The labels are rotated for readability. The labels are: * Add & subtract * Arithmetic sequences * Congruence & similarity * Counting principle * Decimals * Distance between two points * Divide * Domain & range of functions * Estimate metric measurements * Equivalent exprs * Exponents & scientific notation * Financial literacy * Fractions & decimals * Geometric sequences * Interpret functions * Linear equations * Linear functions * Lines & angles * Make predictions * Multiply * Nonlinear functions * One-variable statistics * Percents * Perimeter & area * Prime factorization * Prime or composite * Probability of compound events * Probability of one event * Probability of simple & opposite events * Proportional relationships * Quadrants * Radical exprs * Rational & irrational numbers * Scale drawings * Square roots & cube roots * Square * Surface area & volume * Systems of equations * Transformations * Trapezoids * Triangle * Two-variable statistics * Variable exprs * Absolute value * Axes * Center & variability * Circle * Factors * Independent & dependent events * Inequalities * Mean, median, mode, & range * Opposite integers * Outlier * Polygons * Polyhedra * **Y-axis:** Represents "Accuracy" with a scale from 0 to 100, incrementing by 20. Horizontal gridlines are present at each increment. * **Legend:** Located at the top of the chart. * Blue: InternLM2-20B * Orange: Yi-34B * Green: Qwen-72B * Red: GPT-3.5 ### Detailed Analysis * **InternLM2-20B (Blue):** Generally shows the lowest accuracy across most tasks. The accuracy fluctuates significantly, with several points below 40 and some spikes around 60-70. It performs particularly poorly on "Outlier", "Radical exprs", "Trapezoids", and "Variable exprs". * **Yi-34B (Orange):** Shows a more consistent performance than InternLM2-20B, generally staying between 60 and 90. It has some dips but fewer extreme lows. * **Qwen-72B (Green):** Performs comparably to Yi-34B, often overlapping. It shows strong performance on "Transformations" and "Variable exprs", reaching near 100 accuracy. * **GPT-3.5 (Red):** Generally exhibits the highest accuracy across most tasks, frequently scoring above 80 and often reaching 100. It shows consistently strong performance across the board. **Specific Data Points (Approximate):** * **Add & subtract:** * InternLM2-20B: ~65 * Yi-34B: ~75 * Qwen-72B: ~85 * GPT-3.5: ~90 * **Outlier:** * InternLM2-20B: ~20 * Yi-34B: ~70 * Qwen-72B: ~70 * GPT-3.5: ~90 * **Transformations:** * InternLM2-20B: ~30 * Yi-34B: ~75 * Qwen-72B: ~95 * GPT-3.5: ~100 * **Variable exprs:** * InternLM2-20B: ~35 * Yi-34B: ~80 * Qwen-72B: ~95 * GPT-3.5: ~100 ### Key Observations * GPT-3.5 consistently outperforms the other models across nearly all mathematical tasks. * InternLM2-20B generally has the lowest accuracy, with significant variability. * Yi-34B and Qwen-72B show similar performance, often overlapping. * There is significant variation in accuracy across different mathematical tasks for all models, indicating varying levels of difficulty or model proficiency in specific areas. * The models show a wide range of performance on "Transformations", "Outlier", "Radical exprs", "Trapezoids", and "Variable exprs" suggesting these tasks are particularly challenging or revealing of model capabilities. ### Interpretation The chart provides a comparative analysis of the mathematical reasoning abilities of four different language models. The data suggests that GPT-3.5 is the most proficient in mathematical tasks among the models tested. InternLM2-20B appears to struggle relative to the other models. The performance variations across different mathematical concepts highlight the strengths and weaknesses of each model in specific areas of mathematical reasoning. The significant differences in accuracy for tasks like "Transformations" and "Outlier" suggest that these tasks could be used as benchmarks for evaluating the mathematical capabilities of language models. The data could be used to inform the development and refinement of these models, focusing on improving performance in areas where they currently struggle. </details> Figure 14: Concept accuracies on Middle-EN of more models. <details> <summary>x24.png Details</summary> ![465d1e06](/v1/image/465d1e064a22b26c8a4d6c605cf8fc075a14a12abf97db7281b2e5826ac9f664) ### Visual Description ## Line Chart: Model Accuracy Comparison ### Overview The image is a line chart comparing the accuracy of four different language models (Yi-6B, ChatGLM3-6B, LLaMA2-7B, and DeepSeekMath-7B) across a range of math problem types. The x-axis represents different problem types (in Chinese), and the y-axis represents accuracy, ranging from 0 to 100. ### Components/Axes * **Title:** The chart has no explicit title. * **X-axis:** Represents different math problem types, labeled in Chinese. The labels are very dense and difficult to read individually. * **Y-axis:** Represents "Accuracy", ranging from 0 to 100 in increments of 20. Horizontal gridlines are present at each increment. * **Legend:** Located at the top of the chart. * Blue line: Yi-6B * Orange line: ChatGLM3-6B * Green line: LLaMA2-7B * Red line: DeepSeekMath-7B ### Detailed Analysis The chart displays the accuracy of each model across various problem types. Due to the density of the x-axis labels and the overlapping lines, precise data extraction is challenging. However, general trends and relative performance can be observed. Here's a breakdown of the trends for each model: * **Yi-6B (Blue):** Generally fluctuates between 20 and 60 accuracy, with some dips below 20 and peaks above 60. * **ChatGLM3-6B (Orange):** Shows more variability, ranging from below 20 to above 80. Appears to perform better than Yi-6B on some problem types but worse on others. * **LLaMA2-7B (Green):** Consistently has the lowest accuracy, often below 20, and rarely exceeding 40. * **DeepSeekMath-7B (Red):** Generally exhibits the highest accuracy, frequently above 60, and reaching 100 on at least one problem type. It shows significant fluctuations, indicating varying performance across different problem types. Due to the Chinese labels on the x-axis, I will provide a transcription of the labels as best as possible, but without translation. X-Axis Labels (Left to Right): 1. 三角形 (Triangle) 2. 平行四边形 (Parallelogram) 3. 圆形 (Circle) 4. 平面图形综合 (Planar graphics synthesis) 5. 长方形 (Rectangle) 6. 立方体 (Cube) 7. 立体图形 (Three-dimensional graphics) 8. 和差问题 (Sum and difference problem) 9. 差倍问题 (Difference multiple problem) 10. 几何问题 (Geometric problem) 11. 年龄问题 (Age problem) 12. 归一问题 (Reduction to one problem) 13. 归总问题 (Total return problem) 14. 工程问题 (Engineering problem) 15. 植树问题 (Tree planting problem) 16. 盈亏问题 (Profit and loss problem) 17. 鸡兔同笼 (Chicken and rabbit in the same cage) 18. 浓度问题 (Concentration problem) 19. 行程问题 (Travel problem) 20. 相遇问题 (Meeting problem) 21. 时间问题 (Time problem) 22. 人民币问题 (RMB problem) 23. 浓度问题 (Concentration problem) 24. 面积问题 (Area problem) 25. 统计指标 (Statistical indicators) 26. 分数与小数 (Fractions and decimals) 27. 分数应用 (Fraction application) 28. 公倍数问题 (Common multiple problem) 29. 因数与倍数 (Factors and multiples) 30. 乘法问题 (Multiplication problem) 31. 乘除混合 (Mixed multiplication and division) 32. 定义新运算 (Define new operation) 33. 简便计算 (Simple calculation) 34. 除法问题 (Division problem) 35. 概率问题 (Probability problem) 36. 百分率问题 (Percentage problem) 37. 因数分解 (Factorization) ### Key Observations * DeepSeekMath-7B generally outperforms the other models. * LLaMA2-7B generally underperforms the other models. * The performance of all models varies significantly depending on the problem type. * The x-axis labels are in Chinese, representing different math problem categories. ### Interpretation The chart provides a comparative analysis of the accuracy of four language models on a diverse set of math problems. The data suggests that DeepSeekMath-7B is the most accurate model overall, while LLaMA2-7B is the least accurate. The significant performance variation across problem types highlights the strengths and weaknesses of each model in specific mathematical domains. This information is valuable for understanding the capabilities of each model and selecting the most appropriate model for a given task. The performance of ChatGLM3-6B and Yi-6B are similar, with ChatGLM3-6B having slightly higher variance. </details> <details> <summary>x25.png Details</summary> ![94ee8651](/v1/image/94ee865140eda2af73da7fd0116e541b299edc79a72e196e5851a07110534769) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of four different language models (InternLM2-Math-7B, InternLM2-7B, MAmmoTH-13B, and WizardMath-13B) on a variety of math problem types. The x-axis represents different types of math problems (in Chinese), and the y-axis represents the accuracy score, ranging from 0 to 80. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** Represents different types of math problems, labeled in Chinese. The labels are rotated for readability. * The labels are: 三角形 (Triangle), 平行四边形 (Parallelogram), 平面图形综合 (Plane figure synthesis), 立体 (Solid), 长方形 (Rectangle), 正方形 (Square), 圆柱 (Cylinder), 圆锥 (Cone), 立体图形综合 (Solid figure synthesis), 和差倍问题 (Sum difference multiple problem), 基础问题 (Basic problem), 平均数问题 (Average problem), 工作问题 (Work problem), 年龄问题 (Age problem), 植树问题 (Tree planting problem), 行程问题 (Travel problem), 鸡兔同笼问题 (Chicken and rabbit in the same cage problem), 浓度问题 (Concentration problem), 面积问题 (Area problem), 统计指标 (Statistical indicators), 分数与小数 (Fractions and decimals), 分数应用题 (Fraction application problems), 公差与公倍数 (Common difference and common multiple), 因数与倍数 (Factors and multiples), 定义新运算 (Define new operation), 比例问题 (Proportion problem), 整数问题 (Integer problem), 除法问题 (Division problem), 概率问题 (Probability problem), 百分数问题 (Percentage problem). * **Y-axis:** Represents Accuracy, ranging from 0 to 80, with increments of 20. * **Legend:** Located at the top of the chart. * Blue: InternLM2-Math-7B * Orange: InternLM2-7B * Green: MAmmoTH-13B * Red: WizardMath-13B ### Detailed Analysis Here's a breakdown of each model's performance across the different problem types: * **InternLM2-Math-7B (Blue):** This model generally performs the best, with accuracy scores frequently above 60. It shows peaks in accuracy for problems like "Triangle" (~64), "Plane figure synthesis" (~84), "Rectangle" (~76), "Solid figure synthesis" (~64), "Average problem" (~76), "Travel problem" (~76), "Area problem" (~76), "Fractions and decimals" (~76), "Percentage problem" (~84). It dips to ~40 for "Cylinder" and "Common difference and common multiple". * **InternLM2-7B (Orange):** This model's performance is generally lower than InternLM2-Math-7B, but still competitive. It shows peaks in accuracy for problems like "Triangle" (~40), "Plane figure synthesis" (~72), "Rectangle" (~60), "Solid figure synthesis" (~40), "Average problem" (~60), "Travel problem" (~92), "Area problem" (~60), "Fractions and decimals" (~60), "Percentage problem" (~80). It dips to ~32 for "Cylinder" and "Common difference and common multiple". * **MAmmoTH-13B (Green):** This model's performance is generally lower than the other two, with accuracy scores mostly below 40. It shows peaks in accuracy for problems like "Triangle" (~24), "Plane figure synthesis" (~20), "Rectangle" (~24), "Solid figure synthesis" (~24), "Average problem" (~24), "Travel problem" (~44), "Area problem" (~32), "Fractions and decimals" (~32), "Percentage problem" (~28). It dips to ~0 for "Cylinder" and "Common difference and common multiple". * **WizardMath-13B (Red):** This model consistently has the lowest accuracy scores, often near or at 0. It shows peaks in accuracy for problems like "Triangle" (~12), "Plane figure synthesis" (~16), "Rectangle" (~12), "Solid figure synthesis" (~8), "Average problem" (~12), "Travel problem" (~20), "Area problem" (~16), "Fractions and decimals" (~24), "Percentage problem" (~32). It dips to ~0 for "Cylinder" and "Common difference and common multiple". ### Key Observations * InternLM2-Math-7B generally outperforms the other models. * WizardMath-13B consistently shows the lowest accuracy. * All models struggle with certain problem types, particularly those related to "Cylinder" and "Common difference and common multiple". * There is significant variance in performance across different problem types for all models. ### Interpretation The chart suggests that the InternLM2-Math-7B model is the most effective at solving a variety of math problems compared to the other models tested. The significant performance differences across problem types indicate that the models may have varying strengths and weaknesses depending on the specific mathematical concepts involved. The consistently low performance of WizardMath-13B suggests it may require further training or optimization for these types of problems. The common difficulty across all models with "Cylinder" and "Common difference and common multiple" problems could indicate a need for improved training data or model architecture to handle these specific mathematical concepts. </details> <details> <summary>x26.png Details</summary> ![11cc8c61](/v1/image/11cc8c61ebd633682b6af590f2e70fe9926510f5bb623ef4f9064c37da4c36d2) ### Visual Description ## Line Chart: Model Accuracy Comparison ### Overview The image is a line chart comparing the accuracy of four different language models (Baichuan2-13B, LLaMA2-13B, Qwen-14B, and InternLM2-Math-20B) across a series of math-related problem types. The y-axis represents accuracy, ranging from 0 to 100. The x-axis represents different problem types, labeled in Chinese. ### Components/Axes * **Title:** (None visible) * **X-axis:** Problem Types (labeled in Chinese) * **Y-axis:** Accuracy (ranging from 0 to 100, with gridlines at intervals of 20) * **Legend:** Located at the top of the chart. * Blue: Baichuan2-13B * Orange: LLaMA2-13B * Green: Qwen-14B * Red: InternLM2-Math-20B ### Detailed Analysis The x-axis labels are in Chinese. Here are the labels and their approximate English translations: 1. 三角形 (sān jiǎo xíng): Triangle 2. 平行四边形 (píng xíng sì biān xíng): Parallelogram 3. 平面图形综合 (píng miàn tú xíng zōng hé): Plane figure synthesis 4. 立体图形 (lì tǐ tú xíng): Solid figure 5. 长方形 (cháng fāng xíng): Rectangle 6. 圆形 (yuán xíng): Circle 7. 和差问题 (hé chā wèn tí): Sum and difference problem 8. 基础问题 (jī chǔ wèn tí): Basic problem 9. 平均问题 (píng jūn wèn tí): Average problem 10. 年龄问题 (nián líng wèn tí): Age problem 11. 归一问题 (guī yī wèn tí): Normalized problem 12. 盈亏问题 (yíng kuī wèn tí): Profit and loss problem 13. 鸡兔同笼 (jī tù tóng lóng): Chicken and rabbit in the same cage (a classic math problem) 14. 对称问题 (duì chèn wèn tí): Symmetry problem 15. 植树问题 (zhí shù wèn tí): Tree planting problem 16. 折扣问题 (zhé kòu wèn tí): Discount problem 17. 税收问题 (shuì shōu wèn tí): Tax problem 18. 工程问题 (gōng chéng wèn tí): Engineering problem 19. 浓度问题 (nóng dù wèn tí): Concentration problem 20. 比例问题 (bǐ lì wèn tí): Proportion problem 21. 利率问题 (lì lǜ wèn tí): Interest rate problem 22. 储蓄问题 (chǔ xù wèn tí): Savings problem 23. 面积问题 (miàn jī wèn tí): Area problem 24. 体积问题 (tǐ jī wèn tí): Volume problem 25. 统计指标 (tǒng jì zhǐ biāo): Statistical indicators 26. 分数/百分数应用 (fēn shù/bǎi fēn shù yìng yòng): Fraction/Percentage application 27. 公倍数/公约数 (gōng bèi shù/gōng yuē shù): Common multiple/Common divisor 28. 因数与倍数 (yīn shù yǔ bèi shù): Factor and multiple 29. 差倍问题 (chā bèi wèn tí): Difference multiple problem 30. 和倍问题 (hé bèi wèn tí): Sum multiple problem 31. 还原问题 (huán yuán wèn tí): Reduction problem 32. 定义新运算 (dìng yì xīn yùn suàn): Define new operation 33. 逻辑推理 (luó jí tuī lǐ): Logical reasoning 34. 包含与排除 (bāo hán yǔ pái chú): Inclusion and exclusion 35. 抽屉原理 (chōu tì yuán lǐ): Pigeonhole principle 36. 日历问题 (rì lì wèn tí): Calendar problem 37. 简单方程 (jiǎn dān fāng chéng): Simple equation 38. 百分率问题 (bǎi fēn lǜ wèn tí): Percentage problem **Data Series Analysis:** * **Baichuan2-13B (Blue):** The accuracy fluctuates across problem types, generally ranging between 40 and 80. There are noticeable dips and peaks, indicating varying performance depending on the problem type. * Triangle: ~75 * Plane figure synthesis: ~45 * Rectangle: ~70 * Statistical indicators: ~55 * Simple equation: ~70 * Percentage problem: ~80 * **LLaMA2-13B (Orange):** This model generally shows lower accuracy compared to the others, often below 40. Its performance is particularly poor on several problem types, with accuracy close to 0. * Triangle: ~30 * Plane figure synthesis: ~10 * Rectangle: ~0 * Statistical indicators: ~20 * Simple equation: ~25 * Percentage problem: ~45 * **Qwen-14B (Green):** The accuracy of this model varies significantly, with some problem types showing high accuracy (close to 100) and others showing very low accuracy (close to 0). * Triangle: ~10 * Plane figure synthesis: ~25 * Rectangle: ~15 * Statistical indicators: ~40 * Simple equation: ~45 * Percentage problem: ~70 * **InternLM2-Math-20B (Red):** This model generally exhibits the highest accuracy among the four, often exceeding 60 and reaching close to 100 on some problem types. * Triangle: ~80 * Plane figure synthesis: ~65 * Rectangle: ~50 * Statistical indicators: ~60 * Simple equation: ~90 * Percentage problem: ~100 ### Key Observations * InternLM2-Math-20B (Red) consistently outperforms the other models across most problem types. * LLaMA2-13B (Orange) generally has the lowest accuracy. * Qwen-14B (Green) shows high variance in performance, indicating sensitivity to specific problem types. * All models exhibit fluctuations in accuracy depending on the problem type, suggesting that certain types of math problems are more challenging for these language models. ### Interpretation The chart provides a comparative analysis of the accuracy of four language models on a range of math problems. The data suggests that InternLM2-Math-20B is the most proficient at solving these types of problems, while LLaMA2-13B struggles. The varying performance of Qwen-14B highlights the importance of model architecture and training data in determining a model's ability to generalize across different problem types. The fluctuations in accuracy for all models indicate that certain mathematical concepts or problem-solving strategies are more difficult for these models to learn and apply. Further investigation into the specific characteristics of these challenging problem types could inform future model development and training strategies. </details> <details> <summary>x27.png Details</summary> ![2e916b67](/v1/image/2e916b6771267a4cf864bb791298f79b6f6e72ec27c476360ddf7a0f2e0405f6) ### Visual Description ## Line Chart: Model Accuracy Comparison ### Overview The image is a line chart comparing the accuracy of four different language models (InternLM2-20B, Yi-34B, Qwen-72B, and GPT-3.5) across a range of tasks or questions. The x-axis represents different tasks (labeled in Chinese), and the y-axis represents accuracy, ranging from 0 to 100. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** Represents different tasks or questions. The labels are in Chinese. * **Y-axis:** Represents "Accuracy," ranging from 0 to 100 in increments of 20. Horizontal gridlines are present at each increment. * **Legend:** Located at the top of the chart. * Blue: InternLM2-20B * Orange: Yi-34B * Green: Qwen-72B * Red: GPT-3.5 ### Detailed Analysis The x-axis labels are in Chinese. Here are the labels, along with their approximate English translations: 1. 三角形 (sān jiǎo xíng): Triangle 2. 平面图形 (píng miàn tú xíng): Plane figure 3. 平行四边形 (píng xíng sì biān xíng): Parallelogram 4. 梯形 (tī xíng): Trapezoid 5. 平面图形综合 (píng miàn tú xíng zōng hé): Plane figure comprehensive 6. 长方形 (cháng fāng xíng): Rectangle 7. 正方形 (zhèng fāng xíng): Square 8. 立方体 (lì fāng tǐ): Cube 9. 圆柱 (yuán zhù): Cylinder 10. 立体图形综合 (lì tǐ tú xíng zōng hé): Solid figure comprehensive 11. 和差倍问题 (hé chā bèi wèn tí): Sum difference multiple problem 12. 基础应用题 (jī chǔ yìng yòng tí): Basic application problem 13. 差倍问题 (chā bèi wèn tí): Difference multiple problem 14. 归一问题 (guī yī wèn tí): Return to one problem 15. 高和矮问题 (gāo hé ǎi wèn tí): Tall and short problem 16. 对比问题 (duì bǐ wèn tí): Comparison problem 17. 工程问题 (gōng chéng wèn tí): Engineering problem 18. 折扣问题 (zhé kòu wèn tí): Discount problem 19. 植树问题 (zhí shù wèn tí): Tree planting problem 20. 税收问题 (shuì shōu wèn tí): Tax problem 21. 鸡兔同笼 (jī tù tóng lóng): Chicken and rabbit in the same cage 22. 相遇问题 (xiāng yù wèn tí): Meeting problem 23. 追及问题 (zhuī jí wèn tí): Catch-up problem 24. 人民币问题 (rén mín bì wèn tí): RMB problem 25. 浓度问题 (nóng dù wèn tí): Concentration problem 26. 盈亏问题 (yíng kuī wèn tí): Profit and loss problem 27. 面积和组合 (miàn jī hé zǔ hé): Area and combination 28. 排序问题 (pái xù wèn tí): Sorting problem 29. 统筹规划 (tǒng chóu guī huà): Overall planning 30. 分数应用 (fēn shù yìng yòng): Fraction application 31. 公因数问题 (gōng yīn shù wèn tí): Common factor problem 32. 因数分解 (yīn shù fēn jiě): Factor decomposition 33. 因数与倍数 (yīn shù yǔ bèi shù): Factors and multiples 34. 定义新运算 (dìng yì xīn yùn suàn): Define new operation 35. 定义新运算 (dìng yì xīn yùn suàn): Define new operation 36. 几何问题 (jǐ hé wèn tí): Geometric problem 37. 除法问题 (chú fǎ wèn tí): Division problem 38. 百分数问题 (bǎi fēn shù wèn tí): Percentage problem **InternLM2-20B (Blue):** This model generally shows lower accuracy compared to the other models, with significant fluctuations across different tasks. It has some very low points, dipping near 0 accuracy on certain tasks. **Yi-34B (Orange):** This model's accuracy fluctuates considerably, sometimes performing better than InternLM2-20B but generally lower than Qwen-72B and GPT-3.5. It also has some very low accuracy scores on certain tasks. **Qwen-72B (Green):** This model generally exhibits higher accuracy than InternLM2-20B and Yi-34B, often reaching accuracy levels above 80%. It shows less extreme dips in performance compared to the other two. **GPT-3.5 (Red):** This model's performance is generally competitive with Qwen-72B, showing high accuracy across many tasks. It also experiences fluctuations, but its lows are generally higher than those of InternLM2-20B and Yi-34B. ### Key Observations * The accuracy of all models varies significantly depending on the task. * Qwen-72B and GPT-3.5 generally outperform InternLM2-20B and Yi-34B. * InternLM2-20B and Yi-34B have instances of very low accuracy, indicating potential weaknesses in specific areas. * There are specific tasks where all models struggle, suggesting inherent difficulty in those tasks. ### Interpretation The chart provides a comparative analysis of the accuracy of four language models across a diverse set of tasks. The fluctuations in accuracy highlight the varying strengths and weaknesses of each model. Qwen-72B and GPT-3.5 appear to be more robust overall, while InternLM2-20B and Yi-34B may require further refinement to improve their consistency across different problem types. The specific tasks where all models perform poorly could indicate areas where more advanced techniques or specialized training data are needed. The data suggests that model selection should be task-dependent, as no single model consistently outperforms the others across all categories. </details> Figure 15: Concept accuracies on Elementary-ZH of more models. <details> <summary>x28.png Details</summary> ![a462bd7d](/v1/image/a462bd7d1da0be7c4e3a0dbd86ac8acaf2b930b1d7d2cdc662d82fc2f8a85a78) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of four different language models (Yi-6B, ChatGLM3-6B, LLaMA2-7B, and DeepSeekMath-7B) on a series of math problems. The x-axis represents different math problem types (in Chinese), and the y-axis represents the accuracy score. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** Represents different math problem types, labeled in Chinese. The labels are densely packed and rotated for readability. * **Y-axis:** Represents "Accuracy", ranging from 0 to 80 in increments of 20. Horizontal gridlines are present at each increment. * **Legend:** Located at the top of the chart. * Blue line: Yi-6B * Orange line: ChatGLM3-6B * Green line: LLaMA2-7B * Red line: DeepSeekMath-7B ### Detailed Analysis **X-Axis Labels (Math Problem Types - Chinese with approximate English Translation):** The x-axis labels are in Chinese. Here's a transcription and approximate translation: 1. 全等三角形 (Quán děng sānjiǎoxíng) - Congruent triangles 2. 等腰三角形 (Děng yāo sānjiǎoxíng) - Isosceles triangle 3. 勾股定理 (Gōu gǔ dìnglǐ) - Pythagorean theorem 4. 平行四边形 (Píngxíng sìbiānxíng) - Parallelogram 5. 函数与一次方程 (Hánshù yǔ yīcì fāngchéng) - Function and linear equation 6. 反比例函数 (Fǎn bǐlì hánshù) - Inverse proportional function 7. 圆 (Yuán) - Circle 8. 弧长与扇形 (Hú cháng yǔ shànxíng) - Arc length and sector 9. 圆锥 (Yuánzhuī) - Cone 10. 点与坐标 (Diǎn yǔ zuòbiāo) - Point and coordinates 11. 函数与一次函数 (Hánshù yǔ yīcì hánshù) - Function and linear function 12. 一次函数 (Yīcì hánshù) - Linear function 13. 关系式 (Guānxì shì) - Relation 14. 求一次函数 (Qiú yīcì hánshù) - Finding a linear function 15. 一元一次方程 (Yī yuán yīcì fāngchéng) - Linear equation in one variable 16. 二次函数 (Èrcì hánshù) - Quadratic function 17. 整式的运算 (Zhěng shì de yùsuàn) - Operations with polynomials 18. 平方根 (Píngfāng gēn) - Square root 19. 一元二次方程 (Yī yuán èrcì fāngchéng) - Quadratic equation in one variable 20. 一元一次不等式 (Yī yuán yīcì bù děngshì) - Linear inequality in one variable 21. 解一元一次方程 (Jiě yī yuán yīcì fāngchéng) - Solving linear equations in one variable 22. 解不等式 (Jiě bù děngshì) - Solving inequalities 23. 分式方程 (Fēnshì fāngchéng) - Fractional equation 24. 整式的加减 (Zhěng shì de jiājiǎn) - Addition and subtraction of polynomials 25. 二次根式 (Èrcì gēnshì) - Quadratic radical 26. 平方根与算术平方根 (Píngfāng gēn yǔ suànshù píngfāng gēn) - Square root and arithmetic square root 27. 一元二次方程的根 (Yī yuán èrcì fāngchéng de gēn) - Roots of a quadratic equation in one variable 28. 概率与频率 (Gàilǜ yǔ pínlǜ) - Probability and frequency 29. 随机事件与概率 (Suíjī shìjiàn yǔ gàilǜ) - Random events and probability **Data Series Analysis:** * **Yi-6B (Blue):** The accuracy fluctuates, generally staying between 20 and 60. It shows some peaks and valleys, but no clear upward or downward trend. * Approximate values: Ranges from ~5 to ~55. * **ChatGLM3-6B (Orange):** This model shows more variance in accuracy. It has some high peaks, reaching above 80, but also drops to near 0 on some problem types. * Approximate values: Ranges from ~0 to ~90. * **LLaMA2-7B (Green):** This model generally has lower accuracy compared to the others, often staying below 40. It also exhibits significant fluctuations. * Approximate values: Ranges from ~5 to ~70. * **DeepSeekMath-7B (Red):** This model generally performs the best, with accuracy frequently above 50 and reaching peaks near 90. It also has some dips, but not as severe as ChatGLM3-6B. * Approximate values: Ranges from ~10 to ~90. ### Key Observations * DeepSeekMath-7B generally outperforms the other models across most problem types. * ChatGLM3-6B has the highest variance in performance, with both high peaks and low dips. * LLaMA2-7B tends to have the lowest accuracy among the four models. * All models show significant fluctuations in accuracy depending on the specific math problem type. ### Interpretation The chart demonstrates the varying capabilities of different language models in solving different types of math problems. DeepSeekMath-7B appears to be the most robust model for this specific set of problems, while LLaMA2-7B struggles. The fluctuations in accuracy across problem types suggest that each model has strengths and weaknesses in specific mathematical domains. The performance differences could be attributed to the models' architectures, training data, or specific optimizations for mathematical reasoning. The fact that all models exhibit variance indicates that mathematical reasoning remains a challenging task for these language models, and performance is highly dependent on the specific problem type. </details> <details> <summary>x29.png Details</summary> ![f83690ca](/v1/image/f83690cace497a4b184a6c3f499daa3a85ceada4fbd2d4a71405e06beea7ee6a) ### Visual Description ## Line Chart: Model Accuracy on Math Problems ### Overview The image is a line chart comparing the accuracy of four different language models (InternLM2-Math-7B, InternLM2-7B, MAmmoTH-13B, and WizardMath-13B) on a series of math problems. The x-axis represents different math problem types (in Chinese), and the y-axis represents the accuracy score. ### Components/Axes * **Title:** None explicitly present in the image. * **X-axis:** Represents different math problem types, labeled in Chinese. The labels are closely spaced and difficult to read completely, but some visible labels include: "全等三角形" (Congruent Triangles), "等腰三角形" (Isosceles Triangle), "平方根" (Square Root), "函数与一次函数" (Function and Linear Function), "求一次函数" (Find Linear Function), "随机事件与概率" (Random Events and Probability). * **Y-axis:** Represents Accuracy, ranging from 0 to 100 in increments of 20. Horizontal grid lines are present at each increment. * **Legend:** Located at the top of the chart. * Blue line: InternLM2-Math-7B * Orange line: InternLM2-7B * Green line: MAmmoTH-13B * Red line: WizardMath-13B ### Detailed Analysis The chart displays the accuracy of each model across different math problem types. The x-axis labels are in Chinese, and the models are compared based on their accuracy scores. Here's a breakdown of the trends for each model: * **InternLM2-Math-7B (Blue):** This model generally performs well, with accuracy fluctuating between approximately 40 and 100. It shows a peak in accuracy around the middle of the x-axis. * Approximate data points: Starts around 70, dips to 45, rises to 70, drops to 50, fluctuates around 50-70, reaches a peak of 100, then varies between 30 and 80 towards the end. * **InternLM2-7B (Orange):** This model also shows variable performance, with accuracy ranging from approximately 20 to 90. * Approximate data points: Starts around 60, drops to 40, rises to 60, fluctuates around 40-50, reaches peaks around 75 and 85, then varies between 20 and 80. * **MAmmoTH-13B (Green):** This model generally has lower accuracy compared to the other two, with scores mostly between 0 and 60. * Approximate data points: Starts around 25, peaks around 35, then fluctuates between 0 and 40, with a few peaks around 50-60. * **WizardMath-13B (Red):** This model consistently shows the lowest accuracy, with scores mostly below 40 and often near 0. * Approximate data points: Starts around 15, fluctuates between 0 and 20, with occasional peaks around 30-40. Here are some of the Chinese labels transcribed with English translations: * 全等三角形 (quán děng sān jiǎo xíng): Congruent Triangles * 等腰三角形 (děng yāo sān jiǎo xíng): Isosceles Triangle * 平方根 (píng fāng gēn): Square Root * 函数与一次函数 (hán shù yǔ yī cì hán shù): Function and Linear Function * 求一次函数 (qiú yī cì hán shù): Find Linear Function * 随机事件与概率 (suí jī shì jiàn yǔ gài lǜ): Random Events and Probability ### Key Observations * InternLM2-Math-7B and InternLM2-7B generally outperform MAmmoTH-13B and WizardMath-13B. * WizardMath-13B consistently has the lowest accuracy across all problem types. * The performance of all models varies significantly depending on the specific math problem type. * There are some problem types where all models perform poorly (accuracy close to 0). * There are some problem types where InternLM2-Math-7B and InternLM2-7B perform exceptionally well (accuracy close to 100). ### Interpretation The chart provides a comparative analysis of the accuracy of four language models on a range of math problems. The results suggest that InternLM2-Math-7B and InternLM2-7B are generally more proficient in solving these problems compared to MAmmoTH-13B and WizardMath-13B. However, the performance of all models is highly dependent on the specific type of math problem. The consistent low accuracy of WizardMath-13B indicates a potential weakness in its mathematical reasoning capabilities. The fluctuations in accuracy across different problem types highlight the varying difficulty levels and the models' specific strengths and weaknesses. Further investigation into the types of problems where each model excels or struggles could provide valuable insights for model improvement. </details> <details> <summary>x30.png Details</summary> ![5347c49e](/v1/image/5347c49eed17721dc2563c64f3a614fa0f4cf36d3281b7d026fc9780d8248c4c) ### Visual Description ## Line Chart: Model Accuracy Comparison ### Overview The image is a line chart comparing the accuracy of four different language models (Baichuan2-13B, LLaMA2-13B, Qwen-14B, and InternLM2-Math-20B) across a series of mathematical problems. The x-axis represents different problem types (in Chinese), and the y-axis represents the accuracy score. ### Components/Axes * **Title:** (None visible) * **X-axis:** Represents different mathematical problem types. The labels are in Chinese. * The labels are: 全等三角形的性质与判定, 等腰三角形的性质与判定, 勾股定理, 平行四边形, 弧长和扇形面积, 圆锥, 圆心角, 点与圆的位置关系, 函数与一元一次方程, 求一次函数表达式, 反比例函数的应用, 代数式求值, 约分, 不等式性质, 提公因式法, 公式法, 整式的乘法, 平方根与算术平方根, 一元一次方程的应用, 一元一次不等式组的应用, 二次根式的运算, 平方差公式, 一元二次方程的解法, 一元二次方程根的判别式, 一元二次方程的应用, 数据的波动程度, 数据的集中趋势, 随机事件与概率 * **Y-axis:** Represents Accuracy, ranging from 0 to 80, with increments of 20. * Values: 0, 20, 40, 60, 80 * **Legend:** Located at the top of the chart. * Baichuan2-13B (Blue) * LLaMA2-13B (Orange) * Qwen-14B (Green) * InternLM2-Math-20B (Red) ### Detailed Analysis Here's a breakdown of each model's performance across the problem types: * **Baichuan2-13B (Blue):** * Generally fluctuates between 40 and 80 accuracy. * Starts at approximately 65, dips to around 45, then rises sharply to 70. * Peaks at around 83, then drops to 30. * Rises again to 80, then drops to 50. * Ends at approximately 70. * **LLaMA2-13B (Orange):** * Generally fluctuates between 20 and 60 accuracy. * Starts at approximately 35, drops to 10, then rises to 40. * Peaks at around 55, then drops to 25. * Rises again to 60, then drops to 20. * Ends at approximately 50. * **Qwen-14B (Green):** * Generally fluctuates between 0 and 40 accuracy. * Starts at approximately 25, drops to 10, then rises to 20. * Peaks at around 83, then drops to 10. * Rises again to 35, then drops to 5. * Ends at approximately 45. * **InternLM2-Math-20B (Red):** * Generally fluctuates between 30 and 90 accuracy. * Starts at approximately 45, rises to 70, then drops to 35. * Peaks at around 85, then drops to 25. * Rises again to 75, then drops to 35. * Ends at approximately 70. ### Key Observations * InternLM2-Math-20B (Red) generally shows the highest accuracy, often peaking above the other models. * Qwen-14B (Green) generally shows the lowest accuracy. * All models exhibit significant fluctuations in accuracy across different problem types. * There are specific problem types where all models perform poorly (e.g., around problem 10), suggesting these problems are particularly challenging. * There are specific problem types where all models perform well (e.g., around problem 16), suggesting these problems are relatively easier. ### Interpretation The chart demonstrates the performance of four language models on a suite of mathematical problems. The fluctuations in accuracy highlight the varying difficulty levels of different problem types and the models' strengths and weaknesses. InternLM2-Math-20B appears to be the most proficient overall, while Qwen-14B struggles in comparison. The performance variations suggest that the models may have been trained with different datasets or architectures, leading to varying levels of expertise in specific mathematical domains. Further investigation into the specific problem types and model architectures would be needed to understand the underlying reasons for these performance differences. </details> <details> <summary>x31.png Details</summary> ![d6049a4f](/v1/image/d6049a4fedf7408be662f9e8e95757cfa4777130997e467ee88f1522e47c1a96) ### Visual Description ## Line Chart: Model Accuracy Comparison ### Overview The image is a line chart comparing the accuracy of four different language models (InternLM2-20B, Yi-34B, Qwen-72B, and GPT-3.5) across a range of mathematical problem types. The y-axis represents accuracy, ranging from 0 to 100. The x-axis represents different problem types, labeled in Chinese. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** Represents different mathematical problem types, labeled in Chinese. The labels are densely packed and rotated to fit. * **Y-axis:** Represents "Accuracy," ranging from 0 to 100 in increments of 20. Horizontal gridlines are present at each increment. * **Legend:** Located at the top of the chart. * Blue line: InternLM2-20B * Orange line: Yi-34B * Green line: Qwen-72B * Red line: GPT-3.5 ### Detailed Analysis **X-Axis Labels (Chinese with approximate English Translation):** The x-axis labels are in Chinese. Here's a transcription and approximate translation: 1. 全等三角形 (Quán děng sānjiǎoxíng): Congruent triangles 2. 三角形判定定理 (Sānjiǎoxíng pàndìng dìnglǐ): Triangle determination theorem 3. 等腰三角形 (Děng yāo sānjiǎoxíng): Isosceles triangle 4. 平行四边形 (Píngxíng sìbiānxíng): Parallelogram 5. 圆 (Yuán): Circle 6. 弧长与扇形面积 (Hú cháng yǔ shànxíng miànjī): Arc length and sector area 7. 函数与一次函数 (Hánshù yǔ yīcì hánshù): Function and linear function 8. 反比例函数 (Fǎn bǐlì hánshù): Inverse proportional function 9. 一次函数与方程 (Yīcì hánshù yǔ fāngchéng): Linear function and equation 10. 求一次函数解析式 (Qiú yīcì hánshù jiěxī shì): Finding the analytical expression of a linear function 11. 代数式 (Dàishùshì): Algebraic expression 12. 整式的乘法 (Zhěng shì de chéngfǎ): Multiplication of polynomials 13. 约分 (Yuē fēn): Reduction of fractions 14. 平方根 (Píngfāng gēn): Square root 15. 一元一次不等式 (Yī yuán yīcì bù děngshì): Linear inequality in one variable 16. 一元一次方程 (Yī yuán yīcì fāngchéng): Linear equation in one variable 17. 二元一次方程组 (Èr yuán yīcì fāngchéng zǔ): System of linear equations in two variables 18. 分式的运算 (Fēnshì de yùsuàn): Operations with fractions 19. 数据的整理 (Shùjù de zhěnglǐ): Data organization 20. 随机事件与概率 (Suíjī shìjiàn yǔ gàilǜ): Random events and probability **Data Series Analysis:** * **InternLM2-20B (Blue):** The accuracy fluctuates significantly across different problem types, ranging from approximately 10 to 70. There are several sharp drops and rises. * **Yi-34B (Orange):** Similar to InternLM2-20B, the accuracy varies considerably. The range is approximately 30 to 75. * **Qwen-72B (Green):** Generally shows higher accuracy compared to the other models, with values ranging from approximately 40 to 100. The fluctuations are still present, but the overall performance is better. * **GPT-3.5 (Red):** The accuracy fluctuates, with values ranging from approximately 5 to 85. **Specific Data Points (Approximate):** It's difficult to provide precise data points due to the resolution and density of the chart, but here are some approximate values for the first and last problem types: * **Problem 1 (全等三角形):** * InternLM2-20B: ~52 * Yi-34B: ~40 * Qwen-72B: ~60 * GPT-3.5: ~45 * **Problem 20 (随机事件与概率):** * InternLM2-20B: ~40 * Yi-34B: ~45 * Qwen-72B: ~80 * GPT-3.5: ~65 ### Key Observations * Qwen-72B generally outperforms the other models across most problem types. * All models exhibit significant fluctuations in accuracy depending on the problem type. * There are specific problem types where certain models perform particularly poorly (e.g., GPT-3.5 on problem 16). * The performance of InternLM2-20B and Yi-34B is relatively similar. ### Interpretation The chart demonstrates the varying performance of different language models on a range of mathematical problems. The fluctuations in accuracy highlight the models' strengths and weaknesses in specific areas of mathematical reasoning. Qwen-72B's consistently higher accuracy suggests it is a more robust model for these types of problems compared to InternLM2-20B, Yi-34B, and GPT-3.5. The data suggests that no single model excels at all problem types, indicating the need for specialized models or ensemble approaches for optimal performance across diverse mathematical tasks. The specific nature of the Chinese-labeled problem types would need further investigation to understand the exact challenges they pose to the models. </details> Figure 16: Concept accuracies on Middle-ZH of more models. Appendix B Details on the Efficient Fine-Tuning In this section, we provide the details on the efficient fine-tuning to enhance mathematical reasoning abilities towards specific concepts by first training a concept classifier and then curating a set of samples from a large open-sourced math dataset. Specifically, first, by additionally collecting extra 10 problems per concept, we construct a classifier capable of identifying the concept class of a given question. The backbone of this classifier is a pretrained bilingual LLM (i.e., Baichuan2-13B), where the classification head is operated on its last hidden output feature. Note that the concept classification accuracies in English and Chinese are 92.5 and 86.9, respectively, which indicates that it is reasonable to use an additional classifier for curating an extra concept-related dataset from large-scale math-related data. Note that in our work, we crawl from the OpenWebMath Paster et al. (2023) to produce the concept-related training dataset. Appendix C Details on the Evaluated Models In this section, we offer a detailed overview of the Large Language Models (LLMs) and present the corresponding model links in Table 6. - GPT-3.5/GPT-4 OpenAI (2023): The most powerful closed-model from OpenAI. We utilize its API: gpt-3.5-turbo and gpt-4. - LLaMa2-7B/13B/70B Touvron et al. (2023b): A set of open-source models developed by Meta. - Qwen-14B/72B Bai et al. (2023b): This model pre-trained on multilingual data, concentrates on Chinese and English languages. We employ both the Qwen-Base-14B, and the Qwen-Base-72B. - Baichuan2-13B Baichuan (2023): This model demonstrates impressive performance in both Chinese and English benchmarks. - MetaMath-13B Megill and Wheeler (2019): A domain-specific language model for mathematical reasoning, fine-tuned from the LLaMA-2 model using the MetaMathQA https://huggingface.co/datasets/meta-math/MetaMathQA dataset. - WizardMath-13B Luo et al. (2023): Another domain-specific language model for mathematical reasoning, fine-tuned from the LLaMA-2 model using reinforcement learning. - MAmmoTH-13B Yue et al. (2023): This model is specifically designed for general math problem-solving and has been fine-tuned from the LLaMA model using the MathInstruct https://huggingface.co/datasets/TIGER-Lab/MathInstruct dataset. This dataset features training data that includes both chain-of-thought (CoT) and program-of-thought (PoT) rationales. - Yi-6B/34B Team (2023b): This model released by 01 shows promising performance results in both Chinese and English. - ChatGLM3-6B Zeng et al. (2022): a lightweight and high-performance pre-trained dialogue model released by Zhipu AI in both Chinese and English. - InternLM-7B/20B Team (2023a): A Multilingual Language Model with Progressively Enhanced Capabilities released by InternLM team. - InternLM-Math-7B/20B Ying et al. (2024): Well-performed math reasoning language models. - DeepSeekMath-7B Shao et al. (2024): One powerful mathematical language model released by DeepSeek. | Models | HuggingFace Link / OpenAI Model | | | --- | --- | --- | | ChatGLM3 | ChatGLM3-6B | https://huggingface.co/THUDM/chatglm3-6b | | DeepSeekMath | DeepSeekMath-7B | https://huggingface.co/deepseek-ai/deepseek-math-7b-instruct | | Baichuan2 | Baichuan2-13B | https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat | | MetaMath | MetaMath-13B | https://huggingface.co/meta-math/MetaMath-13B-V1.0 | | WizardMath | WizardMath-13B | https://huggingface.co/WizardLM/WizardMath-13B-V1.0 | | MAmmoTH | MAmmoTH-13B | https://huggingface.co/TIGER-Lab/MAmmoTH-13B | | InternLM | InternLM-7B | https://huggingface.co/internlm/internlm2-chat-7b | | InternLM-20B | https://huggingface.co/internlm/internlm2-chat-20b | | | InternLM-Math-7B | https://huggingface.co/internlm/internlm2-math-7b | | | InternLM-Math-20B | https://huggingface.co/internlm/internlm2-math-20b | | | Yi | Yi-6B | https://huggingface.co/01-ai/Yi-6B-Chat | | Yi-34B | https://huggingface.co/01-ai/Yi-34B-Chat | | | LLaMA2 | LLaMA2-7B | https://huggingface.co/meta-llama/Llama-2-7b-chat-hf | | LLaMA2-13B | https://huggingface.co/meta-llama/Llama-2-13b-chat-hf | | | LLaMA2-70B | https://huggingface.co/meta-llama/Llama-2-70b-chat | | | Qwen | Qwen-14B | https://huggingface.co/Qwen/Qwen-14B-Chat | | Qwen-72B | https://huggingface.co/Qwen/Qwen-72B-Chat | | | GPT | GPT-3.5 | gpt-3.5-turbo | | GPT-4 | gpt-4 | | Table 6: Model links. Appendix D More Results As shown in Fig. 9 and Fig. 10, we provide the mean concept accuracies of Elementary-EN and Elementary-ZH of the evaluated models across different concepts. As shown in Fig. 11 and Fig. 12, we illustrate the concept accuracies on Elementary-EN and Elementary-ZH for different selected models. For the results of all models, please refer to Fig. 13, Fig. 14, Fig. 15 and Fig. 16. Appendix E Details on 5-shot Prompts We provide the 5-shot prompts for our ConceptMath in Pages 17-20. Q: For a party, Oscar gets 4 different kinds of soda. He buys a 6-pack of each kind. At home, he divides the sodas evenly among 3 coolers. How many sodas are in each cooler? A: Oscar buys 4 different kinds of soda, and he gets a 6-pack of each kind. This means that he has 4 * 6 = 24. sodas in total. He divides these 24 sodas evenly among 3 coolers. So number of sodas per cooler is 24 / 3 = 8. Therefore, the answer is 8. Q: Amazing Animals Zoo is building a new home for its 8 zebras. Each zebra will get 2 acres of grassy land to roam around in. The zebras will have a pond that covers 1 acre. How big will the zebras’ new home be? A: The total required grassland area for the zebras is: 8 zebras * 2 acres per zebra = 16 acres. Then, we need to add the extra 1 acre for the pond: 16 acres (for the zebras) + 1 acre (for the pond) = 17 acres. Therefore, the answer is 17. Q: The state taxes personal income exceeding 3500 yuan (this portion not to exceed 3500 yuan) at a rate of 3%. If Xiaoming’s father’s monthly salary is 4700 yuan, how much personal income tax should Xiaoming’s father pay each month? A: To calculate the amount of personal income tax Xiaoming’s father should pay, we can follow this simple process:Subtract the tax-free threshold from the monthly salary. Taxable income = Monthly salary - Tax-free threshold = 4700 - 3500 = 1200 yuan. Apply the tax rate to the amount that exceeds the threshold. Tax owed = Taxable income $*$ Tax rate = 1200 * 3% = 36 yuan. Therefore, the answer is 36. Q: Find the 10th term in the arithmetic sequence where the first term is 3 and the common difference is 4. A: To find the nth term in an arithmetic sequence, we can use the formula: $a_{n}=a_{1}+(n-1)*d$ , where: $a_{n}$ is the $n$ th term, $a_{1}$ is the first term, d is the common difference, n is the term number. Given that the first term ( $a_{1}$ ) is 3 and the common difference (d) is 4, we can find the $10$ th term ( $a_{10}$ ) by plugging these values into the formula: $a_{10}=a_{1}+(10-1)*d=3+9*4=3+36=39$ . Therefore, the answer is 39. Q: A regular polygon has an interior angle of 156 degrees. How many sides does the polygon have? A: To find the number of sides of a regular polygon with a given interior angle, we can use the formula: Interior Angle = [(n - 2) * 180] / n, where n is the number of sides. Given that interior angle is 156: 156 = [(n - 2) * 180] / n. Solve for n, so we get n = 15. Therefore, the answer is 15. Q: $g(x)=x^{2}$ , $f(x)=3(x-2)^{2}$ , $h(x)=g(x)-f(x)$ , $y=h(23)$ , give me the value of $y$ . A: First, calculate $g(23)$ and $f(23)$ : $g(23)=23^{2}=529$ . Next, calculate $f(23)$ : $f(23)=3*(23-2)^{2}=3*(21)^{2}=3*441=1323$ . Now, we calculate $h(23)$ : $h(23)=g(23)-f(23)=529-1323=-794$ . Therefore, the answer is -794. Q: What is the slope of the line segment between the two points: (3,1), (5,2)? A: The slope of a line between two points $(x_{1},y_{1})$ and $(x_{2},y_{2})$ can be found using the formula: $slope(m)=(y_{2}-y_{1})/(x_{2}-x_{1})$ . Given the two points (3, 1) and (5, 2), we can substitute these values into the formula: $slope(m)=(2-1)/(5-3)=1/2=0.5$ . Therefore, the answer is 1/2. Q: In a party, there are 5 cupcakes remaining and 10 children. A mother inserts 10 pieces of paper into a hat, 5 with a cupcake image and 5 without. The children who draw the cupcake paper will receive a cupcake. If the first child draws a paper with a cupcake image, what is the probability that the second child will also draw a cupcake paper? A: Initially, there are 5 cupcake papers and 5 non-cupcake papers in the hat, making a total of 10 papers. After the first child draws a cupcake paper, there are 4 cupcake papers and 5 non-cupcake papers left in the hat, with a total of 9 papers. The probability that the second child will draw a cupcake paper is then the number of cupcake papers left divided by the total number of papers left: Probability = Number of cupcake papers left / Total number of papers left = 4/9. Therefore, the answer is 4/9. Q: What is the total area of an irregular polygon that consists of two squares and two rectangles where the sides of the squares are 12 cm and 16 cm respectively, and the dimensions of the rectangles are 13 cm by 20 cm and 10 cm by 7 cm respectively? A: To find the total area of an irregular polygon consisting of two squares and two rectangles, you would sum up the individual areas of each shape. The area of a square is given by the formula $A=s^{2}$ , where s is the length of a side of the square. For the first square with a side of 12 cm: $A_{1}=12^{2}=144cm^{2}$ . For the second square with a side of 16 cm: $A_{2}=16^{2}=256cm^{2}$ . The area of a rectangle is given by the formula ( $A=l*w$ ), where $l$ is the length and $w$ is the width of the rectangle. For the first rectangle with dimensions 13 cm by 20 cm: $A_{3}=13*20=260cm^{2}$ For the second rectangle with dimensions 10 cm by 7 cm: $A_{4}=10*7=70cm^{2}$ $TotaArea=A_{1}+A_{2}+A_{3}+A_{4}=144+256+260+70=730cm^{2}$ Therefore, the answer is 730. Q: Susie is making invitations to her party this coming weekend. It takes her about 5 minutes to make each invitation. It took her 2 hours to finish all the invitations. How many people is Susie inviting? A: First, convert the 2 hours into minutes to match the unit of time used for the invitation-making process. 2 hours = 120 minutes. Then, divide the total time spent by the time it takes to make one invitation to find out how many invitations she made: 120 minutes total / 5 minutes per invitation = 24 invitations Therefore, the answer is 24. Q: 小文用量角器测量一个角的度数时，误把外圈刻度看成了内圈刻度，量出的角是 $65°$ ，这个角实际是多少度？ A: 量角器一般有两个圈的刻度，内圈刻度通常是顺时针方向从0度到180度，外圈刻度通常是逆时针方向从0度到180度。如果小文误把外圈刻度看成了内圈刻度，那么测量出的角度其实是从180°减去错误读数。因此，实际的角度应该是：180° - 65° = 115°。因此，答案是115。 Q: 小明以每小时2千米的速度上山，到了山顶后立即以每小时3千米的速度按原路返回。求小明上下山的平均速度是多少千米？ A: 要计算小明上下山的平均速度，我们需要知道他上山和下山所走的总距离以及总时间。首先，总距离 = 上山距离+下山距离=d+d= 2d千米。然后，计算总时间：上山时间 = 上山距离/上山速度 = d/2小时，下山时间 = 下山距离/下山速度 = d/3小时，因此，总时间=上山时间+下山时间 = d/2 + d/3小时。最后，计算平均速度：平均速度=总距离 / 总时间 = 2d/(d/2+d/3)=2d/(5d/6) = 12/5 = 2.4千米/小时。因此，答案是2.4千米/小时。 Q: 一个礼盒长20cm、宽15cm、高10cm，用彩带捆扎这样的一个礼盒，至少需要多长的彩带？（接头处12厘米） A: 要捆扎这样一个礼盒，通常情况下我们会沿着礼盒的长和宽方向各绕一圈，即绕过礼盒的长和宽两次，然后再留一些长度作为接头。首先，计算沿着礼盒的长方向绕一圈的长度：绕长方向长度 = （礼盒的长 + 礼盒的高）* 2 = （20 + 10）* 2 = 30 * 2 = 60 cm 然后，计算沿着礼盒的宽方向绕一圈的长度：绕宽方向长度 = （礼盒的宽 + 礼盒的高）* 2 = （15 + 10）* 2 = 25 * 2 = 50 cm 最后，计算总需要彩带长度：总长度 = 绕长方向长度 + 绕宽方向长度 + 接头长度 = 60 + 50 + 12 = 122 cm 因此，答案是122厘米。 Q: 千岛湖大桥的一侧每隔10米安装1盏灯，共需91盏（两端都装），现在改装61盏（两端都装），平均每相邻两盏灯之间的距离是多少米？ A: 原来每隔10米安装1盏灯，共需91盏，这意味着桥的总长度覆盖了90个10米的间隔，所以桥的总长度 = 90间隔 * 10米/间隔 = 900米。现在改装61盏灯（两端都装），也就是说将这900米分成60个间隔（因为61个灯包含了两端的灯，所以间隔数比灯的数量少一个），每个间隔的长度 = 900米 / 60间隔 = 15米/间隔。因此，答案是115米。 Q: a是b的3/4，则a比b少( )，b比a多( )，b与a的比是( )，如果a、b两数的和是210，则a＝( )，b＝( )，括号处分别为多少？ A: 已知a是b的3/4，则a = 3/4 * b。因此，a比b少 (1-3/4)/1 = 1/4，b比a多（1-3/4)/(3/4) = 1/3，b与a的比为b : (3/4 * b) = 4:3 若已知a、b两数的和是210，则a+b = 210，将a = (3/4) * b 代入方程得到：3/4 * b + b = 210，解得b = 120，a = 3/4 * b = 3/4 * 120 = 90 因此，答案分别是 1/4，1/3，4:3，90，120。 Q: 求不等式 $-x^{2}-2x+8>0$ 的解集? A: $-x^{2}-2x+8>0$ , 通过分解因式方式，得到 $-(x-2)(x+4)>0$ 因此，答案是 (-4, 2)。 Q: 已知函数 $y=2x+3$ 与的图象 $y=-1/2x+m$ 交y轴于点A、B，两函数图象的交点为P(n,-2)，求m和n的值？ A: 两函数图像交点为P(n,-2)，意味着这个点同时满足上述两个函数方程：将P点的坐标(n,-2)代入函数 $y=3x+3$ ，即 $-2=3n+3$ ，解得 $n=-5/2$ 。将P点坐标(-5/2,-2)代入函数 $y=-1/2x+m$ ，即 $-2=-1/2*(-5/2)+m$ ，解得m=-13/4。因此，答案是m为-13/4，n为-5/2。 Q: 已知PA是圆O的切线，A为切点，PC与圆O相交于B、C两点，PB=2cm，BC=8cm，则PA的长是多少厘米？ A: 由于PA是圆O的切线，根据切割线定理，我们知道从圆外一点引圆的两条割线，这一点到每条割线与圆的交点的两条线段长的积相等，即 $PA^{2}=PC*PB$ , 已知PB=2cm、BC=8cm，则 $PC=PB+BC=2cm+8cm=10cm$ 。 $PA^{2}=PC*PB=10*2=20$ ，解得PA = 2 $\sqrt{5}$ cm。因此，答案是2 $\sqrt{5}$ 厘米。 Q: 南宋著名数学家杨辉精研数学，著有《详解九章算法》，对数的运算进行了深入研究与总结，类比其中的思想方法，可以解决很多数与式的计算问题已知a，b为实数，且 $a+b=3,ab=1$ ，计算可得： $a^{2}+b^{2}=7$ ， $a^{3}+b^{3}=18$ ， $a^{4}+b^{4}=47$ ，由此求 $a^{5}+b^{5}$ 得多少？ A: 已知 $a+b=3$ 和 $ab=1$ ，可得： $a^{2}+b^{2}=(a+b)^{2}-2ab=3^{2}-2*1=9-2=7a^{3}+b^{3}=(a+b)(a^{2}-ab+b^{2})=3(7% -1)=3*6=18a^{4}+b^{4}=(a^{2}+b^{2})^{2}-2a^{2}b^{2}=7^{2}-2*1^{2}=49-2=47a^{5}% +b^{5}=(a+b)(a^{4}-a^{3}b+a^{2}b^{2}-ab^{3}+b^{4})=3(a^{4}+b^{4}-a^{2}b^{2})=3% (47-1)=3*46=138$ 。因此，答案是138。 Q: 某单位要招聘1名英语翻译，甲、乙两人报名参加了听、说、读、写等项素质测试，甲的成绩分别为：90分、80分、85分、78分；乙的成绩分别为：78分、82分、85分、88分．如果把听、说、读、写的成绩按 $2:2:3:3$ 计算素质测试平均成绩，那么甲和乙的平均成绩分别是多少？ A: 甲的平均成绩为 $(90*2+80*2+85*3+78*3)/(2+2+3+3)=(180+160+255+234)/10=829/10=82.9$ 分，乙的平均成绩为 $(78*2+82*2+85*3+88*3)/(2+2+3+3)=(156+164+255+264)/10=839/10=83.9$ 分。因此，答案是甲的平均成绩是82.9分，乙的平均成绩是83.9分。 | LEVEL1 | LEVEL2 | LEVEL3 | # Samples | | --- | --- | --- | --- | | Calculation & Properties | Calculation | Add | 19 | | Decimals | 20 | | | | Division | 19 | | | | Equations | 18 | | | | Fractions | 16 | | | | Mixed Operations | 18 | | | | Multiple | 18 | | | | Numerical Expressions | 20 | | | | Place Value | 16 | | | | Powers | 20 | | | | Rational Number | 17 | | | | Subtraction | 19 | | | | Variable Expressions | 19 | | | | Properties | Compare | 20 | | | Count | 18 | | | | Estimation & Rounding | 20 | | | | Patterns | 19 | | | | Geometry | Angles | 17 | | | Coordinate Plane | Coordinate Plane | 18 | | | Three-dimensional Shapes | Cones | 17 | | | Cubes | 20 | | | | Cylinders | 17 | | | | Spheres | 17 | | | | Volume of 3D shapes | 18 | | | | Two-dimensional Shapes | Circles | 17 | | | Perimeter | 19 | | | | Polygons | 18 | | | | Quadrilaterals | 17 | | | | Triangles | 18 | | | | Measurement | Basic Knowledge | Temperature | 19 | | Time | 20 | | | | Money | Coin Names & Value | 17 | | | Exchanging Money | 17 | | | | Ratio | Percent | 17 | | | Proportion | 18 | | | | Ratio | 19 | | | | Size | Area | 19 | | | Length | 20 | | | | Volume | 20 | | | | Weight | Light & Heavy | 20 | | | Statistics | Classifying & Sorting | Classifying & Sorting | 17 | | Data | Mode/Mean/Median/Range | 19 | | | Probability | Probability | 16 | | Table 7: Details of the hierarchical concepts in Elementary-EN. | LEVEL1 | LEVEL2 | LEVEL3 | # Samples | | --- | --- | --- | --- | | Calculation | Basic Calculation | Add & Subtract | 20 | | Decimals | 19 | | | | Divide | 19 | | | | Exponents & Scientific Notation | 16 | | | | Fractions & Decimals | 18 | | | | Multiply | 18 | | | | Square Roots & Cube Roots | 20 | | | | Consumer Math | Consumer Math | 18 | | | Financial Literacy | Financial Literacy | 19 | | | Integers | Absolute Value | 18 | | | Opposite Integers | 20 | | | | Measurement | Measurement Metric | 19 | | | Number Theory | Factors | 20 | | | Prime Factorization | 19 | | | | Prime or Composite | 18 | | | | Percents | Percents | 20 | | | Rational & Irrational Numbers | Rational & Irrational Numbers | 18 | | | Ratios & Rates | Proportional Relationships | 18 | | | Sequences | Arithmetic Sequences | 19 | | | Geometric Sequences | 18 | | | | Expressions, equations, and functions | Equations | Linear Equations | 20 | | Systems of Equations | 18 | | | | Expressions | Equivalent Expressions | 20 | | | Radical | 17 | | | | Variable | 18 | | | | Function | Domain & Range of Functions | 18 | | | Interpret Functions | 19 | | | | Linear Functions | 20 | | | | Nonlinear Functions | 18 | | | | Inequalities | Inequalities | 19 | | | Geometry | Congruence & Similarity | Congruence & Similarity | 19 | | Coordinate Plane | Axes | 17 | | | Distance Between Two Points | 19 | | | | Quadrants | 16 | | | | Scale Drawings | Scale Drawings | 16 | | | Slope | Slope | 20 | | | Three-dimensional Figures | Polyhedra | 19 | | | Surface Area & Volume | 17 | | | | Transformations | Transformations | 18 | | | Two-dimensional Figures | Circle | 20 | | | Lines & Angles | 18 | | | | Perimeter & Area | 20 | | | | Polygons | 18 | | | | Square | 18 | | | | Trapezoids | 16 | | | | Triangle | 18 | | | | Statistic and Probability | Data | Center & Variability | 18 | | Mean, Median, Mode & Range | 19 | | | | Outlier | 20 | | | | One-variable Statistics | One-variable Statistics | 19 | | | Probability | Counting Principle | 16 | | | Independent & Dependent Events | 16 | | | | Make Predictions | 17 | | | | Probability of Compound Events | 16 | | | | Probability of One Event | 17 | | | | Probability of Simple and Opposite Events | 19 | | | | Two-variable Statistics | Two-variable Statistics | 18 | | Table 8: Details of the hierarchical concepts in Middle-EN. <details> <summary>x32.png Details</summary> ![a6ff3c38](/v1/image/a6ff3c38694088f126d65d88bfd8e4295592085c378b6861407e2c16dbcff284) ### Visual Description ## Table: Mathematics Problem Types and Sample Counts ### Overview The image is a table outlining different types of mathematics problems, categorized into levels, along with the number of samples for each problem type. The table is structured with four columns: LEVEL1, LEVEL2, LEVEL3, and # Samples. The problem types are presented in both Chinese and English. ### Components/Axes * **Columns:** * **LEVEL1:** Broad categories of mathematical topics (e.g., Geometry, Application, Measurement and Statistics, Number and Algebra). * **LEVEL2:** Subcategories within each broad topic (e.g., Two-dimensional shapes, Three-dimensional Shapes, Fundamental Problem, Classical Problem, Distance Problem, Measurement, Statistics, Fractional Operation, Factors & Multiples, Basic Operation, Ratio). * **LEVEL3:** Specific types of problems within each subcategory, presented in both Chinese and English (e.g., Triangles, Circle, Add & Differential & Multiple, Basics, Interest, Period, Encounter, Travel, RMB, Time, Fraction & Decimal, Fractional Application, Common Multiples, Common Divisors, Multiplication, Reciprocal Problem, Multiple, Probability). * **# Samples:** The number of sample problems available for each specific problem type. ### Detailed Analysis or ### Content Details Here's a breakdown of the table content, row by row: 1. **Geometry:** * Two-dimensional shapes: * Triangles: 20 samples * Circle: 20 samples * Parallelogram: 20 samples * Trapezium: 20 samples * Square: 20 samples * Synthesis Problem: 20 samples * Angle: 20 samples * Rectangle: 20 samples * Three-dimensional Shapes: * Cylinder: 20 samples * Cube: 20 samples * Synthesis Problem: 20 samples * Cuboid: 20 samples 2. **Application:** * Fundamental Problem: * Add & Differential & Multiple: 21 samples * Basics: 20 samples * Differential: 20 samples * Normalization: 20 samples * Induction: 20 samples * Classical Problem: * Interest: 20 samples * Period: 10 samples * Folding: 20 samples * Engineering: 20 samples * Age: 20 samples * Discount: 20 samples * Planting: 20 samples * Tax: 15 samples * Reduction: 20 samples * Pagination: 20 samples * Chickens & Rabbits in the Same Cage: 20 samples * Distance Problem: * Encounter: 20 samples * Travel: 20 samples * Pursuit: 20 samples 3. **Measurement and Statistics:** * Measurement: * RMB: 9 samples * Time: 20 samples * Concentration: 20 samples * Temperature: 6 samples * Area: 17 samples * Statistics: * Permutation: 20 samples * Statistical Metrics: 20 samples * Law: 18 samples 4. **Number and Algebra:** * Fractional Operation: * Fraction & Decimal: 20 samples * Fractional Application: 20 samples * Fractional Operation: 20 samples * Simplest Fraction: 16 samples * Factors & Multiples: * Common Multiples: 16 samples * Common Divisors: 11 samples * Factor: 20 samples * Synthesis Problem: 11 samples * Prime Number: 9 samples * Basic Operation: * Multiplication: 20 samples * Reciprocal Problem: 16 samples * Four-rule Operation: 20 samples * New Operation Definition: 20 samples * Equation: 20 samples * Division: 20 samples * Ratio: * Multiple: 20 samples * Probability: 20 samples * Proportion: 20 samples * Percentage: 20 samples ### Key Observations * The number of samples varies across different problem types, ranging from 6 (Temperature) to 21 (Add & Differential & Multiple). * Most problem types have 20 samples. * The table provides a structured overview of various mathematical problem types, categorized by topic and subtopic. ### Interpretation The table appears to be a resource for organizing and categorizing mathematics problems, possibly for educational purposes. The sample counts indicate the availability of practice problems for each type. The categorization allows for a systematic approach to learning and practicing different mathematical concepts. The presence of both Chinese and English suggests the resource is intended for a bilingual audience or for learners studying mathematics in both languages. The variation in sample counts might reflect the relative importance or complexity of each problem type. </details> Figure 17: Details of the hierarchical concepts in Elementary-ZH. | LEVEL1 | LEVEL2 | LEVEL3 | # Samples | | --- | --- | --- | --- | | 几何 (Geometry) | 三角形(Triangle) | 全等三角形(Congruent Triangle) | 20 | | 勾股定理(Pythagorean Theorem) | 20 | | | | 等腰三角形(Isosceles Triangle) | 20 | | | | 等边三角形(Equilateral Triangle) | 20 | | | | 四边形(Quadrilateral) | 平行四边形(Parallelogram) | 20 | | | 梯形(Trapezium) | 20 | | | | 圆(Circle) | 圆周角(Angle of Circumference) | 20 | | | 圆心角(Angle of Center) | 20 | | | | 垂径定理(Vertical Path Theorem) | 20 | | | | 弧长和扇形面积(Arc length & Sector Area) | 20 | | | | 正多边形和圆(Regular Polygons & Circles) | 20 | | | | 点线圆位置关系(Relations of Point, Line & Circle) | 20 | | | | 立体图形 (Three-dimensional Shapes) | 圆锥(Cone) | 20 | | | 函数 (Function) | 一次函数(Linear Function) | 函数与一元一次方程 (Univariate Function & Equation) | 20 | | 函数与一元一次不等式 (Linear Functions & Univariate Linear Inequalities) | 20 | | | | 一次函数与二元一次方程组 (Linear Functions & System of Binary Linear Equations) | 20 | | | | 正比例函数(Proportional Function) | 20 | | | | 一次函数解析式 (Analytical Formula of Linear Functions ) | 20 | | | | 二次函数(Quadratic Function) | 二次函数的应用 (Applications of Quadratic Functions) | 20 | | | 抛物线的性质 (Properties of Parabolas) | 18 | | | | 反比例函数 (Inverse Proportional Function) | 定义(Definition) | 20 | | | 应用(Applications) | 20 | | | | 性质(Properties) | 19 | | | | 平面直角坐标系 (Rectangular Coordinate System) | 有序数对(Ordered Pair) | 20 | | | 象限中的点(Points of Quadrant) | 14 | | | | 数与式 (Number and Expression) | 代数式(Algebra Expression) | 代数式求值(Algebraic Expression Evaluation) | 20 | | 同类项(Similar Items) | 20 | | | | 分式(Fraction) | 指数幂(Exponential Power) | 20 | | | 约分(Fraction Reduction) | 19 | | | | 因式(Factor) | 十字相乘法(Cross Multiplication) | 20 | | | 公因式提取(Common Factor Extraction) | 18 | | | | 应用(Application) | 流水问题(Flow Problem) | 20 | | | 鸽巢问题(Pigeon Nest Problem) | 20 | | | | 整式(Integral Expression) | 乘法公式(Multiplication) | 20 | | | 整式的乘除及混合(Multiplication, Division & Mixing) | 20 | | | | 整式的加减(Addition & Subtraction) | 20 | | | | 无理数(Irrational Number) | 无理数识别(Irrational Number Recognition) | 20 | | | 根式(Radical Expression) | 二次根式的运算(Operation of Quadratic Radicals) | 20 | | | 同类二次根式(Similar Quadratic Radicals) | 20 | | | | 平方根与算术平方根(Square Root & Arithmetic Square Root) | 20 | | | | 立方根(Cube Root) | 20 | | | | 方程与不等式 (Equations & Inequalities) | 一元一次方程 (Linear Equation in One Variable) | 一元一次方程的应用(Applications) | 20 | | 解一元一次方程(Solutions) | 20 | | | | 一元二次方程 (Quadratic Equation in One Variable) | 一元二次方程的应用(Applications) | 20 | | | 解一元二次方程(Solutions) | 20 | | | | 不等式与不等式组 (Inequalities & Groups of Inequalities) | 一元一次不等式的应用 (Applications of Unary First Order Inequality) | 20 | | | 一元一次不等式组的应用(Applications of Unary First Order Groups of Inequalities) | 20 | | | | 解一元一次不等式(Solve the First Inequality of One Variable) | 20 | | | | 解一元一次不等式组(Solve Unary First Order Groups of Inequalities) | 20 | | | | 分式方程(Fractional Equation) | 分式方程的应用(Application of Fractional Equation) | 20 | | | 解分式方程(Solve Fractional Equation) | 20 | | | | 统计与概率 (Statistics and Probability) | 数据分析(Data Analysis) | 数据的波动趋势(Fluctuating Trend of Data) | 20 | | 数据的集中趋势(Central Tendency of Data) | 20 | | | | 概率(Probability) | 概率的应用(Applications of Probability) | 20 | | | 求概率(Find Probability) | 20 | | | | 随机事件与概率(Random Events & Probabilities) | 20 | | | Table 9: Details of the hierarchical concepts in Middle-ZH.

Rendering Paper...