2506.08989v1

Model: gemma-3-27b-it-free

# SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning **Authors**: Xiao Liang1⁣∗1 1 ∗, Zhong-Zhi Li2⁣∗2 2 ∗, Yeyun Gong, Yang Wang, Hengyuan Zhang, Los AngelesSchool of Artificial Intelligence Corresponding author: ∗ Equal contribution. Work done during Xiao’s and Zhongzhi’s internships at Microsoft. † Corresponding authors: Yeyun Gong and Weizhu Chen. 🖂: yegong@microsoft.com; wzchen@microsoft.com Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model’s capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a S elf-aware W eakness-driven problem S ynthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model’s weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization by empowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks. | \faGithub | Code | https://github.com/MasterVito/SwS | | --- | --- | --- | | \faGlobe | Project | https://MasterVito.SwS.github.io | <details> <summary>x1.png Details</summary> ![cd4a9802](/v1/image/cd4a980230cb07cd13525b58f18c2de90bee6011f0a4cc3c7c0f925ac0d2162d) ### Visual Description ## Radar Charts: Performance Comparison of Language Models ### Overview The image presents two radar charts comparing the performance of several language models across different benchmarks and domains. Chart (a) shows performance across benchmarks (GSM8K, AIME@32, AMC23, GaoKao 2023, Olympiad Bench, MATH 500, Minerva Math, Precalculus), while chart (b) displays performance across mathematical domains (Prealgebra, Number Theory, Counting & Probability, Geometry, Algebra, Intermediate Algebra). The models being compared are Qwen2.5-32B, Qwen2.5-32B-IT, ORZ-32B, SimpleRL-32B, Baseline-32B, and SwS-32B. ### Components/Axes Both charts share the following components: * **Radial Axes:** Representing different benchmarks/domains. The axes are labeled with the benchmark/domain names. * **Concentric Circles:** Representing performance levels, ranging from 0% to 100%. The circles are marked at 20% intervals (20%, 40%, 60%, 80%, 100%). * **Lines:** Each line represents the performance of a specific language model. * **Legend:** Located in the top-left corner of each chart, identifying each line with a specific color and model name. **Chart (a) - Performance across Benchmarks:** * Benchmarks: GSM8K, AIME@32, AMC23, GaoKao 2023, Olympiad Bench, MATH 500, Minerva Math, Precalculus. **Chart (b) - Performance across Domains:** * Domains: Prealgebra, Number Theory, Counting & Probability, Geometry, Algebra, Intermediate Algebra. ### Detailed Analysis or Content Details **Chart (a) - Performance across Benchmarks:** * **Qwen2.5-32B (Purple):** Shows generally high performance, peaking at approximately 96.3% on GSM8K, and maintaining relatively high scores across all benchmarks. The line is mostly above 80% except for Minerva Math (around 47.1%) and Precalculus (around 72.3%). * **Qwen2.5-32B-IT (Orange):** Similar to Qwen2.5-32B, with a peak of around 96.3% on GSM8K. It shows slightly lower performance on AIME@32 (around 31.2%) and AMC23 (around 90.0%). * **ORZ-32B (Green):** Demonstrates strong performance, peaking at approximately 96.3% on GSM8K. It shows a dip on Minerva Math (around 47.1%). * **SimpleRL-32B (Blue):** Generally lower performance than the other models, with a peak of around 80.3% on GaoKao 2023. It shows the lowest performance on Minerva Math (around 31.2%). * **Baseline-32B (Red):** Moderate performance, peaking at around 89.4% on MATH 500. It shows lower performance on AIME@32 (around 31.2%). * **SwS-32B (Teal):** Performance is variable, peaking at around 80.3% on GaoKao 2023. It shows a low score on Minerva Math (around 47.1%). **Chart (b) - Performance across Domains:** * **Qwen2.5-32B (Purple):** Highest performance, peaking at 96.3% on Prealgebra. Maintains high scores across most domains, with a slight dip to around 76.6% on Algebra. * **Qwen2.5-32B-IT (Orange):** Similar to Qwen2.5-32B, peaking at 96.3% on Prealgebra. * **ORZ-32B (Green):** Strong performance, peaking at 96.3% on Prealgebra. * **SimpleRL-32B (Blue):** Lower performance, peaking at 84.1% on Intermediate Algebra. * **Baseline-32B (Red):** Moderate performance, peaking at 84.1% on Intermediate Algebra. * **SwS-32B (Teal):** Variable performance, peaking at 84.1% on Intermediate Algebra. ### Key Observations * **Qwen2.5-32B and Qwen2.5-32B-IT consistently outperform other models** across most benchmarks and domains. * **SimpleRL-32B generally exhibits the lowest performance.** * **Minerva Math consistently presents a challenge** for most models, resulting in lower scores in Chart (a). * **Intermediate Algebra is a strong suit** for several models in Chart (b). * The shapes of the radar charts are similar for Qwen2.5-32B, Qwen2.5-32B-IT, and ORZ-32B, indicating similar performance profiles. ### Interpretation The data suggests that Qwen2.5-32B and its IT variant are the most capable language models among those tested, demonstrating superior performance across a wide range of benchmarks and mathematical domains. The consistent underperformance of SimpleRL-32B indicates potential limitations in its architecture or training data. The lower scores on Minerva Math suggest that this benchmark presents a unique challenge, potentially requiring specialized knowledge or reasoning abilities. The high scores on Intermediate Algebra for several models suggest a strong foundation in fundamental algebraic concepts. The radar chart format effectively visualizes the relative strengths and weaknesses of each model across different tasks. The area enclosed by each line represents the overall performance of the model, with larger areas indicating better overall performance. The deviations from the circular shape highlight specific areas where a model excels or struggles. The charts allow for a quick and intuitive comparison of model capabilities, facilitating informed decision-making in selecting the most appropriate model for a given task. </details> Figure 1: 32B model performance across mainstream reasoning benchmarks and different domains. 1 Introduction "Give me six hours to chop down a tree and I will spend the first four sharpening the axe." —Abraham Lincoln Large-scale Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the reasoning capabilities of large language models (LLMs) [16, 10, 46], where simple rule-based rewards can effectively induce complex reasoning skills. The success of RLVR for eliciting models’ reasoning capabilities heavily depends on a well-curated problem set with proper difficulty levels [63, 28, 55], where each problem is paired with an precise and verifiable reference answer [14, 31, 63, 10]. However, existing reasoning-focused datasets for RLVR suffer from three main issues: (1) High-quality, human-labeled mathematical problems are scarce, and collecting large-scale, well-annotated datasets with precise reference answers is cost-intensive. (2) Most reasoning-focused synthetic datasets are created for SFT distillation, where reference answers are rarely rigorously verified, making them suboptimal for RLVR, which relies heavily on the correctness of the final answer as the training signal. (3) Existing problem augmentation strategies typically involve rephrasing or generating variants of human-written questions [62, 30, 38, 27], or sampling concepts from existing datasets [15, 45, 20, 73], without explicitly considering the model’s reasoning capabilities. Consequently, the synthetic problems may be either too trivial or overly challenging, limiting their utility for model improvement in RL. More specifically, in RL, it is essential to align the difficulty of training tasks with the model’s current capabilities. When using group-level RL algorithms such as GPRO [40], the advantage of each response is calculated based on its comparison with other responses in the same group. If all responses are either entirely correct or entirely incorrect, the token-level advantages within each rollout collapse to 0, leading to gradient vanishing and degraded training efficiency [28, 63], and potentially harming model performance [55]. Therefore, training on problems that the model has fully mastered or consistently fails to solve does not provide useful learning signals for improvement. However, a key advantage of the failure cases is that, unlike the overly simple questions with little opportunity for improvement, persistently failed problems reveal specific areas of weakness in the model and indicate directions for further enhancement. This raises the following research question: How can we effectively utilize these consistently failed cases to address the model’s reasoning deficiencies? Could they be systematically leveraged for data synthesis that targets the enhancement of the model’s weakest capabilities? To answer these questions, we propose a S elf-aware W eakness-driven Problem S ynthesis (SwS) framework, which leverages the model’s self-identified weaknesses in RL to generate synthetic problems for training augmentation. Specifically, we record problems that the model consistently struggles to solve or learns inefficiently through iterative sampling during a preliminary RL training phase. These failed problems, which reflect the model’s weakest areas, are grouped by categories, leveraged to extract common concepts, and to synthesize new problems with difficulty levels tailored to the model’s capabilities. To further improve weakness mitigation efficiency during training, the augmentation budget for each category is allocated based on the model’s relative performance across them. Compared with existing problem synthesis strategies for LLM reasoning [73, 45], our framework explicitly targets the model’s capabilities and self-identified weaknesses, enabling more focused and efficient improvement in RL training. To validate the effectiveness of SwS, we conducted experiments across model sizes ranging from 3B to 32B and comprehensively evaluated performance on eight popular mathematical reasoning benchmarks, showing that its weakness-driven augmentation strategy benefits models across all levels of reasoning capability. Notably, our models trained on the augmented problem set consistently surpass both the base models and those trained on the original dataset across all benchmarks, achieving a substantial average absolute improvement of 10.0% for the 7B model and 7.7% for the 32B model, even surpassing their counterparts trained on carefully curated human-labeled problem sets [14, 6]. We also analyze the model’s performance on previously failed problems and find that, after training on the augmented problem set, it is able to solve up to 20.0% more problems it had consistently failed in its weak domain when trained only on the original dataset. To further demonstrate the robustness and adaptability of the proposed SwS pipeline, we extend it to explore the potential of Weak-to-Strong Generalization, Self-evolving, and Weakness-driven Selection settings, with detailed experimental results and analysis presented in Section 4. Contributions. (i) We propose a Self-aware Weakness-driven Problem Synthesis (SwS) framework that utilizes the model’s self-identified weaknesses to generate synthetic problems for enhanced RLVR training, paving the way for utilizing high-quality and targeted synthetic data for RL training. (ii) We comprehensively evaluate the SwS framework across diverse model sizes on eight mainstream reasoning benchmarks, demonstrating its effectiveness and generalizability. (iii) We explore the potential of extending our SwS framework to Weak-to-Strong Generalization, Self-evolving, and Weakness-driven Selection settings, highlighting its adaptability through detailed analysis. <details> <summary>x2.png Details</summary> ![710914a8](/v1/image/710914a8f16b2db2d435d5df80d1a7e1c8112f42fb836f0c2b69e932c29e3dc9) ### Visual Description ## Diagram: Reinforcement Learning Training and Verification Process ### Overview This diagram illustrates a reinforcement learning (RL) training process with a verification step. It depicts the flow of information from a Policy to a set of Answers, then to a Verifier, and finally to a categorization of success (Failed set) or continuation. The diagram also includes two bar charts representing the accuracy (Acc) of the verifier over time (Epochs). ### Components/Axes The diagram consists of the following components: * **Policy θ:** A green box representing the RL policy. * **Answer 1,1 ... Answer k,1:** A gray box containing multiple answers generated by the policy. * **Verifier:** A light blue box representing the verification process. * **Failed set:** A red oval representing the set of failed answers. * **Bar Charts (2):** Two bar charts showing accuracy (Acc) over time (τ1, τ2, τ3, τr, Epoch). * **Text:** "RL Training for T1 Epochs" below the Answer box. * **Question-1:** A text block at the top left describing a geometry problem. The bar charts have a vertical axis labeled "Acc" (Accuracy) with a scale ranging from approximately 0.0 to 1.0. The horizontal axis represents time steps: τ1, τ2, τ3, τr, and Epoch. ### Detailed Analysis or Content Details The diagram shows a flow of information: 1. The **Policy θ** generates multiple **Answers** (Answer 1,1 to Answer k,1). The arrows indicate a many-to-many relationship. 2. These **Answers** are fed into the **Verifier**. 3. The **Verifier** evaluates the answers and categorizes them. 4. If the answers are deemed incorrect, they are added to the **Failed set**. 5. The **Verifier's** accuracy is visualized using two bar charts. **Bar Chart 1 (Top):** * τ1: Accuracy ≈ 0.3 * τ2: Accuracy ≈ 0.3 * τ3: Accuracy ≈ 0.3 * τr: Accuracy ≈ 0.2 * Epoch: Accuracy ≈ 0.5 The trend in this chart is relatively flat, with accuracy fluctuating around 0.3, and a slight increase to 0.5 at the Epoch. **Bar Chart 2 (Bottom):** * τ1: Accuracy ≈ 0.3 * τ2: Accuracy ≈ 0.9 * τ3: Accuracy ≈ 1.0 * τr: Accuracy ≈ 0.8 * Epoch: Accuracy ≈ 1.0 This chart shows a clear upward trend in accuracy, starting at 0.3, rising to 0.9, 1.0, 0.8, and peaking at 1.0 at the Epoch. The text "RL Training for T1 Epochs" indicates that the process is repeated for T1 epochs. The question at the top left is: "Tiffany is constructing a fence around a rectangular tennis court. She must use exactly 300 feet of fencing. The fence must enclose all four sides of the court. Regulation states that the length of the fence enclosure must be at least 80 feet and the width must be at least 40 feet. Tiffany wants the area enclosed by the fence to be as large as possible in order to accommodate benches and storage space. What is the optimal area, in square feet?" ### Key Observations * The two bar charts demonstrate different accuracy trends. The first chart shows a relatively stable, lower accuracy, while the second chart shows a significant improvement in accuracy over time. * The "Failed set" indicates that the RL policy is not always generating correct answers. * The diagram highlights the importance of a verification step in RL training. ### Interpretation The diagram illustrates a typical reinforcement learning workflow where a policy generates actions (answers), and a verifier assesses their quality. The two bar charts likely represent the accuracy of different verification methods or different stages of the verification process. The first chart might represent an initial, less refined verification method, while the second chart represents a more accurate and improved method. The increasing accuracy in the second chart suggests that the verification process is learning and improving over time. The failed set indicates that the policy is still making errors, and further training is needed. The inclusion of a geometry problem suggests that the RL agent is being trained to solve mathematical problems. The diagram emphasizes the iterative nature of RL training, where the policy is continuously refined based on feedback from the verifier. </details> Figure 2: Illustration of the self-aware weakness identification during a preliminary RL training. 2 Method 2.1 Preliminary Group Relative Policy Optimization (GRPO). GRPO [40] is an efficient optimization algorithm tailored for RL in LLMs, where the advantages for each token are computed in a group-relative manner without requiring an additional critic model to estimate token values. Specifically, given an input prompt $x$ , the policy model $\pi_{\theta_{\text{old}}}$ generates a group of $G$ responses $\mathbf{Y}=\{y_{i}\}_{i=1}^{G}$ , with acquired rewards $\mathbf{R}=\{r_{i}\}_{i=1}^{G}$ . The advantage $A_{i,t}$ for each token in response $y_{i}$ is computed as the normalized rewards: $$ A_{i,t}=\frac{r_{i}-\text{mean}(\{r_{i}\}_{i=1}^{G})}{\text{std}(\{r_{i}\}_{i=% 1}^{G})}. \tag{1} $$ To improve the stability of policy optimization, GRPO clips the probability ratio $k_{i,t}(\theta)=\frac{\pi_{\theta}(y_{i,t}\mid x,y_{i,<t})}{\pi_{\theta_{\text% {old}}}(y_{i,t}\mid x,y_{i,<t})}$ within a trust region [39], and constrains the policy distribution from deviating too much from the reference model using a KL term. The optimization objective is defined as follows: $$ \displaystyle\mathcal{J}_{\text{GRPO}}(\theta) \displaystyle=\mathbb{E}_{x\sim\mathcal{D},\mathbf{Y}\sim\pi_{\theta_{\text{% old}}}(\cdot\mid x)} \displaystyle\Bigg{[}\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|y_{i}|}\sum_{t=1}^{|y_% {i}|}\Bigg{(}\min\Big{(}k_{i,t}(\theta)A_{i,t},\ \text{clip}\Big{(}k_{i,t}(% \theta),1-\varepsilon,1+\varepsilon\Big{)}A_{i,t}\Big{)}-\beta D_{\text{KL}}(% \pi_{\theta}||\pi_{\text{ref}})\Bigg{)}\Bigg{]}. \tag{2} $$ Inspired by DAPO [63], in all experiments of this work, we omit the KL term during optimization, while incorporating the clip-higher, token-level loss and dynamic sampling strategies to enhance the training efficiency of RLVR. Our RLVR training objective is defined as follows: $$ \displaystyle\mathcal{J}(\theta)=\mathbb{E}_{x\sim\mathcal{D},\,\mathbf{Y}\sim% \pi_{\theta_{\text{old}}}(\cdot\mid x)}\ \displaystyle\Bigg{[}\frac{1}{\sum_{i=1}^{G}|y_{i}|}\sum_{i=1}^{G}\sum_{t=1}^{% |y_{i}|}\Big{(}\min\big{(}k_{i,t}(\theta)A_{i,t},\ \text{clip}(k_{i,t}(\theta)% ,1-\varepsilon,1+\varepsilon^{h})A_{i,t}\big{)}\Big{)}\Bigg{]} \displaystyle\text{s.t.}\leavevmode\nobreak\ \leavevmode\nobreak\ \text{acc}_{% \text{lower}}<\left|\left\{y_{i}\in\mathbf{Y}\;\middle|\;\texttt{is\_accurate}% (x,y_{i})\right\}\right|<\text{acc}_{\text{upper}}. \tag{3} $$ where $\varepsilon^{h}$ denotes the upper clipping threshold for importance sampling ratio $k_{i,t}(\theta)$ , and $\text{acc}_{\text{lower}}$ and $\text{acc}_{\text{upper}}$ are thresholds used to filter target prompts for subsequent policy optimization. 2.2 Overview Figure 3 presents an overview of our SwS framework, which generates targeted training samples to enhance the model’s reasoning capabilities in RLVR. The framework initiates with a Self-aware Weakness Identification stage, where the model undergoes preliminary RL training on an initial problem set covering diverse categories. During this stage, the model’s weaknesses are identified as problems it consistently fails to solve or learns ineffectively. Based on failure cases that reflect the model’s weakest capabilities, in the subsequent Targeted Problem Synthesis stage, we group them by category, extract their underlying concepts, and recombine these concepts to synthesize new problems that target the model’s learning and mitigation of its weaknesses. In the final Augmented Training with Synthetic Problems stage, the model receives continuous training with the augmented high-quality synthetic problems, thereby enhancing its general reasoning abilities through more targeted training. 2.3 Self-aware Weakness Identification Utilizing the policy model itself to identify its weakest capabilities, we begin by training it in a preliminary RL phase using an initial problem set $\mathbf{X}_{S}$ , which consists of mathematical problems from $n$ diverse categories ${\mathbf{\{D\}}}_{i=0}^{n}$ , each paired with a ground-truth answer $a$ . As illustrated in Figure 2, we record the average accuracy $a_{i,t}$ of the model’s responses to each prompt $x_{i}$ at each epoch $t∈\{0,1,...,T_{1}\}$ , where $T_{1}$ is the number of training epochs in this phase. We track the Failure Rate $F$ for each problem in the training set to identify those that the model consistently struggles to learn, which are considered its weaknesses. Specifically, such problems are defined as those the model consistently struggles to solve during RL training, which meet two criteria: (1) The model never reaches a response accuracy of 50% at any training epoch, and (2) The accuracy trend decreases over time, indicated by a negative slope: $$ F(x_{i})=\mathbb{I}\left[\max_{t\in[1,T]}a_{i,t}<0.5\;\land\;\text{slope}\left% (\{a_{i,t}\}_{t=1}^{T}\right)<0\right] \tag{4} $$ This metric captures both problems the model consistently fails to solve and those showing no improvement during sampling-based RL training, making them appropriate targets for training augmentation. After the weakness identification phase via the preliminary training on the initial training set $\mathbf{X}_{S}$ , we employ the collected problems $\mathbf{X}_{F}=\left\{x_{i}∈\mathbf{X}_{S}\;\middle|\;F_{r}(x_{i})=1\right\}$ as seed problems for subsequent weakness-driven problem synthesis. <details> <summary>x3.png Details</summary> ![17992679](/v1/image/1799267972d80f2674a0acb705be63c8f7cd67fec73dfc44001260db58d4f9e8) ### Visual Description \n ## Diagram: Reinforcement Learning Training Pipeline with Failure Analysis ### Overview This diagram illustrates a reinforcement learning (RL) training pipeline that incorporates failure analysis and recombination of concepts to improve the synthetic set used for training. The process is divided into three main steps: Weakness Identification, Extracting and Recombining concepts from failures, and Augmenting synthetic set into RL training. It shows how failed cases are analyzed, broken down into concepts, recombined, and used to generate new training data. ### Components/Axes The diagram is structured into three main columns representing the three steps. Key components include: * **Initial Set:** Represents the starting point of the training data. * **Failed Set:** Represents the subset of training cases that resulted in failure. * **Synthetic Set:** Represents the artificially generated training data. * **Filtered Set:** Represents the synthetic set after difficulty filtering. * **Solutions:** Represents the correct answers or solutions to the training problems. * **Concepts Extraction & Recombination:** A central block representing the process of analyzing failed cases and recombining concepts. * **Problem Generation & Verification:** A block representing the generation of new training problems and their verification. * **Training & Acc Recording:** A block representing the training process and the recording of accuracy. * **Difficulty Filtering:** A process to filter the synthetic set based on difficulty. * **Consistency Filtering:** A process to filter the synthetic set based on consistency. * **Quality Verification:** A process to verify the quality of generated problems. * **Domain:** Represents the knowledge domain for problem generation. * **Problem Generation Model:** A model used to generate new problems. * **Answer Generation Model:** A model used to generate answers to the problems. ### Detailed Analysis or Content Details **Step 1: Weakness Identification in initial training steps.** * An image of a sad robot face represents the initial state. * The "Initial Set" contains mathematical expressions: * ∫f(x)dx * C(n) = n! / (n-r)! * √a² + b² * y = f(x) * "Solutions" are represented by a stack of papers with checkmarks and crosses, indicating correct and incorrect solutions. **Step 2: Extracting and Recombining the concepts from the failure cases to synthetic new questions.** * "Failed Set" is indicated by a red 'X' symbol. * "Split by Categories" leads to four distinct concept groups, each represented by a mathematical expression: * ∫f(x)dx, C(n) = n! / (n-r)! * d/dx, x ∈ S ∧ B * lim, ∇ * y = f(x), log a * "Concepts Extraction & Recombination" is represented by a block containing mathematical symbols (∫, d/dx, lim, ∇, x ∈ S ∧ B, n!, y = f(x), log a, {A}). * The recombination process is represented by four probability distributions: PD1, PD2, PD3, PD4. * "Problem Generation & Verification" includes: * "Sampled Concepts" linked to "Domain" (represented by a gear icon). * "(Planning) To create a challenging question within the precalculus..." * "(Generated Problem) Consider the function f(x) which satisfies..." * "Quality Verification" and "Answer Generation Model" (robot face). **Step 3: Augmenting synthetic set into RL training.** * A robot face represents the synthetic set generation. * "Difficulty Filtering" filters the "Synthetic Set". * "Filtered Set" is combined with the "Initial Set". * A robot face represents the training process. * "Training & Acc Recording" is represented by a robot face with a checkmark. ### Key Observations * The diagram emphasizes a cyclical process of training, failure analysis, and improvement. * Mathematical expressions are central to the training data and failure analysis. * The recombination of concepts is a key step in generating new training data. * Filtering mechanisms (difficulty and consistency) are used to refine the synthetic set. * The use of robot faces throughout the diagram suggests an automated training process. ### Interpretation The diagram depicts a sophisticated reinforcement learning training methodology that actively addresses weaknesses in the initial training data. By analyzing failed cases, extracting underlying concepts, and recombining them to generate new training examples, the system aims to improve its performance and robustness. The inclusion of filtering steps suggests a focus on creating high-quality synthetic data that is both challenging and consistent. The cyclical nature of the process indicates a continuous learning loop, where failures are used as opportunities for improvement. The diagram highlights the importance of not only generating synthetic data but also of intelligently analyzing and refining it based on the system's performance. The use of mathematical expressions suggests that the system is being trained on a task involving mathematical reasoning or problem-solving. The overall design suggests a system that is designed to learn from its mistakes and adapt to new challenges. </details> Figure 3: An overview of our proposed weakness-driven problem synthesis framework that targets at mitigating the model’s reasoning limitations within the RLVR paradigm. 2.4 Targeted Problem Synthesis Concept Extraction and Recombination. We synthesize new problems by extracting the underlying concepts $\mathbf{C}_{F}$ from the collected seed questions $\mathbf{X}_{F}$ and strategically recombining them to generate questions that target similar capabilities. Specifically, the extracted concepts are first categorized into their respective categories $\mathbf{D}_{i}$ (e.g., mathematical topics such as Algebra or Geometry) based on the corresponding seed problem $x_{i}$ , and are subsequently sampled and recombined to generate problems within the same category. Inspired by [15, 73], we enhance the coherence and semantic fluency of synthetic problems by computing co-occurrence probabilities and embedding similarities among concepts within each category, enabling more appropriate sampling and recombination of relevant concepts. This targeted sampling approach ensures that the synthesized problems remain semantically coherent and avoids combining concepts from unrelated sub-topics or irrelevant knowledge points, which could otherwise result in invalid or confusing questions. Further details on the co-occurrence calculation and sampling algorithm are provided in Appendix E. Intuitively, categories exhibiting more pronounced weaknesses demand additional learning support. To optimize the efficiency of targeted problem synthesis and weakness mitigation in subsequent RL training, we allocate the augmentation budget, i.e., the concept combinations used as inputs for problem synthesis, across categories based on the model’s category-specific failure rates $F_{\mathbf{D}}$ from the preliminary training phase. Specifically, we normalize these failure rates $F_{\mathbf{D}}$ across categories to determine the allocation weights for problem synthesis. Given a total augmentation budget $|\mathbf{X}_{T}|$ , the number of concept combinations allocated to domain $\mathbf{D}_{i}$ is computed as: $$ |\mathbf{X}_{T,\mathbf{D}_{i}}|=|\mathbf{X}_{T}|\cdot P_{\mathbf{D}_{i}}=|% \mathbf{X}_{T}|\cdot\frac{F_{\mathbf{D}_{i}}}{\sum_{j}^{n}F_{\mathbf{D}_{j}}}, \tag{5} $$ where $F_{\mathbf{D}_{i}}$ is the failure rate of problems in category $\mathbf{D}_{i}$ within the initial training set. The sampled and recombined concepts then serve as inputs for subsequent problem generation. Problem Generation and Quality Verification. After extracting and recombining the concepts associated with the model’s weakest capabilities, we employ a strong instruction model, which does not perform deep reasoning, to generate new problems based on the category label and the recombined concepts. We instruct the model to first generate rationales that explore how the concept combinations can be integrated to produce a well-formed problem. To ensure the synthetic problems align with the RLVR setting, the model is also instructed to avoid generating multiple-choice, multi-part, or proof-based questions [1]. Detailed prompt used for the concept-based problem generation please refer to the Appendix J. For quality verification of the synthetic problems, we prompt general instruction LLMs multiple times to evaluate each problem and its rationale across multiple dimensions, including concept coverage, factual accuracy, and solvability, assigning an overall rating of bad, acceptable, or perfect. Only problems receiving ‘perfect’ ratings above a predefined threshold and no ‘bad’ ratings are retained for subsequent utilization. Reference Answer Generation. Since alignment between the model’s final answer and the reference answer is the primary training signal in RLVR, a rigorous verification of the reference answers for synthetic problems is essential to ensure training stability and effectiveness. To this end, we employ a strong reasoning model (e.g., QwQ-32B [47]) to label reference answers for synthetic problems through a self-consistency paradigm. Specifically, we prompt it to generate multiple responses for each problem and use Math-Verify to assess answer equivalence, which ensures that consistent answers of different forms (e.g., fractions and decimals) are correctly recognized as equal. Only problems with at least 50% consistent answers are retained, as highly inconsistent answers are unreliable as ground truth and may indicate that the problems are excessively complex or unsolvable. Difficulty Filtering. The most prevalently used RLVR algorithms, such as GRPO, compute the advantage of each token in a response by comparing its reward to those of other responses for the same prompt. When all responses yield identical accuracy—either all correct or all incorrect—the advantages uniformly degrade to zero, leading to gradient vanishing for policy updates and resulting in training inefficiency [40, 63]. Recent study [53] further shows that RLVR training can be more efficient with problems of appropriate difficulty. Considering this, we select synthetic problems of appropriate difficulty based on the initially trained model’s accuracy on them. Specifically, we sample multiple responses per synthetic problem using the initially trained model and retain only those whose accuracy falls within a target range $[\text{acc}_{\text{low}},\text{acc}_{\text{high}}]$ (e.g., $[25\%,75\%]$ ). This strategy ensures that the model engages with learnable problems, enhancing both the stability and efficiency of RLVR training. 2.5 Augmented Training with Synthetic Problems After the rigorous problem generation, answer generation, and verification, the allocation budget of synthetic problems in each category is further adjusted using the weights in Eq. 5 to ensure their comprehensive and efficient utilization, resulting in $\mathbf{X}^{\prime}_{T}$ . We incorporate the retained synthetic problems $\mathbf{X}^{\prime}_{T}$ into the initial training set $\mathbf{X}_{S}$ , forming the augmented training set $\mathbf{X}_{A}=[\mathbf{X}_{S};\mathbf{X}^{\prime}_{T}]$ . We then continue training the initially trained model on $\mathbf{X}_{A}$ in a second stage of augmented RLVR, targeting to mitigate the model’s weaknesses through exploration of the synthetic problems. 3 Experiments | Model | GSM8K | MATH 500 | Minerva Math | Olympiad Bench | GaoKao 2023 | AMC23 | AIME24 (Avg@ 1 / 32) | AIME25 (Avg@ 1 / 32) | Avg. | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Qwen 2.5 3B Base | | | | | | | | | | | Qwen2.5-3B | 69.9 | 46.0 | 18.8 | 19.9 | 34.8 | 27.5 | 0.0 / 2.2 | 0.0 / 1.5 | 27.1 | | Qwen2.5-3B-IT | 84.2 | 62.2 | 26.5 | 27.9 | 53.5 | 32.5 | 6.7 / 5.0 | 0.0 / 2.3 | 36.7 | | BaseRL-3B | 86.3 | 66.0 | 25.4 | 31.3 | 57.9 | 40.0 | 10.0 / 9.9 | 6.7 / 3.5 | 40.4 | | SwS-3B | 87.0 | 69.6 | 27.9 | 34.8 | 59.7 | 47.5 | 10.0 / 8.4 | 6.7 / 7.1 | 42.9 | | $\Delta$ | +0.7 | +3.6 | +2.5 | +3.5 | +1.8 | +7.5 | +0.0 / -1.5 | +0.0 / +3.6 | +2.5 | | Qwen 2.5 7B Base | | | | | | | | | | | Qwen2.5-7B | 88.1 | 63.0 | 27.6 | 30.5 | 55.8 | 35.0 | 6.7 / 5.4 | 0.0 / 1.2 | 38.3 | | Qwen2.5-7B-IT | 91.7 | 75.6 | 38.2 | 40.6 | 63.9 | 50.0 | 16.7 / 10.5 | 13.3 / 6.7 | 48.8 | | Open-Reasoner-7B | 93.6 | 80.4 | 39.0 | 45.6 | 72.0 | 72.5 | 10.0 / 16.8 | 13.3 / 17.9 | 53.3 | | SimpleRL-Base-7B | 90.8 | 77.2 | 35.7 | 41.0 | 66.2 | 62.5 | 13.3 / 14.8 | 6.7 / 6.7 | 49.2 | | BaseRL-7B | 92.0 | 78.4 | 36.4 | 41.6 | 63.4 | 45.0 | 10.0 / 14.5 | 6.7 / 6.5 | 46.7 | | SwS-7B | 93.9 | 82.6 | 41.9 | 49.6 | 71.7 | 67.5 | 26.7 / 18.3 | 20.0 / 18.5 | 56.7 | | $\Delta$ | +1.9 | +4.2 | +5.5 | +8.0 | +8.3 | +22.5 | +16.7 / +3.8 | +13.3 / +12.0 | +10.0 | | Qwen 2.5 7B Math | | | | | | | | | | | Qwen2.5-Math-7B | 43.2 | 72.0 | 35.7 | 17.6 | 31.4 | 47.5 | 10.0 / 9.4 | 0.0 / 2.9 | 32.2 | | Qwen2.5-Math-7B-IT | 93.3 | 80.6 | 36.8 | 36.6 | 64.9 | 45.0 | 6.7 / 7.2 | 13.3 / 6.2 | 47.2 | | PRIME-RL-7B | 93.2 | 82.0 | 41.2 | 46.1 | 67.0 | 60.0 | 23.3 / 16.1 | 13.3 / 16.2 | 53.3 | | SimpleRL-Math-7B | 89.8 | 78.0 | 27.9 | 43.4 | 64.2 | 62.5 | 23.3 / 24.5 | 20.0 / 15.6 | 51.1 | | Oat-Zero-7B | 90.1 | 79.4 | 38.2 | 42.4 | 67.8 | 70.0 | 43.3 / 29.3 | 23.3 / 11.8 | 56.8 | | BaseRL-Math-7B | 90.2 | 78.8 | 37.9 | 43.6 | 64.4 | 57.5 | 26.7 / 23.0 | 20.0 / 14.0 | 51.9 | | SwS-Math-7B | 91.9 | 83.8 | 41.5 | 47.7 | 71.4 | 70.0 | 33.3 / 25.9 | 26.7 / 18.2 | 58.3 | | $\Delta$ | +1.7 | +5.0 | +3.6 | +4.1 | +7.0 | +12.5 | +6.7 / +2.9 | +6.7 / +4.2 | +6.4 | | Qwen 2.5 32B base | | | | | | | | | | | Qwen2.5-32B | 90.1 | 66.8 | 34.9 | 29.8 | 55.3 | 50.0 | 10.0 / 4.2 | 6.7 / 2.5 | 42.9 | | Qwen2.5-32B-IT | 95.6 | 83.2 | 42.3 | 49.5 | 72.5 | 62.5 | 23.3 / 15.0 | 20.0 / 13.1 | 56.1 | | Open-Reasoner-32B | 95.5 | 82.2 | 46.3 | 54.4 | 75.6 | 57.5 | 23.3 / 23.5 | 33.3 / 31.7 | 58.5 | | SimpleRL-Base-32B | 95.2 | 81.0 | 46.0 | 47.4 | 69.9 | 82.5 | 33.3 / 26.2 | 20.0 / 15.0 | 59.4 | | BaseRL-32B | 96.1 | 85.6 | 43.4 | 54.7 | 73.8 | 85.0 | 40.0 / 30.7 | 6.7 / 24.6 | 60.7 | | SwS-32B | 96.3 | 89.4 | 47.1 | 60.5 | 80.3 | 90.0 | 43.3 / 33.0 | 40.0 / 31.8 | 68.4 | | $\Delta$ | +0.2 | +3.8 | +3.7 | +5.8 | +6.5 | +5.0 | +3.3 / +2.3 | +33.3 / +7.2 | +7.7 | Table 1: We report the detailed performance of our SwS implementation across various base models and multiple benchmarks. AIME is evaluated using two metrics: Avg@1 (single-run performance) and Avg@32 (average over 32 runs). 3.1 Experimental Setup Models and Datasets. We employ the Qwen2.5-base series [57, 58] with model sizes from 3B to 32B in our experiments. For concept extraction and problem generation, we employ the LLaMA-3.3-70B-Instruct model [8], and for concept embedding, we use the LLaMA-3.1-8B-base model. To verify the quality of the synthetic questions, we use both the LLaMA-3.3-70B-Instruct and additionally Qwen-2.5-72B-Instruct [57] to evaluate them and filter out the low-quality samples. For answer generation, we use Skywork-OR1-Math-7B [12] for training models with sizes up to 7B, and QwQ-32B [47] for the 32B model experiments. We employ the SwS pipeline to generate 40k synthetic problems for each base model. All the prompts for each procedure in SwS can be found in Appendix J. We adopt GRPO [40] as the RL algorithm, and full implementation details are in Appendix B. For the initial training set used in the preliminary RL training for weaknesses identification, we employ the MATH-12k [13] for models with sizes up to 7B. As the 14B and 32B models show early saturation on MATH-12k, we instead use a combined dataset of 17.5k samples from the DAPO [63] English set and the LightR1 [53] Stage-2 set. Evaluation. We evaluated the models on a wide range of mathematical reasoning benchmarks, including GSM8K [4], MATH-500 [26], Minerva Math [19], Olympiad-Bench [11], Gaokao-2023 [71], AMC [33], and AIME [34]. We report Pass@1 (Avg@1) accuracy across all benchmarks and additionally include the Avg@32 metric for the competition-level AIME benchmark to enhance evaluation robustness. For detailed descriptions of the evaluation benchmarks, see Appendix I. Baseline Setting. Our baselines include the base model, its post-trained Instruct version (e.g., Qwen2.5-7B-Instruct), and the initial trained model further trained on the initial dataset for the same number of steps as our augmented RL training as the baselines. To further highlight the effectiveness of the SwS framework, we compare the model trained on the augmented problem set against recent advanced RL-based models, including SimpleRL [67], Open Reasoner [14], PRIME [6], and Oat-Zero [28]. 3.2 Main Results The overall experimental results are presented in Table 1. Our SwS framework enables consistent performance improvements across benchmarks of varying difficulty and model scales, with the most significant gains observed in models greater than 7B parameters. Specifically, SwS-enhanced versions of the 7B and 32B models show absolute improvements of +10.0% and +7.7%, respectively, underscoring the effectiveness and scalability of the framework. When initialized with MATH-12k, SwS yields strong gains on competition-level benchmarks, achieving +16.7% and +13.3% on AIME24 and AIME25 with Qwen2.5-7B. These results highlight the quality and difficulty of the synthesized samples compared to well-crafted human-written ones, demonstrating the effectiveness of generating synthetic data based on model capabilities to enhance training. 3.3 Weakness Mitigation from Augmented Training The motivation behind SwS is to mitigate model weaknesses by explicitly targeting failure cases during training. To demonstrate its effectiveness, we use Qwen2.5-7B to analyze the ratios of consistently failed problems in the initial training set (MATH-12k) across three models: the initially trained model, the model continued trained on the initial training set, and the model trained on the augmented set with synthetic problems from the SwS pipeline. As shown in Figure 4, continued training on the augmented set enables the model to solve a greater proportion of previously failed problems across most domains compared to training on the initial set alone, with the greatest gains observed in Intermediate Algebra (20%), Geometry (5%), and Precalculus (5%) as its weakest areas. Notably, these improvements are achieved even though each original problem is sampled four times less frequently in the augmented set than in training on the original dataset alone, highlighting the efficiency of SwS-generated synthetic problems in RL training. 4 Extensions and Analysis | Model | GSM8K | AIME24 (Pass@32) | Prealgebra | Intermediate Algebra | Algebra | Precalculus | Number Theory | Counting & Probability | Geometry | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Strong Student | 92.0 | 13.8 | 87.7 | 58.7 | 93.8 | 63.2 | 86.4 | 71.2 | 66.8 | | Weak Teacher | 93.3 | 7.2 | 88.2 | 64.3 | 95.5 | 71.2 | 93.0 | 81.4 | 63.0 | | Trained Student | 93.6 | 17.5 | 90.5 | 64.4 | 97.7 | 74.6 | 95.1 | 80.4 | 67.5 | Table 2: Performance on two representative benchmarks and category-specific results on MATH-500 of the weak teacher model and the strong student model. | Model | GSM8K | MATH 500 | Minerva Math | Olympiad Bench | GaoKao 2023 | AMC23 | AIME24 (Avg@ 1 / 32) | AIME25 (Avg@ 1 / 32) | Avg. | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Qwen2.5-14B-IT | 94.7 | 79.6 | 41.9 | 45.6 | 68.6 | 57.5 | 16.7 / 11.6 | 6.7 / 10.9 | 51.4 | | + BaseRL | 94.5 | 85.4 | 44.1 | 52.1 | 71.7 | 65.0 | 20.0 / 21.6 | 20.0 / 22.3 | 56.6 | | + SwS-SE | 95.6 | 85.0 | 46.0 | 53.5 | 74.8 | 67.5 | 20.0 / 19.8 | 20.0 / 17.8 | 57.8 | | $\Delta$ | +1.1 | -0.4 | +1.9 | +1.4 | +3.1 | +2.5 | +0.0 / -1.8 | +0.0 / -4.5 | +1.2 | Table 3: Experimental results of extending the SwS framework to the Self-evolving paradigm on the Qwen2.5-14B-Instruct model. 4.1 Weak-to-Strong Generalization for SwS Employing a powerful frontier model like QwQ [47] helps ensure answer quality. However, when training the top-performing reasoning model, no stronger model exists to produce reference answers for problems identified as its weaknesses. To explore the potential of applying our SwS pipeline to enhancing state-of-the-art models, we extend it to the Weak-to-Strong Generalization [2] setting by using a generally weaker teacher that may outperform the stronger model in specific domains to label reference answers for the synthetic problems. Intuitively, using a weaker teacher may result in mislabeled answers, which could significantly impair subsequent RL training. However, during the difficulty filtering stage, this risk is mitigated by using the initially trained policy to assess the difficulty of synthetic problems, as it rarely reproduces the same incorrect answers provided by the weaker teacher. As a byproduct, mislabeled cases are naturally filtered out alongside overly complex samples through accuracy-based screening. The experimental analysis on the validity of difficulty-level filtering in ensuring label correctness is presented in Table 5. <details> <summary>x4.png Details</summary> ![89095c96](/v1/image/89095c966b3822d36ef58610a85ac02765f84c2502becf29cd0ce0d0e0b75eb9) ### Visual Description ## Bar Chart: Ratios of Consistently Failed Problems Across Categories in MATH-12k ### Overview This bar chart visualizes the ratios of consistently failed problems across different mathematical categories within the MATH-12k dataset. The chart compares three Reinforcement Learning (RL) models: "Init RL", "Base RL", and "Synt RL". The y-axis represents the "Zero Ratio" (in percentage), indicating the proportion of problems consistently failed. The x-axis displays the mathematical categories: Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Prealgebra, and Precalculus. ### Components/Axes * **Title:** "Ratios of Consistently Failed Problems Across Categories in MATH-12k" (centered at the top) * **X-axis Label:** Mathematical Categories (Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Prealgebra, Precalculus) * **Y-axis Label:** "Zero Ratio (%)" (ranging from 0 to 14, with increments of 2) * **Legend:** Located in the top-left corner, identifying the three RL models: * Init RL (Red) * Base RL (Light Blue) * Synt RL (Turquoise) ### Detailed Analysis The chart consists of grouped bar plots for each category, representing the Zero Ratio for each RL model. * **Algebra:** * Init RL: Approximately 0.9% * Base RL: Approximately 0.6% * Synt RL: Approximately 0.5% * **Counting & Probability:** * Init RL: Approximately 5.6% * Base RL: Approximately 4.2% * Synt RL: Approximately 3.8% * **Geometry:** * Init RL: Approximately 11.9% * Base RL: Approximately 9.3% * Synt RL: Approximately 8.8% * **Intermediate Algebra:** * Init RL: Approximately 10.8% * Base RL: Approximately 8.3% * Synt RL: Approximately 6.7% * **Number Theory:** * Init RL: Approximately 3.9% * Base RL: Approximately 1.9% * Synt RL: Approximately 1.8% * **Prealgebra:** * Init RL: Approximately 1.6% * Base RL: Approximately 1.3% * Synt RL: Approximately 0.9% * **Precalculus:** * Init RL: Approximately 13.4% * Base RL: Approximately 10.8% * Synt RL: Approximately 10.3% **Trends:** * For all categories, "Init RL" generally exhibits the highest Zero Ratio, followed by "Base RL", and then "Synt RL". * The Zero Ratio is particularly high in "Precalculus" and "Geometry" for all models. * The Zero Ratio is relatively low in "Algebra", "Number Theory", and "Prealgebra" for all models. ### Key Observations * "Init RL" consistently performs worse (higher Zero Ratio) than the other two models across all categories. * "Synt RL" generally shows the lowest Zero Ratio, suggesting it is the most effective model in addressing consistently failed problems. * The largest difference in Zero Ratio between the models is observed in "Intermediate Algebra", where "Init RL" has a significantly higher ratio than "Synt RL". * The categories "Precalculus" and "Geometry" present the most significant challenges for all models, as indicated by their high Zero Ratios. ### Interpretation The data suggests that the "Init RL" model struggles more with consistently failed problems across all mathematical categories compared to "Base RL" and "Synt RL". This could be due to the initial training or configuration of the "Init RL" model. The "Synt RL" model appears to be the most robust, consistently achieving the lowest Zero Ratio, indicating its superior ability to address these challenging problems. The high Zero Ratios in "Precalculus" and "Geometry" highlight these areas as particularly difficult for the models to solve. This could be due to the complexity of the concepts involved or the lack of sufficient training data for these categories. The relatively low Zero Ratios in "Algebra", "Number Theory", and "Prealgebra" suggest that the models are more proficient in these areas. The differences in performance between the models across different categories could be attributed to the specific training data and algorithms used for each model. Further investigation is needed to understand the underlying reasons for these differences and to improve the performance of the models on the more challenging categories. The data also suggests that the "Synt RL" model is a promising approach for addressing consistently failed problems in mathematical reasoning. </details> Figure 4: The ratios of consistently failed problems from different categories in the MATH-12k training set under different training configurations. (Base model: Qwen2.5-7B). We use the initially trained Qwen2.5-7B-Base as the student and Qwen2.5-Math-7B-Instruct as the teacher. Table 2 presents their performance on popular benchmarks and MATH-12k categories, where the student model generally outperforms the teacher. However, as shown in Table 2, the student policy further improves after training on weak teacher-labeled problems. This improvement stems from the difficulty filtering process, which removes problems with consistent student-teacher disagreement and retains those where the teacher is reliable but the student struggles, enabling targeted training on weaknesses. Detailed analysis can be found in Appendix 11. 4.2 Self-evolving Targeted Problem Synthesis In this section, we explore the potential of utilizing the Self-evolving paradigm to address model weaknesses by executing the full SwS pipeline using the policy itself. This self-evolving paradigm for identifying and mitigating weaknesses leverages self-consistency to guide itself to generate effective trajectories toward accurate answers [75], while also integrating general instruction-following capabilities from question generation and quality filtering to enhance reasoning. We use Qwen2.5-14B-Instruct as the base policy due to its balance between computational efficiency and instruction-following performance. The results are shown in Table 3, where the self-evolving SwS pipeline improves the baseline performance by 1.2% across all benchmarks, especially on the middle-level benchmarks like Gaokao and AMC. Although performance declines on AIME, we attribute this to the initial training data from DAPO and LightR1 already being specifically tailored to that benchmark. For further discussion of the Self-evolve SwS framework, refer to Appendix G. <details> <summary>x5.png Details</summary> ![230bd781](/v1/image/230bd78133c4315bd2ad56d3054b1ebe4a4f8c4fb690e00b77bc8d814b4df617) ### Visual Description \n ## Line Chart: Overall Accuracy (%) ### Overview This image presents a line chart illustrating the relationship between Training Steps and Average Accuracy (%). Two data series are plotted: "Target All Pass@1" and "Random All Pass@1". The chart aims to demonstrate how accuracy changes as the training progresses for both approaches. ### Components/Axes * **Title:** (a) Overall Accuracy (%) - positioned at the top-center. * **X-axis:** Training Steps - ranging from approximately 0 to 150, with markers at 0, 20, 40, 60, 80, 100, 120, and 140. * **Y-axis:** Average Accuracy (%) - ranging from approximately 25.0 to 55.0, with markers at 25.0, 30.0, 35.0, 40.0, 45.0, 50.0, and 55.0. * **Legend:** Located in the bottom-right corner. * "Target All Pass@1" - represented by a reddish-pink color. * "Random All Pass@1" - represented by a teal color. * **Data Series:** Two lines with corresponding data points. ### Detailed Analysis **Target All Pass@1 (Reddish-Pink):** The line representing "Target All Pass@1" shows an upward trend, starting at approximately 38% accuracy at 0 training steps. It increases steadily, reaching around 45% at 20 steps, 47% at 40 steps, 49% at 60 steps, 50% at 80 steps, 51% at 100 steps, 52% at 120 steps, and finally reaching approximately 53% at 140 training steps. The data points are scattered around the line, indicating some variance. * (0, 38%) * (20, 45%) * (40, 47%) * (60, 49%) * (80, 50%) * (100, 51%) * (120, 52%) * (140, 53%) **Random All Pass@1 (Teal):** The line representing "Random All Pass@1" also exhibits an upward trend, but with a steeper initial increase. It starts at approximately 31% accuracy at 0 training steps, rapidly increasing to around 43% at 20 steps. The growth slows down, reaching approximately 46% at 40 steps, 48% at 60 steps, 49% at 80 steps, 48% at 100 steps, 47% at 120 steps, and finally reaching approximately 49% at 140 training steps. The data points are also scattered around the line. * (0, 31%) * (20, 43%) * (40, 46%) * (60, 48%) * (80, 49%) * (100, 48%) * (120, 47%) * (140, 49%) ### Key Observations * Both "Target All Pass@1" and "Random All Pass@1" demonstrate increasing accuracy with more training steps. * "Random All Pass@1" shows a more rapid initial increase in accuracy compared to "Target All Pass@1". * The accuracy growth for both series appears to plateau after approximately 80 training steps. * "Target All Pass@1" consistently achieves higher accuracy than "Random All Pass@1" throughout the training process. ### Interpretation The chart suggests that both training approaches ("Target All Pass@1" and "Random All Pass@1") are effective in improving accuracy. However, the "Target All Pass@1" approach consistently outperforms the "Random All Pass@1" approach, indicating that targeting specific training examples leads to better results. The plateauing of accuracy after 80 training steps suggests that further training may yield diminishing returns. The initial steep increase in "Random All Pass@1" could indicate faster learning in the early stages, but the "Target All Pass@1" approach ultimately achieves higher overall accuracy. The scatter of data points around the lines indicates some variability in the results, which could be due to factors such as the specific training data or the randomness inherent in the training process. </details> <details> <summary>x6.png Details</summary> ![e9dd016a](/v1/image/e9dd016a3527ba12dace32ae8dd95dc16d2b28029a78e07bee599635889f1dd4) ### Visual Description ## Line Chart: Competition Level Accuracy (%) ### Overview This chart displays the average accuracy (%) of two different competition setups ("Target Comp Avg@32" and "Random Comp Avg@32") as a function of training steps. The chart aims to compare the performance of a targeted competition approach versus a random competition approach during a training process. ### Components/Axes * **Title:** (b) Competition Level Accuracy (%) - positioned at the top-center. * **X-axis:** Training Steps - ranging from approximately 0 to 140. * **Y-axis:** Average Accuracy (%) - ranging from approximately 1.0 to 16.0. * **Legend:** Located in the bottom-right corner. * Target Comp Avg@32 - represented by red circles. * Random Comp Avg@32 - represented by teal circles. * **Data Series:** Two lines representing the accuracy trends for each competition setup, with data points marked by circles. * **Gridlines:** A light gray grid is present to aid in reading values. ### Detailed Analysis **Target Comp Avg@32 (Red Line):** The red line shows an upward trend, starting at approximately 3.8% accuracy at 0 training steps. The line initially increases rapidly, then the rate of increase slows down. * (0, 3.8) * (10, 8.8) * (20, 9.6) * (30, 11.2) * (40, 12.2) * (50, 12.8) * (60, 13.2) * (70, 13.6) * (80, 14.0) * (90, 14.2) * (100, 14.5) * (110, 14.8) * (120, 15.1) * (130, 15.3) * (140, 15.5) **Random Comp Avg@32 (Teal Line):** The teal line also shows an upward trend, but it plateaus earlier than the red line. It starts at approximately 4.0% accuracy at 0 training steps. The line increases rapidly initially, then levels off around 11-12% accuracy. * (0, 4.0) * (10, 8.5) * (20, 9.5) * (30, 10.4) * (40, 10.8) * (50, 11.1) * (60, 11.3) * (70, 11.4) * (80, 11.5) * (90, 11.6) * (100, 11.7) * (110, 11.6) * (120, 11.4) * (130, 11.2) * (140, 11.0) ### Key Observations * The "Target Comp Avg@32" consistently outperforms the "Random Comp Avg@32" throughout the training process. * The "Random Comp Avg@32" reaches a plateau in accuracy around 11-12% after approximately 60 training steps, while the "Target Comp Avg@32" continues to improve, albeit at a decreasing rate. * Both lines show a steep initial increase in accuracy, suggesting rapid learning in the early stages of training. ### Interpretation The data suggests that a targeted competition approach ("Target Comp Avg@32") is more effective than a random competition approach ("Random Comp Avg@32") in improving average accuracy during training. The targeted approach demonstrates sustained improvement over a longer period, achieving a higher final accuracy. The plateau observed in the random competition approach indicates that it may reach a limit in its learning capacity or that further training does not yield significant gains. This could be due to the random approach exploring a less optimal search space or getting stuck in local optima. The initial rapid increase in both lines suggests that both approaches benefit from early training, but the targeted approach is better at capitalizing on continued training. The difference in performance highlights the importance of strategic competition design in optimizing training outcomes. </details> <details> <summary>x7.png Details</summary> ![a38e3a67](/v1/image/a38e3a67d2d2782ca7f1fa1c8052c4505551330d9024a750d58160ab445f6c4b) ### Visual Description \n ## Line Chart: Training Batch Accuracy (%) ### Overview This image presents a line chart illustrating the average accuracy (%) of two training methods – "Target Training Acc" and "Random Training Acc" – over a series of training steps. The chart aims to compare the performance of these two methods during the training process. ### Components/Axes * **Title:** "(c) Training Batch Accuracy (%)" – positioned at the top-center of the chart. * **X-axis:** "Training Steps" – ranging from approximately 0 to 140, with tick marks at intervals of 20. * **Y-axis:** "Average Accuracy (%)" – ranging from 0.0 to 80.0, with tick marks at intervals of 16. * **Legend:** Located in the bottom-right corner of the chart. * "Target Training Acc" – represented by red circles. * "Random Training Acc" – represented by teal circles. * **Data Series:** Two lines representing the accuracy of each training method. ### Detailed Analysis **Target Training Acc (Red Line):** The red line representing "Target Training Acc" starts at approximately 8% accuracy at 0 training steps. It exhibits a steep upward slope initially, reaching around 40% accuracy at 20 training steps. The slope gradually decreases, and the line plateaus around 68% accuracy between 80 and 140 training steps. * (0, 8%) * (20, 40%) * (40, 56%) * (60, 62%) * (80, 66%) * (100, 68%) * (120, 69%) * (140, 70%) **Random Training Acc (Teal Line):** The teal line representing "Random Training Acc" begins at approximately 1% accuracy at 0 training steps. It also shows a steep initial increase, reaching around 48% accuracy at 20 training steps. The slope then decreases, and the line approaches a plateau around 76% accuracy between 80 and 140 training steps. * (0, 1%) * (20, 48%) * (40, 60%) * (60, 68%) * (80, 74%) * (100, 76%) * (120, 77%) * (140, 78%) ### Key Observations * Both training methods demonstrate increasing accuracy with more training steps. * "Random Training Acc" consistently outperforms "Target Training Acc" throughout the training process. * The rate of accuracy improvement diminishes for both methods as training progresses, indicating a potential convergence towards a maximum accuracy level. * The "Target Training Acc" line appears to flatten out more significantly than the "Random Training Acc" line, suggesting it may be reaching its performance limit sooner. ### Interpretation The chart suggests that the "Random Training Acc" method is more effective at achieving higher accuracy during the training process compared to the "Target Training Acc" method. The initial steep increase in accuracy for both methods indicates rapid learning in the early stages of training. The subsequent flattening of the curves suggests that the models are approaching their maximum achievable accuracy, and further training may yield diminishing returns. The consistent superiority of "Random Training Acc" could be due to a more effective exploration of the solution space or a better adaptation to the training data. The difference in performance between the two methods highlights the importance of selecting an appropriate training strategy to optimize model accuracy. The data suggests that the "Target Training Acc" method may benefit from adjustments to its parameters or architecture to improve its performance. </details> Figure 5: Comparison of accuracy improvements using (a) Pass@1 on full benchmarks evaluated in Table 1 and (b) Avg@32 on the competition-level benchmarks. (c) illustrates the proportion of prompts within a batch that achieved 100% correctness across multiple rollouts during training. <details> <summary>x8.png Details</summary> ![64bd7c87](/v1/image/64bd7c87af7dc78fac12ab7ee9bd9f7702ad6fbcc64e3786f0e2b516b449fa73) ### Visual Description ## Line Chart: Overall Accuracy (%) ### Overview This chart displays the average accuracy (%) over training steps for different difficulty levels. It uses line plots to represent the trend for each difficulty level, overlaid with scatter plots showing individual data points. ### Components/Axes * **Title:** (a) Overall Accuracy (%) - positioned at the top-center. * **X-axis:** Training Steps - ranging from approximately 0 to 200. * **Y-axis:** Average Accuracy (%) - ranging from approximately 45.2 to 53.2. * **Legend:** Located in the bottom-right corner. * Difficulty (represented by light red circles) * Simple (represented by teal circles) * Medium (represented by blue circles) * **Gridlines:** Present throughout the chart for easier readability. ### Detailed Analysis The chart contains three data series, each represented by a line and corresponding scatter points. **1. Difficulty (Light Red):** * **Trend:** The line slopes upward, indicating increasing accuracy with more training steps. The initial slope is steep, then gradually flattens. * **Data Points (approximate):** * (0, 46.4) * (25, 47.8) * (50, 49.2) * (75, 50.2) * (100, 50.8) * (125, 51.2) * (150, 51.4) * (175, 51.6) * (200, 51.8) **2. Simple (Teal):** * **Trend:** The line also slopes upward, but is more erratic than the "Difficulty" line. It starts lower than the "Difficulty" line but eventually surpasses it. * **Data Points (approximate):** * (0, 47.2) * (25, 48.6) * (50, 49.8) * (75, 50.6) * (100, 51.0) * (125, 51.2) * (150, 51.4) * (175, 51.8) * (200, 52.2) **3. Medium (Blue):** * **Trend:** The line slopes upward, and is the most stable of the three lines. It starts higher than the "Difficulty" line and remains consistently above it. * **Data Points (approximate):** * (0, 48.0) * (25, 49.0) * (50, 50.0) * (75, 50.8) * (100, 51.2) * (125, 51.4) * (150, 51.6) * (175, 52.0) * (200, 52.4) ### Key Observations * All three difficulty levels show an increase in average accuracy with increasing training steps. * The "Medium" difficulty consistently achieves the highest accuracy. * The "Simple" difficulty shows the most variability in its data points. * The "Difficulty" line starts with the lowest accuracy but shows a consistent upward trend. * The lines converge as the training steps increase, suggesting diminishing returns in accuracy improvement. ### Interpretation The chart demonstrates that increasing training steps generally leads to improved accuracy for all difficulty levels. The "Medium" difficulty consistently outperforms the others, suggesting it provides an optimal level of challenge for the model. The variability in the "Simple" difficulty might indicate that the model quickly learns the simple task, leading to fluctuations in accuracy as it overfits or encounters minor variations. The convergence of the lines at higher training steps suggests that further training may not yield significant improvements in accuracy. This data could be used to determine the optimal training duration and difficulty level for maximizing model performance. The chart suggests a positive correlation between training steps and accuracy, but also highlights the importance of selecting an appropriate difficulty level. </details> <details> <summary>x9.png Details</summary> ![a2dffb46](/v1/image/a2dffb46e3bf8b65c42177e05b2d770de7d86efbbd2196ec163e7b7df223a9d6) ### Visual Description ## Chart: Competition Level Accuracy ### Overview The image presents a line chart illustrating the relationship between Training Steps and Average Accuracy (%) for different competition difficulty levels. Three difficulty levels – Difficulty, Simple, and Medium – are represented by different colored data points and trend lines. The chart aims to demonstrate how accuracy changes with increasing training steps for each difficulty level. ### Components/Axes * **Title:** (b) Competition Level Accuracy (%) - positioned at the top-center of the chart. * **X-axis:** Training Steps - ranging from 0 to 200, with tick marks at intervals of 25. * **Y-axis:** Average Accuracy (%) - ranging from 8.4 to 15.7, with tick marks at intervals of 0.4. * **Legend:** Located in the bottom-right corner of the chart. * Difficulty (represented by red circles) * Simple (represented by green circles) * Medium (represented by blue circles) * **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values. ### Detailed Analysis The chart displays three lines representing the trend of accuracy for each difficulty level. * **Difficulty (Red):** The red line shows an upward trend, starting at approximately 9.1% accuracy at 0 training steps and reaching approximately 14.7% accuracy at 200 training steps. The line exhibits a steeper slope in the initial stages (0-100 training steps) and then plateaus. * (0, 9.1) * (25, 9.8) * (50, 10.8) * (75, 11.8) * (100, 12.5) * (125, 13.2) * (150, 13.8) * (175, 14.4) * (200, 14.7) * **Simple (Green):** The green line also shows an upward trend, but it is less pronounced than the red line. It starts at approximately 9.7% accuracy at 0 training steps and reaches approximately 12.7% accuracy at 200 training steps. The slope is relatively consistent throughout the range. * (0, 9.7) * (25, 10.4) * (50, 11.1) * (75, 11.6) * (100, 12.0) * (125, 12.3) * (150, 12.5) * (175, 12.6) * (200, 12.7) * **Medium (Blue):** The blue line exhibits a similar upward trend to the green line, starting at approximately 10.2% accuracy at 0 training steps and reaching approximately 13.5% accuracy at 200 training steps. The slope is also relatively consistent. * (0, 10.2) * (25, 10.9) * (50, 11.7) * (75, 12.3) * (100, 12.8) * (125, 13.1) * (150, 13.3) * (175, 13.4) * (200, 13.5) ### Key Observations * The "Difficulty" level consistently demonstrates the highest accuracy across all training steps. * The "Simple" and "Medium" levels show similar accuracy trends, with "Medium" generally performing slightly better. * All three difficulty levels exhibit diminishing returns in accuracy as training steps increase, suggesting a point of saturation. * The data points are scattered around the trend lines, indicating some variability in accuracy for each difficulty level at each training step. ### Interpretation The chart suggests that increasing training steps generally improves accuracy for all competition difficulty levels. However, the "Difficulty" level benefits the most from additional training, achieving significantly higher accuracy compared to "Simple" and "Medium" levels. This implies that more complex tasks require more training to reach optimal performance. The diminishing returns observed at higher training steps suggest that there is a limit to the improvement achievable through further training. The spread of data points around the trend lines indicates that individual performance may vary, even within the same difficulty level. This could be due to factors such as individual learning rates or variations in the training data. The chart provides valuable insights into the relationship between training effort and performance in different competition settings, highlighting the importance of tailoring training strategies to the specific difficulty level. </details> <details> <summary>x10.png Details</summary> ![9a9e03ad](/v1/image/9a9e03adbdc471f3eaf9de1c48c959f761a9d8325f79b36925c42e7d366cd581) ### Visual Description \n ## Chart: Training Batch Accuracy (%) ### Overview The image presents a line chart illustrating the average accuracy (%) of training batches over training steps for three different difficulty levels: Difficulty, Simple, and Medium. The chart aims to demonstrate how accuracy changes with the number of training steps for each difficulty setting. ### Components/Axes * **Title:** (c) Training Batch Accuracy (%) - positioned at the top-center. * **X-axis:** Training Steps - ranging from 0 to 200, with markers at intervals of 25. * **Y-axis:** Average Accuracy (%) - ranging from 0 to 60, with markers at intervals of 12. * **Legend:** Located in the bottom-right corner, identifying the data series: * Difficulty (represented by a red circle) * Simple (represented by a teal circle) * Medium (represented by a blue circle) * **Gridlines:** Present to aid in reading values. ### Detailed Analysis The chart displays three distinct lines representing the accuracy trends for each difficulty level. **1. Difficulty (Red Line):** * The line starts at approximately 2% accuracy at 0 training steps. * It exhibits a slow, almost linear increase, reaching approximately 22% accuracy at 200 training steps. * Data points (approximate): * 0 Steps: 2% * 25 Steps: 6% * 50 Steps: 10% * 75 Steps: 14% * 100 Steps: 17% * 125 Steps: 20% * 150 Steps: 21% * 175 Steps: 22% * 200 Steps: 22% **2. Simple (Teal Line):** * The line begins at approximately 4% accuracy at 0 training steps. * It shows a moderate, upward sloping trend, reaching approximately 36% accuracy at 200 training steps. * Data points (approximate): * 0 Steps: 4% * 25 Steps: 12% * 50 Steps: 20% * 75 Steps: 27% * 100 Steps: 31% * 125 Steps: 33% * 150 Steps: 34% * 175 Steps: 35% * 200 Steps: 36% **3. Medium (Blue Line):** * The line starts at approximately 1% accuracy at 0 training steps. * It demonstrates the steepest upward slope, reaching approximately 49% accuracy at 200 training steps. * Data points (approximate): * 0 Steps: 1% * 25 Steps: 10% * 50 Steps: 24% * 75 Steps: 33% * 100 Steps: 38% * 125 Steps: 41% * 150 Steps: 44% * 175 Steps: 47% * 200 Steps: 49% ### Key Observations * The "Medium" difficulty level consistently exhibits the highest accuracy throughout the training process. * The "Difficulty" level shows the lowest accuracy and the slowest rate of improvement. * The "Simple" level falls between the other two, demonstrating a moderate improvement in accuracy. * All three lines show diminishing returns in accuracy as the number of training steps increases, suggesting a potential plateau. ### Interpretation The chart suggests that the difficulty of the training task significantly impacts the rate at which accuracy improves. More difficult tasks ("Difficulty") require more training steps to achieve comparable accuracy levels to simpler tasks ("Medium"). The steep slope of the "Medium" line indicates that the model learns more effectively on this difficulty level, potentially due to a better balance between challenge and solvability. The plateauing of all lines suggests that further training may not yield substantial improvements in accuracy, and that other factors (e.g., model architecture, hyperparameters) may need to be adjusted to achieve higher performance. The data implies that the model is learning, but the learning rate is decreasing over time. This is a common phenomenon in machine learning and can be addressed with techniques like learning rate scheduling. </details> Figure 6: Comparison of incorporating synthetic problems of varying difficulty levels during the augmented RL training. For a detailed description of accuracy trends on evaluation benchmarks and the training set, refer to the caption in Figure 5. 4.3 Weakness-driven Selection In this section, we explore an alternative extension that augments the initial training set using identified weaknesses and a larger mathematical reasoning dataset. Specifically, we use the Qwen2.5-7B model, identify its weaknesses on the MATH-12k training set, and retrieve augmented problems from Big-Math [1] that align with its failure cases, incorporating them into the initial training set for augmentation. We employ a category-specific selection strategy similar to the budget allocation in Eq. 5, using KNN [5] to identify the most relevant problems within each category. The total augmentation budget is also set to 40k. We compare this approach to a baseline where the model is trained on an augmented set incorporated with randomly selected problems from Big-Math. Details of the selection procedure are provided in Appendix H. As shown in Figure 5, the model trained with weakness-driven augmentation outperforms the random augmentation strategy in terms of accuracy on both the whole evaluated benchmarks (Figure 5.a) and the competition-level subset (Figure 5.b), demonstrating the effectiveness of the weakness-driven selection strategy. In Figure 5.c, it is worth noting that the model quickly fits the randomly selected problems in training, which then cease to provide meaningful training signals in the GRPO algorithm. In contrast, since the failure cases highlight specific weaknesses of the model’s capabilities, the problems selected based on them remain more challenging and more aligned with its deficiencies, providing richer learning signals and promoting continued development of reasoning skills. 4.4 Impact of Question Difficulty We ablate the impact of the difficulty levels of synthetic problems used in the augmented RL training. In this section, we define the difficulty of a synthetic problem based on the accuracy of multiple rollouts generated by the initially trained model, base from Qwen2.5-7B. We incorporate synthetic problems of three predefined difficulty levels—simple, medium, and hard—into the augmented RL training. These levels correspond to accuracy ranges of $[5,7]$ , $[3,5]$ , and $[1,4]$ out of 8 sampled responses, respectively. For each level, we sample 40k examples and combine them with the initial training set for a second training stage lasting 200 steps. The experimental results are shown in Figure 6. Similar to the findings in Section 4.3, the model fits more quickly on the simple augmented set and initially achieves the best performance across all evaluation benchmarks, including competition-level tasks, but then saturates with no further improvement. In contrast, the medium and hard augmented sets lead to slower convergence on the training set but result in more sustained performance gains on the evaluation set, with the hardest problems providing the longest-lasting training benefits. <details> <summary>x11.png Details</summary> ![6a10130a](/v1/image/6a10130a3705e92abd25e74da50c8b3faf25bc15f82df6d072cfbae77499fcd6) ### Visual Description \n ## Document: Geometry Problem Set ### Overview The image presents a document outlining an original geometry problem alongside a series of synthetic problems of varying difficulty levels. It also lists extracted concepts relevant to the original problem. The document appears to be designed for assessing or training problem-solving skills in geometry. ### Components/Axes The document is divided into three main sections: 1. **Original Problem:** A detailed description of a complex geometric setup. 2. **Synthetic Problems:** Four problems categorized by difficulty (Simple, Medium, Hard, Unsolvable) with their answers and model accuracy. 3. **Extracted Concepts:** A bulleted list of geometric concepts related to the original problem. ### Detailed Analysis or Content Details **Original Problem:** * Equilateral triangle *ABC* has side length 600. * Points *P* and *Q* lie outside the plane of *ΔABC* and are on opposite sides of the plane. * *PA* = *PB* = *PC*, and *QA* = *QB* = *QC*. * The planes of *ΔPAB* and *ΔQAB* form a 120° dihedral angle (the angle between the two planes). * There is a point *O* whose distance from each of *A, B, C, P,* and *Q* is *d*. Find *d*. **Synthetic Problems:** * **Simple:** Two cones, A and B, are similar, with cone A being tangent to a sphere. The radius of the sphere is *r*, and the height of cone A is *h*. If the ratio of the height of cone B to the height of cone A is *k*, find the ratio of the surface area of cone B to the surface area of cone A. * Answer: *k²*, Model Accuracy: 100% * **Medium:** In a circle with radius *r*, two tangents are drawn from a point *P* such that the angle between them is 60°. If the length of each tangent is *√3* find the distance from *P* to the center. * Answer: *2r*, Model Accuracy: 50% * **Hard:** In triangle *ABC*, let *I* be the incenter and *E* the excenter opposite *A*. If *AE* = 5, *AI* = 3, and *EI* is tangent to the incircle at *D*, find the radius. * Answer: 2, Model Accuracy: 6.25% * **Unsolvable:** In triangle *ABC*, with *AB* = 7, *AC* = 9, and *∠A* = 60°, let *D* be the midpoint of *BC*. Given *BD* is 3 more than *DC*, find *AD*. * Answer: 15/2, Model Accuracy: 0% **Extracted Concepts:** * Geometric shapes and their properties * Properties of equilateral triangles * Understanding of points and planes in 3D space * Distance and midpoint formulas in 3D space * Properties of perpendicular lines and planes ### Key Observations * The "Model Accuracy" for the synthetic problems varies significantly, ranging from 0% to 100%. This suggests the difficulty level assessment is somewhat reliable. * The "Unsolvable" problem is labeled as such, but still provides an answer (15/2), indicating it might be solvable with more advanced techniques or there is an error. * The original problem is significantly more complex than the synthetic problems, involving 3D geometry and potentially requiring advanced problem-solving strategies. ### Interpretation The document serves as a learning resource for geometry, presenting a challenging original problem and a set of related problems with varying difficulty levels. The inclusion of "Model Accuracy" provides a metric for evaluating the effectiveness of a problem-solving approach. The extracted concepts highlight the key geometric principles required to tackle the original problem. The document's structure suggests a pedagogical approach where students are first exposed to a complex problem and then practice with simpler, related problems to build their skills. The "Unsolvable" problem is an interesting case, potentially serving as a discussion point about the limits of certain geometric approaches or the possibility of errors in the problem statement. The document is entirely in English. </details> Figure 7: Illustration of a geometry problem from the MATH-12k failed set, with extracted concepts and conceptually linked synthetic problems across different difficulty levels. 4.5 Case Study Figure 7 presents an illustration of a geometry failure case from the MATH-12k training set, accompanied by extracted concepts and our weakness-driven synthetic questions of varying difficulty levels, all closely aligned with the original question. The question focuses on three-dimensional distance and triangle understanding, with key concepts such as “Properties of equilateral triangles” and “Distance and midpoint formulas in 3D space” representing essential knowledge required to solve the problem. Notably, the corresponding synthetic questions exhibit similar semantics—such as “finding distance” in Medium and “understanding triangles” in Hard. Practicing on such targeted problems helps mitigate weaknesses and enhances reasoning capabilities within the relevant domain. 5 Conclusion In this work, we introduce a S elf-aware W eakness-driven Problem S ynthesis (SwS) framework (SwS) in reinforcement learning for LLM reasoning, which synthesizes problems based on weaknesses identified from the model’s failure cases during a preliminary training phase and includes them into subsequent augmented training. We conduct a detailed analysis of incorporating such synthetic problems into training and find that focusing on the model’s failures can enhance its reasoning generalization and mitigate its weaknesses, resulting in overall performance improvements. Furthermore, we extend the framework to the paradigms of Weak-to-Strong Generalization, Self-evolving, and Weakness-driven Selection, demonstrating its comprehensiveness and robustness. 6 Discussions, Limitations and Future Work This paper presents a comprehensive Self-aware Weakness-driven Problem Synthesis (SwS) framework to address the model’s reasoning deficiencies through reinforcement learning (RL) training. Although the SwS framework is effective across a wide range of model sizes, there are still several limitations to it: (1) Employing both a strong instruction model and an answer-labeling reasoning model may lead to computation and time costs. (2) Our framework mainly focuses on the RL setting, as our primary goal is to mitigate the model’s weaknesses by fully activating its inherent reasoning abilities without distilling external knowledge. Exploring how to leverage a similar pipeline for enhancing model capabilities through fine-tuning or distillation remains an open direction for future research. (3) The synthetic problems generated by open-source instruction models in the SwS framework may still lack sufficient complexity to elicit the deeper reasoning capabilities of the model, especially on more challenging problems. This limitation is pronounced in the Self-evolving setting in Section 4.2, which relies solely on a 14B model for problem generation, with performance improvements limited to only moderate or simple benchmarks. This raises questions about the actual utility of problems generated from the LLaMA-3.3-70B-Instruct in the main experiments on top-challenging benchmarks like AIME. One potential strategy is to use Evolve-Instruct [56, 30] to further refine the generated problems to the desired level of difficulty. However, how to effectively raise the upper bound of difficulty in synthetic problems generated by instruction models remains an open problem and warrants further exploration. In the future, we aim to identify model weaknesses from multiple perspectives beyond simple answer accuracy, with the goal of synthesizing more targeted problems to improve sample efficiency. Additionally, we plan to extend the SwS framework to more general tasks beyond reasoning, incorporating an off-the-shelf reward model to provide feedback instead of verifiable answers. Lastly, we also seek to implement the SwS pipeline in more advanced reasoning models equipped with Long-CoT capabilities, further pushing the boundaries of open-source large reasoning models. References - Albalak et al. [2025] Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, et al. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models. arXiv preprint arXiv:2502.17387, 2025. - Burns et al. [2023] Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023. - Chu et al. [2025] Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161, 2025. - Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. - Cover and Hart [1967] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967. - Cui et al. [2025] Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025. - Face [2025] Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1. - Grattafiori et al. [2024] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. - Guan et al. [2025] Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519, 2025. - Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. - He et al. [2024] Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. - He et al. [2025] Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680, 2025. Notion Blog. - Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. Sort, 2(4):0–6, 2021. - Hu et al. [2025] Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025. - Huang et al. [2024] Yiming Huang, Xiao Liu, Yeyun Gong, Zhibin Gou, Yelong Shen, Nan Duan, and Weizhu Chen. Key-point-driven data synthesis with its enhancement on mathematical reasoning. arXiv preprint arXiv:2403.02333, 2024. - Jaech et al. [2024] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024. - Kang et al. [2023] Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, and Sung Ju Hwang. Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks. Advances in Neural Information Processing Systems, 36:48573–48602, 2023. - Kwon et al. [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. - Lewkowycz et al. [2022] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022. - Li et al. [2024a] Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, Zheng Zhang, and Houwen Peng. Common 7b language models already possess strong math capabilities. arXiv preprint arXiv:2403.04706, 2024a. - Li et al. [2024b] Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv preprint arXiv:2411.16594, 2024b. - Li et al. [2025a] Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling. arXiv preprint arXiv:2502.11886, 2025a. - Li et al. [2025b] Zhong-Zhi Li, Xiao Liang, Zihao Tang, Lei Ji, Peijie Wang, Haotian Xu, Haizhen Huang, Weiwei Deng, Ying Nian Wu, Yeyun Gong, et al. Tl; dr: Too long, do re-weighting for effcient llm reasoning compression. arXiv preprint arXiv:2506.02678, 2025b. - Li et al. [2025c] Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models. arXiv preprint arXiv:2502.17419, 2025c. - Liang et al. [2024] Xiao Liang, Xinyu Hu, Simiao Zuo, Yeyun Gong, Qiang Lou, Yi Liu, Shao-Lun Huang, and Jian Jiao. Task oriented in-domain data augmentation. arXiv preprint arXiv:2406.16694, 2024. - Lightman et al. [2023] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023. - Liu et al. [2025a] Haoxiong Liu, Yifan Zhang, Yifan Luo, and Andrew C Yao. Augmenting math word problems via iterative question composing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24605–24613, 2025a. - Liu et al. [2025b] Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025b. - Lu et al. [2025] Dakuan Lu, Xiaoyu Tan, Rui Xu, Tianchu Yao, Chao Qu, Wei Chu, Yinghui Xu, and Yuan Qi. Scp-116k: A high-quality problem-solution dataset and a generalized pipeline for automated extraction in the higher education science domain, 2025. URL https://arxiv.org/abs/2501.15587. - Luo et al. [2023] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023. - Luo et al. [2025] Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. DeepScaleR Notion Page, 2025. Notion Blog. - Luong et al. [2024] Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 3, 2024. - MAA [a] MAA. American mathematics competitions (AMC 10/12). Mathematics Competition Series, 2023a. URL https://maa.org/math-competitions/amc. - MAA [b] MAA. American invitational mathematics examination (AIME). Mathematics Competition Series, 2024b. URL https://maa.org/math-competitions/aime. - Muennighoff et al. [2025] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025. - Nguyen et al. [2025] Hieu Nguyen, Zihao He, Shoumik Atul Gandre, Ujjwal Pasupulety, Sharanya Kumari Shivakumar, and Kristina Lerman. Smoothing out hallucinations: Mitigating llm hallucination with smoothed knowledge distillation. arXiv preprint arXiv:2502.11306, 2025. - Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. - Pei et al. [2025] Qizhi Pei, Lijun Wu, Zhuoshi Pan, Yu Li, Honglin Lin, Chenlin Ming, Xin Gao, Conghui He, and Rui Yan. Mathfusion: Enhancing mathematic problem-solving of llm through instruction fusion. arXiv preprint arXiv:2503.16212, 2025. - Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. - Shao et al. [2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. - Shen et al. [2025] Wei Shen, Guanlin Liu, Zheng Wu, Ruofei Zhu, Qingping Yang, Chao Xin, Yu Yue, and Lin Yan. Exploring data scaling trends and effects in reinforcement learning from human feedback. arXiv preprint arXiv:2503.22230, 2025. - Sheng et al. [2024] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024. - Shi et al. [2025] Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetuning via adaptive curriculum learning. arXiv preprint arXiv:2504.05520, 2025. - Tan et al. [2024] Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. Large language models for data annotation and synthesis: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 930–957, 2024. - Tang et al. [2024] Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. In International Conference on Machine Learning, pages 47885–47900. PMLR, 2024. - Team et al. [2025] Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. - Team [2025] Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/. - Tong et al. [2024] Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. Advances in Neural Information Processing Systems, 37:7821–7846, 2024. - Toshniwal et al. [2024] Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. Advances in Neural Information Processing Systems, 37:34737–34774, 2024. - Wang et al. [2023] Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. arXiv preprint arXiv:2312.08935, 2023. - Wang et al. [2024] Shu Wang, Lei Ji, Renxi Wang, Wenxiao Zhao, Haokun Liu, Yifan Hou, and Ying Nian Wu. Explore the reasoning capability of llms in the chess testbed. arXiv preprint arXiv:2411.06655, 2024. - Wang et al. [2025] Yu Wang, Nan Yang, Liang Wang, and Furu Wei. Examining false positives under inference scaling for mathematical reasoning. arXiv preprint arXiv:2502.06217, 2025. - Wen et al. [2025] Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, et al. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond. arXiv preprint arXiv:2503.10460, 2025. - Wu et al. [2023] Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36:59008–59033, 2023. - Xiong et al. [2025] Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, et al. A minimalist approach to llm reasoning: from rejection sampling to reinforce. arXiv preprint arXiv:2504.11343, 2025. - Xu et al. [2023] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023. - Yang et al. [2024a] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024a. - Yang et al. [2024b] An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024b. - Ye et al. [2025] Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387, 2025. - Yeo et al. [2025] Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025. - Yu et al. [2025a] Bin Yu, Hang Yuan, Yuliang Wei, Bailing Wang, Weizhen Qi, and Kai Chen. Long-short chain-of-thought mixture supervised fine-tuning eliciting efficient reasoning in large language models. arXiv preprint arXiv:2505.03469, 2025a. - Yu et al. [2023] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023. - Yu et al. [2025b] Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025b. - Yu et al. [2025c] Yiyao Yu, Yuxiang Zhang, Dongdong Zhang, Xiao Liang, Hengyuan Zhang, Xingxing Zhang, Ziyi Yang, Mahmoud Khademi, Hany Awadalla, Junjie Wang, et al. Chain-of-reasoning: Towards unified mathematical reasoning in large language models via a multi-paradigm perspective. arXiv preprint arXiv:2501.11110, 2025c. - Yuan et al. [2025] Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv preprint arXiv:2504.05118, 2025. - Yue et al. [2025] Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025. - Zeng et al. [2025] Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892, 2025. - Zhang et al. [2024a] Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search. Advances in Neural Information Processing Systems, 37:64735–64772, 2024a. - Zhang et al. [2024b] Hengyuan Zhang, Yanru Wu, Dawei Li, Sak Yang, Rui Zhao, Yong Jiang, and Fei Tan. Balancing speciality and versatility: a coarse to fine framework for supervised fine-tuning large language model. In Findings of the Association for Computational Linguistics ACL 2024, pages 7467–7509, 2024b. - Zhang et al. [2025] Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, and Yeyun Gong. Process-based self-rewarding language models. arXiv preprint arXiv:2503.03746, 2025. - Zhang et al. [2023] Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474, 2023. - Zhao et al. [2025a] Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training. arXiv preprint arXiv:2503.19633, 2025a. - Zhao et al. [2025b] Xueliang Zhao, Wei Wu, Jian Guan, and Lingpeng Kong. Promptcot: Synthesizing olympiad-level problems for mathematical reasoning in large language models. arXiv preprint arXiv:2503.02324, 2025b. - Ziegler et al. [2019] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. - Zuo et al. [2025] Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning, 2025. URL https://arxiv.org/abs/2504.16084. Appendix Contents for SwS 1. 1 Introduction 1. 2 Method 1. 2.1 Preliminary 1. 2.2 Overview 1. 2.3 Self-aware Weakness Identification 1. 2.4 Targeted Problem Synthesis 1. 2.5 Augmented Training with Synthetic Problems 1. 3 Experiments 1. 3.1 Experimental Setup 1. 3.2 Main Results 1. 3.3 Weakness Mitigation from Augmented Training 1. 4 Extensions and Analysis 1. 4.1 Weak-to-Strong Generalization for SwS 1. 4.2 Self-evolving Targeted Problem Synthesis 1. 4.3 Weakness-driven Selection 1. 4.4 Impact of Question Difficulty 1. 4.5 Case Study 1. 5 Conclusion 1. 6 Discussions, Limitations and Future Work 1. A Related Work 1. B Implementation Details 1. B.1 Training 1. B.2 Evaluation 1. C Motivation for Using RL in Weakness Identification 1. D Data Analysis of the SwS Framework 1. D.1 Detailed Data Workflow 1. D.2 Difficulty Distribution of Synthetic Problems 1. E Co-occurrence Based Concept Sampling 1. F Details for Weak-to-Strong Generalization in SwS 1. G Details for Self-Evolving in SwS 1. H Details for Weakness-driven Selection 1. I Evaluation Benchmark Demonstrations 1. J Prompts 1. J.1 Prompt for Category Labeling 1. J.2 Prompt for Concepts Extraction 1. J.3 Prompt for Problem Synthesis 1. J.4 Prompt for Quality Evaluation Appendix A Related Work Recent advancements have significantly enhanced the integration of reinforcement learning (RL) with large language models (LLMs) [74, 37], particularly in the domains of complex reasoning and code generation [10]. Algorithms such as Proximal Policy Optimization (PPO) [39] and Generalized Reinforcement Preference Optimization (GRPO) [40] have demonstrated strong generalization and effectiveness in these applications. In contrast to supervised fine-tuning (SFT) via knowledge distillation [17, 69, 61], RL optimizes a model’s reason capabilities on its own generated outputs through reward-driven feedback, thereby prompting stronger generalization. In contrast, SFT models often depend on rote memorization of reasoning patterns and solutions [3], and may produce correct answers with flawed rationales [52]. In LLM reasoning, RL strengthens policy exploration and improves reasoning performance by using the verified correctness of the final answer in the responses as reward signals for training [32], which is commonly referred to as reinforcement learning with verifiable rewards (RLVR) [66]. Robust RLVR for LLM Reasoning. Scaling up reinforcement learning for LLMs poses significant challenges in terms of training stability and efficiency. Designing stable and efficient supervision algorithms and frameworks for LLMs has attracted widespread attention from the research community. To address the challenge of reward sparsity in reinforcement learning, recent studies have explored not only answer-based rewards but also process-level reward modeling [4, 26, 50, 70], enabling the provision of more fine-grained reward signals throughout the entire solution process [54]. Wang et al. [50] successfully incorporated a process reward model (PRM), trained on process-level labels generated via Monte Carlo sampling at each step, into RL training and demonstrated its effectiveness. Beyond RL training, PRM can also be used to guide inference [4] and provide value estimates incorporated with search algorithms [68, 9]. However, Guo et al. [10] found that the scalability of process-level RL is limited by the ambiguous definition of “step” and the high cost of process-level labeling. How to effectively scale process-level RL remains an open question. Recent efforts in scaling up RLVR optimization have focused on enhancing exploration [63, 65, 28, 60] and adapting RL to the Long-CoT conditions [16, 10, 24]. Yu et al. [63] found that the KL constraint may limit exploration under RLVR, while Liu et al. [28] proposed removing variance normalization in GRPO to prevent length bias. Building on PPO, Yuan et al. [65] found that pre-training the value function prior to RL training and employing a length-adaptive GAE can improve training stability and efficiency in RLVR, preventing it from degrading to a constant baseline in value estimation. Data Construction in RLVR. Although RL training on simpler mathematical questions can partially elicit a model’s reasoning ability [67], the composition of RL training data is critical for enhancing the model’s reasoning capabilities [31, 63, 22, 14, 12, 41]. Carefully designing a problem set with difficulty levels matched to the model’s abilities and sufficient diversity can significantly improve performance. In addition, the use of curriculum learning has been shown to improve the efficiency of reinforcement learning [43]. In this work, we propose generating synthetic problems based on the model’s weaknesses for RL training, where the synthetic problems are tailored to align with the model’s capabilities and target its areas of weakness, fostering its exploration and improving performance. Data Synthesis for LLM Reasoning Existing data synthesis strategies for enhancing LLM reasoning primarily concentrate on generating problem-response pairs [15, 45, 62, 73, 25, 30, 27, 51, 21, 44, 38] or augmenting responses to existing questions [49, 48, 12, 7, 53, 64, 23], typically by leveraging advanced LLMs to produce these synthetic examples. A prominent line of work focuses on extracting and recombining key concepts from seed problems. KP-Math [15] and MathScale [45] decompose seed problems into underlying concepts and recombine them to create new problems, leveraging advanced models to generate corresponding solutions. PromptCoT [73] also leverages underlying concepts, but focuses on generating competition-level problems. DART-Math [48] introduces a difficulty-aware framework that prioritizes the diversity and richness of synthetic responses to challenging problems. Recently, several studies have emerged aiming to construct distilled datasets to better elicit the reasoning capabilities of LLM. [10]. Several works [7, 59, 35, 29, 72] employ advanced Long-CoT models to generate responses for distilling knowledge into smaller models. However, a significant disparity in capabilities between the teacher and student models can lead to hallucinations in the student’s outputs [36] and hinder generalization to out-of-distribution scenarios [3]. In contrast, our framework under the RL setting enables the model to identify and mitigate its own weaknesses by generating targeted synthetic problems from failure cases, thereby encouraging more effective self-improvement based on its specific weaknesses. Appendix B Implementation Details <details> <summary>x12.png Details</summary> ![d5de8f11](/v1/image/d5de8f112c5a21e6e9fd8e04a99f66fc5316b33d9b84e52e371c75bd2956b869) ### Visual Description \n ## Diagram: Data Filtering Pipeline ### Overview This diagram illustrates a data filtering pipeline, showing the reduction in the number of problems as they pass through various filtering stages. The pipeline starts with an initial training dataset and progressively refines it by removing undesirable or low-quality problems, ultimately resulting in a smaller, difficulty-filtered dataset. The diagram uses horizontal bars to represent the number of problems at each stage, with arrows indicating the flow of data. ### Components/Axes The diagram consists of six stages, labeled 1 through 6, each corresponding to a specific filtering step. The horizontal axis represents the stages of the pipeline, and the vertical axis implicitly represents the number of problems. The stages are: 1. Initial training data 2. Failed Problems 3. All Synthetic Problems 4. RL-style Problems 5. High-quality Problems 6. Answer-verified Problems 7. Difficulty-filtered Problems The diagram also includes numbered labels indicating the filtering steps: 1. Weakness Identification of Failure Cases. 100000 2. Generating Synthetic Problems. 3. Filter out Undesirable Problem Types in RL. 4. Filter out Problems with Low Quality. 5. Remove Problems with Inconsistent Labeled Answers. 6. Remain problems with Suitable Difficulty Levels. ### Detailed Analysis * **Initial training data:** 17545 problems (represented by a pink bar). * **Failed Problems:** 1905 problems (represented by a green bar). This represents a reduction from the initial data. * **All Synthetic Problems:** 813639 problems (represented by a blue bar). This is a significant increase in the number of problems, indicating the generation of synthetic data. * **RL-style Problems:** 176140 problems (represented by a pink bar). This represents a reduction from the synthetic problems. * **Answer-verified Problems:** 137447 problems (represented by a green bar). This represents a further reduction. * **Difficulty-filtered Problems:** 41726 problems (represented by a yellow bar). This is the final filtered dataset. The arrows indicate the flow of data from one stage to the next. The numbers associated with each stage represent the number of problems remaining after that filtering step. ### Key Observations * The number of problems increases dramatically after the "Generating Synthetic Problems" stage, suggesting a large-scale data augmentation process. * Each subsequent filtering step results in a reduction in the number of problems, indicating that the filtering process is effectively removing undesirable data. * The final dataset (Difficulty-filtered Problems) is significantly smaller than the initial training data, suggesting a substantial refinement of the dataset. * The color scheme alternates between pink, green, blue, and yellow, potentially indicating different types of filtering or data characteristics. ### Interpretation This diagram demonstrates a data pipeline designed to improve the quality and suitability of a training dataset. The initial step of identifying failure cases and generating synthetic data likely aims to address data scarcity or bias. The subsequent filtering steps focus on removing problematic data points, such as those with inconsistent labels or unsuitable difficulty levels. The overall goal is to create a refined dataset that is more likely to lead to robust and reliable machine learning models. The large increase in data volume after synthetic data generation suggests a strategy to overcome limitations in the original dataset. The subsequent filtering steps are crucial for ensuring that the synthetic data is of high quality and does not introduce new biases or errors. The final difficulty filtering step indicates a focus on creating a dataset that is appropriate for the intended application. The diagram highlights the importance of data preprocessing and filtering in machine learning. By carefully curating the training data, it is possible to improve the performance and generalization ability of the resulting models. </details> Figure 8: Demonstration of the SwS data workflow by tracing the process from initial training data to the final selection of synthetic problems in the 32B model experiments. For better visualization, the bar heights are scaled using the cube root of the raw data. B.1 Training We conduct our experiments using the verl [42] framework and adopt GRPO [40] as the optimization algorithm. For all RL training experiments, we sample 8 rollouts per problem and use a batch size of 1024, with the policy update batch size set to 256. We employ a constant learning rate of $5× 10^{-7}$ with a 20-step warm-up, and set the maximum prompt and response lengths to 1,024 and 8,192 tokens, respectively. We do not apply a KL penalty, as recent studies have shown it may hinder exploration and potentially cause training collapse [65, 28, 63]. In the initial training stage, we train the model for 200 steps. During augmented RL training, we continually train the initially trained model for 600 steps on the augmented dataset incorporated with synthetic problems, using only prompts with an accuracy between $\text{acc}_{\text{lower}}=10\%$ and $\text{acc}_{\text{upper}}=90\%$ as determined by the online policy model for updates. The probability ratio clipping ranges in Eq. 3 is set to $\varepsilon=0.20$ and $\varepsilon^{h}=0.28$ . Since the training data for the 32B and 14B models (a combination of DAPO [63] and LightR1 [53] subsets) lack human-annotated category information, we leverage the LLaMA-3.3-70B-Instruct model to label their categories. This ensures consistency with our SwS pipeline, which combines concepts within the same category. The prompt is presented in Prompt LABEL:prompt:category-labeling. B.2 Evaluation For evaluation, we utilize the vLLM framework [18] and allow for responses up to 8,192 tokens. For all the benchmarks, Pass@1 is computed using greedy decoding for baseline models and sampling (temperature 1.0, top-p 0.95) for RL-trained models. For Avg@32 on competition-level benchmarks, we sample 32 responses per model with the same sampling configuration as used in RL training. We adopt a hybrid rule-based verifier by integrating Math-Verify and the PRIME-RL verifier [6], as their complementary strengths lead to higher recall. For all the inference, we use the default chat template and enable CoT prompting by appending the instruction: “Let’s think step by step and output the final answer within “ $\backslash\text{boxed}\{\}$ ” after each question. Appendix C Motivation for Using RL in Weakness Identification <details> <summary>x13.png Details</summary> ![1cc3aa63](/v1/image/1cc3aa63c78c7931c75817e231cbb5af6dfdc045d7cbc01aa30e9245de8ab131) ### Visual Description \n ## Bar Chart: Ratios of Failed Problems of Base Model, SFT Model, and Initial RL Model in MATH-12k ### Overview This bar chart compares the ratios of failed problems for three different models – Base Model, SFT Model, and Initial RL Model – across eight different mathematical subjects within the MATH-12k dataset. The y-axis represents the ratio of failed problems (Value), and the x-axis represents the mathematical subjects. Each subject has three bars representing the failure rate of each model. ### Components/Axes * **Title:** "Ratios of Failed Problems of Base Model, SFT Model, and Initial RL Model in MATH-12k" (Top-center) * **X-axis Label:** Mathematical Subjects (Bottom-center) * Categories: Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Prealgebra, Precalculus * **Y-axis Label:** Value (Left-center) * Scale: 0 to 60, with increments of 10. * **Legend:** (Top-left) * Base Model: Light Blue * SFT Model: Medium Blue * Initial RL Model: Light Red ### Detailed Analysis The chart consists of eight groups of three bars, one for each model within each subject. * **Algebra:** * Base Model: Approximately 0.9 * SFT Model: Approximately 16.5 * Initial RL Model: Approximately 0.5 * **Counting & Probability:** * Base Model: Approximately 9.9 * SFT Model: Approximately 41.3 * Initial RL Model: Approximately 3.8 * **Geometry:** * Base Model: Approximately 17.1 * SFT Model: Approximately 45.1 * Initial RL Model: Approximately 8.8 * **Intermediate Algebra:** * Base Model: Approximately 14.8 * SFT Model: Approximately 52.9 * Initial RL Model: Approximately 6.7 * **Number Theory:** * Base Model: Approximately 6.2 * SFT Model: Approximately 37.9 * Initial RL Model: Approximately 1.8 * **Prealgebra:** * Base Model: Approximately 2.8 * SFT Model: Approximately 15.6 * Initial RL Model: Approximately 0.9 * **Precalculus:** * Base Model: Approximately 13.3 * SFT Model: Approximately 48.4 * Initial RL Model: Approximately 10.3 **Trends:** * The SFT Model consistently exhibits the highest failure rates across all subjects. The bars for the SFT model are the tallest in each group. * The Initial RL Model generally has the lowest failure rates, with the shortest bars in each group. * The Base Model's failure rates are intermediate, falling between the SFT and Initial RL Models. * Intermediate Algebra shows the highest failure rate for the SFT model, reaching approximately 52.9. * Number Theory shows the lowest failure rate for the SFT model, reaching approximately 37.9. ### Key Observations * The SFT model performs significantly worse than the other two models across all categories. * The Initial RL model consistently performs the best. * The failure rates vary considerably across different mathematical subjects. Intermediate Algebra is the most challenging subject for the SFT model, while Number Theory is the least challenging. ### Interpretation The data suggests that the SFT model, while potentially more capable in some areas, struggles with the MATH-12k dataset compared to the Base Model and Initial RL Model. The consistently higher failure rates of the SFT model indicate that it may be overfitting to certain types of problems or lacking in generalization ability. The Initial RL model's consistently low failure rates suggest it is robust and well-suited to the dataset. The variation in failure rates across subjects highlights the different levels of difficulty within the MATH-12k dataset. The chart provides valuable insights into the strengths and weaknesses of each model, which can inform future model development and training strategies. The large differences in performance between the SFT model and the others suggest a potential issue with the SFT training process or data. Further investigation into the training data and methodology of the SFT model is warranted. </details> Figure 9: An visualization of utilizing the base model (Qwen2.5-7B), SFT model and the initial RL model on weakness identification in the original training set (MATH-12k). In our SwS framework, we propose utilizing an initial RL training phase for weakness identification. However, one might argue that there are simpler alternatives for weakness identification, such as directly sampling training problems from the base model or applying supervised fine-tuning before prompting the model to answer questions. In this section, we provide an in-depth discussion on the validity of using problems with low training efficiency during the initial RL phase as model’s weaknesses. We first compare the performance of the Base model, SFT model, and Initial RL model by sampling on the training set, where the SFT model is obtained by fine-tuning the Base model for 1 epoch on human-written solutions. For each question, we prompt the model to generate 8 responses and report the proportion of problems for which none of the responses are correct in Figure 9. For the Base model, failures may be attributed to its insufficient alignment with reasoning-specific tasks. Results from the initial RL model show that the Base model can quickly master such questions through RL, indicating that they do not represent challenging weaknesses. Furthermore, the heavy reliance on the prompt template of the Base model [28] reduces its robustness of weakness identification. For the SFT model, there are three main drawbacks regarding weakness identification: (1) The dilemma of training epochs—too many epochs leads to memorizing labeled solutions, while too few epochs fails to align the model with the target problem distribution; (2) SFT is prone to hallucination [3, 52]; and (3) Ensuring the quality of labeled solutions is difficult, as human-written solutions may not always be the best for models [10]. For these reasons, the SFT model performs poorly on the initial training set, even yielding worse results than the Base model, let alone in utilizing its failed problems to identify model weaknesses. In contrast to the Base and SFT models, the Initial RL model exhibits the most robust performance on the initial training set, indicating that the failed problems expose the model’s most critical weaknesses. Additionally, the training efficiency on all problems during initial RL can also be recorded for further analysis of model weaknesses. Meanwhile, the initially trained model can also serve as the starting point for augmented RL training. Therefore, in our SwS framework, we ultimately choose to employ an initial RL phase for robust weakness identification. Appendix D Data Analysis of the SwS Framework | Positive Case # 1: Let $z_{1}$ , $z_{2}$ , and $z_{3}$ be complex numbers such that $|z_{1}|=|z_{2}|=|z_{3}|=1$ and $z_{1}+z_{2}+z_{3}=0$ . Using the symmetric polynomial $s_{2}=z_{1}z_{2}+z_{1}z_{3}+z_{2}z_{3}$ , find the value of $|s_{2}|^{2}$ . | | --- | | Negative Case # 1: In a village, there are 10 houses, each of which can be painted one of three colors: red, blue, or green. Two houses cannot have the same color if they are directly adjacent to each other. Using combinatorial analysis and considering the constraints, find the total number of distinct ways to paint the houses, taking into account the possibility of having a sequence where the same color repeats after two different colors (e.g., red, blue, red), and assuming that the color of one of the end houses is already determined to be red, and the colors of the houses are considered different based on their positions (i.e., the configuration red, blue, green is considered different from green, blue, red). | | Negative Case # 2: A metal’s surface requires a minimum energy of 2.5 eV to remove an electron via the photoelectric effect. If light with a wavelength of 480 nm is shone on the metal, and 1 mole of electrons is ejected, what is the total energy, in kilojoules, transferred to the electrons, given that the energy of a photon is related to its wavelength by the formula E = $hc/\lambda$ , where $h=6.626x10^{-34}$ J s and $c=3.00x10^{8}m/s$ , and Avogadro’s number is $6.02x10^{23}$ particles per mole? | | Negative Case # 3: In triangle $ABC$ , with $\angle A=60^{\circ}$ , $\angle B=90^{\circ}$ , $AB=4$ , and $BC=7$ , use the Law of Sines to find $\angle C$ and calculate the triangle’s area. | Table 4: Case study of quality filtering results in SwS, featuring one high-quality positive case and three low-quality negative cases. The low-quality segments are marked in pink. D.1 Detailed Data Workflow Taking the 32B model experiments as an example, Figure 8 shows the comprehensive data workflow of the SwS framework, from identifying model weaknesses in the initial training data to the processing of synthetic problems. The initial training set, consisting of the DAPO and Light-R1 subsets for the Qwen2.5-32B model, contains 17,545 problem-answer pairs. During the weakness identification stage, 1,905 problems are identified as failure cases according to Eq. 4. These failure cases are subsequently used for concept extraction and targeted problem synthesis. For problem synthesis, we set an initial budget of 1 million synthetic problems in all experiments, with allocations for each category determined as in Eq. 5. These problems then undergo several filtering stages: (1) removing multiple-choice, multi-part, or proof-required problems; (2) discarding problems evaluated as low quality; (3) filtering out problems where the answer generation model yields inconsistent answers, specifically when the most frequent answer among all generations appears less than 50%; and (4) removing problems whose difficulty levels are unsuitable for the current model in RL training. Among these, the quality-based filtering is the strictest, with a filtering rate of 78.35%, indicating that the SwS pipeline maintains rigorous quality control over the generated problems. This ensures both the stability and effectiveness of utilizing synthetic problems in subsequent training. We present a case study of the quality-based filtering results in Table 4. As illustrated, the positive case that passed the model-based quality evaluation features a concise and precise problem description. In contrast, most synthetic problems identified as low-quality exhibit redundant and overly elaborate descriptions, sometimes including lengthy hints for solving the problem, as seen in the first negative case. Additionally, some low-quality problems incorporate excessive non-mathematical knowledge, such as Physics, as illustrated in the second negative case. The informal LaTeX formatting also contributes to their lower quality. Furthermore, problems with multiple question components, such as the third negative case, are also considered as low quality for RL training. D.2 Difficulty Distribution of Synthetic Problems In this section, we study the difficulty distribution of the synthetic problems generated for base models ranging from 3B to 32B, as shown in Figure 10. The red outlines in the pie plots highlight the subset of synthetic problems selected for subsequent augmented RL training, with accuracy falling within the [25%, 75%] range. These samples account for nearly 35% of all generated problems across the four models. The two largest wedges in the pie chart represent problems that the models answered either completely correctly or completely incorrectly. These cases do not provide effective training signals in GRPO [40, 63], and are thus excluded from the later augmented RL training stage. To further enhance stability and efficiency, we also exclude problems where the model produces only one correct or one incorrect response. Since all synthetic problems are generated using the same instruction model (LLaMA-3.3-70B-Instruct) with similar competition-level difficulty levels (as illustrated in Prompt LABEL:prompt:problem-generation), and are based on concepts derived from their respective weaknesses, the resulting difficulty distribution of the synthetic problems exhibits only minor differences across all models. Consistent with intuition, the initially trained 3B model achieved the lowest performance on the synthetic questions, with the highest ratio of all-incorrect and the lowest ratio of all-correct responses, while the 32B model showed the opposite trend, achieving the best performance. <details> <summary>x14.png Details</summary> ![5c52d526](/v1/image/5c52d526cf236e65b847bea91e28e6f8441b4d720cc12dc8f50d70ffcc10855f) ### Visual Description \n ## Chart: Synthetic Problems Difficulty for SwS Models ### Overview The image presents four pie charts, each representing the distribution of difficulty levels for synthetic problems solved by different SwS (likely Solver-based Systems) models: SwS-3B, SwS-7B, SwS-7B-Math, and SwS-32B. Each pie chart shows the percentage of problems falling into difficulty levels ranging from 0 to 8. ### Components/Axes Each chart has a title indicating the SwS model being analyzed. The pie slices are labeled with the difficulty level (0-8) and the corresponding percentage of problems at that level. There are no explicit axes, as it's a pie chart representation. ### Detailed Analysis or Content Details **1. SwS-3B:** * Difficulty 0: 32.2% (Dark Blue) * Difficulty 1: 11.0% (Light Blue) * Difficulty 2: 6.9% (Yellow) * Difficulty 3: 5.6% (Orange) * Difficulty 4: 5.0% (Red) * Difficulty 5: 4.9% (Pink) * Difficulty 6: 5.3% (Purple) * Difficulty 7: 7.1% (Brown) * Difficulty 8: 22.0% (Green) **2. SwS-7B:** * Difficulty 0: 23.3% (Dark Blue) * Difficulty 1: 9.9% (Light Blue) * Difficulty 2: 7.1% (Yellow) * Difficulty 3: 6.0% (Orange) * Difficulty 4: 5.6% (Red) * Difficulty 5: 6.1% (Pink) * Difficulty 6: 6.1% (Purple) * Difficulty 7: 7.9% (Brown) * Difficulty 8: 28.6% (Green) **3. SwS-7B-Math:** * Difficulty 0: 30.2% (Dark Blue) * Difficulty 1: 9.7% (Light Blue) * Difficulty 2: 6.4% (Yellow) * Difficulty 3: 5.3% (Orange) * Difficulty 4: 4.9% (Red) * Difficulty 5: 4.8% (Pink) * Difficulty 6: 5.3% (Purple) * Difficulty 7: 7.7% (Brown) * Difficulty 8: 25.7% (Green) **4. SwS-32B:** * Difficulty 0: 18.8% (Dark Blue) * Difficulty 1: 9.2% (Light Blue) * Difficulty 2: 6.9% (Yellow) * Difficulty 3: 5.8% (Orange) * Difficulty 4: 5.5% (Red) * Difficulty 5: 5.4% (Pink) * Difficulty 6: 6.2% (Purple) * Difficulty 7: 8.2% (Brown) * Difficulty 8: 33.8% (Green) ### Key Observations * Difficulty level 0 consistently represents the largest proportion of problems across all models, ranging from 18.8% (SwS-32B) to 32.2% (SwS-3B). * Difficulty level 8 is the second most frequent for SwS-3B, SwS-7B, and SwS-32B, while for SwS-7B-Math it is also the second most frequent. * The distribution of difficulty levels appears relatively similar across all four models, with a concentration in the lower difficulty levels (0-4). * SwS-32B has the highest percentage of problems at difficulty level 8 (33.8%), suggesting it may be able to handle more complex problems compared to the other models. ### Interpretation The data suggests that the synthetic problems used to evaluate these SwS models are generally easier, as the majority fall into the lower difficulty levels. The slight variations in distribution across the models might indicate differences in their capabilities, with SwS-32B showing a greater ability to tackle more challenging problems (as evidenced by the higher percentage at difficulty level 8). The relatively consistent distribution across the models suggests that the problem generation process is similar for all, and the models are responding in a comparable manner. The fact that no model excels significantly in solving high-difficulty problems could indicate a limitation in the problem generation process or the models' inherent capabilities. Further investigation would be needed to determine the specific types of problems that fall into the higher difficulty levels and whether the models struggle with specific problem characteristics. </details> Figure 10: Difficulty distributions of synthetic problems for models from 3B to 32B in our work. Appendix E Co-occurrence Based Concept Sampling Following Huang et al. [15], Zhao et al. [73], we enhance the coherence and semantic fluency of synthetic problems by sampling concepts within the same category based on their co-occurrence probabilities and embedding similarities. Specifically, for each candidate concept $c∈\mathbf{C}$ from category $\mathbf{D}$ , we define its score based on both co-occurrence statistics and embedding similarity as: $$ \mathrm{Score}(c)=\begin{cases}\mathrm{Co}(c)+\mathrm{Sim}(c),&\text{if }c% \notin\{c_{1},c_{2},\dots,c_{k}\}\\ -\infty,&\text{otherwise}.\end{cases} $$ The co-occurrence term $\mathrm{Co}(c)$ is computed by summing the co-occurrence counts from a sparse matrix built over the entire corpus, generated by iterating through all available concept lists in the pool. For each list, we increment $\mathrm{CooccurMatrix}[c,c^{\prime}]$ by one for every unordered pair where $c≠ c^{\prime}$ , yielding a sparse, symmetric matrix in which each entry $\mathrm{CooccurMatrix}[c,c^{\prime}]$ records the total number of times concepts $c$ and $c^{\prime}$ co-occur across all sampled lists: $$ \mathrm{Co}(c)=\sum_{i=1}^{k}\mathrm{CooccurMatrix}[c,c_{i}], \tag{6} $$ while the semantic similarity is given by the cosine similarity between the candidate’s embedding and the mean embedding of the currently selected concepts: $$ \mathrm{Sim}(c)=\cos\left(\vec{e}_{c},\frac{1}{k}\sum_{i=1}^{k}\vec{e}_{c_{i}}% \right), \tag{7} $$ To efficiently support large-scale and high-dimensional concept spaces, we construct a sparse co-occurrence matrix over all unique concepts, where each entry represents the frequency with which a pair of concepts co-occurs within sampled concept lists. Simultaneously, concept embeddings are normalized and indexed via FAISS to facilitate fast similarity computation. During sampling, an initial seed concept is drawn in proportion to its empirical frequency. For each subsequent concept, scores are computed by efficiently summing its co-occurrence with the current set and its embedding similarity to the group mean, while previously selected concepts are masked out. The probability of sampling each candidate is determined via softmax over these scores with temperature $\tau$ : $$ P(c)=\frac{\exp\left(\mathrm{Score}(c)/\tau\right)}{\sum_{c^{\prime}\notin\{c_% {1},\dots,c_{k}\}}\exp\left(\mathrm{Score}(c^{\prime})/\tau\right)}. \tag{8} $$ This process iteratively constructs coherent, semantically related concept sets to serve as the inputs for synthetic problem generation, ensuring both diversity and fluency. Appendix F Details for Weak-to-Strong Generalization in SwS Case Study for Weak-to-strong Generalization Question: Compute $99^{2}+99+1$ in your head. Please reason step by step, and put your final answer within \boxed{}. Golden Answer: $\boxed{9901}$ $>>>$ Teacher Response: To compute $99^{2}+99+1$ , we can compute each part directly: $\displaystyle 99^{2}$ $\displaystyle=9801$ $\displaystyle 99^{2}+99$ $\displaystyle=9801+99=9900$ $\displaystyle 99^{2}+99+1$ $\displaystyle=9900+1=9901$ Final answer: $\boxed{9901}$ $>>>$ Student Response: Rewrite the expression using a binomial identity: $\displaystyle 99^{2}+99+1$ $\displaystyle=(99+1)^{2}=100^{2}=10000$ Final answer: $\boxed{10000}$ Figure 11: Case study in which a weak teacher provides the correct solution, while a strong student incorrectly applies a binomial identity and derives an incorrect answer. To understand the capabilities of the weak teacher and the strong student model, we evaluated both of them on the MATH-500 test set by prompting them on each question for eight times. Although the teacher model generally exhibits weaker performance, we found that in 16.4% of problems, the weaker teacher outperforms the otherwise stronger student model. This highlights the potential for leveraging a weak teacher to distill its strengths into the student model. A case where the weaker teacher model outperforms the stronger student model is shown in Figure 11. From the analysis of the SwS framework, as well as its Weak-to-Strong Generalization extension, we assert that the upper bound for answer labeling is a revised form of self-consistency score of the teacher model, where (1) the consistent answer must achieve an accuracy greater than 50% across all responses, and (2) the student model must provide the same answer as the teacher model’s consistent answer in at least 25% of responses. These revision procedures help ensure the correctness of the synthetic problem answers labeled by the teacher model. In Table 5, we demonstrate the robustness of utilizing a weaker teacher for answer labeling, assuming that the MATH-500 test set serves as our synthetic problems. As in the second line, even under the self-consistency setting, the teacher model only achieves an improvement of 4.8 points. However, when we exclude problems for which self-consistency does not provide sufficient confidence—specifically, those where the most consistent answer accounts for less than 50% of all responses—the self-consistency setting yields an additional 9.0-point improvement on the remaining questions. Furthermore, in our SwS pipeline, we retain only problems where the student model achieves over 25% accuracy to ensure an appropriate level of difficulty. After filtering out problems where the student falls below this threshold, some mislabeled problems are also automatically removed, resulting in the weak teacher achieving a performance of 97.5% on the final remaining questions. The increase in labeling accuracy from 80.6% to 97.5% shows the potential of utilizing the weaker teacher model for answer labeling as well as the robustness of the SwS framework itself. | Setting | Size | Prealgebra | Intermediate Algebra | Algebra | Precalculus | Number Theory | Counting & Probability | Geometry | All | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Pass@1 | 500 | 88.2 | 64.3 | 95.5 | 71.2 | 93.0 | 81.4 | 63.0 | 80.6 | | + SC | 500 | 96.9 | 96.0 | 84.4 | 84.1 | 96.2 | 87.5 | 67.8 | 85.4 | | + SC>50% | 444 | 96.9 | 97.3 | 93.2 | 94.7 | 98.0 | 94.4 | 89.6 | 94.4 | | + SC>50% & Stu-Con | 407 | 96.8 | 97.2 | 97.7 | 100.0 | 100.0 | 96.8 | 94.9 | 97.5 | Table 5: The performance of the weak teacher model used for answer generation on the MATH-500 test set under different strategies and their corresponding revisions. "Stu-Con" refers to filtering out problems where the student model’s accuracy falls below the defined threshold of 25%. Appendix G Details for Self-Evolving in SwS As mentioned in Section 4.2, the Self-evolving SwS extension enables the policy to achieve better performance on simple to medium-level mathematical reasoning benchmarks but remains suboptimal on AIME-level competition benchmarks. In this section, we further analyze the reasons behind this phenomenon. Figure 12 visualizes the model’s self-quality assessment and difficulty evaluation within the SwS framework. Notably, the model assigns a much higher proportion of “perfect” and “acceptable” labels, and fewer “bad” labels, to its self-generated problems compared to the standard framework shown in Figure 8. This observation is consistent with findings from LLM-as-a-Judge [21], which indicate that models tend to be more favorable toward and assign higher scores to their own generations. Such behavior may result in overlooking low-quality problems or mis-classifying problems that are too complex for the model’s reasoning abilities as unsolvable or of poor quality. Beyond the risk of filtering out over-complex problems, the model may also have difficulty in accurately labeling answers through self-consistency for over-challenging problems, thereby limiting the potential of incorporating complex problems through the Self-evolving SwS framework. Additionally, in Figure 12, it is noteworthy that the initial RL-trained model achieves nearly 50% all-correct responses on its generated problems, whereas only 31% of problems with appropriate difficulty remain for augmentation after SwS difficulty filtering. This suggests that the self-generated problems may be significantly simpler than those produced using a stronger instruction model [8], thus it could lead to data inefficiency and limit the model’s performance on more complex problems during RL training. <details> <summary>x15.png Details</summary> ![0d19551b](/v1/image/0d19551b3b5efa2f190b7a990391de06171cf85d4600a2ea2c7440273063e016) ### Visual Description \n ## Pie Charts: Self-Judgement and Self-Difficulty Evaluation for Qwen2.5-14B-Instruct ### Overview The image contains two pie charts. The left chart displays the results of a "Self-Judgement" evaluation for the Qwen2.5-14B-Instruct model. The right chart shows the "Self-Difficulty Evaluation" for the same model. Both charts represent the distribution of responses across different categories. ### Components/Axes Both charts lack explicit axes, as they are pie charts. They are labeled with titles indicating the type of evaluation. Each slice of the pie charts is labeled with a category and its corresponding percentage. **Left Chart (Self-Judgement):** * Categories: "perfect", "bad" * Percentages: 35.3%, 64.1% **Right Chart (Self-Difficulty Evaluation):** * Categories: 0, 1, 2, 3, 4, 5, 6, 7, 8 * Percentages: 6.9%, 5.2%, 5.1%, 5.4%, 5.8%, 5.6%, 8.2%, 12.0%, 44.8% ### Detailed Analysis or Content Details **Left Chart (Self-Judgement):** The pie chart is dominated by the "bad" category, representing approximately 64.1% of the responses. The "perfect" category accounts for the remaining 35.3%. **Right Chart (Self-Difficulty Evaluation):** The largest segment of this pie chart is the "8" category, representing 44.8% of the responses. The remaining categories are distributed as follows: * 0: 6.9% * 1: 5.2% * 2: 5.1% * 3: 5.4% * 4: 5.8% * 5: 5.6% * 6: 8.2% * 7: 12.0% ### Key Observations * The model frequently self-judges its responses as "bad" (64.1%). * The most common self-assessed difficulty level is 8 (44.8%). * The distribution of difficulty levels is relatively even across the lower levels (0-7), with a slight increase in responses for difficulty level 7 (12.0%). ### Interpretation The data suggests that the Qwen2.5-14B-Instruct model tends to be critical of its own responses, frequently rating them as "bad". This could indicate a conservative self-assessment strategy or a genuine awareness of limitations in its performance. The concentration of responses at difficulty level 8 suggests that the model often perceives tasks as moderately challenging. The relatively even distribution of lower difficulty levels indicates that the model encounters a range of task complexities. The fact that the model rates itself as "bad" more often than "perfect" could be a sign of a well-calibrated model that doesn't overestimate its capabilities. Further investigation would be needed to understand the specific types of responses that are categorized as "bad" and the characteristics of tasks assigned a difficulty level of 8. </details> Figure 12: Illustration of the quality assessment and difficulty evaluation for Qwen2.5-14B-Instruct under the Self-evolving SwS framework. Appendix H Details for Weakness-driven Selection Algorithm 1 Weakness-Driven Selection Pipeline 1: Failed Problems $\mathbf{X}_{S}$ ; Total Budget $|T|$ ; Target Set $\mathbf{T}_{X}$ ; Domains $\{\mathbf{D}_{i}\}_{i=0}^{n}$ 2: Selected problems $\mathbf{T}_{S}$ 3: Embed all failed problems in $\mathbf{X}_{S}$ and all questions in $\mathbf{T}_{X}$ 4: for each domain $\mathbf{D}_{i}$ in $\{\mathbf{D}_{i}\}_{i=0}^{n}$ do 5: Compute selection budget $|T_{i}|$ for $\mathbf{D}_{i}$ according to Eq. 2 6: Extract failed problems $\mathbf{X}_{S,i}$ belonging to $\mathbf{D}_{i}$ 7: for each $q∈\mathbf{T}_{X}$ do $\triangleright$ Domain-level KNN 8: Compute $d_{i}(q)=\min_{f∈\mathbf{X}_{S,i}}\text{distance}(\vec{e}_{q},\vec{e}_{f})$ 9: end for 10: Select top $|T_{i}|$ questions from $\mathbf{T}_{X}$ with the smallest $d_{i}(q)$ as $\mathcal{S}_{i}$ 11: 12: end for 13: return Selected problems $\mathbf{T}_{S}=\bigcup_{i=0}^{n}\mathcal{S}_{i}$ $\triangleright$ Final Selected Set As described in Section 4.3, we utilize the failed problems identified by Qwen2.5-7B [57] on the MATH-12k [13] training set, which comprises 915 problems, to select additional data from Big-Math [1] to mitigate the model’s weaknesses through the augmented RL training. The complete Weakness-driven Selection extension of SwS is presented in Algorithm 1. For embedding the problems, we utilize LLaMA-3.1-8B-base [8] to encode both the collected failure cases and the problems from the target dataset. The failure cases are then grouped by categories, following the concept sampling strategy in standard SwS. We employ a binary K-Nearest Neighbors [5] algorithm to select weakness-driven problems from the target set, where the augmented problems are chosen by their embedding distances to the failure cases within each category. The selection budget for each category is also determined according to Eq. 5. We then aggregate the retrieved problems from all categories, forming a selected set of 40k problems, which are then incorporated with the initial set for the subsequent RL training. Appendix I Evaluation Benchmark Demonstrations | Dataset | Size | Category | Example Problem | Answer | | --- | --- | --- | --- | --- | | GSM8k | 1319 | Prealgebra | The ice cream parlor was offering a deal, buy 2 scoops of ice cream, get 1 scoop free. Each scoop cost $1.50. If Erin had $6.00, how many scoops of ice cream should she buy? | $6$ | | MATH-500 | 500 | Geometry | For a constant $c,$ in cylindrical coordinates $(r,\theta,z),$ find the shape described by the equation $z=c.$ (A) Line (B) Circle (C) Plane (D) Sphere (E) Cylinder (F) Cone. Enter the letter of the correct option. | (C) Plane | | Minerva Math | 272 | Precalculus | If the Bohr energy levels scale as $Z^{2}$ , where $Z$ is the atomic number of the atom (i.e., the charge on the nucleus), estimate the wavelength of a photon that results from a transition from $n=3$ to $n=2$ in Fe, which has $Z=26$ . Assume that the Fe atom is completely stripped of all its electrons except for one. Give your answer in Angstroms, to two significant figures. | $9.6$ | | Olympiad-Bench | 675 | Geometry | Given a positive integer $n$ , determine the largest real number $\mu$ satisfying the following condition: for every $4n$ -point configuration $C$ in an open unit square $U$ , there exists an open rectangle in $U$ , whose sides are parallel to those of $U$ , which contains exactly one point of $C$ , and has an area greater than or equal to $\mu$ . | $\frac{1}{2n+2}$ | | Gaokao-2023 | 385 | Geometry | There are three points $A,B,C$ in space such that $AB=BC=CA=1$ . If 2 distinct points are chosen in space such that they, together with $A,B,C$ , form the five vertices of a regular square pyramid, how many different ways are there to choose these 2 points? | $9$ | | AMC23 | 40 | Algebra | How many complex numbers satisfy the equation $z^{5}=\overline{z}$ , where $\overline{z}$ is the conjugate of the complex number $z$ ? | $7$ | | AIME24 | 30 | Number Theory | Let $N$ be the greatest four-digit positive integer with the property that whenever one of its digits is changed to $1$ , the resulting number is divisible by $7$ . Let $Q$ and $R$ be the quotient and remainder, respectively, when $N$ is divided by $1000$ . Find $Q+R$ . | $699$ | | AIME25 | 30 | Geometry | On $\triangle ABC$ points $A,D,E$ , and $B$ lie that order on side $\overline{AB}$ with $AD=4,DE=16$ , and $EB=8$ . Points $A,F,G$ , and $C$ lie in that order on side $\overline{AC}$ with $AF=13,FG=52$ , and $GC=26$ . Let $M$ be the reflection of $D$ through $F$ , and let $N$ be the reflection of $G$ through $E$ . Quadrilateral $DEGF$ has area 288. Find the area of heptagon $AFNBCEM$ . | $588$ | Table 6: Statistics and examples of the eight evaluation benchmarks utilized in the paper. We present the statistics and examples of the eight evaluation benchmarks used in our work in Table 6. Among these, GSM8K [4] is the simplest, comprising grade school math word problems. The MATH-500 [13], Gaokao-2023 [71], Olympiad-Bench [11], and AMC23 [33] benchmarks consist of high school mathematics problems spanning a wide range of topics and difficulty levels, while Minerva Math [19] may also include problems from other subjects. The AIME [34] benchmark is a prestigious high school mathematics competition that requires deep mathematical insight and precise problem-solving skills. An overview of all benchmarks is provided as follows. - GSM8K: A high-quality benchmark comprising 8,500 human-written grade school math word problems that require multi-step reasoning and basic arithmetic, each labeled with a natural language solution and verified answer. The 1,319-question test set emphasizes sequential reasoning and is primarily solvable by upper-grade elementary school students. - MATH-500: A challenging benchmark of 500 high school competition-level problems spanning seven subjects, including Algebra, Geometry, Number Theory, and Precalculus. Each problem is presented in natural language with LaTeX-formatted notation, offering a strong measure of mathematical reasoning and generalization across diverse topics. - Minerva-Math:A high-difficulty math problem dataset consisting of 272 challenging problems. Some problems are also relevant to scientific topics in other subjects, such as physics. - Olympiad-Bench: An Olympiad-level English and Chinese multimodal scientific benchmark featuring 8,476 problems from mathematics and physics competitions. In this work, we use only the pure language problems described in English, totaling 675 problems. - Gaokao-2023: A dataset consists of 385 mathematics problems from the 2023 Chinese higher education entrance examination, professionally translated into English. - AMC23: The AMC dataset consists of all 83 problems from AMC12 2022 and AMC12 2023, extracted from the AoPS wiki page. We used a subset of this data containing 40 problems. - AIME24 & 25: Each set comprises 30 problems from the 2024 and 2025 American Invitational Mathematics Examination (AIME), a prestigious high school mathematics competition for top-performing students, which are the most challenging benchmarks used in our study. Each problem is designed to require deep mathematical insight, multi-step reasoning, and precise problem-solving skills. Appendix J Prompts J.1 Prompt for Category Labeling Listing 1: The prompt for labeling the categories for mathematical problems, utilizing a few-shot strategy in which each category is represented by a labeled demonstration. ⬇ # CONTEXT # I am a teacher, and I have some high - level mathematical problems. I want to categorize the domain of these math problems. # OBJECTIVE # A. Provide a concise summary of the math problem, clearly identifying the key concepts or techniques involved. B. Assign the problem to one and only one specific mathematical domain. The following is the list of domains to choose from: < math domains > [" Intermediate Algebra ", " Geometry ", " Precalculus ", " Number Theory ", " Counting & Probability ", " Algebra ", " Prealgebra "] </ math domains > # STYLE # Data report. # TONE # Professional, scientific. # AUDIENCE # Students. Enable them to better understand the domain of the problems. # RESPONSE: MARKDOWN REPORT # ## Summarization [Summarize the math problem in a brief paragraph.] ## Math domains [Select one domain from the list above that best fits the problem.] # ATTENTION # - You must assign each problem to exactly one of the domains listed above. - If you are genuinely uncertain and none of the listed categories applies, you may use " Other ", but this should be a last resort. - Be thoughtful and accurate in your classification. Default to the listed categories whenever possible. - Add "=== report over ===" at the end of the report. < example math problem > ** Question **: Let $ n (\ ge2) $ be a positive integer. Find the minimum $ m $, so that there exists $x_ {ij}(1\ le i , j \ le n) $ satisfying: (1) For every $1 \ le i , j \ le n, x_ {ij}= max \{x_ {i1}, x_ {i2},..., x_ {ij}\} $ or $ x_ {ij}= max \{x_ {1 j}, x_ {2 j},..., x_ {ij}\}. $ (2) For every $1 \ le i \ le n$, there are at most $m$ indices $k$ with $x_ {ik}= max \{x_ {i1}, x_ {i2},..., x_ {ik}\}. $ (3) For every $1 \ le j \ le n$, there are at most $m$ indices $k$ with $x_ {kj}= max \{x_ {1 j}, x_ {2 j},..., x_ {kj}\}. $ </ example math problem > ## Summarization The problem involves an $ n \ times n $ matrix where each element $ x_ {ij} $ is constrained by the maximum values in its respective row or column. The goal is to determine the minimum possible value of $ m $ such that, for each row and column, the number of indices attaining the maximum value is limited to at most $ m $. This problem requires understanding matrix properties, maximum functions, and combinatorial constraints on structured numerical arrangements. ## Math domains Algebra === report over === </ example math problem > ** Question **: In an acute scalene triangle $ABC$, points $D, E, F$ lie on sides $BC, CA, AB$, respectively, such that $AD \ perp BC, BE \ perp CA, CF \ perp AB$. Altitudes $AD, BE, CF$ meet at orthocenter $H$. Points $P$ and $Q$ lie on segment $EF$ such that $AP \ perp EF$ and $HQ \ perp EF$. Lines $DP$ and $QH$ intersect at point $R$. Compute $HQ / HR$. </ example math problem > ## Summarization The problem involves an acute scalene triangle with three perpendicular cevians intersecting at the orthocenter. Additional perpendicular constructions are made from specific points on segment $ EF $, leading to an intersection at point $ R $. The goal is to determine the ratio $ HQ / HR $, requiring knowledge of triangle geometry, perpendicularity, segment ratios, and properties of the orthocenter. ## Math domains Geometry === report over === </ example math problem > ** Question **: Three cards are dealt at random from a standard deck of 52 cards. What is the probability that the first card is a 4, the second card is a $ \ clubsuit$, and the third card is a 2? </ example math problem > ## Summarization This problem involves calculating the probability of a specific sequence of events when drawing three cards from a standard 52- card deck without replacement. It requires understanding conditional probability, the basic rules of counting, and how probabilities change as cards are removed from the deck. ## Math domains Counting & Probability === report over === </ example math problem > ** Question **: Let $x$ and $y$ be real numbers such that $3x + 2 y \ le 7 $ and $2x + 4 y \ le 8. $ Find the largest possible value of $x + y. $ </ example math problem > ## Summarization This problem involves optimizing a linear expression $ x + y $ subject to a system of linear inequalities. It requires understanding of linear programming concepts, such as identifying feasible regions, analyzing boundary points, and determining the maximum value of an objective function within that region. ## Math domains Intermediate Algebra === report over === </ example math problem > ** Question **: Solve \[\ arccos 2 x - \ arccos x = \ frac {\ pi}{3}.\] Enter all the solutions, separated by commas. </ example math problem > ## Summarization This problem requires solving a trigonometric equation involving inverse cosine functions. The equation relates two expressions with $ \ arccos (2 x) $ and $ \ arccos (x) $, and asks for all real solutions satisfying the given identity. It involves knowledge of inverse trigonometric functions, their domains, and properties, as well as algebraic manipulation. ## Math domains Precalculus === report over === </ example math problem > ** Question **: What perfect - square integer is closest to 273? </ example math problem > ## Summarization The problem asks for the perfect square integer closest to 273. This involves understanding the distribution and properties of perfect squares, and comparing them with a given integer. It relies on number - theoretic reasoning related to squares of integers and their proximity to a target number. ## Math domains Number Theory === report over === </ example math problem > Voldemort bought $6.\ overline {6} $ ounces of ice cream at an ice cream shop. Each ounce cost $ \ $0.60. $ How much money, in dollars, did he have to pay? </ example math problem > ## Summarization The problem involves multiplying a repeating decimal, $ 6.\ overline {6} $, by a fixed unit price, \ $0.60, to find the total cost in dollars. This requires converting a repeating decimal into a fraction or using decimal multiplication, both of which are foundational arithmetic skills. ## Math domains Prealgebra === report over === < math problem > {problem} </ math problem > J.2 Prompt for Concepts Extraction Listing 2: Prompt template for extracting internal concepts from a mathematical question. ⬇ As an expert in educational assessment, analyze this problem: < problem > {problem} </ problem > Break down and identify {num_concepts} foundational concepts being tested. List these knowledge points that: - Are core curriculum concepts typically taught in standard courses, - Are precise and measurable (not vague like " understanding math "), - Are essential building blocks needed to solve this problem, - Represent fundamental principles rather than problem - specific techniques. Think through your analysis step by step, then format your response as a Python code snippet containing a list of {num_concepts} strings, where each string clearly describes one fundamental knowledge point. J.3 Prompt for Problem Synthesis Listing 3: Prompt template for synthesizing math problems from specified concepts, difficulty levels, and pre-defined mathematical categories. Following [73], the difficulty levels are consistently set to the competition level to prevent the generation of overly simple questions. ⬇ ### Given a set of foundational mathematical concepts, a mathematical domain, and a specified difficulty level, generate a well - constructed question that meaningfully integrates multiple listed concepts and reflects the stated level of complexity. ### Foundational Concepts: {concepts} ### Target Difficulty Level: {level} ### Mathematical Domain: {domain} ### Instructions: 1. Begin by outlining which concepts you will combine and how you plan to structure the question. 2. Ensure that the question is coherent, relevant, and appropriately challenging for the specified level. 3. The question must be a single standalone problem, not split into multiple sub - questions. 4. Do not generate proof - based, multiple - choice, or true / false questions. 5. The answer to the question should be expressible using numbers and mathematical symbols. 6. Provide a final version of the question that is polished and ready for use. ### Output Format: - First, provide your brief outline and planning for the question design. - Then, present only the final version of the question in the following format: ‘‘‘ [Your developed question here] ’’’ Do not include any placeholder, explanatory text, hints, or solutions to the question in the output block J.4 Prompt for Quality Evaluation Listing 4: The quality evaluation prompt utilized to filter out low-quality math problems. Following prior work [73], we assess synthetic problems based on five criteria: format, factual accuracy, difficulty alignment, concept coverage, and solvability. Each problem is then assigned one of three quality levels: ‘bad’, ‘acceptable’, or ‘perfect’. ⬇ As a critical expert in educational problem design, evaluate the following problem components: === GIVEN MATERIALS === 1. Problem & Design Rationale: {rationale_and_problem} (The rationale describes the authorâs thinking process and justification in designing this problem) 2. Foundational Concepts: {concepts} 3. Target Difficulty Level: {level} === EVALUATION CRITERIA === Rate each criterion as: [Perfect | Acceptable | Bad] 1. FORMAT - Verify correct implementation of markup tags: <! â BEGIN RATIONALE â > [design thinking process] <! â END RATIONALE â > <! â BEGIN PROBLEM â > [problem] <! â END PROBLEM â > 2. FACTUAL ACCURACY - Check for any incorrect or misleading information in both problem and rationale - Verify mathematical, scientific, or logical consistency 3. DIFFICULTY ALIGNMENT - Assess if problem complexity matches the specified difficulty level - Evaluate if cognitive demands align with target level 4. CONCEPT COVERAGE - Evaluate how well the problem incorporates the given foundational concepts - Check for missing concept applications 5. SOLVABILITY - Verify if the problem has at least one valid solution - Check if all necessary information for solving is provided === RESPONSE FORMAT === For each criterion, provide: 1. Rating: [Perfect | Acceptable | Bad] 2. Justification: Clear explanation for the rating === FINAL VERDICT === After providing all criterion evaluations, conclude your response with: ‘ Final Judgement: [verdict]’ where verdict must be one of: - ‘ perfect ’ (if both FACTUAL ACCURACY and SOLVABILITY are Perfect, at least two other criteria are Perfect, and no Bad ratings) - ‘ acceptable ’ (if no Bad ratings and doesnât qualify for perfect) - ‘ bad ’ (if ANY Bad ratings) Note: The ‘ Final Judgement: [verdict]’ line must be the final line of your response.

Rendering Paper...