2502.12853v1

Model: gemini-2.5-flash-lite-free

# S2r: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning > Equal contribution. This work was done during Peisong, Cheng, Jiaqi and Bang were interning at Tencent. Corresponding authors. ## Abstract Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs’ deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S 2 r, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S 2 r. Our code and data are available at https://github.com/NineAbyss/S2R. S 2 r: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning Ruotian Ma 1 thanks: Equal contribution. This work was done during Peisong, Cheng, Jiaqi and Bang were interning at Tencent., Peisong Wang 2 footnotemark: , Cheng Liu 1, Xingyan Liu 1, Jiaqi Chen 3, Bang Zhang 1, Xin Zhou 4, Nan Du 1 thanks: Corresponding authors. , Jia Li 5 To ensure a fair comparison, we report the Pass@1 (greedy) accuracy obtained without the process preference model of rStar, rather than the result obtained with increased test-time computation using 64 trajectories. 1 Tencent 2 Tsinghua University 3 The University of Hong Kong 4 Fudan University 5 The Hong Kong University of Science and Technology (Guangzhou) ruotianma@tencent.com, wps22@mails.tsinghua.edu.cn ## 1 Introduction Recent advancements in Large Language Models (LLMs) have demonstrated a paradigm shift from scaling up training-time efforts to test-time compute Snell et al. (2024a); Kumar et al. (2024); Qi et al. (2024); Yang et al. (2024). The effectiveness of scaling test-time compute is illustrated by OpenAI o1 OpenAI (2024), which shows strong reasoning abilities by performing deep and thorough thinking, incorporating essential skills like self-checking, self-verifying, self-correcting and self-exploring during the model’s reasoning process. This paradigm not only enhances reasoning in domains like mathematics and science but also offers new insights into improving the generalizability, helpfulness and safety of LLMs across various general tasks OpenAI (2024); Guo et al. (2025). <details> <summary>x1.png Details</summary> ![f39c820f](/v1/image/f39c820f7a954c7b24888aa04d480f757f0dd9d3d50ea46f5fb0192f4a5adbc0) ### Visual Description ## Scatter Plot: Model Performance vs. Data Size on MATH500 ### Overview This image is a scatter plot that visualizes the performance (Accuracy in %) of different language models against their data size (log10) on the MATH500 dataset. Each point represents a specific model, with its position indicating its accuracy and data size. ### Components/Axes * **Title:** MATH500 * **X-axis:** * **Title:** Data Size (log10) * **Scale:** Logarithmic, ranging from 3 to 8. * **Markers:** 3, 4, 5, 6, 7, 8. * **Y-axis:** * **Title:** Accuracy (%) * **Scale:** Linear, ranging from 76 to 86. * **Markers:** 76, 78, 80, 82, 84, 86. * **Data Points:** Five distinct colored circles, each labeled with the name of a language model. * **Green Circle:** Labeled "Qwen2.5-Math-7B-S²R-ORL (ours)" * **Pink Circle:** Labeled "Qwen2.5-Math-7B-Instruct" * **Orange Circle:** Labeled "Eurus-2-7B-PRIME" * **Blue Circle:** Labeled "rStar-Math-7B" * **Dark Purple Circle:** Labeled "Qwen2.5-7B-SimpleRL-Zero" ### Detailed Analysis The plot displays the following data points: 1. **Qwen2.5-Math-7B-S²R-ORL (ours)** (Green Circle): * **Trend:** This point is positioned at the top-left of the cluster, indicating high accuracy with a relatively smaller data size compared to some other models. * **Approximate Coordinates:** Data Size (log10) ≈ 3.9, Accuracy (%) ≈ 84.5 2. **Qwen2.5-Math-7B-Instruct** (Pink Circle): * **Trend:** This point is located in the upper-right quadrant of the plot, showing a good balance of high accuracy and a larger data size. * **Approximate Coordinates:** Data Size (log10) ≈ 6.5, Accuracy (%) ≈ 83.5 3. **Eurus-2-7B-PRIME** (Orange Circle): * **Trend:** This point is situated in the middle-lower section of the plot, suggesting moderate accuracy with a medium data size. * **Approximate Coordinates:** Data Size (log10) ≈ 5.5, Accuracy (%) ≈ 79.5 4. **rStar-Math-7B** (Blue Circle): * **Trend:** This point is in the lower-right section, indicating lower accuracy with a larger data size. * **Approximate Coordinates:** Data Size (log10) ≈ 7.0, Accuracy (%) ≈ 78.2 5. **Qwen2.5-7B-SimpleRL-Zero** (Dark Purple Circle): * **Trend:** This point is at the bottom-left, showing the lowest accuracy among the plotted models, with a relatively small data size. * **Approximate Coordinates:** Data Size (log10) ≈ 4.0, Accuracy (%) ≈ 77.0 ### Key Observations * The model "Qwen2.5-Math-7B-S²R-ORL (ours)" achieves the highest accuracy (approximately 84.5%) among the plotted models, despite having one of the smallest data sizes (approximately 3.9 log10). * "Qwen2.5-Math-7B-Instruct" also demonstrates high accuracy (approximately 83.5%) but with a significantly larger data size (approximately 6.5 log10). * "Qwen2.5-7B-SimpleRL-Zero" has the lowest accuracy (approximately 77.0%) and a relatively small data size (approximately 4.0 log10). * "rStar-Math-7B" has a larger data size (approximately 7.0 log10) but a lower accuracy (approximately 78.2%) compared to "Eurus-2-7B-PRIME". ### Interpretation This scatter plot suggests a general trend where increased data size might not always directly correlate with improved accuracy, or that model architecture and training methods play a crucial role. The "Qwen2.5-Math-7B-S²R-ORL (ours)" model stands out as being highly efficient, achieving top-tier accuracy with a comparatively smaller data footprint. This could imply a more effective learning process or better generalization capabilities. Conversely, models like "rStar-Math-7B" show that simply increasing data size doesn't guarantee superior performance, as it has a larger data size but lower accuracy than "Eurus-2-7B-PRIME". The "Qwen2.5" family of models shows varying performance based on their specific training (e.g., Instruct vs. SimpleRL-Zero vs. S²R-ORL), highlighting the impact of fine-tuning and reinforcement learning techniques. The plot effectively allows for a quick comparison of model trade-offs between performance and data requirements on the MATH500 benchmark. </details> Figure 1: The data efficiency of S 2 r compared to competitive methods, with all models initialized from Qwen2.5-Math-7B. Recent studies have made various attempts to replicate the success of o1. These efforts include using large-scale Monte Carlo Tree Search (MCTS) to construct long-chain-of-thought (long-CoT) training data, or to scale test-time reasoning to improve the performance of current models Guan et al. (2025); Zhao et al. (2024); Snell et al. (2024b); constructing high-quality long-CoT data for effective behavior cloning with substantial human effort Qin et al. (2024); and exploring reinforcement learning to enhance LLM thinking abilities on large-scale training data and models Guo et al. (2025); Team et al. (2025); Cui et al. (2025); Yuan et al. (2024). Recently, DeepSeek R1 Guo et al. (2025) demonstrated that large-scale reinforcement learning can incentivize LLM’s deep thinking abilities, with the R1 series showcasing the promising potential of long-thought reasoning. However, these approaches generally requires significant resources to enhance LLMs’ thinking abilities, including large datasets, substantial training-time compute, and considerable human effort and time costs. Meanwhile, it remains unclear how to incentivize valid thinking in smaller or less powerful LLMs beyond distilling knowledge from more powerful models. In this work, we propose S 2 r, an efficient alternative to enhance the thinking abilities of LLMs, particularly for smaller or less powerful LLMs. Instead of having LLMs imitate the thinking process of larger, more powerful models, S 2 r focus on teaching LLMs to think deeply by iteratively adopting two critical thinking skills: self-verifying and self-correcting. By acquiring these two capabilities, LLMs can continuously reassess their solutions, identify mistakes during solution exploration, and refine previous solutions after self-checking. Such a paradigm also enables flexible allocation of test-time compute to different levels of problems. Our results show that, with only 3.1k training samples, Qwen2.5-math-7B significantly benefits from learning self-verifying and self-correcting behaviors, achieving a 51.0% to 81.6% accuracy improvement on the Math500 test set. This performance outperforms the same base model distilled from an equivalent amount of long-CoT data (accuracy 80.2%) from QwQ-32B-Preview Team (2024a). More importantly, S 2 r employs both outcome-level and process-level reinforcement learning (RL) to further enhance the LLMs’ self-verifying and self-correcting capabilities. Using only rule-based reward models, RL improves the validity of both the self-verification and self-correction process, allowing the models to perform more flexible and effective test-time scaling through a self-directed trial-and-error process. By comparing outcome-level and process-level RL for our task, we found that process-level supervision is particularly effective in boosting accuracy of the thinking skills at intermediate steps, which might benefit base models with limited reasoning abilities. In contrast, outcome-level supervision enables models explore more flexible trial-and-error paths towards the correct final answer, leading to consistent improvement in the reasoning abilities of more capable base models. Additionally, we further show the potential of offline reinforcement learning as a more efficient alternative to the online RL training. We conducted extensive experiments across 3 LLMs on 7 math reasoning benchmarks. Experimental results demonstrate that S 2 r outperforms competitive baselines in math reasoning, including recently-released advanced o1-like models Eurus-2-7B-PRIME Cui et al. (2025), rStar-Math-7B Guan et al. (2025) and Qwen2.5-7B-SimpleRL Zeng et al. (2025). We also found that S 2 r is generalizable to out-of-domain general tasks, such as MMLU-PRO, highlighting the validity of the learned self-verifying and self-correcting abilities. Additionally, we conducted a series of analytical experiments to better demonstrate the reasoning mechanisms of the obtained models, and provide insights into performing online and offline RL training for enhancing LLM reasoning. ## 2 Methodology The main idea behind teaching LLMs self-verification and self-correction abilities is to streamline deep thinking into a critical paradigm: self-directed trial-and-error with self-verification and self-correction. Specifically: (1) LLMs are allowed to explore any potential (though possibly incorrect) solutions, especially when tackling difficult problems; (2) during the process, self-verification is essential for detecting mistakes on-the-fly; (3) self-correction enables the model to fix detected mistakes. This paradigm forms an effective test-time scaling approach that is more accessible for less powerful base models and is generalizable across various tasks. In this section, we first formally define the problem (§ 2.1). Next, we present the two-stage training framework of S 2 r, as described in Figure 2: Stage 1: Behavior Initialization: We first construct dynamic self-verifying and self-correcting trial-and-error trajectories to initialize the desired behavior. Then, we apply supervised fine-tuning (SFT) to the initial policy models using these trajectories, resulting in behavior-initialized policy models (§ 2.2); Stage 2: Reinforcement Learning: Following behavior initialization, we employ reinforcement learning to further enhance the self-verifying and self-correcting capabilities of the policy models. We explore both outcome-level and process-level RL methods, as well as their offline versions (§ 2.3). <details> <summary>x2.png Details</summary> ![e70dbff6](/v1/image/e70dbff6dd369970884bed97054837de73384afe8e51e889d64461f534cb5068) ### Visual Description ## Diagram: Stages of Reinforcement Learning for Model Training ### Overview This diagram illustrates a three-stage process for training a model, likely a language model or a similar AI system, using reinforcement learning. The stages are: Stage 0: Data Construction, Stage 1: Behavior Initialization, and Stage 2: Reinforcement Learning. The diagram details how data is generated, how an initial policy is established, and how reinforcement learning is applied to refine the model's performance. ### Components/Axes **Stage 0: Data Construction** * **Input:** "Input Questions" (represented by a document icon). * **Process:** An "Initial Policy" box receives input questions. * **Output:** The "Initial Policy" generates "Sample K responses for each question." These responses are visually represented as a grid of colored circles, indicating "Correct Response" (green) and "Incorrect Response" (red). * **Legend:** * Green circle: Correct Response * Red circle: Incorrect Response * **Further Processing:** These samples are used to "Construct trajectories based on difficulty distribution." This is shown by arrows leading to different difficulty levels: * Difficulty Level 5 * Difficulty Level 3 * Difficulty Level 1 * **Trajectory Representation:** Each difficulty level is associated with a set of tokens represented as `r = {s1, v1, s2, v2, s3, v3, s4, v4}` for Level 5, `r = {s1, v1, s2, v2}` for Level 3, and `r = {s1, v1}` for Level 1. The colors of the tokens in the example trajectories are: * Level 5: `s1` (red), `v1` (green), `s2` (red), `v2` (green), `s3` (red), `v3` (green), `s4` (red), `v4` (green) * Level 3: `s1` (red), `v1` (green), `s2` (red), `v2` (green) * Level 1: `s1` (red), `v1` (green) **Verification Construction (Within Stage 0)** * This section provides two methods of verification for a given problem and model answer. * **"Problem-Solving" Verification:** * **Problem:** "27 increased by twice a number is 39. What is the number?" * **Model's answer:** 6 * **Verification:** A step-by-step explanation of how to solve the equation `27 + 2x = 39` to arrive at `x = 6`. * **"Confirmative" Verification:** * **Problem:** "27 increased by twice a number is 39. What is the number?" * **Model's answer:** 6 * **Verification:** Substituting the answer `6` back into the original statement to confirm its validity. It shows that `27 increased by twice 6` (which is `27 + 12`) equals `39`. **Stage 1: Behavior Initialization** * **Process:** "Supervised Fine-tuning" is applied to an "Initial Policy Model π₀". * **Input:** "Input question x". * **Output:** The model outputs a "Target output r = {s1, v1, s2, s3, v3}" and an "SFT Mask m = {0, 1, 0, 1, 1, 1, 1}". Arrows indicate a "Backward" pass. **Stage 2: Reinforcement Learning** * **Process:** A "SFT Model π_SFT" (presumably the model after Stage 1) is trained using reinforcement learning. * **Input:** "Input Questions". * **Outputs:** * **Outcome-level reward:** Indicated by an upward arrow with checkmarks (✓) and crosses (✗), representing success or failure. The possible outcomes are listed as `{s1, v1, ..., sj, vj}`. * **Process-level reward:** Indicated by downward arrows with checkmarks (✓) and crosses (✗), representing success or failure at intermediate steps. * **Feedback Loop:** Both outcome-level and process-level rewards are fed back to the SFT Model via a "Backward" pass. * **Relationship to Stage 1:** The SFT Model in Stage 2 appears to be an evolution of the Initial Policy Model from Stage 1. The "Target output r" and "SFT Mask m" from Stage 1 are shown feeding into the SFT Model in Stage 2, along with "Input question x". ### Detailed Analysis or Content Details **Stage 0: Data Construction** The diagram shows that input questions are processed by an initial policy to generate multiple responses. These responses are categorized as correct (green) or incorrect (red). The distribution of these responses is then used to construct trajectories based on difficulty levels (Level 5, Level 3, Level 1). The token sets associated with each difficulty level suggest a hierarchical structure or varying complexity of generated sequences. For example, Level 5 has the longest sequence of tokens (`s1` through `v4`), while Level 1 has the shortest (`s1`, `v1`). The coloring of these tokens (red for `s` tokens, green for `v` tokens) might indicate specific types of actions or states within the trajectory. The verification sections provide a concrete example of a mathematical problem and demonstrate two distinct methods for verifying the model's answer: 1. **Problem-Solving:** This method involves solving the problem from scratch to derive the correct answer and comparing it to the model's answer. 2. **Confirmative:** This method involves plugging the model's answer back into the problem statement to check for consistency. Both methods confirm that the model's answer of `6` for the problem "27 increased by twice a number is 39" is correct. **Stage 1: Behavior Initialization** This stage focuses on supervised fine-tuning (SFT) of an initial policy model (π₀). The model takes an input question `x` and is trained to produce a target output `r = {s1, v1, s2, s3, v3}`. A corresponding "SFT Mask" `m = {0, 1, 0, 1, 1, 1, 1}` is also generated. The mask likely indicates which parts of the target output are relevant or should be attended to during fine-tuning. The "Backward" arrow suggests that gradients are propagated during this supervised learning process. **Stage 2: Reinforcement Learning** This stage applies reinforcement learning to a "SFT Model π_SFT". The model receives input questions and generates outputs that are evaluated by both "Outcome-level reward" and "Process-level reward". * **Outcome-level reward:** This reward is a global evaluation of the final output. The symbols `✓` and `✗` indicate successful or unsuccessful outcomes, respectively. The set `{s1, v1, ..., sj, vj}` likely represents the possible final states or actions. * **Process-level reward:** This reward provides feedback on intermediate steps of the model's generation process. The visual representation of multiple downward arrows with `✓` and `✗` suggests that each step in the generated sequence is evaluated. Both reward signals are fed back to the SFT Model via a "Backward" pass, allowing the model to learn and improve its policy to maximize cumulative rewards. The diagram shows that the "Target output r" and "SFT Mask m" from Stage 1 are inputs to the SFT Model in Stage 2, implying that the supervised fine-tuning in Stage 1 initializes the model for the subsequent reinforcement learning phase. ### Key Observations * The diagram outlines a structured approach to training a model, moving from data generation and initial policy learning to sophisticated reinforcement learning. * The use of both "Problem-Solving" and "Confirmative" verification methods in Stage 0 highlights a robust approach to ensuring the correctness of generated data or model outputs. * The concept of "difficulty distribution" in Stage 0 suggests that the training data is curated to cover a range of problem complexities. * Stage 1 (Supervised Fine-tuning) serves as a crucial initialization step, preparing the model with a baseline behavior before it undergoes reinforcement learning. * Stage 2 (Reinforcement Learning) utilizes both global (outcome-level) and granular (process-level) rewards, indicating a comprehensive feedback mechanism for model improvement. * The "Backward" arrows consistently denote the flow of gradients or error signals in supervised fine-tuning and reinforcement learning. ### Interpretation This diagram depicts a common pipeline in modern AI development, particularly for tasks involving sequential generation or decision-making, such as natural language processing or game playing. **Stage 0: Data Construction** lays the groundwork by generating a diverse and verified dataset. The inclusion of verification methods suggests a focus on data quality and correctness, which is paramount for effective model training. The segmentation by difficulty level implies a strategy to gradually expose the model to increasingly complex tasks, a common technique in curriculum learning. The token sets `r = {s1, v1, ...}` likely represent sequences of actions or states, where `s` and `v` might denote different types of tokens or decisions. The coloring of these tokens (red/green) could represent specific attributes or classes within the sequence, potentially related to the correctness or type of action taken. **Stage 1: Behavior Initialization** is critical for providing the reinforcement learning agent with a reasonable starting point. Supervised fine-tuning on a dataset of input-output pairs (questions and target outputs) allows the model to learn basic patterns and generate plausible responses. The "SFT Mask" might be used to guide the fine-tuning process, focusing the model's attention on specific parts of the input or output that are deemed most important. This stage prevents the reinforcement learning agent from starting with a completely random policy, which would be highly inefficient. **Stage 2: Reinforcement Learning** is where the model truly learns to optimize its behavior. By receiving rewards based on both the final outcome and the intermediate steps, the model can learn to make better decisions throughout its generation process. The distinction between outcome-level and process-level rewards is significant: * **Outcome-level reward** is essential for achieving the overall goal. * **Process-level reward** is crucial for learning efficient and correct intermediate steps, which can lead to better outcomes and prevent common errors. This is particularly important in complex tasks where a single error early on can derail the entire process. The overall flow suggests a progression from supervised learning to reinforcement learning, a paradigm often referred to as "pre-training and fine-tuning" or "behavior cloning followed by reinforcement learning." The diagram effectively visualizes how an initial, supervised policy is refined through trial and error guided by rewards, leading to a more robust and optimized model. The "Backward" arrows are a consistent visual cue for the backpropagation of errors or gradients, a fundamental mechanism in training deep learning models. The entire process aims to build a model that not only generates correct answers but does so through a sound and efficient reasoning process. </details> Figure 2: Overview of S 2 r. ### 2.1 Problem Setup We formulate the desired LLM reasoning paradigm as a sequential decision-making process under a reinforcement learning framework. Given a problem $x$ , the language model policy $\pi$ is expected to generate a sequence of interleaved reasoning actions $y=(a_{1},a_{2},\cdots,a_{T})$ until reaching the termination action <end>. We represent the series of actions before an action $a_{t}\in y$ as $y_{:a_{t}}$ , i.e., $y_{:a_{t}}=(a_{1},a_{2},\cdots,a_{t-i})$ , where $a_{t}$ is excluded. The number of tokens in $y$ is denoted as $|y|$ , and the total number of actions in $y$ is denoted as $|y|_{a}$ . We restrict the action space to three types: “ solve ”, “ verify ”, and “ <end> ”, where “ solve ” actions represent direct attempts to solve the problem, “ verify ” actions correspond to self-assessments of the preceding solution, and “ <end> ” actions signal the completion of the reasoning process. We denote the type of action $a_{i}$ as $Type(\cdot)$ , where $Type(a_{i})\in\{\texttt{verify},\texttt{solve},\texttt{<end>}\}$ . We expect the policy to learn to explore new solutions by generating “ solve ” actions, to self-verify the correctness of preceding solutions with “ verify ” actions, and to correct the detected mistakes with new “ solve ” actions if necessary. Therefore, for each action $a_{i}$ , the type of the next action $a_{i+1}$ is determined by the following rules: $$ Type(a_{i+1})=\begin{cases}\texttt{verify},&Type(a_{i})=\texttt{solve}\\ \texttt{solve},&Type(a_{i})=\texttt{verify}\\ &\text{ and }\text{Parser}(a_{i})=\textsc{incorrect}\\ \texttt{<end>},&Type(a_{i})=\texttt{verify}\\ &\text{ and }\text{Parser}(a_{i})=\textsc{correct}\\ \end{cases} $$ Here, $Parser(a)\in\{\textsc{correct},\textsc{incorrect}\}$ (for any action $a$ where $Type(a)=\texttt{verify}$ ) is a function (e.g., a regex) that converts the model’s free-form verification text into binary judgments. For simplicity, we denote the $j$ -th solve action as $s_{j}$ and the $j$ -th verify action as $v_{j}$ . Then we have $y=(s_{1},v_{1},s_{2},v_{2},\cdots,s_{k},v_{k},\texttt{<end>})$ . ### 2.2 Initializing Self-verification and Self-correction Behaviors #### 2.2.1 Learning Valid Self-verification Learning to perform valid self-verification is the most crucial part in S 2 r, as models can make mistakes during trial-and-error, and recognizing intermediate mistakes is critical for reaching the correct answer. In this work, we explore two methods for constructing self-verification behavior. “Problem-Solving” Verification The most intuitive method for verification construction is to directly query existing models to generate verifications on the policy models’ responses, and then filter for valid verifications. By querying existing models using different prompts, we found that existing models tend to perform verification in a “Problem-Solving” manner, i.e., by re-solving the problem and checking whether the answer matches the given one. We refer to this kind of verification as “Problem-Solving” Verification. “Confirmative” Verification "Problem-solving" verification is intuitively not the ideal verification behavior we seek. Ideally, we expect the model to think outside the box and re-examine the solution from a new perspective, rather than thinking from the same problem-solving view for verification. We refer to this type of verification behavior as “Confirmative” Verification. Specifically, we construct “Confirmative” Verification by prompting existing LLMs to "verify the correctness of the answer without re-solving the problem", and filtering out invalid verifications using LLM-as-a-judge. The detail implementation can be found in Appendix § A.1. #### 2.2.2 Learning Self-correction Another critical part of S 2 r is enabling the model to learn self-correction. Inspired by Kumar et al. (2024) and Snell et al. (2024b), we initialize the self-correcting behavior by concatenating a series of incorrect solutions (each followed by a verification recognizing the mistakes) with a final correct solution. As demonstrated by Kumar et al. (2024), LLMs typically fail to learn valid self-correction behavior through SFT, but the validity of self-correction can be enhanced through reinforcement learning. Therefore, we only initialize the self-correcting behavior at this stage, leaving further enhancement of the self-correcting capabilities to the RL stage. #### 2.2.3 Constructing Dynamic Trial-and-Error Trajectory We next construct the complete trial-and-error trajectories for behavior initialization SFT, following three principles: - To ensure the diversity of the trajectories, we construct trajectories of various lengths. Specifically, we cover $k\in\{1,2,3,4\}$ for $y=(a_{1},\cdots,a_{2k})=(s_{1},v_{1},\cdots,s_{k},v_{k})$ in the trajectories. - To ensure that the LLMs learn to verify and correct their own errors, we construct the failed trials in each trajectory by sampling and filtering from the LLMs’ own responses. - As a plausible test-time scaling method allocates reasonable effort to varying levels of problems, it is important to ensure the trial-and-error trajectories align with the difficulty level of problems. Specifically, more difficult problems will require more trial-and-error iterations before reaching the correct answer. Thus, we determine the length of each trajectory based on the accuracy of the sampled responses for each base model. #### 2.2.4 Supervised Fine-tuning for Thinking Behavior Initialization Once the dynamic self-verifying and self-correcting training data $\mathcal{D}_{SFT}$ is ready, we optimize the policy $\pi$ for thinking behavior initialization by minimizing the following objective: $$ \mathcal{L}=-\mathbb{E}_{(x,y)\sim\mathcal{D}_{SFT}}\sum_{a_{t}\in y}\delta_{ mask}(a_{t})\log\pi(a_{t}\mid x,y_{:a_{t}}) \tag{1} $$ where the mask function $\delta_{mask}(a_{t})$ for action $a_{t}$ in $y=(a_{1},\cdots,a_{T})$ is defined as: $$ \delta_{mask}(a_{t})=\begin{cases}1,&\text{if }Type(a_{t})=\texttt{verify}\\ 1,&\text{if }Type(a_{t})=\texttt{solve}\text{ and }t=T-1\\ 1,&\text{if }Type(a_{t})=\texttt{<end>}\text{ and }t=T\\ 0,&\text{otherwise}\end{cases} $$ That is, we optimize the probability of all verifications and only the last correct solution $s_{N}$ by using masks during training. ### 2.3 Boosting Thinking Capabilities via Reinforcement Learning After Stage 1, we initialized the policy model $\pi$ with self-verification and self-correction behavior, obtaining $\pi_{SFT}$ . We then explore further enhancing these thinking capabilities of $\pi_{SFT}$ via reinforcement learning. Specifically, we explore two simple RL algorithms: the outcome-level REINFORCE Leave-One-Out (RLOO) algorithm and a proces-level group-based RL algorithm. #### 2.3.1 Outcome-level RLOO We first introduce the outcome-level REINFORCE Leave-One-Out (RLOO) algorithm Ahmadian et al. (2024); Kool et al. (2019) to further enhance the self-verification and self-correction capabilities of $\pi_{SFT}$ . Given a problem $x$ and the response $y=(s_{1},v_{1},...,s_{T},v_{T})$ , we define the reward function $R_{o}(x,y)$ based on the correctness of the last solution $s_{T}$ : $$ R_{o}(x,y)=\begin{cases}1,&V_{golden}(s_{T})=\texttt{correct}\\ -1,&otherwise\\ \end{cases} $$ Here $V_{golden}(\cdot)\in\{\texttt{correct},\texttt{incorrect}\}$ represents ground-truth validation by matching the golden answer with the given solution. We calculate the advantage of each response $y$ using an estimated baseline and KL reward shaping as follows: $$ A(x,y)=R_{o}(x,y)-\hat{b}-\beta\log\frac{\pi_{\theta_{old}}(y|x)}{\pi_{ref}(y| x)} \tag{2} $$ where $\beta$ is the KL divergence regularization coefficient, and $\pi_{\text{ref}}$ is the reference policy (in our case, $\pi_{SFT}$ ). $\hat{b}(x,y^{(m)})=\frac{1}{M-1}\sum_{\begin{subarray}{c}j=1,...,M\\ j\neq m\end{subarray}}.R_{o}(x,y^{(j)})$ is the baseline estimation of RLOO, which represents the leave-one-out mean of $M$ sampled outputs $\{y^{(1)},...y^{(M)}\}$ for each input $x$ , serving as a baseline estimation for each $y^{(m)}$ . Then, we optimize the policy $\pi_{\theta}$ by minimizing the following objective after each sampling episode based on $\pi_{\theta_{old}}$ : $$ \begin{split}\mathcal{L}(\theta)\ &=\ -\mathbb{E}_{\begin{subarray}{c}x\sim \mathcal{D}\\ y\sim\pi_{\theta_{\text{old}}}(\cdot|x)\end{subarray}}\bigg{[}\min\big{(}r( \theta)A(x,y),\\ &\text{clip}\big{(}r(\theta),1-\epsilon,1+\epsilon\big{)}A(x,y)\big{)}\bigg{]} \end{split} \tag{3} $$ where $r(\theta)=\frac{\pi_{\theta}(y|x)}{\pi_{\theta_{\text{old}}}(y|x)}$ is the probability ratio. When implementing the above loss function, we treat $y$ as a complete trajectory sampled with an input problem $x$ , meaning we optimize the entire trajectory with outcome-level supervision. With this approach, we aim to incentivize the policy model to explore more dynamic self-verification and self-correcting trajectories on its own, which has been demonstrated as an effective practice in recent work Guo et al. (2025); Team et al. (2025). #### 2.3.2 Process-level Group-based RL Process-level supervision has demonstrated effectiveness in math reasoning Lightman et al. (2023a); Wang et al. (2024c). Since the trajectory of S 2 r thinking is naturally divided into self-verification and self-correction processes, it is intuitive to adopt process-level supervision for RL training. Inspired by RLOO and process-level GRPO Shao et al. (2024), we designed a group-based process-level optimization method. Specifically, we regard each action $a$ in the output trajectory $y$ as a sub-process and define the action level reward function $R_{a}(a\mid x,y_{:a})$ based on the action type. For each “ solve ” action $s_{j}$ , we expect the policy to generate the correct solution; for each “ verify ” action $v_{j}$ , we expect the verification to align with the actual solution validity. The corresponding rewards are defined as follows: $$ R_{a}(s_{j}\mid x,y_{:s_{j}})=\begin{cases}1,&V_{golden}(s_{j})=\texttt{ correct}\\ -1,&otherwise\\ \end{cases} $$ $$ R_{a}(v_{j}\mid x,y_{:v_{j}})=\begin{cases}1,&Parser(v_{j})=V_{golden}(s_{j}) \\ -1,&otherwise\\ \end{cases} $$ To calculate the advantage of each action $a_{t}$ , we estimate the baseline as the average reward of the group of actions sharing the same reward context: $$ \mathbf{R}(a_{t}\mid x,y)=\left(R_{a}(a_{i}\mid x,y_{:a_{i}})\right)_{i=1}^{t-1} $$ which is defined as the reward sequence of the previous actions $y_{:a_{t}}$ of each action $a_{t}$ . We denote the set of actions sharing the same reward context $\mathbf{R}(a_{t}\mid x,y)$ as $\mathcal{G}(\mathbf{R}(a_{t}\mid x,y))$ . Then the baseline can be estimated as follows: $$ \begin{split}&\hat{b}(a_{t}\mid x,y)=\\ &\frac{1}{|\mathcal{G}(\mathbf{R}(a_{t}|x,y))|}\sum_{a\in\mathcal{G}(\mathbf{R }(a_{t}|x,y))}R_{a}(a|x^{(a)},y^{(a)}_{:a})\end{split} \tag{4} $$ And the advantage of each action $a_{t}$ is: $$ \begin{split}A(a_{t}\mid x,y)=&R_{a}(a_{t}\mid x,y_{:a_{t}})-\hat{b}(a_{t}\mid x ,y)\\ &-\beta\log\frac{\pi_{\theta_{old}}(a_{t}\mid x,y)}{\pi_{\text{ref}}(a_{t}\mid x ,y)}\end{split} \tag{5} $$ The main idea of the group-based baseline estimation is that the actions sharing the same reward context are provided with similar amounts of information before the action is taken. For instance, all actions sharing a reward context consisting of one failed attempt and one successful verification (i.e., $\mathbf{R}(a_{t}|x,y)=(-1,1)$ ) are provided with the information about the problem, a failed attempt, and the reassessment on the failure. Given the same amount of information, it is reasonable to estimate a baseline using the average reward of these actions. Putting it all together, we minimize the following surrogate loss function to update the policy parameters $\theta$ , using trajectories collected from $\pi_{old}$ : $$ \begin{split}\mathcal{L}(\theta)\ &=\ -\mathbb{E}_{\begin{subarray}{c}x\sim \mathcal{D}\\ y\sim\pi_{\theta_{\text{old}}}(\cdot|x)\end{subarray}}\bigg{[}\frac{1}{|y|_{a} }\sum_{a\in y}\min\big{(}r_{a}(\theta)A(a|x,y_{:a}),\\ &\text{clip}\big{(}r_{a}(\theta),1-\epsilon,1+\epsilon\big{)}A(a|x,y_{:a})\big {)}\bigg{]}\end{split} \tag{6} $$ where $r_{a}(\theta)=\frac{\pi_{\theta}(a|x,y_{:a})}{\pi_{\theta_{\text{old}}}(a|x,y_ {:a})}$ is the importance ratio. ### 2.4 More Efficient Training with Offline RL While online RL is known for its high resource requirements, offline RL, which does not require real-time sampling during training, offers a more efficient alternative for RL training. Additionally, offline sampling allows for more accurate baseline calculations with better trajectories grouping for each policy. As part of our exploration into more efficient RL training in S 2 r framework, we also experimented with offline RL to assess its potential in further enhancing the models’ thinking abilities. In Appendix § D.2, we include more details and formal definition for offline RL training. ## 3 Experiment To verify the effectiveness of the proposed method, we conducted extensive experiments across 3 different base policy models on various benchmarks. | Stage 1: Behavior Initialization | | | | --- | --- | --- | | Base Model | Source | # Training Data | | Llama-3.1-8B-Instruct | MATH | 4614 | | Qwen2-7B-Instruct | MATH | 4366 | | Qwen2.5-Math-7B | MATH | 3111 | | Stage 2: Reinforcement Learning | | | | Base Model | Source | # Training Data | | Llama-3.1-8B-Instruct | MATH+GSM8K | 9601 | | Qwen2-7B-Instruct | MATH+GSM8K | 9601 | | Qwen2.5-Math-7B | MATH+OpenMath2.0 | 10000 | Table 1: Training data statistics. Table 2: The performance of S 2 r and other strong baselines on the most challenging math benchmarks is presented. BI refers to the behavior-initialized models through supervised fine-tuning, ORL denotes models trained with outcome-level RL, and PRL refers to models trained with process-level RL. The highest results are highlighted in bold and the second-best results are marked with underline. For some baselines, we use the results from their original reports or from Guan et al. (2025), denoted by ∗. ### 3.1 Experiment Setup Base Models To evaluate the general applicability of our method across different LLMs, we conducted experiments using three distinct base models: Llama-3.1-8B-Instruct Dubey et al. (2024), Qwen2-7B-Instruct qwe (2024), and Qwen2.5-Math-7B Qwen (2024). Llama-3.1-8B-Instruct and Qwen2-7B-Instruct are versatile general-purpose models trained on diverse domains without a specialized focus on mathematical reasoning. In contrast, Qwen2.5-Math-7B is a state-of-the-art model specifically tailored for mathematical problem-solving and has been widely adopted in recent research on math reasoning Guan et al. (2025); Cui et al. (2025); Zeng et al. (2025). Training Data Setup For Stage 1: Behavior Initialization, we used the widely adopted MATH Hendrycks et al. (2021a) training set for dynamic trial-and-error data collection We use the MATH split from Lightman et al. (2023a), i.e., 12000 problems for training and 500 problems for testing.. For each base model, we sampled 5 responses per problem in the training data. After data filtering and sampling, we constructed a dynamic trial-and-error training set consisting of 3k-4k instances for each base model. Detailed statistics of the training set are shown in Table 1. For Stage 2: Reinforcement Learning, we used the MATH+GSM8K Cobbe et al. (2021a) training data for RL training on the policy $\pi_{SFT}$ initialized from Llama-3.1-8B-Instruct and Qwen2-7B-Instruct. Since Qwen2.5-math-7b already achieves high accuracy on the GSM8K training data after Stage 1, we additionally include training data randomly sampled from the OpenMath2 dataset Toshniwal et al. (2024). Following Cui et al. (2025), we filter out excessively easy or difficult problems based on each $\pi_{SFT}$ from Stage 1 to enhance the efficiency and stability of RL training, resulting in RL training sets consisting of approximately 10000 instances. Detailed statistics of the final training data can be found in Table 1. Additional details on training data construction can be found in in Appendix § A.1. Baselines We benchmark our proposed method against four categories of strong baselines: - Frontier LLMs includes cutting-edge proprietary models such as GPT-4o, the latest Claude, and OpenAI’s o1-preview and o1-mini. - Top-tier open-source reasoning models covers state-of-the-art open-source models known for their strong reasoning capabilities, including Mathstral-7B-v0.1 Team (2024b), NuminaMath-72B LI et al. (2024), LLaMA3.1-70B-Instruct Dubey et al. (2024), and Qwen2.5-Math-72B-Instruct Yang et al. (2024). - Enhanced models built on Qwen2.5-Math-7B: Given the recent popularity of Qwen2.5-Math-7B as a base policy model, we evaluate S 2 r against three competitive baselines that have demonstrated superior performance based on Qwen2.5-Math-7B: Eurus-2-7B-PRIME Cui et al. (2025), rStar-Math-7B Guan et al. (2025), and Qwen2.5-7B-SimpleRL Zeng et al. (2025). These models serve as direct and strong baseline for our Qwen2.5-Math-7B-based variants. - SFT with different CoT constructions: We also compare with training on competitive types of CoT reasoning, including the original CoT solution in the training datasets, and Long-CoT solutions distilled from QwQ-32B-Preview Team (2024a), a widely adopted open-source o1-like model Chen et al. (2024c); Guan et al. (2025); Zheng et al. (2024). Specifically, to ensure a fair comparison between behavior initialization with long-CoT and S 2 r, we use long-CoT data of the same size as our behavior initialization data. We provide more details on the baseline data construction in Appendix § A.2.3. More details on the baselines are included in Appendix § A.2. Evaluation Datasets We evaluate the proposed method on 7 diverse mathematical benchmarks. To ensure a comprehensive evaluation, in addition to the in-distribution GSM8K Cobbe et al. (2021b) and MATH500 Lightman et al. (2023a) test sets, we include challenging out-of-distribution benchmarks covering various difficulty levels and mathematical domains, including the AIME 2024 competition problems AI-MO (2024a), the AMC 2023 exam AI-MO (2024b), the advanced reasoning tasks from Olympiad Bench He et al. (2024), and college-level problem sets from College Math Tang et al. (2024a). Additionally, we assess performance on real-world standardized tests, the GaoKao (Chinese College Entrance Exam) En 2023 Liao et al. (2024). A detailed description of these datasets is provided in Appendix § B.1. Evaluation Metrics We report Pass@1 accuracy for all baselines. For inference, we employ vLLM Kwon et al. (2023) and develop evaluation scripts based on Qwen Math’s codebase. All evaluations are performed using greedy decoding. Details of the prompts used during inference are provided in Appendix § A.3. All implementation details, including hyperparameter settings, can be found in Appendix § B.2. ### 3.2 Main Results Table 2 shows the main results of S 2 r compared with baseline methods. We can observe that: (1) S 2 r consistently improves the reasoning abilities of models across all base models. Notably, on Qwen2.5-Math-7B, the proposed method improves the base model by 32.2% on MATH500 and by 34.3% on GSM8K. (2) Generally, S 2 r outperforms the baseline methods derived from the same base models across most benchmarks. Specifically, on Qwen2.5-Math-7B, S 2 r surpasses several recently proposed competitive baselines, such as Eurus-2-7B-PRIME, rStar-Math-7B and Qwen2.5-7B-SimpleRL. While Eurus-2-7B-PRIME and rStar-Math-7B rely on larger training datasets (Figure 1) and require more data construction and reward modeling efforts, S 2 r only needs linear sampling efforts for data construction, 10k RL training data and rule-based reward modeling. These results highlight the efficiency of S 2 r. (3) With the same scale of SFT data, S 2 r also outperforms the long-CoT models distilled from QwQ-32B-Preview, demonstrating that learning to self-verify and self-correct is an effective alternative to long-CoT for test-time scaling in smaller LLMs. Comparing process-level and outcome-level RL, we find that outcome-level RL generally outperforms process-level RL across the three models. This is likely because outcome-level RL allows models to explore trajectories without emphasizing intermediate accuracy, which may benefit enhancing long-thought reasoning in stronger base models like Qwen2.5-Math-7B. In contrast, process-level RL, which provides guidance for each intermediate verification and correction step, may be effective for models with lower initial capabilities, such as Qwen2-7B-Instruct. As shown in Figure 3, process-level RL can notably enhance the verification and correction abilities of Qwen2-7B- S 2 r -BI. | Model | FOLIO | CRUX- Eval | Strategy- QA | MMLUPro- STEM | | --- | --- | --- | --- | --- | | Qwen2.5-Math-72B-Instruct | 69.5 | 68.6 | 94.3 | 66.0 | | Llama-3.1-70B-Instruct ∗ | 65.0 | 59.6 | 88.8 | 61.7 | | OpenMath2-Llama3.1-70B ∗ | 68.5 | 35.1 | 95.6 | 55.0 | | QwQ-32B-Preview ∗ | 84.2 | 65.2 | 88.2 | 71.9 | | Eurus-2-7B-PRIME | 56.7 | 50.0 | 79.0 | 53.7 | | Qwen2.5-Math-7B-Instruct | 61.6 | 28.0 | 81.2 | 44.7 | | Qwen2.5-Math-7B | 37.9 | 40.8 | 61.1 | 46.0 | | Qwen2.5-Math-7B- S 2 r -BI (ours) | 58.1 | 48.0 | 88.7 | 49.8 | | Qwen2.5-Math-7B- S 2 r -ORL (ours) | 61.6 | 50.9 | 90.8 | 50.0 | Table 3: Performance of the proposed method and the baseline methods on 4 cross-domain tasks. The results with ∗ are reported by Shen et al. (2025). ### 3.3 Generalizing to Cross-domain Tasks Despite training on math reasoning tasks, we found that the learned self-verifying and self-correcting capability can also generalize to out-of-distribution general domains. In Table 3, we evaluate the SFT model and the outcome-level RL model based on Qwen2.5-Math-7B on four cross-domain tasks: FOLIO Han et al. (2022) on logical reasoning, CRUXEval Gu et al. (2024) on code reasoning, StrategyQA Geva et al. (2021) on multi-hop reasoning and MMLUPro-STEM on multi-task complex understanding Wang et al. (2024d); Shen et al. (2025), with details of these datasets provided in Appendix § B.1. The results show that after learning to self-verify and self-correct, the proposed method effectively boosts the base model’s performance across all tasks and achieves comparative results to the baseline models. These findings indicate that the learned self-verifying and self-correcting capabilities are general thinking skills, which can also benefit reasoning in general domains. Additionally, we expect that the performance in specific domains can be further improved by applying S 2 r training on domain data with minimal reward model requirements (e.g., rule-based or LLM-as-a-judge). For better illustration, we show cases on how the trained models perform self-verifying and self-correcting on general tasks in Appendix § E. ### 3.4 Analyzing Self-verification and Self-correction Abilities In this section, we conduct analytical experiments on the models’ self-verification and self-correction capabilities from various perspectives. #### 3.4.1 Problem-solving v.s. Confirmative Verification We first compare the Problem-solving and Confirmative Verification methods described in § 2.2.1. In Table 4, we present the verification results of different methods on the Math500 test set. We report the overall verification accuracy, as well as the initial verification accuracy when the initial answer is correct ( $V_{golden}(s_{0})=\texttt{correct}$ ) and incorrect ( $V_{golden}(s_{0})=\texttt{incorrect}$ ), respectively. | Base Model | Methods | Overall Verification Acc. | Initial Verification Acc. | | | --- | --- | --- | --- | --- | | $V_{golden}(s_{0})$ $=\texttt{correct}$ | $V_{golden}(s_{0})$ $=\texttt{incorrect}$ | | | | | Llama3.1-8B-Instruct | Problem-solving | 80.10 | 87.28 | 66.96 | | Confirmative | 65.67 | 77.27 | 78.22 | | | Qwen2-7B-Instruct | Problem-solving | 73.28 | 90.24 | 67.37 | | Confirmative | 58.31 | 76.16 | 70.05 | | | Qwen2.5-Math-7B | Problem-solving | 77.25 | 91.21 | 56.67 | | Confirmative | 61.58 | 82.80 | 68.04 | | Table 4: Comparison of problem-solving and confirmative verification. We observe from the table that: (1) Generally, problem-solving verification achieves superior overall accuracy compared to confirmative verification. This result is intuitive, as existing models are trained for problem-solving, and recent studies have highlighted the difficulty of existing LLMs in performing reverse thinking Berglund et al. (2023); Chen et al. (2024b). During data collection, we also found that existing models tend to verify through problem-solving, even when prompted to verify without re-solving (see Table 6 in Appendix § A.1). (2) In practice, accuracy alone does not fully reflect the validity of a method. For example, when answer accuracy is sufficiently high, predicting all answers as correct will naturally lead to high verification accuracy, but this is not a desired behavior. By further examining the initial verification accuracy for both correct and incorrect answers, we found that problem-solving verification exhibits a notable bias toward predicting answers as correct, while the predictions from confirmative verification are more balanced. We deduce that this bias arises might be because problem-solving verification is more heavily influenced by the preceding solution, aligning with previous studies showing that LLMs struggle to identify their own errors Huang et al. (2023); Tyen et al. (2023). In contrast, confirmative verification performs verification from different perspectives, making it less influenced by the LLMs’ preceding solution. In all experiments, we used confirmative verification for behavior initialization. #### 3.4.2 Boosting Self-verifying and Self-correcting with RL In this experiment, we investigate the effect of RL training on the models’ self-verifying and self-correcting capabilities. We assess self-verification using the following metrics: (1) Verification Accuracy: The overall accuracy of verification predictions, as described in § 3.4.1. (2) Error Recall: The recall of verification when the preceding answers are incorrect. (3) Correct Precision: The precision of verification when it predicts the answers as correct. Both Error Recall and Correct Precision directly affect the final answer accuracy: if verification fails to detect an incorrect answer, or if it incorrectly predicts an answer as correct, the final answer will be wrong. For self-correction, we use the following metrics: (1) Incorrect to Correct Rate: the rate at which the model successfully corrects an incorrect initial answer to a correct final answer. (2) Correct to Incorrect Rate: the rate at which the model incorrectly changes a correct initial answer to an incorrect final answer. We provide the formal definitions of the metrics used in Appendix § C. <details> <summary>x3.png Details</summary> ![8240fd98](/v1/image/8240fd98a09275b0e587aa378855c090685c685b3a6df1f6273c3f380518bff1) ### Visual Description ## Bar Chart: Evaluation on Verification and Correction (Base Model: Qwen2-7B-Instruct) ### Overview This image contains two bar charts side-by-side, presenting evaluation metrics for a base model named "Qwen2-7B-Instruct". The left chart displays "Self-verification Metrics", and the right chart displays "Self-correction Metrics". Both charts compare three different configurations: "SFT", "SFT + Process-level RL", and "SFT + Outcome-level RL". The y-axis for both charts represents "Value (%)". ### Components/Axes **Overall Title:** Evaluation on Verification and Correction (Base Model: Qwen2-7B-Instruct) **Left Chart: Self-verification Metrics** * **Title:** Self-verification Metrics * **Y-axis Title:** Value (%) * **Y-axis Scale:** 50 to 100, with major ticks at 50, 60, 70, 80, 90, 100. * **X-axis Categories:** Verification Accuracy, Error Recall, Correct Precision. * **Legend:** Located in the top-left quadrant of the left chart. * **SFT:** Represented by a light grey rectangle. * **SFT + Process-level RL:** Represented by a teal/mint green rectangle. * **SFT + Outcome-level RL:** Represented by a coral/light orange rectangle. **Right Chart: Self-correction Metrics** * **Title:** Self-correction Metrics * **Y-axis Title:** Value (%) * **Y-axis Scale:** 0 to 25, with major ticks at 0, 5, 10, 15, 20, 25. * **X-axis Categories:** Incorrect to Correct, Correct to Incorrect. * **Legend:** The legend from the left chart is applicable to both charts. ### Detailed Analysis **Left Chart: Self-verification Metrics** * **Verification Accuracy:** * SFT (Grey): 58.31% * SFT + Process-level RL (Teal): 67.86% * SFT + Outcome-level RL (Coral): 63.93% * **Trend:** SFT + Process-level RL shows the highest Verification Accuracy, followed by SFT + Outcome-level RL, and then SFT. * **Error Recall:** * SFT (Grey): 81.91% * SFT + Process-level RL (Teal): 86.67% * SFT + Outcome-level RL (Coral): 87.34% * **Trend:** SFT + Outcome-level RL shows the highest Error Recall, closely followed by SFT + Process-level RL, and then SFT. * **Correct Precision:** * SFT (Grey): 65.58% * SFT + Process-level RL (Teal): 73.59% * SFT + Outcome-level RL (Coral): 69.80% * **Trend:** SFT + Process-level RL shows the highest Correct Precision, followed by SFT + Outcome-level RL, and then SFT. **Right Chart: Self-correction Metrics** * **Incorrect to Correct:** * SFT (Grey): 20.00% * SFT + Process-level RL (Teal): 22.17% * SFT + Outcome-level RL (Coral): 19.55% * **Trend:** SFT + Process-level RL shows the highest rate of correcting incorrect predictions, followed by SFT, and then SFT + Outcome-level RL. * **Correct to Incorrect:** * SFT (Grey): 8.42% * SFT + Process-level RL (Teal): 5.39% * SFT + Outcome-level RL (Coral): 3.93% * **Trend:** SFT shows the highest rate of incorrectly correcting correct predictions, while SFT + Outcome-level RL shows the lowest rate. The SFT + Process-level RL is in between. ### Key Observations * **Self-verification:** The "SFT + Process-level RL" configuration generally performs best across "Verification Accuracy" and "Correct Precision". "SFT + Outcome-level RL" performs best for "Error Recall". All RL-enhanced configurations ("SFT + Process-level RL" and "SFT + Outcome-level RL") outperform the base "SFT" model in all self-verification metrics. * **Self-correction:** For "Incorrect to Correct", "SFT + Process-level RL" is the best. For "Correct to Incorrect", "SFT + Outcome-level RL" is the best, indicating it is least likely to make a correct prediction incorrect. * **Trade-offs:** There appears to be a trade-off between "Incorrect to Correct" and "Correct to Incorrect" rates. While "SFT + Process-level RL" excels at correcting errors, it also has a higher rate of making correct predictions incorrect compared to "SFT + Outcome-level RL". Conversely, "SFT + Outcome-level RL" is better at preserving correct predictions but is slightly less effective at correcting incorrect ones compared to "SFT + Process-level RL". ### Interpretation The data suggests that applying Reinforcement Learning (RL) techniques, specifically "Process-level RL" and "Outcome-level RL", to the base "SFT" model significantly improves its self-verification and self-correction capabilities when evaluated on the "Qwen2-7B-Instruct" base model. The "Self-verification Metrics" indicate that RL enhancements lead to better accuracy in verifying information, recalling errors, and precisely correcting errors. The "SFT + Process-level RL" configuration appears to be a strong contender for overall self-verification performance, particularly in accuracy and precision. The "Self-correction Metrics" reveal nuanced performance. "SFT + Process-level RL" is most effective at turning incorrect predictions into correct ones. However, "SFT + Outcome-level RL" demonstrates a superior ability to avoid degrading correct predictions into incorrect ones. This suggests that "Outcome-level RL" might be more conservative or robust in maintaining correctness, while "Process-level RL" might be more aggressive in error correction, potentially at the cost of introducing new errors. In essence, the choice between "SFT + Process-level RL" and "SFT + Outcome-level RL" might depend on the specific priorities of the application. If the primary goal is to maximize the correction of errors, "SFT + Process-level RL" is favored. If the priority is to minimize the degradation of correct predictions, "SFT + Outcome-level RL" is the better choice. Both RL approaches offer substantial improvements over the baseline "SFT" model. </details> (a) <details> <summary>x4.png Details</summary> ![ebc40214](/v1/image/ebc402146d1240392ebb4fffa0770f29f3a1dcf29a907754ebc8af6e6df3507a) ### Visual Description ## Bar Chart: Evaluation on Verification and Correction (Base Model: Qwen2.5-Math-7B) ### Overview This image contains two bar charts side-by-side, presenting the results of self-verification and self-correction metrics for a base model named "Qwen2.5-Math-7B". The metrics are evaluated across three different configurations: "SFT", "SFT + Process-level RL", and "SFT + Outcome-level RL". The left chart focuses on "Self-verification Metrics" and the right chart on "Self-correction Metrics". ### Components/Axes **Overall Title:** Evaluation on Verification and Correction (Base Model: Qwen2.5-Math-7B) **Left Chart:** * **Title:** Self-verification Metrics * **X-axis Title:** (Implicitly, the metric categories) * **X-axis Categories:** Verification Accuracy, Error Recall, Correct Precision * **Y-axis Title:** Value (%) * **Y-axis Scale:** 50 to 90, with major ticks at 50, 60, 70, 80, 90. * **Legend:** Located in the top-left quadrant of the left chart. * **SFT:** Represented by a grey color. * **SFT + Process-level RL:** Represented by a teal/green color. * **SFT + Outcome-level RL:** Represented by a coral/orange color. **Right Chart:** * **Title:** Self-correction Metrics * **X-axis Title:** (Implicitly, the metric categories) * **X-axis Categories:** Incorrect to Correct, Correct to Incorrect * **Y-axis Title:** Value (%) * **Y-axis Scale:** 0 to 14, with major ticks at 0, 2, 4, 6, 8, 10, 12, 14. * **Legend:** The legend from the left chart applies to both charts. ### Detailed Analysis **Left Chart: Self-verification Metrics** * **Verification Accuracy:** * SFT (Grey): 61.58% * SFT + Process-level RL (Teal): 74.61% * SFT + Outcome-level RL (Coral): 66.49% * **Error Recall:** * SFT (Grey): 66.83% * SFT + Process-level RL (Teal): 64.75% * SFT + Outcome-level RL (Coral): 70.11% * **Correct Precision:** * SFT (Grey): 84.94% * SFT + Process-level RL (Teal): 90.28% * SFT + Outcome-level RL (Coral): 87.85% **Right Chart: Self-correction Metrics** * **Incorrect to Correct:** * SFT (Grey): 6.52% * SFT + Process-level RL (Teal): 12.22% * SFT + Outcome-level RL (Coral): 13.64% * **Correct to Incorrect:** * SFT (Grey): 1.96% * SFT + Process-level RL (Teal): 1.46% * SFT + Outcome-level RL (Coral): 0.97% ### Key Observations * **Verification Accuracy:** "SFT + Process-level RL" shows the highest Verification Accuracy (74.61%), significantly outperforming both "SFT" (61.58%) and "SFT + Outcome-level RL" (66.49%). * **Error Recall:** "SFT + Outcome-level RL" has the highest Error Recall (70.11%), followed by "SFT" (66.83%). "SFT + Process-level RL" has the lowest Error Recall (64.75%). * **Correct Precision:** "SFT + Process-level RL" achieves the highest Correct Precision (90.28%), with "SFT + Outcome-level RL" (87.85%) and "SFT" (84.94%) following. * **Incorrect to Correct:** The addition of RL significantly improves the ability to correct incorrect predictions. "SFT + Outcome-level RL" shows the highest value (13.64%), followed closely by "SFT + Process-level RL" (12.22%), both substantially higher than "SFT" (6.52%). * **Correct to Incorrect:** The addition of RL appears to reduce the rate of correcting correct predictions. "SFT" has the highest rate (1.96%), while "SFT + Outcome-level RL" has the lowest (0.97%), with "SFT + Process-level RL" in between (1.46%). ### Interpretation The data suggests that incorporating Reinforcement Learning (RL) strategies, particularly "Process-level RL" and "Outcome-level RL", generally enhances the self-verification and self-correction capabilities of the "Qwen2.5-Math-7B" model compared to the base "SFT" model. Specifically, "SFT + Process-level RL" demonstrates superior performance in "Verification Accuracy" and "Correct Precision", indicating a better ability to accurately verify its own outputs and to precisely identify correct predictions. This configuration also shows a substantial improvement in correcting "Incorrect to Correct" scenarios, suggesting it is more adept at fixing its own mistakes. "SFT + Outcome-level RL" also shows significant gains in correcting "Incorrect to Correct" predictions, even surpassing "Process-level RL" in this specific metric. It also leads in "Error Recall", implying it is better at identifying errors that need attention. However, it shows a slight decrease in "Correct Precision" compared to "Process-level RL" and a notable reduction in "Correct to Incorrect" rates, which could imply a more conservative approach to corrections, potentially avoiding unnecessary changes to correct outputs. The base "SFT" model performs the lowest across most metrics, highlighting the benefit of the RL fine-tuning. The trade-off between "Incorrect to Correct" and "Correct to Incorrect" rates is also evident. While RL methods improve the correction of errors, they might also slightly increase the risk of incorrectly modifying already correct outputs, though "Outcome-level RL" appears to mitigate this risk more effectively than "Process-level RL". Overall, the results indicate that RL fine-tuning is a promising direction for improving the self-evaluation and self-correction abilities of large language models, with "SFT + Outcome-level RL" showing a strong balance of correcting errors and preserving correct outputs. </details> (b) Figure 3: Evaluation on verification and correction. In Figure 3, we present the results of the behavior-initialized model (SFT) and different RL models obtained from Qwen2.5-Math-7B. We observe that: (1) Both RL methods effectively enhance self-verification accuracy. The process-level RL shows larger improvement on accuracy, while the outcome-level RL consistently improves Error Recall and Correct Precision. This might be because process-level supervision indiscriminately promotes verification accuracy in intermediate steps, while outcome-level supervision allows the policy model to explore freely in intermediate steps and only boosts the final answer accuracy, thus mainly enhancing Error Recall and Correct Precision (which directly relate to final answer accuracy). (2) Both RL methods can successfully enhance the models’ self-correction capability. Notably, the model’s ability to correct incorrect answers is significantly improved after RL training. The rate of model mistakenly altering correct answers is also notably reduced. This comparison demonstrates that S 2 r can substantially enhance the validity of models’ self-correction ability. <details> <summary>x5.png Details</summary> ![cbd7f573](/v1/image/cbd7f573c03ac3f4adbff2570021725a267d78405e8b75cf7d5be5d014242b15) ### Visual Description ## Bar Chart: Accuracy and Trial Numbers across Difficulty Level (Base Model: Llama3.1-8B-Instruct) ### Overview This bar chart displays the "Accuracy" and "Trial Numbers" for two different training configurations, "SFT" and "SFT+RL", across five distinct "Difficulty Levels" (Level 1 to Level 5). The chart uses a dual y-axis system: the left y-axis represents "Accuracy" ranging from 0.2 to 1.0, and the right y-axis represents "Trial Numbers" ranging from 0 to 6. The base model used for this analysis is Llama3.1-8B-Instruct. ### Components/Axes * **Title:** "Accuracy and Trial Numbers across Difficulty Level (Base Model: Llama3.1-8B-Instruct)" * **X-axis:** "Difficulty Level" with categories: "Level 1", "Level 2", "Level 3", "Level 4", "Level 5". * **Left Y-axis:** "Accuracy" with a scale from 0.2 to 1.0, marked at intervals of 0.1. * **Right Y-axis:** "Trial Numbers" with a scale from 0 to 6, marked at intervals of 1. * **Legend:** Located in the top-center of the chart. * Light Green bar: "SFT Accuracy" * Dark Green bar: "SFT+RL Accuracy" * Light Red bar: "SFT Trials" * Dark Red bar: "SFT+RL Trials" ### Detailed Analysis The chart presents grouped bars for each difficulty level, with two bars representing accuracy and two bars representing trial numbers. **Level 1:** * **SFT Accuracy:** 0.814 (light green bar, left axis) * **SFT+RL Accuracy:** 0.930 (dark green bar, left axis) * **SFT Trials:** 3.279 (light red bar, right axis) * **SFT+RL Trials:** 2.209 (dark red bar, right axis) **Level 2:** * **SFT Accuracy:** 0.733 (light green bar, left axis) * **SFT+RL Accuracy:** 0.722 (dark green bar, left axis) * **SFT Trials:** 3.367 (light red bar, right axis) * **SFT+RL Trials:** 2.844 (dark red bar, right axis) **Level 3:** * **SFT Accuracy:** 0.610 (light green bar, left axis) * **SFT+RL Accuracy:** 0.638 (dark green bar, left axis) * **SFT Trials:** 3.924 (light red bar, right axis) * **SFT+RL Trials:** 4.219 (dark red bar, right axis) **Level 4:** * **SFT Accuracy:** 0.367 (light green bar, left axis) * **SFT+RL Accuracy:** 0.445 (dark green bar, left axis) * **SFT Trials:** 5.117 (light red bar, right axis) * **SFT+RL Trials:** 4.234 (dark red bar, right axis) **Level 5:** * **SFT Accuracy:** 0.239 (light green bar, left axis) * **SFT+RL Accuracy:** 0.276 (dark green bar, left axis) * **SFT Trials:** 4.104 (light red bar, right axis) * **SFT+RL Trials:** 5.254 (dark red bar, right axis) ### Key Observations * **Accuracy Trend:** * "SFT Accuracy" generally decreases as difficulty level increases, starting at 0.814 for Level 1 and dropping to 0.239 for Level 5. * "SFT+RL Accuracy" also generally decreases with increasing difficulty, but it consistently outperforms "SFT Accuracy" for Level 1, Level 4, and Level 5. For Level 2 and Level 3, "SFT Accuracy" is slightly higher or comparable to "SFT+RL Accuracy". * **Trial Numbers Trend:** * "SFT Trials" show a general increase from Level 1 (3.279) to Level 4 (5.117), with a slight dip at Level 5 (4.104). * "SFT+RL Trials" show a decrease from Level 1 (2.209) to Level 2 (2.844), then an increase to Level 3 (4.219) and Level 4 (4.234), and finally a significant increase to Level 5 (5.254). * **Comparison of SFT vs. SFT+RL:** * "SFT+RL Accuracy" is higher than "SFT Accuracy" at Level 1 (0.930 vs 0.814), Level 4 (0.445 vs 0.367), and Level 5 (0.276 vs 0.239). * "SFT Accuracy" is slightly higher than "SFT+RL Accuracy" at Level 2 (0.733 vs 0.722) and Level 3 (0.638 vs 0.610). * "SFT Trials" are generally lower than "SFT+RL Trials" for Level 3, Level 4, and Level 5, but higher for Level 1 and Level 2. Notably, "SFT+RL Trials" are highest at Level 5 (5.254), while "SFT Trials" are highest at Level 4 (5.117). ### Interpretation This chart suggests that the "SFT+RL" training method generally leads to higher accuracy compared to "SFT" alone, particularly at lower difficulty levels (Level 1) and higher difficulty levels (Level 4 and 5). However, for intermediate difficulty levels (Level 2 and 3), the standard "SFT" method shows comparable or slightly better accuracy. The trial numbers indicate the computational effort or number of training iterations. It appears that achieving higher accuracy with "SFT+RL" might sometimes require more trials, as seen at Level 5 where "SFT+RL Trials" are the highest (5.254) and "SFT+RL Accuracy" is also higher than "SFT Accuracy". Conversely, at Level 4, "SFT Trials" are higher than "SFT+RL Trials", yet "SFT+RL Accuracy" is still superior. This implies a complex relationship between training trials and accuracy, where the RL component might be more efficient in certain scenarios or require different trial counts to reach optimal performance. The overall trend of decreasing accuracy with increasing difficulty level is expected for both training methods. The performance drop is more pronounced for "SFT Accuracy" at higher difficulty levels. The "SFT+RL" method seems to mitigate this drop to some extent, especially at Level 5, where it achieves a higher accuracy despite the overall decline. This indicates that reinforcement learning might be beneficial for improving model robustness and performance on more challenging tasks. </details> (a) <details> <summary>x6.png Details</summary> ![db311804](/v1/image/db3118041750769b0d2426bfb6ecbf9828d3be07bb1d5c9bdabe17c7738eba47) ### Visual Description ## Bar Chart: Accuracy and Trial Numbers across Difficulty Level ### Overview This bar chart displays the accuracy and trial numbers for two different training methodologies, "SFT" and "SFT+RL", across five distinct difficulty levels. The chart uses a dual y-axis system: the left y-axis represents "Accuracy" (ranging from 0.60 to 1.00), and the right y-axis represents "Trial Numbers" (ranging from 0.0 to 2.5). The x-axis denotes the "Difficulty Level", with categories from Level 1 to Level 5. ### Components/Axes * **Title:** "Accuracy and Trial Numbers across Difficulty Level (Base Model: Qwen2.5-Math-7B)" * **X-axis Label:** "Difficulty Level" * **Categories:** Level 1, Level 2, Level 3, Level 4, Level 5 * **Left Y-axis Label:** "Accuracy" * **Scale:** 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.00 * **Right Y-axis Label:** "Trial Numbers" * **Scale:** 0.0, 0.5, 1.0, 1.5, 2.0, 2.5 * **Legend:** Located in the top-center of the chart. * **SFT Accuracy:** Light green bars. * **SFT+RL Accuracy:** Dark green bars. * **SFT Trials:** Light red bars. * **SFT+RL Trials:** Dark red bars. ### Detailed Analysis The chart presents grouped bars for each difficulty level, with two bars representing accuracy and two bars representing trial numbers. **Level 1:** * **SFT Accuracy:** 0.930 (light green bar, left y-axis) * **SFT+RL Accuracy:** 0.930 (dark green bar, left y-axis) * **SFT Trials:** 1.116 (light red bar, right y-axis) * **SFT+RL Trials:** 1.047 (dark red bar, right y-axis) **Level 2:** * **SFT Accuracy:** 0.944 (light green bar, left y-axis) * **SFT+RL Accuracy:** 0.944 (dark green bar, left y-axis) * **SFT Trials:** 1.311 (light red bar, right y-axis) * **SFT+RL Trials:** 1.244 (dark red bar, right y-axis) **Level 3:** * **SFT Accuracy:** 0.943 (light green bar, left y-axis) * **SFT+RL Accuracy:** 0.962 (dark green bar, left y-axis) * **SFT Trials:** 1.771 (light red bar, right y-axis) * **SFT+RL Trials:** 1.790 (dark red bar, right y-axis) **Level 4:** * **SFT Accuracy:** 0.773 (light green bar, left y-axis) * **SFT+RL Accuracy:** 0.836 (dark green bar, left y-axis) * **SFT Trials:** 1.828 (light red bar, right y-axis) * **SFT+RL Trials:** 1.883 (dark red bar, right y-axis) **Level 5:** * **SFT Accuracy:** 0.619 (light green bar, left y-axis) * **SFT+RL Accuracy:** 0.649 (dark green bar, left y-axis) * **SFT Trials:** 2.254 (light red bar, right y-axis) * **SFT+RL Trials:** 2.149 (dark red bar, right y-axis) ### Key Observations * **Accuracy Trend:** Accuracy for both SFT and SFT+RL generally decreases as the difficulty level increases, with a notable drop observed from Level 3 to Level 4, and further decline to Level 5. * **SFT+RL vs. SFT Accuracy:** In Level 1 and Level 2, SFT and SFT+RL accuracies are identical (0.930 and 0.944 respectively). From Level 3 onwards, SFT+RL consistently shows slightly higher accuracy than SFT. The largest gap is at Level 5 (0.649 vs 0.619). * **Trial Numbers Trend:** Trial numbers for both SFT and SFT+RL generally increase with increasing difficulty level, peaking at Level 5. * **SFT+RL vs. SFT Trials:** SFT+RL trials are consistently lower than SFT trials for Level 1, Level 2, and Level 5. However, at Level 3 and Level 4, SFT+RL trials are slightly higher than SFT trials. * **Correlation between Accuracy and Trials:** There appears to be an inverse relationship between accuracy and trial numbers. As difficulty increases, accuracy tends to decrease, while trial numbers tend to increase. ### Interpretation This chart demonstrates the performance of a base model (Qwen2.5-Math-7B) under different training conditions (SFT vs. SFT+RL) across varying difficulty levels. The data suggests that for lower difficulty levels (1 and 2), the SFT and SFT+RL training methods yield identical accuracy, and SFT+RL requires slightly fewer trials. As the difficulty increases (Level 3 onwards), the SFT+RL method begins to outperform SFT in terms of accuracy, albeit with a slight increase in trial numbers at Level 3 and 4, and a decrease at Level 5 compared to SFT. The significant drop in accuracy from Level 3 to Level 5 for both methods, coupled with the rise in trial numbers, indicates that the model struggles with higher difficulty problems, requiring more computational effort to achieve lower performance. The SFT+RL method appears to be more robust at higher difficulties, maintaining better accuracy despite the increased challenge. The exact reason for the identical accuracy at Levels 1 and 2, and the fluctuating trial number comparison between SFT and SFT+RL at higher levels, warrants further investigation into the specific training dynamics. </details> (b) Figure 4: The accuracy and average trial number of different models across difficulty levels. Evaluated on MATH500 test set. #### 3.4.3 Improvement across Difficulty Levels To further illustrate the effect of S 2 r training, Figure 4 shows the answer accuracy and average number of trials (i.e., the average value of " $K$ " across all $y=(s_{1},v_{1},\cdots,s_{K},v_{K})$ under each difficulty level) for the SFT and SFT+RL models. We observe that: (1) By learning to self-verify and self-correct during reasoning, the models learn to dynamically allocate test-time effort. For easier problems, the models can reach a confident answer with fewer trials, while for more difficult problems, they require more trials to achieve a confident answer. (2) RL further improves test-time effort allocation, particularly for less capable model (e.g., Llama3.1-8B-Instruct). (3) After RL training, the answer accuracy for more difficult problems is notably improved, demonstrating the effectiveness of the self-verifying and self-correcting paradigm in enhancing the models’ reasoning abilities. | | Datasets | xx Average xx | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Model | MATH 500 | AIME 2024 | AMC 2023 | College Math | Olympiad Bench | GSM8K | GaokaoEn 2023 | | | General Model: Qwen2-7B-Instruct | | | | | | | | | | Qwen2-7B-Instruct | 51.2 | 3.3 | 30.0 | 18.2 | 19.1 | 86.4 | 39.0 | 35.3 | | Qwen2-7B- S 2 r -BI (ours) | 61.2 | 3.3 | 27.5 | 41.1 | 27.1 | 87.4 | 49.1 | 42.4 | | Qwen2-7B- S 2 r -PRL (ours) | 65.4 | 6.7 | 35.0 | 36.7 | 27.0 | 89.0 | 49.9 | 44.2 | | Qwen2-7B- S 2 r -ORL (ours) | 64.8 | 3.3 | 42.5 | 34.7 | 26.2 | 86.4 | 50.9 | 44.1 | | Qwen2-7B–Instruct- S 2 r -PRL-offline (ours) | 61.6 | 10.0 | 32.5 | 40.2 | 26.5 | 87.6 | 50.4 | 44.1 | | Qwen2-7B-Instruct- S 2 r -ORL-offline (ours) | 61.0 | 6.7 | 37.5 | 40.5 | 27.3 | 87.4 | 49.6 | 44.3 | | Math-Specialized Model: Qwen2.5-Math-7B | | | | | | | | | | Qwen2.5-Math-7B | 51.0 | 16.7 | 45.0 | 21.5 | 16.7 | 58.3 | 39.7 | 35.6 | | Qwen2.5-Math-7B- S 2 r -BI (ours) | 81.6 | 23.3 | 60.0 | 43.9 | 44.4 | 91.9 | 70.1 | 59.3 | | Qwen2.5-Math-7B- S 2 r -PRL (ours) | 83.4 | 26.7 | 70.0 | 43.8 | 46.4 | 93.2 | 70.4 | 62.0 | | Qwen2.5-Math-7B- S 2 r -ORL (ours) | 84.4 | 23.3 | 77.5 | 43.8 | 44.9 | 92.9 | 70.1 | 62.4 | | Qwen2.5-Math-7B- S 2 r -PRL-offline (ours) | 83.4 | 23.3 | 62.5 | 50.0 | 46.7 | 92.9 | 72.2 | 61.6 | | Qwen2.5-Math-7B- S 2 r -ORL-offline (ours) | 82.0 | 20.0 | 67.5 | 49.8 | 45.8 | 92.6 | 70.4 | 61.2 | Table 5: Comparison of S 2 r using online and offline RL training. ### 3.5 Exploring Offline RL As described in § 2.4, we explore offline RL as a more efficient alternative to online RL training, given the effectiveness of offline RL has been demonstrated in recent studies Baheti et al. (2023); Cheng et al. (2025); Wang et al. (2024b). Table 5 presents the results of offline RL with process-level and outcome-level supervision, compared to online RL. We can observe that: (1) Different from online RL, process-level supervision outperforms outcome-level supervision in offline RL training. This interesting phenomenon may be due to: a) Outcome-level RL, which excels at allowing models to freely explore dynamic trajectories, is more suitable for on-the-fly sampling during online parameter updating. b) In contrast, process-level RL, which requires accurate baseline estimation for intermediate steps, benefits from offline trajectory sampling, which can provide more accurate baseline estimates with larger scale data sampling. (2) Offline RL consistently improves performance over the behavior-initialized models across most benchmarks and achieves comparable results to online RL. These results highlight the potential of offline RL as a more efficient alternative for enhancing LLMs’ deep reasoning. ## 4 Related Work ### 4.1 Scaling Test-time Compute Scaling test-time compute recently garners wide attention in LLM reasoning Snell et al. (2024b); Wu et al. (2024); Brown et al. (2024). Existing studies have explored various methods for scaling up test-time compute, including: (1) Aggregation-based methods that samples multiple responses for each question and obtains the final answer with self-consistency Wang et al. (2023) or by selecting best-of-N answer using a verifier or reward model Wang et al. (2024c); Zhang et al. (2024b); Lightman et al. (2023b); Havrilla et al. (2024b); (2) Search-based methods that apply search algorithms such as Monte Carlo Tree Search Tian et al. (2024); Wang et al. (2024a); Zhang et al. (2024a); Qi et al. (2024), beam search Snell et al. (2024b), or other effective algorithms Feng et al. (2023); Yao et al. (2023) to search for correct trajectories; (3) Iterative-refine-based methods that iteratively improve test performance through self-refinement Madaan et al. (2024a); Shinn et al. (2024); Chen et al. (2024a, 2025). Recently, there has been a growing focus on training LLMs to perform test-time search on their own, typically by conducting longer and deeper thinking OpenAI (2024); Guo et al. (2025). These test-time scaling efforts not only directly benefit LLM reasoning, but can also be integrated back into training time, enabling iterative improvement for LLM reasoning Qin et al. (2024); Feng et al. (2023); Snell et al. (2024b); Luong et al. (2024). In this work, we also present an efficient framework for training LLMs to perform effective test-time scaling through self-verification and self-correction iterations. This approach is achieved without extensive efforts, and the performance of S 2 r can also be consistently promoted via iterative training. ### 4.2 Self-verification and Self-correction Enabling LLMs to perform effective self-verification and self-correction is a promising solution for achieving robust reasoning for LLMs Madaan et al. (2024b); Shinn et al. (2023); Paul et al. (2023); Lightman et al. (2023a), and these abilities are also critical for performing deep reasoning. Previous studies have shown that direct prompting of LLMs for self-verification or self-correction is suboptimal in most scenarios Huang et al. (2023); Tyen et al. (2023); Ma et al. (2024); Zhang et al. (2024c). As a result, recent studies have explored various approaches to enhance these capabilities during post-training Saunders et al. (2022); Rosset et al. (2024); Kumar et al. (2024). These methods highlight the potential of using human-annotated or LLM-generated data to equip LLMs with self-verification or self-correction capabilities Zhang et al. (2024d); Jiang et al. (2024), while also indicating that behavior imitation via supervised fine-tuning alone is insufficient for achieving valid self-verification or self-correction Kumar et al. (2024); Qu et al. (2025); Kamoi et al. (2024). In this work, we propose effective methods to enhance LLMs’ self-verification and self-correction abilities through principled imitation data construction and RL training, and demonstrate the effectiveness of our approach with in-depth analysis. ### 4.3 RL for LLM Reasoning Reinforcement learning has proven effective in enhancing LLM performance across various tasks Ziegler et al. (2019); Stiennon et al. (2020); Bai et al. (2022); Ouyang et al. (2022); Setlur et al. (2025). In LLM reasoning, previous studies typically employ RL in an actor-critic framework Lightman et al. (2024); Tajwar et al. (2024); Havrilla et al. (2024a), and research on developing accurate reward models for RL training has been a long-standing focus, particularly in reward modeling for Process-level RL Lightman et al. (2024); Setlur et al. (2024, 2025); Luo et al. (2024). Recently, several studies have demonstrate that simplified reward modeling and advantage estimation Ahmadian et al. (2024); Shao et al. (2024); Team et al. (2025); Guo et al. (2025) in RL training can also effectively enhance LLM reasoning. Recent advances in improving LLMs’ deep thinking Guo et al. (2025); Team et al. (2025) further highlight the effectiveness of utilizing unhackable rewards Gao et al. (2023); Everitt et al. (2021) to consistently enhance LLM reasoning. In this work, we also show that simplified advantage estimation and RL framework enable effective improvements on LLM reasoning. Additionally, we conducted an analysis on process-level RL, outcome-level RL and offline RL, providing insights for future work in RL for LLM reasoning. ## 5 Conclusion In this work, we propose S 2 r, an efficient framework for enhancing LLM reasoning by teaching LLMs to iteratively self-verify and self-correct during reasoning. We introduce a principled approach for behavior initialization, and explore both outcome-level and process-level RL to further strengthen the models’ thinking abilities. Experimental results across three different base models on seven math reasoning benchmarks demonstrate that S 2 r significantly enhances LLM reasoning with minimal resource requirements. Since self-verification and self-correction are two crucial abilities for LLMs’ deep reasoning, S 2 r offers an interpretable framework for understanding how SFT and RL enhance LLMs’ deep reasoning. It also offers insights into the selection of RL strategies for enhancing LLMs’ long-CoT reasoning. ## References - qwe (2024) 2024. Qwen2 technical report. - Ahmadian et al. (2024) Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740. - AI-MO (2024a) AI-MO. 2024a. Aime 2024. - AI-MO (2024b) AI-MO. 2024b. Amc 2023. - Baheti et al. (2023) Ashutosh Baheti, Ximing Lu, Faeze Brahman, Ronan Le Bras, Maarten Sap, and Mark Riedl. 2023. Leftover lunch: Advantage-based offline reinforcement learning for language models. arXiv preprint arXiv:2305.14718. - Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. - Berglund et al. (2023) Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2023. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288. - Brown et al. (2024) Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. 2024. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787. - Chen et al. (2025) Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, and Sercan Ö Arık. 2025. Sets: Leveraging self-verification and self-correction for improved test-time scaling. arXiv preprint arXiv:2501.19306. - Chen et al. (2024a) Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal. 2024a. Magicore: Multi-agent, iterative, coarse-to-fine refinement for reasoning. arXiv preprint arXiv:2409.12147. - Chen et al. (2024b) Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, et al. 2024b. Reverse thinking makes llms stronger reasoners. arXiv preprint arXiv:2411.19865. - Chen et al. (2024c) Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. 2024c. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187. - Cheng et al. (2025) Pengyu Cheng, Tianhao Hu, Han Xu, Zhisong Zhang, Yong Dai, Lei Han, Xiaolong Li, et al. 2025. Self-playing adversarial language game enhances llm reasoning. Advances in Neural Information Processing Systems, 37:126515–126543. - Cobbe et al. (2021a) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021a. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. - Cobbe et al. (2021b) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021b. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. - Cui et al. (2025) Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. 2025. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456. - Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. - Everitt et al. (2021) Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. 2021. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. Synthese, 198(Suppl 27):6435–6467. - Feng et al. (2023) Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, and Jun Wang. 2023. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179. - Gao et al. (2023) Leo Gao, John Schulman, and Jacob Hilton. 2023. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR. - Geva et al. (2021) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics (TACL). - Gu et al. (2024) Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. 2024. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065. - Guan et al. (2025) Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. 2025. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519. - Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. - Han et al. (2022) Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Luke Benson, Lucy Sun, Ekaterina Zubova, Yujie Qiao, Matthew Burtell, et al. 2022. Folio: Natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840. - Havrilla et al. (2024a) Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. 2024a. Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642. - Havrilla et al. (2024b) Alex Havrilla, Sharath Raparthy, Christoforus Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, and Roberta Raileanu. 2024b. Glore: When, where, and how to improve llm reasoning via global and local refinements. arXiv preprint arXiv:2402.10963. - He et al. (2024) Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. 2024. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008. - Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021a. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). - Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874. - Huang et al. (2023) Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2023. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798. - Jiang et al. (2024) Huchen Jiang, Yangyang Ma, Chaofan Ding, Kexin Luan, and Xinhan Di. 2024. Towards intrinsic self-correction enhancement in monte carlo tree search boosted reasoning via iterative preference learning. arXiv preprint arXiv:2412.17397. - Kamoi et al. (2024) Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. 2024. When can llms actually correct their own mistakes? a critical survey of self-correction of llms. Transactions of the Association for Computational Linguistics, 12:1417–1440. - Kool et al. (2019) Wouter Kool, Herke van Hoof, and Max Welling. 2019. Buy 4 reinforce samples, get a baseline for free! - Kumar et al. (2024) Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. 2024. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917. - Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. - LI et al. (2024) Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. 2024. Numinamath. [https://github.com/project-numina/aimo-progress-prize](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf). - Liao et al. (2024) Minpeng Liao, Wei Luo, Chengxi Li, Jing Wu, and Kai Fan. 2024. Mario: Math reasoning with code interpreter output–a reproducible pipeline. arXiv preprint arXiv:2401.08190. - Lightman et al. (2023a) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023a. Let’s verify step by step. arXiv preprint arXiv:2305.20050. - Lightman et al. (2023b) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023b. Let’s verify step by step. arXiv preprint arXiv:2305.20050. - Lightman et al. (2024) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let’s verify step by step. In The Twelfth International Conference on Learning Representations. - Luo et al. (2024) Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, et al. 2024. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592. - Luong et al. (2024) Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. 2024. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967. - Ma et al. (2024) Ruotian Ma, Xiaolei Wang, Xin Zhou, Jian Li, Nan Du, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. Are large language models good prompt optimizers? arXiv preprint arXiv:2402.02101. - Madaan et al. (2024a) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2024a. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36. - Madaan et al. (2024b) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2024b. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36. - OpenAI (2024) OpenAI. 2024. Openai o1 system card. preprint. - Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744. - Paul et al. (2023) Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. 2023. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904. - Qi et al. (2024) Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. 2024. Mutual reasoning makes smaller llms stronger problem-solvers. arXiv preprint arXiv:2408.06195. - Qin et al. (2024) Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, et al. 2024. O1 replication journey: A strategic progress report–part 1. arXiv preprint arXiv:2410.18982. - Qu et al. (2025) Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. 2025. Recursive introspection: Teaching language model agents how to self-improve. Advances in Neural Information Processing Systems, 37:55249–55285. - Qwen (2024) Qwen. 2024. Qwen2.5-math-7b. - Rosset et al. (2024) Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. 2024. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715. - Saunders et al. (2022) William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. 2022. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802. - Setlur et al. (2025) Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. 2025. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold. Advances in Neural Information Processing Systems, 37:43000–43031. - Setlur et al. (2024) Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. 2024. Rewarding progress: Scaling automated process verifiers for llm reasoning. arXiv preprint arXiv:2410.08146. - Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. - Shen et al. (2025) Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, and Chuang Gan. 2025. Satori: Reinforcement learning with chain-of-action-thought enhances llm reasoning via autoregressive search. arXiv preprint arXiv:2502.02508. - Shinn et al. (2024) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36. - Shinn et al. (2023) Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366. - Snell et al. (2024a) Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024a. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314. - Snell et al. (2024b) Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024b. Scaling llm test-time compute optimally can be more effective than scaling model parameters. Preprint, arXiv:2408.03314. - Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021. - Tajwar et al. (2024) Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. 2024. Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367. - Tang et al. (2024a) Zhengyang Tang, Xingxing Zhang, Benyou Wan, and Furu Wei. 2024a. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884. - Tang et al. (2024b) Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. 2024b. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884. - Team et al. (2025) Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. 2025. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599. - Team (2024a) Qwen Team. 2024a. Qwq: Reflect deeply on the boundaries of the unknown. - Team (2024b) The Mistral AI Team. 2024b. Mathstral-7b-v0.1. - Tian et al. (2024) Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. 2024. Toward self-improvement of llms via imagination, searching, and criticizing. arXiv preprint arXiv:2404.12253. - Toshniwal et al. (2024) Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. 2024. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data. arXiv preprint arXiv:2410.01560. - Tyen et al. (2023) Gladys Tyen, Hassan Mansoor, Peter Chen, Tony Mak, and Victor Cărbune. 2023. Llms cannot find reasoning errors, but can correct them! arXiv preprint arXiv:2311.08516. - Wang et al. (2024a) Chaojie Wang, Yanchen Deng, Zhiyi Lv, Shuicheng Yan, and An Bo. 2024a. Q*: Improving multi-step reasoning for llms with deliberative planning. Preprint, arXiv:2406.14283. - Wang et al. (2024b) Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, and Yi Wu. 2024b. Offline reinforcement learning for llm multi-step reasoning. arXiv preprint arXiv:2412.16145. - Wang et al. (2024c) Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. 2024c. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. Preprint, arXiv:2312.08935. - Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations. - Wang et al. (2024d) Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024d. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574. - Wu et al. (2024) Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. 2024. An empirical analysis of compute-optimal inference for problem-solving with language models. arXiv preprint arXiv:2408.00724. - Yang et al. (2024) An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. 2024. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122. - Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822. - Yuan et al. (2024) Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. 2024. Free process rewards without process labels. arXiv preprint arXiv:2412.01981. - Zeng et al. (2025) Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 2025. 7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient. https://hkust-nlp.notion.site/simplerl-reason. Notion Blog. - Zhang et al. (2024a) Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. 2024a. Rest-mcts*: Llm self-training via process reward guided tree search. arXiv preprint arXiv:2406.03816. - Zhang et al. (2024b) Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. 2024b. Generative verifiers: Reward modeling as next-token prediction. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24. - Zhang et al. (2024c) Qingjie Zhang, Han Qiu, Di Wang, Haoting Qian, Yiming Li, Tianwei Zhang, and Minlie Huang. 2024c. Understanding the dark side of llms’ intrinsic self-correction. arXiv preprint arXiv:2412.14959. - Zhang et al. (2024d) Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, and Lu Wang. 2024d. Small language models need strong verifiers to self-correct reasoning. arXiv preprint arXiv:2404.17140. - Zhao et al. (2024) Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. 2024. Marco-o1: Towards open reasoning models for open-ended solutions. arXiv preprint arXiv:2411.14405. - Zheng et al. (2024) Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2024. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559. - Ziegler et al. (2019) Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593. ## Appendix A Implementation Details ### A.1 Verification Processing and SFT Data Construction Given the responses sampled from the original LLM policy, we prompt frontier LLMs for initial verifications. In order to construct more valid verification, we force the LLMs to “verify without re-solving the problem” and filter out invalid verifications during data processing. We found that despite being instructed to "verify without re-solving the problem", most existing LLMs still biased to solve the problem again, as shown in Table 6. Finally, we collected the verification data by querying gpt-4-preview-1106 https://openai.com/api/ , which shows strong instruction-following ability to "verify without re-solving the problem" and can perform plausible verification such as adopting reverse thinking, inductive reasoning and other methods. For these collected prompts, we refine the remaining verifications using gpt-4o to improve fluency and clarity. During this refinement, we instruct gpt-4o to append a conclusion at the end of each verification based on its stance—for example: “Therefore, the answer is correct/incorrect/cannot verify.” Finally, we discard any verifications where the judgment does not align with the actual correctness of the answer. The prompts we used during the whole process are provided in Appendix § A.3. With the refined and filtered verifications, we construct the SFT data as follows. For each problem, we determine the number of answer attempts required to eventually obtain a correct answer based on the accuracy from the initial sampling. The lower the accuracy, the more rounds of responses are generated. In our implementation, we categorize all problems into four difficulty levels and construct answer sequences with 1, 2, 3, or 4 rounds, according to descending accuracy. Then, after an incorrect answer, we append “Wait, let me recheck my solution” along with the corresponding verification. If that answer is not the final attempt, we further append “Let me try again.” We ensure that the last answer in the sequence is correct. Additionally, we ensure that the answers in each round for a given problem are distinct. Figure 5 is an example of SFT data constructed with 4 rounds of responses. ### A.2 Baseline Details #### A.2.1 Baseline Implementations In Table 2, the reported results for Frontier LLMs and Top-tier Open-source Reasoning LLMs are sourced from the original reports and Guan et al. (2025). We evaluate Llama-3.1-8B-Instruct Dubey et al. (2024), Qwen2-7B-Instruct qwe (2024), Qwen2.5-Math-7B, Qwen2.5-Math-7B-Instruct and Qwen2.5-Math-72B-Instruct Yang et al. (2024) using the same process described in Section § 3.1. For Eurus-7B-PRIME Cui et al. (2025), rStar-Math-7B Guan et al. (2025), and Qwen2.5-7B-SimpleRL Zeng et al. (2025), we report results directly from the original papers. In Table 3, the results for Llama-3.1-70B-Instruct and QwQ-32B-Preview are taken from Shen et al. (2025). For the remaining baselines, we follow the official evaluation protocol of the dataset project https://github.com/Yale-LILY/FOLIO https://github.com/facebookresearch/cruxeval https://github.com/eladsegal/strategyqa https://github.com/TIGER-AI-Lab/MMLU-Pro . #### A.2.2 Baseline License In this work, we utilize the Llama-3.1-8B-Instruct model, whose license can be reviewed at https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/LICENSE. In addition, the models Qwen2-7B-Instruct, Qwen2.5-Math-7B, Eurus-2-7B-PRIME, and project vLLM are distributed under the Apache License 2.0. We gratefully acknowledge the contributions of the open-source community and strictly adhere to the terms of the respective licenses. #### A.2.3 Baseline SFT Data Construction Original Solution SFT Data In this setting, we use the solution from the original dataset as sft data. To ensure a fair comparison, we maintain the same training data volume as our behavior initialization approaches. Long CoT SFT Data We also introduce a baseline by fine-tuning on Long CoT responses generated by QwQ-32B-Preview Team (2024a). Specifically, we instruct QwQ to generate responses to given problems and filter out those with incorrect answers. The remaining high-quality responses are then used for supervised fine-tuning. Importantly, we ensure that the total training data volume remains consistent with that used in our behavior initialization approach. The prompt we use for QwQ is provided in Appendix § A.3. ### A.3 Prompts The prompts we use in all experiments are as follows: Sampling Responses During Training/Inference ⬇ Please reason step by step, and put your final answer within \ boxed {}. Problem: {problem} Verification Refinement ⬇ You are a math teacher. I will give you a math problem and an answer. Verify the answer ’ s correctness without step - by - step solving. Use alternative verification methods. Question: {problem} Answer: {answer} Verification: Verification Collection ⬇ Refine this verification text to read as a natural self - check within a solution. Maintain logical flow and professionalism. Key Requirements: 1. Avoid phrases like " without solving step - by - step " or " as a math teacher ". 2. Treat the answer as your own prior solution. 3. Conclude with EXACTLY one of: Therefore, the answer is correct. Therefore, the answer is incorrect. Therefore, the answer cannot be verified. Original text: {verification} ## Appendix B Detailed Experiment Settings | Without Asking for Confirmative Verification | | | --- | --- | | Model | Confirmative out of 100 | | GPT-4o | 26 | | GPT-4-Preview-1106 | 32 | | QwQ-32B-preview | 37 | | Llama-3.1-70B-Instruct | 28 | | Asking for Confirmative Verification | | | Model | Confirmative out of 100 | | GPT-4o | 44 | | GPT-4-Preview-1106 | 61 | | QwQ-32B-preview | 58 | | Llama-3.1-70B-Instruct | 50 | Table 6: ### B.1 Datasets Details of each test dataset we used as benchmark are as follows: #### B.1.1 In-domain Datasets MATH500 Lightman et al. (2023b) offers a streamlined slice of the broader MATH Hendrycks et al. (2021b) dataset, comprising 500 test problems selected through uniform sampling. Despite its smaller scope, it maintains a distribution of topics and difficulty levels that mirrors the larger MATH corpus. GSM8K Cobbe et al. (2021a) features around 8,500 grade-school math word problems. The dataset focuses on simple arithmetic through early algebra and includes 1,319 distinct tasks in its test set. OlympiadBench He et al. (2024) collects 8,476 advanced math and physics questions drawn from Olympiad contexts, with some originating from the Chinese college entrance exam. We use the subset of 674 text-only competition questions, providing open-ended math challenges. AMC2023 AI-MO (2024b) and AIME AI-MO (2024a) each supply a set of challenging exam-style problems: 40 questions from AMC 2023 and 30 from AIME 2024, all in text-only format. CollegeMath Tang et al. (2024b) is a dataset targeting advanced college-level mathematics, drawn from nine textbooks spanning seven major fields—algebra, pre-calculus, calculus, vector calculus, probability, linear algebra, and differential equations. The final collection comprises 1,281 training examples and 2,818 test examples. Gaokao2023en Liao et al. (2024) is a dataset consisting of 385 mathematics problems sourced from the 2023 Chinese higher education entrance examination, which have been professionally translated into English. #### B.1.2 Cross-domain Datasets FOLIO Han et al. (2022) is meticulously annotated to assess intricate logical reasoning in natural language. It pairs 1,430 conclusions with 487 sets of premises—each verified using first-order logic (FOL)—and contains 203 unique problems in its test portion. CRUXEval Gu et al. (2024) tests code comprehension and reasoning through 800 concise Python functions (spanning 3–13 lines). Each function is accompanied by one or more input-output examples. The goal is to predict the correct outputs given the function body and a specific input. The test partition encompasses all 800 problems. StrategyQA Geva et al. (2021) targets multi-hop reasoning questions where the necessary intermediate steps are not explicit. Each of its 2,780 items includes a strategic query, a breakdown of the reasoning steps, and supporting evidence drawn from Wikipedia. MMLUProSTEM is extracted from MMLU-Pro Wang et al. (2024d). Following Satori Shen et al. (2025), we conduct evaluations on six STEM subsets—physics, chemistry, computer science, engineering, biology, and economics. ### B.2 Hyperparameters Setting | Model | Learning Rate | Batch Size | KL Coefficient | Max Length | Training Epochs | | --- | --- | --- | --- | --- | --- | | Llama-3.1-8B-Instruct | 5e-6 | 32 | 0.1 | 8000 | 3 | | Qwen2-7B-Instruct | 5e-6 | 32 | 0.1 | 6000 | 3 | | Qwen2.5-Math-7B | 5e-6 | 32 | 0.01 | 8000 | 3 | Table 7: Model Training Hyperparameter Settings (SFT) | Model | Learning Rate | Training Batch Size | Forward Batch Size | KL Coefficient | Max Length | Sampling Temperature | Clip Range | Training Steps | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Llama-3.1 | 5e-7 | 64 | 256 | 0.05 | 8000 | 0.7 | 0.2 | 500 | | Qwen2-7B-Instruct | 5e-7 | 64 | 256 | 0.05 | 6000 | 0.7 | 0.2 | 500 | | Qwen2.5-Math-7B | 5e-7 | 64 | 256 | 0.01 | 8000 | 0.7 | 0.2 | 500 | Table 8: Model Training Hyperparameter Settings (RL) During behavior initialization with SFT, we use a batch size of 32 and adopt a learning rate of 5e-6. We set the maximum sequence length 8000 to accommodate long responses and verifications. To balance stability and convergence during training, we add a KL punishment to the training loss, and the KL coefficient is set to 0.1. During reinforcement learning, for each training batch, we use a training batch size of 64, and sample $n$ responses for each question in a batch, resulting a forward batch size of $64n$ . For each forward batch, we update the model for $n$ step with the training batch size 64. Specifically, for both process-level and outcome-level RL, we adopt $n=4$ (i.e., for RLOO, the sample number is also $4$ ). More hyperparameters of the RL training are presented in Table 8. We use the BF16 model precision in all experiments. Main hyperparameters used in the experiments are illustrated in Table 7 and 8. ### B.3 Experiment Environment All experiments are implemented using the PyTorch framework on 32 NVIDIA H20 (96GB) GPUs or 32 NVIDIA A100Pro (40GB) GPUs. Our training code is built upon Hugging Face TRL https://github.com/huggingface/trl. For inference, we use a single NVIDIA A100 (40GB) GPU with vLLM-0.5.4 https://github.com/vllm-project/vllm. We utilize transformers version 4.39.3 for fine-tuning Qwen2-7B-Instruct and Qwen2.5-Math-7B, version 4.44.0 for fine-tuning Llama-3.1-8B, and version 4.46.3 for reinforcement learning. We use PyTorch 2.1.1 across our training pipeline. Our evaluation code is built upon Qwen Math’s evaluation codebase https://github.com/QwenLM/Qwen2.5-Math. ## Appendix C Metrics Definition We include the formal definition of metrics we use for analyzing self-verification and self-correction behaviors of the post-trained models as follows. ### C.1 Notations We first present the main notations used in our formulation in Table 9. | Variable | Description | | --- | --- | | $\pi$ | The policy | | $x$ | Problem instance | | $y$ | Series of predefined actions: $y=\{a_{1},a_{2},\ldots,a_{n}\}$ | | $a_{i}$ | The $i$ -th action in the response $y$ , and let | | $Type(a_{i})\in\{\texttt{verify},\texttt{solve},\texttt{<end>}\}$ | | | $s_{j}$ | $j^{th}$ attempt to solve the problem | | $v_{j}$ | $j^{th}$ self-verification for the $j^{th}$ attempt | | $Parser(\cdot)$ | $Parser(v_{j})\in\{\texttt{correct},\texttt{incorrect}\}$ | | The text parser to get the self-verification result | | | indicating the correctness of action $s_{j}$ | | | $V_{golden}(\cdot)$ | $V_{golden}(a_{i})\in\{\texttt{correct},\texttt{incorrect}\}$ | | $R(\cdot)$ | The rule based reward function | | $R(\cdot)\in\{-1,1\}$ | | | $R(s_{j})=\begin{cases}1,&V_{golden}(s_{j})=\texttt{correct}\\ -1,&otherwise\\ \end{cases}$ | | | $R(v_{j})=\begin{cases}1,&Parser(v_{j})=V_{golden}(s_{j})\\ -1,&otherwise\\ \end{cases}$ | | | <end> | End of action series | | $\mathbb{I}(\cdot)$ | The indicator function, $\mathbb{I}(\cdot)\in\{0,1\}$ . $\mathbb{I}(\cdot)=1$ if the condition inside holds true, and $\mathbb{I}(\cdot)=0$ otherwise. | Table 9: Variable Lookup Table ### C.2 Self-Verification Metrics #### C.2.1 Verification Accuracy (VA) Verification Accuracy measures how often the verification prediction matches the ground-truth correctness ( $N$ is the total number of verifications in the responses to the test set): #### C.2.2 Error Recall (ER) Error Recall measures the recall of detecting incorrect answers (i.e., the fraction of actually incorrect answers that are successfully identified as incorrect): where $|y|_{a}$ is the total number of actions in $y$ and $\frac{|y|_{a}}{2}$ is the total number of attempts to solve the problem ( $y=\{a_{1},a_{2},\cdots,a_{|y|_{a}}\}=\{s_{1},v_{1},\cdots,s_{\frac{|y|_{a}}{2} },v_{\frac{|y|_{a}}{2}}\}$ ). #### C.2.3 Correct Precision (CP) Correct Precision measures the precision when the verification model predicts an answer to be correct (i.e., among all “correct” predictions, how many are truly correct): ### C.3 Self-Correction Metrics #### C.3.1 Incorrect to Correct Rate (ICR) The rate at which the model successfully corrects an initially incorrect answer ( $R(s_{1})=-1$ ) into a correct final answer ( $R(s_{T_{y}})=1$ ), where $T_{y}=|y|_{a}/2$ is the total number of attempts to solve the problem in each $y$ . Formally: $$ \text{ICR}=\frac{\sum_{y}\mathbb{I}\bigl{(}R(s_{1})=-1\bigr{)}\,\mathbb{I} \bigl{(}R(s_{T_{y}})=1\bigr{)}}{\sum_{y}\mathbb{I}\bigl{(}R(s_{1})=-1\bigr{)}}. \tag{10} $$ #### C.3.2 Correct to Incorrect Rate (CIR) The rate at which the model incorrectly alters an initially correct answer ( $R(s_{1})=1$ ) into an incorrect final answer ( $R(s_{T_{y}})=-1$ ), where $T_{y}=|y|_{a}/2$ is the total number of attempts to solve the problem in each $y$ . Formally: $$ \text{CIR}=\frac{\sum_{y}\mathbb{I}\bigl{(}R(s_{1})=1\bigr{)}\,\mathbb{I}\bigl {(}R(s_{T_{y}})=-1\bigr{)}}{\sum_{y}\mathbb{I}\bigl{(}R(s_{1})=1\bigr{)}}. \tag{11} $$ ## Appendix D Offline RL Training Details In this section, we provide additional details on the offline reinforcement learning training process, including formal definition, ablation studies, and implementation details. ### D.1 Accuracy-Grouped Baseline Definition To fully leverage the advantages of offline RL, which does not require real-time sampling, we explore more appropriate baseline selection by further grouping trajectories based on problem difficulty. Intuitively, for two trajectories $y^{(1)}$ and $y^{(2)}$ sampled under questions of different difficulty levels, and their corresponding actions $a^{(1)}_{t}$ and $a^{(2)}_{t}$ at the same position, even if they share identical reward contexts, their expected returns (baselines) should differ, i.e., the expected return is typically lower for more challenging problems. We measure a problem’s difficulty by estimating how often it is solved correctly under the current sampling policy. Concretely, we sample multiple trajectories in parallel for each problem. The fraction of these trajectories that yield a correct final answer serves as the problem’s accuracy. We then discretize this accuracy into separate bins, effectively grouping the problems according to their estimated difficulty. All trajectories belonging to problems within the same accuracy bin form a common subset. Compared to using direct reward contexts alone, this accuracy-based grouping offers a more robust estimate of expected returns, problems in the same bin share similar success rates. Moreover, unlike a pre-defined difficulty grouping, these bins adjust dynamically as the model’s capabilities evolve. Building on this approach, we propose two accuracy-based baseline estimation methods for offline RL as follows. #### D.1.1 Accuracy-Grouped Baseline With Position Group Within each accuracy bin, we further split actions based on their position in the trajectory. Concretely, we consider all actions occurring at the same step index across trajectories in the same bin to be comparable, and we compute their average return to serve as the baseline. Thus, when we look up the baseline for a particular action at a given step in a trajectory, we use the average return of all actions taken at that same step index in all trajectories belonging to the same accuracy bin. #### D.1.2 Accuracy-Grouped Baseline With Reward Context We also propose combining accuracy-based grouping with reward-context grouping. The underlying assumption is that even if two actions share the same immediate reward context, their expected returns can differ if they originate from different difficulty bins. Generally, problems that are harder to solve exhibit lower expected returns. Consequently, we first bin the trajectories by accuracy, then further group them by common reward context. Within each sub-group, we average the returns of all relevant actions to obtain the baseline. ### D.2 Offline RL Implementation Details In each iteration of offline RL training, we generate multiple trajectories (e.g., eight) per prompt in parallel. We then apply prompt filtering, rejection sampling, accuracy-based baseline estimation, advantage computation, and policy updates. Implementation details follow. #### D.2.1 Prompt Filtering | Accuracy Range | Retained Questions | MATH500 | AIME2024 | AMC2023 | College Math | Olympiad Bench | GSM8K | GaokaoEn2023 | Average | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | $[0.1-0.7]$ | 1805 | 83.4 | 23.3 | 62.5 | 50.0 | 46.7 | 92.9 | 72.2 | 61.6 | | $[0.2-0.8]$ | 2516 | 82.6 | 23.3 | 70.0 | 49.8 | 45.3 | 92.4 | 70.1 | 61.9 | | $[0.3-0.9]$ | 4448 | 81.6 | 23.3 | 70.0 | 49.4 | 44.7 | 92.0 | 68.1 | 61.3 | | $[0-1]$ | Full | 80.6 | 26.7 | 67.5 | 50.0 | 43.0 | 91.4 | 67.0 | 60.9 | Table 10: Comparison of question filtering accuracy selection. As we sample multiple trajectories for each prompt, we compute the accuracy of each prompt. We retain prompts whose accuracy falls within a predefined range. Our ablation study on Qwen2.5-Math-7B shown in Table 10 confirms that filtering improves performance. The most stable results are obtained with an accuracy range of $[0.1,0.7]$ , suggesting that including moderately difficult samples enhances the model’s reasoning capabilities. #### D.2.2 Rejection Sampling We discard any trajectory that does not follow the alternation pattern of solution and verification: $y=(s_{1},v_{1},\dots,s_{k},v_{k})$ . Additionally, we remove malformed trajectories such as $y=(s_{1},s_{2},v_{1})$ . To mitigate reward hacking due to excessively long outputs, we eliminate trajectories where $R(s_{t})=1$ and $R(v_{t})=1$ at timestep $t$ , but further actions are taken at $t+1$ . Moreover, we discard trajectories containing more than 20 actions, as excessive action sequences can introduce instability and deviate from expected solution structures. #### D.2.3 Loss Function | | Datasets | xx Average xx | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Baseline Method | MATH500 | AIME2024 | AMC2023 | College Math | Olympiad Bench | GSM8K | GaokaoEn2023 | | | Based on reward context | 82.4 | 26.7 | 65.0 | 50.1 | 46.1 | 92.9 | 71.2 | 62.1 | | Based on accuracy group with position | 83.4 | 23.3 | 62.5 | 50.0 | 46.7 | 92.9 | 72.2 | 61.6 | | Based on accuracy group with reward context | 82.4 | 23.3 | 67.5 | 49.3 | 45.8 | 93.3 | 71.2 | 61.8 | Table 11: The performance of different baselines To determine the best offline baseline method, we conducted ablation studies on Qwen2.5-Math-7B shown in Table 11. We found that using the accuracy-grouped baseline with an additional division by position provides the most stable results. When computing advantages, we subtract both the baseline and a scaled relative policy term like Equation 5. Notably, we fix $\pi_{\text{ref}}$ as the reference policy instead of being updated at each iteration. #### D.2.4 Training Hyperparameter Settings We use a batch size of 64, a maximum learning rate of $5\times 10^{-7}$ , and a KL penalty coefficient of 0.1. The maximum training sequence length is set to 8192. We apply a warm-up phase of 5 steps and a clipping range parameter of 0.2. We use BF16 model precision in all experiments. ## Appendix E Demo Cases <details> <summary>x7.png Details</summary> ![acb954fd](/v1/image/acb954fda0c82c71183883a357a51c998b19312a28bad83512b005b0269d06d3) ### Visual Description ## Textual Analysis: Candy Distribution Problem ### Overview This document presents a mathematical word problem involving the distribution of candy among four individuals: Lizzy, Megan, Oscar, and Patrick. The problem asks to determine the truthfulness of several statements about the number of candies Patrick has at the end, denoted by $P$. The document also includes a "Response" section that attempts to analyze the problem and its statements, including self-corrections and re-evaluations. ### Components/Axes This document does not contain charts or diagrams. It is composed of textual blocks. ### Content Details **Problem Statement:** "Lizzy, Megan, Oscar, and Patrick each have $x$ pieces of candy, where $x$ is a positive integer. Unfortunately, Patrick is the only one of the four who likes candy. So Lizzy gives all her candy to Megan. Then Megan gives all the candy she now has (which includes the candy Lizzy gave her) to Oscar. Then Oscar gives all the candy he now has to Patrick. Let $P$ be the number of pieces of candy Patrick has in the end. How many of the following statements are true? (Assume that we do not know exactly what $x$ is.)" The following statements are presented: * (a) 2 can be a divisor of $P$. * (b) 2 must be a divisor of $P$. * (c) 3 can be a divisor of $P$. * (d) 3 must be a divisor of $P$. * (e) 4 can be a divisor of $P$. * (f) 4 must be a divisor of $P$. **Response Section:** The response section begins with an attempt to analyze the problem, indicated by phrases like "Let's analyze each statement step-by-step..." and "Wait, let me recheck my solution." It notes that the initial answer is incorrect and that the answer is re-evaluated. The response then proceeds to define the initial state: "Lizzy, Megan, Oscar, and Patrick each start with $x$ pieces of candy. The total amount of candy remains constant as it is transferred among them. Ultimately, Patrick ends up with all the candy, totaling $4x$ pieces." The response then examines each statement: * **(a) 2 can be a divisor of $P$:** This is stated as true because if $x$ is even, then $4x$ is even and divisible by 2. * **(b) 2 must be a divisor of $P$:** This is stated as true because $4x$ is always divisible by 2, regardless of $x$. * **(c) 3 can be a divisor of $P$:** This is stated as true if $x$ is a multiple of 3, then $4x$ will also be a multiple of 3. * **(d) 3 must be a divisor of $P$:** This is stated as false because if $x$ is not a multiple of 3, then $4x$ will not be a multiple of 3. * **(e) 4 can be a divisor of $P$:** This is stated as true as $4x$ is divisible by 4. * **(f) 4 must be a divisor of $P$:** This is stated as true because $4x$ is always divisible by 4, regardless of $x$. The response concludes: "Five statements are true: (a), (b), (c), (e), and (f). The only false statement is (d). Therefore, the answer is correct." ### Key Observations * The problem involves a variable $x$ representing an initial number of candies, which is a positive integer. * The total number of candies is $4x$. * The response indicates that the final number of candies for Patrick, $P$, is equal to $4x$. * The analysis of each statement is based on the properties of divisibility of $4x$. * The response explicitly identifies statement (d) as false and statements (a), (b), (c), (e), and (f) as true. ### Interpretation The problem describes a scenario where candy is transferred, and the total amount of candy remains constant. The key insight is that Patrick ultimately receives all the candy, meaning $P = 4x$. The analysis then proceeds to evaluate the divisibility of $P$ by 2, 3, and 4 based on this relationship. * **Statements (a), (b), (e), and (f)** are generally true because $4x$ is always divisible by 2 and 4 for any positive integer $x$. * Statement (b) "2 must be a divisor of $P$" is true because $P = 4x$, and $4x$ is always an even number, hence divisible by 2. * Statement (f) "4 must be a divisor of $P$" is true because $P = 4x$, and $4x$ is by definition divisible by 4. * Statement (a) "2 can be a divisor of $P$" is a weaker claim than (b) and is also true. * Statement (e) "4 can be a divisor of $P$" is a weaker claim than (f) and is also true. * **Statement (c) "3 can be a divisor of $P$"** is true if $x$ is a multiple of 3. For example, if $x=3$, then $P=12$, which is divisible by 3. If $x=1$, $P=4$, not divisible by 3. So, it *can* be a divisor. * **Statement (d) "3 must be a divisor of $P$"** is false because $P=4x$. For $P$ to *always* be divisible by 3, $4x$ must always be divisible by 3. This only happens if $x$ is a multiple of 3. If $x$ is not a multiple of 3 (e.g., $x=1$, $P=4$), then $P$ is not divisible by 3. The response's conclusion that five statements are true and one is false aligns with this analysis. The self-correction phrases ("Wait, let me recheck my solution," "Let me try again") suggest an iterative problem-solving process, which is common in mathematical reasoning. The problem demonstrates the importance of understanding the properties of algebraic expressions and divisibility rules. </details> Figure 5: SFT data example. <details> <summary>x8.png Details</summary> ![8d7de97c](/v1/image/8d7de97cbde36efa2fcc0c4738c7479a36c0b3a540c185e1d1bbc8a9f408840d) ### Visual Description ## Mathematical Document: Vector Cross Product and Magnitude Minimization ### Overview This document presents a mathematical problem involving vector operations, specifically the cross product and the minimization of a vector's magnitude. It details a step-by-step solution process, including an initial attempt, a self-correction, and a final verified solution. The problem asks to find a vector $\mathbf{v}$ that satisfies a given cross product equation and has the smallest possible magnitude. ### Components/Axes This document does not contain charts or diagrams with axes or legends. It consists of textual explanations, mathematical equations, and vector representations. ### Detailed Analysis The problem statement is: "There are an infinite number of vectors $\mathbf{v}$ such that $\begin{pmatrix} 1 \\ 2 \\ -5 \end{pmatrix} \times \mathbf{v} = \begin{pmatrix} 90 \\ 30 \\ 30 \end{pmatrix}$." The objective is to "Find the vector $\mathbf{v}$ that has the smallest magnitude." **Initial Response and Attempt:** 1. **Setting up the equation:** The response begins by defining $\mathbf{v} = \begin{pmatrix} x \\ y \\ z \end{pmatrix}$ and calculating the cross product: $\begin{pmatrix} 1 \\ 2 \\ -5 \end{pmatrix} \times \begin{pmatrix} x \\ y \\ z \end{pmatrix} = \begin{pmatrix} 2z + 5y \\ -5x + z \\ 2x - y \end{pmatrix}$. This is then equated to the given vector: $\begin{pmatrix} 2z + 5y \\ -5x + z \\ 2x - y \end{pmatrix} = \begin{pmatrix} 90 \\ 30 \\ 30 \end{pmatrix}$. 2. **System of Equations:** This equality yields a system of linear equations: * $2z + 5y = 90$ * $-5x + z = 30$ * $2x - y = 30$ 3. **Solving the System (First Attempt):** * From the third equation, $y = 2x - 30$. * Substituting $y$ into the first equation: $2z + 5(2x - 30) = 90 \implies 2z + 10x - 150 = 90 \implies 2z + 10x = 240 \implies z + 5x = 120 \implies z = 120 - 5x$. * Substituting $z$ into the second equation: $-5x + (120 - 5x) = 30 \implies -10x + 120 = 30 \implies -10x = -90 \implies x = 9$. * Using $x=9$ to find $y$ and $z$: * $y = 2(9) - 30 = 18 - 30 = -12$. * $z = 120 - 5(9) = 120 - 45 = 75$. 4. **Result of First Attempt:** The vector found is $\mathbf{v} = \begin{pmatrix} 9 \\ -12 \\ 75 \end{pmatrix}$. 5. **General Solution Form:** The document notes that the general solution to the system of equations is of the form $\mathbf{v} = \begin{pmatrix} 9 \\ -12 \\ 75 \end{pmatrix} + t \begin{pmatrix} 1 \\ 2 \\ -5 \end{pmatrix}$ for some scalar $t$. This is derived from the fact that the cross product equation defines a plane, and any vector in that plane can be represented as a particular solution plus a multiple of a vector orthogonal to the normal of the plane (which is the vector $\begin{pmatrix} 1 \\ 2 \\ -5 \end{pmatrix}$). 6. **Magnitude Minimization (First Attempt):** The magnitude of $\mathbf{v}$ is minimized when $t=0$, which corresponds to the vector $\begin{pmatrix} 9 \\ -12 \\ 75 \end{pmatrix}$. The magnitude is calculated as $\sqrt{9^2 + (-12)^2 + 75^2} = \sqrt{81 + 144 + 5625} = \sqrt{5850}$. **Self-Correction and Re-evaluation:** The document then includes a section titled "Wait, let me recheck my solution." 1. **Orthogonality Check:** The cross product of two vectors $\mathbf{a}$ and $\mathbf{b}$ is orthogonal to both $\mathbf{a}$ and $\mathbf{b}$. The problem states that $\mathbf{a} = \begin{pmatrix} 1 \\ 2 \\ -5 \end{pmatrix}$ and the cross product is $\mathbf{c} = \begin{pmatrix} 90 \\ 30 \\ 30 \end{pmatrix}$. The proposed solution vector is $\mathbf{v} = \begin{pmatrix} 9 \\ -12 \\ 75 \end{pmatrix}$. * The first check is to see if $\mathbf{v}$ is orthogonal to $\mathbf{c}$ using the dot product: $\mathbf{v} \cdot \mathbf{c} = (9)(90) + (-12)(30) + (75)(30) = 810 - 360 + 2250 = 2700$. * Since $\mathbf{v} \cdot \mathbf{c} \neq 0$, the vector $\mathbf{v}$ is not orthogonal to $\mathbf{c}$. This implies that the initial answer is incorrect because the cross product $\mathbf{a} \times \mathbf{v}$ must be orthogonal to $\mathbf{v}$. **Second Attempt and Verified Solution:** The document then proceeds with a "Let me try again." section. 1. **Cross Product Calculation (Determinant Form):** The cross product $\begin{pmatrix} 1 \\ 2 \\ -5 \end{pmatrix} \times \begin{pmatrix} x \\ y \\ z \end{pmatrix}$ is calculated using the determinant of a matrix: $$ \begin{vmatrix} \mathbf{i} & \mathbf{j} & \mathbf{k} \\ 1 & 2 & -5 \\ x & y & z \end{vmatrix} = \mathbf{i}(2z - (-5y)) - \mathbf{j}(1z - (-5x)) + \mathbf{k}(1y - 2x) = (2z + 5y)\mathbf{i} - (z + 5x)\mathbf{j} + (y - 2x)\mathbf{k} $$ This results in the vector $\begin{pmatrix} 2z + 5y \\ -(z + 5x) \\ y - 2x \end{pmatrix}$. 2. **Equating and System of Equations (Second Attempt):** $\begin{pmatrix} 2z + 5y \\ -(z + 5x) \\ y - 2x \end{pmatrix} = \begin{pmatrix} 90 \\ 30 \\ 30 \end{pmatrix}$. This gives the system of equations: * $2z + 5y = 90$ * $-(z + 5x) = 30 \implies -z - 5x = 30 \implies z = -5x - 30$ * $y - 2x = 30 \implies y = 2x + 30$ 3. **Solving the System (Second Attempt):** * Substitute $y = 2x + 30$ and $z = -5x - 30$ into the first equation: $2(-5x - 30) + 5(2x + 30) = 90$ $-10x - 60 + 10x + 150 = 90$ $90 = 90$. * This identity ($90=90$) indicates that the system is consistent and has infinitely many solutions. The general solution is expressed in terms of $x$: $\mathbf{v} = \begin{pmatrix} x \\ y \\ z \end{pmatrix} = \begin{pmatrix} x \\ 2x + 30 \\ -5x - 30 \end{pmatrix}$. 4. **Magnitude Minimization (Second Attempt):** To find the vector with the smallest magnitude, the expression for the magnitude squared, $||\mathbf{v}||^2$, is minimized: $||\mathbf{v}||^2 = x^2 + (2x + 30)^2 + (-5x - 30)^2$ $||\mathbf{v}||^2 = x^2 + (4x^2 + 120x + 900) + (25x^2 + 300x + 900)$ $||\mathbf{v}||^2 = 30x^2 + 420x + 1800$. 5. **Minimizing the Quadratic:** This is a quadratic in $x$. To find the minimum, we can complete the square or find the vertex. $30x^2 + 420x + 1800 = 30(x^2 + 14x) + 1800$ $= 30(x^2 + 14x + 49 - 49) + 1800$ $= 30((x + 7)^2 - 49) + 1800$ $= 30(x + 7)^2 - 1470 + 1800$ $= 30(x + 7)^2 + 330$. The minimum value of the expression is $330$, which occurs when $x + 7 = 0$, i.e., $x = -7$. 6. **Finding the Vector with Minimum Magnitude:** Substitute $x = -7$ back into the general solution for $\mathbf{v}$: * $x = -7$ * $y = 2(-7) + 30 = -14 + 30 = 16$ * $z = -5(-7) - 30 = 35 - 30 = 5$ Thus, the vector $\mathbf{v}$ that has the smallest magnitude is $\begin{pmatrix} -7 \\ 16 \\ 5 \end{pmatrix}$. 7. **Verification of the Final Solution:** * The document checks if this vector $\mathbf{v} = \begin{pmatrix} -7 \\ 16 \\ 5 \end{pmatrix}$ satisfies the cross product equation. * It also checks if $\mathbf{v}$ is orthogonal to $\mathbf{c} = \begin{pmatrix} 90 \\ 30 \\ 30 \end{pmatrix}$ using the dot product: $\mathbf{v} \cdot \mathbf{c} = (-7)(90) + (16)(30) + (5)(30) = -630 + 480 + 150 = 0$. * Since $\mathbf{v} \cdot \mathbf{c} = 0$, the vector $\mathbf{v}$ is orthogonal to $\mathbf{c}$. This is a necessary condition for $\mathbf{v}$ to be a solution to $\mathbf{a} \times \mathbf{v} = \mathbf{c}$, as $\mathbf{c}$ must be orthogonal to $\mathbf{v}$. The document states, "Since both dot products are zero, the given answer $\mathbf{v}$ is orthogonal to $\mathbf{c}$, which means it could be a correct vector $\mathbf{v}$ that satisfies the cross product equation." ### Key Observations * The initial attempt to solve the system of equations led to an incorrect vector $\begin{pmatrix} 9 \\ -12 \\ 75 \end{pmatrix}$ because it failed the orthogonality check with the resulting vector $\mathbf{c}$. * The second attempt correctly identified that the system of equations derived from the cross product has infinitely many solutions, forming a plane. * The general solution was correctly parameterized, and its magnitude squared was minimized by treating it as a quadratic function of the parameter. * The minimum magnitude occurs at $x = -7$, yielding the vector $\begin{pmatrix} -7 \\ 16 \\ 5 \end{pmatrix}$. * The final solution was verified by checking the orthogonality with the resulting vector $\mathbf{c}$, which is a crucial property of cross products. ### Interpretation The problem demonstrates the relationship between vector cross products and systems of linear equations. The equation $\mathbf{a} \times \mathbf{v} = \mathbf{c}$ implies that $\mathbf{c}$ must be orthogonal to $\mathbf{v}$. This constraint, along with the linear relationships derived from the cross product components, defines a set of possible vectors $\mathbf{v}$. The fact that the system of equations has infinitely many solutions indicates that the vector $\mathbf{c}$ is orthogonal to the vector $\mathbf{a}$ (which is implicitly checked by the consistency of the system). The set of all solutions $\mathbf{v}$ forms a plane that is orthogonal to $\mathbf{a}$. The task of finding the vector with the smallest magnitude within this plane is equivalent to finding the point on the plane closest to the origin. This is achieved by minimizing the squared magnitude of the general solution vector, which results in a quadratic optimization problem. The minimum occurs at a specific value of the parameter, yielding a unique vector. The verification step confirms that the found vector satisfies the orthogonality condition, which is a strong indicator of correctness. The initial error highlights the importance of checking all properties of vector operations, not just the algebraic manipulations. </details> Figure 6: MATH500 Case. <details> <summary>x9.png Details</summary> ![01e1ebc8](/v1/image/01e1ebc84dc2fefbb5a9b455ac5f64a56311ac402e3b31bbeb6cd9ac24154cce) ### Visual Description ## Textual Document: Logical Reasoning Exercise ### Overview This document presents a series of logical reasoning problems. Each problem consists of a set of facts and a question to be answered based on those facts. The document then provides a step-by-step analysis of the facts and the question, leading to a conclusion, which is often presented as a "True" or "False" answer. The document also includes prompts for the user to re-evaluate their answer. ### Components/Axes This document does not contain charts or diagrams. It is composed entirely of text, structured into sections for: * **Problem Statement:** Presents the initial facts and the question. * **Response/Analysis:** Details the logical steps taken to arrive at an answer. * **User Interaction Prompts:** Phrases like "Wait, let me recheck my solution." and "Let me try again." ### Detailed Analysis or Content Details The document presents three distinct reasoning scenarios, each addressing the question: "Can you buy chlorine at a dollar store?" **Scenario 1:** * **Facts:** 1. Chlorine, when added to water, creates household bleach. 2. Household bleach is available at most dollar stores. * **Initial Analysis:** The analysis states that Fact 1 provides information about creating bleach, and Fact 2 states bleach is available. However, it concludes that Fact 1 does not provide direct information about the sale of chlorine itself. Therefore, the answer is "False". * **Re-evaluation:** The re-evaluation explains that while household bleach is available, the facts do not explicitly confirm the availability of chlorine itself at dollar stores. It reiterates that the answer "False" is not necessarily incorrect, as chlorine might not be sold directly, but used to create bleach that is sold. * **Final Answer (Implied):** False **Scenario 2:** * **Facts:** 1. Chlorine, when added to water, creates household bleach. 2. Household bleach is available at most dollar stores. * **Analysis:** This scenario reiterates the same facts as Scenario 1. The analysis states that Fact 2 indicates household bleach is a product found at dollar stores. It also notes that Fact 1 states chlorine creates bleach. The analysis then states that since the availability of chlorine itself at dollar stores is not mentioned, a definitive conclusion cannot be reached. * **Conclusion:** The answer is "False". * **Re-evaluation:** The re-evaluation explains that the first statement (chlorine + water = bleach) and the second statement (bleach available at dollar stores) do not explicitly confirm chlorine availability. It states that the answer "False" implies it's not possible to buy chlorine, but this is not necessarily correct. It's possible chlorine isn't sold directly but used to create bleach. * **Final Answer (Implied):** False **Scenario 3:** * **Facts:** 1. Chlorine, when added to water, creates household bleach. 2. Household bleach is available at most dollar stores. * **Analysis:** This scenario again uses the same facts. The analysis states that Fact 2 confirms household bleach is available. It then infers that since household bleach is made by adding chlorine to water, and bleach is available, it implies the process of making bleach involves chlorine. It further infers that since bleach is available and made from chlorine, it is reasonable to infer that chlorine is also available at dollar stores as a necessary component. * **Conclusion:** Based on the given facts, it is reasonable to conclude that you can buy chlorine at a dollar store. The answer is "True". * **Re-evaluation:** The re-evaluation confirms the logical structure. The first statement (chlorine + water = bleach) and the second statement (bleach available at dollar stores) are examined. The inference is that if bleach is available at dollar stores and it's made from chlorine, then chlorine is a necessary component for producing household bleach. This supports the conclusion that chlorine can be bought at a dollar store. * **Final Answer (Implied):** True ### Key Observations * The document explores the nuances of logical inference, particularly when information is implied rather than explicitly stated. * There is a progression in the analysis, with later scenarios drawing stronger inferences from the same initial facts. * The document highlights the difference between direct evidence and inferential evidence. * The "Wait, let me recheck my solution." prompts suggest an interactive learning or testing environment. ### Interpretation This document demonstrates a logical reasoning exercise focused on deductive and inductive inference. The core problem revolves around determining the availability of a precursor ingredient (chlorine) based on the availability of a finished product (household bleach) and the process of its creation. The initial scenarios lean towards a strict interpretation, concluding that without explicit mention of chlorine's sale, its availability cannot be confirmed. This represents a more conservative, direct-evidence-based approach. The third scenario, however, adopts a more inferential approach. It recognizes that the availability of a product (bleach) and knowledge of its composition (made from chlorine) allows for a reasonable inference about the availability of its essential components. This suggests a Peircean investigative approach where abduction (inference to the best explanation) is employed. The availability of bleach at dollar stores is the observed phenomenon, and the existence of chlorine at dollar stores is the best explanation for how that phenomenon can occur, given the stated facts. The document effectively illustrates how different levels of logical rigor can lead to different conclusions from the same set of premises. It also implicitly teaches the user to consider indirect evidence and the implications of cause-and-effect relationships in logical problem-solving. The repeated prompts for rechecking suggest that the document is designed to guide the user through the process of refining their logical reasoning skills, encouraging them to move from a literal interpretation to a more nuanced, inferential one. </details> Figure 7: StrategyQA Case. To intuitively demonstrate the effectiveness of our proposed method, we present the model’s inference examples after RL on the MATH500 and StrategyQA datasets in the Figure 6 and Figure 7. ## Appendix F Other Discussion ### F.1 Discussion on Potential Risk We have carefully considered potential risks associated with our work and found no significant concerns. Our approach, focused on enhancing LLM reasoning through self-verification and self-correction, does not introduce malicious or harmful effects, privacy issues, or security threats. Additionally, it does not contribute to biases, fairness concerns, or environmental impact. We believe our work is safe for responsible use in research. ### F.2 Use of AI Assistant In this work, we utilized an AI assistant solely for the purpose of refining and polishing the language of the manuscript. The AI assistant was employed to improve clarity, flow, and overall readability, ensuring the text adhered to academic writing standards. It was not involved in any data analysis, experimentation, or formulation of ideas. All research design, methodology, results, and conclusions were developed independently by the authors. The use of the AI assistant was limited to language enhancement and did not influence the content or scientific integrity of the work.

Rendering Paper...