2602.07594

Model: healer-alpha-free

# Learning to Self-Verify Makes Language Models Better Reasoners **Authors**: Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai, Yaorui Shi, Qi Gu, Hui Su, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua ## Abstract Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities. Our code is publicly available at https://github.com/chenyuxin1999/Learning-to-Self-Verify. Machine Learning, ICML ## 1 Introduction Large language models (LLMs) have demonstrated strong capabilities in complex reasoning (DeepSeek-AI, 2025; Yang et al., 2025a; Team, 2025; OpenAI, 2025). With the advancement of Reinforcement Learning with Verifiable Rewards (RLVR), current models have made substantial progress on verifiable tasks such as mathematics and programming (Shao et al., 2025; Anthropic, 2025; Z.AI, 2025), while also showing consistent improvements on open-domain tasks including writing, dialogue, and general problem solving (DeepSeek-AI et al., 2025; MiniMax, 2025; Bhaskar et al., 2025; Zeng et al., 2025b). Despite these advances, a fundamental asymmetry remains: even the most powerful models often lack the ability to reliably verify the correctness of their own outputs. <details> <summary>x1.png Details</summary> ![e45e9d4e](/v1/image/e45e9d4eb810d8efed501105fe9e1bcbf39e07d1fcebb68b22892d708c7e6a57) ### Visual Description ## Line Charts: Learning Performance Comparison ### Overview The image displays a 2x2 grid of line charts comparing two learning methods: "Learn to Generate" (top row, red lines) and "Learn to Self-Verify" (bottom row, blue lines). Each method is evaluated on two metrics: "Generation" performance (left column) and "Self-Verification" performance (right column) over 1000 training steps. ### Components/Axes * **Layout**: A 2x2 grid. Top row charts are labeled with a red header "Learn to Generate". Bottom row charts are labeled with a blue header "Learn to Self-Verify". * **Chart Titles**: * Top-Left: "Generation" * Top-Right: "Self-Verification" * Bottom-Left: "Generation" * Bottom-Right: "Self-Verification" * **X-Axis (All Charts)**: Labeled "Step". Linear scale from 0 to 1000, with major tick marks at 0, 200, 400, 600, 800, 1000. * **Y-Axis (Left Column - Generation Charts)**: Labeled "Reward". Linear scale from 0.06 to 0.22 (top-left) and 0.08 to 0.16 (bottom-left). * **Y-Axis (Right Column - Self-Verification Charts)**: Labeled "Accuracy". Linear scale from 0.45 to 0.70 (top-right) and 0.40 to 0.70 (bottom-right). * **Data Series**: Each chart contains a single line. The color corresponds to the learning method: red for "Learn to Generate", blue for "Learn to Self-Verify". ### Detailed Analysis **1. Top-Left Chart: "Learn to Generate" - Generation (Reward vs. Step)** * **Trend**: The red line shows a clear, noisy upward trend. It starts low, increases rapidly until around step 400, then continues to rise with significant volatility. * **Key Data Points (Approximate)**: * Step 0: ~0.06 * Step 200: ~0.14 * Step 400: ~0.18 * Step 600: ~0.19 * Step 800: ~0.20 * Step 1000: ~0.22 * **Characteristics**: High variance/noise throughout the training process. **2. Top-Right Chart: "Learn to Generate" - Self-Verification (Accuracy vs. Step)** * **Trend**: The red line fluctuates heavily with no strong, consistent upward or downward trend. It oscillates within a band. * **Key Data Points (Approximate)**: * The line mostly stays between 0.50 and 0.60 accuracy. * It shows sharp peaks and troughs (e.g., a peak near 0.60 at ~step 100, a trough near 0.48 at ~step 300). * Ends near 0.55 at step 1000. * **Characteristics**: Highly unstable performance, suggesting the self-verification capability does not reliably improve with generation training under this method. **3. Bottom-Left Chart: "Learn to Self-Verify" - Generation (Reward vs. Step)** * **Trend**: The blue line shows a steady, less noisy upward trend compared to its red counterpart. The growth is more consistent. * **Key Data Points (Approximate)**: * Step 0: ~0.08 * Step 200: ~0.11 * Step 400: ~0.13 * Step 600: ~0.14 * Step 800: ~0.15 * Step 1000: ~0.16 * **Characteristics**: Smoother learning curve with lower variance. **4. Bottom-Right Chart: "Learn to Self-Verify" - Self-Verification (Accuracy vs. Step)** * **Trend**: The blue line shows a strong, consistent upward trend with moderate noise. It demonstrates clear improvement over time. * **Key Data Points (Approximate)**: * Step 0: ~0.45 * Step 200: ~0.55 * Step 400: ~0.58 * Step 600: ~0.62 * Step 800: ~0.64 * Step 1000: ~0.65 * **Characteristics**: Stable and improving self-verification accuracy. ### Key Observations 1. **Method Contrast**: The "Learn to Self-Verify" method (blue) produces smoother, more stable learning curves for both generation reward and verification accuracy compared to the "Learn to Generate" method (red). 2. **Verification Stability**: The most striking difference is in the Self-Verification charts. The red line (Learn to Generate) is chaotic and stagnant, while the blue line (Learn to Self-Verify) shows reliable improvement. 3. **Reward vs. Accuracy Correlation**: For the "Learn to Self-Verify" method, the upward trends in both Generation Reward and Self-Verification Accuracy are correlated, suggesting the two capabilities improve together. 4. **Final Performance**: At step 1000, "Learn to Generate" achieves a higher final reward (~0.22 vs ~0.16) but a lower and more unstable verification accuracy (~0.55 vs ~0.65). ### Interpretation The data suggests a fundamental trade-off or difference in learning dynamics between the two approaches. * The **"Learn to Generate"** method appears to optimize aggressively for the primary reward signal (generation), achieving higher final reward values. However, this comes at the cost of a unstable and non-improving secondary capability (self-verification). This could indicate overfitting to the reward signal or a failure to learn a robust internal verification model. The high noise suggests the learning process is volatile. * The **"Learn to Self-Verify"** method demonstrates more holistic and stable learning. While its final generation reward is lower, both its generation and verification capabilities improve steadily and in tandem. This implies the method fosters a more integrated learning process where the ability to verify outputs supports and stabilizes the ability to generate them. The lower noise indicates a more reliable training progression. **In essence, the charts illustrate that directly optimizing for generation reward ("Learn to Generate") may yield higher peak performance on that metric but can destabilize other aspects of the system. In contrast, a method that explicitly incorporates self-verification ("Learn to Self-Verify") promotes balanced, stable, and correlated improvement across multiple competencies, which may be preferable for building reliable and robust AI systems.** The choice between methods would depend on whether the priority is maximizing a single metric or achieving stable, multi-faceted competence. </details> Figure 1: Training dynamics of Qwen2.5-1.5B-Instruct. (Top) It reveals a persistent asymmetry between generation and self-verification: learning to generate does not lead to improved self-verification ability, even on the same task. (Down) In the reverse direction, learning to self-verify not only improves self-verification ability but also leads to improved generation performance. LLMs have long been considered incapable of verifying the correctness of their own answers (Stechly et al., 2025; Hong et al., 2024; Zhang et al., 2024). With the advent of RLVR, some works observe that models can exhibit emergent self-verification behaviors, sometimes also referred to as an “aha moment” (DeepSeek-AI, 2025; Zeng et al., 2025a; Hu et al., 2025). However, subsequent analyses suggest that most of these behaviors are in fact fake verification: although the model appears to be checking its previous reasoning, this step has little impact on the final answer, fundamentally due to the model’s limited ability to reliably verify its own generations (Zhao et al., 2025; Yee et al., 2024). More importantly, self-verification capability does not naturally improve with increased model scale or stronger generation ability (Lu et al., 2025), revealing a persistent asymmetry between generation and self-verification. Motivated by this, several approaches attempt to jointly optimize generation and verification within the same training step, treating verification as an auxiliary component (Liu et al., 2025b; Zhang et al., 2025a; Wang et al., 2025b). In practice, however, the training dynamics of these methods remain dominated by the generation objective, leaving the fundamental asymmetry largely unexplored. In this work, we conduct an in-depth investigation of the asymmetry between generation and self-verification. Specifically, we explicitly train the LLM to generate better answers in a specific domain (e.g., mathematics) and track how it behaves when verifying its own answers on the same set of tasks throughout training process. We find that this asymmetry still persists: improving a model’s generation performance does not lead to corresponding improvements in its ability to verify its own solutions, as illustrated in Figure 1 (top). This naturally raises a key research question: does this asymmetry also manifest in the reverse direction? In other words, can improving a model’s self-verification ability lead to better generation performance? To answer this question, we adopt an alternative training paradigm: instead of training the model to generate better answers, we train it solely to judge the correctness of its own solutions. With a carefully designed self-verification training pipeline, we surprisingly find that although training the model for generation does not improve its self-verification ability, training the model to self-verify does improve its generation performance, even achieving comparable performance to standard generation training on several benchmarks, as illustrated in Figure 1 (down). Beyond comparable performance, the resulting models acquire strong verification capability. Benefiting from this improved self-verification ability, we observe a significant reduction in the number of tokens required to solve the same problems, indicating more efficient reasoning. Moreover, stronger self-verification unlocks effective test-time scaling: incorporating self-verification results into majority voting leads to performance gains. Building on these observations, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Specifically, we introduce two orthogonal training strategies: (i) learning to self-verify as a stronger initial policy before learning to generate, and (ii) alternating training between generation and verification, where a verification phase is triggered after several generation steps. Extensive experiments show that these integrated training strategies consistently outperform those trained with generation alone. Our main contributions are as follows: - We conduct an in-depth investigation of the asymmetry between generation and self-verification throughout training, and show that improving generation ability does not lead to corresponding gains in self-verification. - We identify the reverse direction of this asymmetry: learning to self-verify can effectively improve generation performance. Based on this insight, we propose to integrate self-verification into generation training by formulating a multi-task reinforcement learning framework. - We provide extensive experiments demonstrating that learning to self-verify consistently improves problem-solving performance, together with detailed analyses. ## 2 Preliminary <details> <summary>x2.png Details</summary> ![45f09127](/v1/image/45f091276ade2e7d137c3dc5b57493768de3ac51967ed3cae9c0d23a05845e4a) ### Visual Description \n ## Diagram: AI Training Pipeline with Verification and Feedback Loops ### Overview The image is a technical flowchart illustrating a multi-stage process for training or refining an AI policy model. The system incorporates answer generation, verification against reference answers, post-processing, and a self-verification loop with reward mechanisms. The diagram is divided into three primary, interconnected stages labeled a), b), and c). ### Components/Axes The diagram is structured into three main shaded regions, each representing a distinct phase: **a) On policy sampling collection (Top-left, light blue background):** * **Components:** Database, Query, Policy, Answer, Ref. Answer, Verifier, Correctness label. * **Flow:** A `Query` is drawn from a `Database` and fed into a `Policy` model. The `Policy` generates an `Answer`. This `Answer`, along with a `Ref. Answer` (Reference Answer) also sourced from the `Database`, is sent to a `Verifier`. The `Verifier` outputs a `Correctness label`. * **Auxiliary Element:** A graph labeled `Generation improvements` shows an upward trend. It points to a `GRPO` module, indicating optional `generation reward` input. The `GRPO` module sends an `Update` signal back to the `Policy`. **b) Post-processing (Top-right, light yellow background):** * **Components:** Diversity, Balancing, Filtering. * **Flow:** The output from stage a) (presumably the collected data with correctness labels) flows into this stage. It is processed through three sequential modules: `Diversity`, `Balancing`, and `Filtering`. **c) Self-verification (Bottom, light purple background):** * **Components:** Predicted judgement, Policy, verification reward. * **Flow:** The processed data from stage b) enters this loop. A `Predicted judgement` is made (likely by a model). This is fed into the `Policy` along with a `Query` and `Answer` pair. The output of this `Policy` generates a `verification reward`, which is sent to the `GRPO` module in stage a). **Central Connecting Module:** * **GRPO:** This module sits between stages a) and c). It receives a `generation reward (optional)` from the `Generation improvements` graph and a `verification reward` from stage c). It then sends an `Update` to the `Policy` in stage a), closing the main feedback loop. ### Detailed Analysis The diagram details a sophisticated, iterative training pipeline: 1. **Data Collection & Initial Verification (Stage a):** The core process begins with generating answers to queries and verifying them against ground-truth references. This creates a labeled dataset (`Correctness label`). 2. **Data Refinement (Stage b):** The collected data undergoes post-processing to ensure quality and balance. The `Diversity` module likely ensures varied examples, `Balancing` adjusts class distributions, and `Filtering` removes low-quality or noisy data. 3. **Self-Verification Loop (Stage c):** This is a key innovation. The refined data is used to train or run a `Predicted judgement` model. This model's output, in conjunction with the original `Policy`, generates a `verification reward`. This reward signal is an alternative or supplement to the direct `generation reward`. 4. **Policy Update via GRPO:** The `GRPO` (likely an acronym for a specific reinforcement learning or optimization algorithm like "Generative Reward Policy Optimization") module aggregates rewards from two sources: * **Generation Reward:** Optional, based on the trend of `Generation improvements`. * **Verification Reward:** Derived from the self-verification loop. The aggregated reward signal is used to `Update` the main `Policy` model, aiming to improve its performance iteratively. ### Key Observations * **Dual Reward Mechanism:** The system uses both a direct performance metric (`generation reward`) and an indirect, model-based metric (`verification reward`). * **Closed-Loop System:** The pipeline is cyclical. The updated `Policy` generates new answers, which go through verification and post-processing, leading to new rewards and further updates. * **Role of Reference Answers:** The `Ref. Answer` is crucial for the initial `Verifier` in stage a), providing a ground truth for creating the `Correctness label`. * **Data-Centric Post-Processing:** The dedicated `Post-processing` stage emphasizes the importance of data quality (diversity, balance, cleanliness) before it's used in the advanced self-verification loop. ### Interpretation This diagram represents a **Reinforcement Learning from Human and AI Feedback (RLHAF)** or a similar advanced training paradigm for generative AI models. It moves beyond simple supervised learning. * **What it demonstrates:** The system aims to create a more robust and reliable AI `Policy` by not just learning from static correct answers, but by incorporating a dynamic verification process. The `Self-verification` loop (stage c) suggests the model is learning to judge the quality of its own or other models' outputs, creating a form of **recursive self-improvement**. * **How elements relate:** The `Database` is the source of truth and queries. The `Policy` is the core model being improved. The `Verifier` and `Predicted judgement` act as critics or reward models. `GRPO` is the optimizer that translates feedback into model updates. The post-processing ensures the feedback is based on high-quality data. * **Notable implications:** The inclusion of `Diversity` and `Balancing` suggests an awareness of and mitigation for dataset bias. The optional `generation reward` indicates flexibility in the training signal. The entire architecture is designed to be **scalable and automated**, reducing reliance on constant human annotation by using AI verifiers and self-verification loops. The ultimate goal is likely to produce a policy that generates answers that are not only correct but also diverse, balanced, and verifiable. </details> Figure 2: Overview of our self-verification training framework. We collect on-policy problem-solving trajectories from the model and obtain correctness labels from a verifier. These trajectories are then processed through a post-processing pipeline, including data balancing, filtering, and diversity-aware sampling, to construct self-verification training data, which is used to train the model to judge the correctness of its own answers. We find that training the model solely for self-verification already leads to improved generation performance. Integrating this self-verification objective into generation training further strengthens the model’s generation ability. In this section, we introduce the preliminary concepts and notations used throughout the paper. We first review the RLVR formulation in Section 2.1. We then show how the same RLVR framework can be instantiated in two different settings: generation training (Section 2.2), where the model is optimized to solve a given task, and verification training (Section 2.3), where the model is optimized to judge the correctness of a given solution. ### 2.1 RLVR Reinforcement Learning with Verifiable Rewards (RLVR) is a reinforcement learning framework for training language models using automatically computable reward signals. Instead of relying on human preference models, RLVR employs a rule-based verifier that evaluates each model output against a reference and returns a scalar reward. Concretely, given an input query $x_{i}$ , where $i$ denotes the query index, the model parameterized by $\pi_{\theta}$ generates multiple outputs $o_{i,j}=(z_{i,j},y_{i,j})$ , where $j$ indexes different samples generated for the same query. Here, $z_{i,j}$ denotes the intermediate reasoning trace and $y_{i,j}$ denotes the final prediction. A verifier then assigns a reward score $r_{i,j}$ by comparing the model output with a reference solution $y_{i}^{*}$ . The training objective is to optimize the model parameters so as to maximize the expected verifier reward over model-generated samples: $$ \max_{\theta}\;\mathbb{E}_{o\sim\pi_{\theta}(\cdot\mid x)}\big[r\big]. \tag{1} $$ In this work, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024) as the underlying optimization algorithm. GRPO can be viewed as a simplified variant of PPO (Schulman et al., 2017) that directly optimizes the policy without introducing a separate value network. For each input $x_{i}$ , the policy samples a set of $G$ candidate outputs $\{(z_{i,j},y_{i,j})\}_{j=1}^{G}$ , each receiving a reward $r_{i,j}$ . The policy is then updated by comparing each candidate against the group statistics, using the following clipped surrogate objective: $$ \displaystyle\mathcal{L}_{\text{GRPO}}(\theta)=\mathbb{E}_{x_{i}\sim\mathcal{D}}\Bigg[\frac{1}{G}\sum_{j=1}^{G}\frac{1}{|z_{i,j}\circ y_{i,j}|} \displaystyle\times\sum\limits_{t=1}^{|z_{i,j}\circ y_{i,j}|}\min\Big(\rho^{t}_{i,j}A_{i,j},\;\mathrm{clip}(\rho^{t}_{i,j},1-\epsilon,1+\epsilon)A_{i,j}\Big)\Bigg] \tag{2} $$ where: $$ \rho_{i,j}^{t}=\frac{\pi_{\theta}(z_{i,j}^{t}\circ y_{i,j}^{t}\mid x_{i},z_{i,j}^{<t}\circ y_{i,j}^{<t})}{\pi_{\theta_{\text{old}}}(z_{i,j}^{t}\circ y_{i,j}^{t}\mid x_{i},z_{i,j}^{<t}\circ y_{i,j}^{<t})}, $$ where the superscript $t$ denotes the token index in the concatenated sequence, and $\circ$ denotes sequence concatenation. Here, the advantage $A_{i,j}$ is computed by normalizing the rewards within the sampled group: $$ A_{i,j}=\frac{r_{i,j}-\mu_{i}}{\sigma_{i}+\epsilon_{\text{norm}}}, \tag{3} $$ with: $$ \mu_{i}=\frac{1}{G}\sum_{j=1}^{G}r_{i,j},\qquad\sigma_{i}=\sqrt{\frac{1}{G}\sum_{j=1}^{G}(r_{i,j}-\mu_{i})^{2}}. $$ Intuitively, GRPO encourages generations that perform better than the group average while suppressing those with lower relative rewards. Inspired by (Yu et al., 2025a), we adopt the clip-higher strategy and token-level mean advantage normalization. ### 2.2 Generation Training Under the RLVR formulation, generation training corresponds to the standard task-solving setting. Each training example consists of a task query $x_{i}$ and a reference solution $y_{i}^{*}$ . Given $x_{i}$ , the model samples multiple candidate solutions $\{(z_{i,j},y_{i,j})\}_{j=1}^{G}$ , and the verifier assigns a reward by checking whether each generated answer $y_{i,j}$ matches the reference solution $y_{i}^{*}$ . In this case, the reward signal directly reflects task-solving correctness, and RLVR reduces to optimizing the policy to produce correct solutions for the given tasks. ### 2.3 Verification Training Under the same RLVR formulation, verification training corresponds to a different instantiation of the input and reference. Each training sample is constructed from a triplet $(x_{i},y_{i,j},c_{i,j})$ , where $x_{i}$ is the task query, $y_{i,j}$ is a candidate solution generated by the model, and $c_{i,j}\in\{0,1\}$ is a binary correctness label indicating whether $y_{i,j}$ matches the reference solution $y_{i}^{*}$ . Given such an input $(x_{i},y_{i,j})$ , the model is prompted to output a judgment $\hat{c}_{i,j}$ indicating whether the provided solution is correct. A rule-based verifier then assigns a reward by comparing the model’s judgment $\hat{c}_{i,j}$ with the reference label $c_{i,j}$ . In this setting, the model is not optimized to solve the task itself, but rather to assess the correctness of given solutions. ## 3 Learning to Self-Verify Table 1: Evaluation of learning to self-verify across six mathematical reasoning benchmarks. We report both task accuracy (Acc@16 $\uparrow$ ) and average reasoning length in tokens (Tokens $\downarrow$ ) for each model trained under two different objectives: train LLM to generate better solutions (Generate), and train LLM to verify its own solutions (Self-Verify). Results show that models trained with self-verification yield efficient reasoning traces, while achieving comparable or sometimes even better performance than models trained for generation. | Method | AMC23 | Minerva | Olympiad | Math500 | AIME24 | AIME25 | Avg | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Tokens $\downarrow$ | Acc $\uparrow$ | Tokens $\downarrow$ | Acc $\uparrow$ | Tokens $\downarrow$ | Acc $\uparrow$ | Tokens $\downarrow$ | Acc $\uparrow$ | Tokens $\downarrow$ | Acc $\uparrow$ | Tokens $\downarrow$ | Acc $\uparrow$ | Tokens $\downarrow$ | Acc $\uparrow$ | | | Qwen2.5-1.5B-Instruct | | | | | | | | | | | | | | | | Generate | 1402 | 30.5 | 963 | 12.4 | 1639 | 20.6 | 936 | 53.7 | 2580 | 2.9 | 2103 | 0.8 | 1604 | 20.2 | | Self-Verify | 1309 | 33.0 | 870 | 14.0 | 1351 | 22.2 | 817 | 54.6 | 1467 | 4.8 | 1545 | 1.3 | 1227 | 21.7 | | Qwen2.5-3B-Instruct | | | | | | | | | | | | | | | | Generate | 2754 | 50.9 | 2006 | 17.0 | 3299 | 27.4 | 2021 | 59.6 | 4811 | 8.1 | 4744 | 8.1 | 3273 | 28.5 | | Self-Verify | 1825 | 46.7 | 1658 | 17.1 | 1891 | 32.1 | 1237 | 65.6 | 2755 | 9.2 | 2252 | 6.3 | 1936 | 29.5 | | Qwen2.5-7B-Instruct | | | | | | | | | | | | | | | | Generate | 3967 | 65.3 | 2353 | 25.3 | 4543 | 37.8 | 2437 | 70.9 | 7053 | 16.3 | 6397 | 18.1 | 4458 | 38.9 | | Self-Verify | 1194 | 59.7 | 823 | 25.2 | 1168 | 39.8 | 783 | 74.7 | 1575 | 19.4 | 1369 | 11.7 | 1152 | 38.4 | In this section, we investigate the reverse direction of this long-standing asymmetry: whether a model can improve its generation performance solely by learning to verify its own solutions. We first introduce our self-verification training pipeline in Section 3.1, then describe the experimental setup in Section 3.2, present the main results in Section 3.3, and provide further analysis in Section 3.4. ### 3.1 Self-Verification Framework Following the notation in Section 2, we now introduce our self-verification framework, as illustared in Figure 2. #### On-Policy Sample Collection At each training iteration, we sample a mini-batch of $B$ queries $\{x_{i}\}_{i=1}^{B}$ . For each query $x_{i}$ , we use the current policy $\pi_{\theta}$ to generate $G$ candidate answers, resulting in $B\times G$ generated samples. For each generated sample, the model produces a solution $y_{i,j}$ together with its corresponding reasoning trace $z_{i,j}$ , where $j=1,\ldots,G$ . A rule-based verifier then compares each $y_{i,j}$ with the reference answer $y_{i}^{*}$ and assigns a binary correctness label $c_{i,j}$ . Each sample is thus represented as a triplet $(x_{i},y_{i,j},c_{i,j})$ . All such triplets are stored in a temporary buffer and serve as the raw candidates for constructing the self-verification training data. #### Post-Processing At each iteration, the on-policy sampling procedure produces $B\times G$ samples. Directly using all of them for verification training is computationally expensive and can also introduce instability due to imbalance or low-quality samples. For a fair comparison, we downsample these candidates and construct a verification training batch of size $B$ by selecting the most informative samples. Specifically, we apply the following steps: - Filtering: We first discard invalid samples, including those with malformed outputs, excessively long generations, or missing a unique final answer. We further discard queries for which all generated answers are incorrect, as such cases typically exceed the current capability of the model and provide little useful supervision signal for self-verification. - Diversity Control: To avoid overfitting to a small subset of queries when conducting self-verification training, we perform sampling at the query level and ensure that the selected verification samples are drawn from diverse input queries. - Data Balancing: Since generation often produces highly imbalanced labels (e.g., mostly incorrect at early stages and mostly correct at later stages), while self-verification is essentially a binary classification task, we explicitly enforce each mini-batch of verification data to contain an equal number of correct and incorrect samples. #### Training In self-verification training, the model is prompted with a query–answer pair $(x_{i},y_{i,j})$ and is required to predict whether the provided answer is correct. Let $\hat{c}_{i,j}$ denote the model’s predicted judgment and $c_{i,j}$ the reference correctness label obtained from the rule-based verifier. A verification reward is then computed as: $$ r_{i,j}^{v}=\mathrm{Verifier}(\hat{c}_{i,j},c_{i,j}). \tag{4} $$ We then optimize the model using the same GRPO objective as in Section 2.1. This training stage treats the model purely as a verifier and encourages it to improve its ability to distinguish correct from incorrect answers. We emphasize that at this stage, the training objective contains no generation reward. The policy is optimized solely to maximize the expected verification reward. <details> <summary>figures/aime24_qwen2.5_1.5B_instruct.png Details</summary> ![39b25d5f](/v1/image/39b25d5fe5a163ca461bc327edf0942f34234a18133d675a33666c7273a6b9f2) ### Visual Description ## Dual Line Charts: Accuracy and Token Usage vs. Training Progress ### Overview The image displays two side-by-side line charts comparing the performance of two methods, "Generate" and "Self-Verify," across different stages of training progress. The left chart tracks accuracy ("Acc"), while the right chart tracks the number of tokens used. Both charts share the same x-axis representing "Training Progress" with discrete points at 0, 1/8, 1/4, 1/2, and 1. ### Components/Axes **Common Elements:** * **X-Axis (Both Charts):** Labeled "Training Progress." The axis markers are at the following positions: `0`, `1/8`, `1/4`, `1/2`, and `1`. * **Legend (Both Charts):** Located in the top-left corner of each chart's plot area. * A red square symbol (□) corresponds to the label **"Generate"**. * A blue circle symbol (○) corresponds to the label **"Self-Verify"**. **Left Chart - Accuracy:** * **Chart Title/Type:** Line chart showing Accuracy vs. Training Progress. * **Y-Axis:** Labeled **"Acc"**. The scale ranges from 2 to 6, with major gridlines at intervals of 1 (2, 3, 4, 5, 6). * **Data Series:** * **Generate (Red line with square markers):** The line is a light red/pink color. * **Self-Verify (Blue line with circle markers):** The line is a grayish-blue color. **Right Chart - Token Usage:** * **Chart Title/Type:** Line chart showing Tokens vs. Training Progress. * **Y-Axis:** Labeled **"Tokens"**. The scale ranges from 1000 to 6000, with major gridlines at intervals of 1000 (1000, 2000, 3000, 4000, 5000, 6000). * **Data Series:** * **Generate (Red line with square markers):** The line is a light red/pink color. * **Self-Verify (Blue line with circle markers):** The line is a grayish-blue color. ### Detailed Analysis **Left Chart: Accuracy (Acc)** * **Trend Verification:** * **Generate (Red):** The line slopes upward from 0 to 1/2, reaching a peak, then slopes sharply downward from 1/2 to 1. * **Self-Verify (Blue):** The line slopes upward from 0 to 1/2, reaching a higher peak than Generate, then slopes downward from 1/2 to 1, but remains above the Generate line at the final point. * **Data Points (Value at each Training Progress marker):** * **At 0:** Generate = 2.2, Self-Verify = 2.2 * **At 1/8:** Generate = 2.9, Self-Verify = 2.9 * **At 1/4:** Generate = 3.8, Self-Verify = 3.5 * **At 1/2:** Generate = 4.2, Self-Verify = 5.6 * **At 1:** Generate = 2.5, Self-Verify = 4.0 **Right Chart: Token Usage (Tokens)** * **Trend Verification:** * **Generate (Red):** The line shows a consistent, strong upward slope across all training progress points. * **Self-Verify (Blue):** The line shows a slight initial decrease, then a gradual, modest upward slope from 1/8 to 1/2, followed by a steeper upward slope from 1/2 to 1. * **Data Points (Value at each Training Progress marker):** * **At 0:** Generate = 1935, Self-Verify = 1935 * **At 1/8:** Generate = 2580, Self-Verify = 1635 * **At 1/4:** Generate = 2910, Self-Verify = 1726 * **At 1/2:** Generate = 3782, Self-Verify = 1754 * **At 1:** Generate = 5422, Self-Verify = 3228 ### Key Observations 1. **Accuracy Peak:** Both methods peak in accuracy at the halfway point (Training Progress = 1/2). The "Self-Verify" method achieves a significantly higher peak accuracy (5.6) compared to "Generate" (4.2). 2. **Accuracy Degradation:** After the peak at 1/2, both methods experience a drop in accuracy by the end of training (Progress = 1). However, "Self-Verify" maintains a higher final accuracy (4.0) than "Generate" (2.5). 3. **Token Cost Divergence:** The token usage for the "Generate" method increases linearly and substantially throughout training. In contrast, "Self-Verify" uses fewer tokens initially (after 0), grows slowly until the halfway point, and then increases more rapidly, but its final token count (3228) is still significantly lower than that of "Generate" (5422). 4. **Initial Parity:** At the very start of training (Progress = 0), both methods have identical accuracy (2.2) and token usage (1935). ### Interpretation The data suggests a clear trade-off and performance narrative between the two methods during the training process: * **"Self-Verify" is a more efficient and ultimately more accurate strategy.** It achieves a much higher peak accuracy and maintains better final accuracy, all while consuming considerably fewer tokens by the end of training. Its token usage pattern suggests it may be more selective or efficient in its operations, especially in the first half of training. * **"Generate" shows signs of overfitting or degradation.** While it improves accuracy initially, its performance peaks lower and then collapses dramatically in the second half of training. This decline coincides with a relentless increase in token consumption, indicating it may be generating more content without a corresponding gain in quality, and potentially even harming its performance. * **The halfway point (1/2) is critical.** It represents the optimal training stage for accuracy for both methods. The divergence in their trajectories after this point is the most telling: "Self-Verify" manages its resources better to retain more accuracy, while "Generate" expends more resources for diminishing and ultimately negative returns. In summary, the charts demonstrate that the "Self-Verify" approach is superior in this context, offering a better balance of high accuracy and controlled computational cost (tokens) over the full course of training. </details> Figure 3: Comparison of accuracy and token usage between generation training and self-verification training on AIME24 with Qwen2.5-1.5B-Instruct. ### 3.2 Experimental Setup We conduct extensive experiments to compare the effects of training LLMs to generate solutions and training them to self-verify. We first describe our experimental setup. #### Dataset and Benchmarks For training, we use DAPO-Math-17K (Yu et al., 2025a), a dataset widely adopted for mathematical reasoning. We evaluate our models on six challenging mathematical reasoning benchmarks: AIME24 (Zhang and Math-AI, 2024), AIME25 (Zhang and Math-AI, 2025), AMC23, Minerva, MATH500 (Lightman et al., 2024), and OlympiadBench (He et al., 2024). #### Implementation We choose Qwen2.5-1.5B-Instruct, Qwen2.5-3B-Instruct, and Qwen2.5-7B-Instruct as backbone models (Yang et al., 2024). We use veRL (Sheng et al., 2025) as the training framework to implement our RL-based methods with a rule-based verifier. For both generation training and self-verification training, we train the models for 1000 steps. For fair comparison, both generation and verification training use a batch size of 128 with a group size of 8. We set the maximum generation length to 10,240 tokens for all models, with the temperature set to 0.6 and top- $p$ to 0.95. #### Evaluation We compare models trained exclusively for generation with those trained exclusively for self-verification on the benchmarks. We report two main metrics: (1) Acc, measured by Avg@16 accuracy, (2) Token, calculated as the average number of tokens (including both intermediate reasoning and the final answer) across all outputs on each test set. This metric reflects the reasoning efficiency of the model. ### 3.3 Main Results Table 1 summarizes the performance and reasoning length across six benchmarks and three models, comparing models trained solely to generate answers with models trained solely to judge the correctness of their own solutions. In addition, Figure 3 illustrates the evolution of accuracy and token usage throughout training on AIME24 with Qwen2.5-1.5B-Instruct. From the results, we can draw two key conclusions: #### Learning to self-verify achieves comparable performance to learning to generate. Across all models and datasets, training the model solely for self-verification yields performance that is comparable to, and in some cases better than, that achieved by generation-only training. For example, for Qwen2.5-1.5B-Instruct, the self-verification-trained model outperforms the generation-trained model in accuracy across all benchmarks. For Qwen2.5-3B-Instruct, self-verification achieves 32.1% accuracy on OlympiadBench and 65.6% on Math500, surpassing the generation baseline by 4.7% and 6.0%, respectively, demonstrating strong potential even without explicit generation training. This points to an interesting asymmetry in the reverse direction: while improving a model’s generation performance does not lead to a corresponding improvement in its ability to self-verify, even on the same task (cf. Figure 1), improving self-verification alone can in turn enhance generation performance. Table 2: Evaluation of verification capability. We report the Acc@8 of different models in judging the correctness of solutions generated by DeepSeek-R1-Distill-Qwen-7B. | Model | Base | Generate | Self-Verify | | --- | --- | --- | --- | | Qwen2.5-1.5B-Instruct | 45.58 | 45.95 +0.37 | 62.31 +16.73 | | Qwen2.5-3B-Instruct | 59.82 | 55.19 -4.63 | 65.69 +5.87 | | Qwen2.5-7B-Instruct | 64.46 | 68.84 +4.38 | 69.50 +5.04 | #### Learning to self-verify requires significantly fewer tokens to solve the same problems. Across all models and datasets, training the model solely for self-verification consistently produces much shorter reasoning traces than generation-only training, while maintaining comparable performance. Notably, for Qwen2.5-7B-Instruct, the self-verification-trained model achieves performance comparable to generation training using only about 25% of the tokens. For Qwen2.5-3B-Instruct, it uses roughly 60% of the tokens while even slightly outperforming the generation baseline. These results indicate that, although the final performance is comparable, self-verification leads to substantially more efficient reasoning traces. We attribute this to the strengthened self-verification ability induced by our training, which enables the model to better recognize when its current solution is likely incorrect and when verification should be triggered. As a result, the model avoids redundant or “fake” verification behaviors and follows more direct solution trajectories. This markedly different reasoning behavior further motivates us to regard self-verification and generation as complementary training signals. <details> <summary>x3.png Details</summary> ![f48b2b29](/v1/image/f48b2b29b59a47fbbf3369e41e1a10542d0b29769d3bce8dbbf13fa78840f369) ### Visual Description ## Grouped Bar Chart: Accuracy Comparison of Qwen2.5 Models with Different Methods ### Overview The image is a grouped bar chart comparing the performance (accuracy) of three different Qwen2.5 language model variants on the ArguBench (ArguB) dataset. For each model variant, three methods are evaluated: Base, Generate, and Self-Verify. The chart demonstrates the impact of these methods on model accuracy. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **Y-Axis:** * **Label:** `Accuracy (ArguB) %` * **Scale:** Linear scale from 0 to 50, with major tick marks at intervals of 10 (0, 10, 20, 30, 40, 50). * **X-Axis:** * **Label:** Model variants. * **Categories (Groups):** Three distinct model variants are listed: 1. `Qwen2.5-1.5B-Instruct` 2. `Qwen2.5-3B-Instruct` 3. `Qwen2.5-7B-Instruct` * **Legend:** * **Position:** Top-left corner of the chart area. * **Content:** Defines the three bar patterns/colors corresponding to the methods: * `Base`: Solid light blue bar. * `Generate`: Light blue bar with diagonal hatching (lines sloping down from left to right). * `Self-Verify`: Reddish-brown bar with a dense cross-hatch pattern. * **Data Labels:** Each bar has its exact accuracy value printed directly above it. ### Detailed Analysis The chart presents accuracy percentages for each method across the three model sizes. The data is as follows: **1. Qwen2.5-1.5B-Instruct (Leftmost Group):** * **Trend:** Accuracy increases progressively from Base to Generate to Self-Verify. * **Data Points:** * Base: 29.52% * Generate: 29.54% (a negligible increase of 0.02% over Base) * Self-Verify: 31.10% (an increase of ~1.56% over Generate) **2. Qwen2.5-3B-Instruct (Center Group):** * **Trend:** A clear, steady upward trend from Base to Generate to Self-Verify. * **Data Points:** * Base: 34.66% * Generate: 36.73% (an increase of ~2.07% over Base) * Self-Verify: 39.44% (an increase of ~2.71% over Generate) **3. Qwen2.5-7B-Instruct (Rightmost Group):** * **Trend:** A very strong upward trend, with the largest gains observed in this model group. * **Data Points:** * Base: 30.44% * Generate: 38.33% (a substantial increase of ~7.89% over Base) * Self-Verify: 44.64% (a further large increase of ~6.31% over Generate) ### Key Observations 1. **Consistent Improvement with Advanced Methods:** For all three model sizes, the `Self-Verify` method yields the highest accuracy, followed by `Generate`, with `Base` performing the worst. This pattern is consistent across the board. 2. **Scaling Effect:** The performance gain from applying the `Generate` and `Self-Verify` methods is not uniform across model sizes. The largest model (7B) shows the most dramatic improvement, especially between its `Base` (30.44%) and `Self-Verify` (44.64%) scores—a total gain of over 14 percentage points. 3. **Anomaly in Base Performance:** The `Base` accuracy does not scale linearly with model size. The 3B model (34.66%) outperforms both the smaller 1.5B (29.52%) and the larger 7B (30.44%) models when using only the `Base` method. This suggests factors beyond pure parameter count influence baseline performance on this task. 4. **Method Efficacy:** The `Self-Verify` method provides a significant boost over `Generate` for the two larger models (+2.71% for 3B, +6.31% for 7B), but a much smaller one for the smallest model (+1.56%). This could indicate that the self-verification capability benefits more from the model's increased capacity. ### Interpretation The data suggests that for the ArguBench benchmark, simply scaling up the base model size (from 1.5B to 7B parameters) does not guarantee better performance, as seen in the dip in `Base` accuracy for the 7B model. However, when combined with more sophisticated inference methods like `Generate` and particularly `Self-Verify`, larger models can leverage their increased capacity to achieve substantially higher accuracy. The `Self-Verify` method appears to be a highly effective technique for improving reasoning or argumentative performance on this benchmark. Its advantage grows with model scale, implying it may be a key method for unlocking the potential of larger language models. The near-identical `Base` and `Generate` scores for the 1.5B model suggest a possible performance ceiling for that model size on this task, where the `Generate` method alone provides little benefit, but `Self-Verify` can still extract some improvement. In summary, the chart demonstrates that methodological improvements (like `Self-Verify`) can be as important, or even more important, than raw model scale for achieving high performance on specific tasks. The most effective strategy appears to be combining larger models with advanced verification techniques. </details> Figure 4: Performance comparison under partially corrupted reasoning prefix setting. ### 3.4 Analysis In this section, we conduct a detailed analysis to investigate what capabilities are acquired by learning to self-verify and how these capabilities can be exploited in practice. Based on our experiments, we make the following observations: #### Explicit self-verification training turns the model into a strong verifier. Figure 1 demonstrates that even models with limited parameter sizes can verify their own solutions much more accurately after self-verification training. To further evaluate the model’s verification capability as a general verifier in specific domains, we construct a verification evaluation set consisting of benchmark solutions generated by DeepSeek-R1-Distill-Qwen-7B. The model is then asked to judge whether each solution is correct or incorrect, and the results are reported in Table 2. The results show that, since self-verification is essentially a verification task, our model naturally acquires the ability to assess solutions produced by other models as well, demonstrating strong general-purpose verification capability. In contrast, models trained only for generation overall achieve marginal performance gain and in some cases even exhibit noticeable degradation. #### Learning to self-verify enables the model to identify and correct errors in its reasoning process. We observe that after self-verification training, the number of tokens required to solve a problem is significantly reduced. This suggests that the model starts to precisely trigger verification when it detects potential errors in its reasoning and to correct them in time, thereby avoiding redundant or “fake” verification behaviors. To validate this hypothesis, we construct a dedicated evaluation set to assess the effectiveness of self-verification behaviors during the reasoning process. Specifically, we mix data from different benchmarks and build a set of 1,545 problems. We first collect the original reasoning trajectories generated by Qwen2.5-7B-Instruct on these problems. We then use GPT-4.1 (OpenAI, 2023) to randomly rewrite these trajectories into reasoning step prefixes with varying numbers of steps and injected some mistakes. The model is prompted with the original query and the corrupted prefix, and is asked to continue the reasoning process. Under this setting, a higher success rate indicates that the model is more capable of detecting errors in the ongoing reasoning and correcting them through effective self-verification. As shown in Figure 4, the self-verification-trained model significantly outperforms both the base model and the generation-trained model, demonstrating substantially stronger error detection and correction capability during reasoning. In contrast, generation training yields only marginal improvements over the base model in this setting. Table 3: Test-time scaling with self-verification with Qwen2.5-1.5B-Instruct. We report Acc@32 and compare standard majority voting and majority voting augmented with self-verification across three training regimes: Base, Generate, and Self-Verify. | Method | AIME25 | MATH500 | Olympaid | Minerva | | --- | --- | --- | --- | --- | | Base | | | | | | Major voting | 3.30 | 52.20 | 22.40 | 14.30 | | + Self-verify | 3.30 +0.00 | 52.40 +0.20 | 22.30 -0.10 | 10.70 -3.60 | | Generate | | | | | | Major voting | 0.00 | 54.20 | 23.40 | 15.80 | | + Self-verify | 0.00 +0.00 | 53.00 -1.20 | 23.60 +0.20 | 13.60 -2.20 | | Self-Verify | | | | | | Major voting | 3.30 | 55.20 | 25.80 | 16.20 | | + Self-verify | 6.70 +3.40 | 56.40 +1.20 | 27.20 +1.40 | 16.20 +0.00 | #### Effective self-verification enables test-time scaling. With a substantially improved self-verification capability, the model can reliably assess the correctness of its own candidate solutions, which unlocks a new form of test-time scaling based on self-verification. Specifically, at inference time, we sample multiple candidate solutions, let the model verify each of them, and aggregate the verification results to obtain a verification score for each candidate. We then jointly consider the majority vote and the verification scores to determine the final answer. Experimental results in Table 3 show that introducing this additional self-verification signal at test time consistently improves performance, demonstrating that self-verification provides an effective and principled way to scale inference beyond naive sampling or self-consistency. Table 4: Evaluation of integrating self-verification into generation training across six mathematical reasoning benchmarks. We report task accuracy (Acc@16 $\uparrow$ ) for each model under four training strategies. Results show that our strategies improve overall performance over standard and mixed training. | Method | AMC23 | Minerva | Olympiad | Math500 | AIME24 | AIME25 | Avg | | --- | --- | --- | --- | --- | --- | --- | --- | | Qwen2.5-1.5B-Instruct | | | | | | | | | Generate | 30.5 | 12.4 | 20.6 | 53.7 | 2.9 | 0.8 | 20.2 | | Mixed-Train | 33.3 | 12.7 | 21.2 | 53.8 | 3.8 | 1.5 | 21.1 | | Verify-Init | 33.0 | 13.0 | 21.9 | 54.7 | 5.4 | 1.3 | 21.6 | | Verify-Alter | 36.4 | 13.9 | 22.4 | 54.2 | 5.0 | 4.2 | 22.7 | | Qwen2.5-3B-Instruct | | | | | | | | | Generate | 50.9 | 17.0 | 27.4 | 59.6 | 8.1 | 8.1 | 28.5 | | Mixed-Train | 49.4 | 17.6 | 27.5 | 59.2 | 10.8 | 6.3 | 28.5 | | Verify-Init | 47.8 | 18.3 | 29.5 | 63.1 | 9.6 | 6.5 | 29.1 | | Verify-Alter | 47.7 | 18.7 | 30.2 | 64.5 | 9.4 | 5.6 | 29.4 | | Qwen2.5-7B-Instruct | | | | | | | | | Generate | 65.3 | 25.3 | 37.8 | 70.9 | 16.3 | 18.1 | 38.9 | | Mixed-Train | 59.2 | 24.6 | 40.5 | 73.5 | 15.8 | 9.8 | 37.2 | | Verify-Init | 63.6 | 26.0 | 39.0 | 74.0 | 17.3 | 12.5 | 38.7 | | Verify-Alter | 68.0 | 26.0 | 39.0 | 72.9 | 18.3 | 17.7 | 40.3 | ## 4 Integrating Self-Verification into Training We observe that training a model solely to verify its own answers already improves its generation performance to a level comparable with models trained purely for generation, while exhibiting a markedly different inference behavior: verification-only models produce significantly shorter outputs, indicating a more efficient reasoning trace. This markedly different reasoning behavior further motivates us to view self-verification and generation as complementary training signals. Building on this observation, we further propose to integrate self-verification into generation training. ### 4.1 Multi-Task RL Pipeline In this work, we formulate the integration of generation and self-verification as a multi-task reinforcement learning problem, where the two objectives are optimized in a decoupled manner. Under this framework, we consider two simple yet effective strategies that are orthogonal: a stage-wise initialization strategy and an alternating training strategy. #### Stage-wise Initialization We first train the model with a self-verification objective by optimizing the policy to maximize the verification reward $r_{v}$ , as described in Section 3.1. The resulting model, which already possesses a stronger ability to judge the correctness of its own outputs, is then used as a better initial policy for standard generation training, where the policy is further optimized to maximize the generation reward $r_{g}$ . #### Alternating Training We alternate between generation training and self-verification training. Specifically, we run generation training for $n$ steps to optimize the policy with respect to the generation reward $r_{g}$ . Every $n$ generation steps, we trigger a self-verification phase, during which the same policy is optimized with respect to the verification reward $r_{v}$ , using the answers generated in the preceding generation phase to construct verification training data. This process is repeated throughout training, allowing the policy to be continuously shaped by both objectives. In both strategies, generation and self-verification are optimized under the same RLVR framework using GRPO, and the only difference lies in which reward signal ( $r_{g}$ or $r_{v}$ ) is used at each stage of training. ### 4.2 Experimental Setup #### Baseline To benchmark the effectiveness of our method, we compare it against two primary baselines. Generate follows the standard RL-based training paradigm for reasoning models and optimizes the policy solely with respect to the generation reward. Mixed-Train (Zhang et al., 2025a) jointly optimizes generation and self-verification objectives within each training step. For fair comparison, all baselines and our methods are trained using the same implementation and the same set of hyperparameters. #### Implementation and Evaluation We use the same datasets, model architectures, implementation details, and evaluation protocols as in Section 3.3. In this section, we evaluate two strategies for integrating self-verification into training: Verify-Init, which corresponds to the stage-wise initialization strategy, and Verify-Alter, which corresponds to the alternating training strategy. For Verify-Init, we initialize the model from a checkpoint obtained after 400 steps of self-verification-only training, and then further train it for 600 steps with the generation objective. For Generate, Mixed-Train, and Verify-Alter, we train the models for 1000 steps in total. ### 4.3 Results and Analysis Table 4 summarizes the performance across six benchmarks and three models. Beyond training the model solely for self-verification, we find that integrating self-verification into generation training consistently leads to improved generation performance across most models and benchmarks. Compared to Mixed-Train, which directly mixes the two objectives within a single optimization step, our framework decouples the optimization of generation and verification and optimizes them in a coordinated but separate manner, demonstrating additional performance gains. For instance, for Qwen2.5-1.5B-Instruct, Verify-Alter improves the average accuracy from 20.2% to 22.7%, outperforming both standard generation training and mixed-objective training. Notably, on AMC23, it improves accuracy by 5.9 points, and on the more challenging AIME benchmarks, it raises accuracy from 0.8% to 4.2%. This suggests that self-verification provides a complementary and beneficial training signal. ## 5 Related Works ### 5.1 LLM as Generator Improving the generation capability of LLMs has long been a central focus of the community (Brown et al., 2020; OpenAI, 2023). Early works typically collect high-quality trajectories with complex reasoning patterns and train LLMs via imitation learning (Ouyang et al., 2022; Touvron et al., 2023; Yang et al., 2024). With the success of models such as DeepSeek-R1 (DeepSeek-AI, 2025), which is trained with the GRPO algorithm (Shao et al., 2024), a surge of follow-up research has been inspired. Meanwhile, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful and scalable training paradigm for further boosting LLM generation performance by leveraging verifiable reward signals (Jin et al., 2025; Wang et al., 2025c). Building on these foundations, more recent studies start to investigate how to extend LLMs’ advanced generation ability to broader domains (Su et al., 2025; Yu et al., 2025c; Gunjal et al., 2025), make generation more efficient at inference time (Sui et al., 2025; Feng et al., 2025; Wang et al., 2025a), and improve stability and effectiveness of generation training (Yang et al., 2025b; Wu et al., 2025; Chen et al., 2025c). These advances have led to a series of increasingly capable large models (OpenAI, 2025; Anthropic, 2025; Google, 2025). However, despite these advancements, even the most powerful LLMs still cannot reliably self-verify their own outputs (Lu et al., 2025; Stechly et al., 2025). ### 5.2 LLMs as Verifier Verifiers play a crucial role in guiding LLMs toward better generations and enabling effective test-time scaling (Zhong et al., 2025; Yu et al., 2025b; Snell et al., 2024). Existing verifiers are typically either (1) discriminative (Liu et al., 2025a, 2024), producing scalar scores to rank candidate responses, or (2) generative (Zhang et al., 2025b; Mahan et al., 2024; Liu et al., 2025c), producing textual judgments or reward signals. With the success of RLVR in training stronger generators, increasing attention has been paid to generative verifiers due to their better generalization ability. LLM verifiers typically produce natural language rationales or textual judgments, which improve transparency and evaluation reliability. Correspondingly, training methods for LLM verifiers have evolved from supervised fine-tuning (SFT) to direct preference optimization (DPO) (Chen et al., 2025a; Zhang et al., 2025b; Liu et al., 2025c), and more recently to RLVR (Chen et al., 2025b; Yu et al., 2025d), inspired by advances in reasoning-oriented models. Despite these advances, how training as verifiers influences the model itself as a generator remains largely underexplored. ### 5.3 Joint Training of Generator and Verifier Recently, several works have begun to explore incorporating verification signals into generator training. Among them, (Chen et al., 2025d) shows that collecting correctness signals on external models’ outputs and training LLMs via imitation learning with fixed templates can shorten generated responses, albeit sometimes at the cost of slightly degraded generation performance. Other works (Liu et al., 2025b; Zhang et al., 2025a; Wang et al., 2025b) propose to jointly train generation and verification within the same training step, where verification is scaled as an auxiliary signal while the overall training dynamics remain dominated by the generation objective. Different from these approaches, we notice that rewarding self-verification alone is sufficient to obtain a generator with performance comparable to standard generation training, while producing better reasoning traces. We further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two decoupled but complementary objectives. ## 6 Conclusion In this work, we investigate the asymmetry between generation and self-verification in large language models and show that improving generation does not naturally lead to better self-verification, even on the same task. More interestingly, we identify the reverse direction of this asymmetry: learning to self-verify alone can significantly improve generation performance. This finding challenges the common view of verification as merely an auxiliary component and highlights its role as a powerful training signal. Building on this insight, we further explore integrating self-verification into generation training by formulating a multi-objective reinforcement framework. Extensive experiments demonstrate that explicit self-verification training consistently improves problem-solving performance, produces more efficient and effective reasoning traces, and enables effective test-time scaling. Looking ahead, we believe that verification has untapped potential to improve generation, through designed verification tasks, more principled integration of verification and generation objectives, and more efficient training strategies. Exploring these directions is beyond the scope of the current work, and we leave them for future research. ## Impact Statement Our findings suggest that strengthening self-verification in large language models can fundamentally change how these systems reason and generate responses. By showing that learning to self-verify not only improves reliability but also enhances generation efficiency, this work contributes to a better understanding of the interaction between reasoning, verification, and generation in modern language models. These insights have broader implications for the development and deployment of AI systems, especially in scenarios where correctness, robustness, and controllability are critical. Improving a model’s ability to assess its own outputs may help reduce spurious reasoning steps, increase transparency, and mitigate certain classes of errors in real-world applications. At the same time, more powerful self-verification capabilities also raise new questions about how such systems should be evaluated, monitored, and governed, particularly when they are used in high-stakes or decision-critical settings. Understanding and carefully managing these dynamics is therefore important for the responsible use of large language models in practice. ## Limitation Although this work provides an encouraging analysis of the asymmetry between generation and self-verification and demonstrates the effectiveness of learning to self-verify, it still has several limitations. First, introducing an additional self-verification objective into generation training inevitably incurs extra computation, including additional inference and optimization costs. Second, although our experiments cover models of different parameter sizes, they are still limited in scale. Due to computational constraints, we do not explore whether the same phenomena and benefits continue to hold for larger models. Also, despite its effectiveness, we only explore a limited set of ways to combine generation and self-verification, as well as a single form of verification task. More diverse verification formulations and tighter coupling paradigms between generation and verification may further push the performance ceiling. Moreover, our study focuses primarily on mathematical reasoning benchmarks. While the proposed framework is conceptually general, it remains an open question whether the same asymmetry and the benefits of learning to self-verify will hold in other domains, such as planning and multimodal reasoning. Finally, the current multi-task training schedule (e.g., stage-wise or alternating) is manually designed and heuristic. A more principled or adaptive strategy for balancing generation and self-verification objectives remains an interesting direction for future work. ## References - Anthropic (2025) Introducing claude opus 4.5. External Links: Link Cited by: §1, §5.1. - A. Bhaskar, X. Ye, and D. Chen (2025) Language models that think, chat better. CoRR abs/2509.20357. Cited by: §1. - T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In NeurIPS, Cited by: §5.1. - C. Chen, Z. Liu, C. Du, T. Pang, Q. Liu, A. Sinha, P. Varakantham, and M. Lin (2025a) Bootstrapping language models with DPO implicit rewards. In ICLR, Cited by: §5.2. - X. Chen, G. Li, Z. Wang, B. Jin, C. Qian, Y. Wang, H. Wang, Y. Zhang, D. Zhang, T. Zhang, H. Tong, and H. Ji (2025b) RM-R1: reward modeling as reasoning. CoRR abs/2505.02387. Cited by: §5.2. - Z. Chen, X. Qin, Y. Wu, Y. Ling, Q. Ye, W. X. Zhao, and G. Shi (2025c) Pass@k training for adaptively balancing exploration and exploitation of large reasoning models. CoRR abs/2508.10751. Cited by: §5.1. - Z. Chen, X. Ma, G. Fang, R. Yu, and X. Wang (2025d) VeriThinker: learning to verify makes reasoning model efficient. CoRR abs/2505.17941. Cited by: §5.3. - DeepSeek-AI, A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, C. Lu, C. Zhao, C. Deng, C. Xu, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, E. Li, F. Zhou, F. Lin, F. Dai, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Li, H. Liang, H. Wei, H. Zhang, H. Luo, H. Ji, H. Ding, H. Tang, H. Cao, H. Gao, H. Qu, H. Zeng, J. Huang, J. Li, J. Xu, J. Hu, J. Chen, J. Xiang, J. Yuan, J. Cheng, J. Zhu, J. Ran, J. Jiang, J. Qiu, J. Li, J. Song, K. Dong, K. Gao, K. Guan, K. Huang, K. Zhou, K. Huang, K. Yu, L. Wang, L. Zhang, L. Wang, L. Zhao, L. Yin, L. Guo, L. Luo, L. Ma, L. Wang, L. Zhang, M. S. Di, M. Y. Xu, M. Zhang, M. Zhang, M. Tang, M. Zhou, P. Huang, P. Cong, P. Wang, Q. Wang, Q. Zhu, Q. Li, Q. Chen, Q. Du, R. Xu, R. Ge, R. Zhang, R. Pan, R. Wang, R. Yin, R. Xu, R. Shen, R. Zhang, S. H. Liu, S. Lu, S. Zhou, S. Chen, S. Cai, S. Chen, S. Hu, S. Liu, S. Hu, S. Ma, S. Wang, S. Yu, S. Zhou, S. Pan, S. Zhou, T. Ni, T. Yun, T. Pei, T. Ye, T. Yue, W. Zeng, W. Liu, W. Liang, W. Pang, W. Luo, W. Gao, W. Zhang, X. Gao, X. Wang, X. Bi, X. Liu, X. Wang, X. Chen, X. Zhang, X. Nie, X. Cheng, X. Liu, X. Xie, X. Liu, X. Yu, X. Li, X. Yang, X. Li, X. Chen, X. Su, X. Pan, X. Lin, X. Fu, Y. Q. Wang, Y. Zhang, Y. Xu, Y. Ma, Y. Li, Y. Li, Y. Zhao, Y. Sun, Y. Wang, Y. Qian, Y. Yu, Y. Zhang, Y. Ding, Y. Shi, Y. Xiong, Y. He, Y. Zhou, Y. Zhong, Y. Piao, Y. Wang, Y. Chen, Y. Tan, Y. Wei, Y. Ma, Y. Liu, Y. Yang, Y. Guo, Y. Wu, Y. Wu, Y. Cheng, Y. Ou, Y. Xu, Y. Wang, Y. Gong, Y. Wu, Y. Zou, Y. Li, Y. Xiong, Y. Luo, Y. You, Y. Liu, Y. Zhou, Z. F. Wu, Z. Z. Ren, Z. Zhao, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Xie, Z. Zhang, Z. Hao, Z. Gou, Z. Ma, Z. Yan, Z. Shao, Z. Huang, Z. Wu, Z. Li, Z. Zhang, Z. Xu, Z. Wang, Z. Gu, Z. Zhu, Z. Li, Z. Zhang, Z. Xie, Z. Gao, Z. Pan, Z. Yao, B. Feng, H. Li, J. L. Cai, J. Ni, L. Xu, M. Li, N. Tian, R. J. Chen, R. L. Jin, S. S. Li, S. Zhou, T. Sun, X. Q. Li, X. Jin, X. Shen, X. Chen, X. Song, X. Zhou, Y. X. Zhu, Y. Huang, Y. Li, Y. Zheng, Y. Zhu, Y. Ma, Z. Huang, Z. Xu, Z. Zhang, D. Ji, J. Liang, J. Guo, J. Chen, L. Xia, M. Wang, M. Li, P. Zhang, R. Chen, S. Sun, S. Wu, S. Ye, T. Wang, W. L. Xiao, W. An, X. Wang, X. Sun, X. Wang, Y. Tang, Y. Zha, Z. Zhang, Z. Ju, Z. Zhang, and Z. Qu (2025) DeepSeek-v3.2: pushing the frontier of open large language models. Vol. abs/2512.02556. Cited by: §1. - DeepSeek-AI (2025) DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning. CoRR abs/2501.12948. Cited by: §1, §1, §5.1. - S. Feng, G. Fang, X. Ma, and X. Wang (2025) Efficient reasoning models: A survey. Trans. Mach. Learn. Res. 2025. Cited by: §5.1. - Google (2025) A new era of intelligence with gemini 3. External Links: Link Cited by: §5.1. - A. Gunjal, A. Wang, E. Lau, V. Nath, B. Liu, and S. Hendryx (2025) Rubrics as rewards: reinforcement learning beyond verifiable domains. CoRR abs/2507.17746. Cited by: §5.1. - C. He, R. Luo, Y. Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han, Y. Huang, Y. Zhang, J. Liu, L. Qi, Z. Liu, and M. Sun (2024) OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In ACL (1), pp. 3828–3850. Cited by: §3.2. - R. Hong, H. Zhang, X. Pang, D. Yu, and C. Zhang (2024) A closer look at the self-verification abilities of large language models in logical reasoning. In NAACL-HLT, pp. 900–925. Cited by: §1. - J. Hu, Y. Zhang, Q. Han, D. Jiang, X. Zhang, and H. Shum (2025) Open-reasoner-zero: an open source approach to scaling up reinforcement learning on the base model. CoRR abs/2503.24290. Cited by: §1. - B. Jin, H. Zeng, Z. Yue, D. Wang, H. Zamani, and J. Han (2025) Search-r1: training llms to reason and leverage search engines with reinforcement learning. CoRR abs/2503.09516. Cited by: §5.1. - H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2024) Let’s verify step by step. In ICLR, Cited by: §3.2. - C. Y. Liu, L. Zeng, J. Liu, R. Yan, J. He, C. Wang, S. Yan, Y. Liu, and Y. Zhou (2024) Skywork-reward: bag of tricks for reward modeling in llms. CoRR abs/2410.18451. Cited by: §5.2. - C. Y. Liu, L. Zeng, Y. Xiao, J. He, J. Liu, C. Wang, R. Yan, W. Shen, F. Zhang, J. Xu, Y. Liu, and Y. Zhou (2025a) Skywork-reward-v2: scaling preference data curation via human-ai synergy. CoRR abs/2507.01352. Cited by: §5.2. - X. Liu, T. Liang, Z. He, J. Xu, W. Wang, P. He, Z. Tu, H. Mi, and D. Yu (2025b) Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards. CoRR abs/2505.13445. Cited by: §1, §5.3. - Z. Liu, P. Wang, R. Xu, S. Ma, C. Ruan, P. Li, Y. Liu, and Y. Wu (2025c) Inference-time scaling for generalist reward modeling. CoRR abs/2504.02495. Cited by: §5.2. - J. Lu, R. Teehan, J. Jin, and M. Ren (2025) When does verification pay off? A closer look at llms as solution verifiers. CoRR abs/2512.02304. Cited by: §1, §5.1. - D. Mahan, D. Phung, R. Rafailov, C. Blagden, N. Lile, L. Castricato, J. Fränken, C. Finn, and A. Albalak (2024) Generative reward models. CoRR abs/2410.12832. Cited by: §5.2. - MiniMax (2025) MiniMax m2 and agent: ingenious in simplicity. External Links: Link Cited by: §1. - OpenAI (2023) GPT-4 technical report. CoRR abs/2303.08774. Cited by: §3.4, §5.1. - OpenAI (2025) Introducing gpt-5.2. External Links: Link Cited by: §1, §5.1. - L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe (2022) Training language models to follow instructions with human feedback. In NeurIPS, Cited by: §5.1. - J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. Cited by: §2.1. - Z. Shao, Y. Luo, C. Lu, Z. Z. Ren, J. Hu, T. Ye, Z. Gou, S. Ma, and X. Zhang (2025) DeepSeekMath-v2: towards self-verifiable mathematical reasoning. CoRR abs/2511.22570. Cited by: §1. - Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. K. Li, Y. Wu, and D. Guo (2024) DeepSeekMath: pushing the limits of mathematical reasoning in open language models. CoRR abs/2402.03300. Cited by: §2.1, §5.1. - G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu (2025) HybridFlow: A flexible and efficient RLHF framework. In EuroSys, pp. 1279–1297. Cited by: §3.2. - C. Snell, J. Lee, K. Xu, and A. Kumar (2024) Scaling LLM test-time compute optimally can be more effective than scaling model parameters. CoRR abs/2408.03314. Cited by: §5.2. - K. Stechly, K. Valmeekam, and S. Kambhampati (2025) On the self-verification limitations of large language models on reasoning and planning tasks. In ICLR, Cited by: §1, §5.1. - Y. Su, D. Yu, L. Song, J. Li, H. Mi, Z. Tu, M. Zhang, and D. Yu (2025) Crossing the reward bridge: expanding RL with verifiable rewards across diverse domains. CoRR abs/2503.23829. Cited by: §5.1. - Y. Sui, Y. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, N. Zou, H. Chen, and X. Hu (2025) Stop overthinking: A survey on efficient reasoning for large language models. Trans. Mach. Learn. Res. 2025. Cited by: §5.1. - G. Team (2025) Gemma 3 technical report. CoRR abs/2503.19786. Cited by: §1. - H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample (2023) LLaMA: open and efficient foundation language models. CoRR abs/2302.13971. Cited by: §5.1. - R. Wang, H. Wang, B. Xue, J. Pang, S. Liu, Y. Chen, J. Qiu, D. F. Wong, H. Ji, and K. Wong (2025a) Harnessing the reasoning economy: A survey of efficient reasoning for large language models. CoRR abs/2503.24377. Cited by: §5.1. - X. Wang, B. Liu, S. Jiang, J. Liu, J. Qi, X. Chen, and B. He (2025b) From solving to verifying: A unified objective for robust reasoning in llms. CoRR abs/2511.15137. Cited by: §1, §5.3. - Z. Wang, K. Wang, Q. Wang, P. Zhang, L. Li, Z. Yang, X. Jin, K. Yu, M. N. Nguyen, L. Liu, E. Gottlieb, Y. Lu, K. Cho, J. Wu, L. Fei-Fei, L. Wang, Y. Choi, and M. Li (2025c) RAGEN: understanding self-evolution in LLM agents via multi-turn reinforcement learning. CoRR abs/2504.20073. Cited by: §5.1. - F. Wu, W. Xuan, X. Lu, Z. Harchaoui, and Y. Choi (2025) The invisible leash: why RLVR may not escape its origin. CoRR abs/2507.14843. Cited by: §5.1. - A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. Li, T. Tang, W. Yin, X. Ren, X. Wang, X. Zhang, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Zhang, Y. Wan, Y. Liu, Z. Wang, Z. Cui, Z. Zhang, Z. Zhou, and Z. Qiu (2025a) Qwen3 technical report. CoRR abs/2505.09388. Cited by: §1. - A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu (2024) Qwen2.5 technical report. CoRR abs/2412.15115. Cited by: §3.2, §5.1. - Z. Yang, Z. Guo, Y. Huang, Y. Wang, D. Xie, Y. Wang, X. Liang, and J. Tang (2025b) Depth-breadth synergy in RLVR: unlocking LLM reasoning gains with adaptive exploration. CoRR abs/2508.13755. Cited by: §5.1. - E. Yee, A. Li, C. Tang, Y. H. Jung, R. Paturi, and L. Bergen (2024) Dissociation of faithful and unfaithful reasoning in llms. CoRR abs/2405.15092. Cited by: §1. - Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, W. Dai, Y. Song, X. Wei, H. Zhou, J. Liu, W. Ma, Y. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang (2025a) DAPO: an open-source LLM reinforcement learning system at scale. CoRR abs/2503.14476. Cited by: §2.1, §3.2. - R. Yu, S. Wan, Y. Wang, C. Gao, L. Gan, Z. Zhang, and D. Zhan (2025b) Reward models in deep reinforcement learning: A survey. In IJCAI, pp. 10807–10816. Cited by: §5.2. - T. Yu, B. Ji, S. Wang, S. Yao, Z. Wang, G. Cui, L. Yuan, N. Ding, Y. Yao, Z. Liu, M. Sun, and T. Chua (2025c) RLPR: extrapolating RLVR to general domains without verifiers. CoRR abs/2506.18254. Cited by: §5.1. - Z. Yu, J. Zeng, W. Gu, Y. Wang, J. Wang, F. Meng, J. Zhou, Y. Zhang, S. Zhang, and W. Ye (2025d) RewardAnything: generalizable principle-following reward models. CoRR abs/2506.03637. Cited by: §5.2. - Z.AI (2025) GLM-4.7: advancing the coding capability. External Links: Link Cited by: §1. - W. Zeng, Y. Huang, Q. Liu, W. Liu, K. He, Z. Ma, and J. He (2025a) SimpleRL-zoo: investigating and taming zero reinforcement learning for open base models in the wild. CoRR abs/2503.18892. Cited by: §1. - Y. Zeng, Y. Huang, C. Xu, Q. Sun, J. Yan, G. Xu, T. Yang, and F. Lian (2025b) Zero reinforcement learning towards general domains. CoRR abs/2510.25528. Cited by: §1. - F. Zhang, J. Xu, C. Wang, C. Cui, Y. Liu, and B. An (2025a) Incentivizing llms to self-verify their answers. CoRR abs/2506.01369. Cited by: §1, §4.2, §5.3. - L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal (2025b) Generative verifiers: reward modeling as next-token prediction. In ICLR, Cited by: §5.2. - Y. Zhang and T. Math-AI (2024) American invitational mathematics examination (aime) 2024. Cited by: §3.2. - Y. Zhang and T. Math-AI (2025) American invitational mathematics examination (aime) 2025. Cited by: §3.2. - Y. Zhang, M. Khalifa, L. Logeswaran, J. Kim, M. Lee, H. Lee, and L. Wang (2024) Small language models need strong verifiers to self-correct reasoning. In ACL (Findings), pp. 15637–15653. Cited by: §1. - J. Zhao, Y. Sun, W. Shi, and D. Song (2025) Can aha moments be fake? identifying true and decorative thinking steps in chain-of-thought. CoRR abs/2510.24941. Cited by: §1. - J. Zhong, W. Shen, Y. Li, S. Gao, H. Lu, Y. Chen, Y. Zhang, W. Zhou, J. Gu, and L. Zou (2025) A comprehensive survey of reward models: taxonomy, applications, challenges, and future. CoRR abs/2504.12328. Cited by: §5.2.

Rendering Paper...