<details>
<summary>Image 1 Details</summary>

### Visual Description
Icon/Small Image (92x33)
</details>
## Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
1,3,* 1,* 2 2 2 2
Amrith Setlur , Chirag Nagpal , Adam Fisch , Xinyang Geng , Jacob Eisenstein , Rishabh Agarwal , Alekh Agarwal 1 , Jonathan Berant β ,2 and Aviral Kumar β ,2,3 1 Google Research, 2 Google DeepMind, 3 Carnegie Mellon University, * Equal contribution, β Equal advising
A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: 'How should we design process rewards?'. Our key insight is that, to be effective, the process reward for a step should measure progress : a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test-time search against PAVs is > 8% more accurate, and 1 . 5 -5 Γ more compute-efficient. Online RL with dense rewards from PAVs enables one of the first results with 5 -6 Γ gain in sample efficiency, and > 6% gain in accuracy, over ORMs.
## 1. Introduction
Trained reward models or verifiers are often used to improve math reasoning in large language models, either by re-ranking solutions at test-time (Collins, 2000) or via reinforcement learning (RL) (Uesato et al., 2022). Typically, verifiers are trained to predict the outcome of an entire reasoning trace, often referred to as outcome reward models (ORM) (Cobbe et al., 2021b; Hosseini et al., 2024). However, ORMs only provide a sparse signal of correctness, which can be hard to learn from and inefficient to search against. This challenge is alleviated by fine-grained supervision, in theory. For reasoning, prior works train process reward models (PRMs) that assign intermediate rewards after each step of search (Snell et al., 2024) or during RL. While Lightman et al. (2023) obtains PRM annotations from human raters, this approach is not scalable. More recent works (Luo et al., 2024; Wang et al., 2024) train PRMs to predict automatically-generated annotations that estimate future success of solving the problem, akin to value functions in RL. So far, automated PRMs, especially as dense rewards in RL, only improve by 1-2% over ORMs (Shao et al., 2024), raising serious doubts over their utility.
To resolve these uncertainties, in this paper, we train PRMs with automated annotations, such that optimizing the dense rewards from trained PRMs can improve a base policy compute- and sampleefficiently , during test-time search and online RL. For this, we first ask: (i) what should the per-step process rewards measure, and (ii) what kind of automated data collection strategy should we use to train PRMs that predict this measure. For (i) , conventional belief (Lightman et al., 2023; Uesato et al., 2022)
Figure 1 | Process advantage verifiers (PAV): Process reward for a step is defined as progress (advantage) under the prover policy, i.e. , change in prover policy's success rate before and after the step. (a): The base policy samples both correct 1 and incorrect 2 steps but struggles to succeed from either. A strong prover policy completes the solution from both steps, and is unable to adequately reflect progress made by 1 and 2 (both scored 0.0). Conversely, a complementary prover policy distinguishes 1 , 2 more prominently (only succeeds from 1 ). (b,c): Compared to ORMs, PAVs are 5x more compute efficient, 10% more accurate in test-time search, and 6x more sample efficient, 7% more accurate for online reinforcement learning (RL).
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Diagram: Prover Policy and Performance Analysis
### Overview
The image presents a diagram illustrating a system for solving equations, comparing a "Base Policy" to a "Very capable prover policy" and evaluating their performance using Proof-Aware Values (PAVs). The diagram shows a sequence of steps in solving an equation, with feedback on the correctness of each step. Below this, two charts (b and c) compare the performance of different policies using metrics like accuracy and efficiency.
### Components/Axes
The diagram consists of three main sections: (a) a flow diagram of the equation-solving process, (b) a chart comparing "Search with PAVs" performance, and (c) a chart comparing "RL with PAVs" performance.
**Section (a):**
* **Start:** Indicates the beginning of the problem.
* **Question:** "Let 4x+3y=25, 7x+6y=49. Solve for x, y."
* **Step 1:** "We eliminate y from system of equations." (Associated with a blue curved line)
* **Step 2:** "The equations imply: 10x + 9y = 25" (Associated with an orange curved line)
* **Feedback:** Green checkmarks indicate correct steps, red crosses indicate incorrect steps.
* **Final Answer:** Displayed alongside each step, showing the system's solution attempt.
* **Policy Blocks:** "Base Policy" (grey), "Very capable prover policy" (light blue), "Good prover policy: complementary to base" (dark blue).
**Section (b): Search with PAVs**
* **X-axis:** "# samples from Base Policy" (Scale: 2<sup>1</sup> to 2<sup>7</sup>)
* **Y-axis:** "Accuracy" (Scale: 0.10 to 0.25)
* **Legend:**
* "ORM" (Solid Red Line)
* "PRM Q-value" (Dashed Orange Line)
* "PAV" (Dashed Teal Line)
* **Annotations:** "5x Compute Efficient", "10% Accuracy"
**Section (c): RL with PAVs**
* **X-axis:** "Training Iterations (x10<sup>3</sup>)" (Scale: 0 to 10)
* **Y-axis:** "Accuracy" (Scale: 0.10 to 0.25)
* **Legend:**
* "ORM-RL" (Solid Red Line)
* "PAV-RL" (Dashed Teal Line)
* **Annotations:** "6x Sample Efficient", "7% Accuracy"
### Detailed Analysis or Content Details
**Section (a):**
The flow diagram shows the system attempting to solve the given equations.
* Step 1: The system attempts to eliminate 'y', resulting in a final answer of x=1, y=7 (Correct).
* Step 2: The system attempts to derive the next equation, resulting in a final answer of x=1, y=7 (Correct).
* A third attempt (not numbered) leads to an incorrect implication and a final answer of x=-1, y=3 (Incorrect).
* A fourth attempt (not numbered) leads to a final answer of x=1, y=7 (Correct).
**Section (b): Search with PAVs**
* **ORM:** Starts at approximately 0.12 accuracy at 2<sup>1</sup> samples, rises to approximately 0.22 accuracy at 2<sup>7</sup> samples. The line slopes upward, with increasing steepness.
* **PRM Q-value:** Starts at approximately 0.11 accuracy at 2<sup>1</sup> samples, rises to approximately 0.18 accuracy at 2<sup>7</sup> samples. The line slopes upward, but less steeply than ORM.
* **PAV:** Starts at approximately 0.10 accuracy at 2<sup>1</sup> samples, rises to approximately 0.20 accuracy at 2<sup>7</sup> samples. The line slopes upward, with a moderate steepness.
**Section (c): RL with PAVs**
* **ORM-RL:** Starts at approximately 0.12 accuracy at 0 training iterations, rises to approximately 0.23 accuracy at 9x10<sup>3</sup> training iterations. The line initially rises steeply, then plateaus.
* **PAV-RL:** Starts at approximately 0.11 accuracy at 0 training iterations, rises to approximately 0.24 accuracy at 9x10<sup>3</sup> training iterations. The line rises more consistently than ORM-RL, with a slight peak around 8x10<sup>3</sup> iterations.
### Key Observations
* In Section (b), PAV consistently outperforms PRM Q-value, and achieves comparable accuracy to ORM with fewer samples.
* In Section (c), PAV-RL consistently outperforms ORM-RL in terms of accuracy.
* The annotations highlight that using PAVs results in 5x compute efficiency and 10% accuracy improvement in the search process (Section b), and 6x sample efficiency and 7% accuracy improvement in the RL process (Section c).
* The system demonstrates an ability to recover from incorrect implications (as seen in the red 'X' step in Section a).
### Interpretation
The diagram demonstrates the effectiveness of using Proof-Aware Values (PAVs) to improve the performance of both search and reinforcement learning algorithms in the context of equation solving. PAVs enhance both accuracy and efficiency. The flow diagram in Section (a) illustrates the iterative nature of the problem-solving process and the system's ability to learn from its mistakes. The comparison between the "Base Policy" and the "Very capable prover policy" suggests that incorporating proof-awareness leads to a more robust and accurate solver. The annotations quantify these improvements, providing concrete evidence of the benefits of using PAVs. The consistent outperformance of PAV-based methods in both charts suggests a generalizable advantage, applicable to a range of problem-solving scenarios. The slight peak in the PAV-RL curve around 8x10<sup>3</sup> iterations might indicate an optimal training point beyond which further iterations yield diminishing returns.
</details>
has been to measure mathematical correctness or relevance of steps. But, it is unclear if this supervision yields the most improvement in the base policy ( e.g. , a policy may need to generate simpler, repetitive, and even incorrect steps to explore and discover the final answer during test-time search and RL). Our key insight is that per-step, process rewards that measure a notion of progress : change in the likelihood of arriving at a correct final answer before and after taking the step, are effective, for both test-time beam search and online RL. Reinforcing steps that make progress regardless of whether they appear in a correct or incorrect trace diversifies the exploration of possible answers at initial steps, which is crucial when the approach to solve a problem is not clear. Formally, such rewards correspond to per-step advantages of steps from the RL literature (Sutton and Barto, 2018). We empirically show that using advantages in addition to ORM rewards outperforms the typical use of future probabilities of success or π -values (Wang et al., 2024) for both search and RL. This is because, when given a combinatorial space of responses, under bounded computational and sampling constraints, π -values mainly 'exploit' states whereas advantages also 'explore' steps that make the most progress towards the final answer (Fig. 2).
To answer (ii) , we first note that advantages under a poor base policy are β 0 on most steps, and thus will not be informative for search or RL. In addition, regardless of the strength of the base policy, using its own per-step advantages as process rewards in RL will result in base policy updates equivalent to only
using outcome rewards for RL (since a standard policy gradient algorithm already computes advantages). Hence, we propose to use advantages estimated via rollouts under a different prover policy as process rewards (Fig. 1(a)). How should we choose this prover policy? A natural guess would be to use a very capable prover. However, we show advantages under an overly capable prover policy, that can succeed from any step, fail to distinguish good and bad steps. A similar argument holds for very weak provers.
In theory, we formalize this intuition to define good provers as policies that are complementary to the base policy ( i.e. , policies with advantages that can contrast steps produced by the base policy sufficiently), while still producing step-level advantages correlated with those of the base policy. For e.g. , for BestofπΎ policies (Nakano et al., 2021) corresponding to a base policy, we empirically find that provers corresponding to πΎ > 1 (but not too large) are more capable at improving the base policy. Contrary to intuition, the set of complementary provers also contains policies that are worse than the base policy. To predict the advantages of such provers we train dense verifiers, called process advantage verifiers (PAVs) , that accelerate sample and compute efficiency of RL and search.
With the conceptual design of PAVs in place, we prescribe practical workflows for training PAVs and demonstrate their efficacy on a series of 2B, 9B, and 27B Gemma2 models (Gemma Team et al., 2024). PAV training data is gathered by sampling 'seed' solution traces from the prover and partial rollouts from the same to estimate the π -value at each prefix of the seed trace. Our workflow prescribes favorable ratios for seed and partial rollouts. Our first set of empirical results show that for an equal budget on test-time compute, beam search against trained PAVs is >8% better in accuracy, and 1 . 5 -5 Γ more compute efficient compared to re-ranking complete traces against an ORM (Fig. 1(b)). Dense rewards from PAVs improve the efficiency of step-level exploration during search by pruning the combinatorial space of solutions aggressively and honing in on a diverse set of possible sequences. Finally, we demonstrate for the first time , that using PAVs as dense rewards in RL scales up data efficiency by 6 Γ compared to only using outcome rewards (Fig. 1(c)). Moreover, base policies trained with PAVs also achieve 8 Γ better Pass @ π performance (probability of sampling the correct solution in π attempts), and consequently afford a higher ceiling on the performance of any test-time re-ranker. Finally, running RL with PAVs discovers solutions to hard problems that sampling from the SFT policy with a very large budget can't solve.
## 2. Preliminaries, Definitions, and Notation
Following protocols from Lightman et al. (2023); Uesato et al. (2022), a reasoning trace from an LLM consists of multiple logical steps separated by a demarcation token. An outcome reward model (ORM) is a trained verifier that assigns a numerical score after the last step of the trace, and a process reward model (PRM) is a trained verifier that scores each step of the trace individually.
Problem setup and notation. Given a math problem π β X , our goal is to improve a base policy π that samples a response π βΌ π (Β· | π ) in the set Y . A response π consists of multiple reasoning steps (maximum π» ), separated by a delimiter ('next line' in our case), i.e. , π = ( π 1 , π 2 , . . . , ππ» ) . Since sampling is auto-regressive, we can view each step as an action taken by the agent π in a Markov decision process (MDP) with deterministic dynamics. Specifically, we treat the prefix ( π , π 1 , . . . , π β -1 ) as the current state π β and next step πβ βΌ π (Β· | π ) as the action taken by π at π β , resulting in the next state π β + 1 . For problem π , with ground-truth response π β
π , we can evaluate the accuracy of π by running a regular expression match on the final answer (Hendrycks et al., 2021): Rex ( π , π β
π ) β¦β { 0 , 1 } , i.e. , accuracy is given by πΌ π βΌ π (Β·| π ) Rex ( π , π β
π ) . Now, given a dataset D = {( π π , π β
π π )} π of problem-solution pairs, the main goal is to learn a good base policy by optimizing this outcome reward on D . Next, we see how we can leverage the final answer verifier Rex available on D to train ORMs and PRMs.
Outcome reward model (ORM). Given a response π , an ORM estimates the ground-truth correctness Rex ( π , π β
π ) . To train such a model we first take problems in D , and collect training data of the form {( π , π βΌ π (Β· | π ) , Rex ( π , π β
π ))} . Then we train an ORM that takes as input a problem-response pair ( π , π ) and predicts Rex ( π , π β
π ) . At test time, when π β
π is unknown, the ORM is used to score candidate solutions revealed by test-time search. Given a base policy π , a Best-ofπΎ policy: BoK ( π ) , is a policy that samples πΎ responses from π , scores them against an ORM, and returns the one with the highest score. Whenever the ORM matches Rex, the performance of BoK ( π ) is referred to as Pass @ πΎ . Furthermore, when the likelihood of π solving problem π is π π , then for BoK ( π ) this likelihood is given by the expression: 1 - ( 1 -π π ) πΎ . In general, this is larger than π π , making BoK ( π ) stronger than π for πΎ > 1.
Standard process reward models (PRMs). A PRM scores every step πβ in a multi-step response π βΌ π ( e.g. , in Lightman et al. (2023) PRMs are trained to score correct steps over incorrect and irrelevant ones). But, unlike ORMs, which only require Rex for data collection, PRM training data requires expensive step-level human annotations. Prior works (Luo et al., 2024; Wang et al., 2024) attempted to scale process rewards automatically by sampling from the model to provide a heuristic understanding of when a step is actually correct. In particular, they evaluate a prefix by computing the expected future accuracy of multiple completions sampled from π , after conditioning on the prefix, i.e. , value function π π (Eq. 1) from RL. Similarly, we define π π ( π β ) B πΌ π β βΌ π (Β·| π β ) π π ( π β , π β ) as value of state π β . These works use π π as the PRM that assigns a score of π π ( π β , π β ) to the action πβ , at state π β .
$$Q ^ { \pi } ( \underbrace { ( x , a _ { 1 } , \dots , a _ { h - 1 } ) } _ { s t a t e \, s _ { h } } , \underbrace { a _ { h } } _ { a c t i o n \, a _ { h } } ) = \underbrace { \mathbb { E } _ { a _ { h + 1 } , \dots , a _ { H } \sim \pi ( \cdot | s _ { h } , a _ { h } ) } \left [ R e x \left ( ( a _ { 1 } , \dots , a _ { H } ) , y _ { x } ^ { ^ { * } } \right ) \right ] } _ { l i k e l i d o f f u t r e s c e s s } ,$$
Using PRMs for beam search at test-time. Given a PRM, a natural way to spend test-time compute is to use it as a step-level re-ranker within a beam search procedure (Snell et al., 2024). For each problem, at step 0, a beam of maximum width π΅ , is initialized with a single state consisting of just the problem. At step β , a beam contains partial responses unrolled till a set of states or prefixes { π π } π΅ π = 1 . From each state π π in this set, πΆ independent actions or steps { ππ,π } πΆ π = 1 are sampled from π (Β· | π π ) , each of which leads to a new state. Process rewards from PRMs assign a score to every new state ( π π , π π, π ) , and only the states corresponding to the top π΅ values are retained in the beam for the next step.
## 3. How Should we Define Process Rewards and Why?
Ultimately, we are interested in test-time search and RL methods that can most efficiently and reliably discover solution traces with the correct final answer, thus maximizing Rex. To this end, process rewards should serve as step-level supervision to indirectly maximize outcome-level Rex. Our position contrasts with conventional belief that process rewards should mainly evaluate mathematical correctness or relevance of individual steps (Lightman et al., 2023; Uesato et al., 2022), since LLMs might need to generate trivial or repetitive intermediate steps in order to discover a trace with the correct final answer. With this insight, in this section we approach the design of dense automated step rewards as a form of supervision to be used in conjunction with sparse outcome rewards to improve the base policy.
In an MDP, a starting point to design step-level dense feedback that is eventually meant to optimize a sparse outcome reward Rex is to consider the notion of a potential function (Ng et al., 1999): in our case, this is a function that summarizes the difference between some statistic of the policy at the future state and the same statistic computed at the current state. By appealing to this framework, in Sec. 3.1, we
Figure 2 | Issues with using π -values as process rewards : (a): Unlike π΄ π , π π mixes action evaluation with the π -value of the previous state. Beam search with π π exploits high-likelihood states, while adding π΄ π ( e.g. , π π + πΌπ΄ π in Eq. 5) aids in exploring states reached by making actions that induce progress, i.e. , increase likelihood of success. (b): π π from a strong prover π can assign unmerited bonuses to trivial actions.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Diagram: Reinforcement Learning Process & Question Rephrasing
### Overview
The image presents two diagrams, labeled (a) and (b), illustrating components of a reinforcement learning process and a question rephrasing mechanism. Diagram (a) depicts a state transition diagram with Q-values and policy values, while diagram (b) shows a flow chart for rephrasing questions and obtaining solutions.
### Components/Axes
**Diagram (a):**
* **States:** s1, s2
* **Actions:** a1,1, a1,2, a1,3, a2,1, a2,2, a2,3
* **Q-values:** Q<sup>Ο</sup>(s, a) β values ranging from 0.25 to 0.70.
* **Policy Values:** V<sup>Ο</sup>(s) β 0.60 for s1 and 0.30 for s2.
* **Policy Transition Weights:** A<sup>Ο</sup>(s, a) β values ranging from -0.10 to +0.20.
* **Legend:**
* Blue circles: Q<sup>Ο</sup>(s, a)
* Green arrows: A<sup>Ο</sup>(s, a)
* **Label:** "Beam" β positioned below states s1 and s2.
**Diagram (b):**
* **Input:** "Question"
* **Process:** "Rephrased Question", "Repeated Question"
* **Output:** "Correct solution from prover"
* **Distribution:** ~Ο (sampling from a policy) and ~ΞΌ (sampling from a distribution)
* **Reward:** Q<sup>ΞΌ</sup> = 1 on each state, trivial actions are rewarded.
* **Arrows:** Indicate flow of information.
### Detailed Analysis or Content Details
**Diagram (a):**
* **State s1:**
* Q<sup>Ο</sup>(s1, a1,1) = 0.55
* Q<sup>Ο</sup>(s1, a1,2) = 0.50
* Q<sup>Ο</sup>(s1, a1,3) = 0.70
* V<sup>Ο</sup>(s1) = 0.60
* A<sup>Ο</sup>(s1, a1,1) = -0.05
* A<sup>Ο</sup>(s1, a1,2) = -0.10
* A<sup>Ο</sup>(s1, a1,3) = +0.20
* **State s2:**
* Q<sup>Ο</sup>(s2, a2,1) = 0.50
* Q<sup>Ο</sup>(s2, a2,2) = 0.35
* Q<sup>Ο</sup>(s2, a2,3) = 0.25
* V<sup>Ο</sup>(s2) = 0.30
* A<sup>Ο</sup>(s2, a2,1) = +0.05
* A<sup>Ο</sup>(s2, a2,2) = -0.05
**Diagram (b):**
* A "Question" is fed into a process that generates a "Rephrased Question" sampled from a policy Ο.
* The "Rephrased Question" leads to a "Correct solution from prover".
* The process is repeated with a "Repeated Question" also sampled from policy Ο.
* The "Repeated Question" is sampled from a distribution ΞΌ.
* The reward function Q<sup>ΞΌ</sup> assigns a value of 1 to each state, rewarding trivial actions.
### Key Observations
* In diagram (a), the Q-values for state s1 are generally higher than those for state s2, suggesting s1 is a more desirable state.
* The policy transition weights (A<sup>Ο</sup>(s, a)) indicate the probability of taking specific actions from each state. Positive values suggest a higher probability, while negative values suggest a lower probability.
* Diagram (b) illustrates an iterative process of question rephrasing and solution seeking, potentially aiming to improve the quality or accuracy of the solutions obtained.
### Interpretation
Diagram (a) represents a simplified reinforcement learning environment. The Q-values represent the expected cumulative reward for taking a specific action in a given state, following a particular policy (Ο). The policy values (V<sup>Ο</sup>(s)) represent the expected cumulative reward for being in a given state, following the same policy. The policy transition weights (A<sup>Ο</sup>(s, a)) show the probability of taking each action from each state. The "Beam" label suggests a beam search algorithm might be used to explore the state space.
Diagram (b) depicts a mechanism for refining questions to obtain better solutions. The rephrasing process, guided by a policy Ο, aims to generate questions that are more likely to elicit correct answers. The iterative nature of the process, with repeated questioning and sampling from distribution ΞΌ, suggests a search for optimal question formulations. The reward function Q<sup>ΞΌ</sup> incentivizes trivial actions, potentially indicating a need to balance exploration and exploitation in the question-answering process.
The two diagrams together suggest a system where reinforcement learning is used to guide the rephrasing of questions, ultimately leading to improved solution quality. The system appears to be designed to explore different question formulations and learn which ones are most effective in eliciting correct answers.
</details>
show that advantages - not value functions (Luo et al., 2024; Wang et al., 2024) - that measure a notion of 'progress' at each new step are more appropriate for use as dense rewards in search and RL (primarily for exploration). Then in Secs. 3.3 and 3.4, we show that this progress or advantage vakue is measured best under a policy π , different from the base policy π . We call this policy π , the prover policy .
## 3.1. Process Rewards Should be Advantages, Not Value Functions
To understand the relationship to potential functions, we first study test-time beam search, and present some challenges with the reward design of Snell et al. (2024), that uses value function π π ( π , π ) of the base policy π to reward action π at state π . Consider the example in Fig. 2(a), where from the 2 states in the beam, we sample 3 actions. If we pick next states purely based on highest values of π π , we would be comparing steps sampled from different states ( e.g. , π 1 , 1 vs. π 2 , 1 ) against each other. Clearly, a reduction in expected final outcome, i.e. , π π ( π 1 , π 1 , 1 ) -π π ( π 1 ) , means that π 1 , 1 by itself has a negative effect of -0 . 05 on the probability of success from π 1 , whereas π 2 , 1 has a positive effect of + 0 . 20 from π 2 . However, expanding the beam based on absolute values of π π retains the action that makes negative progress, and removes state π 2 from the beam (as beam size is 2). In other words, π π fails to decouple the 'evaluation' of an action (step), from the 'promise' shown by the previous state. This will not be an issue for every problem, and particularly not when the beam capacity is unbounded, but under finite computational and sampling constraints, using π π might retain states with potentially unfavorable steps that hurt the overall likelihood of success. If we could also also utilize the progress made by the previous step along with the likelihood of success π π when deciding what to retain in the beam, then we can address this tradeoff.
How can we measure the 'progress' made by a step? One approach is to consider the relative increase/decrease in the likelihood of success, before and after the step. This notion is formalized by the advantage (Eq. 2) of a step under policy π . Furthermore, since advantages can attach either positive or negative values to a step, training the base policy against advantages supervises the base policy when it generates a step that makes progress (where π΄ π > 0), and also when it fails to produce one, employing a 'negative gradient' that speeds up RL training (Tajwar et al., 2024).
$$A ^ { \pi } ( { \mathbf s _ { h } } , a _ { h } ) \coloneqq Q ^ { \pi } ( { \mathbf s _ { h } } , a _ { h } ) - V ^ { \pi } ( { \mathbf s _ { h } } ) = Q ^ { \pi } ( { \mathbf s _ { h } } , a _ { h } ) - Q ^ { \pi } ( { \mathbf s _ { h - 1 } } , a _ { h - 1 } ) .$$
Recall that since we view process rewards as potential functions in the MDP, they can be computed under
any policy π , which can be the base policy. However, in the above example, reasons for which π π is a seemingly unfit choice for process rewards also apply to π π . Nevertheless, we can possibly use advantage under π : π΄ π , which measures the progress made by a step to improve the likelihood of success under π . In that case, how should we choose this policy π , that we call the prover policy, and should it be necessarily different from base policy π ? Before diving into the choice of π , we discuss a more pertinent question: how should we use π΄ π in conjunction with outcome rewards for improving the base policy π ? We will then formally reason about the choice of π in Secs. 3.3 and 3.4.
## 3.2. Our Approach: Process Advantage Verifiers (PAV)
For building an approach that uses process rewards π΄ π together with the outcome reward Rex to improve the base policy π , we situate ourselves in the context of improving π with online RL. If all we had was access to Rex on D , the standard RL objective is given by:
$$\ell _ { \text {ORM-RL} } ( \pi ) \coloneqq \mathbb { E } _ { \mathbf x \sim \mathcal { D } , ( a _ { 1 } , \dots , a _ { H } ) \sim \pi ( \cdot | \mathbf x ) } \left [ R e x \left ( ( \mathbf x , a _ { 1 } , \dots , a _ { H } ) , y _ { \mathbf x } ^ { ^ { * } } \right ) \right ] .$$
Inspired by how reward bonuses (and potential functions) are additive (Bellemare et al., 2016; Ng et al., 1999), one way to use process rewards π΄ π is to combine it with the standard RL objective as:
$$\ell _ { \text {PAV-RL} } ^ { \pi ^ { \prime } } ( \pi ) \colon = \ell _ { \text {ORM-RL} } ( \pi ) + \alpha \cdot \sum _ { h = 1 } ^ { H } \mathbb { E } _ { s _ { h } \sim d _ { h } ^ { \prime } } \, \mathbb { E } _ { a _ { h } \sim \pi ( \cdot | s _ { h } ) } \left [ A ^ { \mu } ( s _ { h } , a _ { h } ) \right ]$$
The term in red is the difference in likelihoods of success of the prover π , summed over consecutive steps (a notion of progress ). Here, π π β² β denotes the distribution over states at step β , visited by the old policy π β² (policy at previous iterate). Following policy gradient derivations (Williams, 1992):
$$\boxed { \nabla _ { \pi } \ell ^ { \pi ^ { \prime } } _ { \text {PAV-RL} } ( \pi ) \Big | _ { \pi ^ { \prime } = \pi } = \sum _ { h = 1 } ^ { H } \, \nabla _ { \pi } \log \pi ( a _ { h } \, | \, s _ { h } ) \cdot \underbrace { ( Q ^ { \pi } ( s _ { h } , a _ { h } ) + \alpha \cdot A ^ { \mu } ( s _ { h } , a _ { h } ) ) } _ { \text {effective reward} } }$$
At a glance, we can view π π ( π β , π β ) + πΌπ΄ π ( π β , π β ) as the effective reward for step πβ when scored against a combination of the outcome evaluation Rex, i.e., π π , and process rewards π΄ π . Thus, we can optimize Eq. 4 indirectly via (a) running beam-search against the effective reward; or (b) online RL where the policy gradients are given by Eq. 5. For either of these, we need access to verifiers that are trained to predict the advantage π΄ π ( π β , π β ) under the prover. We refer to these verifiers as process advantage verifiers (PAVs) . In Sec. 4.2 we describe how to train PAVs, but now we use the above formulation to reason about how to choose prover π that is most effective at improving base π .
We also remark that the term in red resembles prior work on imitation learning via policy optimization (Ross and Bagnell, 2014; Sun et al., 2017), where the main aim is to learn a policy π that imitates the prover π , or to improve upon it to some extent. Of course, this is limiting since our goal is to not just take actions that perform at a similar level as π , but to improve the base policy even further, and using a combination of π π and π΄ π is critical towards this goal.
How should we choose the prover π ? Perhaps a natural starting point is to set the prover to be identical to the base policy, i.e. , π = π , which produces process rewards that prior works have considered Shao et al. (2024). However, setting π΄ π = π΄ π in Eq. 5 results in exactly the same policy gradient update as only optimizing outcome evaluation Rex. Moreover, for a poor base policy π , where π π β 0 on most states,
the term π΄ π would also be β 0, and hence running beam search with the effective rewards would not be informative at all. Hence, a better approach is to use a different prover policy , but a very weak prover π will likely run into similar issues as a poor base policy. We could instead use a very capable prover π , but unfortunately even this may not be any better than optimizing only the outcome reward either. To see why, consider a scenario where π 's response contains an intermediate step that does not help make progress towards the solution ( e.g. , π simply restates the question, see Fig. 2(b)). Here, π π for a capable prover before and after this irrelevant step will be identical since π can succeed from either step. This means that π fails to distinguish steps, resulting in π΄ π β 0 in most cases. Training with this process reward during RL will then lead to gradients that are equivalent to those observed when purely optimizing β ORM -RL . In fact, empirically, we observe that online RL with π π from strong provers leads to polices that only produce re-phrasings of the question (App. G) and do not succeed at solving the question. Clearly, any policy different from the base policy cannot serve as a prover. So, how do we identify a set of good provers? Can they indeed be weaker than the base policy? We answer next.
## Takeaway: What should process rewards measure during test-time search and online RL?
- Process rewards should correspond to progress, or advantage , as opposed to absolute π -values, for a better explore-exploit tradeoff during beam search and online RL.
- Advantages should be computed using a prover policy, different from the base policy.
## 3.3. Analysis in a Didactic Setting: Learning a Planted Sub-sequence
In this section, we aim to characterize prover policies that are effective in improving the base policy. To do so, we first introduce a didactic example, representative of real reasoning scenarios to illustrate the main intuition. Then, we will formalize these intuitions in the form of theoretical results.
Didactic example setup. Given an unknown sub-sequence π β
consisting of tokens from vocabulary V B { 1 , 2 , . . . , 15 } , we train a policy π to produce a response which contains this sub-sequence. The task completion reward is terminal and sparse, i.e. , π ( π , π β
) = 1 for a π if and only if π β
appears in π . By design, the reward π ( π , π β
) resembles outcome reward Rex ( π , π β
π ) in Sec. 2. The prover policy π is a procedural policy, parameterized by a scalar πΎ > 0 (details in App. B). As πΎ increases, the performance of π improves and β 1 as πΎ ββ . For simplicity, we assume oracle access to ground-truth π΄ π and π π , and alleviate errors from learned verifiers approximating these values.
(1) RL with effective reward π π + πΌπ΄ π is 10 Γ more sample-efficient than only outcome reward. In Fig. 3(a), we first note that training π with this effective reward under a prover π with strength πΎ = 10, produces optimal performance (100% accuracy) in 350 iterations, despite starting from a mediocre initialization for π ( πΎ = 5 . 0 ) . Training with only outcome reward is ineffective. More importantly, in Fig. 3(b), we note that effective rewards only help for a set of provers, in πΎ β [ 8 . 0 , 15 . 0 ] . Outside this range, we observed advantages π΄ π were close to 0 on most states, either because π was poor (small πΎ ) and was unable to generate π β
even when π got the sequence partially correct, or because π was strong (large πΎ ) that it generated π β with almost equal likelihood from all prefixes.
(2) Effective reward improves Pass @N by 5 Γ over only outcome reward. We report the 'Pass @N' performance in Fig. 3(c), which measures the maximum reward π across π traces sampled i.i.d. from π and hence, represents the ceiling on the performance of any test-time search method that picks a single response from multiple draws ( e.g. , as in Best-of-N). For a policy trained with the effective reward for 100 iterations, the Pass @N performance grows 5 Γ faster with π , compared to the policy trained with only
Figure 3 | Results for our didactic analysis: (a): We train base policy via RL with either effective reward π π + πΌπ΄ π , or the typical π π (computed via Monte-Carlo sampling). (b): We vary the strength πΎ of the prover π used to compute advantages π΄ π in the effective reward, and plot the base policy accuracy averaged over the RL run. (c): We plot the max score out of π responses (Pass @N) sampled i.i.d. from an undertrained base policy (iter 100) .
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Charts: Reinforcement Learning Performance Analysis
### Overview
The image presents three charts (a, b, and c) analyzing the performance of Reinforcement Learning (RL) algorithms under different reward schemes and parameter settings. Chart (a) compares accuracy with different rewards over training iterations. Chart (b) examines the impact of the prover policy parameter Ξ³ on accuracy. Chart (c) investigates the effect of the test-time re-ranking parameter N on the pass rate @ N.
### Components/Axes
**Chart (a): RL with Different Rewards**
* **X-axis:** Training iterations (0.5k, 1k, 1.5k, 2k, 2.5k, 3k, 3.5k)
* **Y-axis:** Accuracy (0.0 to 1.0)
* **Legend:**
* "Only QΟ" (Blue)
* "Effective Reward" (Green)
**Chart (b): Choice of Prover Policy ΞΌ**
* **X-axis:** Ξ³ (1.0 to 17.0)
* **Y-axis:** Accuracy (avg. over iters) (0.0 to 1.0)
* **Legend:**
* "Effective Reward" (Green)
* "Only QΟ" (Orange)
* "Ξ³ for initial Ο" (Dashed Orange)
**Chart (c): Ceiling on Test-time Re-ranking**
* **X-axis:** N (2<sup>1</sup>, 2<sup>2</sup>, 2<sup>3</sup>, 2<sup>4</sup>, 2<sup>5</sup>, 2<sup>6</sup>, 2<sup>7</sup>, 2<sup>8</sup>) which is equivalent to (2, 4, 8, 16, 32, 64, 128, 256)
* **Y-axis:** Pass @ N (0.0 to 0.25)
* **Legend:**
* "Effective Reward" (Green)
* "Only QΟ" (Blue)
* "Initial Ο" (Yellow)
### Detailed Analysis or Content Details
**Chart (a): RL with Different Rewards**
* **"Only QΟ" (Blue):** The line remains relatively flat around 0.0 accuracy throughout all training iterations. Accuracy is approximately 0.02 at all points.
* **"Effective Reward" (Green):** The line shows a steep increase in accuracy from 0.0 to approximately 1.0 between 0.5k and 1.5k training iterations. After 1.5k iterations, the accuracy stabilizes around 0.98-1.0.
**Chart (b): Choice of Prover Policy ΞΌ**
* **"Effective Reward" (Green):** The line forms a bell-shaped curve, peaking at approximately Ξ³ = 9.0 with an accuracy of around 0.95. The accuracy decreases as Ξ³ moves away from 9.0 in either direction. Accuracy is approximately 0.2 at Ξ³ = 1.0 and Ξ³ = 17.0.
* **"Only QΟ" (Orange):** The line is relatively flat, with an accuracy around 0.2 across all values of Ξ³.
* **"Ξ³ for initial Ο" (Dashed Orange):** A horizontal dashed line at approximately 0.2 accuracy, spanning the entire range of Ξ³.
**Chart (c): Ceiling on Test-time Re-ranking**
* **"Effective Reward" (Green):** The line shows a generally increasing trend, starting from approximately 0.02 at N=2 and reaching approximately 0.12 at N=256. There is a slight plateau between N=64 and N=128.
* **"Only QΟ" (Blue):** The line also shows an increasing trend, but starts at a lower value (approximately 0.01 at N=2) and reaches approximately 0.10 at N=256.
* **"Initial Ο" (Yellow):** The line starts at approximately 0.03 at N=2 and increases to approximately 0.08 at N=256, exhibiting a slower growth rate compared to the other two lines.
### Key Observations
* In Chart (a), the "Effective Reward" scheme significantly outperforms "Only QΟ" in terms of accuracy.
* In Chart (b), the "Effective Reward" scheme is highly sensitive to the choice of Ξ³, with optimal performance around Ξ³ = 9.0.
* In Chart (c), increasing the value of N generally improves the pass rate for all three schemes, but the "Effective Reward" scheme consistently achieves the highest pass rate.
* The "Only QΟ" scheme consistently performs worse than the "Effective Reward" scheme across all three charts.
### Interpretation
These charts demonstrate the effectiveness of the "Effective Reward" scheme in Reinforcement Learning tasks. The scheme leads to faster learning (Chart a), optimal performance with a specific parameter setting (Chart b), and higher success rates in test-time re-ranking (Chart c). The sensitivity of the "Effective Reward" scheme to the parameter Ξ³ suggests that careful tuning of this parameter is crucial for achieving optimal performance. The consistently lower performance of the "Only QΟ" scheme indicates that the additional components of the "Effective Reward" scheme provide significant benefits. The trends in Chart (c) suggest diminishing returns as N increases, indicating a potential trade-off between computational cost and performance. The dashed line in Chart (b) provides a baseline for comparison, showing the performance of the initial policy without any learning. Overall, the data suggests that the "Effective Reward" scheme is a promising approach for improving the performance of Reinforcement Learning algorithms.
</details>
the outcome reward. Due to only sparse feedback, the latter policy does not learn to sample partially correct π β
, whereas a policy trained with the effective reward produces partially correct π β
, and is able to sample the complete π β
with higher likelihood during Pass @N.
## Takeaway: Online RL with process rewards from different prover policies.
Effective rewards π π + πΌπ΄ π from prover π : (i) improve sample efficiency of online RL, and (ii) yield policies with better Pass @N performance, over using only outcome rewards. But, advantages of very capable or poor π do not improve base policy beyond outcome rewards.
## 3.4. Theory: Provers Complementary to the Base Policy Boost Improvement
From our didactic analysis, it is clear that process rewards π΄ π under different provers π disparately affect the base policy that optimizes π π + πΌπ΄ π via online RL. We now present a formal analysis of why this happens and characterize a class of provers that can guarantee non-trivial improvements to the base policy. For simplicity, we assume oracle access to π π , π΄ π at every state-action pair ( π β , π β ) and prove our result in the tabular RL setting, where the policy class is parameterized using the softmax parameterization in Agarwal et al. (2021). Proofs for this section are in App. F.
Main intuitions. We expect a prover π to improve a base policy π only when π is able to distinguish different actions taken by π , by attaining sufficiently varying advantage values π΄ π ( π β , π ) for actions π at state π β . This can be formalized under the notion of sufficiently large variance across actions, π π βΌ π [ π΄ π ( π β , π )] . In that case, can we simply use a policy with large advantage variance under any measure? No, because when the prover π ranks actions at a given state very differently compared to the base policy π (e.g., if π΄ π and π΄ π are opposite), then effective rewards π π + πΌπ΄ π will be less reliable due to conflicting learning signals. Thus, we want πΌ π [β¨ π΄ π , π΄ π β©] to not be too negative, so that π and π
are reasonably aligned on their assessment of steps from π .
In Theorem 3.1, we present our result on policy improvement where the base policy is updated with natural policy gradient (Kakade, 2001a): ππ‘ + 1 ( π | π β ) β exp ( πΎ Β· ( π π ( π β , π ) + π΄ π ( π β , π ))) . We note that in this idealized update rule, swapping π values (of π or π ) with advantages does not affect the update since we assume access to all possible actions when running the update. Nonetheless, despite this simplifying assumption, the analysis is able to uncover good choices for the prover policy π for computing process
reward π΄ π , and is orthogonal to the design consideration of advantages or π -values as process rewards that we have discussed so far in this paper. Theorem 3.1 formalizes our intuition by showing that policy improvement at iteration π‘ , grows as the variance in π΄ π values increases (higher distinguishability) and reduces when π΄ π and π΄ π become extremely misaligned. This will then allow us to discuss a special case for the case of Best-of-K policies as provers as an immediate corollary.
Theorem 3.1 (Lower bound on policy improvement; informal) . For base policy iterate ππ‘ , after one step of policy update, with learning rate πΎ βͺ 1 , the improvement over a distribution of states π :
$$\mathbb { E } _ { s \sim \rho } \left [ V ^ { \pi _ { t + 1 } } ( s ) - V ^ { \pi _ { t } } ( s ) \right ] & \gtrsim \gamma \cdot \mathbb { E } _ { s \sim \rho } \mathbb { V } _ { a \sim \pi _ { t } } [ A ^ { \mu } ( s , a ) ] + \gamma \cdot \mathbb { E } _ { s \sim \rho } \mathbb { E } _ { a \sim \pi _ { t } } \left [ A ^ { \mu } ( s , a ) A ^ { \pi _ { t } } ( s , a ) \right ] \\ & \underbrace { d i s t i n g u i s h a b i l i t y f r o m \, u } _ { d i s t i n g u i s h a b i l i t y f r o m \, u } \left [ A ^ { \mu } ( s , a ) \right ] + \gamma \cdot \mathbb { E } _ { s \sim \rho } \mathbb { E } _ { a \sim \pi _ { t } } \left [ A ^ { \mu } ( s , a ) A ^ { \pi _ { t } } ( s , a ) \right ] \\$$
It may seem that the base policy π can only learn from an improved prover π , but our result shows that a weak prover can also amplify a stronger base policy , since a weak prover π may have a lower average of π π under its own measure, but still have higher variance across π π (compared to π π ) when evaluated under π (see Proposition F.1 in App. F.5 for formal discussion). This tells us that rewarding progress under a prover is different from typical knowledge distillation or imitation learning algorithms (Hinton, 2015; Rusu et al., 2015) that in most cases remain upper bounded by the performance of the stronger teacher. So provers cannot be characterized purely by strength, what is a class of provers that is a reasonable starting point if we were to improve any base policy π ?
The policy class of 'Best-of-K' (computed over base policies) contain complementary provers. A good starting point to identify good provers for a base policy π , is the class of Best-of-K policies or BoK ( π ) . Recall from Sec. 2 that the performance of BoK ( π ) increases monotonically with πΎ . Applying Theorem 3.1 to this class, we arrive at Remark 3.1 that recommends using BoK ( π ) with πΎ > 1 as a prover policy for a poor base policy π . However, πΎ cannot be too large always since when π π ( π , π ) β 1 , increasing πΎ too much can hurt distinguishability of different steps at that state. In the next section, we empirically note that the policies in the class of BoK ( π ) indeed induce different performance gains when used as prover policies, and we find Bo4 to be a good choice for test-time search over most base policies.
Remark 3.1. When π π ( π , π ) = π ( 1 / πΎ ) , β π , π , using BoK ( π ) as a prover for base π improves distinguishability (and improvement) by Ξ© ( πΎ 2 ) , and make alignment worse at most by π ( πΎ ) .
## Takeaway: Formal characterization of good prover policies that improve the base policy.
Provers with advantages that can distinguish actions taken by the base policy (more strongly than the base policy itself) but are not too misaligned from the base, boost improvements on each update of the base policy. We call such policies complementary provers . BoK ( π ) for any base policy π for πΎ > 1 can provide a good starting choice of prover policies.
## 4. Results: Scaling Test-Time Compute with PAVs
Now, we study how process verifiers can scale up test-time compute. While our derivations from Sec. 3.2 were with RL, we can also use the effective reward π π ( π β , π β ) + πΌ Β· π΄ π ( π β , π β ) for running beam search over intermediate steps sampled from base policy π . To do so, we train a process advantage verifier to predict π΄ π , along with a process reward model π π . PAV training is done using procedures discussed in Sec. 4.2. While the candidates of the beam are selected using a combination of both the PAV and the PRM π π ,
Figure 4 | For test-time search, PAVs are 8 -10% more accurate and 1 . 5 -5 Γ more compute efficient over ORMs: On samples from (a) Gemma-2B , (b) 9B , and (c) 27B SFT policies, we run test-time beam search with the estimate of effective reward π π + πΌπ΄ π (PAV), where π is the Bo4 ( π ) policy. We compare beam search performance with best-of-N, re-ranking with a trained outcome verifier (ORM), or the oracle Rex (Pass @N).
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Chart: Accuracy vs. N for Different Models
### Overview
This image presents three line charts, each comparing the accuracy of three methods β ORM, PAV (ours), and Pass@N β across varying values of N. Each chart corresponds to a different model: Gemma-2B, Gemma-9B, and Gemma-27B. The charts visualize how accuracy changes as N (likely representing the number of samples or attempts) increases. Shaded areas around each line represent confidence intervals.
### Components/Axes
* **X-axis:** Labeled "N", with tick marks at 2<sup>1</sup>, 2<sup>2</sup>, 2<sup>3</sup>, 2<sup>4</sup>, 2<sup>5</sup>, 2<sup>6</sup>, and 2<sup>7</sup>.
* **Y-axis:** Labeled "Accuracy", with a scale ranging from approximately 0.1 to 0.65.
* **Legend:** Located at the top-left of each chart, containing the labels:
* ORM (Blue line with circle markers)
* PAV (ours) (Orange line with circle markers)
* Pass@N (Gray dashed line with diamond markers)
* **Chart Titles:**
* (a) Gemma-2B
* (b) Gemma-9B
* (c) Gemma-27B
* **Annotations:** Each chart includes arrows indicating percentage improvements of PAV (ours) over ORM.
### Detailed Analysis or Content Details
**Gemma-2B (Chart a):**
* **ORM (Blue):** Starts at approximately 0.17 accuracy at N=2<sup>1</sup>, rises to around 0.24 at N=2<sup>3</sup>, plateaus around 0.25-0.27 for N=2<sup>4</sup> through N=2<sup>7</sup>.
* **PAV (Orange):** Starts at approximately 0.21 accuracy at N=2<sup>1</sup>, steadily increases to around 0.33 at N=2<sup>7</sup>.
* **Pass@N (Gray):** Starts at approximately 0.23 accuracy at N=2<sup>1</sup>, and increases steadily to approximately 0.41 at N=2<sup>7</sup>.
* **Annotation:** An arrow indicates a 5x improvement of PAV over ORM at N=2<sup>3</sup>, and a 10% improvement at N=2<sup>7</sup>.
**Gemma-9B (Chart b):**
* **ORM (Blue):** Starts at approximately 0.32 accuracy at N=2<sup>1</sup>, rises to around 0.38 at N=2<sup>3</sup>, plateaus around 0.38-0.42 for N=2<sup>4</sup> through N=2<sup>7</sup>.
* **PAV (Orange):** Starts at approximately 0.40 accuracy at N=2<sup>1</sup>, steadily increases to around 0.56 at N=2<sup>7</sup>.
* **Pass@N (Gray):** Starts at approximately 0.42 accuracy at N=2<sup>1</sup>, and increases steadily to approximately 0.62 at N=2<sup>7</sup>.
* **Annotation:** An arrow indicates a 2x improvement of PAV over ORM at N=2<sup>3</sup>, and a 10% improvement at N=2<sup>7</sup>.
**Gemma-27B (Chart c):**
* **ORM (Blue):** Starts at approximately 0.38 accuracy at N=2<sup>1</sup>, rises to around 0.45 at N=2<sup>3</sup>, then decreases slightly to around 0.43 at N=2<sup>7</sup>.
* **PAV (Orange):** Starts at approximately 0.48 accuracy at N=2<sup>1</sup>, steadily increases to around 0.61 at N=2<sup>7</sup>.
* **Pass@N (Gray):** Starts at approximately 0.50 accuracy at N=2<sup>1</sup>, and increases steadily to approximately 0.64 at N=2<sup>7</sup>.
* **Annotation:** An arrow indicates a 1.5x improvement of PAV over ORM at N=2<sup>3</sup>, and an 8% improvement at N=2<sup>7</sup>.
### Key Observations
* **PAV consistently outperforms ORM** across all models and values of N.
* **Pass@N generally achieves the highest accuracy** across all models and values of N.
* The improvement of PAV over ORM appears to diminish as N increases, particularly for Gemma-27B.
* The accuracy of ORM for Gemma-27B decreases slightly after N=2<sup>3</sup>, while PAV and Pass@N continue to improve.
### Interpretation
The charts demonstrate the effectiveness of the "PAV (ours)" method in improving accuracy compared to "ORM" across different model sizes (Gemma-2B, Gemma-9B, and Gemma-27B). The "Pass@N" method consistently achieves the highest accuracy, suggesting it is the most robust approach. The annotations highlight the percentage improvements of PAV over ORM, providing a quantifiable measure of its benefit.
The diminishing improvement of PAV over ORM as N increases suggests that the benefits of PAV are more pronounced at lower values of N. The slight decrease in ORM accuracy for Gemma-27B at higher N values could indicate overfitting or a limitation of the ORM method with larger models.
The consistent upward trend of Pass@N suggests that increasing the number of samples (N) generally leads to higher accuracy, regardless of the model size. This is expected, as more samples provide more information for the model to learn from. The differences in accuracy between the models suggest that model size plays a significant role in performance. Gemma-27B generally exhibits higher accuracy than Gemma-9B and Gemma-2B, indicating that larger models have a greater capacity to learn and generalize.
</details>
the final candidate is selected using the outcome reward prediction from π π itself (i.e., we repurpose the PRM representing π π as an ORM). For clarity, we abuse notation and refer to the estimated effective reward (ORM + πΌ PAV) as PAV directly.
Setup . We finetune Gemma 2B, 9B, and 27B (Gemma Team et al., 2024) on MATH (Hendrycks et al., 2021) via supervised fine-tuning (SFT) to get three base policies. The set of provers consists of the three base SFT policies themselves as well as their best-of-K policies for different values of πΎ β { 2 0 , . . . , 2 5 } . Additional details for the experiments in this section are in App. C.
## 4.1. PAVs Scale Test-Time Compute by 5 -10 Γ Over ORMs
Result 1: PAVs are more compute efficient than ORMs. In Fig. 4, we plot the performance of beam search with PAVs for different sizes of the beam π , and compare it with best-ofπ using ORMs, i.e. , sampling π complete solutions from the base policy and returning the one with the highest ORM score. To compare PAVs and ORMs, we evaluate the compute efficiency of PAVs over ORMs, given by the ratio of total compute needed by PAVs to obtain the same performance as running best-of-128 with ORM. Even when accounting for the fact that running beam search with PAVs does require additional compute per solution trace (since each element in the beam samples πΆ = 3 next steps, before scoring and pruning the beam), PAVs are able to scale the compute efficiency by 10 Γ over ORMs for Gemma-2B, 9B base models, and by 5 Γ for Gemma-27B model. We use BoK ( π ) with πΎ = 4 as the prover policy for all base policies π .
We also compare performance with beam search using process verifiers that only predict π π , and best-of-N where the ORM is replaced with PAV (PAV-as-ORM). At π = 128, similar to Luo et al. (2024), we note a similar gain of 4% for 'PAV-as-ORM' Fig. 5(a) over only ORMs, for base Gemma-9B π . When comparing beam search with π π (Snell et al., 2024), we find that PAVs scale compute efficiency by 8 Γ . Evidently, advantages from the prover in the effective reward positively impact the beam search. Why does π΄ π help, and for what choice of the prover π ?
Result 2: Beam search with too weak/strong provers is sub-optimal. In Fig. 5(b), for the setting when the base policy π is a Gemma-2B SFT model, we compare beam search with PAVs where the provers are given by BoK ( π ) , for different values of πΎ . Recall that as πΎ increases, BoK ( π ) becomes stronger. Corroborating our analysis in Sec. 3.4, our results show that neither too weak (Bo2) or too strong (Bo32) provers perform best. Instead, across all values of π , we find Bo4 to be dominant. The advantage values
Figure 5 | Comparing PAVs with search baselines and ablating over the prover policy: (a): We compare beam search over Gemma 9B SFT, using either effective reward (PAV), or π π (Snell et al., 2024), and report best-of-N performance where the re-ranker is either the ORM or PAV-as-ORM. (b): For the base Gemma 2B SFT policy, we run beam search with the effective reward where the prover is BoK ( π ) for different values of πΎ . In both ( π ) , ( π ) the x-axis scales the size of the beam or π for best-of-N. (c): For each base policy in the set: Gemma 2B, 9B, 27B policies, we run beam search with PAVs (beam size of 16) where the prover is another policy from the same set.
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Charts: Performance Comparison of Proof Reasoning Methods (PRMs)
</details>
π΄ π β 0 on all steps for very large πΎ , since π π ( π β , π β ) = 1 - ( 1 -π π ( π β , π β )) πΎ β 1 on all steps, as we increase πΎ . Hence, in order to succeed we need an intermediate-level prover policy .
We make similar observations in Figure 5(c) where we use the three base policies (Gemma 2B/9B/27B) as provers for training PAVs. In this scenario, we evaluate beam search with PAVs at π = 16 on top of different base policies. We find that for the 2B and 9B base models, the 9B and 27B provers are most effective respectively , whereas for the 27B model, surprisingly a weaker 9B policy is more effective than the stronger 27B model. The weaker model presumably offers a complementary signal that distinguishes between different actions taken by 27B, aligning with our theoretical observations in Sec. 3.4.
Result 3: Advantages from the prover policy enable exploration. As discussed in Sec. 3.1, advantage π΄ π measures the progress made by an action agnostic of the value of the previous state, where as π π measures the promise of a particular state. Given a finite capacity beam, our effective reward (Eq. 5), which linearly combines π π and π΄ π induces a better tradeoff between exploring new prefixes (states) from where progress can be made and exploiting currently known prefixes with high Q-values. Exploration at
Figure 6 | (a): Beam search with PAVs improves exploration efficiency (higher Pass@N), over typical PRMs. (b): Performance of beam search over Gemma 9B SFT for PAVs trained on datasets with different π mc / π cov .
<details>
<summary>Image 7 Details</summary>

### Visual Description
\n
## Charts: Pass@N over Gemma 2B Base & Scaling Laws
### Overview
The image presents two charts comparing the performance of different sampling methods (IID, PAV, PRM) on the Gemma 2B Base model, measured by Pass@N accuracy. The first chart (a) shows accuracy versus the number of samples (N). The second chart (b) illustrates the impact of training data size on accuracy, with varying ratios of model capacity to coverage (Ξ·mc/Ξ·cov).
### Components/Axes
**Chart (a):**
* **Title:** Pass@N over Gemma 2B Base
* **X-axis:** N (labeled as 2<sup>1</sup>, 2<sup>2</sup>, 2<sup>3</sup>, 2<sup>4</sup>, 2<sup>5</sup>, 2<sup>6</sup>, 2<sup>7</sup>)
* **Y-axis:** Accuracy (ranging from approximately 0.15 to 0.5)
* **Legend:**
* IID Sampling (Green dashed line)
* Search: PAV (ours) (Orange solid line)
* Search: PRM (Q<sup>Ο</sup>) (Blue dashed line)
**Chart (b):**
* **Title:** Scaling Laws: Ξ·mc/Ξ·cov
* **X-axis:** Training Data Size (labeled as 64k, 128k, 256k, 512k, 1024k)
* **Y-axis:** Accuracy (ranging from approximately 0.3 to 0.5)
* **Legend:** (Color-coded lines representing different Ξ·mc/Ξ·cov ratios)
* 2 (Light Blue dashed line)
* 4 (Pale Green dashed line)
* 8 (Pink solid line)
* 16 (Orange solid line)
* 32 (Red solid line)
### Detailed Analysis or Content Details
**Chart (a):**
* **IID Sampling (Green):** Starts at approximately 0.17 accuracy at N=2<sup>1</sup> and increases steadily to approximately 0.47 accuracy at N=2<sup>7</sup>. The line is relatively straight, indicating a consistent increase in accuracy with increasing N.
* **Search: PAV (Orange):** Begins at approximately 0.18 accuracy at N=2<sup>1</sup> and rises more steeply than IID, reaching approximately 0.52 accuracy at N=2<sup>7</sup>.
* **Search: PRM (Blue):** Starts at approximately 0.19 accuracy at N=2<sup>1</sup> and increases at a rate between IID and PAV, reaching approximately 0.49 accuracy at N=2<sup>7</sup>.
**Chart (b):**
* **Ξ·mc/Ξ·cov = 2 (Light Blue):** Starts at approximately 0.32 accuracy at 64k and increases to approximately 0.42 accuracy at 1024k.
* **Ξ·mc/Ξ·cov = 4 (Pale Green):** Starts at approximately 0.34 accuracy at 64k and increases to approximately 0.45 accuracy at 1024k.
* **Ξ·mc/Ξ·cov = 8 (Pink):** Starts at approximately 0.36 accuracy at 64k and increases to approximately 0.47 accuracy at 1024k.
* **Ξ·mc/Ξ·cov = 16 (Orange):** Starts at approximately 0.38 accuracy at 64k and increases to approximately 0.49 accuracy at 1024k.
* **Ξ·mc/Ξ·cov = 32 (Red):** Starts at approximately 0.40 accuracy at 64k and increases to approximately 0.51 accuracy at 1024k.
### Key Observations
* In Chart (a), PAV consistently outperforms both IID and PRM across all values of N. PRM performs better than IID.
* In Chart (b), accuracy generally increases with training data size for all Ξ·mc/Ξ·cov ratios. Higher Ξ·mc/Ξ·cov ratios lead to higher accuracy, with the most significant gains observed at lower training data sizes.
* The performance gap between different Ξ·mc/Ξ·cov ratios appears to narrow as the training data size increases.
### Interpretation
The data suggests that the PAV search method is the most effective for improving Pass@N accuracy on the Gemma 2B Base model, particularly as the number of samples (N) increases. The scaling laws (Chart b) demonstrate the importance of balancing model capacity (Ξ·mc) with coverage (Ξ·cov) during training. Increasing the training data size consistently improves accuracy, but the optimal ratio between model capacity and coverage depends on the available data. The diminishing returns observed at larger training data sizes suggest that there may be a point of saturation where further increases in data yield only marginal improvements. The consistent outperformance of PAV indicates that its search strategy is more efficient at leveraging the available data to improve model performance. The charts provide empirical evidence for the benefits of both improved sampling techniques and careful consideration of scaling laws in language model training.
</details>
initial steps is critical to ensure that the beam at later steps covers diverse partial rollouts each with a high likelihood of producing the correct answer. Thus over-committing to the beam with actions from the same state, regardless of the progress made by each can prove to be sub-optimal over a selection strategy that balances rewarding previous actions π΄ π and current states π π . Indeed, we observe in Fig. 6(a), beam search with PAV enhances pass@N performance vs. beam search with π π and i.i.d. sampling.
## Takeaways: Scaling test-time compute with process advantage verifiers.
- Beam search with PAVs boosts accuracy by >8% & compute efficiency by 1.5-5x over ORMs.
- Utilizing Best-of-K policies (corresponding to the base policy) as provers induce better exploration to maximize outcome reward. Optimal provers for a base policy appear at πΎ > 1.
## 4.2. How to Collect Data to Train PAVs?: PAV Training Data Scaling Laws
We now describe the procedure for training outcome verifiers and PAVs. We can learn to predict π π for a policy π (similar for π π ) by finetuning LLMs with a cross-entropy loss on the following data with triplets ( π , π, π π mc ( π , π )) . To collect this data, we first sample π cov 'seed' rollouts from the base or prover policy respectively for ORM and PAVs, to promote coverage over prefixes and steps. Then we sample π mc additional rollouts, conditioned on each prefix in the seed rollout to compute the Monte-Carlo estimate of π π at each prefix. In Fig. 6(b) we plot the beam search performance of PAVs trained with different ratios of π mc / π cov , as we scale the total dataset size. Here, the beam size is fixed to 128 and the base policy is the Gemma 9B SFT policy and prover is π΅π 4 policy. We find that under low sampling budgets, optimizing for coverage ( π cov > π mc ) is better for performance, and when budget is higher, reducing label noise in π π mc by setting π mc > π cov gets us more improvements. In addition, we also spend some initial sampling budget is spent to identify 'high value' states where π π is larger than a threshold, and identify the first step with low π π on an incorrect partial rollout from this state. We found this strategy to scale better with dataset size, as we discuss in App. D.
## 5. Results: Scaling Dense-Reward RL with PAVs
We can also use PAVs to train policies via online reinforcement learning (RL), by using the effective reward π π + πΌπ΄ π as dense, per-step rewards. We compare the sample efficiency of PAV-RL (i.e., β PAV -RL in Eq. 4) with standard ORM-RL (i.e., β ORM -RL in Eq. 3) on Gemma 2B and Gemma 9B SFT models, which are further optimized via rejection finetuning (RFT) (Yuan et al., 2023), before using them to initialize RL. To our knowledge, no prior work has successfully demonstrated the use of dense per-step feedback with a process reward model for RL, and we present the first significant set of results establishing the efficacy of this approach. We show that PAV-RL is much more sample-efficient, and enjoys a higher ceiling on the performance of any test-time re-ranker. Additional details for the experiments are in App. E.
Result 1: PAV-RL is > 7% better than ORM-RL in test accuracy, and 6 Γ sample efficient. In Fig. 7(a), we report the test accuracies of Gemma 2B and 9B models trained with SFT, RFT, ORM-RL and PAV-RL. PAV-RL improves the RFT policy by 11% for 2B, and 15% for 9B, with > 7% gain over ORM-RL in both cases. Not only do the effective rewards from PAV improve the raw accuracy after RL, this higher accuracy is attained 6 Γ faster (see Fig. 7(b)) for the 2B run and similarly for the 9B RL run (Fig. 7(c)). For both 2B and 9B, RL runs, we experiment with two options for the prover policy: (i) 2B SFT policy; and (ii) 9B SFT policy. While both of these provers rapidly become weaker than the base policy within a few gradient steps of RL, a fixed PAV trained with each of these provers is able to still sustain performance gains in RL. More interestingly, we find that the 2B SFT policy serves as the best choice of the prover for both 2B and
Figure 7 | PAVs as dense rewards in RL improve sample efficiency compared to ORMs, along with gains on raw accuracy: (a) We report the performance of a base policy trained using RL with effective rewards (PAV-RL), or only outcome rewards (ORM-RL), and baselines SFT, RFT. (b,c): Across training iterations, we report the test performance of policies trained with PAV-RL and ORM-RL, on Gemma 2B and 9B SFT base policies.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Charts: PAVs vs. ORMs & Sample Efficiency
### Overview
The image presents three charts comparing the performance of different reinforcement learning (RL) policies. Chart (a) compares the final accuracy of several policies (SFT, RFT, ORM-RL, and PAV-RL) across two base policy sizes (2B and 9B). Charts (b) and (c) show the sample efficiency, i.e., accuracy as a function of training iterations, for ORM-RL and PAV-RL with 2B and 9B base policies, respectively. The charts aim to demonstrate the benefits of PAV-RL in terms of both final accuracy and sample efficiency.
### Components/Axes
* **Chart (a): PAVs vs. ORMs**
* X-axis: Base Policy (2B, 9B)
* Y-axis: Accuracy (0.0 to 0.45)
* Legend:
* SFT (Supervised Fine-Tuning) - Light Blue circles
* RFT (Reinforcement Fine-Tuning) - Light Blue diagonal stripes
* ORM-RL (Offline RL) - Orange
* PAV-RL (Policy Alignment via Value) - Red stars
* **Chart (b): Sample Efficiency (2B)**
* X-axis: Training Iterations (x10^3) (0 to 9)
* Y-axis: Accuracy (0.15 to 0.3)
* Legend:
* ORM-RL - Dashed Orange line
* PAV-RL - Solid Orange line
* **Chart (c): Sample Efficiency (9B)**
* X-axis: Training Iterations (x10^2) (0 to 50)
* Y-axis: Accuracy (0.4 to 0.55)
* Legend:
* ORM-RL - Dashed Orange line
* PAV-RL - Solid Orange line
### Detailed Analysis or Content Details
* **Chart (a): PAVs vs. ORMs**
* SFT (2B): Accuracy β 0.08
* SFT (9B): Accuracy β 0.15
* RFT (2B): Accuracy β 0.15
* RFT (9B): Accuracy β 0.25
* ORM-RL (2B): Accuracy β 0.25
* ORM-RL (9B): Accuracy β 0.40
* PAV-RL (2B): Accuracy β 0.30
* PAV-RL (9B): Accuracy β 0.45
* **Chart (b): Sample Efficiency (2B)**
* ORM-RL: Starts at β 0.17, increases to β 0.21 by 4x10^3 iterations, then fluctuates around β 0.20.
* PAV-RL: Starts at β 0.16, increases rapidly to β 0.26 by 8x10^3 iterations, then decreases slightly to β 0.24 by 9x10^3 iterations.
* The chart indicates that PAV-RL achieves a higher accuracy than ORM-RL with 2B base policy. The text "6 x" and "7%" are placed on the chart, indicating that PAV-RL achieves 6x the accuracy and 7% improvement over ORM-RL.
* **Chart (c): Sample Efficiency (9B)**
* ORM-RL: Starts at β 0.42, increases to β 0.48 by 20x10^2 iterations, then fluctuates around β 0.50.
* PAV-RL: Starts at β 0.40, increases rapidly to β 0.52 by 30x10^2 iterations, then decreases slightly to β 0.48 by 50x10^2 iterations.
* The chart indicates that PAV-RL achieves a higher accuracy than ORM-RL with 9B base policy. The text "5 x" and "6%" are placed on the chart, indicating that PAV-RL achieves 5x the accuracy and 6% improvement over ORM-RL.
### Key Observations
* PAV-RL consistently outperforms ORM-RL in both final accuracy (Chart a) and sample efficiency (Charts b and c).
* Increasing the base policy size from 2B to 9B significantly improves the performance of all policies, but the relative advantage of PAV-RL remains.
* PAV-RL exhibits faster initial learning compared to ORM-RL, as evidenced by the steeper slope of the PAV-RL curves in Charts b and c.
* The accuracy of both policies fluctuates after reaching a peak, suggesting potential instability or overfitting.
### Interpretation
The data strongly suggests that PAV-RL is a more effective reinforcement learning policy than ORM-RL, particularly when using larger base policies. The higher final accuracy and faster sample efficiency of PAV-RL indicate that it can learn more effectively from limited data. The annotations "6x" and "7%" (for 2B) and "5x" and "6%" (for 9B) quantify the improvement of PAV-RL over ORM-RL, highlighting its practical significance. The fluctuations in accuracy after the initial learning phase suggest that further research is needed to improve the stability and generalization ability of both policies. The comparison of 2B and 9B base policies demonstrates the importance of model size in reinforcement learning, with larger models generally achieving better performance. The charts provide compelling evidence for the benefits of policy alignment via value (PAV-RL) as a method for improving reinforcement learning performance.
</details>
9B policies. This observation that a weak prover can still improve the base policy corroborates our results in the didactic setup and our analysis in Sec. 3.4. While we were not able to run experiments where the prover policy is dynamically updated on the fly, we believe that updating the prover through the process of RL training should only amplify these benefits.
Result 2: PAV-RL achieves higher performance ceiling on test-time re-ranking. In Fig. 8(a), for Gemma 2B, we plot the Pass @N performance for each method, and find (i) Pass @N is higher ( > 7%) for PAV-RL, compared to ORM-RL, for any π β€ 128; and (ii) the rate at which Pass @N improves for PAV-RL is higher than ORM-RL. Both trends are consistent with our observations on the didactic example in Sec. 3.3. Notably, for π β₯ 64, ORM-RL is worse than the SFT policy, perhaps due to lower entropy over the distribution at the next step resulting in non-diverse candidates. Why does PAV-RL produce diverse candidates, and does not suffer from the low diversity problem in ORM-RL? We answer this with a key insight on how the primary benefit of PAVs is to promote efficient exploration.
Figure 8 | (a): For the policies trained in (a) we report the best-of-N performance where the oracle reward Rex is used to rank π candidates sampled from the base policy (Pass @N). (b): Amongst hard problems that remain unsolved by Best-of-256 over the base SFT policy, we check how many are solved by Best-of-N over PAV-RL or ORM-RL. PAV-RL is able to solve a substantially more problems than what ORM-RL was able to solve.
<details>
<summary>Image 9 Details</summary>

### Visual Description
## Charts: Model Performance Comparison
### Overview
The image presents two line charts comparing the performance of different models (SFT, RFT, ORM-RL, PAV-RL, ORM, PAV) across varying values of 'N'. The left chart (a) displays "Pass @N (2B)" which appears to be an accuracy metric, while the right chart (b) shows "Success Rate on Problems Unsolved by SFT @256". Both charts use a logarithmic scale for the x-axis (N).
### Components/Axes
**Chart (a): Pass @N (2B)**
* **X-axis:** N, ranging from 2<sup>1</sup> to 2<sup>7</sup> (approximately 2 to 128).
* **Y-axis:** Accuracy, ranging from 0.15 to 0.5.
* **Data Series:**
* SFT (Light Blue)
* RFT (Pale Blue)
* ORM-RL (Light Orange)
* PAV-RL (Dark Orange)
* **Legend:** Located in the top-left corner.
**Chart (b): Solving Hard Questions**
* **X-axis:** N, ranging from 2<sup>1</sup> to 2<sup>8</sup> (approximately 2 to 256).
* **Y-axis:** Success Rate on Problems Unsolved by SFT @256, ranging from 0 to 0.15.
* **Data Series:**
* ORM (Light Green)
* PAV (Dark Orange)
* **Legend:** Located in the top-left corner.
### Detailed Analysis or Content Details
**Chart (a): Pass @N (2B)**
* **SFT (Light Blue):** The line slopes upward, starting at approximately 0.17 at N=2<sup>1</sup> and reaching approximately 0.35 at N=2<sup>7</sup>.
* **RFT (Pale Blue):** The line also slopes upward, starting at approximately 0.18 at N=2<sup>1</sup> and reaching approximately 0.42 at N=2<sup>7</sup>.
* **ORM-RL (Light Orange):** The line slopes upward more steeply than SFT and RFT, starting at approximately 0.22 at N=2<sup>1</sup> and reaching approximately 0.45 at N=2<sup>7</sup>.
* **PAV-RL (Dark Orange):** The line has the steepest upward slope, starting at approximately 0.25 at N=2<sup>1</sup> and reaching approximately 0.5 at N=2<sup>7</sup>.
**Chart (b): Solving Hard Questions**
* **ORM (Light Green):** The line is relatively flat, starting at approximately 0.02 at N=2<sup>1</sup> and reaching approximately 0.03 at N=2<sup>8</sup>.
* **PAV (Dark Orange):** The line slopes upward, starting at approximately 0.02 at N=2<sup>1</sup> and reaching approximately 0.15 at N=2<sup>8</sup>. The increase is more pronounced from N=2<sup>5</sup> onwards.
### Key Observations
* In Chart (a), PAV-RL consistently outperforms all other models across all values of N. ORM-RL performs better than RFT and SFT.
* In Chart (b), PAV demonstrates a significant increase in success rate as N increases, while ORM remains relatively stable.
* The shaded areas around the lines in both charts represent confidence intervals or standard deviations, indicating the variability in the results.
### Interpretation
The data suggests that models utilizing Reinforcement Learning (RL) β specifically ORM-RL and PAV-RL β demonstrate superior performance in the "Pass @N (2B)" task (Chart a) compared to models trained with Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RFT). PAV-RL consistently achieves the highest accuracy.
Chart (b) highlights the ability of the PAV model to solve problems that are initially unsolved by the SFT model, and this ability increases with larger values of N. This suggests that PAV benefits from increased computational resources or a larger search space. The relatively flat performance of ORM indicates it does not scale as effectively as PAV in solving these "hard" problems.
The logarithmic scale of the x-axis (N) is crucial. It indicates that the performance gains are more significant at lower values of N, and the rate of improvement diminishes as N increases. This could be due to diminishing returns or the inherent difficulty of the task. The confidence intervals suggest that the observed differences in performance are statistically significant, but there is still some degree of uncertainty. The two charts together suggest a trade-off between overall accuracy (Chart a) and the ability to solve particularly challenging problems (Chart b).
</details>
Result 3: PAVs improve exploration and discover correct solutions to novel problems. An outcome reward model rewards downweight all steps in an incorrect rollout equally during RL, whereas the effective reward π π + πΌπ΄ π in PAVs, up-weights steps that make progress under the prover, even when the complete rollout is incorrect. This increases the coverage over individual steps that can improve the likelihood of the base policy to succeed (since the prover policy is not too misaligned with the base policy). These can now be proposed by the base policy at a given prefix. This mechanism for exploration is analogous to test-time search we discussed in Sec. 4.1. Hence, the directed supervision from PAVs improves sample-efficiency throughout the course of training (Fig. 7(c)). In fact, we also find that combining the PAV-RL policy with test-time beam search is able to solve a substantially larger number of new problems within smaller compute budgets ( π = 16 , 32) that the SFT policy cannot solve with a much larger budget π = 256 (Fig. 8(b)).
## Takeaway: RL with process advantage verifiers (PAVs) as dense rewards
- Using trained PAVs as dense rewards in RL boosts scales sample efficiency by 5 -6 Γ , compared to only using sparse ORM rewards, and results in policies with a higher Pass @N performance.
- Advantages from a complementary prover policy improves the sample efficiency of exploration in RL, and produces policies that can discover solutions to hard novel questions.
## 6. Related Work
We briefly discuss some key related works here, and leave the detailed discussion for App. A. To address issues of sparse feedback in ORMs (Cobbe et al., 2021b), recent works (Lightman et al., 2023; Uesato et al., 2022) trained process reward models (PRMs) to densely predict incorrect steps in a multi-step reasoning trace. Since human data collection for process labels is not scalable enough, recent work (Luo et al., 2024; Wang et al., 2024) used automated supervision to annotate steps with π values under the base policy, i.e. , the PRMs score a step with the likelihood of future success, when continuing to sample from the step. While π -value PRMs in Lightman et al. (2023); Luo et al. (2024) were mainly used as verifiers for re-ranking, Snell et al. (2024) used them for test-time beam search. Shao et al. (2024) uses PRMs for RL but found a gain of only 1 -2% with PRMs. In our work, we question solely relying on π -values or advantages of the base policy, and find that measuring progress (i.e., advantages) under a different prover policy can amplify exploration, thus boosting test-time search and RL. To our knowledge, we are the first to show substantial gains in compute and sample efficiency with PRMs. Our methodology for data collection is similar to (Hwang et al., 2024; Setlur et al., 2024) (i.e., identify 'first pits' in reasoning traces), these works only use it to collect preference pairs. Beyond all of these, we also characterize which policy to use for computing advantages.
Concurrently to us, akin to the methodology in Hwang et al. (2024); Setlur et al. (2024), Kazemnejad et al. (2024) optimize the base policy π with online RL, where the dense step-level rewards correspond to advantages π΄ π under the base policy π itself. This is a special case of our setting, where the prover policy π = π , but as we note in Sec. 3.2, setting π = π in our effective reward (Eq. 5) results in exactly the same policy gradient updates as only optimizing the outcome reward. Since Kazemnejad et al. (2024) use 'on-the-fly' Monte-Carlo rollout estimation to estimate advantages, they are able to avoid estimation errors in the process reward model. Nonetheless, our theoretical result and the didactic example (both of which assume access to perfect advantage estimates) show that gains from this approach are significantly smaller than using an appropriate prover policy, which is distinct from the base policy.
## 7. Discussion and Conclusion
We began our exposition with the following question: how to define process rewards such that optimizing the base policy against process rewards ultimately improves the outcome level correctness of the final answer? Our key finding is that process rewards defined as advantages of a prover policy, distinct from the base policy improve the efficiency of exploration for steps sampled from the base policy during test-time search and online RL. This improved exploration in turn leads to discovery of better solutions, resulting in a higher accuracy on the math reasoning task. We also formally characterized the set of good prover policies as policies with step-level advantages that meaningfully contrast steps generated by the base policy, while still producing step-level advantages that are aligned with the base policy. Having trained process advantage verifiers (PAVs) to predict advantages under the prover policy, we empirically observed that test-time search against the trained PAVs improve the compute-efficiency of search by 1 . 5 -5 Γ , and accuracy of search by over 8% compared to running best-ofπ against an ORM. Next, we present one of the significant results that validate the use of dense supervision when optimizing the base policy with online RL. Specifically, we show that dense online RL with rewards from our trained PAVs, improves sample efficiency of online RL by 5 -6 Γ , and results in an accuracy gain of over 6%.
Limitations. Despite the promise of our results, there are several limitations to our work that present important avenues for future research. First, while we can easily compute the right hand side of our result in Theorem 3.1 to understand whether a given prover policy will improve a fixed base policy, it is unclear how to automatically design a flexible class of optimal (or very good) prover policies for a sequence of base policy iterates. Perhaps simultaneously optimizing the prover and the base policy (in a two-player game) might provide for an approach to obtain the best prover during RL, but this is largely an open question. Second, since inevitably learning a process advantage verifier (PAV) will incur fitting errors and this upper bounds peformance of our method. Fitting errors can be circumvented if our approach if we can simply run rollouts from prover policies during online RL or search to estimate advantages without training verifiers, and extending our approach to this setup is a good avenue for future work.
## Acknowledgements
The authors would like to thank Charlie Snell, Yi Su, Katherine Heller, and Virginia Smith for feedback on an earlier version of this paper. We also thank Ahmad Beirami, Sergey Levine, Victor Veitch, Idan Shenfeld, Arian Hosseini, Stephen Pfohl, Xiangyu Qi, Tianhe Yu, and Christina Baek for technical discussions. AS and CN also thank Preston Robinette, Sho Kannan, Tianze Shi, Diana Mincu, Hritik Bansal, and Liangchen Luo for code, infrastructure and data analytics support.
## References
- A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research , 22(98):1-76, 2021.
- T. Anthony, Z. Tian, and D. Barber. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems , 30, 2017.
- M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems , pages 1471-1479, 2016.
- X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 , 2024.
- J. D. Chang, K. Brantley, R. Ramamurthy, D. Misra, and W. Sun. Learning to generate better than your llm. arXiv preprint arXiv:2306.11816 , 2023.
3. K.-W. Chang, A. Krishnamurthy, A. Agarwal, H. DaumΓ© III, and J. Langford. Learning to search better than your teacher. In International Conference on Machine Learning , pages 2058-2066. PMLR, 2015.
- K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021a.
- K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021b.
- M. Collins. Discriminative reranking for natural language parsing. In Proceedings of the International Conference on Machine Learning , 2000.
7. Gemma Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 , 2024.
- M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution estimation. In International conference on machine learning , pages 881-889. PMLR, 2015.
- A. Havrilla, Y. Du, S. C. Raparthy , C. Nalmpantis, J. Dwivedi-Yu, M. Zhuravinskyi, E. Hambro, S. Sukhbaatar, and R. Raileanu. Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642 , 2024.
- D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS , 2021.
- G. Hinton. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 , 2015.
- A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni, and R. Agarwal. V-star: Training verifiers for self-taught reasoners. arXiv preprint arXiv:2402.06457 , 2024.
- H. Hwang, D. Kim, S. Kim, S. Ye, and M. Seo. Self-explore to avoid the pit: Improving the reasoning capabilities of language models with fine-grained rewards. arXiv preprint arXiv:2404.10346 , 2024.
- S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning , pages 267-274, 2002.
- S. M. Kakade. A natural policy gradient. In Advances in neural information processing systems , volume 14. Advances in neural information processing systems, 2001a.
- S. M. Kakade. A natural policy gradient. Advances in neural information processing systems , 14, 2001b.
- A. Kazemnejad, M. Aghajohari, E. Portelance, A. Sordoni, S. Reddy, A. Courville, and N. L. Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment. arXiv preprint arXiv:2410.01679 , 2024.
- H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050 , 2023.
- L. Luo, Y. Liu, R. Liu, S. Phatale, H. Lara, Y. Li, L. Shu, Y. Zhu, L. Meng, J. Sun, et al. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592 , 2024.
- Q. Ma, H. Zhou, T. Liu, J. Yuan, P. Liu, Y. You, and H. Yang. Let's reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080 , 2023.
- R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 , 2021.
- A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML , volume 99, pages 278-287, 1999.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730-27744, 2022.
- R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290 , 2023.
- S. Ross and J. A. Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979 , 2014.
- A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295 , 2015.
- A. Setlur, S. Garg, X. Geng, N. Garg, V. Smith, and A. Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold. arXiv preprint arXiv:2406.14532 , 2024.
- Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024.
- A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, P. J. Liu, J. Harrison, J. Lee, K. Xu, A. Parisi, et al. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585 , 2023a.
- I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 11523-11530. IEEE, 2023b.
- C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 , 2024.
- W. Sun, A. Venkatraman, G. J. Gordon, B. Boots, and J. A. Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In International conference on machine learning , pages 3309-3318. PMLR, 2017.
- R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction . The MIT Press, second edition, 2018.
- R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems , 12, 1999.
- F. Tajwar, A. Singh, A. Sharma, R. Rafailov, J. Schneider, T. Xie, S. Ermon, C. Finn, and A. Kumar. Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data, 2024.
- J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275 , 2022.
- P. Wang, L. Li, Z. Shao, R. X. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024.
- R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning , 8(3-4):229-256, 1992.
- Y. Wu, Z. Sun, S. Li, S. Welleck, and Y. Yang. An empirical analysis of compute-optimal inference for problem-solving with language models. arXiv preprint arXiv:2408.00724 , 2024.
- F. Yu, A. Gao, and B. Wang. Outcome-supervised verifiers for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724 , 2023.
- Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825 , 2023.
- E. Zelikman, Y. Wu, J. Mu, and N. Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems , 35:15476-15488, 2022.
- L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal. Generative verifiers: Reward modeling as next-token prediction. arXiv preprint arXiv:2408.15240 , 2024.
## Appendices
## A. Additional Related Work
In this section, we highlight works from four relevant streams, expanding on discussion in Section 6. First, we look at works that train verifiers to provide outcome level feedback (Cobbe et al., 2021b; Hosseini et al., 2024; Singh et al., 2023b; Zelikman et al., 2022) on the correctness of the full response (ORM). Here, the trained ORMs are mainly used for test-time search (best-ofπ ). Next, we look at works that alleviate issues with sparse feedback in ORMs, and instead train process reward models (PRMs), that can perform credit assignment. PRMs are trained either through human annotations (Lightman et al., 2023; Uesato et al., 2022), or automated forms of supervision (Luo et al., 2024; Snell et al., 2024; Wang et al., 2024). While some works use PRMs and ORMs to collect data for supervised fine-tuning Hosseini et al. (2024) or offline RL Setlur et al. (2024), other works directly use them as rewards in online RL (Shao et al., 2024; Uesato et al., 2022; Wang et al., 2024). Finally, we contrast our work against papers on imitating stronger teacher policies via RL objectives that optimize potential functions of teacher policies.
Outcome reward models. ORMs are verifiers (Cobbe et al., 2021b; Uesato et al., 2022) commonly used to improve the test-time performance using best-ofπ , where we generate multiple candidate solutions from the base policy (LLM), rank them using the ORM, and pick the best one. ORMs are trained to assess correctness of a solution either using binary classification (Cobbe et al., 2021a; Yu et al., 2023), preference optimization using DPO (Hosseini et al., 2024), or next-token prediction (Zhang et al., 2024). Furthermore, prior works train LLMs on self-generated data using ground-truth outcome rewards (Rex), either via supervised fine-tuning (Singh et al., 2023a; Yuan et al., 2023; Zelikman et al., 2022), or online RL (Bi et al., 2024). In contrast to these approaches, our work focuses on process reward models (PRMs) for improving performance with beam-search at test time as well as online RL where we maximize the effective reward in Eq. 5 which linearly combines both Rex (outcome supervision) and process supervision in the form of advantages π΄ π under a prover policy π .
PRMs and credit assignment. Several works focus on training step-level PRMs on math reasoning tasks, either using human labels (Lightman et al., 2023) or automated LLM-generated data to estimate value functions π π (Luo et al., 2024; Wang et al., 2024). Our work also focus on automated data collection for PRMs but empirically argues for using the advantage function π΄ π as step-level rewards along with π π , with a conceptual explanation in Section 3.1. Several prior works have explored step-level search algorithms with PRMs, such as beam search (Snell et al., 2024), heuristic greedy search (Ma et al., 2023), and reward-balanced tree search (Wu et al., 2024). Hwang et al. (2024); Setlur et al. (2024) use advantages to identify the 'first pit' in an incorrect reasoning trace. Specifically, they collect data by computing advantages at each step using Monte Carlo rollouts. Then in an incorrect trace, they identify the step with the least advantage, and use the prefix of that step to construct preference pairs for offline direct preference optimization (Rafailov et al., 2023). In contrast, our work computes advantages under a prover policy, that we formally characterize, and use the computed advantages for improving test-time search and efficiency of online reinforcement learning.
Online RL for math reasoning. Once we have a trained outcome or process verifiers, it is natural update a policy by optimizing it against the learned signal, similar to how learned reward models are optimized in RLHF (Ouyang et al., 2022). In the context of math reasoning, Havrilla et al. (2024); Shao et al. (2024); Uesato et al. (2022) trained policies with RL, experimenting with both dense and sparse
Planted sub-sequence π β : [3 6 0 3 5]
```
Samples from policy corresponding to π¦ = 15
```
Figure 9 | Pictorial description of our planted sub-sequence didactic setup : An example showing five samples drawn i.i.d. from a very strong policy ( πΎ = 100), and a relatively weaker ( πΎ = 15) policy in our didactic setup.
rewards. In all three works, the gains observed by using PRMs that predict step-level correctness (similar to Lightman et al. (2023)) is quite small, compared to simply using trained ORMs, or the ground-truth outcome supervision Rex. In fact, Havrilla et al. (2024) states that the only algorithm that does well is a form of expert iteration (Anthony et al., 2017), which does not inhibit exploration as severely as some other approaches they compare with. Our work presents one of the first results, where trained PRMs, used in conjunction with the outcome rewards during online RL, result in policies with substantially higher ( + 6%) performance, than the one trained only with outcome supervision. Our results also indicate a 5 -6 Γ sample efficiency boost for online RL, with our trained PAVs.
Connections to imitation learning through RL. The idea of mixing potential functions from different policies π and π , in order to improve upon a sub-optimal expert π appears in Chang et al. (2015), but this work considers the structured prediction problem which is vastly different from our setting. Related to this, is the work by Chang et al. (2023), which uses a 'guide' policy to rollout from prefixes generated by a base policy. The base policy can now imitate the guide by cloning those rollouts, and eventually surpass. Our work also uses a prover policy which can complete rollouts from states where the base policy fails. But, we also show that weak provers in many cases are able to improve the base policy, or search over its responses, better than a stronger prover policy. We tie this observation to the insight that the main goal of the prover policy is to distinguish steps taken by the base policy, as measured by advantages under the prover. Thus, we do not require the prover policy to be something better than the base policy, which is a key distinction with Chang et al. (2023).
## B. Didactic Analysis
We consider sequences of length 10 from a 15-token vocabulary V B { 1 , 2 , . . . , 14 } , where the end-ofsequence token is given by 14, and all tokens following the end-of-sequence token (including it) are masked. Given an unknown planted sequence π β
(in Fig. 9), we train a policy π with policy gradient, where the outcome reward we wish to optimize is terminal and sparse, i.e. , for π βΌ π we have π ( π , π β
) = 1 if and only if π β
appears in π , and 0 otherwise (Fig. 9). The policy π in our experiments is represented by a multi-layer neural network, similar to the MADE architecture (Germain et al., 2015). The prover policy π is parameterized by a scalar πΎ > 0. In particular, at any state π , where the last π tokens leading up to π match first π tokens of π β
, then:
$$\mu ( y _ { k + 1 } ^ { ^ { * } } | s ) \varpropto c \gamma ,$$
and uniform on all other tokens. Thus, as πΎ increases, the performance of π improves and β 1 as πΎ ββ . For our experiments, we assume (almost) oracle access to ground-truth advantage and π -values, thus
mitigating any confounding issues due to usage of a learned verifier. We are able to approximate exact π -values very accurately by using Monte Carlo estimates with large > 100 rollouts. With the goal of optimizing the terminal reward π ( π , π β
) , we optimize π with two types of rewards: (i) only the outcome reward π ( π , π β
) , which is equivalent to using only π π as step-level rewards; and (ii) using the effective reward: π π + πΌπ΄ π as the step-level reward.
Training details. We use effective rewards with πΌ = 1, and use the gradient in Eq. 5 to update the policy via policy gradient iterations. For the ORM runs, where we only use the outcome reward π ( π , π β
) , the policy gradient is equivalent to the case where πΌ = 0 in Eq. 5. We train for 10,000 iterations in both cases, with a batch size of 64, and a constant learning rate of 1 π -3 for the Adam optimizer. The RL runs are initialized with a supervised finetuned policy. For this we take a randomly initialized network, based on the MADE architecture (Germain et al., 2015), with 3 layers, and 128 hidden units in each. Then we train it with supervised next-token prediction loss for 50 iterations on a dataset of 3200 samples from a weak policy ( πΎ = 5 . 0 ) . The batch size for the SFT training is also set to 64. For evaluating Pass @ π performance, we either sample π independent trajectories (temperature 1.0) from the base policy trained using effective rewards, or only π π . We also evaluate Pass @ π for the SFT policy for comparison.
## C. Additional: Experiments on Test-time Search with PAVs
Implementation details. For our experiments in Sec. 4, we use three pretrained models: Gemma 2B, 9B and 27B. We finetune each of these on the MATH (Hendrycks et al., 2021) dataset. The finetuning is done for 5000 iterations, with a batchsize of 32, and a maximum learning rate of 5 π -6 for 2B, 9B and 5 π -7 for the 27B models. We trained the policies using the Adam optimizer, with a linear warm up and cosine decay learning rate schedule. The linear warm up is done for the first 500 iterations. For the base policies, we choose the SFT checkpoints with the best accuracy on a holdout validation set of the MATH dataset. Given the SFT checkpoints, we next train PAVs using the procedure in Sec. 4.2. We do this for a class of provers, which include the base policies themselves. As we discuss in Sec. 4, the prover class also includes the best-ofπΎ policy for πΎ in { 2 , 4 , 8 , 16 , 32 } .
We use the hold out validation set to ascertain the value of πΌ in the effective reward. For each base policy we run beam search with a beam size of 16 on this hold out validation set, and using the base policy itself as a prover, we evaluate the value of πΌ that works best in the effective reward. We find that πΌ = 0 . 5 worked best for Gemma 2B and 9B base policies, while a lower value of πΌ = 0 . 2 was optimal for Gemma 27B. To tune πΌ we ran a grid search over the range [ 0 . 0 , 1 . 0 ] , evaluating at an interval of 0 . 1. We observe that the choice of πΌ is a relatively robust one, since for all three base policies, we saw improvements (over only π π as the reward) for values in the range of [ 0 . 2 , 0 . 6 ] . Having a separate value of πΌ for each base policy, we use the same value in the effective reward given by any choice of the prover policy that is used for that base policy. Next, we present an experiment that compares the predictive power of effective reward vs. just π π at initial states of a rollout under the base policy π , when either is used to predict the final outcome given by Rex.
Experiment: Is the effective reward able to predict the final outcome better than π π ? In Fig. 10, we describe an experiment where for both the effective reward π π + πΌπ΄ π (PAV) and just π π (PQV), we compute the error of the classifier that makes a prediction on the final outcome by thresholding on either reward value at each step of the rollout. This threshold is computed using a validation set, and is separate for each step and reward combination. The figure tells us that the outcome prediction error drops for both rewards as the base policy is rolled out more, but clearly the effective reward dominates π π (PQV)
across all steps. Thus, the effective reward is a more informative signal (lower classification error) for the problem of predicting the success of a partial rollout, especially in the earlier steps of the rollout. This helps to explain the better performance of beam search with a finite capacity beam that re-ranks partial rollouts with the effective reward. For this experiment, we use the Gemma 9B SFT policy as the base policy and the best-of-4 policy corresponding to the same SFT policy as the prover.
Figure 10 | Effective rewards at any step are able to predict the outcome rewards at the final step, better than πΈ π
: For both the effective reward π π + πΌπ΄ π and just π π , we compute the error of the classifier that makes a prediction on the final outcome by thresholding on either reward value at each step of the rollout. This threshold is computed using a validation set, and is separate for each step and reward combination. The figure tells us that the outcome prediction error drops for both rewards, as the base policy is rolled out more, but compared to π π at an intermediate step, the effective reward π π + πΌπ΄ π at an intermediate step is able to reliably predict the final outcome level correctness of the future rollout under the base policy π .
<details>
<summary>Image 10 Details</summary>

### Visual Description
\n
## Line Chart: Predicting Final Outcome
### Overview
The image presents a line chart comparing the error rates of two methods, PAV and PQV, as the number of steps taken increases. The chart aims to demonstrate how the accuracy of predicting a final outcome improves with more steps.
### Components/Axes
* **Title:** "Predicting Final Outcome" (centered at the top)
* **X-axis:** "Number of steps taken" (ranging from 1 to 10, with tick marks at each integer value)
* **Y-axis:** "Error" (ranging from 0 to 1, with tick marks at 0.2 intervals)
* **Data Series 1:** PAV (represented by a teal line with star-shaped markers)
* **Data Series 2:** PQV (represented by a pink line with circular markers)
* **Legend:** Located in the top-right corner, labeling the two data series with their corresponding colors and symbols.
### Detailed Analysis
**PAV (Teal Line with Star Markers):**
The PAV line exhibits a generally downward trend, indicating a decrease in error as the number of steps increases.
* At 2 steps: Error β 0.74
* At 3 steps: Error β 0.65
* At 4 steps: Error β 0.54
* At 5 steps: Error β 0.42
* At 6 steps: Error β 0.34
* At 7 steps: Error β 0.28
* At 8 steps: Error β 0.24
* At 9 steps: Error β 0.21
* At 10 steps: Error β 0.18
**PQV (Pink Line with Circular Markers):**
The PQV line also shows a decreasing trend, but it is less consistent than the PAV line.
* At 2 steps: Error β 0.88
* At 3 steps: Error β 0.78
* At 4 steps: Error β 0.67
* At 5 steps: Error β 0.58
* At 6 steps: Error β 0.51
* At 7 steps: Error β 0.45
* At 8 steps: Error β 0.38
* At 9 steps: Error β 0.27
* At 10 steps: Error β 0.23
### Key Observations
* Both PAV and PQV methods demonstrate a reduction in error as the number of steps taken increases.
* PAV consistently exhibits lower error rates than PQV across all step values.
* The rate of error reduction appears to slow down as the number of steps increases for both methods.
* The PQV line shows a more pronounced decrease in error between steps 7 and 10 compared to PAV.
### Interpretation
The chart suggests that both PAV and PQV are effective methods for improving the accuracy of predicting a final outcome, with increasing accuracy as more steps are taken. However, PAV consistently outperforms PQV, indicating it is a more reliable method. The diminishing returns in error reduction with increasing steps suggest there may be a point where additional steps provide minimal improvement. The steeper decline in PQV's error rate at higher step counts could indicate that PQV benefits more from iterative refinement than PAV, or that it requires more steps to converge to a stable solution. The data implies that the choice of method (PAV vs. PQV) and the number of steps taken should be carefully considered based on the desired level of accuracy and computational cost.
</details>
## D. Details on Collecting Data and Training PAVs
In Fig. 6(b) in Sec. 4.2, our seed rollouts are i.i.d. sampled from π , but prior works (Hwang et al., 2024; Luo et al., 2024; Setlur et al., 2024) employed a 'first pit' strategy for coverage. Here, some initial sampling budget is spent to identify 'high value' states where π π is larger than a threshold. Then, for any incorrect sample from these high value states greater budget is spent to estimate the first step (first pit) with π π = 0. All prefixes (and their estimated π -values) until the first pit are then added to the training data. In Fig. 11, we compare beam search using PAVs trained using data from the first pit strategy, and the random sampling strategy. Both of them use the best value of π mc / π cov from Fig. 6(b) for every dataset size. We find the first pit strategy to be better than random, especially when the number of seed rollouts are limited. Once we get coverage over such pits, we sample a large number of partial rollouts conditioned on each prefix until the first pit. This is used to compute the Monte Carlo estimate of π values more accurately on the path to the first pit. Each prefix and estimated π value pair is then added to the dataset used to train PAVs.
Training details. All PAVs used in this work are trained by taking the Gemma 9B pretrained checkpoint and finetuning it on the data collected from the above strategy. The data collection uses first pit strategy for better coverage over pits in the seed rollouts. Based on findings in Fig. 6(b), we use a high value of π mc = 20 to estimate the π -values accurately for each step in the seed rollout. For each base policy,
in total, we collect a dataset of over 300 , 000 (prefix, Λ π π -value) pairs. Here, Λ π π is the Monte Carlo estimate for the π -value at the prefix, under the policy π . on which we finetune the Gemma 9B model with cross-entropy loss. Since the distribution of values for Λ π π can be skewed, we split the range of Λ π π -values into two buckets, based on which we also partition the training data. The first bucket is the set of all prefixes with Λ π π < 0 . 5 and the second is the set of all prefixes with Λ π π β₯ 0 . 5. Then, we use class-balanced sampling over these buckets to finetune the pretrained model for 20000 training iterations, using a batch size of 32. We use an Adam optimizer with a maximum learning rate of 5 π -7. We use a linear warm up (till 2000 steps), followed by a cosine decay learning rate schedule to train the models. Since a pretrained LLM would output a matrix of logits (vocabulary size Γ sequence length) we fix a token as the 'scoring token' to be the end of the sequence / prefix that needs to be scored. The logits of this scoring token are then used to determine the prediction for the LLM being trained.
Once we have models that predict the π π for a base policy π , we compute the π -value under the BoK policy corresponding to π , by setting: π BoK ( π ) ( π , π ) = 1 - ( 1 -π π ( π , π )) πΎ . Next, given the π -values we for π , and their corresponding best-ofπΎ policies, we can compute any effective reward in Eq. 5, using the definition of advantage value of a step in Eq. 2.
Figure 11 | First pit strategy from Luo et al. (2024); Setlur et al. (2024) : We compare the beam search performance (with beam size 128) for a PAV trained on data collected using two types of seed rollouts. For the seed rollouts, we either randomly sample from the distribution of state action pairs induced by the base policy. Or we improve coverage particularly by using the 'first pit' strategy of identifying the first state with low π values on a partial rollout that starts from a high π -value state and ends with an outcome reward of 0, i.e., π value is 0.
<details>
<summary>Image 11 Details</summary>

### Visual Description
\n
## Line Chart: Scaling Laws: First Pit
### Overview
This image presents a line chart illustrating the relationship between Training Data Size and Accuracy for two different methods: "First pit" and "Random". The chart demonstrates how accuracy improves with increasing training data size for both methods, with "First pit" consistently achieving higher accuracy.
### Components/Axes
* **Title:** Scaling Laws: First Pit (positioned at the top-center)
* **X-axis:** Training Data Size (labeled at the bottom-center). Markers are: 64k, 128k, 256k, 512k, 1024k.
* **Y-axis:** Accuracy (labeled on the left-side). Scale ranges from approximately 0.25 to 0.55.
* **Legend:** Located in the top-left corner.
* "First pit" - represented by a solid orange line.
* "Random" - represented by a dashed teal line.
### Detailed Analysis
**First Pit (Orange Line):**
The orange line slopes upward, indicating a positive correlation between Training Data Size and Accuracy.
* At 64k Training Data Size, Accuracy is approximately 0.27.
* At 128k Training Data Size, Accuracy is approximately 0.34.
* At 256k Training Data Size, Accuracy is approximately 0.42.
* At 512k Training Data Size, Accuracy is approximately 0.47.
* At 1024k Training Data Size, Accuracy is approximately 0.54.
**Random (Teal Dashed Line):**
The teal dashed line also slopes upward, but less steeply than the orange line.
* At 64k Training Data Size, Accuracy is approximately 0.23.
* At 128k Training Data Size, Accuracy is approximately 0.31.
* At 256k Training Data Size, Accuracy is approximately 0.39.
* At 512k Training Data Size, Accuracy is approximately 0.45.
* At 1024k Training Data Size, Accuracy is approximately 0.52.
### Key Observations
* The "First pit" method consistently outperforms the "Random" method across all Training Data Sizes.
* The rate of accuracy improvement diminishes as the Training Data Size increases for both methods. The curve appears to be flattening out at larger data sizes.
* The difference in accuracy between the two methods is most pronounced at smaller Training Data Sizes (64k and 128k).
### Interpretation
The chart demonstrates the impact of training data size on model accuracy for two different approaches. The "First pit" method exhibits superior scaling behavior, meaning it benefits more significantly from increased training data compared to the "Random" method. This suggests that the "First pit" method is more data-efficient or better suited to leverage larger datasets. The diminishing returns observed at larger data sizes indicate a potential point of saturation, where further increases in training data yield progressively smaller improvements in accuracy. This information is valuable for resource allocation and model development, as it helps determine the optimal amount of training data needed to achieve desired performance levels. The chart implies that the "First pit" method is a more effective strategy for improving accuracy, especially when data is limited.
</details>
## E. Additional: Experiments on RL Training with PAVs
Training details. As discussed in Section 5, the initialization for RL training is the RFT (rejection finetuned) checkpoint for the corresponding base policies. More specifically, we consider two base policies Gemma 2B SFT, and Gemma 9B SFT, where the RL training is initialized with the policy obtained by further optimizing the base policy via rejection finetuning (RFT) Yuan et al. (2023). This is done to improve coverage over states and actions during the initial iterations of RL training. For rejection finetuning we train the SFT model (base policy) on all correct trajectories sampled when collecting seed rollouts for training PAVs (that predict the advantage of the same base policy). The training details for RFT remain same as SFT and are detailed in Appendix C. We use the REINFORCE (Sutton et al., 1999)
algorithm to improve the base policy. The RL training is run for 10000 iterations for the 2B model, and for 5000 iterations for the 9B model. For both we use the Adam optimizer with a learning rate of 1e-7, and a batchsize of 32. The maximum response length is set to 512. For both we used a learning rate schedule with a linear warm up for 10% of the total training iterations, followed by a cosine decay. The implementation of our policy gradient algorithm also uses a token-level value function that is initialized to be the base policy itself, and is trained with a square loss. The value function is only used as a baseline during RL training, i.e., at any state π it is only predicting πΌ π βΌ π (Β·| π ) π π ( π , π ) + πΌπ΄ π ( π , π ) .
Most importantly, we use a validation set to identify a good choice of πΌ in the effective reward (in Eq. 5). For Gemma 2B this value is 5 . 0 and for the 9B policy πΌ = 3 . 0 works best. Similar to the choice of πΌ for test-time search, we find that most values of πΌ in the range of 0 . 5 to 6 . 0 improved performance over ORM-RL, to different degrees. Both policies are also optimized with KL regularization, against the initial iterate of the RL policy, where the strength of the KL penalty is set to 0.001 for both.
## F. Theoretical Analysis: Complementary Provers Amplify Base Policy Improvement
In this section, we present the proofs for our theoretical results in the main paper (Sec. 3.4). We begin by describing some notation we use in our proofs, the natural policy gradient algorithm we use for the policy update, followed by the proof for Theorem 3.1. We also present a simple application of this result in Proposition F.1. Our results in this section are in the tabular setting, with softmax parameterization of the policies. Note that for the deterministic Markov decision process induced by the LLM, we are indeed in a tabular setting, where the set of states and actions is discrete, but large and grows exponentially in sequence length.
Notation and preliminaries. We use π π β , π π β to denote the distribution over states π β at time step β , starting from the initial state distribution given by the empirical distribution over the questions in the dataset D , and following the base policy π , or prover policy π respectively. The term π π π denotes the distribution over future states, starting from state π , and following policy π . Here, π can be a state at any time β β [ 0 , . . . , π» ] . For convenience, we overload the notation π π π (the distribution over future states induced by a policy starting from state π ), and use π π π to denote the mixture distribution over π π π starting from a random state π drawn from π βΌ π , and following policy π .
The term π π ( π β , π β ) refers to the value of state-action pair π β , π β , i.e. , the expected return in the future, starting the policy from state π β and taking action πβ :
$$Q ^ { \pi } ( s _ { h } , a _ { h } ) \colon = \mathbb { E } _ { a _ { h } , \dots , a _ { H } \sim \pi ( \cdot | s _ { h } , a _ { h } ) } \left [ R e x \left ( ( a _ { 1 } , \dots , a _ { H } ) , y _ { x } ^ { ^ { * } } \right ) \right ] .$$
Note that π β
π is known on the dataset D , and state π β contains the question π as part of it. Similarly , we can define the value function π π ( π β ) of a state π β as:
$$V ^ { \pi } ( s _ { h } ) \coloneqq \mathbb { E } _ { a _ { h + 1 } \sim \pi ( \cdot | s _ { h } ) } Q ^ { \pi } ( s _ { h } , a _ { h } ) .$$
The advantage function is then given by:
$$A ^ { \pi } ( s _ { h } , a _ { h } ) \colon = Q ^ { \pi } ( s _ { h } , a _ { h } ) - V ^ { \pi } ( s _ { h } ) .$$
The policy gradient algorithm we use to update the base policy iteratively is natural policy gradient (Kakade, 2001b), and we use ππ‘ to refer to the base policy iterate at time π‘ of this iterative algorithm.
Finally, we use S to denote the set of all states (prefixes) and A for the set of all actions (steps) that the LLM can take at any state.
Parameterization of the base policy. We adopt the softmax parameterization for the base policy:
$$\pi _ { \theta } ( a | s _ { h } ) = \frac { \exp ( \theta _ { s _ { h } , a } ) } { \sum _ { a ^ { \prime } \in \mathcal { A } } \exp ( \theta _ { s _ { h } , a ^ { \prime } } ) } .$$
Here π½π β ,π β π― β β π controls the probability of taking action π at state π β . The full set of parameters across all states and actions is denoted by π½ β β π Γ|S|Γ|A| . Whenever clear from context, we overload the notation ππ‘ to denote both the policy at iterate π‘ , i.e. , π π½ π‘ and the parameter π½ π‘ itself. E.g. , the gradient operator β ππ‘ [Β·] is referring to β π½ π‘ [Β·] .
Defining policy improvement. Let π be a distribution over all states { π β : β β [ 0 , 1 , . . . , π» ]} , then πΌ π βΌ π π π ( π ) , and πΌ π βΌ π π π ( π ) give us the expected value functions over states across time steps, measured under π , for policies π and π respectively. We assume that π π β and π π β are both absolutely continuous with respect to π , and use the expected value function over π as the quantity we track before and after a policy update. A positive change in πΌ π βΌ π π π ( π ) implies a net positive improvement in the base policy. Thus, progress is made at each update of the policy when:
$$\mathbb { E } _ { s \sim \rho } V ^ { \pi _ { t + 1 } } ( { s } ) - \mathbb { E } _ { s \sim \rho } V ^ { \pi _ { t } } ( { s } ) > 0 .$$
## F.1. Natural Policy Gradient
The natural policy gradient (NPG) algorithm (Kakade, 2001a) defines a Fisher information matrix (induced by the policy), and performs gradient updates in the geometry induced by the following matrix:
$$\begin{array} { r } { F _ { \rho } ( \pi ) = \mathbb { E } _ { s \sim d _ { \rho } ^ { \pi } } \mathbb { E } _ { a \sim \pi ( \cdot | s ) } \left [ \nabla _ { \pi } \log \pi ( a | s ) \left ( \nabla _ { \pi } \log \pi ( a | s ) \right ) ^ { \top } \right ] } \end{array}$$
Typically, the NPG update does gradient updates on the objective β ORM -RL in Eq. 3, but in our case, the objective of interest is β PAV -RL in Eq. 4, and thus the natural gradient is given by:
$$\pi _ { t + 1 } = \pi _ { t } + \gamma \cdot F _ { \rho } ( \pi ^ { t } ) ^ { \dagger } \left ( \nabla _ { \pi } \ell _ { P A V - R L } ( \pi ) \right | _ { \pi = \pi _ { t } } ,$$
where π β denotes the Moore-Penrose pseudoinverse of the matrix π . We restrict to using the initial state distribution π in our update rule, i.e. , we restrict attention to states π reachable from π , since π governs the performance measure of interest when evaluating the expected value of a policy. Thus, without loss of generality, we can exclude states that are not reachable under π . Specifically, we restrict the MDP to the set: { π β : β π such that π π π ( π β ) > 0 , β β [ 0 , . . . , π» ]} . The scalar πΎ > 0 determines the learning rate.
## F.2. Useful Lemmas
Lemma F.1. [The performance difference lemma; (Kakade and Langford, 2002)] For all policies π, π β² and states π 0 ,
$$\begin{array} { r l r } { V ^ { \pi } ( s ) - V ^ { \pi ^ { \prime } } ( s ) } & { = } & { \mathbb { E } _ { s _ { h } \sim d _ { s } ^ { \pi } } \mathbb { E } _ { a _ { h } \sim \pi ( \cdot | s _ { h } ) } \left [ A ^ { \pi ^ { \prime } } ( s _ { h } , a _ { h } ) \right ] . } \end{array}$$
Proof. See proof of Lemma 6.1 in Kakade and Langford (2002).
β‘
Lemma F.2 (Natural policy gradient update) . For the natural policy gradient in Eq. 12, the corresponding policy update is given by:
$$\pi ^ { t + 1 } ( a \, | \, s ) = \pi ^ { t } ( a \, | \, s ) \cdot \frac { \exp \left ( \gamma \cdot ( Q ^ { t } ( s , a ) + \alpha \cdot A ^ { \mu } ( s , a ) ) \right ) } { Z ^ { t } ( s ) } ,$$
<!-- formula-not-decoded -->
Proof. Weuse arguments similar to the proof of Lemma 15 in Agarwal et al. (2021), with the key difference of separately accounting for the term π΄ π ( π , π ) in the effective reward. For the sake of completeness we reproduce some of the derivation, accounting for the π΄ π term in the process. We follow compatible function approximation in Sutton et al. (1999) and Kakade (2001a). For a vector π β β π Γ|S||A| , we define the error function
$$L ^ { \pi } ( w ) = \mathbb { E } _ { s \sim d _ { \rho } ^ { \pi } } , \, \mathbb { E } _ { a \sim \pi ( \cdot | s ) } \left [ w ^ { \top } \nabla _ { \pi } \log \pi ( \cdot \, | \, s ) - \left ( A ^ { \pi } ( s , a ) + \alpha A ^ { \mu } ( s , a ) - \alpha \mathbb { E } _ { a \sim \pi _ { t } ( \cdot | s ) } A ^ { \mu } ( s , a ) \right ) \right ] ^ { 2 } .$$
Let π β
be the minimizer of πΏ π ( π ) with the smallest β 2 norm. Then by definition of Moore-Penrose pseudoinverse:
$$w ^ { ^ { * } } & = F _ { \rho } ( \pi ) ^ { \dagger } \mathbb { E } _ { s \sim d _ { \rho } ^ { \pi } , a \sim \pi ( a | s ) } \left [ \nabla _ { \pi } \log \pi ( a | s ) \left ( A ^ { \pi } ( s , a ) + \alpha A ^ { \mu } ( s , a ) - \alpha \mathbb { E } _ { a \sim \pi _ { t } ( \cdot | s ) } A ^ { \mu } ( s , a ) \right ] \\ & = F _ { \rho } ( \pi ) ^ { \dagger } \nabla _ { \pi } \ell _ { \text {PAV-RL} } ( \pi ) .$$
In other words, π β
is precisely proportional to the NPG update direction. Note further that for the Softmax policy parameterization, we have:
$$w ^ { \top } \nabla \log \pi ( a | s ) = w _ { s , a } - \sum _ { a ^ { \prime } \in \mathcal { A } } w _ { s , a ^ { \prime } } \pi ( a ^ { \prime } | s ) .$$
Since Λ π βA π ( π | π ) π΄ π ( π , π ) = 0, this immediately yields that:
$$L ^ { \pi } ( A ^ { \pi } ( s , a ) + \alpha A ^ { \mu } ( s , a ) ) = 0 .$$
However, this might not be the unique minimizer of πΏ π , which is problematic since π β
( π ) as defined in terms of the Moore-Penrose pseudoinverse is formally the smallest norm solution to the least-squares problem, which π΄ π + πΌπ΄ π may not be. However, given any vector π β β | S | Γ | A | , let us consider solutions of the form π΄ π + πΌπ΄ π + π . Due to the form of the derivatives of the policy for the softmax parameterization, we have for any state π , π such that π is reachable under π ,
$$v ^ { \top } \nabla _ { \pi } \log \pi ( a | s ) = \sum _ { a ^ { \prime } \in \mathcal { A } } ( v _ { s , a ^ { \prime } } 1 [ a = a ^ { \prime } ] - v _ { s , a ^ { \prime } } \pi ( a ^ { \prime } | s ) ) = v _ { s , a } - \sum _ { a ^ { \prime } \in \mathcal { A } } v _ { s , a ^ { \prime } } \pi ( a ^ { \prime } | s ) .$$
This is because π is a stochastic policy with π ( π | π ) > 0 for all actions π in each state π , so that if a state is reachable under π , it will also be reachable using π , and hence the zero derivative conditions apply at each reachable state. For π΄ π + πΌπ΄ π + π to minimize πΏ π , we would like π£ β€ β π log π ( π | π ) = 0 for all π , π so that ππ ,π is independent of the action and can be written as a constant π π for each π by the above equality. Hence, the minimizer of πΏ π ( π ) is determined up to a state-dependent offset, and
$$F _ { \rho } ( \theta ) ^ { \dagger } \nabla _ { \pi } \ell _ { P A V - R L } ( \pi ) = Q ^ { \pi } + \alpha A ^ { \mu } + v ,$$
where ππ ,π = π π for some π π β β for each state π and action π . Finally, we observe that this yields the updates
$$\theta _ { t + 1 } = \theta ^ { t } + \gamma ( Q ^ { \pi } + \alpha A ^ { \mu } + v ) \quad a n d \quad \pi _ { t + 1 } ( a | s ) = \pi _ { t } ( a | s ) \frac { \exp ( \gamma Q ^ { t } ( s , a ) + \gamma \alpha A ^ { \mu } ( s , a ) ) } { Z ^ { t } ( s ) } .$$
Owing to the normalization factor π π‘ ( π ) , the state dependent offsets cancel in the updates for π , which yields the statement of the lemma. β‘
## F.3. Proof of Theorem 3.1
Proof. For some notational convenience, we use π π‘ , π π‘ + 1 , to denote the value functions π ππ‘ , π ππ‘ + 1 for policies at ππ‘ , ππ‘ + 1 at time π‘ and π‘ + 1 respectively. Similarly, we use π΄ π‘ , π΄ π‘ + 1 for π΄ ππ‘ , π΄ ππ‘ + 1 respectively. For the distribution over states π ππ‘ + 1 π induced by the policy ππ‘ + 1 , starting from an initial distribution of states given by π , we simplify the notation and use π π‘ + 1 π . Similarly we use π π‘ π for π ππ‘ π .
Next, for simplicity we set πΌ = 1 in the natural policy gradient update in Lemma F.2. It is easy to see that the lower bound result we show holds for any value of πΌ > 0, and the term in the lower bound scales linearly with πΌ . Note, that this is not a free variable, and πΌ has to be π ( 1 ) , since as we increase the value of πΌ we would have to correspondingly reduce πΎ for our result to hold. For now, we fix πΌ = 1.
From the policy difference Lemma F.1, we can write:
$$\mathbb { E } _ { s \sim \rho } V ^ { t + 1 } ( s ) - \mathbb { E } _ { s \sim \rho } V ^ { t } ( s ) = \mathbb { E } _ { s \sim d _ { \rho } ^ { t + 1 } } \mathbb { E } _ { a \sim \pi ^ { t + 1 } ( a | s ) } \left [ A ^ { t } ( s , a ) \right ]$$
Next, from the natural policy gradient update in Lemma F.2, we can write π΄ π‘ + 1 ( π , π ) as:
$$A ^ { t } ( \boldsymbol s , a ) = \frac { 1 } { \gamma } \cdot \log \left ( \frac { \pi ^ { t + 1 } ( a | \boldsymbol s ) \cdot Z ^ { t } ( \boldsymbol s , a ) } { \pi ^ { t } ( a | \boldsymbol s ) } \right ) - A ^ { \mu } ( \boldsymbol s , a )$$
Substituting Eq. 18 in Eq. 17 we get:
$$\mathbb { E } _ { s \sim \rho } V ^ { t + 1 } ( s ) & - \mathbb { E } _ { s \sim \rho } V ^ { t } ( s ) = \frac { 1 } { \gamma } \mathbb { E } _ { s \sim d _ { \rho } ^ { t + 1 } } \left [ \text {KL} ( \pi ^ { t + 1 } ( \cdot | s ) ) \| \pi ^ { t } ( \cdot | s ) ) \right ] \\ & + \frac { 1 } { \gamma } \log Z ^ { t } ( s ) - \mathbb { E } _ { s \sim d _ { \rho } ^ { t + 1 } } \mathbb { E } _ { a \sim \pi ^ { t + 1 } ( a | s ) } A ^ { \mu } ( s , a ) .$$
$$\log Z ^ { t } ( s ) = \log \mathbb { E } _ { a \sim \pi ^ { t } ( \cdot | s ) } \exp \left ( \gamma \cdot ( A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) ) \right )$$
Applying Jensen's inequality we get:
$$\log Z ^ { t } ( s ) \geq \gamma \cdot \mathbb { E } _ { a \sim \pi ^ { t } ( \cdot | s ) } \left [ A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) \right ]$$
$$= \gamma \cdot \mathbb { E } _ { a \sim \pi ^ { t } ( \cdot | s ) } \left [ A ^ { \mu } ( s , a ) \right ] ,$$
since πΌ π π‘ [ π΄ π‘ ( π , π )] = 0. Note that in Eq. 20 the KL term is always non-negative. Thus, we can lower bound our policy improvement:
$$\mathbb { E } _ { \mathfrak { s } \sim \rho } V ^ { t + 1 } ( \mathbf s ) - \mathbb { E } _ { \mathfrak { s } \sim \rho } V ^ { t } ( \mathbf s ) \geq \mathbb { E } _ { \mathfrak { s } \sim d _ { \rho } ^ { t + 1 } } \langle \pi ^ { t + 1 } ( \cdot | \mathbf s ) - \pi ^ { t } ( \cdot | \mathbf s ) , A ^ { \mu } ( \mathbf s , \cdot ) \rangle ,$$
Recall that for πΌ = 1,
where the inner product is the standard euclidean product as our actions space A is discrete.
In the following we will treat the distribution π π‘ + 1 (Β· | π ) as a vector denoted by π Next, from the NPG update we know that:
<!-- formula-not-decoded -->
We note that for πΎ βͺ 1, exp πΎ ( π΄ π‘ ( π , π )) + πΎ ( π΄ π ( π , π )) = Ξ ( 1 + πΎ ( π΄ π‘ ( π , π ) + π΄ π ( π , π ))) , where the terms that grow as π ( πΎ ) are being ignored. Based on this, for πΎ βͺ 1, β constants 0 < πΆ 1 < πΆ 2 such that:
$$\exp ( \gamma A ^ { t } ( s , a ) + \gamma A ^ { \mu } ( s , a ) ) - 1 \in [ C _ { 1 } \gamma ( A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) ) , C _ { 2 } \gamma ( A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) ) ]$$
Applying the above claim in Eq. 24 gives us:
$$\begin{array} { r l } & { \text {applying the above claim in Eq. 24 gives us:} } \\ & { \quad \pi ^ { t + 1 } ( a | s ) - \pi ^ { t } ( a | s ) \geq \pi ^ { t } ( a | s ) \left ( \frac { 1 + C _ { 1 } \gamma ( A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) ) } { 1 + C _ { 2 } \gamma E _ { a | \pi ^ { t } ( \cdot | s ) } [ A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) ] } - 1 \right ) } \\ & { \geq C _ { 3 } \gamma \frac { \left ( \pi ^ { t } ( a | s ) ( A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) ) - \pi _ { t } ( a | s ) [ E _ { a \sim \pi _ { t } ( a | s ) } [ A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) ] \right ) } { 1 + C _ { 2 } \gamma E _ { a \sim \pi _ { t } ( a | s ) } [ A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) ] } } \\ & { \quad \left ( \pi ^ { t } ( a | s ) ( A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) ) - \pi _ { t } ( a | s ) [ E _ { a \sim \pi _ { t } ( a | s ) } [ A ^ { \mu } ( s , a ) ] } \right ) } \end{array}$$
$$( 2 5 )$$
$$= C _ { 3 } \gamma \frac { \left ( \pi ^ { t } ( a \, | \, s ) \left ( A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) \right ) - \pi _ { t } ( a \, | \, s ) \mathbb { E } _ { a \sim \pi _ { t } ( a | s ) } \left [ A ^ { \mu } ( s , a ) \right ] \right ) } { 1 + \gamma C _ { 2 } \mathbb { E } _ { a \sim \pi _ { t } ( a | s ) } \left [ A ^ { t } ( s , a ) + A ^ { \mu } ( s , a ) \right ] , }$$
where we reused: πΌ π π‘ [ π΄ π‘ ( π , π )] = 0. Here, πΆ 3 > 0 is a constant.
We now plug in the above lower bound into Eq. 21 to get the final lower bound on the policy improvement in Theorem 3.1. For this, we will once again use the assumption that the learning rate πΎ βͺ 1, which allows us to use 1 + πΎ πΌ π βΌ ππ‘ ( π | π ) [ π΄ π‘ ( π , π ) + π΄ π ( π , π )] β₯ πΆ 4 for some constant πΆ 4 > 0. This is because, in our setting the range of the advantages is [-1 , 1 ] .
$$\begin{array} { r l } \mathbb { E } _ { s \sim \rho } \left [ V ^ { t + 1 } ( s ) - V ^ { t } ( s ) \right ] \gtrsim \gamma \mathbb { E } _ { s \sim d _ { \rho } ^ { t + 1 } } \left [ \mathbb { E } _ { a \sim \pi ^ { t } ( a | s ) } \left [ A ^ { \mu } ( s , a ) A ^ { t } ( s , a ) \right ] \right ] \\ + \gamma \mathbb { E } _ { s \sim d _ { \rho } ^ { t + 1 } } \left [ \mathbb { E } _ { a \sim \pi ^ { t } ( a | s ) } \left [ \left ( A ^ { \mu } ( s , a ) \right ) ^ { 2 } \right ] \right ] \\ - \gamma \mathbb { E } _ { s \sim d _ { \rho } ^ { t + 1 } } \left [ \left ( \mathbb { E } _ { a \sim \pi ^ { t } ( a | s ) } \left [ A ^ { \mu } ( s , a ) \right ] \right ) ^ { 2 } \right ] \\ \end{array}$$
Now Eq. 27 gives us,
$$\mathbb { E } _ { s \sim \rho } \left [ V ^ { t + 1 } ( \boldsymbol s ) - V ^ { t } ( \boldsymbol s ) \right ] \gtrsim \gamma \mathbb { E } _ { s \sim d _ { \rho } ^ { l + 1 } } \left [ \mathbb { V } _ { a \sim \pi _ { t } ( a | \boldsymbol s ) } \left [ A ^ { \mu } ( \boldsymbol s , a ) \right ] - \mathbb { E } _ { a \sim \pi _ { t } ( a | \boldsymbol s ) } \left [ A ^ { \mu } ( \boldsymbol s , a ) A ^ { t } ( \boldsymbol s , a ) \right ] \right ]$$
Now, for the last step we note that π π‘ + 1 π is component wise larger than π , and this gives us the final result:
$$\mathbb { E } _ { s \sim \rho } \left [ V ^ { t + 1 } ( \boldsymbol s ) - V ^ { t } ( \boldsymbol s ) \right ] \gtrsim \gamma \mathbb { E } _ { s \sim \rho } \left [ \mathbb { V } _ { a \sim \pi _ { t } ( a | \boldsymbol s ) } \left [ A ^ { \mu } ( \boldsymbol s , a ) \right ] - \mathbb { E } _ { a \sim \pi _ { t } ( a | \boldsymbol s ) } \left [ A ^ { \mu } ( \boldsymbol s , a ) A ^ { t } ( \boldsymbol s , a ) \right ] \right ]$$
β‘
## F.4. Discussion on Remark 3.1
First, we note that if the π -value of a base policy π at state, action pair ( π , π ) is π π ( π , π ) , then for the prover π set to the best-of-K policy BoK ( π ) , the π -value at the same state, action pair is:
$$Q ^ { \mu } ( s , a ) = 1 - ( 1 - Q ^ { \pi } ( s , a ) ) ^ { K } & & ( 3 0 )$$
This is because, in our setup the final outcome of correctness given by Rex is a random variable taking values in { 0 , 1 } , with expectation π π ( π , π ) , when completing a rollout starting from prefix ( π , π ) . Thus, we can treat the outcome of function Rex as a Bernoulli random variable. Next, we can simply compute the probability of sampling a single correct answer out of πΎ attempts, which is also a Bernoulli random variable with mean given by π π ( π , π ) in Eq. 30.
Now for Remark 3.1, we observe that whenever π π ( π , π ) βͺ 1, e.g. , when π π ( π , π ) = π ( 1 / πΎ ) , for all values of ( π , π ) , we can do the Taylor approximation of π π ( π , π ) around 0, and note that π π ( π , π ) = Ξ ( πΎ Β· π π ( π , π )) . Next, note the following calculation for the first term (distinguishability) in Theorem 3.1:
$$\begin{array} { r l } & { \mathbb { V } _ { \pi } A ^ { \mu } ( s , a ) = \mathbb { V } _ { \pi } Q ^ { \mu } ( s , a ) } \\ & { = \mathbb { V } _ { \pi } \left [ 1 - ( 1 - Q ^ { \pi } ( s , a ) ) ^ { K } \right ] } \\ & { = \mathbb { V } _ { \pi } \left [ ( 1 - Q ^ { \pi } ( s , a ) ) ^ { K } \right ] } \end{array}$$
This means that the first term in Theorem 3.1 which measures 'distinguishability' now increases by a factor of πΎ 2 . Similarly, we can see that the the term which measures 'misalignment' can change in magnitude by atmost a factor of π ( πΎ ) , since the misalignment term is linear in π π . These two observations combined lead us to the conclusion in Remark 3.1.
## F.5. Improving a Stronger Base policy with a Weaker Prover Policy
In Proposition F.1, we consider the case where the π and π differ in performance, as measured under the distribution of states π in the following way:
$$\mathbb { E } _ { s \sim \rho } [ | V ^ { \mu } ( { \pm b s } ) - V ^ { \pi _ { t } } ( { \pm b s } ) | ] = \eta .$$
Next, whenever the prover's preference over actions is complementary to the base policy, by a factor of π , i.e.,
<!-- formula-not-decoded -->
then the variance of π΄ π or π΄ ππ‘ under ππ‘ should scale as π 2 .
Thus, we see that when ππ‘ fails to distinguish actions ( i.e. , πΌ π βΌ π π ππ‘ [ π΄ ππ‘ ( π , π )] is small) regardless of the strength of prover policy π , as long as it is sufficiently complementary to ππ‘ , the prover policy induces an improvement in base policy, that scales as π 2 .
Proposition F.1 (Complementary π boosts improvements in π ) . Under the distribution over states given by π , let prover π and base policy at iterate π‘ , ππ‘ , differ in absolute performance, i.e.,
$$\mathbb { E } _ { s \sim \rho } [ | V ^ { \mu } ( { \pm b s } ) - V ^ { \pi _ { t } } ( { \pm b s } ) | ] = \eta .$$
When πΌ π βΌ π π π βΌ ππ‘ [ π΄ ππ‘ ( π , π )] < πΌ π βΌ π π π βΌ ππ‘ [ π΄ π ( π , π )] , and π is complementary to ππ‘ , i.e.,
$$\mathbb { E } _ { \boldsymbol s \sim \rho } \mathbb { E } _ { a \sim \pi _ { t } } | Q ^ { \pi _ { t } } ( { \boldsymbol s } , a ) - Q ^ { \mu } ( { \boldsymbol s } _ { h } , a ) | = \Theta ( \eta ) ,$$
then πΌ π βΌ π [ π ππ‘ + 1 ( π ) -π ππ‘ ( π )] > βΌ π 2 .
Proof. We begin by proving an upper bound on the disagreement between prover and base policy:
$$& \mathbb { E } _ { s \sim \rho } \mathbb { E } _ { a \sim \pi _ { t } } | Q ^ { \mu } ( s , a ) - Q ^ { \pi _ { t } } ( s , a ) | \\ & \leq \mathbb { E } _ { s \sim \rho } \mathbb { E } _ { a \sim \pi _ { t } } | Q ^ { \mu } ( s , a ) - V ^ { \mu } ( s ) | + \mathbb { E } _ { s \sim \rho } \mathbb { E } _ { a \sim \pi _ { t } } | Q ^ { \pi _ { t } } ( s , a ) - V ^ { \mu } ( s ) | \\ & \leq \mathbb { E } _ { s \sim \rho } \mathbb { E } _ { a \sim \pi _ { t } } | Q ^ { \mu } ( s , a ) - V ^ { \mu } ( s ) | + \mathbb { E } _ { s \sim \rho } [ | V ^ { \mu } ( s ) - V ^ { \pi _ { t } } ( s ) | ] + \mathbb { E } _ { s \sim \rho } \mathbb { E } _ { a \sim \pi _ { t } } | Q ^ { \pi _ { t } } ( s , a ) - V ^ { \pi _ { t } } ( s , a ) | \\ & \leq \eta + \mathbb { E } _ { s \sim \rho } \sqrt { \mathbb { V } _ { \pi _ { t } } [ A ^ { \mu } ( s , a ) ] } + \mathbb { E } _ { s \sim \rho } \sqrt { \mathbb { V } _ { \pi _ { t } } [ A ^ { \pi _ { t } } ( s , a ) ] } ,$$
where the last inequality uses Cauchy-Schwartz. Next we apply Jensen's inequality on the terms in the square root, and the conditions that πΌ π βΌ π πΌ π βΌ ππ‘ | π ππ‘ ( π , π ) -π π ( π β , π )| = Ξ© ( π ) to conclude:
$$\sqrt { \mathbb { E } _ { s \sim \rho } \mathbb { V } _ { \pi _ { t } } [ A ^ { \mu } ( { \pm b s , a } ) ] } \gtrsim \eta ,$$
whenever πΌ π βΌ π π π βΌ ππ‘ [ π΄ ππ‘ ( π , π )] < πΌ π βΌ π π π βΌ ππ‘ [ π΄ π ( π , π )] . Now that we lower bound 'distinguishability', it is easy to see that a similar derivation would upper bound the magnitude of the misalignment term by π ( π 2 ) . Invoking the result in Theorem 3.1 yields:
$$E _ { s \sim \rho } [ V ^ { \pi _ { t + 1 } } ( s ) - V ^ { \pi _ { t } } ( s ) ] \gtrsim \eta ^ { 2 }$$
$$E _ { s \sim \rho } [ V ^ { \pi _ { t + 1 } } ( s ) - V ^ { \pi _ { t } } ( s ) ] \gtrsim \eta ^ { 2 }$$
## G. Examples Generated by Base Policy Trained on Rewards π π + πΌπ π
When we train the base policy with reinforcement learning where the reward is π π + πΌπ π , instead of the effective reward π π + πΌπ΄ π , we find that the policy training converges to contrived solutions with degenerate failures.
For example, we see in multiple examples (like the four below), that for every question, the model learns to output "REPHRASE THE PROBLEM" as the first step. This is because under π π this trivial step achieves a positive reward. While there exists a better solution where the policy outputs the correct answer and achieves higher rewards, it fails to recover from this local optima and find the better solution. Across multiple training runs, we observed similar trends when training with π π + πΌπ π . Note that in the idealized update, where we observe advantages π΄ π or value function π π on all actions at a state ( e.g. , in the NPG update in Sec. 3.4), using π π + πΌπ π is equivalent to π π + πΌπ΄ π . But, in practice, π΄ π offers significant benefits in the form of variance reduction in policy gradient iterates. In this instance, it does that by not assigning reward to steps like 'REPHRASE THE PROBLEM', since the likelihood of a prover solving before or after this step is expected to be roughly the same under any choice of the prover policy ( i.e. , the prover policy makes no progress).
## Example 1
```
Q: My club has 25 members. In how many ways can I choose members to form a 4-person executive committee?
A:
>>>
REPHRASE THE PROBLEM
WE NEED TO FIND THE NUMBER OF WAYS TO CHOOSE 4 MEMBERS OUT OF 25 FOR THE EXECUTIVE COMMITTEE SO WE CAN USE THE COMBINATION FORMULA WHICH IS
```
N C R = N! / (R! * (N - R)!) WHERE N IS THE TOTAL NUMBER OF MEMBERS AND R IS THE NUMBER OF MEMBERS WE CHOOSE SO IN THIS CASE N = 25 AND R = 4 SO WE PLUG THEM IN AND SIMPLIFY 25 C 4 = 25! / (4! * (25 - 4)!) = (25 * 24 * 23 * 22) / (4 * 3 * 2 * 1) = 104005 # Answer
## Example 2
Q: We have a triangle $\triangle ABC$ and a point $K$ on $BC$ such that $AK$ is an altitude of $\triangle ABC$. If $AC = 10,$ $BK = 7$, and $BC = 13,$ then what is the area of $\triangle ABC$?
A:
>>>
REPHRASE THE PROBLEM
WE NEED TO FIND THE AREA OF $\triangle ABC$ SO WE CAN USE THE FORMULA $A = \frac{1}{2}bh$ WHERE $b$ IS THE BASE AND $h$ IS THE HEIGHT SO SINCE $AK$ IS AN ALTITUDE, THAT MEANS $AK \perp BC$ SO WE CAN USE $BC$ AS THE BASE AND $AK$ AS THE HEIGHT SO NOW WE NEED TO FIND $AK$ SO WE CAN USE THE PYTHAGORE THEOREM ON $\triangle AKC$ SINCE $AC = 10$ # Answer
## Example 3
```
Example 3
Q: Suppose that $n, n+1, n+2, n+3, n+4$ are five consecutive integers.
Determine a simplified expression for the sum of these five consecutive integers.
A:
>>>
REPHRASE THE PROBLEM
WE NEED TO FIND THE SUM OF THESE FIVE CONSECUTIVE INTEGERS SO WE CAN USE THE FORMULA
$S_n = \frac{n}{2}(a_1 + a_n)$ WHERE $S_n$ IS THE SUM, $n$ IS THE NUMBER OF TERMS,
$a_1$ IS THE FIRST TERM, AND $a_n$ IS THE LAST TERM SO IN THIS CASE $n = 5, a_1 = n,
a_5 = n + 4$
# Answer
5n + 10
```
5n + 10
## Example 4
```
Q: What is the value of $x^2+y^2-z^2+2xy$ when $x=-3$, $y=5$, and $z=-4$?
A:
>>>
REPHRASE THE PROBLEM
WE NEED TO FIND THE VALUE OF $x^2+y^2-z^2+2xy$ WHEN $x=-3$, $y=5$, AND $z=-4$
SO WE JUST PLUG IN THE VALUES AND SIMPLIFY
# Answer
-2
```