2406.06592v2

Model: nemotron-free

# Improve Mathematical Reasoning in Language Models by Automated Process Supervision **Authors**: Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, Abhinav Rastogi > Google DeepMind > Google Corresponding author: see the cls Abstract Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named OmegaPRM for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train Process Reward Models (PRMs). This fully automated process supervision alongside the weighted self-consistency algorithm is able to enhance LLMs’ math reasoning performances. We improved the success rates of the instruction-tuned Gemini Pro model from 51% to 69.4% on MATH500 and from 86.4% to 93.6% on GSM8K. Similarly, we boosted the success rates of Gemma2 27B from 42.3% to 58.2% on MATH500 and from 74.0% to 92.2% on GSM8K. The entire process operates without any human intervention or supervision, making our method both financially and computationally cost-effective compared to existing methods. 1 Introduction Despite the impressive advancements achieved by scaling Large Language Models (LLMs) on established benchmarks (Wei et al., 2022a), cultivating more sophisticated reasoning capabilities, particularly in domains like mathematical problem-solving and code generation, remains an active research area. Chain-of-thought (CoT) generation is crucial for these reasoning tasks, as it decomposes complex problems into intermediate steps, mirroring human reasoning processes. Prompting LLMs with CoT examples (Wei et al., 2022b) and fine-tuning them on question-CoT solution pairs (Perez et al., 2021; Ouyang et al., 2022) have proven effective, with the latter demonstrating superior performance. Furthermore, the advent of Reinforcement Learning with Human Feedback (RLHF; Ouyang et al., 2022) has enabled the alignment of LLM behaviors with human preferences through reward models, significantly enhancing model capabilities. <details> <summary>extracted/6063103/figures/tree.png Details</summary> ![69bd9470](/v1/image/69bd9470c22e965006fb1485595c77323f47162bdde5264f46bfb60043660e32) ### Visual Description ## Binary Tree Diagram: Hierarchical Structure with Node Numbering ### Overview The image depicts a binary tree diagram with nodes numbered from 0 to 82. The tree is structured hierarchically, with each node having two children. The left child of a node `n` is calculated as `2n + 1`, and the right child as `2n + 2`. Nodes are color-coded: **teal** nodes contain their own numbers, while **orange** nodes display the numbers of their children. The tree is truncated at the 82nd node, indicating an incomplete final level. --- ### Components/Axes - **Nodes**: - **Teal Nodes**: Contain their own numerical labels (e.g., 0, 1, 2, 3, ..., 82). - **Orange Nodes**: Display the numbers of their children (e.g., node 0 shows "1" and "2", node 2 shows "5" and "6"). - **Edges**: Connect parent nodes to their children, following the binary tree structure. - **Legend**: - **Teal**: Represents nodes with their own numbers. - **Orange**: Represents nodes displaying their children's numbers. - **Positioning**: - **Root Node (0)**: Top-center of the diagram. - **Legend**: Located on the right side of the diagram. - **Nodes**: Arranged in levels, with each level branching left and right. --- ### Detailed Analysis #### Node Numbering and Structure - **Level 0**: Root node `0` (teal). - **Level 1**: Children of `0` are `1` (orange) and `2` (orange). These nodes display their children's numbers: `1` shows "3" and "5", `2` shows "5" and "6". - **Level 2**: - Node `1` (orange) has children `3` (teal) and `5` (teal). - Node `2` (orange) has children `5` (teal) and `6` (teal). - **Level 3**: - Node `3` (teal) has children `7` (orange) and `8` (orange). - Node `5` (teal) has children `11` (orange) and `12` (orange). - Node `6` (teal) has children `13` (orange) and `14` (orange). - **Subsequent Levels**: Continue recursively, with each node's children calculated as `2n + 1` (left) and `2n + 2` (right). The tree is truncated at node `82`, which is the rightmost node on the final visible level. #### Color Coding - **Teal Nodes**: Always contain their own numerical labels (e.g., `0`, `1`, `3`, `5`, `7`, etc.). - **Orange Nodes**: Display the numbers of their children (e.g., node `0` shows "1" and "2", node `2` shows "5" and "6"). #### Truncation - The tree is incomplete at the final level. The last visible node is `82`, which is the right child of node `40` (orange). The full level would include nodes up to `126` (for level 6), but the diagram stops at `82`. --- ### Key Observations 1. **Perfect Binary Tree Structure**: The tree follows a strict hierarchical pattern where each node has exactly two children, except for the truncated final level. 2. **Color-Coded Hierarchy**: Teal nodes represent parent nodes with their own numbers, while orange nodes act as "child indicators" showing the numbers of their children. 3. **Truncation at Node 82**: The diagram is cut off before completing the final level, leaving the tree incomplete. 4. **Consistency in Numbering**: The left child of a node `n` is always `2n + 1`, and the right child is `2n + 2`, ensuring no conflicts in numbering. --- ### Interpretation - **Hierarchical Relationships**: The tree visually represents a binary tree structure, with each level doubling the number of nodes. This is critical for understanding data structures like heaps or priority queues. - **Color Coding Purpose**: The teal/orange distinction helps differentiate between parent nodes (teal) and their child references (orange), aiding in visual parsing of the tree's structure. - **Truncation Implications**: The incomplete final level suggests the diagram is a simplified representation, possibly for illustrative purposes. A full binary tree would require 127 nodes (levels 0–6), but the diagram stops at 82, indicating a partial or truncated view. - **Mathematical Foundation**: The formulas `2n + 1` (left) and `2n + 2` (right) ensure a deterministic and non-overlapping numbering system, which is essential for algorithms like heap operations or tree traversal. --- ### Final Notes The diagram effectively illustrates the principles of binary tree construction, emphasizing hierarchical relationships and numerical patterns. The color coding and truncation highlight both the structure's logic and its practical limitations in visualization. </details> Figure 1: Example tree structure built with our proposed OmegaPRM algorithm. Each node in the tree indicates a state of partial chain-of-thought solution, with information including accuracy of rollouts and other statistics. Each edge indicates an action, i.e., a reasoning step, from the last state. Yellow edges are correct steps and blue edges are wrong. Beyond prompting and further training, developing effective decoding strategies is another crucial avenue for improvement. Self-consistency decoding (Wang et al., 2023) leverages multiple reasoning paths to arrive at a voted answer. Incorporating a verifier, such as an off-the-shelf LLM (Huang et al., 2022; Luo et al., 2023), can further guide LLMs in reasoning tasks by providing a feedback loop to verify final answers, identify errors, and suggest corrections. However, the gain of such approaches remains limited for complex multi-step reasoning problems. Reward models offer a promising alternative to verifiers, enabling the reranking of candidate outcomes based on reward signals to ensure higher accuracy. Two primary types of reward models have emerged: Outcome Reward Models (ORMs; Yu et al., 2024; Cobbe et al., 2021), which provide feedback only at the end of the problem-solving process, and Process Reward Models (PRMs; Li et al., 2023; Uesato et al., 2022; Lightman et al., 2023), which offer granular feedback at each reasoning step. PRMs have demonstrated superior effectiveness for complex reasoning tasks by providing such fine-grained supervision. The primary bottleneck in developing PRMs lies in obtaining process supervision signals, which require supervised labels for each reasoning step. Current approaches rely heavily on costly and labor-intensive human annotation (Uesato et al., 2022; Lightman et al., 2023). Automating this process is crucial for scalability and efficiency. While recent efforts using per-step Monte Carlo estimation have shown promise (Wang et al., 2024a, b), their efficiency remains limited due to the vast search space. To address this challenge, we introduce OmegaPRM, a novel divide-and-conquer Monte Carlo Tree Search (MCTS) algorithm inspired by AlphaGo Zero (Silver et al., 2017) for automated process supervision data collection. For each question, we build a Monte Carlo Tree, as shown in Fig. 1, with the details explained in Section 3.3. This algorithm enables efficient collection of over 1.5 million high-quality process annotations without human intervention. Our PRM, trained on this dataset and combined with weighted self-consistency decoding, significantly improves the performance of instruction-tuned Gemini Pro from 51% to 69.4% on MATH500 (Lightman et al., 2023) and from 86.4% to 93.6% on GSM8K (Cobbe et al., 2021). We also boosted the success rates of Gemma2 27B from 42.3% to 58.2% on MATH500 and from 74.0% to 92.2% on GSM8K. Our main contributions are as follows: - We propose a novel divide-and-conquer style Monte Carlo Tree Search algorithm for automated process supervision data generation. - The algorithm enables the efficient generation of over 1.5 million process supervision annotations, representing the largest and highest quality dataset of its kind to date. Additionally, the entire process operates without any human annotation, making our method both financially and computationally cost-effective. - We combine our verifier with weighted self-consistency to further boost the performance of LLM reasoning. We significantly improves the success rates from 51% to 69.4% on MATH500 and from 86.4% to 93.6% on GSM8K for instruction-tuned Gemini Pro. For Gemma2 27B, we also improved the success rates of from 42.3% to 58.2% on MATH500 and from 74.0% to 92.2% on GSM8K. 2 Related Work Improving mathematical reasoning ability of LLMs. Mathematical reasoning poses significant challenges for LLMs, and it is one of the key tasks for evaluating the reasoning ability of LLMs. With a huge amount of math problems in pretraining datasets, the pretrained LLMs (OpenAI, 2023; Gemini Team et al., 2024; Touvron et al., 2023) are able to solve simple problems, yet struggle with more complicated reasoning. To overcome that, the chain-of-thought (Wei et al., 2022b; Fu et al., 2023) type prompting algorithms were proposed. These techniques were effective in improving the performance of LLMs on reasoning tasks without modifying the model parameters. The performance was further improved by supervised fine-tuning (SFT; Cobbe et al., 2021; Liu et al., 2024; Yu et al., 2023) with high quality question-response pairs with full CoT reasoning steps. Application of reward models in mathematical reasoning of LLMs. To further improve the LLM’s math reasoning performance, verifiers can help to rank and select the best answer when multiple rollouts are available. Several works (Huang et al., 2022; Luo et al., 2023) have shown that using LLM as verifier is not suitable for math reasoning. For trained verifiers, two types of reward models are commonly used: Outcome Reward Model (ORM) and Process Reward Model (PRM). Both have shown performance boost on math reasoning over self-consistency (Cobbe et al., 2021; Uesato et al., 2022; Lightman et al., 2023), yet evidence has shown that PRM outperforms ORM (Lightman et al., 2023; Wang et al., 2024a). Generating high quality process supervision data is the key for training PRM, besides expensive human annotation (Lightman et al., 2023), Math-Shepherd (Wang et al., 2024a) and MiPS (Wang et al., 2024b) explored Monte Carlo estimation to automate the data collection process with human involvement, and both observed large performance gain. Our work shared the essence with MiPS and Math-Shepherd, but we explore further in collecting the process data using MCTS. Monte Carlo Tree Search (MCTS). MCTS (Świechowski et al., 2021) has been widely adopted in reinforcement learning (RL). AlphaGo (Silver et al., 2016) and AlphaGo Zero (Silver et al., 2017) were able to achieve great performance with MCTS and deep reinforcement learning. For LLMs, there are planning algorithms that fall in the category of tree search, such as Tree-of-Thought (Yao et al., 2023) and Reasoning-via-Planing (Hao et al., 2023). Recently, utilizing tree-like decoding to find the best output during the inference-time has become a hot topic to explore as well, multiple works (Feng et al., 2023; Ma et al., 2023; Zhang et al., 2024; Tian et al., 2024; Feng et al., 2024; Kang et al., 2024) have observed improvements in reasoning tasks. 3 Methods 3.1 Process Supervision Process supervision is a concept proposed to differentiate from outcome supervision. The reward models trained with these objectives are termed Process Reward Models (PRMs) and Outcome Reward Models (ORMs), respectively. In the ORM framework, given a query $q$ (e.g., a mathematical problem) and its corresponding response $x$ (e.g., a model-generated solution), an ORM is trained to predict the correctness of the final answer within the response. Formally, an ORM takes $q$ and $x$ and outputs the probability $p=\mathrm{ORM}(q,x)$ that the final answer in the response is correct. With a training set of question-answer pairs available, an ORM can be trained by sampling outputs from a policy model (e.g., a pretrained or fine-tuned LLM) using the questions and obtaining the correctness labels by comparing these outputs with the golden answers. In contrast, a PRM is trained to predict the correctness of each intermediate step $x_{t}$ in the solution. Formally, $p_{t}=\mathrm{PRM}([q,x_{1:t-1}],x_{t})$ , where $x_{1:i}=[x_{1},...,x_{i}]$ represents the first $i$ steps in the solution. This provides more precise and fine-grained feedback than ORMs, as it identifies the exact location of errors. Process supervision has also been shown to mitigate incorrect reasoning in the domain of mathematical problem solving. Despite these advantages, obtaining the intermediate signal for each step’s correctness to train such a PRM is non-trivial. Previous work (Lightman et al., 2023) has relied on hiring domain experts to manually annotate the labels, which is and difficult to scale. 3.2 Process Annotation with Monte Carlo Method In two closely related works, Math-Shepherd (Wang et al., 2024a) and MiPS (Wang et al., 2024b), the authors propose an automatic annotation approach to obtain process supervision signals using the Monte Carlo method. Specifically, a “completer” policy is established that can take a question $q$ and a prefix solution comprising the first $t$ steps $x_{1:t}$ and output the completion — often referred to as a “rollout” in reinforcement learning — of the subsequent steps until the final answer is reached. As shown in Fig. 2 (a), for any step of a solution, the completer policy can be used to randomly sample $k$ rollouts from that step. The final answers of these rollouts are compared to the golden answer, providing $k$ labels of answer correctness corresponding to the $k$ rollouts. Subsequently, the ratio of correct rollouts to total rollouts from the $t$ -th step, as represented in Eq. 1, estimates the “correctness level” of the prefix steps up to $t$ . Regardless of false positives, $x_{1:t}$ should be considered correct as long as any of the rollouts is correct in the logical reasoning scenario. $$ c_{t}=\mathrm{MonteCarlo}(q,x_{1:t})=\frac{\mathrm{num}(\text{correct rollouts% from $t$-th step})}{\mathrm{num}(\text{total rollouts from $t$-th step})} \tag{1} $$ Taking a step forward, a straightforward strategy to annotate the correctness of intermediate steps in a solution is to perform rollouts for every step from the beginning to the end, as done in both Math-Shepherd and MiPS. However, this brute-force approach requires a large number of policy calls. To optimize annotation efficiency, we propose a binary-search-based Monte Carlo estimation. Monte Carlo estimation using binary search. As suggested by Lightman et al. (2023), supervising up to the first incorrect step in a solution is sufficient to train a PRM. Therefore, our objective is locating the first error in an efficient way. We achieve this by repeatedly dividing the solution and performing rollouts. Assuming no false positives or negatives, we start with a solution with potential errors and split it at the midpoint $m$ . We then perform rollouts for $s_{1:m}$ with two possible outcomes: (1) $c_{m}>0$ , indicating that the first half of the solution is correct, as at least one correct answer can be rolled out from $m$ -th step, and thus the error is in the second half; (2) $c_{m}=0$ , indicating the error is very likely in the first half, as none of the rollouts from $m$ -th step is correct. This process narrows down the error location to either the first or second half of the solution. As shown in Fig. 2 (b), by repeating this process on the erroneous half iteratively until the partial solution is sufficiently small (i.e., short enough to be considered as a single step), we can locate the first error with a time complexity of $O(k\log M)$ rather than $O(kM)$ in the brute-force setting, where $M$ is the total number of steps in the original solution. 3.3 Monte Carlo Tree Search Although binary search improves the efficiency of locating the first error in a solution, we are still not fully utilizing policy calls as rollouts are simply discarded after stepwise Monte Carlo estimation. In practice, it is necessary to collect multiple PRM training examples (a.k.a., triplets of question, partial solution and correctness label) for a question (Lightman et al., 2023; Wang et al., 2024a). Instead of starting from scratch each time, we can store all rollouts during the process and conduct binary searches from any of these rollouts whenever we need to collect a new example. This approach allows for triplets with the same solution prefix but different completions and error locations. Such reasoning structures can be represented as a tree, as described in previous work like Tree of Thought (Yao et al., 2023). Formally, consider a state-action tree representing detailed reasoning paths for a question, where a state $s$ contains the question and all preceding reasoning steps, and an action $a$ is a potential subsequent step from a specific state. The root state is the question without any reasoning steps: $r_{\text{root}}=q$ . The policy can be directly modeled by a language model as $\pi(a|s)=\mathrm{LM}(a|s)$ , and the state transition function is simply the concatenation of the preceding steps and the action step, i.e., $s^{\prime}=\mathrm{Concatenate}(s,a)$ . Collecting PRM training examples for a question can now be formulated as constructing such a state-action tree. This reminds us the classic Monte Carlo Tree Search (MCTS) algorithm, which has been successful in many deep reinforcement learning applications (Silver et al., 2016, 2017). However, there are some key differences when using a language model as the policy. First, MCTS typically handles an environment with a finite action space, such as the game of Go, which has fewer than $361$ possible actions per state (Silver et al., 2017). In contrast, an LM policy has an infinite action space, as it can generate an unlimited number of distinct actions (sequences of tokens) given a prompt. In practice, we use temperature sampling to generate a fix number of $k$ completions for a prompt, treating the group of $k$ actions as an approximate action space. Second, an LM policy can sample a full rollout until the termination state (i.e., reaching the final answer) without too much overhead than generating a single step, enabling the possibility of binary search. Consequently, we propose an adaptation of the MCTS algorithm named OmegaPRM, primarily based on the one introduced in AlphaGo (Silver et al., 2016), but with modifications to better accommodate the scenario of PRM training data collection. We describe the algorithm details as below. Tree Structure. Each node $s$ in the tree contains the question $q$ and prefix solution $x_{1:t}$ , together with all previous rollouts $\{(s,r_{i})\}_{i=1}^{k}$ from the state. Each edge $(s,a)$ is either a single step or a sequence of consecutive steps from the node $s$ . The nodes also store a set of statistics, $$ \{N(s),\mathrm{MC}(s),Q(s,r)\}, $$ where $N(s)$ denotes the visit count of a state, $\mathrm{MC}(s)$ represents the Monte Carlo estimation of a state as specified in Eq. 1, and $Q(s,r)$ is a state-rollout value function that is correlated to the chance of selecting a rollout during the selection phase of tree traversal. Specifically, <details> <summary>x1.png Details</summary> ![1f86a027](/v1/image/1f86a0273010f2f5b4b96fcc548b1819d5da47d644a9eb446b8736ea0b88099b) ### Visual Description ## Diagram: Polynomial Root Problem and Solution Rollouts ### Overview The diagram illustrates a mathematical problem involving a monic polynomial of degree 4 with three known roots (1, 2, 3). It compares an incorrect solution (Final Answer: 20) to three "rollout" attempts, two of which correctly identify the golden answer (24) and one that repeats the error. A metric labeled "MC = 0.67" is included, likely representing model confidence. --- ### Components/Axes 1. **Problem Statement** (Top-left yellow box): - Text: "Let p(x) be a monic polynomial of degree 4. Three of the roots of p(x) are 1, 2, 3. Find p(0)+p(4)." - **Golden Answer**: 24 (Top-right yellow box). 2. **Incorrect Solution** (Blue box below problem): - Text: "Solution: Since three of the roots of p(x) [...] Final Answer 20. X" (Red "X" indicates error). 3. **Rollout Attempts** (Bottom blue box): - **Rollout 1**: "Final Answer 24. ✓" (Green checkmark). - **Rollout 2**: "Final Answer 24. ✓" (Green checkmark). - **Rollout 3**: "Final Answer 20. X" (Red "X"). - **MC = 0.67**: Likely model confidence score (67% accuracy across rollouts). --- ### Detailed Analysis - **Problem Structure**: - The polynomial is monic (leading coefficient = 1) and degree 4, implying a fourth root (denoted as *a*) exists. - Roots: 1, 2, 3, and *a*. The polynomial can be expressed as: $$ p(x) = (x-1)(x-2)(x-3)(x-a) $$ - To find $ p(0) + p(4) $, substitute $ x = 0 $ and $ x = 4 $: $$ p(0) = (-1)(-2)(-3)(-a) = 6a, \quad p(4) = (3)(2)(1)(4-a) = 6(4-a) $$ $$ p(0) + p(4) = 6a + 24 - 6a = 24 $$ - **Conclusion**: The fourth root cancels out, making the answer independent of *a*. The golden answer is **24**. - **Incorrect Solution**: - The flawed solution likely miscalculates $ p(0) $ or $ p(4) $, leading to the erroneous answer **20**. For example, omitting the fourth root or misapplying polynomial properties. - **Rollout Analysis**: - **Rollout 1 & 2**: Correctly derive $ p(0) + p(4) = 24 $, aligning with the golden answer. - **Rollout 3**: Repeats the error from the initial solution, yielding **20**. - **MC = 0.67**: Suggests the model has a 67% success rate in generating correct rollouts (2/3 correct). --- ### Key Observations 1. **Golden Answer Consistency**: The correct answer (24) is derived from the polynomial's structure, where the fourth root cancels out. 2. **Error Persistence**: The incorrect answer (20) appears in both the initial solution and Rollout 3, indicating a recurring flaw in reasoning. 3. **Model Performance**: The MC value (0.67) reflects moderate confidence, with two-thirds of rollouts being correct. --- ### Interpretation - **Mathematical Insight**: The problem tests understanding of polynomial roots and their impact on function evaluation. The cancellation of the fourth root (*a*) simplifies the calculation, making the answer independent of its value. - **Error Analysis**: The incorrect solution likely stems from an incomplete factorization or miscalculation of $ p(0) $ or $ p(4) $. For instance, assuming only three roots exist (ignoring the fourth) or arithmetic errors. - **Rollout Reliability**: The MC score highlights the model's partial accuracy. While two rollouts correctly identify the golden answer, the persistence of the error in Rollout 3 suggests challenges in consistently resolving the problem's nuances. - **Educational Value**: The diagram underscores the importance of rigorous root analysis in polynomial problems and the role of iterative verification (rollouts) in refining solutions. --- ### Spatial Grounding & Trend Verification - **Legend Alignment**: No explicit legend exists, but checkmarks (✓) and "X" symbols visually differentiate correct/incorrect answers. - **Flow Direction**: The diagram progresses from the problem statement → incorrect solution → rollout attempts, with metrics (MC) contextualizing performance. - **Data Trends**: Rollouts 1 and 2 consistently yield the correct answer, while Rollout 3 replicates the initial error, confirming a 67% success rate. --- ### Final Answer The golden answer is **24**, derived from the polynomial's structure. The diagram illustrates model rollouts with 67% accuracy (MC = 0.67), emphasizing the need for careful root analysis to avoid persistent errors. </details> (a) Monte Carlo estimation of a prefix solution. <details> <summary>x2.png Details</summary> ![34cd5a7e](/v1/image/34cd5a7e1b76719a14f4b448705559d1494e4b799c2781df93636d70312e62bd) ### Visual Description ## Diagram: Process Validation with Monte Carlo Confidence (MC) Thresholds ### Overview The diagram illustrates a sequential process validation workflow using Monte Carlo confidence (MC) thresholds. It consists of four horizontal rows of 8 boxes each, representing discrete steps in a process. Each row corresponds to a specific MC threshold (0.25, 0.5, 0), with visual indicators for correct steps (✓), uncertain steps (?), errors (×), and terminal states (NA). The "First error step" is explicitly marked with an arrow and red highlighting. ### Components/Axes - **Rows**: - Row 1: MC = 0.25 (all steps marked with ?) - Row 2: MC = 0.5 (first 4 steps ✓, next 4 ?) - Row 3: MC = 0.5 (first 6 steps ✓, next 2 ?) - Row 4: MC = 0 (first 6 steps ✓, step 7 ×, step 8 NA) - **Columns**: Numbered 1–8 at the bottom, representing sequential steps. - **Legend**: - Green ✓: Confirmed correct step - Gray ? : Uncertain/untested step - Red × : Error detected - NA: Not Applicable/Process Terminated - **Annotations**: - "First error step" (red arrow pointing to step 7 in Row 4) - MC values labeled above each row ### Detailed Analysis 1. **Row 1 (MC = 0.25)**: - All 8 steps marked with ?, indicating no confirmed correct steps at this threshold. - Spatial grounding: Topmost row, MC label centered above. 2. **Row 2 (MC = 0.5)**: - First 4 steps (1–4) marked ✓, next 4 steps (5–8) marked ?. - Spatial grounding: Second row from top, MC label centered above. 3. **Row 3 (MC = 0.5)**: - First 6 steps (1–6) marked ✓, steps 7–8 marked ?. - Spatial grounding: Third row from top, MC label centered above. 4. **Row 4 (MC = 0)**: - Steps 1–6 marked ✓, step 7 marked × (first error), step 8 marked NA. - Spatial grounding: Bottom row, MC label centered above. - Red arrow labeled "First error step" points from text to step 7. ### Key Observations - **MC Threshold Correlation**: Higher MC values (0.5) correspond to more confirmed correct steps before uncertainty (?) or error (×) occurs. - **Error Progression**: The first error occurs at step 7 in the lowest MC threshold (0), suggesting errors become inevitable as confidence drops. - **Terminal State**: Step 8 in Row 4 is marked NA, indicating process termination after the first error. - **Anomaly**: Row 3 (MC = 0.5) shows 6 confirmed steps despite the same MC as Row 2, which only confirms 4 steps. This may imply non-linear relationships between MC and step validation. ### Interpretation This diagram demonstrates a probabilistic workflow where: 1. **Confidence Degradation**: As MC thresholds decrease (0.25 → 0), the number of confirmed correct steps before uncertainty/error increases, but the error becomes inevitable at MC = 0. 2. **Process Failure Point**: The "First error step" at MC = 0 (step 7) acts as a critical failure threshold, after which the process is deemed invalid (NA). 3. **Non-Linear Behavior**: The discrepancy between Rows 2 and 3 (same MC = 0.5 but different confirmed steps) suggests MC thresholds may interact with step-specific probabilities or external factors not explicitly labeled. The diagram emphasizes the importance of MC thresholds in process validation, highlighting how lower confidence levels correlate with earlier failure points. The NA designation in step 8 reinforces that errors propagate terminal states, halting further validation. </details> (b) Error locating using binary search. <details> <summary>x3.png Details</summary> ![8dd90f73](/v1/image/8dd90f73b8f43342dad091a86401ebf1c491e7d0a0752e2ac49514b0fff8899a) ### Visual Description ## Diagrams: Algorithm Visualization ### Overview The image contains three interconnected diagrams labeled **"Select"**, **"Binary Search"**, and **"Maintain"**, each depicting hierarchical node structures with numerical labels and color-coded connections. Arrows, annotations, and color schemes suggest relationships between nodes and operational states. --- ### Components/Axes #### **Select Diagram** - **Nodes**: - `0` (teal, selected with a red arrow labeled "Selected"). - `1`, `2`, `3` (teal, connected via dotted lines to `0`). - **Connections**: - Dotted lines from `0` to `1`, `3`, and `2` (with `2` having additional dotted branches). #### **Binary Search Diagram** - **Nodes**: - `0` (light blue, central node). - `1`, `2` (light blue, connected to `0`). - `3`, `4`, `5`, `6` (teal, connected via solid lines). - **Connections**: - Solid lines from `0` to `3` and `1`; `3` connects to `4` and `2`; `4` connects to `5` and `6`. #### **Maintain Diagram** - **Nodes**: - `0` (teal, labeled "N++"). - `1`, `2`, `3`, `4`, `5`, `6` (dark teal, with annotations). - **Connections**: - Solid lines from `0` to `3` and `4`; `3` connects to `1` and `2`; `4` connects to `5` and `6`. - **Annotations**: - Right-side labels: `N++`, `MC`, `Q` (likely representing operations or states). --- ### Detailed Analysis #### **Select Diagram** - Node `0` is explicitly marked as "Selected" with a red arrow. - Dotted lines suggest tentative or alternative connections to nodes `1`, `2`, and `3`. #### **Binary Search Diagram** - Color differentiation: - Light blue nodes (`0`, `1`, `2`) may represent initial or active states. - Teal nodes (`3`, `4`, `5`, `6`) likely denote deeper or resolved states. - Hierarchical flow: `0` → `3` → `4` → `5`/`6`, with `2` as a secondary branch. #### **Maintain Diagram** - Node `0` is labeled "N++", implying an increment operation. - Annotations `MC` (possibly "Memory Counter") and `Q` (possibly "Queue") on the right suggest auxiliary processes. - Dark teal nodes (`1`, `2`, `3`, `4`, `5`, `6`) may represent persistent or updated states. --- ### Key Observations 1. **Color Coding**: - Teal and light blue nodes differentiate primary vs. secondary states across diagrams. - Dark teal in "Maintain" suggests updated or finalized states. 2. **Flow Direction**: - "Select" and "Binary Search" use top-down branching, while "Maintain" combines top-down and right-side annotations. 3. **Annotations**: - "N++" in "Maintain" implies iterative updates. - "MC" and `Q` hint at memory and queue management, common in algorithmic workflows. --- ### Interpretation - **Purpose**: - **Select**: Visualizes node selection, possibly for tree/graph algorithms. - **Binary Search**: Illustrates a search process with hierarchical decision points. - **Maintain**: Depicts state management, including increments (`N++`), memory (`MC`), and queuing (`Q`). - **Relationships**: - The diagrams may represent stages of an algorithm: selection → search → maintenance. - Color transitions (e.g., light blue → teal → dark teal) could indicate progression through states. - **Anomalies**: - The "Select" diagram’s dotted lines lack clear directionality, suggesting ambiguity in connections. - The "Maintain" diagram’s right-side annotations (`MC`, `Q`) are spatially isolated, possibly indicating auxiliary processes. - **Significance**: - The diagrams likely model computational workflows, emphasizing decision-making (`Select`), search efficiency (`Binary Search`), and resource management (`Maintain`). - The use of color and annotations provides a visual guide for tracking state changes and operational priorities. </details> (c) Three stages in an iteration of the MCTS process. Figure 2: Illustration of the process supervision rollouts, Monte Carlo estimation using binary search and the MCTS process. (a) An example of Monte Carlo estimation of a prefix solution. Two out of the three rollouts are correct, producing the Monte Carlo estimation $\mathrm{MC}(q,x_{1:t})=2/3≈ 0.67$ . (b) An example of error locating using binary search. The first error step is located at the $7^{\text{th}}$ step after three divide-and-rollouts, where the rollout positions are indicated by the vertical dashed lines. (c) The MCTS process. The dotted lines in Select stage represent the available rollouts for binary search. The bold colored edges represent steps with correctness estimations. The yellow color indicates a correct step, i.e., with a preceding state $s$ that $\mathrm{MC}(s)>0$ and the blue color indicates an incorrect step, i.e., with $\mathrm{MC}(s)=0$ . The number of dashes in each colored edge indicates the number of steps. $$ Q(s,r)=\alpha^{1-\mathrm{MC}(s)}\cdot\beta^{\frac{\mathrm{len}(r)}{L}}, \tag{2} $$ where $\alpha,\beta∈(0,1]$ and $L>0$ are constant hyperparameters; while $\mathrm{len}(r)$ denotes the length of a rollout in terms of number of tokens. $Q$ is supposed to indicate how likely a rollout will be chosen for each iteration and our goal is to define a heuristic that selects the most valuable rollout to search with. The most straightforward strategy is uniformly choosing rollout candidates generated by the policy in previous rounds; however, this is obviously not an effective way. Lightman et al. (2023) suggests surfacing the convincing wrong-answer solutions for annotators during labeling. Inspired by this, we propose to prioritize supposed-to-be-correct wrong-answer rollouts during selection. We use the term supposed-to-be-correct to refer to the state with a Monte Carlo estimation $\mathrm{MC}(s)$ closed to $1$ ; and use wrong-answer to refer that the specific rollout $r$ has a wrong final answer. The rollout contains mistakes made by the policy that should have been avoided given its high $\mathrm{MC}(s)$ . We expect a PRM that learns to detect errors in such rollouts will be more useful in correcting the mistakes made by the policy. The first component in Eq. 2, $\alpha^{1-\mathrm{MC}(s)}$ , has a larger value as $\mathrm{MC}(s)$ is closer to $1$ . Additionally, we incorporate a length penalty factor $\beta^{\frac{\mathrm{len}(r)}{L}}$ , to penalize excessively long rollouts. Select. The selection phase in our algorithm is simpler than that of AlphaGo (Silver et al., 2016), which involves selecting a sequence of actions from the root to a leaf node, forming a trajectory with multiple states and actions. In contrast, we maintain a pool of all rollouts $\{(s_{i},r^{i}_{j})\}$ from previous searches that satisfy $0<\mathrm{MC}(s_{i})<1$ . During each selection, a rollout is popped and selected according to tree statistics, $(s,r)=\operatorname*{arg\,max}_{(s,r)}[Q(s,r)+U(s)]$ , using a variant of the PUCT (Rosin, 2011) algorithm, $$ U(s)=c_{\text{puct}}\frac{\sqrt{\sum_{i}N(s_{i})}}{1+N(s)}, \tag{3} $$ where $c_{\text{puct}}$ is a constant determining the level of exploration. This strategy initially favors rollouts with low visit counts but gradually shifts preference towards those with high rollout values. Binary Search. We perform a binary search to identify the first error location in the selected rollout, as detailed in Section 3.2. The rollouts with $0<\mathrm{MC}(s)<1$ during the process are added to the selection candidate pool. All divide-and-rollout positions before the first error become new states. For the example in Fig. 2 (b), the trajectory $s[q]→ s[q,x_{1:4}]→ s[q,x_{1:6}]→ s[q,x_{1:7}]$ is added to the tree after the binary search. The edges $s[q]→ s[q,x_{1:4}]$ and $s[q,x_{1:4}]→ s[q,x_{1:6}]$ are correct, with $\mathrm{MC}$ values of $0.25$ and $0.5$ , respectively; while the edge $s[q,x_{1:6}]→ s[q,x_{1:7}]$ is incorrect with $\mathrm{MC}$ value of $0 0$ . Maintain. After the binary search, the tree statistics $N(s)$ , $\mathrm{MC}(s)$ , and $Q(s,r)$ are updated. Specifically, $N(s)$ is incremented by $1$ for the selected $(s,r)$ . Both $\mathrm{MC}(s)$ and $Q(s,r)$ are updated for the new rollouts sampled from the binary search. This phase resembles the backup phase in AlphaGo but is simpler, as it does not require recursive updates from the leaf to the root. Tree Construction. By repeating the aboved process, we can construct a state-action tree as the example illustrated in Fig. 1. The construction ends either when the search count reaches a predetermined limit or when no additional rollout candidates are available in the pool. 3.4 PRM Training Each edge $(s,a)$ with a single-step action in the constructed state-action tree can serve as a training example for the PRM. It can be trained using the standard classification loss $$ \mathcal{L}_{\text{pointwise}}=\sum_{i=1}^{N}\hat{y}_{i}\log y_{i}+(1-\hat{y}_% {i})\log(1-y_{i}), \tag{4} $$ where $\hat{y}_{i}$ represents the correctness label and $y_{i}=\mathrm{PRM}(s,a)$ is the prediction score of the PRM. Wang et al. (2024b) have used the Monte Carlo estimation as the correctness label, denoted as $\hat{y}=\mathrm{MC}(s)$ . Alternatively, Wang et al. (2024a) have employed a binary labeling approach, where $\hat{y}=\mathbf{1}[\mathrm{MC}(s)>0]$ , assigning $\hat{y}=1$ for any positive Monte Carlo estimation and $\hat{y}=0$ otherwise. We refer the former option as pointwise soft label and the latter as pointwise hard label. In addition, considering there are many cases where a common solution prefix has multiple single-step actions, we can also minimize the cross-entropy loss between the PRM predictions and the normalized pairwise preferences following the Bradley-Terry model (Christiano et al., 2017). We refer this training method as pairwise approach, and the detailed pairwise loss formula can be found in Section Appendix B. We use the pointwise soft label when evaluating the main results in Section 4.1, and a comparion of the three objectives are discussed in Section 4.3. 4 Experiments Data Generation. We conduct our experiments on the challenging MATH dataset (Hendrycks et al., 2021). We use the same training and testing split as described in Lightman et al. (2023), which consists of $12$ K training examples and a subset with $500$ holdout representative problems from the original $5$ K testing examples introduced in Hendrycks et al. (2021). We observe similar policy performance on the full test set and the subset. For creating the process annotation data, we use the questions from the training split and set the search limit to $100$ per question, resulting $1.5$ M per-step process supervision annotations. To reduce the false positive and false negative noise, we filtered out questions that are either too hard or too easy for the model. Please refer to Appendix A for details. We use $\alpha=0.5$ , $\beta=0.9$ and $L=500$ for calculating $Q(s,r)$ in Eq. 2; and $c_{\text{puct}}=0.125$ in Eq. 3. We sample $k=8$ rollouts for each Monte Carlo estimation. Models. In previous studies (Lightman et al., 2023; Wang et al., 2024a, b), both proprietary models such as GPT-4 (OpenAI, 2023) and open-source models such as Llama2 (Touvron et al., 2023) were explored. In our study, we perform experiments with both proprietary Gemini Pro (Gemini Team et al., 2024) and open-source Gemma2 (Gemma Team et al., 2024) models. For Gemini Pro, we follow Lightman et al. (2023); Wang et al. (2024a) to initially fine-tune it on math instruction data, achieving an accuracy of approximately $51$ % on the MATH test set. The instruction-tuned model is then used for solution sampling. For open-source models, to maximize reproducibility, we directly use the pretrained Gemma2 27B checkpoint with the 4-shot prompt introduced in Gemini Team et al. (2024). The reward models are all trained from the pretrained checkpoints. Metrics and baselines. We evaluate the PRM-based majority voting results on GSM8K (Cobbe et al., 2021) and MATH500 (Lightman et al., 2023) using PRMs trained on different process supervision data. We choose the product of scores across all steps as the final solution score following Lightman et al. (2023), where the performance difference between product and minimum of scores was compared and the study showed the difference is minor. Baseline process supervision data include PRM800K (Lightman et al., 2023) and Math-Shepherd (Wang et al., 2024a), both publicly available. Additionally, we generate a process annotation dataset with our Gemini policy model using the brute-force approach described in Wang et al. (2024a, b), referred to as Math-Shepherd (our impl) in subsequent sections. 4.1 Main Results <details> <summary>x4.png Details</summary> ![09138811](/v1/image/091388117f688ad7ff836be31654f3a8ea99931886c455a4e9665c6d0312bdcc) ### Visual Description ## Line Graph: Percentage of Problems Solved vs. Number of Solutions per Problem ### Overview The graph illustrates the performance of five algorithms in solving problems as the number of solutions per problem (N) increases. The y-axis represents the percentage of problems solved (56–70%), and the x-axis represents N, ranging from 2² to 2⁶. Five algorithms are compared: Majority Vote, +OmegaPRM, +PRM800K, +Shepherd, and +Shepherd (ours). All algorithms show upward trends, but with varying rates of improvement. ### Components/Axes - **X-axis**: Labeled "N = number of solutions per problems," with markers at 2², 2³, 2⁴, 2⁵, and 2⁶. - **Y-axis**: Labeled "% Problems Solved," ranging from 56% to 70%. - **Legend**: Located in the bottom-right corner, with the following mappings: - Green squares: Majority Vote - Red triangles: +OmegaPRM - Purple circles: +PRM800K - Blue circles: +Shepherd - Purple diamonds: +Shepherd (ours) ### Detailed Analysis 1. **Majority Vote (Green Squares)**: - Starts at ~57.5% at 2². - Increases steadily to ~67% at 2⁶. - Slope is moderate compared to other algorithms. 2. **+OmegaPRM (Red Triangles)**: - Starts at ~64% at 2². - Shows the steepest upward trend, reaching ~69% at 2⁶. - Consistently outperforms all other algorithms. 3. **+PRM800K (Purple Circles)**: - Starts at ~63% at 2². - Increases to ~67.5% at 2⁶. - Slope is slightly less steep than +OmegaPRM but higher than others. 4. **+Shepherd (Blue Circles)**: - Starts at ~60.5% at 2². - Reaches ~66.5% at 2⁶. - Shows a consistent upward trend but lags behind +OmegaPRM and +PRM800K. 5. **+Shepherd (ours) (Purple Diamonds)**: - Starts at ~58% at 2². - Ends at ~67% at 2⁶. - Demonstrates the largest improvement over the range, surpassing +Shepherd (blue) at higher N values. ### Key Observations - **+OmegaPRM** maintains the highest performance across all N values, with the steepest improvement. - **+PRM800K** and **+Shepherd (ours)** show strong performance, with the latter catching up to +Shepherd (blue) at higher N. - **Majority Vote** has the lowest starting performance but improves significantly, though it remains the least effective overall. - All algorithms exhibit diminishing returns as N increases, with performance gains slowing at higher N values. ### Interpretation The data suggests that **+OmegaPRM** is the most effective algorithm for solving problems, particularly as the number of solutions per problem increases. The Shepherd variants (+Shepherd and +Shepherd (ours)) demonstrate competitive performance, with the latter showing the most improvement over the range. The Majority Vote algorithm, while improving, remains the least effective. The trends indicate that increasing the number of solutions per problem generally enhances problem-solving capability, but the efficiency of this improvement varies by algorithm. Notably, +Shepherd (ours) outperforms the standard +Shepherd at higher N values, suggesting potential optimizations in the "ours" variant. </details> (a) Gemini Pro on MATH500. <details> <summary>x5.png Details</summary> ![f0b4aeaa](/v1/image/f0b4aeaaa4d7db852745c049d77d33d5fe601c9e2f79f97b28e920bb3489b478) ### Visual Description ## Line Chart: Percentage of Problems Solved vs. Number of Solutions per Problem ### Overview The chart illustrates the performance of five algorithms in solving problems as the number of solutions per problem (N) increases. The y-axis represents the percentage of problems solved (89%–94%), while the x-axis shows N in powers of 2 (2² to 2⁶). Each algorithm is represented by a distinct line with markers and shaded confidence intervals. ### Components/Axes - **X-axis (N)**: Number of solutions per problem, labeled as $ N = \text{number of solutions per problems} $, with values $ 2^2, 2^3, 2^4, 2^5, 2^6 $. - **Y-axis**: Percentage of problems solved, ranging from 89% to 94%. - **Legend**: Located on the right, mapping colors and markers to algorithms: - **Green +**: Majority Vote - **Red ▲**: +OmegaPRM - **Purple ○**: +PRM800K - **Blue ○**: +Shepherd - **Dark Purple ○**: +Shepherd (ours) ### Detailed Analysis 1. **Majority Vote (Green +)**: - Starts at ~89.2% at $ 2^2 $, rising to ~92.7% at $ 2^6 $. - Confidence interval widens at lower N values, narrowing as N increases. 2. **+OmegaPRM (Red ▲)**: - Highest performance, starting at ~92.5% at $ 2^2 $, peaking at ~93.6% at $ 2^6 $. - Confidence interval remains relatively narrow across all N values. 3. **+PRM800K (Purple ○)**: - Begins at ~91.8% at $ 2^2 $, increasing to ~92.9% at $ 2^6 $. - Confidence interval widens slightly at lower N but stabilizes. 4. **+Shepherd (Blue ○)**: - Starts at ~91.1% at $ 2^2 $, rising to ~92.7% at $ 2^6 $. - Confidence interval is moderate, with minimal variation. 5. **+Shepherd (ours) (Dark Purple ○)**: - Lowest performance, starting at ~89.1% at $ 2^2 $, reaching ~91.8% at $ 2^6 $. - Confidence interval is the widest, indicating higher uncertainty. ### Key Observations - **+OmegaPRM** consistently outperforms all other algorithms, maintaining the highest percentage of problems solved. - **+Shepherd (ours)** shows the lowest performance, with a significant gap compared to other methods. - All algorithms improve as N increases, but the rate of improvement varies. +OmegaPRM and +PRM800K show steeper growth. - Confidence intervals (shaded regions) are widest at lower N values, suggesting greater uncertainty in early results. ### Interpretation The data demonstrates that **+OmegaPRM** is the most effective algorithm for solving problems, particularly as the number of solutions per problem increases. The shaded confidence intervals indicate that performance estimates are less reliable at lower N values, with uncertainty decreasing as N grows. The **+Shepherd (ours)** algorithm lags behind others, suggesting potential limitations in its design or implementation. The trend highlights the importance of solution quantity in problem-solving efficiency, with +OmegaPRM leveraging this advantage most effectively. </details> (b) Gemini Pro on GSM8K. <details> <summary>x6.png Details</summary> ![ab99c94a](/v1/image/ab99c94a7dd8ed937deed96173b03f65a54818d5229d65ae499560221c78ee7b) ### Visual Description ## Line Graph: Percentage of Problems Solved vs. Number of Solutions per Problem ### Overview The graph illustrates the performance of five algorithms in solving problems as the number of solutions per problem (N) increases. The y-axis represents the percentage of problems solved (40–60%), while the x-axis shows N in powers of 2 (2² to 2⁶). Each algorithm is represented by a distinct line with markers and shaded confidence intervals. ### Components/Axes - **X-axis**: "N = number of solutions per problems" (logarithmic scale: 2², 2³, 2⁴, 2⁵, 2⁶). - **Y-axis**: "% Problems Solved" (linear scale: 40–60%). - **Legend**: Located in the bottom-right corner, mapping colors/markers to algorithms: - Green squares: Majority Vote - Red triangles: +OmegaPRM - Purple circles: +PRM800K - Blue circles: +Shepherd - Purple diamonds: +Shepherd (ours) ### Detailed Analysis 1. **Majority Vote (Green Squares)**: - Starts at ~39.5% (N=2²) and rises to ~55% (N=2⁶). - Shows a steady upward trend but remains the lowest-performing algorithm. 2. **+OmegaPRM (Red Triangles)**: - Consistently the highest-performing algorithm. - Begins at ~47.5% (N=2²) and reaches ~58% (N=2⁶). - Shaded area indicates moderate variability. 3. **+PRM800K (Purple Circles)**: - Starts at ~46% (N=2²) and climbs to ~57% (N=2⁶). - Outperforms +Shepherd (ours) but lags behind +OmegaPRM and +Shepherd. 4. **+Shepherd (Blue Circles)**: - Begins at ~44% (N=2²) and reaches ~57.5% (N=2⁶). - Shows the second-strongest performance, closely following +OmegaPRM. 5. **+Shepherd (ours) (Purple Diamonds)**: - Starts at ~39.5% (N=2²) and rises to ~55% (N=2⁶). - Overlaps with +PRM800K at higher N values but underperforms at lower N. ### Key Observations - **Performance Trends**: All algorithms improve as N increases, with diminishing returns at higher N values. - **Outliers**: +OmegaPRM maintains a consistent lead, while Majority Vote remains the weakest. - **Confidence Intervals**: Shaded regions suggest variability in performance, but trends are clear. ### Interpretation The data demonstrates that increasing the number of solutions per problem (N) enhances algorithmic performance across all methods. +OmegaPRM is the most effective, likely due to its robust solution generation or prioritization strategy. +Shepherd and +PRM800K show strong performance, with +Shepherd (ours) closely matching +PRM800K at higher N. Majority Vote’s lower performance highlights its limitations in complex problem-solving scenarios. The shaded areas indicate that while confidence intervals widen slightly at higher N, the overall trends remain stable. This suggests that scaling N is a viable strategy for improving problem-solving efficiency, with algorithmic design playing a critical role in determining the ceiling of performance. </details> (c) Gemma2 27B on MATH500. <details> <summary>x7.png Details</summary> ![2e73bcc3](/v1/image/2e73bcc33ee6cd81ae4f933ffb645c4d4331f4153f58accd4eebf252ee1cb80b) ### Visual Description ## Line Chart: Percentage of Problems Solved vs. Number of Solutions per Problem ### Overview The chart illustrates the performance of five algorithms in solving problems as the number of solutions per problem (N) increases. The y-axis represents the percentage of problems solved, while the x-axis shows N on a logarithmic scale (2² to 2⁶). Each algorithm is represented by a distinct line with markers, and shaded regions indicate confidence intervals. ### Components/Axes - **X-axis**: "N = number of solutions per problems" (logarithmic scale: 2², 2³, 2⁴, 2⁵, 2⁶). - **Y-axis**: "% Problems Solved" (70% to 95%). - **Legend**: Located on the right, with five algorithms: - **Majority Vote** (green squares). - **+OmegaPRM** (red triangles). - **+PRM800K** (purple circles). - **+Shepherd** (blue circles). - **+Shepherd (ours)** (dark purple circles). - **Title**: "Percentage of Problems Solved vs. Number of Solutions per Problem" (centered at the top). ### Detailed Analysis 1. **Majority Vote** (green squares): - Starts at ~72% at N=2², rising to ~90% at N=2⁶. - Shows a steady upward trend but lags behind other algorithms. - Confidence interval widens slightly at lower N values. 2. **+OmegaPRM** (red triangles): - Consistently the highest-performing algorithm. - Starts at ~86% at N=2², reaching ~92% at N=2⁶. - Maintains a narrow confidence interval, indicating stable performance. 3. **+PRM800K** (purple circles): - Second-highest performance, starting at ~85% at N=2² and reaching ~91% at N=2⁶. - Slightly outperforms +Shepherd (ours) at higher N values. 4. **+Shepherd** (blue circles): - Starts at ~74% at N=2², rising to ~90% at N=2⁶. - Confidence interval widens more than +OmegaPRM but less than Majority Vote. 5. **+Shepherd (ours)** (dark purple circles): - Similar to +Shepherd but slightly higher performance. - Starts at ~84% at N=2², reaching ~91% at N=2⁶. - Confidence interval is narrower than +Shepherd but wider than +OmegaPRM. ### Key Observations - **+OmegaPRM** dominates across all N values, suggesting superior optimization or problem-solving efficiency. - **Majority Vote** improves significantly with more solutions but remains the least effective. - **+Shepherd (ours)** and **+Shepherd** show comparable performance, with the "ours" variant slightly better. - All algorithms exhibit upward trends, indicating that increasing N improves problem-solving capacity. - Confidence intervals are tightest for +OmegaPRM, reflecting higher reliability in its results. ### Interpretation The data demonstrates that **adding more solutions per problem (N)** enhances the performance of all algorithms, with **+OmegaPRM** achieving the highest efficiency. The logarithmic scale on the x-axis highlights exponential growth in N, emphasizing that performance gains are more pronounced at higher N values. - **Why +OmegaPRM excels**: Its consistent top performance suggests it may leverage solutions more effectively, possibly through advanced optimization techniques or better handling of problem constraints. - **Majority Vote's limitations**: Despite improvement, its lower performance indicates it may lack the sophistication to utilize additional solutions as effectively as other algorithms. - **Shepherd variants**: The close performance between +Shepherd and +Shepherd (ours) implies that the "ours" version introduces minor but meaningful enhancements, such as improved initialization or solution selection strategies. This chart underscores the importance of algorithm design in leveraging computational resources (e.g., more solutions) to solve problems. The results could guide future research into optimizing solution generation or selection strategies for complex tasks. </details> (d) Gemma2 27B on GSM8K. Figure 3: A comparison of PRMs trained with different process supervision datasets, evaluated by their ability to search over many test solutions using a PRM-weighted majority voting. We visualize the variance across many sub-samples of the $128$ solutions we generated in total per problem. Table 1: The performance comparison of PRMs trained with different process supervision datasets. The numbers represent the percentage of problems solved using PRM-weighted majority voting with $k=64$ . | MajorityVote@64 + Math-Shepherd + Math-Shepherd (our impl) | 67.2 67.2 67.2 | 54.7 57.4 55.2 | 92.7 92.7 91.8 | 90.6 90.5 91.4 | | --- | --- | --- | --- | --- | | + PRM800K | 67.6 | 57.2 | 92.9 | 91.7 | | + OmegaPRM | 69.4 | 58.2 | 93.6 | 92.2 | Table 1 and Fig. 3 presents the performance comparison of PRMs trained on various process annotation datasets. OmegaPRM consistently outperforms the other process supervision datasets. Specifically, the fine-tuned Gemini Pro achieves $69.4$ % and $93.6$ % accuracy on MATH500 and GSM8K, respectively, using OmegaPRM-weighted majority voting. For the pretrained Gemma2 27B, it also performs the best with $58.2$ % and $92.2$ % accuracy on MATH500 and GSM8K, respectively. It shows superior performance comparing to both human annotated PRM800K but also automatic annotated Math-Shepherd. More specifically, when the number of samples is small, almost all the PRM models outperforme the majority vote. However, as the number of samples increases, the performance of other PRMs gradually converges to the same level of the majority vote. In contrast, our PRM model continues to demonstrate a clear margin of accuracy. 4.2 Step Distribution An important factor in process supervision is the number of steps in a solution and the length of each step. Previous works (Lightman et al., 2023; Wang et al., 2024a, b) use rule-based strategies to split a solution into steps, e.g., using newline as delimiters. In contrast, we propose a more flexible method for step division, treating any sequence of consecutive tokens in a solution as a valid step. We observe that many step divisions in Math-Shepherd lack semantic coherence to some extent. Therefore, we hypothesize that semantically explicit cutting is not necessary for training a PRM. <details> <summary>x8.png Details</summary> ![3fa82df3](/v1/image/3fa82df3afd32891902bacb9c675e924a36c1b6ceacffb4719a14a7beda14781) ### Visual Description ## Histogram: Math-Shepherd ### Overview The image displays a histogram titled "Math-Shepherd" with a right-skewed distribution. The x-axis represents the "Number of Steps per Solution" (0–30), and the y-axis represents "Count" scaled by ×10⁴. The bars are uniformly blue, with no legend present. The distribution peaks sharply at lower step counts and tapers off exponentially toward higher step counts. ### Components/Axes - **Title**: "Math-Shepherd" (top-center, bold black text). - **Y-Axis**: - Label: "Count" (left-aligned, black text). - Scale: Linear increments of 1×10⁴ (0, 1×10⁴, 2×10⁴, 3×10⁴). - **X-Axis**: - Label: "Number of Steps per Solution" (bottom-center, black text). - Scale: Discrete intervals of 5 steps (0, 5, 10, ..., 30). - **Bars**: - Color: Light blue (uniform across all bars). - Orientation: Vertical, aligned with x-axis intervals. ### Detailed Analysis - **Step Count Distribution**: - **0–5 steps**: - Counts range from ~2×10⁴ (0 steps) to ~3×10⁴ (5 steps). - **5–10 steps**: - Counts drop to ~2.5×10⁴ (10 steps). - **10–15 steps**: - Counts decrease to ~1.5×10⁴ (15 steps). - **15–20 steps**: - Counts further reduce to ~8×10³ (20 steps). - **20–30 steps**: - Counts fall below ~2×10³, with negligible values beyond 25 steps. ### Key Observations 1. **Peak Efficiency**: The highest frequency (~3×10⁴) occurs at 5 steps, indicating most solutions are resolved in this range. 2. **Rapid Decline**: Counts decrease by ~50% between 5 and 10 steps, and by ~75% between 10 and 20 steps. 3. **Long Tail**: Solutions requiring >20 steps are rare, with counts dropping to ~1×10³ or lower. 4. **Skewness**: The distribution is heavily right-skewed, with the mean likely closer to 5–10 steps. ### Interpretation The histogram suggests that the Math-Shepherd process is highly efficient for solutions requiring ≤10 steps, with diminishing returns as complexity increases. The sharp decline in counts for >20 steps implies that solutions requiring excessive steps are either uncommon or represent edge cases. This pattern could reflect algorithmic optimization for simpler problems or a need for improved strategies for complex cases. The absence of a legend or additional annotations limits contextual understanding of the data source or methodology. </details> <details> <summary>x9.png Details</summary> ![b475a50b](/v1/image/b475a50bf97dd84ef482a7071a29cf078e4b45c3587635a979462885b4c447c0) ### Visual Description ## Histogram: PRM800K ### Overview The image displays a histogram titled "PRM800K" representing the distribution of "Number of Steps per Solution." The y-axis is labeled "Count" with a scale up to 6×10³, and the x-axis ranges from 0 to 40 steps. The histogram uses blue bars to visualize the frequency of solutions grouped by step count. ### Components/Axes - **Title**: "PRM800K" (centered at the top). - **X-axis**: "Number of Steps per Solution" (linear scale, 0–40). - **Y-axis**: "Count" (linear scale, 0–6×10³, with a multiplier notation "×10³" at the top). - **Bars**: Blue, vertical bars representing frequency counts. ### Detailed Analysis - **Peak Distribution**: The tallest bar occurs at **10 steps**, with a count of approximately **5,500** (5.5×10³). - **Decline**: Counts decrease steadily as step numbers increase. For example: - **15 steps**: ~4,000 counts. - **20 steps**: ~2,500 counts. - **25 steps**: ~1,000 counts. - **30 steps**: ~300 counts. - **35–40 steps**: Counts drop to near zero. - **Shape**: The distribution is right-skewed, with a sharp decline after the peak at 10 steps. ### Key Observations 1. **Dominant Step Count**: Most solutions cluster around **10–15 steps**, indicating this is the most common range. 2. **Long Tail**: Fewer solutions require more than 20 steps, with counts diminishing significantly beyond 25 steps. 3. **Sparsity at High Steps**: Solutions requiring 35+ steps are rare, with counts below 100. ### Interpretation The histogram suggests that the PRM800K process or system is optimized for efficiency, as the majority of solutions converge on a low step count (10–15). The right-skewed distribution implies that while most solutions are quick, a small fraction require significantly more steps, potentially indicating edge cases or outliers. The sharp decline after 10 steps highlights the effectiveness of the process in minimizing computational or iterative effort for most scenarios. The absence of data beyond 40 steps suggests either a natural upper limit or that such cases are negligible in this dataset. </details> Figure 4: Number of steps distribution. In practice, we first examine the distribution of the number of steps per solution in PRM800K and Math-Shepherd, as shown in Fig. 4, noting that most solutions have less than $20$ steps. During binary search, we aim to divide a full solution into $16$ pieces. To calculate the expected step length, we divide the average solution length by $16$ . The binary search terminates when a step is shorter than this value. The resulting distributions of step lengths for OmegaPRM and the other two datasets are shown in Fig. 5. This flexible splitting strategy produces a step length distribution similar to that of the rule-based strategy. <details> <summary>x10.png Details</summary> ![19650e42](/v1/image/19650e423e34a678d0e5d4e68697ac0e0c13af0fc7029470f27566eec011290c) ### Visual Description ## Histogram: Math-Shepherd Per-Step Length Distribution ### Overview The image displays a histogram titled "Math-Shepherd" showing the distribution of per-step lengths (in number of tokens) for a dataset. The y-axis represents counts scaled by 10⁵, and the x-axis ranges from 0 to 200 tokens. The distribution is right-skewed, with the highest frequency occurring at shorter per-step lengths. ### Components/Axes - **Title**: "Math-Shepherd" (top-center, black text). - **X-axis**: - Label: "Per-step Length (in number of tokens)" (bottom, black text). - Scale: 0 to 200, with major ticks at 0, 50, 100, 150, 200. - **Y-axis**: - Label: "Count" (left, black text). - Scale: 0 to 3×10⁵, with increments of 1×10⁵. - **Bars**: - Color: Light blue (uniform across all bars). - Positioning: Centered on x-axis bins (e.g., 0–10, 10–20, etc.). ### Detailed Analysis - **Bars**: - **0–10 tokens**: Count ≈ 2×10⁴ (lowest visible bar). - **10–20 tokens**: Count ≈ 5×10⁴. - **20–30 tokens**: Count ≈ 1.2×10⁵. - **30–40 tokens**: Count ≈ 2.5×10⁵ (peak). - **40–50 tokens**: Count ≈ 1.8×10⁵. - **50–60 tokens**: Count ≈ 1.2×10⁵. - **60–70 tokens**: Count ≈ 7×10⁴. - **70–80 tokens**: Count ≈ 4×10⁴. - **80–90 tokens**: Count ≈ 2×10⁴. - **90–100 tokens**: Count ≈ 1×10⁴. - **100–150 tokens**: Counts drop below 1×10³, becoming negligible. - **150–200 tokens**: No visible bars (count ≈ 0). ### Key Observations 1. **Peak Frequency**: The highest count (~2.5×10⁵) occurs for per-step lengths of **30–40 tokens**. 2. **Rapid Decline**: Counts decrease by ~50% every 10 tokens after the peak (e.g., 1.8×10⁵ at 40–50 tokens, 1.2×10⁵ at 50–60 tokens). 3. **Long-Tail Behavior**: Fewer steps exceed 100 tokens, with counts dropping to near-zero beyond 150 tokens. 4. **Right-Skewed Distribution**: Most data points cluster at shorter per-step lengths, with a long tail toward larger values. ### Interpretation The histogram suggests that in the Math-Shepherd context, **most computational steps involve 30–40 tokens**, with shorter steps being less frequent and longer steps extremely rare. The right-skewed distribution implies that while the majority of steps are concise, a small fraction of steps require significantly more tokens. This could reflect efficiency in typical problem-solving workflows, with occasional complex steps (e.g., multi-stage reasoning) accounting for outliers. The negligible counts beyond 150 tokens indicate that extremely long steps are either non-existent or exceedingly uncommon in this dataset. </details> <details> <summary>x11.png Details</summary> ![08039ace](/v1/image/08039ace6421770b29f6ac4a583d9ae1e2be41d6de72356c7907c3de855eecc1) ### Visual Description ## Bar Chart: PRM800K Token Length Distribution ### Overview The chart displays a distribution of per-step token lengths for the PRM800K model, showing counts of occurrences across different token length intervals. The data follows a right-skewed distribution with a sharp peak at shorter token lengths. ### Components/Axes - **X-axis**: "Per-step Length (in number of tokens)" - Scale: 0 to 200 (increments of 50) - Labels: 0, 50, 100, 150, 200 - **Y-axis**: "Count" (scaled ×10⁴) - Scale: 0 to 8×10⁴ (increments of 2×10⁴) - Labels: 0, 2×10⁴, 4×10⁴, 6×10⁴, 8×10⁴ - **Legend**: Not present - **Bars**: Blue (uniform color, no sub-categories) ### Detailed Analysis - **Token Length Intervals**: - **0–50 tokens**: Counts range from ~1×10⁴ to ~8×10⁴ (peak at ~75,000 for 50 tokens). - **50–100 tokens**: Counts decline steeply from ~75,000 to ~45,000. - **100–150 tokens**: Counts drop to ~10,000–15,000. - **150–200 tokens**: Counts near zero (~1,000–5,000). ### Key Observations 1. **Peak at 50 tokens**: The highest frequency (~75,000) occurs at 50 tokens, suggesting this is the most common per-step length. 2. **Rapid decline**: Counts decrease by ~40% between 50 and 75 tokens, then continue declining exponentially. 3. **Long token lengths**: Usage beyond 100 tokens is minimal (<20,000 total). ### Interpretation The data indicates that PRM800K predominantly operates with shorter token lengths (≤50 tokens), likely reflecting optimization for efficiency or performance in typical use cases. The sharp drop after 50 tokens suggests longer sequences are either rare, computationally expensive, or less effective for the model's intended tasks. The absence of a legend implies a single data series, reinforcing the focus on token length frequency rather than categorical comparisons. </details> <details> <summary>x12.png Details</summary> ![a8feb90d](/v1/image/a8feb90d45e4338566808bc7dc74d923b520944d2d90d7becf1645ae82f7d9f9) ### Visual Description ## Histogram: OmegaPRM Per-step Length Distribution ### Overview The image displays a histogram titled "OmegaPRM" showing the distribution of per-step lengths (in number of tokens) for a dataset or model. The y-axis represents counts scaled by 10⁵, while the x-axis measures per-step length in tokens. The distribution is heavily skewed, with a sharp decline in frequency as per-step length increases. ### Components/Axes - **Title**: "OmegaPRM" (top center) - **Y-axis**: - Label: "Count" - Scale: 0 to 2.0×10⁵ (increments of 0.5×10⁵) - Units: Absolute counts (no explicit units beyond "Count") - **X-axis**: - Label: "Per-step Length (in number of tokens)" - Scale: 0 to 200 (increments of 50) - Units: Tokens (discrete intervals) - **Bars**: - Color: Blue (uniform across all bars) - Orientation: Vertical - **Legend**: Absent ### Detailed Analysis 1. **Per-step Length Intervals**: - Bins are grouped in ranges (e.g., 0–10, 10–20, etc.), though exact bin widths are not labeled. - The first bin (0–10 tokens) contains the highest count, approximately **2.0×10⁵**. - Counts decrease monotonically with increasing per-step length: - 10–20 tokens: ~1.8×10⁵ - 20–30 tokens: ~1.5×10⁵ - 30–40 tokens: ~1.2×10⁵ - 40–50 tokens: ~1.0×10⁵ - 50–60 tokens: ~0.8×10⁵ - Subsequent bins show gradual declines, with counts dropping below 0.1×10⁵ by 100 tokens. - The last non-zero bar appears near 100 tokens, with counts near 0.01×10⁵. 2. **Data Trends**: - The distribution follows a **long-tailed pattern**, with the majority of data concentrated in the first 50 tokens. - After 50 tokens, counts drop sharply, forming a steep decline. - No secondary peaks or anomalies are observed. ### Key Observations - **Dominance of Short Per-step Lengths**: Over 90% of counts occur within the first 50 tokens. - **Long Tail**: A small fraction of instances extend to 100+ tokens, but these are rare. - **No Outliers**: No isolated bars or irregularities in the distribution. ### Interpretation The histogram suggests that **OmegaPRM** predominantly operates with short per-step lengths, likely reflecting a design optimized for efficiency or specific task constraints. The long tail indicates occasional longer steps, which could represent complex reasoning or edge cases. The absence of a legend implies a single data series, and the uniform bar color reinforces this simplicity. The skewed distribution highlights the importance of short-step processing in the system’s behavior, with longer steps being statistically insignificant but potentially critical for understanding rare events. </details> Figure 5: Step length distribution in terms of number of tokens. 4.3 PRM Training Objectives Table 2: Comparison of different training objectives for PRMs. | PRM Accuracy (%) | Soft Label 70.1 | Hard Label 63.3 | Pairwise 64.2 | | --- | --- | --- | --- | As outlined in Section 3.4, PRMs can be trained using multiple objectives. We construct a small process supervision test set using the problems from the MATH test split. We train PRMs using pointwise soft label, pointwise hard label and pairwise loss respectively, and evaluate how accurately they can classify the per-step correctness. Table 2 presents the comparison of different objectives, and the pointwise soft label is the best among them with $70.1$ % accuracy. 4.4 Algorithm Efficiency As described in Section Section 3.2 and Section 3.3, we utilize binary search and Monte Carlo Tree Search to improve the efficiency of OmegaPRM process supervision data collection by effectively identifying the first incorrect step and reusing rollouts in Monte Carlo estimation. To quantitatively measure the efficiency of OmegaPRM, we collected process supervision data using both brute-force-style method (Wang et al., 2024a, b) and OmegaPRM with the same computational budget. As a result, we were able to generate $200$ K data points using the brute-force algorithm compared to $15$ million data points with OmegaPRM, demonstrating a $75$ -times efficiency improvement. In practice, we randomly down-sampled OmegaPRM data to $1.5$ million for PRM training. 5 Limitations There are some limitations with our paper, which we reserve for future work: Automatic process annotation is noisy. Our method for automatic process supervision annotation introduces noise in the form of false positives and negatives, but experiments indicate that it can still effectively train a PRM. The PRM trained on our dataset performs better than one trained on the human-annotated PRM800K dataset. The precise impact of noise on PRM performance remains uncertain. For future research, a comprehensive comparison of human and automated annotations should be conducted. One other idea is to integrate human and automated annotations, which could result in more robust and efficient process supervision. Human supervision is still necessary. Unlike the work presented in AlphaGo Zero (Silver et al., 2017), our method requires the question and golden answer pair. The question is necessary for LLM to start the MCTS and the golden answer is inevitable for the LLM to compare its rollouts with and determine the correctness of the current step. This will limit the method to the tasks with such question and golden answer pairs. Therefore, we need to adapt the current method further to make it suitable for open-ended tasks. 6 Conclusion In conclusion, we introduce OmegaPRM, a divide-and-conquer Monte Carlo Tree Search algorithm, designed to automate the process supervision data collection for LLMs. By efficiently pinpointing the first error in the Chain-of-Thought and balancing data quality, OmegaPRM addresses the shortcomings of existing methods. Our automated approach enables the collection of over 1.5 million process supervision annotations, which are used to train a PRM. Leveraging this automated process supervision with the weighted self-consistency algorithm, we improve LLM mathematical reasoning performance, achieving a 69.4% success rate on the MATH benchmark — a 18.4% absolute increase over the base model which amounts to a relative improvement of 36%. Additionally, our method significantly reduces data collection costs compared to human annotation and brute force Monte-Carlo sampling. These findings highlight OmegaPRM’s potential to enhance LLM capabilities in complex multi-step reasoning tasks. 7 Acknowledgements We would like to express our deepest gratitude to Matt Barnes, Jindong Chen and Rif A. Saurous, for their support and helpful feedback to our work. We also thank Peiyi Wang for clarifying the details in Math-Shepherd and providing insightful suggestions. References - Christiano et al. (2017) P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. arXiv preprint arXiv:1706.03741, 2017. - Cobbe et al. (2021) K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. - Feng et al. (2023) X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023. - Feng et al. (2024) X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2024. - Fu et al. (2023) Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot. Complexity-based prompting for multi-step reasoning. In Proceedings of the 11th International Conference on Learning Representations (ICLR), May 2023. - Gemini Team et al. (2024) Gemini Team, M. Reid, N. Savinov, D. Teplyashin, Dmitry, Lepikhin, T. Lillicrap, J. baptiste Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, I. Antonoglou, R. Anil, S. Borgeaud, A. Dai, K. Millican, E. Dyer, M. Glaese, T. Sottiaux, B. Lee, F. Viola, M. Reynolds, Y. Xu, J. Molloy, J. Chen, M. Isard, P. Barham, T. Hennigan, R. McIlroy, M. Johnson, J. Schalkwyk, E. Collins, E. Rutherford, E. Moreira, K. Ayoub, M. Goel, C. Meyer, G. Thornton, Z. Yang, H. Michalewski, Z. Abbas, N. Schucher, A. Anand, R. Ives, J. Keeling, K. Lenc, S. Haykal, S. Shakeri, P. Shyam, A. Chowdhery, R. Ring, S. Spencer, E. Sezener, L. Vilnis, O. Chang, N. Morioka, G. Tucker, C. Zheng, O. Woodman, N. Attaluri, T. Kocisky, E. Eltyshev, X. Chen, T. Chung, V. Selo, S. Brahma, P. Georgiev, A. Slone, Z. Zhu, J. Lottes, S. Qiao, B. Caine, S. Riedel, A. Tomala, M. Chadwick, J. Love, P. Choy, S. Mittal, N. Houlsby, Y. Tang, M. Lamm, L. Bai, Q. Zhang, L. He, Y. Cheng, P. Humphreys, Y. Li, S. Brin, A. Cassirer, Y. Miao, L. Zilka, T. Tobin, K. Xu, L. Proleev, D. Sohn, A. Magni, L. A. Hendricks, I. Gao, S. Ontanon, O. Bunyan, N. Byrd, A. Sharma, B. Zhang, M. Pinto, R. Sinha, H. Mehta, D. Jia, S. Caelles, A. Webson, A. Morris, B. Roelofs, Y. Ding, R. Strudel, X. Xiong, M. Ritter, M. Dehghani, R. Chaabouni, A. Karmarkar, G. Lai, F. Mentzer, B. Xu, Y. Li, Y. Zhang, T. L. Paine, A. Goldin, B. Neyshabur, K. Baumli, A. Levskaya, M. Laskin, W. Jia, J. W. Rae, K. Xiao, A. He, S. Giordano, L. Yagati, J.-B. Lespiau, P. Natsev, S. Ganapathy, F. Liu, D. Martins, N. Chen, Y. Xu, M. Barnes, R. May, A. Vezer, J. Oh, K. Franko, S. Bridgers, R. Zhao, B. Wu, B. Mustafa, S. Sechrist, E. Parisotto, T. S. Pillai, C. Larkin, C. Gu, C. Sorokin, M. Krikun, A. Guseynov, J. Landon, R. Datta, A. Pritzel, P. Thacker, F. Yang, K. Hui, A. Hauth, C.-K. Yeh, D. Barker, J. Mao-Jones, S. Austin, H. Sheahan, P. Schuh, J. Svensson, R. Jain, V. Ramasesh, A. Briukhov, D.-W. Chung, T. von Glehn, C. Butterfield, P. Jhakra, M. Wiethoff, J. Frye, J. Grimstad, B. Changpinyo, C. L. Lan, A. Bortsova, Y. Wu, P. Voigtlaender, T. Sainath, S. Gu, C. Smith, W. Hawkins, K. Cao, J. Besley, S. Srinivasan, M. Omernick, C. Gaffney, G. Surita, R. Burnell, B. Damoc, J. Ahn, A. Brock, M. Pajarskas, A. Petrushkina, S. Noury, L. Blanco, K. Swersky, A. Ahuja, T. Avrahami, V. Misra, R. de Liedekerke, M. Iinuma, A. Polozov, S. York, G. van den Driessche, P. Michel, J. Chiu, R. Blevins, Z. Gleicher, A. Recasens, A. Rrustemi, E. Gribovskaya, A. Roy, W. Gworek, S. M. R. Arnold, L. Lee, J. Lee-Thorp, M. Maggioni, E. Piqueras, K. Badola, S. Vikram, L. Gonzalez, A. Baddepudi, E. Senter, J. Devlin, J. Qin, M. Azzam, M. Trebacz, M. Polacek, K. Krishnakumar, S. yiin Chang, M. Tung, I. Penchev, R. Joshi, K. Olszewska, C. Muir, M. Wirth, A. J. Hartman, J. Newlan, S. Kashem, V. Bolina, E. Dabir, J. van Amersfoort, Z. Ahmed, J. Cobon-Kerr, A. Kamath, A. M. Hrafnkelsson, L. Hou, I. Mackinnon, A. Frechette, E. Noland, X. Si, E. Taropa, D. Li, P. Crone, A. Gulati, S. Cevey, J. Adler, A. Ma, D. Silver, S. Tokumine, R. Powell, S. Lee, K. Vodrahalli, S. Hassan, D. Mincu, A. Yang, N. Levine, J. Brennan, M. Wang, S. Hodkinson, J. Zhao, J. Lipschultz, A. Pope, M. B. Chang, C. Li, L. E. Shafey, M. Paganini, S. Douglas, B. Bohnet, F. Pardo, S. Odoom, M. Rosca, C. N. dos Santos, K. Soparkar, A. Guez, T. Hudson, S. Hansen, C. Asawaroengchai, R. Addanki, T. Yu, W. Stokowiec, M. Khan, J. Gilmer, J. Lee, C. G. Bostock, K. Rong, J. Caton, P. Pejman, F. Pavetic, G. Brown, V. Sharma, M. Lučić, R. Samuel, J. Djolonga, A. Mandhane, L. L. Sjösund, E. Buchatskaya, E. White, N. Clay, J. Jiang, H. Lim, R. Hemsley, Z. Cankara, J. Labanowski, N. D. Cao, D. Steiner, S. H. Hashemi, J. Austin, A. Gergely, T. Blyth, J. Stanton, K. Shivakumar, A. Siddhant, A. Andreassen, C. Araya, N. Sethi, R. Shivanna, S. Hand, A. Bapna, A. Khodaei, A. Miech, G. Tanzer, A. Swing, S. Thakoor, L. Aroyo, Z. Pan, Z. Nado, J. Sygnowski, S. Winkler, D. Yu, M. Saleh, L. Maggiore, Y. Bansal, X. Garcia, M. Kazemi, P. Patil, I. Dasgupta, I. Barr, M. Giang, T. Kagohara, I. Danihelka, A. Marathe, V. Feinberg, M. Elhawaty, N. Ghelani, D. Horgan, H. Miller, L. Walker, R. Tanburn, M. Tariq, D. Shrivastava, F. Xia, Q. Wang, C.-C. Chiu, Z. Ashwood, K. Baatarsukh, S. Samangooei, R. L. Kaufman, F. Alcober, A. Stjerngren, P. Komarek, K. Tsihlas, A. Boral, R. Comanescu, J. Chen, R. Liu, C. Welty, D. Bloxwich, C. Chen, Y. Sun, F. Feng, M. Mauger, X. Dotiwalla, V. Hellendoorn, M. Sharman, I. Zheng, K. Haridasan, G. Barth-Maron, C. Swanson, D. Rogozińska, A. Andreev, P. K. Rubenstein, R. Sang, D. Hurt, G. Elsayed, R. Wang, D. Lacey, A. Ilić, Y. Zhao, A. Iwanicki, A. Lince, A. Chen, C. Lyu, C. Lebsack, J. Griffith, M. Gaba, P. Sandhu, P. Chen, A. Koop, R. Rajwar, S. H. Yeganeh, S. Chang, R. Zhu, S. Radpour, E. Davoodi, V. I. Lei, Y. Xu, D. Toyama, C. Segal, M. Wicke, H. Lin, A. Bulanova, A. P. Badia, N. Rakićević, P. Sprechmann, A. Filos, S. Hou, V. Campos, N. Kassner, D. Sachan, M. Fortunato, C. Iwuanyanwu, V. Nikolaev, B. Lakshminarayanan, S. Jazayeri, M. Varadarajan, C. Tekur, D. Fritz, M. Khalman, D. Reitter, K. Dasgupta, S. Sarcar, T. Ornduff, J. Snaider, F. Huot, J. Jia, R. Kemp, N. Trdin, A. Vijayakumar, L. Kim, C. Angermueller, L. Lao, T. Liu, H. Zhang, D. Engel, S. Greene, A. White, J. Austin, L. Taylor, S. Ashraf, D. Liu, M. Georgaki, I. Cai, Y. Kulizhskaya, S. Goenka, B. Saeta, Y. Xu, C. Frank, D. de Cesare, B. Robenek, H. Richardson, M. Alnahlawi, C. Yew, P. Ponnapalli, M. Tagliasacchi, A. Korchemniy, Y. Kim, D. Li, B. Rosgen, K. Levin, J. Wiesner, P. Banzal, P. Srinivasan, H. Yu, Çağlar Ünlü, D. Reid, Z. Tung, D. Finchelstein, R. Kumar, A. Elisseeff, J. Huang, M. Zhang, R. Aguilar, M. Giménez, J. Xia, O. Dousse, W. Gierke, D. Yates, K. Jalan, L. Li, E. Latorre-Chimoto, D. D. Nguyen, K. Durden, P. Kallakuri, Y. Liu, M. Johnson, T. Tsai, A. Talbert, J. Liu, A. Neitz, C. Elkind, M. Selvi, M. Jasarevic, L. B. Soares, A. Cui, P. Wang, A. W. Wang, X. Ye, K. Kallarackal, L. Loher, H. Lam, J. Broder, D. Holtmann-Rice, N. Martin, B. Ramadhana, M. Shukla, S. Basu, A. Mohan, N. Fernando, N. Fiedel, K. Paterson, H. Li, A. Garg, J. Park, D. Choi, D. Wu, S. Singh, Z. Zhang, A. Globerson, L. Yu, J. Carpenter, F. de Chaumont Quitry, C. Radebaugh, C.-C. Lin, A. Tudor, P. Shroff, D. Garmon, D. Du, N. Vats, H. Lu, S. Iqbal, A. Yakubovich, N. Tripuraneni, J. Manyika, H. Qureshi, N. Hua, C. Ngani, M. A. Raad, H. Forbes, J. Stanway, M. Sundararajan, V. Ungureanu, C. Bishop, Y. Li, B. Venkatraman, B. Li, C. Thornton, S. Scellato, N. Gupta, Y. Wang, I. Tenney, X. Wu, A. Shenoy, G. Carvajal, D. G. Wright, B. Bariach, Z. Xiao, P. Hawkins, S. Dalmia, C. Farabet, P. Valenzuela, Q. Yuan, A. Agarwal, M. Chen, W. Kim, B. Hulse, N. Dukkipati, A. Paszke, A. Bolt, K. Choo, J. Beattie, J. Prendki, H. Vashisht, R. Santamaria-Fernandez, L. C. Cobo, J. Wilkiewicz, D. Madras, A. Elqursh, G. Uy, K. Ramirez, M. Harvey, T. Liechty, H. Zen, J. Seibert, C. H. Hu, A. Khorlin, M. Le, A. Aharoni, M. Li, L. Wang, S. Kumar, N. Casagrande, J. Hoover, D. E. Badawy, D. Soergel, D. Vnukov, M. Miecnikowski, J. Simsa, P. Kumar, T. Sellam, D. Vlasic, S. Daruki, N. Shabat, J. Zhang, G. Su, J. Zhang, J. Liu, Y. Sun, E. Palmer, A. Ghaffarkhah, X. Xiong, V. Cotruta, M. Fink, L. Dixon, A. Sreevatsa, A. Goedeckemeyer, A. Dimitriev, M. Jafari, R. Crocker, N. FitzGerald, A. Kumar, S. Ghemawat, I. Philips, F. Liu, Y. Liang, R. Sterneck, A. Repina, M. Wu, L. Knight, M. Georgiev, H. Lee, H. Askham, A. Chakladar, A. Louis, C. Crous, H. Cate, D. Petrova, M. Quinn, D. Owusu-Afriyie, A. Singhal, N. Wei, S. Kim, D. Vincent, M. Nasr, C. A. Choquette-Choo, R. Tojo, S. Lu, D. de Las Casas, Y. Cheng, T. Bolukbasi, K. Lee, S. Fatehi, R. Ananthanarayanan, M. Patel, C. Kaed, J. Li, S. R. Belle, Z. Chen, J. Konzelmann, S. Põder, R. Garg, V. Koverkathu, A. Brown, C. Dyer, R. Liu, A. Nova, J. Xu, A. Walton, A. Parrish, M. Epstein, S. McCarthy, S. Petrov, D. Hassabis, K. Kavukcuoglu, J. Dean, and O. Vinyals. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. - Gemma Team et al. (2024) Gemma Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, J. Ferret, P. Liu, P. Tafti, A. Friesen, M. Casbon, S. Ramos, R. Kumar, C. L. Lan, S. Jerome, A. Tsitsulin, N. Vieillard, P. Stanczyk, S. Girgin, N. Momchev, M. Hoffman, S. Thakoor, J.-B. Grill, B. Neyshabur, O. Bachem, A. Walton, A. Severyn, A. Parrish, A. Ahmad, A. Hutchison, A. Abdagic, A. Carl, A. Shen, A. Brock, A. Coenen, A. Laforge, A. Paterson, B. Bastian, B. Piot, B. Wu, B. Royal, C. Chen, C. Kumar, C. Perry, C. Welty, C. A. Choquette-Choo, D. Sinopalnikov, D. Weinberger, D. Vijaykumar, D. Rogozińska, D. Herbison, E. Bandy, E. Wang, E. Noland, E. Moreira, E. Senter, E. Eltyshev, F. Visin, G. Rasskin, G. Wei, G. Cameron, G. Martins, H. Hashemi, H. Klimczak-Plucińska, H. Batra, H. Dhand, I. Nardini, J. Mein, J. Zhou, J. Svensson, J. Stanway, J. Chan, J. P. Zhou, J. Carrasqueira, J. Iljazi, J. Becker, J. Fernandez, J. van Amersfoort, J. Gordon, J. Lipschultz, J. Newlan, J. yeong Ji, K. Mohamed, K. Badola, K. Black, K. Millican, K. McDonell, K. Nguyen, K. Sodhia, K. Greene, L. L. Sjoesund, L. Usui, L. Sifre, L. Heuermann, L. Lago, L. McNealus, L. B. Soares, L. Kilpatrick, L. Dixon, L. Martins, M. Reid, M. Singh, M. Iverson, M. Görner, M. Velloso, M. Wirth, M. Davidow, M. Miller, M. Rahtz, M. Watson, M. Risdal, M. Kazemi, M. Moynihan, M. Zhang, M. Kahng, M. Park, M. Rahman, M. Khatwani, N. Dao, N. Bardoliwalla, N. Devanathan, N. Dumai, N. Chauhan, O. Wahltinez, P. Botarda, P. Barnes, P. Barham, P. Michel, P. Jin, P. Georgiev, P. Culliton, P. Kuppala, R. Comanescu, R. Merhej, R. Jana, R. A. Rokni, R. Agarwal, R. Mullins, S. Saadat, S. M. Carthy, S. Perrin, S. M. R. Arnold, S. Krause, S. Dai, S. Garg, S. Sheth, S. Ronstrom, S. Chan, T. Jordan, T. Yu, T. Eccles, T. Hennigan, T. Kocisky, T. Doshi, V. Jain, V. Yadav, V. Meshram, V. Dharmadhikari, W. Barkley, W. Wei, W. Ye, W. Han, W. Kwon, X. Xu, Z. Shen, Z. Gong, Z. Wei, V. Cotruta, P. Kirk, A. Rao, M. Giang, L. Peran, T. Warkentin, E. Collins, J. Barral, Z. Ghahramani, R. Hadsell, D. Sculley, J. Banks, A. Dragan, S. Petrov, O. Vinyals, J. Dean, D. Hassabis, K. Kavukcuoglu, C. Farabet, E. Buchatskaya, S. Borgeaud, N. Fiedel, A. Joulin, K. Kenealy, R. Dadashi, and A. Andreev. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024. - Hao et al. (2023) S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023. - Hendrycks et al. (2021) D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021. - Huang et al. (2022) J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and J. Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022. - Kang et al. (2024) J. Kang, X. Z. Li, X. Chen, A. Kazemi, Q. Sun, B. Chen, D. Li, X. He, Q. He, F. Wen, J. Hao, and J. Yao. Mindstar: Enhancing math reasoning in pre-trained llms at inference time. arXiv preprint arXiv:2405.16265, 2024. - Li et al. (2023) Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen. Making large language models better reasoners with step-aware verifier. arXiv preprint arXiv:2206.02336, 2023. - Lightman et al. (2023) H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023. - Liu et al. (2024) H. Liu, Y. Zhang, Y. Luo, and A. C.-C. Yao. Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003, 2024. - Luo et al. (2023) L. Luo, Z. Lin, Y. Liu, L. Shu, Y. Zhu, J. Shang, and L. Meng. Critique ability of large language models. arXiv preprint arXiv:2310.04815, 2023. - Ma et al. (2023) Q. Ma, H. Zhou, T. Liu, J. Yuan, P. Liu, Y. You, and H. Yang. Let’s reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080, 2023. - OpenAI (2023) OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023. - Ouyang et al. (2022) L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022. - Perez et al. (2021) E. Perez, D. Kiela, and K. Cho. True few-shot learning with language models. arXiv preprint arXiv:2105.11447, 2021. - Rosin (2011) C. D. Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011. - Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 2016. URL https://doi.org/10.1038/nature16961. - Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017. - Tian et al. (2024) Y. Tian, B. Peng, L. Song, L. Jin, D. Yu, H. Mi, and D. Yu. Toward self-improvement of llms via imagination, searching, and criticizing. arXiv preprint arXiv:2404.12253, 2024. - Touvron et al. (2023) H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. - Uesato et al. (2022) J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022. - Wang et al. (2024a) P. Wang, L. Li, Z. Shao, R. X. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. arXiv preprint arXiv:2312.08935, 2024a. - Wang et al. (2023) X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. In Proceedings of the 11th International Conference on Learning Representations (ICLR), May 2023. - Wang et al. (2024b) Z. Wang, Y. Li, Y. Wu, L. Luo, L. Hou, H. Yu, and J. Shang. Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision. arXiv preprint arXiv:2402.02658, 2024b. - Wei et al. (2022a) J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus. Emergent abilities of large language models, 2022a. - Wei et al. (2022b) J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), Dec 2022b. - Yao et al. (2023) S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023. - Yu et al. (2024) F. Yu, A. Gao, and B. Wang. Ovm, outcome-supervised value models for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2024. - Yu et al. (2023) L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y. Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. MetaMath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023. - Zhang et al. (2024) D. Zhang, S. Zhoubian, Z. Hu, Y. Yue, Y. Dong, and J. Tang. Rest-mcts*: Llm self-training via process reward guided tree search. arXiv preprint arXiv:2406.03816, 2024. - Świechowski et al. (2021) M. Świechowski, K. Godlewski, B. Sawicki, and J. Mańdziuk. Monte carlo tree search: A review of recent modifications and applications. arXiv preprint arXiv:2103.04931, 2021. Appendix Appendix A Question Filtering During the evaluation of partial solution correctness using MC estimation, false negative noise may be introduced when a question is too hard for the model, thus no correct rollout can be found even with correct partial solution. Or false positive noise may be introduced when a question is too easy, that model can conclude in correct answer given partial solution with wrong step. It is not possible to exclude such noise completely, but we can reduce the chance by filtering out questions that are either too hard or too easy for the model. Specifically, we ran a $k=32$ rollouts for each question in the 12K training data, and filter out the questions that with no correct answer (too hard) or no wrong answer (too easy) in the 32 rollouts. Appendix B Pairwise Loss Formula When training with pairwise labels, the Bradley-Terry model (people typically use this objective to train reward models in RLHF) generally accepts two probability scalars summing up to 1. When we select the two actions as a pair, there are two cases. The first case is that one sample with a zero MC value, and the other sample with a positive MC value. The second case is that both samples are with positive MC values. The first case is straight-forward, and a normalization step is required for the second case. Assume the two MC values are $p$ and $q$ , and they follow the Bernoulli distribution: $P(X=1)=p$ and $P(Y=1)=q$ . We need to calculate the probability that action X is preferred over action Y and vice versa. $$ \begin{split}P(X>Y)&=P(X=1,Y=0)=p(1-q),\\ P(X<Y)&=P(X=0,Y=1)=(1-p)q,\\ P(X=Y)&=P(X=0,Y=0)+P(X=1,Y=1)=(1-p)(1-q)+pq.\end{split} \tag{5} $$ For the tied situation, each action has half the chance of being preferred. Thus, $$ \begin{split}P(\text{action X is preferred})=P(X>Y)+1/2*P(X=Y)=1/2*(1+p-q),\\ P(\text{action Y is preferred})=P(X<Y)+1/2*P(X=Y)=1/2*(1+q-p).\end{split} \tag{6} $$ Now the MC values are normalized and we can train with the pairwise loss.

Rendering Paper...