2406.06592

Model: gemma-3-27b-it-free

# Improve Mathematical Reasoning in Language Models by Automated Process Supervision **Authors**: Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, Abhinav Rastogi > Google DeepMind > Google Corresponding author: see the cls Abstract Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named OmegaPRM for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train Process Reward Models (PRMs). This fully automated process supervision alongside the weighted self-consistency algorithm is able to enhance LLMs’ math reasoning performances. We improved the success rates of the instruction-tuned Gemini Pro model from 51% to 69.4% on MATH500 and from 86.4% to 93.6% on GSM8K. Similarly, we boosted the success rates of Gemma2 27B from 42.3% to 58.2% on MATH500 and from 74.0% to 92.2% on GSM8K. The entire process operates without any human intervention or supervision, making our method both financially and computationally cost-effective compared to existing methods. 1 Introduction Despite the impressive advancements achieved by scaling Large Language Models (LLMs) on established benchmarks (Wei et al., 2022a), cultivating more sophisticated reasoning capabilities, particularly in domains like mathematical problem-solving and code generation, remains an active research area. Chain-of-thought (CoT) generation is crucial for these reasoning tasks, as it decomposes complex problems into intermediate steps, mirroring human reasoning processes. Prompting LLMs with CoT examples (Wei et al., 2022b) and fine-tuning them on question-CoT solution pairs (Perez et al., 2021; Ouyang et al., 2022) have proven effective, with the latter demonstrating superior performance. Furthermore, the advent of Reinforcement Learning with Human Feedback (RLHF; Ouyang et al., 2022) has enabled the alignment of LLM behaviors with human preferences through reward models, significantly enhancing model capabilities. <details> <summary>extracted/6063103/figures/tree.png Details</summary> ![69bd9470](/v1/image/69bd9470c22e965006fb1485595c77323f47162bdde5264f46bfb60043660e32) ### Visual Description \n ## Network Diagram: Node-Link Visualization ### Overview The image presents a network diagram consisting of nodes (numbered 0-48) connected by lines. The lines appear to represent relationships or connections between the nodes, and are colored in shades of yellow and teal. The nodes are arranged in a roughly triangular shape, with node 0 at the top and nodes 46-48 at the bottom-right. Each node is labeled with a number. ### Components/Axes There are no explicit axes or scales in this diagram. The components are: * **Nodes:** Represented by circles, labeled 0 through 48. * **Edges/Links:** Represented by lines connecting the nodes. The lines are colored in two shades: a lighter yellow and a darker teal. * **Node Labels:** Numerical identifiers for each node. * **Edge Labels:** Numerical values associated with each line. ### Detailed Analysis or Content Details The diagram contains 49 nodes and numerous edges. The edges are labeled with numerical values, presumably representing weights or distances. Here's a breakdown of the connections and edge weights, proceeding roughly from top to bottom and left to right: * 0-1: 2 * 0-2: 3 * 1-2: 14 * 1-3: 5 * 2-4: 7 * 2-5: 11 * 3-5: 8 * 3-6: 6 * 4-11: 21 * 4-12: 25 * 5-11: 22 * 5-13: 42 * 6-8: 16 * 6-9: 17 * 6-10: 20 * 7-9: 7 * 7-10: 18 * 8-19: 34 * 8-20: 35 * 9-18: 32 * 9-20: 54 * 10-18: 57 * 10-29: 55 * 11-13: 46 * 11-23: 43 * 12-16: 29 * 12-17: 30 * 13-17: 15 * 13-25: 26 * 14-16: 14 * 15-25: 47 * 16-17: 31 * 17-25: 33 * 18-19: 38 * 18-20: 39 * 18-27: 50 * 19-21: 21 * 19-22: 22 * 20-27: 51 * 20-28: 28 * 21-26: 26 * 22-27: 27 * 23-24: 43 * 24-26: 26 * 25-26: 26 * 27-35: 64 * 27-36: 65 * 28-35: 63 * 28-37: 66 * 29-30: 30 * 29-32: 59 * 30-32: 60 * 30-33: 61 * 31-33: 78 * 31-40: 70 * 32-34: 34 * 33-34: 79 * 33-45: 46 * 34-40: 71 * 34-41: 72 * 35-37: 67 * 35-38: 68 * 36-37: 65 * 36-39: 68 * 37-39: 68 * 38-41: 74 * 38-42: 75 * 39-41: 74 * 39-43: 44 * 40-42: 81 * 40-44: 75 * 41-42: 82 * 41-47: 47 * 42-48: 82 * 43-44: 44 * 44-47: 47 * 45-46: 46 The lines are colored teal for the connections between nodes 0-10 and yellow for the connections between nodes 11-48. ### Key Observations * The network appears to be relatively dense in the lower-right quadrant (nodes 30-48). * Node 18 has a high degree of connectivity, linking nodes 9, 19, 20, and 27. * The edge weights vary considerably, suggesting different strengths of relationships between nodes. * There is a clear visual separation based on edge color, potentially indicating different types of connections or phases of the network. ### Interpretation This diagram likely represents a network of interconnected entities, where the nodes represent the entities and the edges represent relationships between them. The numerical labels on the edges could represent the strength, cost, or frequency of these relationships. The color-coding of the edges suggests a distinction between two phases or types of connections within the network. The higher density of connections in the lower-right quadrant suggests that the entities in that region are more interconnected than those in the upper-left. Node 18 appears to be a central hub, playing a key role in connecting different parts of the network. Without further context, it's difficult to determine the specific meaning of the network. However, the diagram provides a visual representation of the network's structure and relationships, which could be useful for analysis and understanding. The diagram could represent social networks, communication networks, transportation networks, or any other system where entities are connected by relationships. The two colors could represent different types of relationships, or perhaps the network's evolution over time. </details> Figure 1: Example tree structure built with our proposed OmegaPRM algorithm. Each node in the tree indicates a state of partial chain-of-thought solution, with information including accuracy of rollouts and other statistics. Each edge indicates an action, i.e., a reasoning step, from the last state. Yellow edges are correct steps and blue edges are wrong. Beyond prompting and further training, developing effective decoding strategies is another crucial avenue for improvement. Self-consistency decoding (Wang et al., 2023) leverages multiple reasoning paths to arrive at a voted answer. Incorporating a verifier, such as an off-the-shelf LLM (Huang et al., 2022; Luo et al., 2023), can further guide LLMs in reasoning tasks by providing a feedback loop to verify final answers, identify errors, and suggest corrections. However, the gain of such approaches remains limited for complex multi-step reasoning problems. Reward models offer a promising alternative to verifiers, enabling the reranking of candidate outcomes based on reward signals to ensure higher accuracy. Two primary types of reward models have emerged: Outcome Reward Models (ORMs; Yu et al., 2024; Cobbe et al., 2021), which provide feedback only at the end of the problem-solving process, and Process Reward Models (PRMs; Li et al., 2023; Uesato et al., 2022; Lightman et al., 2023), which offer granular feedback at each reasoning step. PRMs have demonstrated superior effectiveness for complex reasoning tasks by providing such fine-grained supervision. The primary bottleneck in developing PRMs lies in obtaining process supervision signals, which require supervised labels for each reasoning step. Current approaches rely heavily on costly and labor-intensive human annotation (Uesato et al., 2022; Lightman et al., 2023). Automating this process is crucial for scalability and efficiency. While recent efforts using per-step Monte Carlo estimation have shown promise (Wang et al., 2024a, b), their efficiency remains limited due to the vast search space. To address this challenge, we introduce OmegaPRM, a novel divide-and-conquer Monte Carlo Tree Search (MCTS) algorithm inspired by AlphaGo Zero (Silver et al., 2017) for automated process supervision data collection. For each question, we build a Monte Carlo Tree, as shown in Fig. 1, with the details explained in Section 3.3. This algorithm enables efficient collection of over 1.5 million high-quality process annotations without human intervention. Our PRM, trained on this dataset and combined with weighted self-consistency decoding, significantly improves the performance of instruction-tuned Gemini Pro from 51% to 69.4% on MATH500 (Lightman et al., 2023) and from 86.4% to 93.6% on GSM8K (Cobbe et al., 2021). We also boosted the success rates of Gemma2 27B from 42.3% to 58.2% on MATH500 and from 74.0% to 92.2% on GSM8K. Our main contributions are as follows: - We propose a novel divide-and-conquer style Monte Carlo Tree Search algorithm for automated process supervision data generation. - The algorithm enables the efficient generation of over 1.5 million process supervision annotations, representing the largest and highest quality dataset of its kind to date. Additionally, the entire process operates without any human annotation, making our method both financially and computationally cost-effective. - We combine our verifier with weighted self-consistency to further boost the performance of LLM reasoning. We significantly improves the success rates from 51% to 69.4% on MATH500 and from 86.4% to 93.6% on GSM8K for instruction-tuned Gemini Pro. For Gemma2 27B, we also improved the success rates of from 42.3% to 58.2% on MATH500 and from 74.0% to 92.2% on GSM8K. 2 Related Work Improving mathematical reasoning ability of LLMs. Mathematical reasoning poses significant challenges for LLMs, and it is one of the key tasks for evaluating the reasoning ability of LLMs. With a huge amount of math problems in pretraining datasets, the pretrained LLMs (OpenAI, 2023; Gemini Team et al., 2024; Touvron et al., 2023) are able to solve simple problems, yet struggle with more complicated reasoning. To overcome that, the chain-of-thought (Wei et al., 2022b; Fu et al., 2023) type prompting algorithms were proposed. These techniques were effective in improving the performance of LLMs on reasoning tasks without modifying the model parameters. The performance was further improved by supervised fine-tuning (SFT; Cobbe et al., 2021; Liu et al., 2024; Yu et al., 2023) with high quality question-response pairs with full CoT reasoning steps. Application of reward models in mathematical reasoning of LLMs. To further improve the LLM’s math reasoning performance, verifiers can help to rank and select the best answer when multiple rollouts are available. Several works (Huang et al., 2022; Luo et al., 2023) have shown that using LLM as verifier is not suitable for math reasoning. For trained verifiers, two types of reward models are commonly used: Outcome Reward Model (ORM) and Process Reward Model (PRM). Both have shown performance boost on math reasoning over self-consistency (Cobbe et al., 2021; Uesato et al., 2022; Lightman et al., 2023), yet evidence has shown that PRM outperforms ORM (Lightman et al., 2023; Wang et al., 2024a). Generating high quality process supervision data is the key for training PRM, besides expensive human annotation (Lightman et al., 2023), Math-Shepherd (Wang et al., 2024a) and MiPS (Wang et al., 2024b) explored Monte Carlo estimation to automate the data collection process with human involvement, and both observed large performance gain. Our work shared the essence with MiPS and Math-Shepherd, but we explore further in collecting the process data using MCTS. Monte Carlo Tree Search (MCTS). MCTS (Świechowski et al., 2021) has been widely adopted in reinforcement learning (RL). AlphaGo (Silver et al., 2016) and AlphaGo Zero (Silver et al., 2017) were able to achieve great performance with MCTS and deep reinforcement learning. For LLMs, there are planning algorithms that fall in the category of tree search, such as Tree-of-Thought (Yao et al., 2023) and Reasoning-via-Planing (Hao et al., 2023). Recently, utilizing tree-like decoding to find the best output during the inference-time has become a hot topic to explore as well, multiple works (Feng et al., 2023; Ma et al., 2023; Zhang et al., 2024; Tian et al., 2024; Feng et al., 2024; Kang et al., 2024) have observed improvements in reasoning tasks. 3 Methods 3.1 Process Supervision Process supervision is a concept proposed to differentiate from outcome supervision. The reward models trained with these objectives are termed Process Reward Models (PRMs) and Outcome Reward Models (ORMs), respectively. In the ORM framework, given a query $q$ (e.g., a mathematical problem) and its corresponding response $x$ (e.g., a model-generated solution), an ORM is trained to predict the correctness of the final answer within the response. Formally, an ORM takes $q$ and $x$ and outputs the probability $p=\mathrm{ORM}(q,x)$ that the final answer in the response is correct. With a training set of question-answer pairs available, an ORM can be trained by sampling outputs from a policy model (e.g., a pretrained or fine-tuned LLM) using the questions and obtaining the correctness labels by comparing these outputs with the golden answers. In contrast, a PRM is trained to predict the correctness of each intermediate step $x_{t}$ in the solution. Formally, $p_{t}=\mathrm{PRM}([q,x_{1:t-1}],x_{t})$ , where $x_{1:i}=[x_{1},...,x_{i}]$ represents the first $i$ steps in the solution. This provides more precise and fine-grained feedback than ORMs, as it identifies the exact location of errors. Process supervision has also been shown to mitigate incorrect reasoning in the domain of mathematical problem solving. Despite these advantages, obtaining the intermediate signal for each step’s correctness to train such a PRM is non-trivial. Previous work (Lightman et al., 2023) has relied on hiring domain experts to manually annotate the labels, which is and difficult to scale. 3.2 Process Annotation with Monte Carlo Method In two closely related works, Math-Shepherd (Wang et al., 2024a) and MiPS (Wang et al., 2024b), the authors propose an automatic annotation approach to obtain process supervision signals using the Monte Carlo method. Specifically, a “completer” policy is established that can take a question $q$ and a prefix solution comprising the first $t$ steps $x_{1:t}$ and output the completion — often referred to as a “rollout” in reinforcement learning — of the subsequent steps until the final answer is reached. As shown in Fig. 2 (a), for any step of a solution, the completer policy can be used to randomly sample $k$ rollouts from that step. The final answers of these rollouts are compared to the golden answer, providing $k$ labels of answer correctness corresponding to the $k$ rollouts. Subsequently, the ratio of correct rollouts to total rollouts from the $t$ -th step, as represented in Eq. 1, estimates the “correctness level” of the prefix steps up to $t$ . Regardless of false positives, $x_{1:t}$ should be considered correct as long as any of the rollouts is correct in the logical reasoning scenario. $$ c_{t}=\mathrm{MonteCarlo}(q,x_{1:t})=\frac{\mathrm{num}(\text{correct rollouts% from $t$-th step})}{\mathrm{num}(\text{total rollouts from $t$-th step})} \tag{1} $$ Taking a step forward, a straightforward strategy to annotate the correctness of intermediate steps in a solution is to perform rollouts for every step from the beginning to the end, as done in both Math-Shepherd and MiPS. However, this brute-force approach requires a large number of policy calls. To optimize annotation efficiency, we propose a binary-search-based Monte Carlo estimation. Monte Carlo estimation using binary search. As suggested by Lightman et al. (2023), supervising up to the first incorrect step in a solution is sufficient to train a PRM. Therefore, our objective is locating the first error in an efficient way. We achieve this by repeatedly dividing the solution and performing rollouts. Assuming no false positives or negatives, we start with a solution with potential errors and split it at the midpoint $m$ . We then perform rollouts for $s_{1:m}$ with two possible outcomes: (1) $c_{m}>0$ , indicating that the first half of the solution is correct, as at least one correct answer can be rolled out from $m$ -th step, and thus the error is in the second half; (2) $c_{m}=0$ , indicating the error is very likely in the first half, as none of the rollouts from $m$ -th step is correct. This process narrows down the error location to either the first or second half of the solution. As shown in Fig. 2 (b), by repeating this process on the erroneous half iteratively until the partial solution is sufficiently small (i.e., short enough to be considered as a single step), we can locate the first error with a time complexity of $O(k\log M)$ rather than $O(kM)$ in the brute-force setting, where $M$ is the total number of steps in the original solution. 3.3 Monte Carlo Tree Search Although binary search improves the efficiency of locating the first error in a solution, we are still not fully utilizing policy calls as rollouts are simply discarded after stepwise Monte Carlo estimation. In practice, it is necessary to collect multiple PRM training examples (a.k.a., triplets of question, partial solution and correctness label) for a question (Lightman et al., 2023; Wang et al., 2024a). Instead of starting from scratch each time, we can store all rollouts during the process and conduct binary searches from any of these rollouts whenever we need to collect a new example. This approach allows for triplets with the same solution prefix but different completions and error locations. Such reasoning structures can be represented as a tree, as described in previous work like Tree of Thought (Yao et al., 2023). Formally, consider a state-action tree representing detailed reasoning paths for a question, where a state $s$ contains the question and all preceding reasoning steps, and an action $a$ is a potential subsequent step from a specific state. The root state is the question without any reasoning steps: $r_{\text{root}}=q$ . The policy can be directly modeled by a language model as $\pi(a|s)=\mathrm{LM}(a|s)$ , and the state transition function is simply the concatenation of the preceding steps and the action step, i.e., $s^{\prime}=\mathrm{Concatenate}(s,a)$ . Collecting PRM training examples for a question can now be formulated as constructing such a state-action tree. This reminds us the classic Monte Carlo Tree Search (MCTS) algorithm, which has been successful in many deep reinforcement learning applications (Silver et al., 2016, 2017). However, there are some key differences when using a language model as the policy. First, MCTS typically handles an environment with a finite action space, such as the game of Go, which has fewer than $361$ possible actions per state (Silver et al., 2017). In contrast, an LM policy has an infinite action space, as it can generate an unlimited number of distinct actions (sequences of tokens) given a prompt. In practice, we use temperature sampling to generate a fix number of $k$ completions for a prompt, treating the group of $k$ actions as an approximate action space. Second, an LM policy can sample a full rollout until the termination state (i.e., reaching the final answer) without too much overhead than generating a single step, enabling the possibility of binary search. Consequently, we propose an adaptation of the MCTS algorithm named OmegaPRM, primarily based on the one introduced in AlphaGo (Silver et al., 2016), but with modifications to better accommodate the scenario of PRM training data collection. We describe the algorithm details as below. Tree Structure. Each node $s$ in the tree contains the question $q$ and prefix solution $x_{1:t}$ , together with all previous rollouts $\{(s,r_{i})\}_{i=1}^{k}$ from the state. Each edge $(s,a)$ is either a single step or a sequence of consecutive steps from the node $s$ . The nodes also store a set of statistics, $$ \{N(s),\mathrm{MC}(s),Q(s,r)\}, $$ where $N(s)$ denotes the visit count of a state, $\mathrm{MC}(s)$ represents the Monte Carlo estimation of a state as specified in Eq. 1, and $Q(s,r)$ is a state-rollout value function that is correlated to the chance of selecting a rollout during the selection phase of tree traversal. Specifically, <details> <summary>x1.png Details</summary> ![1f86a027](/v1/image/1f86a0273010f2f5b4b96fcc548b1819d5da47d644a9eb446b8736ea0b88099b) ### Visual Description \n ## Diagram: Problem Solving Process with Model Confidence ### Overview This diagram illustrates a problem-solving process, likely involving a large language model (LLM) or similar AI system. It presents a mathematical problem, an initial solution attempt, and subsequent "rollouts" with associated confidence scores. The diagram visually represents the iterative refinement of a solution. ### Components/Axes The diagram consists of several rectangular blocks connected by arrows. The blocks contain text describing the problem, solution attempts, and evaluation results. Key elements include: * **Problem Statement (Top):** "Problem: Let p(x) be a monic polynomial of degree 4. Three of the roots of p(x) are 1, 2, 3. Find p(0)+p(4)." * **Golden Answer (Top-Right):** "Golden Answer: 24" * **Initial Solution (Middle-Left):** "Solution: Since three of the roots of p(x) .... Final Answer 20. ❌" (Marked with a red 'X' indicating an incorrect answer) * **Problem Restatement (Bottom-Left):** "Problem: .... Since three of the roots of p(x) are 1, 2 and 3, we can write:" * **Rollout 1 (Bottom-Center):** "Rollout 1: .... Final Answer 24. ✅" (Marked with a green checkmark indicating a correct answer) * **Rollout 2 (Bottom-Center):** "Rollout 2: .... Final Answer 24. ✅" (Marked with a green checkmark indicating a correct answer) * **Rollout 3 (Bottom-Center):** "Rollout 3: .... Final Answer 20. ❌" (Marked with a red 'X' indicating an incorrect answer) * **Model Confidence (MC) (Bottom-Right):** "MC = 0.67" Arrows connect the "Problem Restatement" block to each of the "Rollout" blocks, indicating the iterative process. ### Detailed Analysis or Content Details The diagram presents a mathematical problem involving a polynomial. The correct answer is stated as "24". The initial solution attempt yields "20", which is incorrect. The subsequent rollouts demonstrate the model's ability to refine its answer. * **Rollout 1:** Correct answer (24) with a checkmark. * **Rollout 2:** Correct answer (24) with a checkmark. * **Rollout 3:** Incorrect answer (20) with a red 'X'. * The Model Confidence (MC) is given as 0.67. This likely represents the confidence level of the model in its final answer, or perhaps the average confidence across the rollouts. ### Key Observations The diagram highlights the iterative nature of problem-solving with an AI model. The initial attempt is incorrect, but subsequent rollouts lead to the correct answer. The model confidence score provides a measure of the model's certainty. The presence of both correct and incorrect rollouts suggests that the model's performance is not always consistent. ### Interpretation This diagram demonstrates a process of iterative refinement in an AI-driven problem-solving scenario. The initial incorrect solution suggests the model may have started with an incomplete or flawed understanding of the problem. The subsequent rollouts, with their associated checkmarks and 'X' marks, illustrate the model's ability to learn and correct its mistakes. The model confidence score of 0.67 indicates a moderate level of certainty, suggesting that the model is not entirely confident in its final answer. The diagram could be used to evaluate the performance of an AI model on a specific task, or to illustrate the benefits of iterative refinement in problem-solving. The fact that the model eventually converges on the correct answer (24) is a positive sign, but the initial error and the relatively low confidence score suggest that further improvements may be needed. The diagram implies a reinforcement learning or similar iterative process where the model adjusts its approach based on feedback (the checkmarks and 'X' marks). </details> (a) Monte Carlo estimation of a prefix solution. <details> <summary>x2.png Details</summary> ![34cd5a7e](/v1/image/34cd5a7e1b76719a14f4b448705559d1494e4b799c2781df93636d70312e62bd) ### Visual Description \n ## Diagram: Monte Carlo Error Introduction ### Overview This diagram illustrates the introduction of errors in a Monte Carlo simulation, visualized as a series of steps where correct results ('✓') are progressively replaced with incorrect results ('✗' or 'NA'). The diagram shows how the Monte Carlo confidence (MC) decreases as errors are introduced. ### Components/Axes The diagram consists of four rows of eight boxes each. Each box contains either a checkmark ('✓'), a question mark ('?'), an 'X', or 'NA'. The horizontal axis represents steps 1 through 8. Vertical rows represent the progression of the simulation. Annotations indicate "MC = 0.25", "MC = 0.5", and "MC = 0". An arrow points to the first error step and is labeled "First error step". ### Detailed Analysis or Content Details * **Row 1:** All boxes contain question marks ('?'). * **Row 2:** Boxes 1-4 contain checkmarks ('✓'). Boxes 5-8 contain question marks ('?'). Annotation: "MC = 0.25" is positioned between rows 1 and 2, slightly to the right of center. * **Row 3:** Boxes 1-6 contain checkmarks ('✓'). Boxes 7-8 contain question marks ('?'). Annotation: "MC = 0.5" is positioned between rows 2 and 3, slightly to the right of center. * **Row 4:** Boxes 1-6 contain checkmarks ('✓'). Box 7 contains an 'X'. Box 8 contains 'NA'. Annotation: "MC = 0" is positioned between rows 3 and 4, pointing to box 7. The annotation "First error step" is positioned above box 7, with an arrow pointing to it. * **Horizontal Axis:** Labeled with numbers 1 through 8, representing the step number. ### Key Observations The diagram demonstrates a clear progression of error introduction. Initially, the simulation has an unknown state (row 1). As the simulation progresses (rows 2 & 3), correct results accumulate, increasing the Monte Carlo confidence (MC). The introduction of an error in step 7 (box 7) immediately reduces the MC to 0, and the subsequent step (box 8) is marked as 'NA' (Not Applicable), indicating the simulation is no longer valid. ### Interpretation This diagram visually represents the sensitivity of Monte Carlo simulations to errors. Even a single error can invalidate the entire simulation, reducing the confidence level to zero. The 'NA' designation in the final step highlights that subsequent calculations based on an erroneous result are meaningless. The diagram emphasizes the importance of error detection and correction in Monte Carlo methods. The MC values (0.25, 0.5, 0) are likely representing the proportion of correct results at each stage, and thus the confidence in the simulation's output. The diagram is a pedagogical tool to illustrate the impact of errors on the reliability of Monte Carlo simulations. </details> (b) Error locating using binary search. <details> <summary>x3.png Details</summary> ![8dd90f73](/v1/image/8dd90f73b8f43342dad091a86401ebf1c491e7d0a0752e2ac49514b0fff8899a) ### Visual Description \n ## Diagram: Binary Search Illustration ### Overview The image presents a visual comparison of three approaches related to binary search: "Select", "Binary Search", and "Maintain". Each approach is depicted as a tree-like structure with numbered nodes (0-6) connected by lines. The diagram illustrates how each method navigates a sorted dataset. ### Components/Axes The diagram consists of three distinct sections, labeled "Select", "Binary Search", and "Maintain" positioned horizontally across the top. Each section represents a different method. Within each section, nodes are represented by teal circles with numbers 0 through 6. Lines connect the nodes, indicating relationships or search paths. The "Select" section has a red arrow pointing from node 0 to a cluster of nodes, labeled "Selected". The "Maintain" section has annotations "N++" and "MC, Q" next to nodes 0 and 5/6 respectively. ### Detailed Analysis or Content Details **Select:** * Node 0 is at the top, connected to nodes 1 and 3. * Node 1 is connected to node 2. * Node 3 is connected to a series of smaller nodes (represented by dots), indicating a larger dataset. * A red arrow originates from node 0 and points towards the cluster of smaller nodes, labeled "Selected". **Binary Search:** * Node 0 is at the top, connected to nodes 1 and 3. * Node 1 is connected to node 2. * Node 3 is connected to nodes 4 and 5. * Node 4 is connected to node 6. * Node 5 is connected to node 6. **Maintain:** * Node 0 is at the top, labeled "N++", connected to nodes 1 and 3. * Node 1 is connected to node 2. * Node 3 is connected to nodes 4 and 5. * Node 4 is connected to node 6, labeled "MC, Q". * Node 5 is connected to node 6, labeled "MC, Q". ### Key Observations * The "Select" method appears to directly choose a subset of the data (represented by the "Selected" label). * The "Binary Search" method follows a standard binary search tree structure, dividing the dataset in half at each step. * The "Maintain" method is similar to "Binary Search" but includes additional annotations ("N++", "MC, Q") suggesting some form of maintenance or tracking of data characteristics. * The annotations "MC, Q" appear twice in the "Maintain" section, suggesting they relate to nodes 5 and 6. * "N++" appears to be an increment operation. ### Interpretation The diagram illustrates different strategies for searching or managing data within a sorted structure. The "Select" method represents a direct selection process, potentially based on some criteria. The "Binary Search" method demonstrates the classic divide-and-conquer approach for efficient searching. The "Maintain" method builds upon binary search, adding elements related to data maintenance, potentially tracking metadata ("MC") and query status ("Q"). The "N++" annotation suggests a counter or index is being incremented. The diagram doesn't provide specific data values or performance metrics, but it visually conveys the conceptual differences between these three approaches. The "Maintain" method appears to be an enhanced version of "Binary Search", incorporating additional functionality for data management. The annotations suggest a more complex system where the search process is integrated with other operations. </details> (c) Three stages in an iteration of the MCTS process. Figure 2: Illustration of the process supervision rollouts, Monte Carlo estimation using binary search and the MCTS process. (a) An example of Monte Carlo estimation of a prefix solution. Two out of the three rollouts are correct, producing the Monte Carlo estimation $\mathrm{MC}(q,x_{1:t})=2/3≈ 0.67$ . (b) An example of error locating using binary search. The first error step is located at the $7^{\text{th}}$ step after three divide-and-rollouts, where the rollout positions are indicated by the vertical dashed lines. (c) The MCTS process. The dotted lines in Select stage represent the available rollouts for binary search. The bold colored edges represent steps with correctness estimations. The yellow color indicates a correct step, i.e., with a preceding state $s$ that $\mathrm{MC}(s)>0$ and the blue color indicates an incorrect step, i.e., with $\mathrm{MC}(s)=0$ . The number of dashes in each colored edge indicates the number of steps. $$ Q(s,r)=\alpha^{1-\mathrm{MC}(s)}\cdot\beta^{\frac{\mathrm{len}(r)}{L}}, \tag{2} $$ where $\alpha,\beta∈(0,1]$ and $L>0$ are constant hyperparameters; while $\mathrm{len}(r)$ denotes the length of a rollout in terms of number of tokens. $Q$ is supposed to indicate how likely a rollout will be chosen for each iteration and our goal is to define a heuristic that selects the most valuable rollout to search with. The most straightforward strategy is uniformly choosing rollout candidates generated by the policy in previous rounds; however, this is obviously not an effective way. Lightman et al. (2023) suggests surfacing the convincing wrong-answer solutions for annotators during labeling. Inspired by this, we propose to prioritize supposed-to-be-correct wrong-answer rollouts during selection. We use the term supposed-to-be-correct to refer to the state with a Monte Carlo estimation $\mathrm{MC}(s)$ closed to $1$ ; and use wrong-answer to refer that the specific rollout $r$ has a wrong final answer. The rollout contains mistakes made by the policy that should have been avoided given its high $\mathrm{MC}(s)$ . We expect a PRM that learns to detect errors in such rollouts will be more useful in correcting the mistakes made by the policy. The first component in Eq. 2, $\alpha^{1-\mathrm{MC}(s)}$ , has a larger value as $\mathrm{MC}(s)$ is closer to $1$ . Additionally, we incorporate a length penalty factor $\beta^{\frac{\mathrm{len}(r)}{L}}$ , to penalize excessively long rollouts. Select. The selection phase in our algorithm is simpler than that of AlphaGo (Silver et al., 2016), which involves selecting a sequence of actions from the root to a leaf node, forming a trajectory with multiple states and actions. In contrast, we maintain a pool of all rollouts $\{(s_{i},r^{i}_{j})\}$ from previous searches that satisfy $0<\mathrm{MC}(s_{i})<1$ . During each selection, a rollout is popped and selected according to tree statistics, $(s,r)=\operatorname*{arg\,max}_{(s,r)}[Q(s,r)+U(s)]$ , using a variant of the PUCT (Rosin, 2011) algorithm, $$ U(s)=c_{\text{puct}}\frac{\sqrt{\sum_{i}N(s_{i})}}{1+N(s)}, \tag{3} $$ where $c_{\text{puct}}$ is a constant determining the level of exploration. This strategy initially favors rollouts with low visit counts but gradually shifts preference towards those with high rollout values. Binary Search. We perform a binary search to identify the first error location in the selected rollout, as detailed in Section 3.2. The rollouts with $0<\mathrm{MC}(s)<1$ during the process are added to the selection candidate pool. All divide-and-rollout positions before the first error become new states. For the example in Fig. 2 (b), the trajectory $s[q]→ s[q,x_{1:4}]→ s[q,x_{1:6}]→ s[q,x_{1:7}]$ is added to the tree after the binary search. The edges $s[q]→ s[q,x_{1:4}]$ and $s[q,x_{1:4}]→ s[q,x_{1:6}]$ are correct, with $\mathrm{MC}$ values of $0.25$ and $0.5$ , respectively; while the edge $s[q,x_{1:6}]→ s[q,x_{1:7}]$ is incorrect with $\mathrm{MC}$ value of $0 0$ . Maintain. After the binary search, the tree statistics $N(s)$ , $\mathrm{MC}(s)$ , and $Q(s,r)$ are updated. Specifically, $N(s)$ is incremented by $1$ for the selected $(s,r)$ . Both $\mathrm{MC}(s)$ and $Q(s,r)$ are updated for the new rollouts sampled from the binary search. This phase resembles the backup phase in AlphaGo but is simpler, as it does not require recursive updates from the leaf to the root. Tree Construction. By repeating the aboved process, we can construct a state-action tree as the example illustrated in Fig. 1. The construction ends either when the search count reaches a predetermined limit or when no additional rollout candidates are available in the pool. 3.4 PRM Training Each edge $(s,a)$ with a single-step action in the constructed state-action tree can serve as a training example for the PRM. It can be trained using the standard classification loss $$ \mathcal{L}_{\text{pointwise}}=\sum_{i=1}^{N}\hat{y}_{i}\log y_{i}+(1-\hat{y}_% {i})\log(1-y_{i}), \tag{4} $$ where $\hat{y}_{i}$ represents the correctness label and $y_{i}=\mathrm{PRM}(s,a)$ is the prediction score of the PRM. Wang et al. (2024b) have used the Monte Carlo estimation as the correctness label, denoted as $\hat{y}=\mathrm{MC}(s)$ . Alternatively, Wang et al. (2024a) have employed a binary labeling approach, where $\hat{y}=\mathbf{1}[\mathrm{MC}(s)>0]$ , assigning $\hat{y}=1$ for any positive Monte Carlo estimation and $\hat{y}=0$ otherwise. We refer the former option as pointwise soft label and the latter as pointwise hard label. In addition, considering there are many cases where a common solution prefix has multiple single-step actions, we can also minimize the cross-entropy loss between the PRM predictions and the normalized pairwise preferences following the Bradley-Terry model (Christiano et al., 2017). We refer this training method as pairwise approach, and the detailed pairwise loss formula can be found in Section Appendix B. We use the pointwise soft label when evaluating the main results in Section 4.1, and a comparion of the three objectives are discussed in Section 4.3. 4 Experiments Data Generation. We conduct our experiments on the challenging MATH dataset (Hendrycks et al., 2021). We use the same training and testing split as described in Lightman et al. (2023), which consists of $12$ K training examples and a subset with $500$ holdout representative problems from the original $5$ K testing examples introduced in Hendrycks et al. (2021). We observe similar policy performance on the full test set and the subset. For creating the process annotation data, we use the questions from the training split and set the search limit to $100$ per question, resulting $1.5$ M per-step process supervision annotations. To reduce the false positive and false negative noise, we filtered out questions that are either too hard or too easy for the model. Please refer to Appendix A for details. We use $\alpha=0.5$ , $\beta=0.9$ and $L=500$ for calculating $Q(s,r)$ in Eq. 2; and $c_{\text{puct}}=0.125$ in Eq. 3. We sample $k=8$ rollouts for each Monte Carlo estimation. Models. In previous studies (Lightman et al., 2023; Wang et al., 2024a, b), both proprietary models such as GPT-4 (OpenAI, 2023) and open-source models such as Llama2 (Touvron et al., 2023) were explored. In our study, we perform experiments with both proprietary Gemini Pro (Gemini Team et al., 2024) and open-source Gemma2 (Gemma Team et al., 2024) models. For Gemini Pro, we follow Lightman et al. (2023); Wang et al. (2024a) to initially fine-tune it on math instruction data, achieving an accuracy of approximately $51$ % on the MATH test set. The instruction-tuned model is then used for solution sampling. For open-source models, to maximize reproducibility, we directly use the pretrained Gemma2 27B checkpoint with the 4-shot prompt introduced in Gemini Team et al. (2024). The reward models are all trained from the pretrained checkpoints. Metrics and baselines. We evaluate the PRM-based majority voting results on GSM8K (Cobbe et al., 2021) and MATH500 (Lightman et al., 2023) using PRMs trained on different process supervision data. We choose the product of scores across all steps as the final solution score following Lightman et al. (2023), where the performance difference between product and minimum of scores was compared and the study showed the difference is minor. Baseline process supervision data include PRM800K (Lightman et al., 2023) and Math-Shepherd (Wang et al., 2024a), both publicly available. Additionally, we generate a process annotation dataset with our Gemini policy model using the brute-force approach described in Wang et al. (2024a, b), referred to as Math-Shepherd (our impl) in subsequent sections. 4.1 Main Results <details> <summary>x4.png Details</summary> ![09138811](/v1/image/091388117f688ad7ff836be31654f3a8ea99931886c455a4e9665c6d0312bdcc) ### Visual Description ## Line Chart: Problem Solving Performance vs. Solution Count ### Overview This line chart depicts the performance of different problem-solving algorithms as a function of the number of solutions generated per problem. The y-axis represents the percentage of problems solved, while the x-axis represents the number of solutions (N), expressed as powers of 2. Five algorithms are compared: Majority Vote, +OmegaPRM, +PRM800K, +Shepherd, and +Shepherd (ours). ### Components/Axes * **X-axis Title:** N = number of solutions per problems * **Y-axis Title:** % Problems Solved * **X-axis Scale:** Logarithmic, with markers at 22, 23, 24, 25, and 26 (corresponding to 4, 8, 16, 32, and 64). * **Y-axis Scale:** Linear, ranging from approximately 56% to 70%. * **Legend:** Located in the bottom-right corner. * Majority Vote (Green) * +OmegaPRM (Red) * +PRM800K (Magenta) * +Shepherd (Blue) * +Shepherd (ours) (Dark Purple) ### Detailed Analysis Here's a breakdown of each line's trend and approximate data points, verifying color consistency with the legend: * **Majority Vote (Green):** The line slopes upward, but with diminishing returns. * N = 4: ~58% * N = 8: ~62% * N = 16: ~65% * N = 32: ~66.5% * N = 64: ~67% * **+OmegaPRM (Red):** This line shows a consistent upward trend, with a relatively steep slope initially, then flattening out. It consistently outperforms the other algorithms. * N = 4: ~64% * N = 8: ~66% * N = 16: ~67.5% * N = 32: ~68% * N = 64: ~68.5% * **+PRM800K (Magenta):** The line starts relatively low, then increases more rapidly than Majority Vote, but remains below +OmegaPRM. * N = 4: ~58% * N = 8: ~63% * N = 16: ~66% * N = 32: ~67% * N = 64: ~67.5% * **+Shepherd (Blue):** This line exhibits a significant increase between N=4 and N=8, then plateaus. * N = 4: ~60% * N = 8: ~64% * N = 16: ~65% * N = 32: ~66% * N = 64: ~66.5% * **+Shepherd (ours) (Dark Purple):** This line shows a steady increase, starting slightly below +Shepherd, and eventually surpassing it. * N = 4: ~63% * N = 8: ~65% * N = 16: ~66% * N = 32: ~67% * N = 64: ~67.5% ### Key Observations * +OmegaPRM consistently achieves the highest percentage of problems solved across all values of N. * Majority Vote has the lowest performance. * The performance gains from increasing N diminish as N increases for all algorithms. * "+Shepherd (ours)" outperforms "+Shepherd" at higher values of N. * The largest performance jump for +Shepherd occurs between N=4 and N=8. ### Interpretation The chart demonstrates the effectiveness of different algorithms in solving problems as the number of potential solutions explored increases. The superior performance of +OmegaPRM suggests it is a more efficient algorithm for this type of problem. The diminishing returns observed with increasing N indicate that there's a point where generating more solutions provides limited additional benefit. The improvement of "+Shepherd (ours)" over "+Shepherd" suggests that the modifications made in the "ours" version are beneficial, particularly when a larger number of solutions are considered. The data suggests a trade-off between computational cost (generating more solutions) and problem-solving success. The logarithmic scale on the x-axis highlights the impact of doubling the number of solutions at lower values of N, where the performance gains are more substantial. </details> (a) Gemini Pro on MATH500. <details> <summary>x5.png Details</summary> ![f0b4aeaa](/v1/image/f0b4aeaaa4d7db852745c049d77d33d5fe601c9e2f79f97b28e920bb3489b478) ### Visual Description \n ## Line Chart: Performance Comparison of Problem Solving Methods ### Overview This line chart compares the performance of several problem-solving methods as a function of the number of solutions generated per problem. The y-axis represents the percentage of problems solved, while the x-axis represents the number of solutions (N) per problem, expressed as powers of 2. The chart displays five different methods: Majority Vote, +OmegaPRM, +PRM800K, +Shepherd, and +Shepherd (ours). ### Components/Axes * **Y-axis Title:** "% Problems Solved" * Scale: Ranges from approximately 88% to 94%. * Markers: 88, 89, 90, 91, 92, 93, 94 * **X-axis Title:** "N = number of solutions per problems" * Scale: Logarithmic, with markers at 22, 23, 24, 25, 26 (which are approximately 4, 8, 16, 32, 64). * **Legend:** Located in the bottom-right corner of the chart. * Majority Vote (Green) * +OmegaPRM (Red) * +PRM800K (Magenta) * +Shepherd (Light Blue) * +Shepherd (ours) (Purple) ### Detailed Analysis Here's a breakdown of each line's trend and approximate data points, verified against the legend colors: * **Majority Vote (Green):** The line slopes upward, starting at approximately 88.5% at N=4 and reaching approximately 92.5% at N=64. * N=4: ~88.5% * N=8: ~89.5% * N=16: ~91.5% * N=32: ~92.0% * N=64: ~92.5% * **+OmegaPRM (Red):** This line shows a consistently high performance, with a slight upward slope. It starts at approximately 92.8% at N=4 and reaches approximately 93.8% at N=64. * N=4: ~92.8% * N=8: ~93.0% * N=16: ~93.2% * N=32: ~93.4% * N=64: ~93.8% * **+PRM800K (Magenta):** The line is relatively flat, with a slight increase. It begins at approximately 91.8% at N=4 and reaches approximately 92.8% at N=64. * N=4: ~91.8% * N=8: ~92.2% * N=16: ~92.4% * N=32: ~92.6% * N=64: ~92.8% * **+Shepherd (Light Blue):** This line shows a moderate upward slope, starting at approximately 92.2% at N=4 and reaching approximately 92.8% at N=64. * N=4: ~92.2% * N=8: ~92.5% * N=16: ~92.6% * N=32: ~92.7% * N=64: ~92.8% * **+Shepherd (ours) (Purple):** This line starts at approximately 89.2% at N=4 and increases to approximately 91.5% at N=64. * N=4: ~89.2% * N=8: ~90.2% * N=16: ~90.8% * N=32: ~91.2% * N=64: ~91.5% ### Key Observations * +OmegaPRM consistently outperforms all other methods across all values of N. * Majority Vote and +Shepherd (ours) start with the lowest performance, but show improvement as N increases. * +PRM800K and +Shepherd exhibit relatively stable performance with minimal improvement as N increases. * The performance gap between +OmegaPRM and other methods widens slightly as N increases. ### Interpretation The data suggests that increasing the number of solutions generated per problem (N) generally improves the performance of the problem-solving methods, although the extent of improvement varies. +OmegaPRM appears to be the most effective method, consistently achieving the highest percentage of problems solved. The "ours" version of +Shepherd starts with lower performance but shows a positive trend, indicating potential for improvement with further optimization. The relatively flat performance of +PRM800K and +Shepherd suggests that they may reach a performance plateau with increasing N. The chart demonstrates a trade-off between computational cost (generating more solutions) and solution accuracy (percentage of problems solved). The logarithmic scale on the x-axis highlights the diminishing returns of generating more solutions beyond a certain point. The data implies that for maximizing problem-solving success, +OmegaPRM is the preferred method, but other methods can be viable depending on computational constraints and desired performance levels. </details> (b) Gemini Pro on GSM8K. <details> <summary>x6.png Details</summary> ![ab99c94a](/v1/image/ab99c94a7dd8ed937deed96173b03f65a54818d5229d65ae499560221c78ee7b) ### Visual Description ## Line Chart: Performance Comparison of Solution Methods ### Overview This line chart compares the performance of several solution methods for solving problems, measured by the percentage of problems solved as a function of the number of solutions per problem. The x-axis represents the number of solutions (N), displayed on a logarithmic scale (2^2 to 2^6). The y-axis represents the percentage of problems solved, ranging from approximately 40% to 60%. Each line represents a different solution method, and shaded areas around the lines likely represent standard deviation or confidence intervals. ### Components/Axes * **X-axis Label:** "N = number of solutions per problem" * **Y-axis Label:** "% Problems Solved" * **X-axis Scale:** Logarithmic, with markers at 2^2 (4), 2^3 (8), 2^4 (16), 2^5 (32), and 2^6 (64). * **Y-axis Scale:** Linear, with markers at 40%, 45%, 50%, 55%, and 60%. * **Legend:** Located in the bottom-right corner. * Majority Vote (Green) * +OmegaPRM (Red) * +PRM800K (Magenta) * +Shepherd (Blue) * +Shepherd (ours) (Black) ### Detailed Analysis Let's analyze each line's trend and extract approximate data points. * **Majority Vote (Green):** The line starts at approximately 40% at N=4 and increases steadily, reaching around 53% at N=64. * N=4: ~40% * N=8: ~43% * N=16: ~47% * N=32: ~50% * N=64: ~53% * **+OmegaPRM (Red):** This line shows a steeper upward trend than Majority Vote. It begins at approximately 44% at N=4 and reaches around 57% at N=64. * N=4: ~44% * N=8: ~48% * N=16: ~52% * N=32: ~55% * N=64: ~57% * **+PRM800K (Magenta):** This line starts at approximately 46% at N=4 and increases to around 55% at N=64. It appears to plateau somewhat between N=32 and N=64. * N=4: ~46% * N=8: ~49% * N=16: ~51% * N=32: ~53% * N=64: ~55% * **+Shepherd (Blue):** This line begins at approximately 48% at N=4 and rises to around 56% at N=64. It shows a relatively consistent increase. * N=4: ~48% * N=8: ~50% * N=16: ~52% * N=32: ~54% * N=64: ~56% * **+Shepherd (ours) (Black):** This line starts at approximately 42% at N=4 and increases to around 56% at N=64. It shows a similar trend to +Shepherd (Blue), but starts lower. * N=4: ~42% * N=8: ~46% * N=16: ~49% * N=32: ~53% * N=64: ~56% ### Key Observations * "+OmegaPRM" consistently outperforms the other methods across all values of N. * "+Shepherd" and "+Shepherd (ours)" perform similarly, with "+Shepherd (ours)" starting slightly lower but converging towards the same performance level as N increases. * Majority Vote consistently performs the worst. * The performance gains appear to diminish as N increases, suggesting a point of diminishing returns. * The shaded areas around the lines indicate variability in performance, but the overall trends are clear. ### Interpretation The data suggests that increasing the number of solutions per problem generally improves the percentage of problems solved. The "+OmegaPRM" method is the most effective, followed closely by "+Shepherd" and "+Shepherd (ours)". The "Majority Vote" method is the least effective. The "+Shepherd (ours)" method appears to be a refined version of "+Shepherd", potentially incorporating improvements that lead to similar performance at higher values of N. The diminishing returns observed at higher N values suggest that there is a limit to the benefit of increasing the number of solutions, and that other factors may become more important in improving performance beyond a certain point. The shaded areas around the lines indicate that the performance of each method can vary, and that the observed trends may not hold true for all problems. The logarithmic scale on the x-axis emphasizes the impact of increasing N from small values, as the distance between markers becomes smaller at higher values. This chart provides a comparative analysis of different solution methods, allowing for informed decisions about which method to use based on the available resources and desired performance level. </details> (c) Gemma2 27B on MATH500. <details> <summary>x7.png Details</summary> ![2e73bcc3](/v1/image/2e73bcc33ee6cd81ae4f933ffb645c4d4331f4153f58accd4eebf252ee1cb80b) ### Visual Description ## Line Chart: Performance Comparison of Problem Solving Methods ### Overview This line chart compares the performance of several problem-solving methods as a function of the number of solutions generated per problem. The performance metric is the percentage of problems solved. The chart displays five different methods: Majority Vote, +OmegaPRM, +PRM800K, +Shepherd, and +Shepherd (ours). ### Components/Axes * **X-axis:** "N = number of solutions per problems". The scale is logarithmic, with markers at 22 (4), 23 (8), 24 (16), 25 (32), and 26 (64). * **Y-axis:** "% Problems Solved". The scale ranges from approximately 74% to 93%. * **Legend:** Located in the top-right corner. It maps colors to the following methods: * Green: Majority Vote * Red: +OmegaPRM * Magenta: +PRM800K * Blue: +Shepherd * Purple: +Shepherd (ours) ### Detailed Analysis Here's a breakdown of each line's trend and approximate data points, verified against the legend colors: * **Majority Vote (Green):** This line starts at approximately 74% at N=4 and increases steadily, reaching around 88% at N=64. The slope is relatively consistent. * N=4: ~74% * N=8: ~81% * N=16: ~87% * N=32: ~90% * N=64: ~88% * **+OmegaPRM (Red):** This line begins at approximately 86% at N=4 and shows a slight increase, plateauing around 92-93% for N > 16. * N=4: ~86% * N=8: ~89% * N=16: ~92% * N=32: ~93% * N=64: ~93% * **+PRM800K (Magenta):** This line starts at approximately 84% at N=4 and increases to around 92% at N=64. It exhibits a similar trend to +OmegaPRM, but starts slightly lower. * N=4: ~84% * N=8: ~88% * N=16: ~91% * N=32: ~92% * N=64: ~92% * **+Shepherd (Blue):** This line begins at approximately 76% at N=4 and increases more rapidly than Majority Vote, reaching around 90% at N=64. * N=4: ~76% * N=8: ~84% * N=16: ~89% * N=32: ~90% * N=64: ~90% * **+Shepherd (ours) (Purple):** This line starts at approximately 86% at N=4 and shows a slight increase, plateauing around 92-93% for N > 16. It is very similar to +OmegaPRM. * N=4: ~86% * N=8: ~89% * N=16: ~92% * N=32: ~93% * N=64: ~93% ### Key Observations * The "+OmegaPRM" and "+Shepherd (ours)" methods achieve the highest performance, consistently above 92% for N >= 16. * Majority Vote has the lowest performance, but still shows improvement with increasing N. * "+Shepherd" performs better than "+PRM800K" across all values of N. * The performance gains diminish as N increases beyond 16 for most methods. ### Interpretation The data suggests that increasing the number of solutions generated per problem generally improves the percentage of problems solved. However, there's a diminishing return to scale. The "+OmegaPRM" and "+Shepherd (ours)" methods are the most effective, indicating that the techniques used in these methods are particularly well-suited for this type of problem-solving task. The fact that "+Shepherd (ours)" performs similarly to "+OmegaPRM" suggests that the modifications made in "+Shepherd (ours)" have not significantly altered its performance relative to "+OmegaPRM". The lower performance of Majority Vote suggests that simply aggregating solutions without more sophisticated techniques is less effective. The logarithmic scale on the x-axis highlights the increasing computational cost of generating more solutions, and the plateauing performance suggests that the benefits of generating more solutions eventually outweigh the costs. </details> (d) Gemma2 27B on GSM8K. Figure 3: A comparison of PRMs trained with different process supervision datasets, evaluated by their ability to search over many test solutions using a PRM-weighted majority voting. We visualize the variance across many sub-samples of the $128$ solutions we generated in total per problem. Table 1: The performance comparison of PRMs trained with different process supervision datasets. The numbers represent the percentage of problems solved using PRM-weighted majority voting with $k=64$ . | MajorityVote@64 + Math-Shepherd + Math-Shepherd (our impl) | 67.2 67.2 67.2 | 54.7 57.4 55.2 | 92.7 92.7 91.8 | 90.6 90.5 91.4 | | --- | --- | --- | --- | --- | | + PRM800K | 67.6 | 57.2 | 92.9 | 91.7 | | + OmegaPRM | 69.4 | 58.2 | 93.6 | 92.2 | Table 1 and Fig. 3 presents the performance comparison of PRMs trained on various process annotation datasets. OmegaPRM consistently outperforms the other process supervision datasets. Specifically, the fine-tuned Gemini Pro achieves $69.4$ % and $93.6$ % accuracy on MATH500 and GSM8K, respectively, using OmegaPRM-weighted majority voting. For the pretrained Gemma2 27B, it also performs the best with $58.2$ % and $92.2$ % accuracy on MATH500 and GSM8K, respectively. It shows superior performance comparing to both human annotated PRM800K but also automatic annotated Math-Shepherd. More specifically, when the number of samples is small, almost all the PRM models outperforme the majority vote. However, as the number of samples increases, the performance of other PRMs gradually converges to the same level of the majority vote. In contrast, our PRM model continues to demonstrate a clear margin of accuracy. 4.2 Step Distribution An important factor in process supervision is the number of steps in a solution and the length of each step. Previous works (Lightman et al., 2023; Wang et al., 2024a, b) use rule-based strategies to split a solution into steps, e.g., using newline as delimiters. In contrast, we propose a more flexible method for step division, treating any sequence of consecutive tokens in a solution as a valid step. We observe that many step divisions in Math-Shepherd lack semantic coherence to some extent. Therefore, we hypothesize that semantically explicit cutting is not necessary for training a PRM. <details> <summary>x8.png Details</summary> ![3fa82df3](/v1/image/3fa82df3afd32891902bacb9c675e924a36c1b6ceacffb4719a14a7beda14781) ### Visual Description \n ## Histogram: Math-Shepherd - Number of Steps per Solution ### Overview This image presents a histogram visualizing the distribution of the number of steps required to reach a solution, likely within the "Math-Shepherd" system. The x-axis represents the number of steps, and the y-axis represents the count of solutions requiring that number of steps. The distribution appears approximately normal, with a peak around 8-10 steps. ### Components/Axes * **Title:** "Math-Shepherd" (centered at the top) * **X-axis Label:** "Number of Steps per Solution" (ranging from 0 to 30) * **Y-axis Label:** "Count" (scaled by x104, ranging from 0 to 3.5 x 104) * **Bins:** The histogram consists of 30 bins, each representing a range of step counts. * **Color:** The bars are filled with a light blue color. ### Detailed Analysis The histogram shows a roughly symmetrical distribution. The peak of the distribution is approximately at 8-10 steps, with a count of around 3.2 x 104. Here's a breakdown of approximate counts for different step ranges: * 0-2 steps: Approximately 0.2 x 104 * 2-4 steps: Approximately 0.6 x 104 * 4-6 steps: Approximately 1.5 x 104 * 6-8 steps: Approximately 2.5 x 104 * 8-10 steps: Approximately 3.2 x 104 * 10-12 steps: Approximately 2.8 x 104 * 12-14 steps: Approximately 2.0 x 104 * 14-16 steps: Approximately 1.2 x 104 * 16-18 steps: Approximately 0.6 x 104 * 18-20 steps: Approximately 0.3 x 104 * 20-22 steps: Approximately 0.1 x 104 * 22-24 steps: Approximately 0.05 x 104 * 24-26 steps: Approximately 0.02 x 104 * 26-28 steps: Approximately 0.01 x 104 * 28-30 steps: Approximately 0.005 x 104 The distribution decreases rapidly as the number of steps increases beyond 12. ### Key Observations * The most frequent number of steps required for a solution is between 8 and 10. * The distribution is approximately bell-shaped, suggesting a normal distribution. * There is a long tail extending to the right, indicating that some solutions require a significantly larger number of steps than the majority. * Very few solutions require fewer than 4 steps. ### Interpretation The data suggests that the "Math-Shepherd" system generally finds solutions with a moderate number of steps (around 8-10). The normal distribution indicates that the step count is relatively consistent for most problems. The presence of a tail suggests that some problems are considerably more complex and require a larger number of steps to solve. This could be due to the inherent difficulty of certain mathematical problems or limitations in the system's problem-solving approach. The relatively low count for solutions requiring fewer than 4 steps might indicate that the system doesn't often find very direct or simple solutions, or that such problems are less common within the dataset. The data could be used to evaluate the efficiency of the "Math-Shepherd" system and identify areas for improvement. For example, if the goal is to reduce the average number of steps required for a solution, efforts could be focused on optimizing the system's approach to more complex problems. </details> <details> <summary>x9.png Details</summary> ![b475a50b](/v1/image/b475a50bf97dd84ef482a7071a29cf078e4b45c3587635a979462885b4c447c0) ### Visual Description \n ## Histogram: PRM800K - Number of Steps per Solution ### Overview The image presents a histogram visualizing the distribution of the number of steps per solution for a dataset labeled "PRM800K". The x-axis represents the number of steps, and the y-axis represents the count (frequency) of solutions falling into each step range. The distribution appears approximately normal, skewed slightly to the right. ### Components/Axes * **Title:** PRM800K (positioned at the top-center) * **X-axis Label:** Number of Steps per Solution (positioned at the bottom-center) * **Y-axis Label:** Count (positioned at the left-center) * **Y-axis Scale:** The y-axis is scaled from 0 to 6 x 103 (6000). * **X-axis Scale:** The x-axis is scaled from 0 to approximately 40. * **Bars:** The histogram consists of vertical bars representing the frequency of solutions for each step count. The bars are light blue. ### Detailed Analysis The histogram shows a peak in the count around 8-12 steps. The distribution is roughly symmetrical around this peak, but with a longer tail extending towards higher step counts. Here's an approximate breakdown of the data, reading from left to right: * **0-2 Steps:** Approximately 200-300 count. * **2-4 Steps:** Approximately 500-600 count. * **4-6 Steps:** Approximately 800-900 count. * **6-8 Steps:** Approximately 1100-1200 count. * **8-10 Steps:** Approximately 1500-1600 count. (Peak) * **10-12 Steps:** Approximately 1400-1500 count. * **12-14 Steps:** Approximately 1100-1200 count. * **14-16 Steps:** Approximately 800-900 count. * **16-18 Steps:** Approximately 500-600 count. * **18-20 Steps:** Approximately 300-400 count. * **20-22 Steps:** Approximately 200-300 count. * **22-24 Steps:** Approximately 100-200 count. * **24-26 Steps:** Approximately 50-100 count. * **26-28 Steps:** Approximately 20-50 count. * **28-30 Steps:** Approximately 10-20 count. * **30-32 Steps:** Approximately 5-10 count. * **32-34 Steps:** Approximately 2-5 count. * **34-36 Steps:** Approximately 1-2 count. * **36-40 Steps:** Approximately 0-1 count. The counts decrease steadily as the number of steps increases beyond the peak. ### Key Observations * The most frequent number of steps per solution is between 8 and 12. * The distribution is unimodal (single peak). * The distribution is not perfectly symmetrical; it exhibits a slight positive skew. * The number of solutions requiring more than 20 steps is relatively low. ### Interpretation This histogram likely represents the performance of an algorithm or process (PRM800K) in finding solutions. The number of steps could correspond to the computational effort or iterations required to reach a solution. The fact that most solutions are found within 8-12 steps suggests that the algorithm is relatively efficient for a significant portion of the problem space. The rightward skew indicates that some solutions require considerably more steps, potentially due to the complexity of those specific instances. This could be indicative of a need to optimize the algorithm for handling more challenging cases or to investigate the characteristics of those instances that require a higher number of steps. The data suggests that the algorithm is generally effective, but there's room for improvement in handling the more difficult scenarios. </details> Figure 4: Number of steps distribution. In practice, we first examine the distribution of the number of steps per solution in PRM800K and Math-Shepherd, as shown in Fig. 4, noting that most solutions have less than $20$ steps. During binary search, we aim to divide a full solution into $16$ pieces. To calculate the expected step length, we divide the average solution length by $16$ . The binary search terminates when a step is shorter than this value. The resulting distributions of step lengths for OmegaPRM and the other two datasets are shown in Fig. 5. This flexible splitting strategy produces a step length distribution similar to that of the rule-based strategy. <details> <summary>x10.png Details</summary> ![19650e42](/v1/image/19650e423e34a678d0e5d4e68697ac0e0c13af0fc7029470f27566eec011290c) ### Visual Description \n ## Histogram: Math-Shepherd Per-step Length ### Overview The image presents a histogram visualizing the distribution of "Per-step Length" (measured in number of tokens) for a system or process named "Math-Shepherd". The histogram displays the frequency (Count) of different per-step lengths. ### Components/Axes * **Title:** "Math-Shepherd" (centered at the top) * **X-axis Label:** "Per-step Length (in number of tokens)" - ranging from approximately 0 to 200. * **Y-axis Label:** "Count" - ranging from 0 to 3 x 105. * **Histogram Bars:** Light blue bars representing the frequency of each per-step length. ### Detailed Analysis The histogram shows a roughly symmetrical distribution, peaking around a per-step length of approximately 20-30 tokens. The distribution appears to be unimodal. Here's a breakdown of approximate values, reading from left to right: * **0-10 tokens:** Count is approximately 1.5 x 105. * **10-20 tokens:** Count increases to approximately 2.8 x 105. * **20-30 tokens:** Count reaches a peak of approximately 3.0 x 105. * **30-40 tokens:** Count decreases to approximately 2.4 x 105. * **40-50 tokens:** Count decreases to approximately 1.7 x 105. * **50-60 tokens:** Count decreases to approximately 1.0 x 105. * **60-70 tokens:** Count decreases to approximately 0.6 x 105. * **70-80 tokens:** Count decreases to approximately 0.3 x 105. * **80-90 tokens:** Count decreases to approximately 0.15 x 105. * **90-100 tokens:** Count decreases to approximately 0.07 x 105. * **100-150 tokens:** Count continues to decrease, approaching 0. * **150-200 tokens:** Count is very low, close to 0. The bars are of equal width, representing equal ranges of per-step length. ### Key Observations * The most frequent per-step length is between 20 and 30 tokens. * The distribution is right-skewed, meaning there's a longer tail extending towards higher per-step lengths. * The count drops off rapidly for per-step lengths greater than 80 tokens. * The distribution is concentrated in the lower range of per-step lengths. ### Interpretation This histogram suggests that the "Math-Shepherd" system typically operates with relatively short per-step lengths, most commonly between 20 and 30 tokens. The right skew indicates that while most steps are short, there are occasional steps that are significantly longer. This could be due to the complexity of certain mathematical operations or the need for more detailed reasoning in specific cases. The rapid decline in count for longer per-step lengths suggests that these longer steps are rare. The data provides insight into the granularity of the Math-Shepherd system's processing. A shorter per-step length might indicate a more fine-grained approach, while a longer per-step length could suggest a more coarse-grained approach. Understanding this distribution is crucial for optimizing the system's performance and resource allocation. The concentration of steps in the lower token range suggests that the system is efficient in most cases, but the presence of longer steps warrants further investigation to identify potential bottlenecks or areas for improvement. </details> <details> <summary>x11.png Details</summary> ![08039ace](/v1/image/08039ace6421770b29f6ac4a583d9ae1e2be41d6de72356c7907c3de855eecc1) ### Visual Description \n ## Histogram: PRM800K Per-step Length Distribution ### Overview The image presents a histogram visualizing the distribution of "Per-step Length" (measured in number of tokens) for a dataset labeled "PRM800K". The histogram displays the frequency (Count) of different per-step lengths. ### Components/Axes * **Title:** PRM800K (positioned at the top-center) * **X-axis:** Per-step Length (in number of tokens). Scale ranges from approximately 0 to 200. * **Y-axis:** Count. Scale ranges from 0 to 80,000 (indicated by the "x10⁴" prefix). * **Histogram Bars:** Represent the frequency of each per-step length. The bars are light blue. ### Detailed Analysis The histogram shows a right-skewed distribution. The highest frequency occurs at a per-step length of approximately 20-30 tokens, with a count of around 80,000. The distribution decreases as the per-step length increases. Here's a breakdown of approximate values, reading from left to right: * **0-10 tokens:** Count approximately 10,000. * **10-20 tokens:** Count increases rapidly to approximately 30,000. * **20-30 tokens:** Peak frequency, around 80,000. * **30-40 tokens:** Count decreases to approximately 60,000. * **40-50 tokens:** Count decreases to approximately 40,000. * **50-60 tokens:** Count decreases to approximately 25,000. * **60-70 tokens:** Count decreases to approximately 15,000. * **70-80 tokens:** Count decreases to approximately 8,000. * **80-90 tokens:** Count decreases to approximately 4,000. * **90-100 tokens:** Count decreases to approximately 2,000. * **100-150 tokens:** Count remains low, around 1,000-2,000. * **150-200 tokens:** Count is very low, approaching 0. The histogram consists of approximately 30 bars. ### Key Observations * The distribution is heavily skewed to the right, indicating that most per-step lengths are relatively short. * There is a clear peak in the frequency around 20-30 tokens. * The frequency decreases rapidly as the per-step length increases beyond 50 tokens. * Very few per-step lengths exceed 100 tokens. ### Interpretation The data suggests that the "PRM800K" dataset primarily consists of steps with relatively short lengths, measured in tokens. This could indicate that the underlying process generating the data involves frequent, small updates or changes. The right skewness implies that while most steps are short, there are occasional steps with significantly longer lengths. This could be due to infrequent but substantial changes or events within the dataset. The distribution provides insight into the characteristics of the data and could be useful for optimizing algorithms or models that process it. The dataset is likely related to a sequence-based task, where "steps" represent individual processing units and "tokens" are the units of measurement within those steps. The peak around 20-30 tokens suggests an optimal or typical step size for this dataset. </details> <details> <summary>x12.png Details</summary> ![a8feb90d](/v1/image/a8feb90d45e4338566808bc7dc74d923b520944d2d90d7becf1645ae82f7d9f9) ### Visual Description \n ## Histogram: OmegaPRM Per-step Length Distribution ### Overview The image presents a histogram visualizing the distribution of "Per-step Length" measured in "number of tokens" for a system or process labeled "OmegaPRM". The histogram displays the frequency (Count) of different per-step lengths. ### Components/Axes * **Title:** OmegaPRM (positioned at the top-center) * **X-axis Label:** Per-step Length (in number of tokens) (positioned at the bottom-center) * **Y-axis Label:** Count (positioned at the left-center) * **Y-axis Scale:** The Y-axis is scaled to multiples of 105. The scale ranges from 0 to 2.0 x 105. * **X-axis Scale:** The X-axis ranges from 0 to 200, with increments of approximately 10. ### Detailed Analysis The histogram consists of a series of bars representing the frequency of different per-step lengths. The distribution is heavily skewed to the right. * **Trend:** The histogram shows a strong negative correlation between Per-step Length and Count. The frequency of shorter per-step lengths is significantly higher than that of longer lengths. The distribution peaks around a Per-step Length of approximately 10-20 tokens. * **Data Points (Approximate):** * Per-step Length = 0: Count ≈ 1.8 x 105 * Per-step Length = 10: Count ≈ 1.7 x 105 * Per-step Length = 20: Count ≈ 1.4 x 105 * Per-step Length = 30: Count ≈ 0.9 x 105 * Per-step Length = 40: Count ≈ 0.6 x 105 * Per-step Length = 50: Count ≈ 0.4 x 105 * Per-step Length = 60: Count ≈ 0.25 x 105 * Per-step Length = 70: Count ≈ 0.15 x 105 * Per-step Length = 80: Count ≈ 0.08 x 105 * Per-step Length = 90: Count ≈ 0.04 x 105 * Per-step Length = 100: Count ≈ 0.02 x 105 * Per-step Length = 150: Count ≈ 0.005 x 105 * Per-step Length = 200: Count ≈ 0.001 x 105 ### Key Observations * The vast majority of per-step lengths are below 50 tokens. * The distribution exhibits a long tail, indicating that while rare, there are some instances of significantly longer per-step lengths. * The histogram is unimodal, with a single clear peak. ### Interpretation The data suggests that the "OmegaPRM" system generally operates with relatively short per-step lengths, most frequently between 0 and 50 tokens. The right skew indicates that occasionally, the system processes steps with much longer token lengths, but these are infrequent. This distribution could be indicative of the nature of the tasks being processed by OmegaPRM. For example, it might suggest that most tasks are simple and require few tokens, while a smaller number of tasks are more complex and require a larger number of tokens. The shape of the distribution could also be influenced by the specific tokenization method used. Further investigation would be needed to understand the underlying reasons for this distribution and its implications for the performance and efficiency of the OmegaPRM system. </details> Figure 5: Step length distribution in terms of number of tokens. 4.3 PRM Training Objectives Table 2: Comparison of different training objectives for PRMs. | PRM Accuracy (%) | Soft Label 70.1 | Hard Label 63.3 | Pairwise 64.2 | | --- | --- | --- | --- | As outlined in Section 3.4, PRMs can be trained using multiple objectives. We construct a small process supervision test set using the problems from the MATH test split. We train PRMs using pointwise soft label, pointwise hard label and pairwise loss respectively, and evaluate how accurately they can classify the per-step correctness. Table 2 presents the comparison of different objectives, and the pointwise soft label is the best among them with $70.1$ % accuracy. 4.4 Algorithm Efficiency As described in Section Section 3.2 and Section 3.3, we utilize binary search and Monte Carlo Tree Search to improve the efficiency of OmegaPRM process supervision data collection by effectively identifying the first incorrect step and reusing rollouts in Monte Carlo estimation. To quantitatively measure the efficiency of OmegaPRM, we collected process supervision data using both brute-force-style method (Wang et al., 2024a, b) and OmegaPRM with the same computational budget. As a result, we were able to generate $200$ K data points using the brute-force algorithm compared to $15$ million data points with OmegaPRM, demonstrating a $75$ -times efficiency improvement. In practice, we randomly down-sampled OmegaPRM data to $1.5$ million for PRM training. 5 Limitations There are some limitations with our paper, which we reserve for future work: Automatic process annotation is noisy. Our method for automatic process supervision annotation introduces noise in the form of false positives and negatives, but experiments indicate that it can still effectively train a PRM. The PRM trained on our dataset performs better than one trained on the human-annotated PRM800K dataset. The precise impact of noise on PRM performance remains uncertain. For future research, a comprehensive comparison of human and automated annotations should be conducted. One other idea is to integrate human and automated annotations, which could result in more robust and efficient process supervision. Human supervision is still necessary. Unlike the work presented in AlphaGo Zero (Silver et al., 2017), our method requires the question and golden answer pair. The question is necessary for LLM to start the MCTS and the golden answer is inevitable for the LLM to compare its rollouts with and determine the correctness of the current step. This will limit the method to the tasks with such question and golden answer pairs. Therefore, we need to adapt the current method further to make it suitable for open-ended tasks. 6 Conclusion In conclusion, we introduce OmegaPRM, a divide-and-conquer Monte Carlo Tree Search algorithm, designed to automate the process supervision data collection for LLMs. By efficiently pinpointing the first error in the Chain-of-Thought and balancing data quality, OmegaPRM addresses the shortcomings of existing methods. Our automated approach enables the collection of over 1.5 million process supervision annotations, which are used to train a PRM. Leveraging this automated process supervision with the weighted self-consistency algorithm, we improve LLM mathematical reasoning performance, achieving a 69.4% success rate on the MATH benchmark — a 18.4% absolute increase over the base model which amounts to a relative improvement of 36%. Additionally, our method significantly reduces data collection costs compared to human annotation and brute force Monte-Carlo sampling. These findings highlight OmegaPRM’s potential to enhance LLM capabilities in complex multi-step reasoning tasks. 7 Acknowledgements We would like to express our deepest gratitude to Matt Barnes, Jindong Chen and Rif A. Saurous, for their support and helpful feedback to our work. We also thank Peiyi Wang for clarifying the details in Math-Shepherd and providing insightful suggestions. References - Christiano et al. (2017) P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. arXiv preprint arXiv:1706.03741, 2017. - Cobbe et al. (2021) K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. - Feng et al. (2023) X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023. - Feng et al. (2024) X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2024. - Fu et al. (2023) Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot. Complexity-based prompting for multi-step reasoning. In Proceedings of the 11th International Conference on Learning Representations (ICLR), May 2023. - Gemini Team et al. (2024) Gemini Team, M. Reid, N. Savinov, D. Teplyashin, Dmitry, Lepikhin, T. Lillicrap, J. baptiste Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, I. Antonoglou, R. Anil, S. Borgeaud, A. Dai, K. Millican, E. Dyer, M. Glaese, T. Sottiaux, B. Lee, F. Viola, M. Reynolds, Y. Xu, J. Molloy, J. Chen, M. Isard, P. Barham, T. Hennigan, R. McIlroy, M. Johnson, J. Schalkwyk, E. Collins, E. Rutherford, E. Moreira, K. Ayoub, M. Goel, C. Meyer, G. Thornton, Z. Yang, H. Michalewski, Z. Abbas, N. Schucher, A. Anand, R. Ives, J. Keeling, K. Lenc, S. Haykal, S. Shakeri, P. Shyam, A. Chowdhery, R. Ring, S. Spencer, E. Sezener, L. Vilnis, O. Chang, N. Morioka, G. Tucker, C. Zheng, O. Woodman, N. Attaluri, T. Kocisky, E. Eltyshev, X. Chen, T. Chung, V. Selo, S. Brahma, P. Georgiev, A. Slone, Z. Zhu, J. Lottes, S. Qiao, B. Caine, S. Riedel, A. Tomala, M. Chadwick, J. Love, P. Choy, S. Mittal, N. Houlsby, Y. Tang, M. Lamm, L. Bai, Q. Zhang, L. He, Y. Cheng, P. Humphreys, Y. Li, S. Brin, A. Cassirer, Y. Miao, L. Zilka, T. Tobin, K. Xu, L. Proleev, D. Sohn, A. Magni, L. A. Hendricks, I. Gao, S. Ontanon, O. Bunyan, N. Byrd, A. Sharma, B. Zhang, M. Pinto, R. Sinha, H. Mehta, D. Jia, S. Caelles, A. Webson, A. Morris, B. Roelofs, Y. Ding, R. Strudel, X. Xiong, M. Ritter, M. Dehghani, R. Chaabouni, A. Karmarkar, G. Lai, F. Mentzer, B. Xu, Y. Li, Y. Zhang, T. L. Paine, A. Goldin, B. Neyshabur, K. Baumli, A. Levskaya, M. Laskin, W. Jia, J. W. Rae, K. Xiao, A. He, S. Giordano, L. Yagati, J.-B. Lespiau, P. Natsev, S. Ganapathy, F. Liu, D. Martins, N. Chen, Y. Xu, M. Barnes, R. May, A. Vezer, J. Oh, K. Franko, S. Bridgers, R. Zhao, B. Wu, B. Mustafa, S. Sechrist, E. Parisotto, T. S. Pillai, C. Larkin, C. Gu, C. Sorokin, M. Krikun, A. Guseynov, J. Landon, R. Datta, A. Pritzel, P. Thacker, F. Yang, K. Hui, A. Hauth, C.-K. Yeh, D. Barker, J. Mao-Jones, S. Austin, H. Sheahan, P. Schuh, J. Svensson, R. Jain, V. Ramasesh, A. Briukhov, D.-W. Chung, T. von Glehn, C. Butterfield, P. Jhakra, M. Wiethoff, J. Frye, J. Grimstad, B. Changpinyo, C. L. Lan, A. Bortsova, Y. Wu, P. Voigtlaender, T. Sainath, S. Gu, C. Smith, W. Hawkins, K. Cao, J. Besley, S. Srinivasan, M. Omernick, C. Gaffney, G. Surita, R. Burnell, B. Damoc, J. Ahn, A. Brock, M. Pajarskas, A. Petrushkina, S. Noury, L. Blanco, K. Swersky, A. Ahuja, T. Avrahami, V. Misra, R. de Liedekerke, M. Iinuma, A. Polozov, S. York, G. van den Driessche, P. Michel, J. Chiu, R. Blevins, Z. Gleicher, A. Recasens, A. Rrustemi, E. Gribovskaya, A. Roy, W. Gworek, S. M. R. Arnold, L. Lee, J. Lee-Thorp, M. Maggioni, E. Piqueras, K. Badola, S. Vikram, L. Gonzalez, A. Baddepudi, E. Senter, J. Devlin, J. Qin, M. Azzam, M. Trebacz, M. Polacek, K. Krishnakumar, S. yiin Chang, M. Tung, I. Penchev, R. Joshi, K. Olszewska, C. Muir, M. Wirth, A. J. Hartman, J. Newlan, S. Kashem, V. Bolina, E. Dabir, J. van Amersfoort, Z. Ahmed, J. Cobon-Kerr, A. Kamath, A. M. Hrafnkelsson, L. Hou, I. Mackinnon, A. Frechette, E. Noland, X. Si, E. Taropa, D. Li, P. Crone, A. Gulati, S. Cevey, J. Adler, A. Ma, D. Silver, S. Tokumine, R. Powell, S. Lee, K. Vodrahalli, S. Hassan, D. Mincu, A. Yang, N. Levine, J. Brennan, M. Wang, S. Hodkinson, J. Zhao, J. Lipschultz, A. Pope, M. B. Chang, C. Li, L. E. Shafey, M. Paganini, S. Douglas, B. Bohnet, F. Pardo, S. Odoom, M. Rosca, C. N. dos Santos, K. Soparkar, A. Guez, T. Hudson, S. Hansen, C. Asawaroengchai, R. Addanki, T. Yu, W. Stokowiec, M. Khan, J. Gilmer, J. Lee, C. G. Bostock, K. Rong, J. Caton, P. Pejman, F. Pavetic, G. Brown, V. Sharma, M. Lučić, R. Samuel, J. Djolonga, A. Mandhane, L. L. Sjösund, E. Buchatskaya, E. White, N. Clay, J. Jiang, H. Lim, R. Hemsley, Z. Cankara, J. Labanowski, N. D. Cao, D. Steiner, S. H. Hashemi, J. Austin, A. Gergely, T. Blyth, J. Stanton, K. Shivakumar, A. Siddhant, A. Andreassen, C. Araya, N. Sethi, R. Shivanna, S. Hand, A. Bapna, A. Khodaei, A. Miech, G. Tanzer, A. Swing, S. Thakoor, L. Aroyo, Z. Pan, Z. Nado, J. Sygnowski, S. Winkler, D. Yu, M. Saleh, L. Maggiore, Y. Bansal, X. Garcia, M. Kazemi, P. Patil, I. Dasgupta, I. Barr, M. Giang, T. Kagohara, I. Danihelka, A. Marathe, V. Feinberg, M. Elhawaty, N. Ghelani, D. Horgan, H. Miller, L. Walker, R. Tanburn, M. Tariq, D. Shrivastava, F. Xia, Q. Wang, C.-C. Chiu, Z. Ashwood, K. Baatarsukh, S. Samangooei, R. L. Kaufman, F. Alcober, A. Stjerngren, P. Komarek, K. Tsihlas, A. Boral, R. Comanescu, J. Chen, R. Liu, C. Welty, D. Bloxwich, C. Chen, Y. Sun, F. Feng, M. Mauger, X. Dotiwalla, V. Hellendoorn, M. Sharman, I. Zheng, K. Haridasan, G. Barth-Maron, C. Swanson, D. Rogozińska, A. Andreev, P. K. Rubenstein, R. Sang, D. Hurt, G. Elsayed, R. Wang, D. Lacey, A. Ilić, Y. Zhao, A. Iwanicki, A. Lince, A. Chen, C. Lyu, C. Lebsack, J. Griffith, M. Gaba, P. Sandhu, P. Chen, A. Koop, R. Rajwar, S. H. Yeganeh, S. Chang, R. Zhu, S. Radpour, E. Davoodi, V. I. Lei, Y. Xu, D. Toyama, C. Segal, M. Wicke, H. Lin, A. Bulanova, A. P. Badia, N. Rakićević, P. Sprechmann, A. Filos, S. Hou, V. Campos, N. Kassner, D. Sachan, M. Fortunato, C. Iwuanyanwu, V. Nikolaev, B. Lakshminarayanan, S. Jazayeri, M. Varadarajan, C. Tekur, D. Fritz, M. Khalman, D. Reitter, K. Dasgupta, S. Sarcar, T. Ornduff, J. Snaider, F. Huot, J. Jia, R. Kemp, N. Trdin, A. Vijayakumar, L. Kim, C. Angermueller, L. Lao, T. Liu, H. Zhang, D. Engel, S. Greene, A. White, J. Austin, L. Taylor, S. Ashraf, D. Liu, M. Georgaki, I. Cai, Y. Kulizhskaya, S. Goenka, B. Saeta, Y. Xu, C. Frank, D. de Cesare, B. Robenek, H. Richardson, M. Alnahlawi, C. Yew, P. Ponnapalli, M. Tagliasacchi, A. Korchemniy, Y. Kim, D. Li, B. Rosgen, K. Levin, J. Wiesner, P. Banzal, P. Srinivasan, H. Yu, Çağlar Ünlü, D. Reid, Z. Tung, D. Finchelstein, R. Kumar, A. Elisseeff, J. Huang, M. Zhang, R. Aguilar, M. Giménez, J. Xia, O. Dousse, W. Gierke, D. Yates, K. Jalan, L. Li, E. Latorre-Chimoto, D. D. Nguyen, K. Durden, P. Kallakuri, Y. Liu, M. Johnson, T. Tsai, A. Talbert, J. Liu, A. Neitz, C. Elkind, M. Selvi, M. Jasarevic, L. B. Soares, A. Cui, P. Wang, A. W. Wang, X. Ye, K. Kallarackal, L. Loher, H. Lam, J. Broder, D. Holtmann-Rice, N. Martin, B. Ramadhana, M. Shukla, S. Basu, A. Mohan, N. Fernando, N. Fiedel, K. Paterson, H. Li, A. Garg, J. Park, D. Choi, D. Wu, S. Singh, Z. Zhang, A. Globerson, L. Yu, J. Carpenter, F. de Chaumont Quitry, C. Radebaugh, C.-C. Lin, A. Tudor, P. Shroff, D. Garmon, D. Du, N. Vats, H. Lu, S. Iqbal, A. Yakubovich, N. Tripuraneni, J. Manyika, H. Qureshi, N. Hua, C. Ngani, M. A. Raad, H. Forbes, J. Stanway, M. Sundararajan, V. Ungureanu, C. Bishop, Y. Li, B. Venkatraman, B. Li, C. Thornton, S. Scellato, N. Gupta, Y. Wang, I. Tenney, X. Wu, A. Shenoy, G. Carvajal, D. G. Wright, B. Bariach, Z. Xiao, P. Hawkins, S. Dalmia, C. Farabet, P. Valenzuela, Q. Yuan, A. Agarwal, M. Chen, W. Kim, B. Hulse, N. Dukkipati, A. Paszke, A. Bolt, K. Choo, J. Beattie, J. Prendki, H. Vashisht, R. Santamaria-Fernandez, L. C. Cobo, J. Wilkiewicz, D. Madras, A. Elqursh, G. Uy, K. Ramirez, M. Harvey, T. Liechty, H. Zen, J. Seibert, C. H. Hu, A. Khorlin, M. Le, A. Aharoni, M. Li, L. Wang, S. Kumar, N. Casagrande, J. Hoover, D. E. Badawy, D. Soergel, D. Vnukov, M. Miecnikowski, J. Simsa, P. Kumar, T. Sellam, D. Vlasic, S. Daruki, N. Shabat, J. Zhang, G. Su, J. Zhang, J. Liu, Y. Sun, E. Palmer, A. Ghaffarkhah, X. Xiong, V. Cotruta, M. Fink, L. Dixon, A. Sreevatsa, A. Goedeckemeyer, A. Dimitriev, M. Jafari, R. Crocker, N. FitzGerald, A. Kumar, S. Ghemawat, I. Philips, F. Liu, Y. Liang, R. Sterneck, A. Repina, M. Wu, L. Knight, M. Georgiev, H. Lee, H. Askham, A. Chakladar, A. Louis, C. Crous, H. Cate, D. Petrova, M. Quinn, D. Owusu-Afriyie, A. Singhal, N. Wei, S. Kim, D. Vincent, M. Nasr, C. A. Choquette-Choo, R. Tojo, S. Lu, D. de Las Casas, Y. Cheng, T. Bolukbasi, K. Lee, S. Fatehi, R. Ananthanarayanan, M. Patel, C. Kaed, J. Li, S. R. Belle, Z. Chen, J. Konzelmann, S. Põder, R. Garg, V. Koverkathu, A. Brown, C. Dyer, R. Liu, A. Nova, J. Xu, A. Walton, A. Parrish, M. Epstein, S. McCarthy, S. Petrov, D. Hassabis, K. Kavukcuoglu, J. Dean, and O. Vinyals. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. - Gemma Team et al. (2024) Gemma Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, J. Ferret, P. Liu, P. Tafti, A. Friesen, M. Casbon, S. Ramos, R. Kumar, C. L. Lan, S. Jerome, A. Tsitsulin, N. Vieillard, P. Stanczyk, S. Girgin, N. Momchev, M. Hoffman, S. Thakoor, J.-B. Grill, B. Neyshabur, O. Bachem, A. Walton, A. Severyn, A. Parrish, A. Ahmad, A. Hutchison, A. Abdagic, A. Carl, A. Shen, A. Brock, A. Coenen, A. Laforge, A. Paterson, B. Bastian, B. Piot, B. Wu, B. Royal, C. Chen, C. Kumar, C. Perry, C. Welty, C. A. Choquette-Choo, D. Sinopalnikov, D. Weinberger, D. Vijaykumar, D. Rogozińska, D. Herbison, E. Bandy, E. Wang, E. Noland, E. Moreira, E. Senter, E. Eltyshev, F. Visin, G. Rasskin, G. Wei, G. Cameron, G. Martins, H. Hashemi, H. Klimczak-Plucińska, H. Batra, H. Dhand, I. Nardini, J. Mein, J. Zhou, J. Svensson, J. Stanway, J. Chan, J. P. Zhou, J. Carrasqueira, J. Iljazi, J. Becker, J. Fernandez, J. van Amersfoort, J. Gordon, J. Lipschultz, J. Newlan, J. yeong Ji, K. Mohamed, K. Badola, K. Black, K. Millican, K. McDonell, K. Nguyen, K. Sodhia, K. Greene, L. L. Sjoesund, L. Usui, L. Sifre, L. Heuermann, L. Lago, L. McNealus, L. B. Soares, L. Kilpatrick, L. Dixon, L. Martins, M. Reid, M. Singh, M. Iverson, M. Görner, M. Velloso, M. Wirth, M. Davidow, M. Miller, M. Rahtz, M. Watson, M. Risdal, M. Kazemi, M. Moynihan, M. Zhang, M. Kahng, M. Park, M. Rahman, M. Khatwani, N. Dao, N. Bardoliwalla, N. Devanathan, N. Dumai, N. Chauhan, O. Wahltinez, P. Botarda, P. Barnes, P. Barham, P. Michel, P. Jin, P. Georgiev, P. Culliton, P. Kuppala, R. Comanescu, R. Merhej, R. Jana, R. A. Rokni, R. Agarwal, R. Mullins, S. Saadat, S. M. Carthy, S. Perrin, S. M. R. Arnold, S. Krause, S. Dai, S. Garg, S. Sheth, S. Ronstrom, S. Chan, T. Jordan, T. Yu, T. Eccles, T. Hennigan, T. Kocisky, T. Doshi, V. Jain, V. Yadav, V. Meshram, V. Dharmadhikari, W. Barkley, W. Wei, W. Ye, W. Han, W. Kwon, X. Xu, Z. Shen, Z. Gong, Z. Wei, V. Cotruta, P. Kirk, A. Rao, M. Giang, L. Peran, T. Warkentin, E. Collins, J. Barral, Z. Ghahramani, R. Hadsell, D. Sculley, J. Banks, A. Dragan, S. Petrov, O. Vinyals, J. Dean, D. Hassabis, K. Kavukcuoglu, C. Farabet, E. Buchatskaya, S. Borgeaud, N. Fiedel, A. Joulin, K. Kenealy, R. Dadashi, and A. Andreev. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024. - Hao et al. (2023) S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023. - Hendrycks et al. (2021) D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021. - Huang et al. (2022) J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and J. Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022. - Kang et al. (2024) J. Kang, X. Z. Li, X. Chen, A. Kazemi, Q. Sun, B. Chen, D. Li, X. He, Q. He, F. Wen, J. Hao, and J. Yao. Mindstar: Enhancing math reasoning in pre-trained llms at inference time. arXiv preprint arXiv:2405.16265, 2024. - Li et al. (2023) Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen. Making large language models better reasoners with step-aware verifier. arXiv preprint arXiv:2206.02336, 2023. - Lightman et al. (2023) H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023. - Liu et al. (2024) H. Liu, Y. Zhang, Y. Luo, and A. C.-C. Yao. Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003, 2024. - Luo et al. (2023) L. Luo, Z. Lin, Y. Liu, L. Shu, Y. Zhu, J. Shang, and L. Meng. Critique ability of large language models. arXiv preprint arXiv:2310.04815, 2023. - Ma et al. (2023) Q. Ma, H. Zhou, T. Liu, J. Yuan, P. Liu, Y. You, and H. Yang. Let’s reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080, 2023. - OpenAI (2023) OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023. - Ouyang et al. (2022) L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022. - Perez et al. (2021) E. Perez, D. Kiela, and K. Cho. True few-shot learning with language models. arXiv preprint arXiv:2105.11447, 2021. - Rosin (2011) C. D. Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011. - Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 2016. URL https://doi.org/10.1038/nature16961. - Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017. - Tian et al. (2024) Y. Tian, B. Peng, L. Song, L. Jin, D. Yu, H. Mi, and D. Yu. Toward self-improvement of llms via imagination, searching, and criticizing. arXiv preprint arXiv:2404.12253, 2024. - Touvron et al. (2023) H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. - Uesato et al. (2022) J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022. - Wang et al. (2024a) P. Wang, L. Li, Z. Shao, R. X. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. arXiv preprint arXiv:2312.08935, 2024a. - Wang et al. (2023) X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. In Proceedings of the 11th International Conference on Learning Representations (ICLR), May 2023. - Wang et al. (2024b) Z. Wang, Y. Li, Y. Wu, L. Luo, L. Hou, H. Yu, and J. Shang. Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision. arXiv preprint arXiv:2402.02658, 2024b. - Wei et al. (2022a) J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus. Emergent abilities of large language models, 2022a. - Wei et al. (2022b) J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), Dec 2022b. - Yao et al. (2023) S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023. - Yu et al. (2024) F. Yu, A. Gao, and B. Wang. Ovm, outcome-supervised value models for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2024. - Yu et al. (2023) L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y. Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. MetaMath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023. - Zhang et al. (2024) D. Zhang, S. Zhoubian, Z. Hu, Y. Yue, Y. Dong, and J. Tang. Rest-mcts*: Llm self-training via process reward guided tree search. arXiv preprint arXiv:2406.03816, 2024. - Świechowski et al. (2021) M. Świechowski, K. Godlewski, B. Sawicki, and J. Mańdziuk. Monte carlo tree search: A review of recent modifications and applications. arXiv preprint arXiv:2103.04931, 2021. Appendix Appendix A Question Filtering During the evaluation of partial solution correctness using MC estimation, false negative noise may be introduced when a question is too hard for the model, thus no correct rollout can be found even with correct partial solution. Or false positive noise may be introduced when a question is too easy, that model can conclude in correct answer given partial solution with wrong step. It is not possible to exclude such noise completely, but we can reduce the chance by filtering out questions that are either too hard or too easy for the model. Specifically, we ran a $k=32$ rollouts for each question in the 12K training data, and filter out the questions that with no correct answer (too hard) or no wrong answer (too easy) in the 32 rollouts. Appendix B Pairwise Loss Formula When training with pairwise labels, the Bradley-Terry model (people typically use this objective to train reward models in RLHF) generally accepts two probability scalars summing up to 1. When we select the two actions as a pair, there are two cases. The first case is that one sample with a zero MC value, and the other sample with a positive MC value. The second case is that both samples are with positive MC values. The first case is straight-forward, and a normalization step is required for the second case. Assume the two MC values are $p$ and $q$ , and they follow the Bernoulli distribution: $P(X=1)=p$ and $P(Y=1)=q$ . We need to calculate the probability that action X is preferred over action Y and vice versa. $$ \begin{split}P(X>Y)&=P(X=1,Y=0)=p(1-q),\\ P(X<Y)&=P(X=0,Y=1)=(1-p)q,\\ P(X=Y)&=P(X=0,Y=0)+P(X=1,Y=1)=(1-p)(1-q)+pq.\end{split} \tag{5} $$ For the tied situation, each action has half the chance of being preferred. Thus, $$ \begin{split}P(\text{action X is preferred})=P(X>Y)+1/2*P(X=Y)=1/2*(1+p-q),\\ P(\text{action Y is preferred})=P(X<Y)+1/2*P(X=Y)=1/2*(1+q-p).\end{split} \tag{6} $$ Now the MC values are normalized and we can train with the pairwise loss.

Rendering Paper...