2501.07301

Model: healer-alpha-free

# Visual Description <details> <summary>Image 1 Details</summary> ![d973faab](/v1/image/d973faabebfdad7dbe8ffc1e0401a7f3c3bc60ff8fa3e0964b7cf240d2332e9c) ### Visual Description Icon/Small Image (98x40) </details> ## The Lessons of Developing Process Reward Models in Mathematical Reasoning Runji Lin Zhenru Zhang Chujie Zheng Yangzhen Wu Beichen Zhang Bowen Yu ∗ Dayiheng Liu ∗ Jingren Zhou Junyang Lin ∗ Qwen Team, Alibaba Group https://hf.co/Qwen/Qwen2.5-Math-PRM-7B https://hf.co/Qwen/Qwen2.5-Math-PRM-72B ## Abstract Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate currentstep correctness, which can generate correct answers from incorrect steps or incorrect answers from correct steps, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models. Figure 1: Overview of evaluation results on the Best-of-8 strategy of the policy model Qwen2.5-Math-7BInstruct and the benchmark PROCESSBENCH (Zheng et al., 2024) across multiple PRMs (see Table 6 and Table 7 for details). <details> <summary>Image 2 Details</summary> ![bb2a936c](/v1/image/bb2a936c17d8db93f4585d93e515a2ca7c8a514d22faa116d049cc978404bdf6) ### Visual Description ## Bar Chart: Model Performance Comparison on Best-of-8 Accuracy and ProcessBench F1 ### Overview The image is a dual-axis bar chart comparing the performance of 11 different language models on two distinct evaluation metrics: "Best-of-8 Mean Accuracy" (left y-axis, blue bars) and "ProcessBench Mean F1" (right y-axis, orange bars). The chart includes reference lines for baseline performance levels. ### Components/Axes * **Chart Type:** Grouped bar chart with dual y-axes. * **X-Axis (Categories):** Lists 11 model names. From left to right: 1. `Math-Shepherd-PRM-7B` 2. `RLHFlow-PRM-Mistral-8B` 3. `RLHFlow-PRM-Deepseek-8B` 4. `Skywork-PRM-1.5B` 5. `Skywork-PRM-7B` 6. `EurusPRM-Stage1` 7. `EurusPRM-Stage2` 8. `Qwen2.5-Math-7B-Math-Shepherd` 9. `Qwen2.5-Math-7B-PRM800K` 10. `★ Qwen2.5-Math-PRM-7B` 11. `★ Qwen2.5-Math-PRM-72B` * **Left Y-Axis (Primary):** Label: `Best-of-8 Mean Acc (%)`. Scale ranges from 60 to 76, with major ticks every 2 units. * **Right Y-Axis (Secondary):** Label: `ProcessBench Mean F1 (%)`. Scale ranges from 20 to 100, with major ticks every 10 units. * **Legend:** Located in the top-left quadrant. * Blue square: `Best-of-8` * Orange square: `ProcessBench` * **Reference Lines:** * Solid blue horizontal line at ~74.7% on the left axis, labeled `pass@8 (74.7)`. * Dash-dot blue horizontal line at ~66.2% on the left axis, labeled `maj@8 (66.2)`. * Dashed orange horizontal line at ~87.9% on the right axis, labeled `o1-mini (87.9)`. ### Detailed Analysis **Data Series & Values (Model: Best-of-8 (Blue) | ProcessBench (Orange)):** 1. `Math-Shepherd-PRM-7B`: 64.2 | 31.5 2. `RLHFlow-PRM-Mistral-8B`: 64.2 | 28.4 3. `RLHFlow-PRM-Deepseek-8B`: 64.9 | 26.6 4. `Skywork-PRM-1.5B`: 64.2 | 36.4 5. `Skywork-PRM-7B`: 64.8 | 42.1 6. `EurusPRM-Stage1`: 61.6 | 31.3 7. `EurusPRM-Stage2`: 62.0 | 31.2 8. `Qwen2.5-Math-7B-Math-Shepherd`: 64.3 | 28.9 9. `Qwen2.5-Math-7B-PRM800K`: 64.9 | 56.5 10. `★ Qwen2.5-Math-PRM-7B`: 67.6 | 73.5 11. `★ Qwen2.5-Math-PRM-72B`: 69.3 | 78.3 **Trend Verification:** * **Best-of-8 (Blue Bars):** The trend is generally flat or slightly increasing from left to right, with a notable dip for the `EurusPRM` models. The two `★ Qwen2.5-Math-PRM` models on the far right show a clear upward step. * **ProcessBench (Orange Bars):** The trend is more variable. It starts low, peaks at `Skywork-PRM-7B`, dips again, then rises sharply for the last three models, culminating in the highest value for `★ Qwen2.5-Math-PRM-72B`. ### Key Observations 1. **Performance Gap:** There is a significant performance gap between the two metrics for most models. Best-of-8 Accuracy scores cluster between 61.6% and 69.3%, while ProcessBench F1 scores vary widely from 26.6% to 78.3%. 2. **Top Performers:** The models marked with a star (`★ Qwen2.5-Math-PRM-7B` and `72B`) are the top performers on both metrics, with the 72B variant achieving the highest scores overall. 3. **Reference Line Context:** No model reaches the `pass@8` (74.7%) or `o1-mini` (87.9%) reference lines. All models exceed the `maj@8` (66.2%) baseline for Best-of-8 Accuracy. 4. **Model Family Trends:** Within the `Qwen2.5-Math` family, performance on ProcessBench improves dramatically with the specialized PRM models (`PRM800K`, `PRM-7B`, `PRM-72B`), while Best-of-8 Accuracy sees a more modest improvement. 5. **Outlier:** `RLHFlow-PRM-Deepseek-8B` has the lowest ProcessBench F1 score (26.6%) despite a competitive Best-of-8 Accuracy (64.9%). ### Interpretation This chart evaluates Process Reward Models (PRMs) for mathematical reasoning. The two metrics measure different aspects of capability: * **Best-of-8 Mean Accuracy** likely measures the final answer correctness when sampling 8 reasoning paths and selecting the best one (a common technique for improving LLM outputs). * **ProcessBench Mean F1** likely evaluates the model's ability to correctly identify the step-by-step quality or correctness of a reasoning process, a more granular task. The data suggests that while many models achieve similar final answer accuracy (Best-of-8), their ability to understand and evaluate the reasoning *process* (ProcessBench) varies greatly. The superior performance of the `★ Qwen2.5-Math-PRM` models, especially the 72B version, indicates that scaling and specialized training on process supervision data lead to significant gains in process evaluation capability, which is a crucial component for building reliable reasoning systems. The gap between a model's process evaluation skill and its final answer generation skill (e.g., `RLHFlow-PRM-Deepseek-8B`) highlights that these are distinct abilities that do not necessarily improve in tandem. The reference lines (`pass@8`, `o1-mini`) set a high bar that current PRMs have yet to meet, indicating room for improvement in the field. </details> ∗ Corresponding authors. <details> <summary>Image 3 Details</summary> ![88e59bf8](/v1/image/88e59bf8391aa707a8bf09a4cf7805650a9fa4fcca513470fd0698c3463b11cf) ### Visual Description Icon/Small Image (27x24) </details> <details> <summary>Image 4 Details</summary> ![eb066657](/v1/image/eb0666577843772e6a07c213476c972f7f292cd16daff8ada5e2fa9cb78e49bc) ### Visual Description Icon/Small Image (25x21) </details> ## 1 Introduction In recent years, Large Language Models (LLMs) have made remarkable advances in mathematical reasoning (OpenAI, 2023; Dubey et al., 2024; Shao et al., 2024; Zhu et al., 2024; Yang et al., 2024a;c;b), yet they can make mistakes, such as miscalculations or logical errors, leading to wrong conclusions. Moreover, even when achieving correct final answers, these powerful models can still regularly make up plausible reasoning steps, where the final answers build upon flawed calculations or derivations, which undermine the reliability and trustworthiness of LLMs' reasoning processes. To address these challenges, Process Reward Models (PRMs; Lightman et al. 2023; Wang et al. 2024b), as a representative and recently focal approach, are proposed to identify and mitigate process errors, thereby enabling finer-grained supervision on the reasoning process. One critical challenge of developing PRMs lies in the data annotation for the correctness of reasoning processes, which is typically expensive and time-consuming. While Lightman et al. (2023) recruited human annotators with detailed instructions and elaborate procedures to achieve satisfactory annotation quality, the prohibitive cost pushes researchers to explore automated annotation methods. Among them, one commonly used approach is to assess process correctness by estimating the empirical probability of leading to the correct final answers through Monte Carlo (MC) methods, which has attracted great research interests and has also been commonly employed in practice (Xiong et al., 2024; Wang et al., 2024b; Luo et al., 2024). Another challenge lies in evaluating PRM performance, as previous studies (Lightman et al., 2023; Wang et al., 2024b; Luo et al., 2024) have predominantly relied on the Best-of-N (BoN) evaluation, which selects the highest-scored response from N candidates according to a PRM. Recently, PROCESSBENCH (Zheng et al., 2024) have emerged to evaluate the capability of PRMs in identifying step-wise correctness. Nevertheless, during the training of our own PRM following conventional principles to construct data using MC estimation and evaluate on BoN, we gain several crucial lessons. In terms of MC estimation , (1) we observe that the PRM trained via MC estimation demonstrated significantly inferior performance and generalization capabilities compared to LLM-as-a-judge (Zheng et al., 2023) and human annotation. (2) We attribute the suboptimal performance of MC estimation to its fundamental limitation, which attempts to evaluate deterministic current-step correctness based on potential future outcomes. It significantly relies on the performance of the completion model, which may generate correct answers based on incorrect steps, or incorrect answers based on correct steps, introducing substantial noise and inaccuracy verification into step-wise correctness estimation. Regarding the BoN evaluation , (1) the unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) the limited process verification capability makes PRMs demonstrate tolerance for these cases, resulting in inflated BoN performance. (3) We find that in the step scores distribution of existing PRMs, a significant proportion of minimum scores are concentrated on the final answer steps, indicating PRMs have shifted from process to outcome-based assessment in BoN. To address these challenges, we develop a consensus filtering mechanism that combines MC estimation with LLM-as-a-judge. The instances are only retained when both LLM-as-a-judge and MC estimation show consensus on the error reasoning step locations in the solution. Our approach demonstrates more efficient data utilization and surpass existing open-source PRMs in the conventional BoN evaluation. Furthermore, we advocate for complementing response-level BoN with step-wise evaluation methods. We employ the step-wise benchmark PROCESSBENCH (Zheng et al., 2024) to measure the ability to identify process errors in mathematical reasoning. Our trained PRMs exhibit impressively stronger error identification performance than other open-source models, from PRMs to general language models, confirming that our training approach genuinely teaches PRMs to assess the correctness of intermediate reasoning steps. Our key contributions can be summarized as follows: - We identify critical limitations in current data construction approaches for PRMs, demonstrating that MC estimation-based data construction yields inferior performance compared to LLM-as-ajudge and human annotation. - We reveal the potential bias in using response-level BoN evaluation alone for PRMs and advocate for comprehensive evaluation strategies combining both response-level and step-level metrics. - We propose a simple yet efficient consensus filtering mechanism that integrates MC estimation with LLM-as-a-judge, significantly improving both model performance and data efficiency in PRM training. - We substantiate our findings through extensive empirical studies and also open source our trained PRMs, which can establish practical guidelines and best practices for future research and development for reasoning process supervision. ## 2 Preliminary Trials In this section, we describe our preliminary attempts to train PRMs via MC estimation-based reasoning step annotation. Despite our efforts in scaling up training data and careful tuning of training objectives, we found that the MC estimation-based PRMs do not possess noticeable advantages over the one trained on human-annotated data (Lightman et al., 2023), and even lag significantly behind the latter in identifying specific erroneous reasoning steps. ## 2.1 Training Setup Training Data Synthesis We followed the commonly used MC estimation approach, Math-Shepherd (Wang et al., 2024b), to construct the PRM training data. Specifically, we collected a large-scale dataset of approximately 500,000 queries with golden answers. For each query, we generate 6-8 diverse responses by mixing outputs from the Qwen2-Math-Instruct and Qwen2.5-Math-Instruct series models (Yang et al., 2024c), spanning the model sizes of 7B and 72B parameters. These responses are systematically split into individual steps using the delimiter ' \ n \ n'. To assess the correctness of each step, we conduct 8 independent completions starting from this step using Qwen2.5-Math-Instruct series with the corresponding model size, estimating the step labels based on the empirical probabilities of each step yielding the correct final answer. We trained PRMs with either hard labels or soft labels. For hard labels, we treat a step as correct if any one of the 8 completions yields the correct final answer, and negative otherwise. For soft labels, we determined the value (between 0 and 1) as the proportion of completions leading to the correct final answers. Note that we eliminated all steps subsequent to those labeled as incorrect (label 0), as their validity becomes irrelevant after an error occurs. This removal was implemented to prevent potential model confusion during training. Training Details Our trained PRMs were initialized from the supervised fine-tuned Qwen2.5-Math7B/72B-Instruct models (Yang et al., 2024c), where we replace the original language modeling head (used for next token prediction) with a scalar-value head, consisting of two linear layers. We calculated the cross-entropy (CE) loss and mean squared error (MSE) loss on the last tokens of each step for the binary classification task using hard labels and for the regression task using soft labels, respectively. ## 2.2 Evaluation Setup We evaluate our trained PRMs from two aspects: their utilities in straightforwardly improving downstream task performance and their abilities to identify specific erroneous steps in reasoning processes. Best-of-N Consistent with previous work (Lightman et al., 2023; Wang et al., 2024b; Luo et al., 2024; Cobbe et al., 2021; Yang et al., 2024c), we employed the Best-of-N (BoN) sampling strategy for evaluation, which selects the highest-scored response from N candidates according to a PRM. We denote the evaluation metric as 'prm@ N '. Following Yang et al. (2024c), we sampled eight responses (i.e., N = 8) from Qwen2.5-Math-7B-Instruct across multiple mathematical benchmarks, including GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), Minerva Math (Lewkowycz et al., 2022), GaoKao 2023 En (Liao et al., 2024), OlympiadBench (He et al., 2024), College Math (Tang et al., 2024), and MMLU STEM (Hendrycks et al., 2021a). Each candidate response is scored using the product of all the individual scores of each step within the response, as computed in Lightman et al. (2023). We also report the result of majority voting among eight samplings (maj@8) as the baseline, and pass@8 (i.e., the proportion of test samples where any of the eight samplings lead to the correct final answers) as the upper bound. PROCESSBENCH We also evaluated on PROCESSBENCH as a complement. PROCESSBENCH (Zheng et al., 2024) measures the capability of models to identify erroneous steps in mathematical reasoning. Models are required to identify the first step that contains an error or conclude that all steps are correct. Following the evaluation methods for PRMs in PROCESSBENCH, we locate the first erroneous step from predict scores yielded by PRMs. ## 2.3 Evaluation Results As shown in Table 1 and Table 2, we denote the models trained on our MC estimated dataset as Qwen2.5Math-7B-PRM-MC-hard (trained with hard labels) and Qwen2.5-Math-7B-PRM-MC-soft (trained with soft labels), respectively. To compare them with a baseline model, we trained exclusively on the PRM800K (Lightman et al., 2023) dataset with its hard labels named Qwen2.5-Math-7B-PRM-PRM800K. The experimental results reveal two critical limitations: (1) In the Best-of-8 evaluation, none of the PRMs achieved prm@8 scores superior to maj@8. (2) When evaluating on the PROCESSBENCH for identifying erroneous | Setting | GSM8K | MATH | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. | |-----------------------------|---------|--------|----------------|------------------|------------------|----------------|-------------|--------| | pass@8 (Upper Bound) | 98.1 | 92 | 49.3 | 80.5 | 59.6 | 52.6 | 90.5 | 74.7 | | maj@8 | 96.7 | 87.1 | 41.2 | 72.5 | 44.4 | 47.8 | 73.8 | 66.2 | | Qwen2.5-Math-7B-PRM800K | 96.9 | 86.9 | 37.1 | 71.2 | 44 | 47.6 | 70.9 | 64.9 | | Qwen2.5-Math-7B-PRM-MC-hard | 96.8 | 87.3 | 40.1 | 70.6 | 43.7 | 48.1 | 71.6 | 65.5 | | Qwen2.5-Math-7B-PRM-MC-soft | 96.8 | 86.3 | 37.9 | 70.6 | 41 | 47.7 | 70.4 | 64.4 | Table 1: Performance comparison on Best-of-8 using PRMs trained with MC estimated hard labels and soft labels, human-annotated PRM800K, denoted as Qwen2.5-Math-7B-PRM-MC-hard, Qwen2.5-Math7B-PRM-MC-soft, and Qwen2.5-Math-7B-PRM800K, respectively. | Model | GSM8K | GSM8K | GSM8K | MATH | MATH | MATH | OlympiadBench | OlympiadBench | OlympiadBench | Omni-MATH | Omni-MATH | Omni-MATH | Avg. F1 | |-----------------------------|---------|---------|---------|--------|---------|--------|-----------------|-----------------|-----------------|-------------|-------------|-------------|-----------| | | error | correct | F1 | error | correct | F1 | error | correct | F1 | error | correct | F1 | Avg. F1 | | Qwen2.5-Math-7B-PRM800K | 53.1 | 95.3 | 68.2 | 48.0 | 90.1 | 62.6 | 35.7 | 87.3 | 50.7 | 29.8 | 86.1 | 44.3 | 56.5 | | Qwen2.5-Math-7B-PRM-MC-hard | 67.1 | 90.2 | 77.0 | 35.2 | 65.8 | 45.8 | 13.2 | 28.0 | 17.9 | 13.3 | 41.9 | 20.2 | 40.2 | | Qwen2.5-Math-7B-PRM-MC-soft | 65.7 | 93.3 | 77.1 | 35.7 | 64.5 | 46.0 | 13.2 | 29.2 | 18.1 | 12.9 | 40.2 | 19.6 | 40.2 | Table 2: Performance comparison on PROCESSBENCH using PRMs trained with MC estimated hard labels and soft labels, human-annotated PRM800K, denoted as Qwen2.5-Math-7B-PRM-MC-hard, Qwen2.5Math-7B-PRM-MC-soft, and Qwen2.5-Math-7B-PRM800K, respectively. reasoning steps, both Qwen2.5-Math-7B-PRM-MC-hard and Qwen2.5-Math-7B-PRM-MC-soft exhibit significantly inferior erroneous step localization capabilities compared to Qwen2.5-Math-7B-PRM-PRM800K, though the former had larger scale of data. These undesirable evaluation performances push us to reflect on the currently prevalent data synthesis approach and evaluation strategy. Through the subsequent optimization process, we have indeed gained several observations and lessons learned. ## 3 The lessons In this section, we present the critical lessons gained during the PRM training. Our discussion comprises three main aspects: (1) the limitations of commonly adopted MC estimation approaches in PRMs training, and (2) the bias in using BoN as the sole evaluation metric for optimizing PRMs. ## 3.1 Limitations of MC Estimation for PRMs Training ## 3.1.1 Distinguishing PRMs from Value Models Reward models in mathematical reasoning serve as correctness verifiers and PRMs provide fine-grained supervision by evaluating the correctness of intermediate reasoning steps. In contrast, value models estimate the potential of reaching the correct final answer from the current step in the future. The key difference between PRM and value model lies in that PRMs function as deterministic evaluators of current step correctness, while value models operate as predictive estimators of future solution potential. MCestimation attempts to estimate the potential of reaching the correct final answer in the future from the current step. When we follow this approach to construct data and train the PRMs, the value model principles are incorporated into PRMs training essentially. This methodology potentially introduces performance and generalization limitations which we will discuss in subsequent sections. ## 3.1.2 MCEstimation vs. LLM-as-a-judge vs. Human Annotation Wefound that MC estimation methods limit PRM's capability to identify erroneous steps as demonstrated in the experiments of Section 2.3. For further investigation, we compare the performance using 3 distinct data construct approaches: MC estimation, LLM-as-a-judge, and human annotation. For the MCestimation approach, we respectively train the PRM on 445k open-source datasets Math-shepherd (Wang et al., 2024b) and our 860k similarly constructed dataset. For our constructed dataset, the MC estimation employs responses from Qwen2-Math-Instruct and completes subsequent reasoning processes by Qwen2.5-Math-Instruct. For the LLM-as-a-judge approach, we use the same 860k query and response and employ Qwen2.5-72B-Instruct to verify the correctness of each step in the responses with the prompt template shown in Appendix C. For the human annotation approach, we use the open-source dataset PRM800K (Lightman et al., 2023) which consists of approximately 265k samples after deduplication against the test set. Table 3: PRMs performance comparison on the Best-of-8 strategy of the policy model Qwen2.5-Math-7BInstruct. The models are trained on the different data construction methods including MC estimation, LLM-as-a-judge, and human annotation. | Setting | # samples | GSM8K | MATH | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. | |------------------------------|-------------|---------|--------|----------------|------------------|------------------|----------------|-------------|--------| | MCEstimation (Math-Shepherd) | 440k | 96.9 | 86.5 | 36.8 | 71.4 | 41.6 | 47.7 | 69.3 | 64.3 | | MCEstimation (our data) | 860k | 97 | 87.6 | 41.9 | 71.4 | 43.6 | 48.2 | 71.9 | 65.9 | | LLM-as-a-judge (our data) | 860k | 96.9 | 86.8 | 39 | 71.2 | 43.7 | 47.7 | 71.9 | 65.3 | | Human Annotation (PRM800K) | 264k | 96.9 | 86.9 | 37.1 | 71.2 | 44 | 47.6 | 70.9 | 64.9 | Table 4: PRMs performance comparison on PROCESSBENCH. The models are trained on the different data construction methods including MC estimation, LLM-as-a-judge, and human annotation. | Method | # samples | GSM8K | GSM8K | GSM8K | MATH | MATH | MATH | OlympiadBench | OlympiadBench | OlympiadBench | Omni-MATH | Omni-MATH | Omni-MATH | Avg.F1 | |------------------------------|-------------|---------|---------|---------|--------|---------|--------|-----------------|-----------------|-----------------|-------------|-------------|-------------|----------| | | | error | correct | F1 | error | correct | F1 | error | correct | F1 | error | correct | F1 | | | MCEstimation (Math-Shepherd) | 440k | 46.4 | 95.9 | 62.5 | 18.9 | 96.6 | 31.6 | 7.4 | 93.8 | 13.7 | 4.0 | 95.0 | 7.7 | 28.9 | | MCEstimation (our data) | 860k | 62.3 | 91.2 | 74.0 | 35.2 | 71.9 | 47.3 | 12.7 | 41.3 | 19.4 | 12.1 | 54.4 | 19.8 | 40.1 | | LLM-as-a-judge (our data) | 860k | 44.0 | 99.0 | 60.9 | 33.5 | 94.8 | 49.5 | 24.7 | 97.1 | 39.4 | 22.3 | 95.4 | 36.1 | 46.5 | | Human Annotation (PRM800K) | 264k | 53.1 | 95.3 | 68.2 | 48.0 | 90.1 | 62.6 | 35.7 | 87.3 | 50.7 | 29.8 | 86.3 | 44.3 | 56.5 | The experimental results of Best-of-8 and PROCESSBENCH are shown in Table 3 and Table 4, respectively. For Best-of-8, Table 3 shows that the PRM trained on our MC estimated data achieves the best average accuracy and human annotation performs worst. For PROCESSBENCH, Table 4 demonstrates that human annotation achieves the best performance with the least amount of data, followed by LLM-as-a-judge, while MC estimation performed the worst despite having the largest dataset overall. Specifically, (1) human annotation, despite being only performed on the MATH dataset, exhibited superior generalization capabilities on more complex tasks OlympiadBench and Omni-MATH. (2) Given identical data with different annotation approaches, LLM-as-a-judge demonstrates better generalization performance on challenging problems than MC estimation, although the latter showed favorable results on GSM8K. (3) For MC estimation, a comparison between our 860k dataset and Math-Shepherd 440k data indicates that performance improvements can still be achieved through data scaling. The two models trained on MC estimated and human-annotated data exhibit inverse performance relationships in Best-of-8 and PROCESSBENCH, which catches our attention and is thoroughly investigated in Section 3.2. ## 3.1.3 Stringent Data Filtering Mechanisms Required in MC Estimation We attribute the inferior performance of MC estimation compared to LLM-as-a-judge and human annotation to its high noise in reasoning step correctness estimation and inaccurate error position identification due to its heavy dependence on the policy model. For instance, the policy model may generate correct final answers but incorrect reasoning steps, which will be investigated thoroughly in Section 3.2.1. Motivated by LLM-as-a-judge's encouraging results in Section 3.1.2, we naturally propose a simple yet efficient consensus Filtering mechanism that integrates LLM-as-a-judge with MC estimation. Based on the aforementioned 860K samples, the instances are only retained when both LLM-as-a-judge and MCestimation show consensus on the error reasoning step locations in the solution. As demonstrated in Figure 2, it can be found that only approximately 40% of the data are preserved after consensus filtering. For evaluation on PROCESSBENCH, the results reveal that the reduced dataset after consensus filtering significantly outperforms MC estimation, and notably, achieves comparable performance to LLM-as-a-judge while using only 40% of the data. Regarding the BoN evaluation, the performance variations among these three models are marginal. The limitations of BoN evaluation in PRMs will be elaborated on in Section 3.2 later. ## 3.1.4 Hard Label vs. Soft Label in MC Estimation Although we have previously demonstrated that MC estimation is not as effective as LLM-as-a-judge and human annotation, there remains a noteworthy point of MC estimation to be discussed, i.e., whether to train with soft label or hard label. We construct 3 million training data using MC estimation, where for each reasoning step we perform 8 completions. Subsequently, we apply the consensus filtering strategy discussed in Section 3.1.3 to filter the 3 million samples, which reduces the dataset to 1.5 million samples. We respectively train PRMs using both soft labels and hard labels on 3 million and 1.5 million data. The performance of trained PRMs on Best-of-8 and PROCESSBENCH are illustrated in Figure 3 and 4 separately. Before data filtering, the performance difference between soft and hard labels is not significant, which we attribute to the high noise level masking their distinctions. However, this difference becomes much more pronounced after data filtering, with hard labels substantially outperforming soft labels <details> <summary>Image 5 Details</summary> ![c878c495](/v1/image/c878c495055eea5cd5ea29caf6ff60d4a7986ab4c1b1ab92b5b4975695deff21) ### Visual Description ## Dual-Axis Bar Chart: Performance Comparison of Three Methods ### Overview This image is a dual-axis bar chart comparing the performance of three different methods across two distinct metrics. The chart uses a grouped bar format, with each method represented by two adjacent bars of different colors, each corresponding to a different performance metric measured on separate y-axes. ### Components/Axes * **Chart Type:** Grouped Bar Chart with Dual Y-Axes. * **X-Axis (Categories):** Three methods are listed from left to right: 1. `MC estimation (860k)` 2. `LM-as-a-judge (860k)` 3. `Consensus Filtering (350k)` * **Primary Y-Axis (Left):** * **Label:** `Best-of-8 Mean Acc (%)` * **Scale:** Linear scale ranging from 63 to 68, with major tick marks at 63, 64, 65, 66, 67, and 68. * **Secondary Y-Axis (Right):** * **Label:** `ProcessBench Mean F1 (%)` * **Scale:** Linear scale ranging from 36 to 52, with major tick marks at 36, 38, 40, 42, 44, 46, 48, 50, and 52. * **Legend:** Located in the top-left corner of the chart area. * A blue square is labeled `Best-of-8`. * An orange square is labeled `ProcessBench`. * **Data Labels:** Numerical values are printed directly above each bar. ### Detailed Analysis The chart presents data for three methods, with each method having a blue bar (Best-of-8 Accuracy) and an orange bar (ProcessBench F1 Score). **1. MC estimation (860k)** * **Best-of-8 (Blue Bar):** The bar is the tallest among the blue bars. Its value is labeled as **65.9**. This indicates a high mean accuracy for the Best-of-8 metric. * **ProcessBench (Orange Bar):** The bar is the shortest among the orange bars. Its value is labeled as **40.1**. This indicates a relatively low F1 score for the ProcessBench metric. * **Trend:** This method shows a strong performance on the Best-of-8 metric but the weakest performance on the ProcessBench metric. **2. LM-as-a-judge (860k)** * **Best-of-8 (Blue Bar):** The bar is the shortest among the blue bars. Its value is labeled as **65.3**. * **ProcessBench (Orange Bar):** The bar is the tallest among the orange bars. Its value is labeled as **46.5**. * **Trend:** This method shows the opposite pattern to MC estimation: the lowest Best-of-8 accuracy but the highest ProcessBench F1 score. **3. Consensus Filtering (350k)** * **Best-of-8 (Blue Bar):** The bar's height is intermediate between the other two blue bars. Its value is labeled as **65.7**. * **ProcessBench (Orange Bar):** The bar's height is slightly lower than the tallest orange bar. Its value is labeled as **46.3**. * **Trend:** This method demonstrates balanced performance, with the second-highest score on both metrics. It uses a smaller dataset (350k) compared to the other two methods (860k). ### Key Observations 1. **Inverse Relationship:** There is a clear inverse relationship between the two metrics across the first two methods. The method with the highest Best-of-8 accuracy (MC estimation) has the lowest ProcessBench F1 score, and vice-versa (LM-as-a-judge). 2. **Consensus Filtering's Balance:** Consensus Filtering achieves a middle ground, sacrificing a small amount of Best-of-8 accuracy (0.2% less than MC estimation) for a significant gain in ProcessBench F1 score (+6.2 points over MC estimation). 3. **Dataset Size Note:** The labels indicate the dataset size used for each method. The first two methods used 860k samples, while Consensus Filtering used only 350k samples, suggesting it may be more data-efficient for the ProcessBench metric. 4. **Metric Sensitivity:** The performance ranking of the methods is completely different depending on which metric is used for evaluation. ### Interpretation This chart illustrates a classic trade-off in model evaluation or system design. The "Best-of-8 Mean Accuracy" and "ProcessBench Mean F1 Score" appear to be measuring different, potentially conflicting, aspects of performance. * **MC estimation** excels at the task measured by Best-of-8 accuracy (possibly a multiple-choice or selection-based task) but performs poorly on the ProcessBench metric (which likely evaluates a process, reasoning chain, or step-by-step output quality via F1 score). * **LM-as-a-judge** shows the reverse pattern, suggesting it is better at generating or evaluating process-oriented outputs (high F1) but less reliable at the specific selection task measured by Best-of-8. * **Consensus Filtering** represents a potential compromise. Its name suggests it aggregates multiple outputs to filter for consensus, which may help stabilize performance across different evaluation criteria. Notably, it achieves near-peak performance on both metrics while using less than half the data of the other methods, indicating potential efficiency and robustness. The data suggests that the choice of method should be driven by the primary objective: if raw accuracy on a selection task is paramount, MC estimation is best. If the quality of an intermediate process is critical, LM-as-a-judge is superior. If a balance of both is needed, or if data efficiency is a concern, Consensus Filtering appears to be the most robust and efficient choice. The stark difference in rankings underscores the importance of using multiple, complementary metrics for comprehensive evaluation. </details> Figure 2: Performance comparison on Best-of-8 and PROCESSBENCH using PRMs trained with different data synthesis methods. Figure 3: Performance comparison on Best-of-8 for the PRMs trained on soft and hard labels before and after consensus filtering. <details> <summary>Image 6 Details</summary> ![91264048](/v1/image/91264048596f8050e6f9d52f4e183af49743f5da9a1ea1da41abf6f3bfb4a77d) ### Visual Description ## Bar Chart: Best-of-8 Mean Accuracy Comparison (Soft vs. Hard Labels) ### Overview The image is a grouped bar chart comparing the "Best-of-8 Mean Accuracy" percentage for two types of labels ("soft labels" and "hard labels") under two conditions: "Before Filtering" and "After Filtering." The chart demonstrates the impact of a data filtering process on model performance. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **Y-Axis:** * **Label:** "Best-of-8 Mean Acc (%)" * **Scale:** Linear scale from 62 to 68, with major tick marks at 62, 64, 66, and 68. * **X-Axis:** * **Categories:** Two primary categories. 1. **Left Group:** "Before Filtering (3M)" 2. **Right Group:** "After Filtering (1.5M)" * The parenthetical values "(3M)" and "(1.5M)" likely denote the dataset size in millions of samples before and after filtering, respectively. * **Legend:** * **Position:** Top-left corner of the chart area. * **Items:** * A blue square labeled "soft labels" * An orange square labeled "hard labels" * **Data Series & Values:** * **Soft Labels (Blue Bars):** * Before Filtering: 65.4% * After Filtering: 65.4% * **Hard Labels (Orange Bars):** * Before Filtering: 65.4% * After Filtering: 67.2% ### Detailed Analysis The chart presents a direct comparison across two dimensions: label type and data filtering state. 1. **Before Filtering (3M dataset):** * Both "soft labels" and "hard labels" achieve an identical Best-of-8 Mean Accuracy of **65.4%**. The blue and orange bars are of equal height. 2. **After Filtering (1.5M dataset):** * The performance for "soft labels" (blue bar) remains unchanged at **65.4%**. * The performance for "hard labels" (orange bar) shows a clear increase, rising to **67.2%**. This bar is visibly taller than its counterpart in the "Before Filtering" group and taller than the adjacent "soft labels" bar. 3. **Effect of Filtering:** * The filtering process reduced the dataset size by 50% (from 3M to 1.5M samples). * This reduction had no measurable effect on the accuracy metric for "soft labels." * Conversely, it resulted in a **1.8 percentage point improvement** (from 65.4% to 67.2%) for "hard labels." ### Key Observations * **Performance Parity then Divergence:** Initially, both label types perform identically. After filtering, their performance diverges significantly. * **Filtering Benefit is Label-Dependent:** The primary observation is that the data filtering process selectively benefits the "hard labels," improving their accuracy, while the "soft labels" show no gain. * **Efficiency Gain:** The "hard labels" achieve higher accuracy with half the data (1.5M vs. 3M), suggesting the filtering successfully removed noisy or uninformative samples that were particularly detrimental to the model's performance when using hard labels. ### Interpretation This chart suggests that the nature of the label ("soft" vs. "hard") interacts critically with data curation processes. "Hard labels" (typically discrete, one-hot encoded targets) appear to benefit more from the removal of low-quality or ambiguous data points. The filtering likely created a cleaner, more consistent training set that better aligns with the crisp decision boundaries implied by hard labels. In contrast, "soft labels" (which often represent probability distributions or smoothed targets) may be inherently more robust to noise or ambiguity in the data, explaining why their performance did not change. Alternatively, the specific filtering criteria used might have been less effective at identifying samples that are noisy for a soft-label-based training objective. The key takeaway is that data filtering is not universally beneficial; its impact is mediated by the training paradigm (here, the label type). For tasks employing hard labels, aggressive data filtering can be a highly effective strategy to boost performance and training efficiency. For tasks using soft labels, the same filtering may yield diminishing returns, and different curation strategies might be required. </details> Figure 4: Performance comparison on PROCESSBENCH for PRMs trained on soft and hard labels before and after consensus filtering. <details> <summary>Image 7 Details</summary> ![bb68b121](/v1/image/bb68b121cf8606fb015185c08075d80d2ea7fad2e20d8c238a6c97a159cc0483) ### Visual Description ## Grouped Bar Chart: ProcessBench Mean F1 (%) Before and After Filtering ### Overview The image is a grouped bar chart comparing the performance (measured by Mean F1 score in percentage) of two labeling methods—"soft labels" and "hard labels"—on a benchmark called "ProcessBench." The comparison is made at two stages: before a filtering process and after a filtering process. The chart clearly demonstrates the impact of filtering on model performance for both label types. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **Y-Axis:** * **Label:** "ProcessBench Mean F1 (%)" * **Scale:** Linear scale ranging from 30 to 70, with major tick marks every 5 units (30, 35, 40, 45, 50, 55, 60, 65, 70). * **X-Axis:** * **Categories:** Two primary categories representing stages of data processing. 1. **Left Group:** "Before Filtering (3M)" – The "3M" likely denotes the dataset size (e.g., 3 million samples). 2. **Right Group:** "After Filtering (1.5M)" – The "1.5M" denotes the reduced dataset size after filtering. * **Legend:** * **Position:** Top-left corner of the chart area. * **Items:** * A blue rectangle labeled "soft labels". * An orange rectangle labeled "hard labels". * **Data Series & Values:** Each x-axis category contains two bars, one for each label type. The exact Mean F1 (%) value is annotated on top of each bar. * **Before Filtering (3M):** * **Soft labels (Blue bar):** 40.2 * **Hard labels (Orange bar):** 40.2 * **After Filtering (1.5M):** * **Soft labels (Blue bar):** 49.3 * **Hard labels (Orange bar):** 66.5 ### Detailed Analysis 1. **Baseline Performance (Before Filtering):** Both "soft labels" and "hard labels" achieve an identical Mean F1 score of 40.2% on the original, unfiltered dataset of 3 million samples. This establishes a common performance baseline. 2. **Impact of Filtering:** After applying a filtering process that reduces the dataset to 1.5 million samples, the performance of both methods improves. * The **soft labels** score increases from 40.2% to 49.3%, a gain of approximately 9.1 percentage points. * The **hard labels** score increases dramatically from 40.2% to 66.5%, a gain of approximately 26.3 percentage points. 3. **Comparative Trend:** While both lines (represented by the tops of the bars) slope upward from left to right, the slope for the "hard labels" series is significantly steeper. The performance gap between the two methods, which was zero before filtering, widens to 17.2 percentage points (66.5 - 49.3) after filtering. ### Key Observations * **Identical Starting Point:** The most striking initial observation is that both label types performed identically on the larger, noisier dataset. * **Differential Improvement:** Filtering benefits both methods, but the magnitude of improvement is not uniform. "Hard labels" benefit disproportionately more from the data filtering process. * **Final Performance Disparity:** After filtering, "hard labels" achieve a substantially higher Mean F1 score (66.5%) compared to "soft labels" (49.3%), indicating a clear performance advantage in the refined dataset context. ### Interpretation The data suggests that the filtering process is highly effective at improving model performance on the ProcessBench task, as measured by the F1 score. The key insight lies in the differential impact on the two labeling strategies. The fact that both methods started equally but diverged sharply after filtering implies that the original 3M dataset contained a significant amount of noise or low-quality samples. This noise appears to have been equally detrimental to both "soft" and "hard" label learning initially. However, the filtering process (which removed half the data) seems to have selectively removed samples that were particularly harmful to the models trained with "hard labels." This could mean: 1. **Hard labels are more sensitive to noise:** Models using hard, definitive labels may be more easily confused by ambiguous or incorrect examples in the training data. Filtering removes these confusing examples, allowing the hard-label model to learn more precise decision boundaries. 2. **Soft labels are more robust:** Models using soft, probabilistic labels might inherently handle noisy data better, as they don't commit fully to any single label. Therefore, while they still benefit from cleaner data, the relative gain is smaller because they were less impaired by the noise to begin with. In essence, the chart demonstrates that **data quality (achieved via filtering) is a critical factor for performance, and its importance is amplified when using "hard" supervision signals.** For tasks where obtaining clean data is feasible, using hard labels with a rigorous filtering pipeline may yield superior results. If data is inherently noisy and cannot be cleaned, soft labels might offer more stable, though potentially lower-peak, performance. </details> on both Best-of-8 and PROCESSBENCH. We consider the limitations of soft labels are: (1) as discussed in Section 3.1.1, the correctness of steps (i.e., rewards) should be deterministic. Training PRMs with soft labels that represent future possibilities introduces additional noise. For instance, when numerous completely correct steps are assigned with soft labels lower than 1, it actually reduces the model's ability to discriminate between positive and negative labels; (2) only 8 completions for step correctness estimation exhibit high variance and are relatively crude. Although we can achieve better estimation accuracy by increasing the number of completions, the associated costs may outweigh the incremental benefits. Moreover, the experimental results indicate that the consensus filtering strategy yields performance benefits across both soft and hard label schemes. Last but not least, we investigate the threshold selection for distinguishing between positive and negative labels based on the MC estimation result of 8 completions. Following our previous experimental setup, we conduct a series of experiments on the 3 million with threshold values from 1/8 to 7/8 at 1/8 intervals, with results shown in Figure 5. It can be easily observed that as the threshold increases, the performance deteriorates on both Best-of-8 and PROCESSBENCH, indicating that using an MC estimated value of 0 as the negative label and all others as positive labels yields the best results. Therefore, if we have to rely on MCestimation for step-wise correctness verification, we suggest setting the threshold to 0, meaning that a step is considered correct if any completion start from this step reaches the correct final answer. This threshold has also been employed throughout our all experimental studies. ## 3.1.5 Summary Through extensive experimentation, we have demonstrated that MC estimation yields inferior performance and generalization compared to both LLM-as-a-judge and human annotation. However, incorporating MC estimation with LLM-as-a-judge via a consensus filtering strategy leads to enhanced performance and improved data efficiency. Furthermore, optimal results are achieved when treating MC estimation values of 0 as negative labels and training with hard labels. ## 3.2 Bias in BoN Sampling for PRM Performance Evaluation Although BoN evaluations are commonly used in PRM optimization, their effectiveness as a sole optimization criterion is worth careful consideration due to potential limitations in performance assessment. ## 3.2.1 Unreliable Policy Models Cause BoN-PRMs Misalignment In an ideal scenario, the responses generated by the policy model would exhibit both correct answers and accurate solution steps or conversely, flawed processes would correspond to incorrect answers. However, existing policy models are prone to generating responses with correct answers but flawed processes, while BoN inherently only focuses on answers, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. To provide empirical evidence for this phenomenon, we sample 8 responses per query from GSM8K, MATH, OlympiadBench, and Omni-MATH using the policy model Qwen2.5-Math-7B-Instruct. Then we randomly choose correct-answer responses from them and conduct thorough manual annotations. As detailed in Figure 6, a substantial percentage of responses contain process errors while maintaining correct answers. Notably, compared with easy task GSM8K and hard task Omni-MATH, this phenomenon becomes more pronounced as the problem's complexity increases. This implies that an effective PRM might assign low scores to responses with correct answers but flawed processes, resulting in overall lower performance on the BoN evaluation. <details> <summary>Image 8 Details</summary> ![658075d1](/v1/image/658075d13f7556275c8daa43fa33e5b5841b56d78502cb46358722d7b5324629) ### Visual Description \n ## Line Chart: Best-of-8 Accuracy vs. ProcessBench F1 Score Across Thresholds ### Overview The image is a dual-axis line chart comparing the performance of two metrics, "Best-of-8 Mean Acc (%)" and "ProcessBench Mean F1 (%)", as a function of an increasing "Threshold" parameter. The chart demonstrates a negative correlation between the threshold value and both performance metrics. ### Components/Axes * **Chart Type:** Dual-axis line chart with markers. * **X-Axis:** * **Label:** `Threshold` * **Scale:** Categorical, with 8 discrete points: `0`, `1/8`, `2/8`, `3/8`, `4/8`, `5/8`, `6/8`, `7/8`. * **Primary Y-Axis (Left):** * **Label:** `Best-of-8 Mean Acc (%)` * **Scale:** Linear, ranging from `64.0` to `66.0`, with major ticks at 0.5 intervals. * **Secondary Y-Axis (Right):** * **Label:** `ProcessBench Mean F1 (%)` * **Scale:** Linear, ranging from `28` to `42`, with major ticks at 2-unit intervals. * **Legend:** * **Position:** Bottom center, below the x-axis. * **Series 1:** `Best-of-8` - Represented by a blue line with circular markers. * **Series 2:** `ProcessBench` - Represented by an orange line with square markers. ### Detailed Analysis **Data Series 1: Best-of-8 Mean Acc (%) (Blue Line, Left Axis)** * **Trend:** The line shows a general downward trend as the threshold increases, with minor fluctuations. * **Data Points (Threshold, Value):** * (0, 65.5) * (1/8, 65.3) * (2/8, 65.0) * (3/8, 64.8) * (4/8, 64.9) *[Slight increase from previous point]* * (5/8, 64.5) * (6/8, 64.6) *[Slight increase from previous point]* * (7/8, 64.4) **Data Series 2: ProcessBench Mean F1 (%) (Orange Line, Right Axis)** * **Trend:** The line shows a consistent and more pronounced downward trend compared to the Best-of-8 line. The rate of decrease accelerates after the 3/8 threshold. * **Data Points (Threshold, Value):** * (0, 40.7) * (1/8, 40.7) *[Plateau]* * (2/8, 39.9) * (3/8, 37.9) * (4/8, 36.6) * (5/8, 35.6) * (6/8, 32.6) * (7/8, 30.0) ### Key Observations 1. **Correlated Decline:** Both metrics degrade as the threshold increases, suggesting a trade-off or a common sensitivity to this parameter. 2. **Differential Sensitivity:** The ProcessBench F1 score (orange) is significantly more sensitive to the threshold increase than the Best-of-8 Accuracy (blue). The F1 score drops by approximately 10.7 percentage points (from 40.7% to 30.0%), while the accuracy drops by 1.1 percentage points (from 65.5% to 64.4%). 3. **Non-Linear Drop in F1:** The ProcessBench decline is not linear. It is relatively flat between thresholds 0 and 1/8, then decreases steadily, with the steepest drop occurring between thresholds 5/8 (35.6%) and 6/8 (32.6%). 4. **Minor Fluctuations in Accuracy:** The Best-of-8 accuracy line is not perfectly monotonic, showing small rebounds at thresholds 4/8 and 6/8, though the overall direction is downward. ### Interpretation The chart likely illustrates the effect of a filtering or confidence threshold on two different evaluation metrics for an AI system (possibly a language model or a reasoning system). "Best-of-8" suggests a sampling-based method where 8 outputs are generated and the best is selected, with accuracy measured against a ground truth. "ProcessBench" appears to be a separate benchmark, possibly evaluating the quality of intermediate reasoning steps (hence "Process") using an F1 score. The data suggests that **increasing the threshold (which could represent stricter filtering of outputs or steps) harms overall performance, but it disproportionately harms the quality of the process (F1) compared to the final answer accuracy.** This could imply that while stricter filtering might remove some incorrect final answers, it is even more aggressive in removing correct or partially correct reasoning steps, leading to a sharper decline in the process-oriented metric. The plateau in F1 at the lowest thresholds (0 to 1/8) indicates a region where minimal filtering has little effect on process quality. The accelerating decline after 3/8 suggests a critical point beyond which the threshold becomes highly detrimental to the measured process integrity. </details> Figure 5: PRM Performance changes on Best-of-8 and PROCESSBENCH across different hard label thresholds. <details> <summary>Image 9 Details</summary> ![58935ed7](/v1/image/58935ed73a21e04b82fffda898e9841a521e9088187105f20fb3492b9ab4e8d2) ### Visual Description ## Bar Chart: Process Error Rate by Benchmark ### Overview The image is a vertical bar chart comparing the "Process Error Rate (%)" across four different mathematical reasoning benchmarks. The chart uses a simple, clean design with blue bars on a white background, and each bar is labeled with its exact percentage value. ### Components/Axes * **Chart Type:** Vertical Bar Chart * **Y-Axis:** * **Title:** "Process Error Rate (%)" (rotated vertically on the left side). * **Scale:** Linear scale from 0 to 40, with major tick marks at intervals of 10 (0, 10, 20, 30, 40). * **X-Axis:** * **Categories (from left to right):** "GSM8K", "MATH", "Olympiad Bench", "Omni-MATH". * **Data Series:** A single data series represented by blue bars. There is no legend, as only one metric is being compared across categories. * **Data Labels:** Each bar has its exact numerical value displayed centered above it. ### Detailed Analysis The chart presents the following data points for Process Error Rate: 1. **GSM8K:** The bar is the shortest, located at the far left. Its labeled value is **5.1%**. 2. **MATH:** The second bar from the left is taller than the first. Its labeled value is **11.9%**. 3. **Olympiad Bench:** The third bar is significantly taller than the previous two. Its labeled value is **27.4%**. 4. **Omni-MATH:** The bar on the far right is the tallest in the chart. Its labeled value is **43.4%**. **Trend Verification:** The visual trend is a clear and consistent upward slope from left to right. Each subsequent benchmark shows a higher process error rate than the one before it, with the increase becoming more pronounced after the MATH benchmark. ### Key Observations * **Monotonic Increase:** There is a strict, monotonic increase in error rate across the four benchmarks as presented on the x-axis. * **Magnitude of Increase:** The jump in error rate is most substantial between "MATH" (11.9%) and "Olympiad Bench" (27.4%), an increase of 15.5 percentage points. The increase from "Olympiad Bench" to "Omni-MATH" is also large at 16.0 percentage points. * **Relative Performance:** The error rate for "Omni-MATH" (43.4%) is more than 8 times higher than that for "GSM8K" (5.1%). * **Visual Scaling:** The y-axis scale (0-40) is appropriate for the data, as the highest value (43.4%) slightly exceeds the top axis marker, drawing visual attention to it. ### Interpretation This chart demonstrates a strong positive correlation between the presumed complexity or difficulty of a mathematical reasoning benchmark and the "Process Error Rate" of the system being evaluated. GSM8K, often considered a benchmark for grade-school level math, shows a very low error rate. The error rate more than doubles for the more advanced MATH benchmark. The rate then more than doubles again for Olympiad-level problems and peaks with Omni-MATH, which likely represents a comprehensive or extremely challenging suite of problems. The data suggests that the evaluated system's reliability in its reasoning *process* degrades significantly as the mathematical problems become more complex. The high error rate on Omni-MATH (43.4%) indicates that for this most challenging category, the system's process fails nearly half the time, which is a critical insight for understanding its limitations. The chart effectively argues that benchmark difficulty is a primary driver of process failure for this system. </details> Figure 6: Proportion of cases where the policy model generates correct answers but incorrect reasoning steps. Figure 7: Performance trends on BoN and PROCESSBENCH for models trained with different data sources. <details> <summary>Image 10 Details</summary> ![84c0fd21](/v1/image/84c0fd210b072d0b3b584aeaf9f2a4d56d722a8f8ae2da72a7b4b3803ca08fe8) ### Visual Description ## Dual-Axis Line Chart: Comparison of Accuracy Metrics Across Methods ### Overview The image displays a dual-axis line chart comparing two accuracy metrics—"Best-of-8 Mean Acc (%)" and "Extracted ProcessBench Mean Acc (%)"—across three different methods or evaluation approaches. The chart illustrates how performance changes between a baseline method and two proposed methods ("ours"). ### Components/Axes * **Chart Type:** Dual-axis line chart with markers. * **X-Axis (Categorical):** Lists three methods. * Label 1 (Left): `MC (Math-Shepherd)` * Label 2 (Center): `MC (ours)` * Label 3 (Right): `LLM-as-Judge (ours)` * **Primary Y-Axis (Left):** * Title: `Best-of-8 Mean Acc (%)` * Scale: Linear, ranging from 63.0 to 67.0, with major ticks at 0.5% intervals. * **Secondary Y-Axis (Right):** * Title: `Extracted ProcessBench Mean Acc (%)` * Scale: Linear, ranging from 0 to 40, with major ticks at 5% intervals. * **Legend:** Positioned at the bottom center of the chart area. * Entry 1: `Best-of-8` - Represented by a blue line with square markers. * Entry 2: `Extracted ProcessBench` - Represented by an orange line with circular markers. * **Data Points (Approximate Values):** * **Best-of-8 (Blue Line, Left Axis):** * At `MC (Math-Shepherd)`: ~64.3% * At `MC (ours)`: ~65.9% * At `LLM-as-Judge (ours)`: ~64.9% * **Extracted ProcessBench (Orange Line, Right Axis):** * At `MC (Math-Shepherd)`: ~63.3% * At `MC (ours)`: ~65.2% * At `LLM-as-Judge (ours)`: ~66.7% ### Detailed Analysis The chart plots two distinct performance trends across the three evaluated methods. 1. **Best-of-8 Accuracy (Blue Line, Left Axis):** This series shows an initial increase followed by a slight decrease. * The line slopes upward from `MC (Math-Shepherd)` (~64.3%) to `MC (ours)` (~65.9%), indicating a performance gain of approximately 1.6 percentage points. * It then slopes slightly downward from `MC (ours)` to `LLM-as-Judge (ours)` (~64.9%), a decrease of about 1.0 percentage point. * **Trend:** Inverted-V shape, peaking at the central method. 2. **Extracted ProcessBench Accuracy (Orange Line, Right Axis):** This series shows a consistent upward trend. * The line slopes upward from `MC (Math-Shepherd)` (~63.3%) to `MC (ours)` (~65.2%), a gain of approximately 1.9 percentage points. * It continues to slope upward more steeply from `MC (ours)` to `LLM-as-Judge (ours)` (~66.7%), a further gain of about 1.5 percentage points. * **Trend:** Consistently ascending line. ### Key Observations * **Diverging Final Performance:** While both metrics improve from the baseline (`MC (Math-Shepherd)`) to the first proposed method (`MC (ours)`), their paths diverge at the final method (`LLM-as-Judge (ours)`). Best-of-8 accuracy dips slightly, whereas Extracted ProcessBench accuracy continues to rise. * **Scale Disparity:** The two metrics operate on vastly different scales (0-40% vs. 63-67%), necessitating the dual-axis format. The absolute values for Extracted ProcessBench are significantly lower than for Best-of-8. * **Peak Performance Points:** The highest value for Best-of-8 is achieved by `MC (ours)`. The highest value for Extracted ProcessBench is achieved by `LLM-as-Judge (ours)`. ### Interpretation This chart likely evaluates different methods for solving or judging mathematical problems (suggested by "MC" for Multiple Choice and "Math-Shepherd," a known math reasoning dataset/benchmark). The "ours" labels indicate novel methods proposed by the authors. The data suggests that the authors' methods (`MC (ours)` and `LLM-as-Judge (ours)`) generally improve performance over the baseline (`MC (Math-Shepherd)`) on both metrics. However, the nature of the improvement differs: * The `MC (ours)` method provides a balanced boost to both the "Best-of-8" metric (which may measure raw answer accuracy from multiple attempts) and the "Extracted ProcessBench" metric (which likely evaluates the quality of the reasoning process or steps). * The `LLM-as-Judge (ours)` method shows a trade-off: it yields the best performance on the process-oriented metric (Extracted ProcessBench) but results in a slight regression on the final answer accuracy metric (Best-of-8) compared to `MC (ours)`. This could imply that using an LLM as a judge is particularly effective for evaluating or improving the reasoning process itself, but this focus might come at a minor cost to the optimization of the final answer selection in a "best-of-N" setting. The chart effectively argues that the authors' contributions improve upon the baseline, with different methods excelling on different evaluation criteria. </details> Table 5: The accuracy in identifying erroneous steps on the test cases of PROCESSBENCH containing correct answers but erroneous reasoning steps. '# samples' represents the number of test cases. | | GSM8K | MATH | OlympiadBench | Omni-MATH | Avg. | |-------------------------------|---------|--------|-----------------|-------------|--------| | # samples | 7 | 94 | 161 | 259 | | | 1.5B | | | | | | | Skywork-PRM-1.5B | 42.9 | 36.2 | 12.4 | 13.9 | 26.4 | | 7B+ | | | | | | | Math-Shepherd-PRM-7B | 14.3 | 12.8 | 13.7 | 14.7 | 13.9 | | RLHFlow-PRM-Mistral-8B | 14.3 | 13.8 | 7.5 | 10.0 | 11.4 | | RLHFlow-PRM-Deepseek-8B | 0.0 | 18.1 | 9.9 | 10.8 | 9.7 | | Skywork-PRM-7B | 57.1 | 26.6 | 14.3 | 13.1 | 27.8 | | EurusPRM-Stage1 | 28.6 | 25.5 | 19.9 | 20.1 | 23.5 | | EurusPRM-Stage2 | 42.9 | 27.7 | 18.0 | 20.8 | 27.4 | | Qwen2.5-Math-7B-Math-Shepherd | 0.0 | 9.6 | 4.3 | 1.2 | 3.8 | | Qwen2.5-Math-7B-PRM800K | 42.9 | 50.0 | 31.7 | 28.2 | 38.2 | | ⋆ Qwen2.5-Math-PRM-7B | 42.9 | 68.1 | 48.4 | 56.0 | 53.9 | | 72B | | | | | | | ⋆ Qwen2.5-Math-PRM-72B | 28.6 | 76.6 | 62.7 | 64.5 | 58.1 | ## 3.2.2 Limited Process Verification Capability in PRMs Lead to BoN Scores Inflation When the PRM cannot distinguish responses that have correct answers but flawed processes and assign them high scores, this leads to overestimated performance in the BoN evaluation, thereby creating an overly optimistic and potentially misleading assessment of PRM capabilities. To investigate the discriminative capability of PRMs for such cases, we extract instances from PROCESSBENCH where answers are correct but processes are erroneous and analysis the detection accuracy of PRMs for these cases. As shown in Figure 7, the PRMs trained on MC estimation, LLM-as-a-judge and human annotation exhibit completely opposite performance trends in BoN and extracted PROCESSBENCH evaluation. It can be observed that the model trained on our MC estimated data shows limited process verification capability but inflated results on the BoN. On the other hand, as shown in Table 5, except our released PRMs Qwen2.5-Math-PRM-7B and Qwen2.5Math-PRM-72B, all other open-sourced PRMs demonstrate detection accuracy rates below 50%. This limited discriminative capability indicates that PRMs struggle to differentiate between genuinely correct responses and those with merely superficial answer correctness in BoN evaluations. Consequently, this implies that beyond BoN evaluation, supplementary benchmarks are necessary to assess the actual capability of PRMs, especially in detecting process errors. ## 3.2.3 Process-to-Outcome Shift in BoN Optimized PRMs The majority of current PRMs are optimized towards BoN. However, the limitations of BoN result in PRMs process-to-outcome shift. During the BoN selection process based on PRM-predicted scores and follow the scoring method for responses in (Lightman et al., 2023), it can be found that regardless of whether we employ the minimum score or the product of scores to evaluate the full solution, the lowest step score acts as the key limiting factor that affects the selection criteria of PRMs. Figure 8: Percentage of responses where the minimum step score predict by PRMs appears in the final step (among all Best of 8 responses from Qwen2.5-Math-7B-Instruct). <details> <summary>Image 11 Details</summary> ![7f2c46f9](/v1/image/7f2c46f9862bf1a7ddc5f144080e50f7e59f4e56eb432b50986bcd3be6e201ea) ### Visual Description ## Horizontal Bar Chart: Minimum Score at Last Step (%) by AI Model ### Overview The image displays a horizontal bar chart comparing the performance of various AI models based on a metric called "Minimum Score at Last Step (%)". The chart ranks models from highest to lowest score, with the model names listed on the vertical axis (y-axis) and the percentage score on the horizontal axis (x-axis). ### Components/Axes * **Chart Type:** Horizontal Bar Chart. * **Y-Axis (Vertical):** Lists the names of 11 different AI models. The labels are left-aligned. * **X-Axis (Horizontal):** Labeled "Minimum Score at Last Step (%)". The scale runs from 0 to 60, with major tick marks at intervals of 10 (0, 10, 20, 30, 40, 50, 60). * **Data Series:** A single data series represented by solid blue horizontal bars. Each bar's length corresponds to the model's score. * **Data Labels:** The exact percentage value is printed at the end of each bar, to the right. * **Legend:** Not present as a separate element; model identification is via the y-axis labels. ### Detailed Analysis The models and their corresponding scores, listed from top (highest score) to bottom (lowest score), are as follows: | Rank | Model Name | Score (%) | | :--- | :--- | :--- | | 1 | EurusPRM-Stage1 | 54.6 | | 2 | EurusPRM-Stage2 | 52.9 | | 3 | Math-Shepherd-PRM-7B | 44.5 | | 4 | Skywork-PRM-7B | 42.2 | | 5 | Skywork-PRM-1.5B | 30.9 | | 6 | Qwen2.5-Math-7B-PRM800K | 26.8 | | 7 | Qwen2.5-Math-PRM-72B | 18.0 | | 8 | Qwen2.5-Math-PRM-7B | 17.5 | | 9 | RLHF-PRM-Deepseek-8B | 17.3 | | 10 | Qwen2.5-Math-7B-Math-Shepherd | 9.8 | | 11 | RLHF-PRM-Mistral-8B | 9.1 | **Trend Verification:** The visual trend is a clear, stepwise decrease in bar length from the top model to the bottom model. The two "EurusPRM" models form a distinct top tier, both scoring above 50%. There is a significant drop of over 8 percentage points between the 2nd and 3rd ranked models. Another notable drop occurs between the 5th and 6th models (from 30.9% to 26.8%). The bottom four models all score below 20%. ### Key Observations * **Top Performers:** The "EurusPRM" series (Stage1 and Stage2) significantly outperforms all other models listed, with scores above 50%. * **Model Family Clustering:** Models from the same family or with similar naming conventions tend to cluster together in performance. For example, the two "Skywork-PRM" models are adjacent, and three "Qwen2.5-Math" variants are grouped in the lower half. * **Performance Gap:** There is a large performance gap (approximately 25.5 percentage points) between the highest-scoring model (EurusPRM-Stage1) and the lowest-scoring model (RLHF-PRM-Mistral-8B). * **Parameter Size vs. Performance:** The chart does not show a simple correlation between model parameter size (e.g., 7B, 72B, 8B) and score. For instance, "Qwen2.5-Math-PRM-72B" (18.0%) scores lower than the smaller "Skywork-PRM-1.5B" (30.9%). ### Interpretation This chart provides a comparative benchmark of AI models on a specific evaluation metric, likely related to reasoning or problem-solving where a "minimum score at the last step" is a meaningful measure of reliability or correctness. The data suggests that the training methodology or architecture used in the "EurusPRM" models is substantially more effective for this particular task than the approaches used in the other listed models, which include variants using Math-Shepherd, Skywork, Qwen, and RLHF-based methods. The clustering of results indicates that the underlying technique (e.g., PRM - Process Reward Model, RLHF) and the base model family are stronger determinants of performance on this metric than sheer model size alone. The chart serves as a leaderboard, highlighting which models are currently leading in this specific evaluation dimension and quantifying the performance differences between them. The significant variance in scores (from ~9% to ~55%) underscores that model selection for tasks requiring high final-step reliability is critical. </details> Figure 9: Performance on BoN across multiple PRMs with different scoring methods: minimum, product and last. <details> <summary>Image 12 Details</summary> ![b8f31c44](/v1/image/b8f31c44893f916a0ad8a4d3ed627de16bbe7e0c4340e4e23a729cf1327255a4) ### Visual Description ## Grouped Bar Chart: Best-of-5 Mean Accuracy Comparison ### Overview The image displays a grouped bar chart comparing the "Best-of-5 Mean Accuracy (%)" across six different methods or datasets. Each method is evaluated using three distinct metrics: "min", "product", and "last". The chart is designed to compare the performance of these evaluation strategies across various training or annotation paradigms. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **Y-Axis:** * **Label:** "Best-of-5 Mean Acc (%)" * **Scale:** Linear, ranging from 63.0 to 67.5. * **Major Ticks:** 63.0, 63.5, 64.0, 64.5, 65.0, 65.5, 66.0, 66.5, 67.0, 67.5. * **X-Axis (Categories):** Six distinct categories are listed from left to right: 1. `MC-hard labels (860k)` 2. `MC-hard labels (3M)` 3. `MC-soft labels (3M)` 4. `MC-Math-Shepherd (3M)` 5. `human-annotation-PRM800K (860k)` 6. `LLM-as-a-judge (860k)` * **Legend:** Located in the top-right corner of the chart area. * **Blue Bar:** `min` * **Orange Bar:** `product` * **Green Bar:** `last` ### Detailed Analysis The following table reconstructs the approximate numerical values for each bar, based on visual alignment with the y-axis grid lines. Values are reported to one decimal place. | Category (X-Axis) | `min` (Blue) | `product` (Orange) | `last` (Green) | | :--- | :--- | :--- | :--- | | **MC-hard labels (860k)** | 64.1 | 65.9 | 66.7 | | **MC-hard labels (3M)** | 64.0 | 65.5 | 66.9 | | **MC-soft labels (3M)** | 63.7 | 64.4 | 65.5 | | **MC-Math-Shepherd (3M)** | 64.9 | 64.3 | 65.4 | | **human-annotation-PRM800K (860k)** | 65.6 | 64.9 | 64.7 | | **LLM-as-a-judge (860k)** | 65.6 | 65.3 | 65.3 | **Trend Verification per Category:** * **MC-hard labels (860k & 3M):** Clear ascending trend from `min` to `product` to `last`. The `last` metric performs best. * **MC-soft labels (3M):** Ascending trend (`min` < `product` < `last`). * **MC-Math-Shepherd (3M):** Non-monotonic trend. `min` (64.9) is higher than `product` (64.3), with `last` (65.4) being the highest. * **human-annotation-PRM800K (860k):** Descending trend. `min` (65.6) is the highest, followed by `product` (64.9), and then `last` (64.7). This is the only category where `min` is the top performer. * **LLM-as-a-judge (860k):** `min` (65.6) is slightly higher than `product` and `last`, which are tied at 65.3. ### Key Observations 1. **Dominance of `last`:** The `last` metric (green bar) achieves the highest accuracy in four out of six categories, with a peak of 66.9% for `MC-hard labels (3M)`. 2. **Performance of `min`:** The `min` metric (blue bar) shows the most variance. It is the lowest performer in the first three categories but becomes the highest performer in the `human-annotation` and `LLM-as-a-judge` categories. 3. **Category Performance:** The `MC-hard labels` methods (both 860k and 3M) achieve the highest overall accuracy values on the chart (up to 66.9%). The `MC-soft labels` method shows the lowest overall performance, with its highest bar (`last`) at 65.5%. 4. **Notable Anomaly:** The `human-annotation-PRM800K (860k)` category breaks the common pattern. It is the only category where the `min` metric outperforms both `product` and `last`. ### Interpretation This chart likely compares different methods for training or evaluating a model on a multiple-choice (MC) task, possibly in mathematics or reasoning, given the "Math-Shepherd" and "PRM800K" references. The numbers in parentheses (860k, 3M) probably denote the size of the training or annotation dataset. The three metrics (`min`, `product`, `last`) appear to be different strategies for aggregating or selecting from a set of model outputs ("Best-of-5"). The consistent superiority of the `last` strategy in most categories suggests that using the probability or score of the final generated token is a robust method for this evaluation. The reversal of this trend in the `human-annotation` category is significant. It implies that when the ground truth labels come from human annotation (as opposed to methods like `MC-hard` or `LLM-as-a-judge`), a different aggregation strategy (`min`) becomes more effective. This could indicate a difference in the quality, noise, or distribution of human-generated labels compared to machine-generated or heuristic-based labels. The overall high performance of the `MC-hard labels` methods suggests that this particular training paradigm, especially with a larger dataset (3M), is highly effective for the task being measured. The chart provides a clear empirical basis for choosing both a training method (`MC-hard labels`) and an evaluation aggregation strategy (`last`) to maximize accuracy, while also highlighting that the optimal strategy may depend on the source of the evaluation data. </details> As shown in Figure 8, we analyze the distribution of minimum step scores assigned by multiple opensourced PRMs, specifically focusing on cases where the lowest score occurred at the final step, which typically contains the final answer. The results show that models EurusPRM-Stage1, EurusPRM-Stage2, Math-Shepherd-PRM-7B and Skywork-PRM-7B exhibit notably high proportions in this category, which exceed 40%. In contrast, our released PRMs Qwen2.5-Math-PRM-72B and Qwen2.5-Math-PRM-7B exhibit a significantly lower proportion of minimum scores at the final step. This analysis reveals that some PRMs' performance in BoN evaluation is predominantly determined by final answer scores rather than intermediate reasoning steps, indicating a model degradation from process-based to outcome-oriented assessment. In other words, optimizing solely for the BoN evaluation has made current PRMs perform more like ORMs in practice. Hence, it is essential to supplement response-level evaluation BoN with step-level assessment methods to avoid the process-to-outcome shift. Specifically, we can employ process error localization tasks such as PROCESSBENCH. Other commonly used step-wise BoN methodologies leverage the integration of PRMs or value models with search mechanisms, which provide a more granular assessment of process reliability. It worth noting that the latter requires more computational costs. ## 3.2.4 Different PRMs, Different Optimal Scoring Strategies In the BoN evaluation, the overall solution score is derived by combining individual step scores. When each step's score represents the probability of that specific step being correct, it's generally acceptable to combine these step-level scores (through methods like product or minimum) to calculate the overall solution score. However, the situation becomes different when using MC estimation. In this case, each step's score actually estimates the probability of reaching the correct final answer in the future from the current position. Given this forward-looking nature of MC estimation, we should neither multiply the estimated probabilities across steps (as these estimates are dependent on each other), nor simply take the minimum estimated value from a particular step as the overall score. Instead, the estimated value from the final step naturally integrates information from the entire solution process, making it more suitable as the final score for the complete solution. To validate that, we evaluate BoN in different scoring strategies for the PRMs trained on MC estimation, LLM-as-a-judge, and human annotation data, as shown in Figure 9. We found that in MC estimation, using the last score shows significantly better performance than product and minimum approaches across multiple PRMs. And the trend is the opposite for human annotation and LLM-as-a-judge. This suggests that if the PRM has to be trained via MC estimation and evaluated in BoN, the last score strategy may be more reasonable and effective. However, it's worth noting that this use of PRM in BoN has deviated from PRM's original intended purpose. ## 3.2.5 Summary The above observations underscore critical limitations in BoN evaluation. Firstly , the unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. Secondly , the limited process verification capability makes PRMs demonstrate tolerance for the responses with correct answers but flawed reasoning processes, resulting in inflated BoN performance. Thirdly , model optimization solely focused on BoN evaluation leads PRMs to drift to prioritize final answers over reasoning processes. Therefore, we argue that supplementary step-level evaluation plays a crucial role in PRM evaluation. Finally , In BoN, different PRMs have different optimal scoring strategies. The last score strategy may be more reasonable and effective for the PRM trained via MC estimation. In contrast, product and minimum scoring are more appropriate for LLM-as-judge and human annotation. ## 4 Our PRMs This section presents our methodology for overcoming the previously discussed limitations and the details of our trained PRM achieving state-of-the-art performance. Additionally, we outline our experimental settings, and baseline models for comparison and evaluation results. ## 4.1 Training Details The data construction procedure comprises two primary phases: data expansion and data filtering. In the expansion phase, we follow the MC estimation to construct data described in Section 2.1. We employ hard labels, where a response is classified as negative only if none of the 8 completions achieves the correct final answer. In the subsequent filtering phase, we employ the LLM instantiated by Qwen2.5-Instruct-72B (Yang et al., 2024b) to serve as a critic to verify the reasoning process for all responses step by step, i.e., LLM-as-a-judge. We implement a simple yet efficient consensus filtering mechanism by filtering out instances where there is a discrepancy between the LLM-annotated and MC-estimated process labels. This ensures the retained data maintains high quality and consistency in the reasoning process annotation. For the training task, we employ cross-entropy loss on the tokens at the end of each step to train the binary classification task. We trained both 7B and 72B-parameter PRMs, initialized with Qwen2.5-Math-7B-Instruct and Qwen2.5-Math-72B-Instruct respectively. ## 4.2 Experimental Setup To validate the effectiveness of our trained PRM Qwen2.5-Math-PRM-7B and Qwen2.5-Math-PRM-72B, we respectively conduct the response-level BoN evaluation and the step-level process errors identification task PROCESSBENCH (Zheng et al., 2024). Best-of-N We follow the experimental setting in Section 2.2. In rm@8, we evaluate Outcome Reward Models (ORMs) and Process Reward Models (PRMs). For ORMs, we introduce Qwen2.5-Math-RM-72B (Yang et al., 2024c), which assigns a single score to each complete response. For PRMs, we compute the product of each step score as the final response score. We compare with the following PRMs: - Math-Shepherd-PRM-7B (Wang et al., 2024b): determining process labels for each step by estimating the empirical probability of reaching the correct final answer. - RLHFlow-PRM-Mistral-8B & RLHFlow-PRM-Deepseek-8B (Xiong et al., 2024): two LLaMA3.1-based PRMs that adopt Math-Shepherd's training methodology while implementing different solution generation models and optimization objectives. - Skywork-PRM-1.5B & Skywork-PRM-7B (Skywork, 2024): two recently released Qwen2.5Math-based PRMs by Skywork. - EurusPRM-Stage1 & EurusPRM-Stage2 (Cui et al., 2025): two PRMs trained using Implicit PRM approach (Yuan et al., 2024) with 7B parameters, which obtains process rewards replying on the ORMtrained on the response-level labels. - Qwen2.5-Math-7B-Math-Shepherd & Qwen2.5-Math-7B-PRM800K : two additional PRMs our developed by fine-tuning Qwen2.5-Math-7B-Instruct separately on the PRM800K (Lightman et al., 2023) and Math-Shepherd (Wang et al., 2024b) opensource datasets. PROCESSBENCH The compared PRMs are consistent with the previously mentioned PRMs. For the LLM prompted as Critic Models, i.e., LLM-as-a-judge, we compare with proprietary language models GPT-4o-0806 (Hurst et al., 2024) and o1-mini (OpenAI, 2024), open-source language models Llama-3.370B-Instruct (Dubey et al., 2024), Qwen2.5-Math-72B-Instruct (Yang et al., 2024c), Qwen2.5-72B-Instruct (Yang et al., 2024b) and QwQ-32B-Preview (Qwen, 2024). We also decompose the N-step response trajectory into N separate instances to enable individual scoring by the ORM Qwen2.5-Math-RM-72B. ## 4.3 Experimental Results Best-of-N The evaluation on policy model Qwen2.5-Math-7b-Instruct is shown in Table 6. Qwen2.5Math-PRM-7B demonstrates superior performance compared to other PRMs of equivalent model scale. Notably, it outperforms maj@8 across all 7 tasks, achieving an average improvement of 1.4%. Furthermore, | Setting | GSM8K | MATH | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. | |-------------------------------|---------|--------|----------------|------------------|------------------|----------------|-------------|--------| | pass@8 (Upper Bound) | 98.1 | 92 | 49.3 | 80.5 | 59.6 | 52.6 | 90.5 | 74.7 | | maj@8 | 96.7 | 87.1 | 41.2 | 72.5 | 44.4 | 47.8 | 73.8 | 66.2 | | 1.5B | | | | | | | | | | Skywork-PRM-1.5B | 96.9 | 86.7 | 37.9 | 70.1 | 42.1 | 47.9 | 67.9 | 64.2 | | 7B+ | | | | | | | | | | Math-Shepherd-PRM-7B | 97.3 | 85.4 | 37.9 | 70.6 | 40.4 | 47.2 | 70.5 | 64.2 | | RLHFlow-PRM-Mistral-8B | 97.0 | 86.1 | 37.1 | 70.6 | 41.2 | 47.6 | 69.5 | 64.2 | | RLHFlow-PRM-Deepseek-8B | 97.3 | 86.3 | 40.8 | 70.9 | 42.2 | 47.2 | 69.3 | 64.9 | | Skywork-PRM-7B | 97.3 | 87.3 | 38.2 | 71.9 | 43.7 | 47.8 | 67.7 | 64.8 | | EurusPRM-Stage1 | 95.6 | 83.0 | 35.7 | 66.2 | 38.2 | 46.2 | 66.6 | 61.6 | | EurusPRM-Stage2 | 95.4 | 83.4 | 34.9 | 67.3 | 39.1 | 46.3 | 67.3 | 62.0 | | Qwen2.5-Math-7B-Math-Shepherd | 96.9 | 86.5 | 36.8 | 71.4 | 41.6 | 47.7 | 69.3 | 64.3 | | Qwen2.5-Math-7B-PRM800K | 96.9 | 86.9 | 37.1 | 71.2 | 44.0 | 47.6 | 70.9 | 64.9 | | ⋆ Qwen2.5-Math-PRM-7B | 97.1 | 88.0 | 42.6 | 74.5 | 47.6 | 48.7 | 74.5 | 67.6 | | 72B | | | | | | | | | | Qwen2.5-Math-RM-72B | 97.9 | 88.5 | 42.6 | 75.1 | 49.9 | 49.6 | 78.7 | 68.9 | | ⋆ Qwen2.5-Math-PRM-72B | 97.6 | 88.7 | 46.0 | 74.3 | 48.1 | 49.3 | 81.1 | 69.3 | Table 6: Performance comparison on the Best-of-8 strategy of the policy model Qwen2.5-Math- 7B-Instruct. ⋆ represents the models we trained. Table 7: Performance comparison on PROCESSBENCH. ⋆ represents the models we trained. We report the results in the same calculation method with PROCESSBENCH. | Model | GSM8K | GSM8K | GSM8K | MATH | MATH | MATH | OlympiadBench | OlympiadBench | OlympiadBench | Omni-MATH | Omni-MATH | Omni-MATH | Avg. F1 | |-------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------| | Model | error | correct | F1 | error | correct | F1 | error | correct | F1 | error | correct | F1 | Avg. F1 | | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | LLM-as-judge, Proprietary language models | | GPT-4-0806 | 70.0 | 91.2 | 79.2 | 54.4 | 76.6 | 63.6 | 45.8 | 58.4 | 51.4 | 45.2 | 65.6 | 53.5 | 61.9 | | o1-mini | 88.9 | 97.9 | 93.2 | 83.5 | 95.1 | 88.9 | 80.2 | 95.6 | 87.2 | 74.8 | 91.7 | 82.4 | 87.9 | | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | LLM-as-judge, Open-source language models | | Llama-3.3-70B-Instruct | 72.5 | 96.9 | 82.9 | 43.3 | 83.2 | 59.4 | 31.0 | 94.1 | 46.7 | 28.2 | 90.5 | 43.0 | 58.0 | | Qwen2.5-Math-72B-Instruct | 49.8 | 96.9 | 65.8 | 36.0 | 94.3 | 52.1 | 19.5 | 97.3 | 32.5 | 19.0 | 96.3 | 31.7 | 45.5 | | Qwen2.5-72B-Instruct | 62.8 | 96.9 | 76.2 | 46.3 | 93.1 | 61.8 | 38.7 | 92.6 | 54.6 | 36.6 | 90.9 | 52.2 | 61.2 | | QwQ-32B-Preview | 81.6 | 95.3 | 88.0 | 78.1 | 79.3 | 78.7 | 61.4 | 54.6 | 57.8 | 55.7 | 68.0 | 61.3 | 71.5 | | PRMs | | | | | | | | | | | | | | | 1.5B | | | | | | | | | | | | | | | Skywork-PRM-1.5B | 50.2 | 71.5 | 59.0 | 37.9 | 65.2 | 48.0 | 15.4 | 26.0 | 19.3 | 13.6 | 32.8 | 19.2 | 36.4 | | 7B+ | | | | | | | | | | | | | | | Math-Shepherd-PRM-7B | 32.4 | 91.7 | 47.9 | 18.0 | 82.0 | 29.5 | 15.0 | 71.1 | 24.8 | 14.2 | 73.0 | 23.8 | 31.5 | | RLHFlow-PRM-Mistral-8B | 33.8 | 99.0 | 50.4 | 21.7 | 72.2 | 33.4 | 8.2 | 43.1 | 13.8 | 9.6 | 45.2 | 15.8 | 28.4 | | RLHFlow-PRM-Deepseek-8B | 24.2 | 98.4 | 38.8 | 21.4 | 80.0 | 33.8 | 10.1 | 51.0 | 16.9 | 10.9 | 51.9 | 16.9 | 26.6 | | Skywork-PRM-7B | 61.8 | 82.9 | 70.8 | 43.8 | 62.2 | 53.6 | 17.9 | 31.9 | 22.9 | 14.0 | 41.9 | 21.0 | 42.1 | | EurusPRM-Stage1 | 46.9 | 42.0 | 44.3 | 33.3 | 38.2 | 35.6 | 23.9 | 19.8 | 21.7 | 21.9 | 24.5 | 23.1 | 31.2 | | EurusPRM-Stage2 | 51.2 | 44.0 | 47.3 | 36.4 | 35.0 | 35.7 | 25.7 | 18.0 | 21.2 | 23.1 | 19.1 | 20.9 | 31.3 | | Qwen2.5-Math-7B-Math-Shepherd | 46.4 | 95.9 | 62.5 | 18.9 | 96.6 | 31.6 | 7.4 | 93.8 | 13.7 | 4.0 | 95.0 | 7.7 | 28.9 | | Qwen2.5-Math-7B-PRM800K | 53.1 | 95.3 | 68.2 | 48.0 | 90.1 | 62.6 | 35.7 | 87.3 | 50.7 | 29.8 | 86.1 | 44.3 | 56.5 | | ⋆ Qwen2.5-Math-PRM-7B | 72.0 | 96.4 | 82.4 | 68.0 | 90.4 | 77.6 | 55.7 | 85.5 | 67.5 | 55.2 | 83.0 | 66.3 | 73.5 | | 72B | 72B | | | | | | | | | | | | | | Qwen2.5-Math-RM-72B | 41.1 | 46.1 | 43.5 | 39.7 | 58.1 | 47.2 | 28.1 | 56.6 | 37.6 | 18.8 | 50.2 | 27.4 | 38.9 | | ⋆ Qwen2.5-Math-PRM-72B | 78.7 | 97.9 | 87.3 | 74.2 | 88.2 | 80.6 | 67.9 | 82.0 | 74.3 | 64.8 | 78.8 | 71.1 | 78.3 | the Qwen2.5-Math-PRM-72B exhibits slightly better overall performance than Qwen2.5-Math-RM-72B, with particularly significant improvements observed in the Minerva Math and MMLU STEM tasks. Finally, Supplementary BoN results, including BoN performance on Policy model Qwen2.5-Math-72bInstruct, alternative scoring strategies, evaluations on Chinese benchmarks, BoN with larger N values and BoN with LLM-as-a-judge are comprehensively documented in the Appendix B. PROCESSBENCH The evaluation results are presented in Table 7. When compared with LLM-as-judge, Qwen2.5-Math-PRM-7B in smaller model size demonstrates superior performance over all open-source models. For proprietary language models, Qwen2.5-Math-PRM-7B outperforms GPT-4o-0806, while there remains a performance gap compared to o1-mini. Furthermore, comparing with existing PRMs, both Qwen2.5-Math-PRM-7B and 72B exhibit substantial advantages over their counterparts. An interesting observation worth noting is that the ORM Qwen2.5-Math-RM-72B exhibits considerable capability in identifying step errors, even surpassing some open-source PRMs, which validates its potential as a complementary reward beyond solely rule-based mechanism. ## 5 Related Work Reward Model in Mathematical Reasoning To further improve mathematical reasoning accuracy, the reward model plays a crucial role in selecting the best answers. Two main types of reward models have emerged: (1) Outcome Reward Model (ORM) which provides an evaluation score for the entire solution, especially for the final answer. (2) Process Reward Model (PRM) (Uesato et al., 2022; Lightman et al., 2023) which evaluates each step in the reasoning process. Previous work (Lightman et al., 2023; Wang et al., 2024b) has demonstrated that PRM outperforms ORM which exhibits greater potential, though it requires more high-quality training data. Mathematical Reasoning Step Verification There are two primary approaches to evaluating the correctness of reasoning steps. The first approach relies on human annotation (Lightman et al., 2023), which produces high-quality data but suffers from substantial costs. The second approach, which has attracted considerable research attention, focuses on automated evaluation of reasoning step correctness. Current automated methods can be categorized into two main types: (1) backward-propagation based methods that infer step correctness from solution outcomes, including MC estimation (Wang et al., 2024b; Luo et al., 2024; Chen et al., 2024), progressive ORM labeling (Xi et al., 2024), and credit assignment (Wang et al., 2024a; Cui et al., 2025; Yuan et al., 2024) techniques; (2) prompting-based methods that leverage LLMs serve as critic, i.e., LLM-as-a-judge (Zhang et al., 2024; Gao et al., 2024; Xia et al., 2024) to assess step correctness directly. In this work, we integrate the two approaches MC estimation and LLM-as-a-judge. ## 6 Conclusion In this paper, we investigate the Process Reward Model (PRM) and release an effective PRM that demonstrates superior performance. Firstly, we discuss the undesirable trials on MC estimation. Then we demonstrate that data construction via MC estimation yields inferior performance and generalization compared to both LLM-as-a-judge and human annotation through extensive experiments. Besides, we investigate the limitations of vanilla BoN evaluation for PRMs which leads to inaccurate assessment of the PRM's ability and causes an optimization bias that shifts focus from process-oriented to outcome-oriented verification. Finally, we propose a simple yet effective consensus filtering strategy combining MC estimation and LLM-as-a-judge to overcome the limitation of MC estimation. In terms of evaluation, we conduct the response-level BoN evaluation and the step-level process errors identification task PROCESSBENCH to avoid the bias of relying solely on BoN. The experiments demonstrate our strategy significantly improves both data efficiency and model performance. In the future, there remains substantial potential in data construction and evaluation for PRMs, driving the development of more robust and reliable PRMs. Limitation There are several limitations remained in our current work. Firstly, there exists a considerable performance gap between our PRMs and the BoN upper bound (pass@8), suggesting substantial optimization potential. Then the best practices for utilizing PRMs in reinforcement learning remain unexplored. Finally, although our approach combines LLM-as-a-judge with MC estimation for consensus filtering, the efficient utilization of existing high-quality human annotation data is still largely underexplored. For instance, gradually expanding high-quality datasets through weakly supervised methods can be investigated as a promising direction for future exploration. ## References - Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: Process supervision without process, 2024. URL https://arxiv.org/abs/2405.03553 . - Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021. - Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards, 2025. - Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024. - Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Dayiheng Liu, Chang Zhou, Wen Xiao, Junjie Hu, Tianyu Liu, and Baobao Chang. Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, 2024. URL https://arxiv. org/abs/2406.14024 . - Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008 , 2024. - Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR . OpenReview.net, 2021a. - Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 , 2021b. - Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 , 2024. - Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , 2022. - Minpeng Liao, Chengxi Li, Wei Luo, Jing Wu, and Kai Fan. MARIO: math reasoning with code interpreter output - A reproducible pipeline. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024 , pages 905-924. Association for Computational Linguistics, 2024. - Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050 , 2023. - Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models by automated process supervision, 2024. URL https://arxiv.org/abs/2406.06592 . - OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. - OpenAI. Openai o1-mini: Advancing cost-efficient reasoning, 2024. URL https://openai.com/index/ openai-o1-mini-advancing-cost-efficient-reasoning/ . - Team Qwen. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. URL https: //qwenlm.github.io/blog/qwq-32b-preview/ . - Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024. - o1 Team Skywork. Skywork-o1 open series. https://huggingface.co/Skywork , November 2024. URL https://huggingface.co/Skywork . - Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id= Kjww7ZN47M . - Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcomebased feedback, 2022. URL https://arxiv.org/abs/2211.14275 . - Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, and Bo An. Q*: Improving multi-step reasoning for llms with deliberative planning, 2024a. URL https://arxiv.org/abs/2406. 14283 . - Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 9426-9439, August 2024b. doi: 10.18653/v1/2024.acl-long.510. URL https://aclanthology.org/ 2024.acl-long.510 . - Tianwen Wei, Jian Luan, Wei Liu, Shuang Dong, and Bin Wang. CMATH: can your language model pass chinese elementary school math test? CoRR , abs/2306.16636, 2023. - Zhiheng Xi, Wenxiang Chen, Boyang Hong, Senjie Jin, Rui Zheng, Wei He, Yiwen Ding, Shichun Liu, Xin Guo, Junzhe Wang, Honglin Guo, Wei Shen, Xiaoran Fan, Yuhao Zhou, Shihan Dou, Xiao Wang, Xinbo Zhang, Peng Sun, Tao Gui, Qi Zhang, and Xuanjing Huang. Training large language models for reasoning through reverse curriculum reinforcement learning, 2024. - Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, and Pengfei Liu. Evaluating mathematical reasoning beyond accuracy, 2024. URL https://arxiv.org/abs/2404.05692 . - Wei Xiong, Hanning Zhang, Nan Jiang, and Tong Zhang. An implementation of generative prm. https://github.com/RLHFlow/RLHF-Reward-Modeling , 2024. - An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671 , 2024a. - An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 , 2024b. - An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122 , 2024c. - Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels. arXiv preprint arXiv:2412.01981 , 2024. - Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction, 2024. URL https://arxiv.org/abs/ 2408.15240 . - Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559 , 2024. - Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems , volume 36, pages 46595-46623, 2023. - Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. In NAACL-HLT (Findings) , pages 2299-2314. Association for Computational Linguistics, 2024. - Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931 , 2024. ## A PRMGuided Search We further integrate PRM with greedy search by generating N candidate steps at each step, evaluating these candidates using PRM scoring, and selecting the highest-scoring step for subsequent expansion. For the policy model, we employed Qwen2.5-7B-Instruct which has greater diversity in generation to sample 8 candidates at each step, with sampling parameters set to temperature = 1.0 and top p = 1.0. We conduct comparative experiments with ORM in BoN approach. As shown in Table 8, Qwen2.5-Math-PRM-72B with greedy search@8 is slightly superior performance compared to Qwen2.5-Math-RM-72B with orm@8. We argue the potentially smaller performance differential between PRM and ORM lies in the consistency of generated token counts between greedy search and BoN outputs. Furthermore, although greedy search always selects the highest-scoring candidate at each step, the highest-scoring step may not be the correct one. Therefore, implementing either Depth-First Search (DFS) with backtracking capabilities or search approaches incorporating score constraints could prove more suitable for this cases. We choose the highest-scoring candidate at each step which the score predicted by PRM represents the correctness of this step. But such locally optimal choices may not lead to the correct final answer. In contrast, value models can predict the future probability of reaching the correct answer, rather than reflecting the correctness of the current step like rewards do, making them particularly well-suited for integration with search strategies. Based on these considerations, we believe there is still significant potential for exploration in the future regarding more appropriate search strategies or combining rewards and values to simultaneously consider both the correctness of the current step and the possibility of reaching the correct future outcomes. | Setting | GSM8K | MATH | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. | |------------------------|---------|--------|----------------|------------------|------------------|----------------|-------------|--------| | pass@8 (Upper Bound) | 96.9 | 89.6 | 48.2 | 79.7 | 58.4 | 55.0 | 81.6 | 72.8 | | pass@1 | 91.2 | 74.0 | 32.0 | 64.7 | 36.9 | 46.2 | 57.1 | 57.4 | | maj@8 | 93.7 | 80.3 | 37.1 | 69.9 | 45.8 | 48.5 | 61.9 | 62.5 | | orm@8 | | | | | | | | | | Qwen2.5-Math-RM-72B | 95.4 | 84.2 | 38.6 | 73.0 | 48.6 | 50.1 | 75.6 | 66.5 | | Greedy Search@8 | | | | | | | | | | Skywork-PRM-7B | 95.3 | 83.2 | 33.8 | 70.4 | 44.1 | 48.2 | 60.1 | 62.2 | | ⋆ Qwen2.5-Math-PRM-7B | 95.5 | 82.6 | 32.0 | 71.4 | 44.9 | 48.8 | 69.6 | 63.5 | | ⋆ Qwen2.5-Math-PRM-72B | 95.9 | 84.7 | 37.9 | 73.2 | 48.9 | 50.0 | 75.3 | 66.6 | Table 8: The performance of PRM guided greedy search and ORM of Best-of-8 with policy model Qwen2.5-7B-Instruct. For greedy search, 8 candidates is proposed at each step. ## B Supplementary BoN Results ## B.1 The BoN Evaluation on Qwen2.5-Math-72b-Instruct The BoN evaluation on policy model Qwen2.5-Math-72b-Instruct is shown in Table 9. Qwen2.5- Math7B-PRM outperforms other PRMs of equivalent model scale. However, its performance is inferior to maj@8, suggesting challenges in employing a 7B PRM for the supervision of 72B policy modelgenerated responses. Besides, Qwen2.5-Math-PRM-72B surpasses maj@8 in prm@8 and is comparable with Qwen2.5-Math-RM-72B in orm@8. ## B.2 The BoN Evaluation with Various Scoring Strategies We demonstrate experimental results using the last step score, the minimum step score or the production of step scores as the solution-level score. The BoN results with policy model Qwen2.5-Math-7B-Instruct and Qwen2.5-Math-72B-Instruct are shown in Table 13 and Table 14 respectively. ## B.3 The BoN Evaluation on Chinese Benchmarks We evaluate across three Chinese benchmarks including Chinese math benchmarks CMATH (Wei et al., 2023), GaoKao Math Cloze (Zhong et al., 2024), and GaoKao Math QA (Zhong et al., 2024) following Yang et al. (2024c), as shown in Table 15 and Table 16. Table 9: Performance comparison on the Best-of-8 strategy of the policy model Qwen2.5-Math-72BInstruct. ⋆ represents the models we trained. | Setting | GSM8K | MATH | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. | |-------------------------------|---------|--------|----------------|------------------|------------------|----------------|-------------|--------| | pass@8 | 97.3 | 93.2 | 56.6 | 83.6 | 62.4 | 54.1 | 95.3 | 77.5 | | maj@8 | 96.0 | 88.6 | 47.8 | 73.8 | 50.1 | 50.2 | 84.9 | 70.2 | | 1.5B | | | | | | | | | | Skywork-PRM-1.5B | 96.5 | 88.1 | 45.2 | 74.3 | 48.4 | 49.7 | 79.7 | 68.8 | | 7B+ | | | | | | | | | | Math-Shepherd-PRM-7B | 96.5 | 86.8 | 45.6 | 71.9 | 49.2 | 49.5 | 77.5 | 68.1 | | RLHFlow-PRM-Mistral-8B | 96.6 | 87.5 | 46.3 | 73.5 | 48.9 | 49.4 | 83.4 | 69.4 | | RLHFlow-PRM-Deepseek-8B | 96.5 | 87.7 | 44.5 | 73.5 | 48.7 | 49.4 | 84.6 | 69.3 | | Skywork-PRM-7B | 97.0 | 89.0 | 47.1 | 75.3 | 49.8 | 49.9 | 76.3 | 69.2 | | EurusPRM-Stage1 | 95.4 | 85.6 | 44.1 | 72.5 | 46.5 | 49.2 | 80.3 | 67.7 | | EurusPRM-Stage2 | 95.3 | 85.1 | 44.9 | 72.5 | 47.1 | 49.0 | 80.2 | 67.7 | | Qwen2.5-Math-7B-Math-Shepherd | 96.9 | 88.5 | 46.0 | 75.8 | 49.9 | 49.5 | 79.7 | 69.5 | | Qwen2.5-Math-7B-PRM800K | 96.5 | 88.9 | 47.4 | 75.3 | 50.7 | 50.1 | 76.6 | 69.4 | | ⋆ Qwen2.5-Math-PRM-7B | 96.8 | 89.6 | 46.7 | 77.7 | 51.4 | 50.4 | 76.4 | 69.9 | | 72B | | | | | | | | | | Qwen2.5-Math-RM-72B | 96.4 | 89.8 | 47.4 | 76.9 | 54.5 | 50.6 | 80.1 | 70.8 | | ⋆ Qwen2.5-Math-PRM-72B | 96.4 | 89.9 | 46.0 | 77.4 | 52.9 | 50.1 | 82.3 | 70.7 | ## B.4 BoN with Larger N Values To validate the effectiveness of our PRMs on the BoN with larger N values, we conduct additional Best-of-8 experiments on the policy model Qwen2.5-Math-7b-Instruct across diverse tasks including MATH500 (Lightman et al., 2023), AIME24 1 , AMC23 2 , Minerva Math (Lewkowycz et al., 2022), GaoKao 2023 En (Liao et al., 2024) and OlympiadBench (He et al., 2024). The results are presented in the Table 10 and it can be found that our PRMs maintain superior performance compared to other PRMs, especially on MATH500. Table 10: Performance comparison on the Best-of-64 strategy of the policy model Qwen2.5-Math-7BInstruct. ⋆ represents the models we trained. | Setting | MATH500 | AIME24 | AMC23 | Minerva Math | GaoKao 2023 En | Olympiad Bench | Avg. | |-------------------------------|-----------|----------|---------|----------------|------------------|------------------|--------| | pass@64 | 96.0 | 50.0 | 95.0 | 56.6 | 86.8 | 73.5 | 76.3 | | maj@64 | 84.2 | 16.7 | 77.5 | 34.6 | 73.8 | 51.1 | 56.3 | | 1.5B | | | | | | | | | Skywork-PRM-1.5B | 81.2 | 20.0 | 62.5 | 31.6 | 70.9 | 46.5 | 52.1 | | 7B+ | | | | | | | | | Math-Shepherd-PRM-7B | 79.6 | 20.0 | 62.5 | 32.4 | 70.1 | 43.9 | 51.4 | | RLHFlow-PRM-Mistral-8B | 82.4 | 20.0 | 62.5 | 30.9 | 69.1 | 45.9 | 51.8 | | RLHFlow-PRM-Deepseek-8B | 80.2 | 20.0 | 67.5 | 35.3 | 69.1 | 46.2 | 53.1 | | Skywork-PRM-7B | 84.6 | 20.0 | 67.5 | 32.0 | 71.2 | 47.1 | 53.7 | | EurusPRM-Stage1 | 76.0 | 10.0 | 55.0 | 27.6 | 66.5 | 40.0 | 45.9 | | EurusPRM-Stage2 | 76.2 | 10.0 | 52.5 | 27.9 | 67.0 | 40.3 | 45.7 | | Qwen2.5-Math-7B-Math-Shepherd | 84.2 | 23.3 | 67.5 | 34.6 | 72.5 | 47.4 | 54.9 | | Qwen2.5-Math-7B-PRM800K | 83.6 | 23.3 | 67.5 | 33.8 | 74.8 | 48.3 | 55.2 | | ⋆ Qwen2.5-Math-PRM-7B | 87.8 | 20.0 | 67.5 | 33.8 | 75.8 | 51.4 | 56.1 | | 72B | | | | | | | | | Qwen2.5-Math-RM-72B | 82.0 | 36.7 | 75.0 | 37.5 | 77.7 | 54.1 | 60.5 | | ⋆ Qwen2.5-Math-PRM-72B | 87.8 | 23.3 | 72.5 | 38.6 | 77.4 | 55.3 | 59.2 | ## B.5 Best-of-8 with LLM-as-a-judge Regarding BoN evaluation with LLMs, there are two ways to implement: pairwise and pointwise. For pairwise comparison, we employ a single-elimination tournament method. For N responses, we conduct N-1 comparisons to determine the optimal response. In terms of pointwise comparison, we score each 1 https://huggingface.co/datasets/AI-MO/aimo-validation-aime 2 https://huggingface.co/datasets/AI-MO/aimo-validation-amc step 1 for correct and 0 for incorrect. We then calculate the proportion of correct steps across all steps and select the response with the highest percentage of correct steps as the best response. The experiment are conduct on the policy model Qwen2.5-Math-7B-Instruct and Qwen2.5-Math-72B-Instruct and the results are shown in Table 11 and Table 12 respectively. Table 11: Performance comparison with LLM-as-a-judge on the Best-of-8 strategy of the policy model Qwen2.5-Math-7B-Instruct. | Setting | GSM8K | MATH | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. | |-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------| | pass@8 (Upper Bound) | 98.1 | 92 | 49.3 | 80.5 | 59.6 | 52.6 | 90.5 | 74.7 | | maj@8 | 96.7 | 87.1 | 41.2 | 72.5 | 44.4 | 47.8 | 73.8 | 66.2 | | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | | QwQ-32B-Preview | 97.0 | 86.0 | 39.3 | 70.1 | 46.2 | 47.9 | 70.5 | 65.3 | | Qwen2.5-72B-Instruct | 97.0 | 85.6 | 40.1 | 70.9 | 43.4 | 47.9 | 73.4 | 65.5 | | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | | QwQ-32B-Preview | 97.6 | 89.2 | 40.8 | 75.8 | 50.4 | 48.9 | 70.5 | 67.6 | | Qwen2.5-72B-Instruct | 97.3 | 86.8 | 40.8 | 73.5 | 45.0 | 48.4 | 74.5 | 66.6 | | PRMs | PRMs | PRMs | PRMs | PRMs | PRMs | PRMs | PRMs | PRMs | | Qwen2.5-Math-PRM-7B | 97.1 | 88.0 | 42.6 | 74.5 | 47.6 | 48.7 | 74.5 | 67.6 | | Qwen2.5-Math-PRM-72B | 97.6 | 88.7 | 46.0 | 74.3 | 48.1 | 49.3 | 81.1 | 69.3 | Table 12: Performance comparison with LLM-as-a-judge on the Best-of-8 strategy of the policy model Qwen2.5-Math-72B-Instruct. | Setting | GSM8K | MATH | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. | |-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------| | pass@8 (Upper Bound) | 97.3 | 93.2 | 56.6 | 83.6 | 62.4 | 54.1 | 95.3 | 77.5 | | maj@8 | 96.0 | 88.6 | 47.8 | 73.8 | 50.1 | 50.2 | 84.9 | 70.2 | | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | LLM-as-a-judge, Open-source language models POINTWISE | | QwQ-32B-Preview | 96.2 | 88.3 | 46.3 | 75.3 | 51.0 | 50.0 | 74.9 | 68.9 | | Qwen2.5-72B-Instruct | 96.5 | 87.8 | 47.4 | 76.4 | 48.9 | 50.0 | 76.0 | 69.0 | | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | PAIRWISE | | QwQ-32B-Preview | 96.4 | 90.9 | 46.0 | 79.5 | 55.1 | 50.5 | 73.6 | 70.3 | | Qwen2.5-72B-Instruct | 96.1 | 88.2 | 43.4 | 75.3 | 50.1 | 49.6 | 71.4 | 67.7 | | PRMs | PRMs | PRMs | PRMs | PRMs | PRMs | PRMs | PRMs | PRMs | | Qwen2.5-Math-PRM-7B | 96.8 | 89.6 | 46.7 | 77.7 | 51.4 | 50.4 | 76.4 | 69.9 | | Qwen2.5-Math-PRM-72B | 96.4 | 89.9 | 46.0 | 77.4 | 52.9 | 50.1 | 82.3 | 70.7 | ## C Prompt Template for LLM-as-a-judge To construct PRM training data via LLM-as-a-judge, we use the following prompt. Prompt for constructing PRM training data via LLM-as-a-judge I will provide a math problem along with a solution. They will be formatted as follows: problem)... ``` [Math Problem] <math_problem> ... (math problem) </math_problem> [/Solution] [paragraph_1] ``` ``` ``` ``` The following is the math problem and the solution for you task: [Math Problem] Write a program that takes in two numbers as input from the user and calculates their sum. The program should then display the result to the user. [Solution] Here's a simple Python program that solves the problem: ```python # Get the first number from the user num1 = float(input("Enter the first number: ")) # Get the second number from the user num2 = float(input("Enter the second number: ")) # Calculate the sum of the two numbers sum_of_numbers = num1 + num2 # Display the result to the user print("The sum of", num1, "and", num2, "is", sum_of_numbers) ``` [tagged_response] This program first prompts the user to enter two numbers. It then calculates their sum and displays the result to the user. The solution is a simple Python program that solves the problem of finding the sum of two numbers entered by the user. ``` Table 13: Performance comparison on the Best-of-8 strategy of the policy model Qwen2.5-Math-7BInstruct with 3 scoring strategies: last, product and minimum. ⋆ represents the models we trained. | Setting | Scoring | GSM8K | MATH | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. | |-------------------------------|-------------|-----------|-----------|----------------|------------------|------------------|----------------|-------------|-----------| | pass@8 (Upper Bound) | - | 98.1 | 92 | 49.3 | 80.5 | 59.6 | 52.6 | 90.5 | 74.7 | | maj@8 | - | 96.7 | 87.1 | 41.2 | 72.5 | 44.4 | 47.8 | 73.8 | 66.2 | | Math-Shepherd-PRM-7B | last | 96.8 | 85.2 | 39.0 | 70.1 | 42.8 | 47.2 | 67.7 | 64.1 | | | product | 97.3 | 85.4 | 37.9 | 70.6 | 40.4 | 47.2 | 70.5 | 64.2 | | | min | 96.9 | 85.3 | 39.0 | 69.9 | 42.2 | 47.4 | 70.6 | 64.5 | | RLHFlow-PRM-Mistral-8B | last | 97.0 | 85.3 | 39.0 | 71.2 | 44.0 | 47.1 | 64.0 | 63.9 | | RLHFlow-PRM-Mistral-8B | product | 97.0 | 86.1 | 37.1 | 70.6 | 41.2 | 47.6 | 69.5 | 64.2 | | RLHFlow-PRM-Mistral-8B | min | 97.0 | 84.3 | 37.1 | 69.4 | 40.4 | 46.9 | 68.7 | 63.4 | | RLHFlow-PRM-Deepseek-8B | last | 97.0 | 84.7 | 35.7 | 70.4 | 43.0 | 46.8 | 63.8 | 63.1 | | RLHFlow-PRM-Deepseek-8B | product | 97.3 | 86.3 | 40.8 | 70.9 | 42.2 | 47.2 | 69.3 | 64.9 | | RLHFlow-PRM-Deepseek-8B | min | 97.3 | 84.5 | 38.2 | 69.6 | 40.7 | 46.5 | 67.6 | 63.5 | | Skywork-PRM-1.5B | last | 96.8 | 86.4 | 39.0 | 71.7 | 45.0 | 47.9 | 68.2 | 65.0 | | Skywork-PRM-1.5B | product | 96.9 | 86.7 | 37.9 | 70.1 | 42.1 | 47.9 | 67.9 | 64.2 | | Skywork-PRM-1.5B | min | 96.6 | 86.6 | 37.9 | 71.9 | 43.1 | 48.2 | 66.9 | 64.5 | | Skywork-PRM-7B | last | 97.2 | 87.3 | 41.2 | 73.8 | 45.8 | 48.3 | 65.3 | 65.6 | | Skywork-PRM-7B | product | 97.3 | 87.3 | 38.2 | 71.9 | 43.7 | 47.8 | 67.7 | 64.8 | | Skywork-PRM-7B | min | 96.7 | 87.0 | 39.7 | 71.2 | 42.5 | 48.2 | 66.6 | 64.6 | | EurusPRM-Stage1 | last | 94.7 | 79.7 | 32.7 | 61.6 | 33.8 | 45.7 | 63.4 | 58.8 | | EurusPRM-Stage1 | product | 95.6 | 83.0 | 35.7 | 66.2 | 38.2 | 46.2 | 66.6 | 61.6 | | EurusPRM-Stage1 | min | 95.8 | 83.3 | 39.0 | 67.8 | 37.9 | 46.6 | 67.4 | 62.5 | | EurusPRM-Stage2 | last | 94.7 | 79.7 | 33.1 | 61.3 | 34.2 | 45.7 | 63.5 | 58.9 | | EurusPRM-Stage2 | product | 95.4 | 83.4 | 34.9 | 67.3 | 39.1 | 46.3 | 67.3 | 62.0 | | EurusPRM-Stage2 | min | 96.1 | 83.6 | 39.3 | 68.8 | 38.8 | 46.7 | 67.5 | 63.0 | | Qwen2.5-Math-7B-Math-Shepherd | last | 97.1 | 87.7 | 38.6 | 73.8 | 44.6 | 48.1 | 68.0 | 65.4 | | Qwen2.5-Math-7B-Math-Shepherd | product | 96.9 | 86.5 | 36.8 | 71.4 | 41.6 | 47.7 | 69.3 | 64.3 | | Qwen2.5-Math-7B-Math-Shepherd | min | 97.0 | 86.7 | 36.8 | 72.5 | 43.1 | 47.6 | 70.7 | 64.9 | | Qwen2.5-Math-7B-PRM800K | last | 96.7 | 86.3 | 37.9 | 71.9 | 44.3 | 47.6 | 68.1 | 64.7 | | Qwen2.5-Math-7B-PRM800K | product | 96.9 | 86.9 | 37.1 | 71.2 | 44.0 | 47.6 | 70.9 | 64.9 | | | min | 96.9 | 86.6 | 39.7 | 71.7 | 45.6 | 47.8 | 71.1 | 65.6 | | | last | 96.9 | 87.2 | 39.0 | 73.5 | 45.5 | 48.5 | 72.0 | 66.1 | | | product | 97.1 | 88.0 | 42.6 | 74.5 | 47.6 | 48.7 | 74.5 | 67.6 | | ⋆ Qwen2.5-Math-PRM-7B | | | 87.8 | | | | 48.3 | | | | ⋆ Qwen2.5-Math-PRM-7B | min | 97.0 | | 42.3 | 74.3 | 46.2 | | 74.1 | 67.1 | | ⋆ Qwen2.5-Math-PRM-72B | last | 97.6 | 88.9 | 43.4 | 73.8 | 49.2 | 49.6 | 76.8 | 68.5 | | ⋆ Qwen2.5-Math-PRM-72B | product min | 97.6 97.6 | 88.7 88.8 | 46.0 45.2 | 74.3 74.5 | 48.1 48.1 | 49.3 49.2 | 81.1 80.9 | 69.3 69.2 | Table 14: Performance comparison on the Best-of-8 strategy of the policy model Qwen2.5-Math-72BInstruct with 3 scoring strategies: last, product and minimum. ⋆ represents the models we trained. | Setting | Scoring | GSM8K | MATH | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. | |-------------------------------|-----------|---------|--------|----------------|------------------|------------------|----------------|-------------|-----------| | pass@8 (Upper Bound) | - | 97.3 | 93.2 | 56.6 | 83.6 | 62.4 | 54.1 | 95.3 | 77.5 | | maj@8 | - | 96.0 | 88.6 | 47.8 | 73.8 | 50.1 | 50.2 | 84.9 | 70.2 | | Math-Shepherd-PRM-7B | last | 96.2 | 87.0 | 46.7 | 73.0 | 47.3 | 49.8 | 76.3 | 68.0 | | | product | 96.5 | 86.8 | 45.6 | 71.9 | 49.2 | 49.5 | 77.5 | 68.1 | | | min | 96.1 | 86.8 | 45.6 | 73.2 | 48.6 | 49.9 | 76.0 | 68.0 | | RLHFlow-PRM-Mistral-8B | last | 96.3 | 86.6 | 44.9 | 74.3 | 47.6 | 49.3 | 67.1 | 66.6 | | RLHFlow-PRM-Mistral-8B | product | 96.6 | 87.5 | 46.3 | 73.5 | 48.9 | 49.4 | 83.4 | 69.4 | | RLHFlow-PRM-Mistral-8B | min | 96.4 | 86.3 | 44.5 | 71.9 | 47.9 | 49.3 | 76.0 | 67.5 | | RLHFlow-PRM-Deepseek-8B | last | 96.1 | 86.6 | 46.3 | 73.2 | 49.2 | 49.2 | 71.7 | 67.5 | | RLHFlow-PRM-Deepseek-8B | product | 96.5 | 87.7 | 44.5 | 73.5 | 48.7 | 49.4 | 84.6 | 69.3 | | RLHFlow-PRM-Deepseek-8B | min | 96.6 | 87.4 | 44.1 | 74.0 | 48.6 | 49.3 | 74.8 | 67.8 | | Skywork-PRM-1.5B | last | 96.1 | 88.6 | 44.9 | 72.2 | 47.9 | 50.1 | 74.2 | 67.7 | | Skywork-PRM-1.5B | product | 96.5 | 88.1 | 45.2 | 74.3 | 48.4 | 49.7 | 79.7 | 68.8 | | Skywork-PRM-1.5B | min | 96.0 | 88.3 | 45.6 | 73.8 | 48.6 | 50.1 | 75.9 | 68.3 | | Skywork-PRM-7B | last | 97.0 | 89.0 | 46.0 | 74.8 | 51.0 | 49.7 | 66.7 | 67.7 | | Skywork-PRM-7B | product | 97.0 | 89.0 | 47.1 | 75.3 | 49.8 | 49.9 | 76.3 | 69.2 | | Skywork-PRM-7B | min | 96.9 | 89.2 | 46.7 | 73.5 | 49.8 | 49.8 | 73.2 | 68.4 | | EurusPRM-Stage1 | last | 95.9 | 87.3 | 44.9 | 72.7 | 47.0 | 49.4 | 78.4 | 67.9 | | EurusPRM-Stage1 | product | 95.4 | 85.6 | 44.1 | 72.5 | 46.5 | 49.2 | 80.3 | 67.7 | | EurusPRM-Stage1 | min | 96.4 | 88.2 | 44.9 | 75.1 | 49.0 | 49.5 | 83.7 | 69.5 | | EurusPRM-Stage2 | last | 96.0 | 87.7 | 44.5 | 73.5 | 47.0 | 49.4 | 78.1 | 68.0 | | EurusPRM-Stage2 | product | 95.3 | 85.1 | 44.9 | 72.5 | 47.1 | 49.0 | 80.2 | 67.7 | | EurusPRM-Stage2 | min | 96.5 | 88.6 | 45.2 | 75.3 | 48.9 | 49.6 | 83.3 | 69.6 | | | last | 97.0 | 89.6 | 44.9 | 77.4 | 50.8 | 50.5 | 74.9 | 69.3 | | Qwen2.5-Math-7B-Math-Shepherd | product | 96.9 | 88.5 | 46.0 | 75.8 | 49.9 | 49.5 | 79.7 | 69.5 | | Qwen2.5-Math-7B-Math-Shepherd | min | 97.0 | 88.6 | 46.0 | 74.8 | 50.2 | 49.6 | 79.6 | 69.4 | | Qwen2.5-Math-7B-PRM800K | last | 96.7 | 88.8 | 47.1 | 76.1 | 50.1 | 49.5 | 71.8 | 68.6 | | Qwen2.5-Math-7B-PRM800K | product | 96.5 | 88.9 | 47.4 | 75.3 | 50.7 | 50.1 | 76.6 | 69.4 | | | | 96.5 | 89.1 | 47.1 | 76.1 | 50.7 | 49.9 | 75.3 | | | ⋆ Qwen2.5-Math-PRM-7B | min last | 96.8 | 89.0 | 46.7 | | 49.8 | 50.3 | 78.4 | 69.2 69.5 | | ⋆ Qwen2.5-Math-PRM-7B | product | 96.8 | 89.6 | 46.7 | 75.3 77.7 | 51.4 | 50.4 | 76.4 | 69.9 | | ⋆ | min | | | 46.3 | | | 50.3 | | | | ⋆ | | 96.7 | 89.6 | | 77.9 | 50.8 | | 76.0 | 69.7 | | Qwen2.5-Math-PRM-72B | last | 96.3 | 89.8 | 47.8 | 76.6 | 53.3 | 50.9 | 80.5 | 70.7 | | Qwen2.5-Math-PRM-72B | min | 96.4 | 89.7 | 46.3 | 77.7 | 52.4 | 50.4 | 81.2 | 70.6 | Table 15: Best-of-8 performance comparison on the Chinese benchmarks with the policy model Qwen2.5Math-7B-Instruct in 3 scoring strategies: last, product and minimum. ⋆ represents the PRMs we trained. | Setting | Scoring | CMATH | CNMiddle School 24 | GaoKao | Avg. | |-------------------------------|-----------|-----------|----------------------|----------|--------| | pass@8 (Upper Bound) | - | 95.3 | 82.2 | 84.3 | 87.3 | | maj@8 | - | 92.7 | 78.2 | 68.1 | 79.7 | | Math-Shepherd-PRM-7B | last | 91.8 | 80.2 | 63 | 78.3 | | | product | 92.0 | 80.2 | 69.1 | 80.4 | | | min | 91.5 | 80.2 | 69.8 | 80.5 | | RLHFlow-PRM-Mistral-8B | last | 92.8 | 79.2 | 57.2 | 76.4 | | | product | 92.7 | 77.2 | 65.8 | 78.6 | | | min | 92.8 | 76.2 | 62.1 | 77 | | | last | 93.2 | 75.2 | 56.9 | 75.1 | | RLHFlow-PRM-Deepseek-8B | product | 92.7 | 76.2 | 63.6 | 77.5 | | | min | 93.0 | 74.3 | 67.3 | 78.2 | | | last | 93.8 | 80.2 | 66.6 | 80.2 | | Skywork-PRM-1.5B | product | 92.8 | 79.2 | 66.3 | 79.4 | | | min | 93.3 | 80.2 | 66.6 | 80 | | | last | 94.0 | 81.2 | 66.7 | 80.6 | | Skywork-PRM-7B | product | 93.3 | 79.2 | 68.1 | 80.2 | | | min | 93.8 | 80.2 | 66.3 | 80.1 | | | last | 91.8 | 77.2 | 55.4 | 74.8 | | EurusPRM-Stage1 | product | 91.7 | 77.2 | 52.6 | 73.8 | | | min | 91.7 | 78.2 | 64.4 | 78.1 | | | last | 91.8 | 77.2 | 55.7 | 74.9 | | EurusPRM-Stage2 | product | 92.0 | 77.2 | 52.4 | 73.9 | | | min | 92.0 | 78.2 | 64.7 | 78.3 | | | last | 93.0 | 81.2 | 65.4 | 79.9 | | Qwen2.5-Math-7B-Math-Shepherd | product | 93.0 | 79.2 | 67.7 | 80 | | | min | 92.5 | 80.2 | 69.8 | 80.8 | | | last | 92.8 | 78.2 | 67.1 | 79.4 | | Qwen2.5-Math-7B-PRM800K | product | 92.7 | 77.2 | 68.9 | 79.6 | | | min | 93.0 | 77.2 | 69.4 | 79.9 | | | last | | | 68.2 | 80.6 | | ⋆ Qwen2.5-Math-PRM-7B | product | 93.3 | 80.2 | 70.1 | 81.3 | | | min | 93.7 93.5 | 80.2 80.2 | 71.7 | 81.8 | | | last | 94.3 | 80.2 | 72.1 | 82.2 | | ⋆ Qwen2.5-Math-PRM-72B | product | 94.2 | 80.2 | 73.5 | 82.6 | | | min | 94.2 | 80.2 | 73.1 | 82.5 | Table 16: Best-of-8 performance comparison on the Chinese benchmarks with the policy model Qwen2.5Math-72B-Instruct in 3 scoring strategies: last, product and minimum. ⋆ represents the PRMs we trained. | Setting | Scoring | CMATH | CNMiddle School 24 | GaoKao | Avg. | |-------------------------------|-----------|---------|----------------------|----------|--------| | pass@8 (Upper Bound) | - | 96.8 | 83.2 | 86.2 | 88.7 | | maj@8 | - | 95.3 | 79.2 | 75 | 83.2 | | Math-Shepherd-PRM-7B | last | 93.7 | 78.2 | 73.2 | 81.7 | | | product | 94 | 80.2 | 72.1 | 82.1 | | | min | 93.5 | 80.2 | 73.9 | 82.5 | | RLHFlow-PRM-Mistral-8B | last | 94.3 | 79.2 | 65.5 | 79.7 | | | product | 93.8 | 79.2 | 72 | 81.7 | | | min | 93.3 | 79.2 | 71.2 | 81.2 | | | last | 94.3 | 79.2 | 63 | 78.8 | | RLHFlow-PRM-Deepseek-8B | product | 94.3 | 79.2 | 72.5 | 82 | | | min | 94.5 | 79.2 | 73.5 | 82.4 | | | last | 94.8 | 80.2 | 74.3 | 83.1 | | Skywork-PRM-1.5B | product | 93.8 | 79.2 | 69.7 | 80.9 | | | min | 94.5 | 80.2 | 74.6 | 83.1 | | | last | 95.3 | 80.2 | 72.6 | 82.7 | | Skywork-PRM-7B | product | 94.7 | 80.2 | 71.5 | 82.1 | | Skywork-PRM-7B | min | 94.8 | 80.2 | 76 | 83.7 | | EurusPRM-Stage1 | last | 94 | 79.2 | 64.5 | 79.2 | | | product | 93.8 | 80.2 | 64.5 | 79.5 | | | min | 94.7 | 79.2 | 70.8 | 81.6 | | | last | 94.2 | 79.2 | 63.4 | 78.9 | | EurusPRM-Stage2 | product | 93.7 | 80.2 | 65.4 | 79.8 | | | min | 94.3 | 79.2 | 69.7 | 81.1 | | | last | 95 | 81.2 | 74.6 | 83.6 | | Qwen2.5-Math-7B-Math-Shepherd | product | 94.5 | 80.2 | 73 | 82.6 | | | min | 94.3 | 80.2 | 71.5 | 82 | | | last | 94.2 | 79.2 | 76.5 | 83.3 | | Qwen2.5-Math-7B-PRM800K | product | 94.2 | 82.2 | 70.8 | 82.4 | | | min | 93.8 | 80.2 | 72.9 | 82.3 | | | last | 94.7 | 79.2 | 74.5 | 82.8 | | ⋆ Qwen2.5-Math-PRM-7B | product | 94.3 | 81.2 | 77.6 | 84.4 | | | min | 94.5 | 81.2 | 77.6 | 84.4 | | | last | 96 | 79.2 | 76.1 | 83.8 | | ⋆ Qwen2.5-Math-PRM-72B | product | 96 | 80.2 | 77.2 | 84.5 |

Rendering Paper...