# Efficient Process Reward Model Training via Active Learning
> Work done during the internship in Sea AI Lab. Email:k.duan@u.nus.eduCorresponding authors.
Abstract
Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss w.r.t. the labels and update the PRM’s weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduce 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models The code is available at https://github.com/sail-sg/ActivePRM.
1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Line Chart: Average F1 Score vs. Estimated Annotation Cost
### Overview
The chart compares the performance (Average F1 Score) of three language models (ActPRM, UniversalPRM, Qwen2.5-Math-PRM-7B) against their estimated annotation costs (in generated tokens). Performance improves with higher annotation costs, but trade-offs exist between cost and efficiency.
### Components/Axes
- **X-axis**: "Est. Annotation Cost (Gen. Tokens)" (logarithmic scale: 2²⁸ to 2³⁴)
- **Y-axis**: "Average F1 Score" (linear scale: 0.68 to 0.76)
- **Legend**: Located in the bottom-right corner, mapping colors to models:
- Red star: ActPRM
- Teal star: UniversalPRM
- Blue star: Qwen2.5-Math-PRM-7B
- **Annotations**:
- Red line (ActPRM) peaks at 0.750 (2³⁰ tokens)
- Teal line (UniversalPRM) at 0.743 (2³² tokens)
- Blue line (Qwen2.5) at 0.735 (2³⁴ tokens)
- Dashed lines indicate cost multipliers: "4.8× Cost" (2³⁰) and "17.3× Cost" (2³²)
### Detailed Analysis
1. **ActPRM (Red Line)**:
- Starts at 0.68 (2²⁸ tokens) and rises sharply to 0.750 (2³⁰ tokens).
- Shows the steepest improvement in F1 score per token.
- Annotated with a star at its peak (0.750).
2. **UniversalPRM (Teal Line)**:
- Flat line at 0.743, positioned at 2³² tokens.
- Cost multiplier: 4.8× higher than ActPRM’s baseline (2³⁰ vs. 2³²).
3. **Qwen2.5-Math-PRM-7B (Blue Line)**:
- Flat line at 0.735, positioned at 2³⁴ tokens.
- Cost multiplier: 17.3× higher than ActPRM’s baseline (2³⁴ vs. 2³⁰).
### Key Observations
- **ActPRM** achieves the highest F1 score (0.750) at the lowest cost (2³⁰ tokens), outperforming others in efficiency.
- **UniversalPRM** and **Qwen2.5** have lower F1 scores (0.743 and 0.735, respectively) but require significantly higher annotation costs (4.8× and 17.3× more tokens than ActPRM).
- All models plateau after their respective cost thresholds, suggesting diminishing returns beyond certain annotation budgets.
### Interpretation
- **Cost-Effectiveness**: ActPRM offers the best balance of performance and cost, making it the most efficient model for annotation tasks.
- **Trade-offs**: UniversalPRM and Qwen2.5 sacrifice cost efficiency for marginally lower performance gains, which may not justify their higher resource demands.
- **Diminishing Returns**: The flat lines for UniversalPRM and Qwen2.5 imply that increasing annotation costs beyond 2³² and 2³⁴ tokens yields no further F1 score improvements, highlighting potential inefficiencies in scaling.
- **Strategic Implications**: Organizations prioritizing cost savings should favor ActPRM, while those requiring absolute maximum performance might consider the trade-offs of UniversalPRM or Qwen2.5 despite their higher costs.
</details>
Figure 1: Average F1 score on ProcessBench (Zheng et al., 2024) versus estimated annotation cost. ActPRM outperforms prior SOTA models while requiring merely 20% of the annotation costs.
Recently, Large Language Models (LLMs) (DeepSeek-AI et al., 2025; Yang et al., 2024; OpenAI et al., 2024b) have shown remarkable advances in mathematical reasoning, yet they can make mistakes during chain-of-thought (CoT) reasoning despite correct final answers (Zheng et al., 2024). To address this challenge, process reward models (Lightman et al., 2023; Wang et al., 2024; Zhang et al., 2025) were proposed, aiming to identify process errors and provide finer-grained supervision of the training process.
The main challenge in training Process Reward Models (PRMs) lies in obtaining fine-grained step-level annotations, which remain prohibitively expensive. Lightman et al. (2023) pioneered PRM training by employing human experts to label 75K questions at the step level. While their approach achieved high-quality results (reaching 57.5% on ProcessBench (Zheng et al., 2024)), it does not scale automatically due to the heavy reliance on manual annotation. To reduce human efforts, Monte Carlo (M.C.) Estimation Methods (Wang et al., 2024; Wei et al., 2024; Luo et al., 2024) were proposed. However, these approaches come with high computational costs (massive rollouts are required for accurate estimation) and struggle to accurately identify the first error step (Zheng et al., 2024). To address this challenge, Qwen2.5-Math-PRM (Zhang et al., 2025) proposed using LLM-as-Judge — leveraging LLMs to detect the first error step — and filter out unreliable M.C. labels. It significantly boosts the performance of PRM on both ProcessBench (Zheng et al., 2024) and PRMBench (Song et al., 2025). More recently, UniversalPRM (Tan et al., 2025) relies solely on LLM-as-Judge with ensemble prompting (via majority voting), achieving new SOTA performance on ProcessBench within the same model size. However, the annotation costs are still considerable. We estimate the labeling costs of Qwen2.5-Math-PRM (Zhang et al., 2025) and UniversalPRM (Tan et al., 2025) and illustrate them in Figure 1. It shows that Qwen-Math-PRM-7B and UniversalPRM consume over $2^{34}$ and $2^{32}$ generated tokens, respectively. Refer to Appendix C for estimation strategy.
To reduce annotation costs, we propose ActPRM, which uses a trained ensemble PRM to identify and select uncertain data for annotation by a high-capability reasoning model. Our approach trains a PRM with ensemble heads for uncertainty estimation. For each reasoning step, we compute the mean $\mu$ and standard deviation $\sigma$ of ensemble predictions, identifying uncertain steps when prediction confidence is outside threshold $1-\delta_{pred}<\mu<\delta_{pred}$ or variation exceeds $\delta_{std}$ . We consider a CoT trajectory uncertain if any step up to and including the first predicted error meets these criteria. By annotating only uncertain data and training exclusively on this subset, we significantly reduce labeling costs while maintaining PRM performance.
To validate the effectiveness and efficiency of ActPRM, we conducted comprehensive experiments in multiple settings:
- Pool-based Evaluation (Section 5.1): Using 100K labeled samples, ActPRM achieved performance comparable to full-data tuning while reducing annotation costs by 50%. It consistently outperformed random selection under identical budget constraints.
- One-shot Active Learning (Section 5.2): Starting with our pool-based model, we applied ActPRM to select uncertain samples from 1M+ unlabeled CoT trajectories from NuminaMath (Li et al., 2024). After annotation and fine-tuning, we achieved new SOTA performance of 75.0% on ProcessBench. As shown in Figure 1, ActPRM surpasses prior SOTA models with significantly lower costs—outperforming UniversalPRM (Tan et al., 2025) by 0.7% using only 20% of its annotation budget and exceeding Qwen2.5-Math-PRM-7B by 1.5% with just 6% of its annotation budget.
Our contributions are summarized as follows: ❶ We propose an uncertainty-aware active learning approach ActPRM for PRM training that selectively annotates informative reasoning steps using ensemble-based uncertainty estimation, significantly reducing labeling costs while maintaining performance. ❷ ActPRM achieves state-of-the-art results (75.0% on ProcessBench, 65.5% on PRMBench) while requiring only 20% of the annotation budget compared to prior SOTA method UniversalPRM. ❸ We release all trained models, datasets, and code to ensure reproducibility and facilitate community adoption.
2 Preliminaries
2.1 Process Reward Models
Problem Formulation. Given a math problem $q$ and a corresponding solution trajectory $s=[s_{1},s_{2},...,s_{n}]$ , where $s_{i}$ denotes $i$ -th step., we require a PRM to identify the correctness of each step until a wrong step is identified. We only label the steps up to and including the first error step following prior works (Lightman et al., 2023; Zheng et al., 2024), since the afterward steps are genuinely difficult to define their correctness. As a result, in practice the labels for a solution trajectory are either $[1,1,...,1]$ or $[1,1,...,0]$ . Then a PRM could be trained using the typical BCE loss:
$$
\mathcal{L}_{BCE}(s,y|\theta)=-\frac{1}{|s|}\sum_{1}^{|s|}y_{i}\log(P_{\theta}%
(s_{i}|s_{[:i]},q))+(1-y_{i})\log(1-P_{\theta}(s_{i}|s_{[:i]},q)), \tag{1}
$$
where $P_{\theta}$ is the PRM parameterized by $\theta$ and $s_{[:i]}$ denotes the steps before $s_{i}$ . When using PRM for inference, we set a threshold $\delta$ (usually $0.5$ ) to identify the first step that has a correctness probability $P_{\theta}(s_{i}|s_{[:i]},q)$ less than $\delta$ .
PRM Implementation Details. A typical PRM is built upon a pretrained generative LLM by replacing the causal language model head with a binary classification head that outputs the probability of correctness at corresponding token position. In practice, we solely need the prediction at the end of each reasoning step and thus a prediction mask is used to mask out the prediction at the other positions.
2.2 Uncertainty Estimation for Classification
Aleatoric Uncertainty. As aforementioned, a typical PRM $P_{\theta}$ is trained as a binary classification task. The simplest way to measure the uncertainty for it is to use aleatoric uncertainty (Valdenegro-Toro & Saromo, 2022): For simplicity, we use $P_{\theta}(s_{i})$ as the aleatoric probability for the $i$ -th step in the solution trajectory, where the full representation is $p_{\theta}(s_{i}|s_{[:i]},q)$ as in Equation 1.
$$
\mbox{Aleatoric Uncertainty}\propto P_{\theta}(s_{i})\log P_{\theta}(s_{i}). \tag{2}
$$
Epistemic Uncertainty. In addition, ensemble of models is also a common way to estimate epistemic uncertainty (Valdenegro-Toro & Saromo, 2022) by quantifying the disagreement among ensemble models. For example, Liang et al. (2022); Gleave & Irving (2022) use an ensemble of reward models to estimate uncertainty in preference learning. for process reward modeling, we could leverage the standard deviation of the ensemble predictions as the uncertainty estimation:
$$
\mbox{Epistemic Uncertainty}\propto\mbox{Var}(\{P_{\theta}(s_{i})\}), \tag{3}
$$
where $\{P_{\theta}\}$ is a set of models. It is worth noting that employing an ensemble of heads built upon a shared backbone is a common strategy to mitigate computational costs. We empirically study the combination of aleatoric and epistemic uncertainty and find that they are complementary to each other. Experimental results are shown in Section 5.1.
3 Related Work
Active Learning and Uncertainty Estimation. Active learning has been widely explored in the alignment of LLMs. Several studies adopt an online bandit formulation, leveraging uncertainty-aware reward models (RMs) for active exploration in response selection (Mehta et al., 2023; Dwaracherla et al., 2024; Liu et al., 2024; Melo et al., 2024; Gleave & Irving, 2022). For instance, Mehta et al. (2023) and Dwaracherla et al. (2024) use ensemble-based LLM heads to estimate epistemic uncertainty, prioritizing data most informative for preference learning. Similarly, Melo et al. (2024) propose an acquisition function combining both entropy (aleatoric uncertainty) and epistemic uncertainty. Our work builds on these approaches, empirically evaluating the role of both uncertainty types in the context of process reward modeling. Beyond active learning, ensemble methods—such as those in Coste et al. (2024) —have also proven effective in mitigating reward hacking (Amodei et al., 2016).
Process Reward Models. Different from outcome rewards (e.g., verifiable rewards (DeepSeek-AI et al., 2025) for mathematical reasoning problems) that assign rewards based on the final outcome, process rewards are assigned based on the intermediate steps of the problem-solving process. For a question and a corresponding solution with several steps, a PRM provides finer-grained rewards for each step. Til current stage, process reward modeling contains two categories: $(i)$ Process Reward as Q-values and $(ii)$ Process Reward as Judger. The former one (Wang et al., 2024; Luo et al., 2024; Wei et al., 2024; Li & Li, 2024) regards the process reward as the Q-values of the steps that estimate the probability of the policy model to reach the final correct answer. Specifically, they leverage the policy model that generates the solution steps to do Monte Carlo Estimation for each step. The estimated Q-values are used as the process rewards. However, recent works (Zhang et al., 2025; Zheng et al., 2024) show that this kind of process reward modeling suffers from identifying the process errors because it highly depends on the policy model and has large bias with the ground truth distribution. In contrast, the latter one (Lightman et al., 2023; Zhang et al., 2025) regards the process reward model as a proxy for identifying the intermediate process errors and the corresponding trained model achieves better performance on several benchmarks (Zheng et al., 2024; Song et al., 2025). In this work, we follow the latter one and regard the process reward as a judge that tries to identify the first error steps in the solutions if any. In addition, there are other works related to PRM. For example, Yuan et al. (2024) tries to train a PRM with a fashion of outcome reward modeling (ORM). Cheng et al. (2025); Cui et al. (2025) proposed RL training frameworks that integrate PRM as finer-grained supervison.
4 Efficient Process Reward Labeling via Active Learning
Labeling the process rewards for a large-scale dataset is very expensive as it either requires human experts to annotate the correctness of each step for each solution as in the previous work (Lightman et al., 2023) or leverages highly capable generative models to imitate human experts (Zhang et al., 2025). Even though the latter one is automated, it is still resource-consuming since the test time scales up with the difficulty of math problems.
Algorithm 1 PRM Active Learning with Cold Start.
1: // The difference with full data tuning is colored.
2: Ensemble PRM $P_{\theta}$ , dataset $\mathcal{D}=\{(q,s)\}$ , uncertainty thresholds $\delta_{pred}$ and $\delta_{std}$ , generative LLM $M$ , batch size $B$ , learning rate $\eta$
3: for $\mathcal{B}⊂\mathcal{D}$ do
4: $P_{\theta}(\mathcal{B})←\text{Forward}(\mathcal{B})$
5: $\widetilde{\mathcal{B}}=\{\}$
6: for $(q,s)∈\mathcal{B}$ do
7: if $\mathcal{U}^{\text{alea}}_{\theta}(s)\lor\mathcal{U}^{\text{epis}}_{\theta}(s)$ then $\triangleright$ Equation 5
8: $\widetilde{\mathcal{B}}←\widetilde{\mathcal{B}}\cup\{(q,s)\}$
9: end if
10: end for
11: $Y_{\widetilde{\mathcal{B}}}←\text{labeling}(\widetilde{\mathcal{B}})$ $\triangleright$ Labeling using generative LLM
12: $\mathcal{L}←\frac{1}{|\widetilde{\mathcal{B}}|}\sum_{(s,y)∈(%
\widetilde{\mathcal{B}},Y_{\widetilde{\mathcal{B}}})}{\color[rgb]{%
0.65234375,0.125,0.08984375}\mathcal{L}(s,y)}$ $\triangleright$ Equation 4
13: $∇_{\theta}\mathcal{L}←\text{Backward}(\mathcal{L})$
14: $\theta←\theta-\eta∇_{\theta}\mathcal{L}$
15: end for
16: $P_{\theta}$
To mitigate this issue, we propose to leverage active learning to make the PRM proactively select the data that is most informative to train on. Specifically, we train a PRM with ensemble heads to enable uncertainty estimation following Liang et al. (2022); Gleave & Irving (2022). As shown in Algorithm 1, We forward the data candidates to the ensemble PRM (line 3) and estimate the prediction uncertainty for each data point (line 5-6). Then we omit the data that the ensemble PRM is confident about (line 7) and only label the other retained data with a generative reasoning LLM (line 10). Then, we only backpropagate from the loss of labeled data (line 11). By doing so, we could considerably reduce the labeling cost while maintaining the PRM performance. Now we introduce our two key differences with the original finetuning: ensemble PRM training and uncertain data selection.
Ensemble PRM Training. In this work, we use ensemble of PRMs to estimate the epistemic uncertainty following Gleave & Irving (2022); Liang et al. (2022). Specifically, we use a shared LLM backbone and build multiple binary classification heads on top of it. In our training, the diversity of ensemble models is ensured by two ways: $(i)$ the random initialization of the head layers and $(ii)$ a diversity regularization term (Dwaracherla et al., 2024): $\mathcal{L}_{\mbox{div}}=\lambda·\frac{1}{n}\sum_{i=1}^{n}||\phi^{i}-\phi_%
{\text{init}}^{i}||_{2},$ where $\{\phi^{i}\}$ are the parameters of the ensemble heads and n is the number of ensemble heads. It is a $L2$ term penalizing the distance between the ensemble head parameters and their initial parameters. Our training objective for the ensemble PRM is therefore formulated as follows
$$
\mathcal{L}(y,s)=\frac{1}{n}\sum_{i=1}^{n}\left(\mathcal{L}_{BCE}(y,s|\theta,%
\phi^{i})+\lambda||\phi^{i}-\phi_{\text{init}}^{i}||_{2}\right), \tag{4}
$$
where $\theta$ denotes the backbone parameters and $\mathcal{L}_{B}CE$ is from Equation 1, that computes the loss for a certain head.
Uncertain Data Selection. Considering a batch of data candidates $\mathcal{D}=\{(q,s)\}$ , we first forward the data to the ensemble PRM $P_{\theta}$ to get the ensemble predictions $P_{\theta}(\mathcal{D})∈\mathbb{R}^{n×|\mathcal{D}|×|s|}$ . for each data $(q,s)∈\mathcal{D}$ , we could determine the hard-value aleatoric and epistemic uncertainty with pre-set thresholds. Briefly, the aleatoric (or epistemic) uncertainty is defined as 1 if uncertainty occurs at any step prior to the first predicted error step; otherwise, it is 0. A formal definition is as follows:
$$
\begin{split}\mathcal{U}^{\text{alea}}_{\theta}(s)=\bigvee_{i=0}^{\mathcal{E}(%
s)}\left(\max\left(\mu(P_{\theta}(s_{i})),1-\mu(P_{\theta}(s_{i}))\right)<%
\delta_{pred}\right)\,\,;\,\,\mathcal{U}^{\text{epis}}_{\theta}(s)=\bigvee_{i=%
0}^{\mathcal{E}(s)}\left(\sigma(P_{\theta}(s_{i}))>\delta_{std}\right),\end{split} \tag{5}
$$
where $\mu(·)$ and $\sigma(·)$ are the mean and standard deviation of the ensemble predictions among ensemble heads and $\lor$ denotes the logical ‘OR’ operation. Moreover, the $\mathcal{E}(s)$ denotes the first error step in the solution trajectory $s$ , defined as $\mathcal{E}(s)=\min\{j\mid\mu(s_{j})<\delta\}$ , where $\delta$ is the threshold for the correctness, typically set to $0.5$ . This is because we only care about the correctness of the steps before the first error step since it is genuinely difficult to define the correctness of the steps afterwards. By following the uncertainty estimation strategy, we retain the data in $\mathcal{D}$ that satisfies either $\mathcal{U}^{\text{alea}}_{\theta}$ or $\mathcal{U}^{\text{epis}}_{\theta}$ as $\widetilde{\mathcal{D}}$ . Then we could leverage expensive generative LLMs as judger (Zheng et al., 2024) to label the retained data in $\widetilde{\mathcal{D}}$ .
5 Experiments
In Section 5.1, we first validate ActPRM in a pool-based active learning setting using 100K labeled samples, including ablation studies on our uncertainty estimation strategy. Based on the optimal configuration, we then scale up to 1M unlabeled samples in Section 5.2, further proving our pipeline’s efficiency and effectiveness.
5.1 Pool-Based Active Learning
5.1.1 Experimental Settings
To evaluate our active learning strategy’s effectiveness, we first conduct experiments in an offline setting where ActPRM iteratively selects the most informative examples from a large unlabeled pool as detailed in Algorithm 1. We establish a strong baseline by comparing against full data tuning, where a model is trained on the complete dataset labeled by a single annotator. It is worth noting that as our data is randomly shuffled, the performance of full data tuning at intermediate training steps is essentially equivalent to the performance of random selection with the corresponding budget.
Evaluation Benchmark. We utilize ProcessBench (Zheng et al., 2024) to evaluate the effectiveness of PRMs. The test data in ProcessBench contains intermediate step errors and requires the PRM to identify the first error step. ProcessBench contains four subsets, and we report the average F1 score following the original work.
Models. We train ActPRM based on Qwen2.5-Math-7B-Instruct.
Training Dataset. For dataset construction, we randomly select 100K data from Numinamath (Li et al., 2024) dataset after decontamination against the ProcessBench (Zheng et al., 2024) and PRMBench (Song et al., 2025). We leverage Qwen-2.5-Math-7B-Instruct to generate CoT reasoning trajectories for the selected data and further use QwQ-32B as a judge to annotate the process correctness for all trajectories following Zhang et al. (2025). For completeness, we provide the prompt template in Appendix A.
5.1.2 Experimental Results
a)
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Graph: Average F1 Score vs. Budget for Two Models
### Overview
The image is a line graph comparing the performance of two models (ActPRM and Full Data Tuning) across varying budget levels. The x-axis represents "Budget" (0.0 to 1.0), and the y-axis represents "Average F1 Score" (0.45 to 0.70). Two dashed reference lines are present: a vertical line at Budget = 0.5 and a horizontal line at F1 = 0.673. The graph includes data points with uncertainty (scatter) and trend lines for both models.
---
### Components/Axes
- **X-axis (Budget)**: Ranges from 0.0 to 1.0 in increments of 0.1. Labeled "Budget."
- **Y-axis (Average F1 Score)**: Ranges from 0.45 to 0.70 in increments of 0.05. Labeled "Average F1 Score."
- **Legend**: Located in the bottom-right corner.
- Red line with circular markers: "ActPRM"
- Blue line with circular markers: "Full Data Tuning"
- **Dashed Lines**:
- Vertical line at Budget = 0.5 (labeled "Budget = 0.5").
- Horizontal line at F1 = 0.673 (labeled "F1 = 0.673").
---
### Detailed Analysis
#### ActPRM (Red Line)
- **Trend**: Starts at ~0.51 F1 at Budget = 0.0, rises sharply to ~0.673 at Budget = 0.5, then fluctuates slightly downward to ~0.66 at Budget = 1.0.
- **Key Data Points**:
- Budget = 0.0: ~0.51 F1
- Budget = 0.2: ~0.59 F1
- Budget = 0.4: ~0.64 F1
- Budget = 0.5: ~0.673 F1 (peak)
- Budget = 0.8: ~0.66 F1
- Budget = 1.0: ~0.66 F1
#### Full Data Tuning (Blue Line)
- **Trend**: Starts at ~0.48 F1 at Budget = 0.0, rises steadily, surpassing ActPRM after Budget = 0.5, and plateaus near ~0.68 F1 at Budget = 1.0.
- **Key Data Points**:
- Budget = 0.0: ~0.48 F1
- Budget = 0.2: ~0.56 F1
- Budget = 0.4: ~0.63 F1
- Budget = 0.5: ~0.64 F1 (overtakes ActPRM)
- Budget = 0.8: ~0.67 F1
- Budget = 1.0: ~0.68 F1
#### Scatter Points
- Red and blue scatter points (with uncertainty) are distributed around the trend lines, showing variability in performance across trials.
---
### Key Observations
1. **Crossover Point**: Full Data Tuning surpasses ActPRM at Budget = 0.5, where the vertical dashed line is drawn.
2. **F1 Score Threshold**: The horizontal line at F1 = 0.673 indicates a performance benchmark. ActPRM reaches this threshold at Budget = 0.5, while Full Data Tuning exceeds it by Budget = 0.8.
3. **Performance Trends**:
- ActPRM shows diminishing returns after Budget = 0.5.
- Full Data Tuning demonstrates consistent improvement up to Budget = 1.0.
---
### Interpretation
- **Model Efficiency**: Full Data Tuning achieves higher F1 scores with the same or lower budget compared to ActPRM, suggesting better resource utilization.
- **Diminishing Returns**: ActPRM’s performance plateaus after Budget = 0.5, indicating limited gains from additional budget allocation.
- **Benchmark Significance**: The F1 = 0.673 threshold may represent a target for acceptable performance, with Full Data Tuning being more reliable for exceeding this benchmark.
- **Uncertainty**: Scatter points suggest variability in model performance, with ActPRM showing higher variance (wider spread of red points) compared to Full Data Tuning.
This analysis highlights the trade-offs between model complexity (ActPRM) and data-driven tuning (Full Data Tuning) in achieving optimal performance under budget constraints.
</details>
( (
b)
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: Model Performance vs. Budget
### Overview
The chart illustrates the relationship between budget allocation and average F1 score for three distinct models. The x-axis represents budget (0.0 to 0.6), while the y-axis shows average F1 score (0.45 to 0.70). Three data series are plotted with distinct markers and colors, each corresponding to specific model configurations.
### Components/Axes
- **X-axis (Budget)**: Continuous scale from 0.0 to 0.6, labeled "Budget"
- **Y-axis (Average F1 Score)**: Continuous scale from 0.45 to 0.70, labeled "Average F1 Score"
- **Legend**: Located in bottom-right corner, with three entries:
- Red circles: (std: 0.005, p: 0.95)
- Blue triangles: (p: 0.95)
- Teal squares: (std: 0.005)
### Detailed Analysis
1. **Red Line (std: 0.005, p: 0.95)**:
- Starts at 0.50 when budget = 0.0
- Gradually increases to 0.68 at budget = 0.6
- Maintains consistent upward trajectory with minimal fluctuations
- Highest performance across all budget levels
2. **Blue Line (p: 0.95)**:
- Begins at 0.47 when budget = 0.0
- Sharp initial increase to 0.58 at budget = 0.1
- Plateaus between 0.60-0.62 from budget = 0.3 to 0.6
- Shows diminishing returns after initial budget increase
3. **Teal Line (std: 0.005)**:
- Starts at 0.48 when budget = 0.0
- Reaches 0.58 at budget = 0.1
- Exhibits moderate fluctuations (e.g., 0.60 → 0.59 → 0.62)
- Ends at 0.65 when budget = 0.6
- Maintains second-highest performance overall
### Key Observations
- Red line demonstrates optimal performance, maintaining a 0.08-0.10 F1 score advantage over other models at higher budgets
- Blue line shows strongest initial response to budget increases but fails to sustain momentum
- Teal line exhibits moderate performance with slight volatility but remains stable
- All models show diminishing returns beyond budget = 0.4
- Red line's performance suggests optimal configuration for std=0.005 and p=0.95
### Interpretation
The data indicates that model performance improves with increased budget allocation, with the red configuration (std=0.005, p: 0.95) achieving superior results across all budget levels. The blue line's initial performance spike suggests potential overfitting or configuration-specific advantages that don't translate to sustained gains. The teal line's moderate performance with std=0.005 implies that standard deviation alone doesn't guarantee optimal results without complementary parameter tuning. The consistent performance of the red line across budgets suggests that this configuration effectively balances model complexity and generalization capability. The plateau observed in the blue line after budget=0.3 may indicate resource saturation or diminishing returns in the model's optimization landscape.
</details>
c)
<details>
<summary>x4.png Details</summary>

### Visual Description
## Bar Chart: Average F1 Score vs. Number of Heads
### Overview
The image displays a bar chart comparing the **Average F1 Score** across different numbers of **Heads** (2, 4, 8, 16, 32, 64). The chart uses vertical red bars with error bars to represent uncertainty. The y-axis ranges from 0.600 to 0.650, while the x-axis lists the number of heads in ascending order.
---
### Components/Axes
- **X-Axis**: Labeled "#Heads" with categories: 2, 4, 8, 16, 32, 64.
- **Y-Axis**: Labeled "Average F1 Score" with a scale from 0.600 to 0.650.
- **Bars**: Six vertical red bars, each annotated with an error bar (horizontal line with caps).
- **Error Bars**: Represent uncertainty in F1 scores (e.g., ±0.010, ±0.005).
- **No legend** is present, but all bars share the same red color.
---
### Detailed Analysis
1. **2 Heads**:
- Average F1 Score: **0.600**
- Error: ±0.010
- Position: Bottom-left bar.
2. **4 Heads**:
- Average F1 Score: **0.625**
- Error: ±0.005
- Position: Second bar from the left.
3. **8 Heads**:
- Average F1 Score: **0.635**
- Error: ±0.015
- Position: Third bar from the left.
4. **16 Heads**:
- Average F1 Score: **0.640**
- Error: ±0.010
- Position: Fourth bar from the left.
5. **32 Heads**:
- Average F1 Score: **0.655**
- Error: ±0.005
- Position: Fifth bar from the left (tallest bar).
6. **64 Heads**:
- Average F1 Score: **0.650**
- Error: ±0.010
- Position: Rightmost bar.
---
### Key Observations
- **Trend**: The Average F1 Score generally increases with the number of heads, peaking at **32 heads (0.655)**.
- **Anomaly**: A slight decline is observed at **64 heads (0.650)**, despite the highest head count.
- **Uncertainty**: Error bars are smallest for **4 heads (±0.005)** and **32 heads (±0.005)**, suggesting higher confidence in these measurements.
- **Consistency**: Error bars for **2, 8, 16, and 64 heads** are ±0.010, indicating moderate uncertainty.
---
### Interpretation
The data suggests a **positive correlation** between the number of heads and the Average F1 Score up to **32 heads**, after which performance plateaus or slightly declines. This could imply:
1. **Diminishing Returns**: Adding more heads beyond 32 may not improve performance proportionally.
2. **Overfitting Risk**: At 64 heads, the model might overfit, reducing generalization (as seen in the slight drop).
3. **Optimal Configuration**: **32 heads** achieves the highest F1 score with minimal uncertainty, making it the most reliable configuration.
The error bars highlight that measurements for **4 and 32 heads** are more precise, possibly due to larger sample sizes or better model stability at these configurations. The chart underscores the importance of balancing model complexity (number of heads) with performance metrics like F1 score.
</details>
(
Figure 2: (a) Comparison of the average F1 score on ProcessBench between ActPRM and random selection, plotted against the normalized budget positively correlated the number of labeled data instances consumed. The semi-transparent points represent all results in grid searching w.r.t. $\delta_{pred}$ and $\delta_{std}$ . For the highlighted ActPRM curve in the figure, $\delta_{pred}=0.95$ and $\delta_{std}=0.005$ . (b) Ablation: uncertainty estimation strategies. (c) Ablation: number of ensemble PRM heads.
ActPRM achieves comparable performance while reducing annotation costs by 50%. We compare ActPRM with full data tuning across a normalized budget, as illustrated in Figure 2 (a). The results demonstrate that ActPRM achieves an average F1 score of $0.673$ on ProcessBench, matching baseline performance while using only half the annotation budget. Furthermore, ActPRM consistently outperforms random selection under the same budget constraints. Notably, at 50% budget, ActPRM surpasses random selection by a significant margin of 3.3%. at the end of pool-based active training, ActPRM achieves a better performance of 0.680 on ProcessBench while consuming solely 62.5% budget.
ActPRM Consistently Outperforms Random Selection Under Diverse $\delta_{pred}$ and $\delta_{std}$ . As shown in Figure 2 (a), the semi-transparent blue points represent all results of a grid searching over $\delta_{pred}∈\{0.9,0.95,0.97\}$ and $\delta_{std}∈\{0.01,0.005,0.002,0.001\}$ . One can see that most blue points are above the baseline (gray line) with the same budget, further demonstrating the effectiveness and robustness of ActPRM.
Ablation Study on Uncertainty Estimation Strategies. We conduct an ablation study on uncertainty estimation strategies, i.e. using epistemic and aleatoric uncertainty. We selected the best setting ( $\delta_{std}=0.005,\delta_{pred}=0.95$ ) searched by a grid search as in Figure 2 and ablates epistemic and aleatoric uncertainty by setting $\delta_{std}=∈f$ and $\delta_{pred}=0.5$ , respectively. As shown illustrated in Figure 2 (b), solely use either epistemic or aleatoric uncertainty under-performs using both, indicating that epistemic and aleatoric uncertainty are complementary to each other.
Ablation Study on Number of Heads for Ensemble PRM. The number of heads for ensemble PRM controls how accurate our estimated epistemic uncertainty is. To find the trade-off between good estimation and computational overhead, we conduct an ablation study regarding it and show the results in Figure 2 (c), where we only consider epistemic uncertainty by setting $\delta_{std}=0.005,\delta_{pred}=0.5$ and report the averaged results with 3 runs. We empirically find that the performance continually grows with the number of heads and converges at about 32.
5.2 Achieving New SOTA Performance on ProcessBench (75.0%) with Solely 6% Annotation Cost.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Bar Chart: Estimated Annotation Cost Comparison
### Overview
The image is a bar chart comparing the estimated annotation costs (in "Gen. Token") for four different methods: "Ours," "Ensemble Prompting," "Math Shepherd," and "Consensus Filtering." The y-axis represents the cost in scientific notation (e.g., 1.0e+10), while the x-axis lists the methods. Each bar is labeled with a multiplier (e.g., "1x," "4.8x") indicating the relative cost compared to a baseline.
### Components/Axes
- **X-axis**: Labeled "Methods" with categories: "Ours," "Ensemble Prompting," "Math Shepherd," "Consensus Filtering."
- **Y-axis**: Labeled "Est. Annotation Cost (Gen. Token)" with values: 0.0e+00, 5.0e+09, 1.0e+10, 1.5e+10, 2.0e+10.
- **Bars**: Red-colored bars with multipliers (e.g., "1x," "4.8x") displayed above each bar.
- **No legend** is present in the image.
### Detailed Analysis
- **"Ours"**: The shortest bar, with a multiplier of **1x** (baseline). Its height corresponds to approximately **1.0e+10** on the y-axis.
- **"Ensemble Prompting"**: A bar with a multiplier of **4.8x**, reaching approximately **4.8e+10** (though the y-axis only goes up to 2.0e+10, suggesting the bar may extend beyond the chart's visible range).
- **"Math Shepherd"**: A bar with a multiplier of **15.9x**, reaching approximately **1.59e+11** (again, exceeding the chart's visible y-axis range).
- **"Consensus Filtering"**: The tallest bar, with a multiplier of **17.3x**, reaching approximately **1.73e+11** (also exceeding the chart's visible y-axis range).
### Key Observations
1. **"Ours"** is the most cost-effective method, with a multiplier of **1x** (baseline).
2. **"Ensemble Prompting"** is **4.8x** more expensive than "Ours."
3. **"Math Shepherd"** and **"Consensus Filtering"** are significantly more expensive, with multipliers of **15.9x** and **17.3x**, respectively.
4. The y-axis scale (up to 2.0e+10) does not fully capture the higher values for "Math Shepherd" and "Consensus Filtering," indicating the chart may be truncated or scaled incompletely.
### Interpretation
The data suggests a clear hierarchy in annotation costs:
- **"Ours"** is the most efficient, requiring the least resources.
- **"Ensemble Prompting"** introduces a moderate cost increase.
- **"Math Shepherd"** and **"Consensus Filtering"** are the most resource-intensive, with costs escalating exponentially.
- The chart’s y-axis may not fully represent the higher values for the latter two methods, potentially obscuring the true scale of their costs.
- The multipliers imply that the methods vary widely in efficiency, with "Consensus Filtering" being the least efficient.
This analysis highlights the trade-offs between different annotation strategies, emphasizing the need for cost-aware method selection in resource-constrained environments.
</details>
Figure 3: Estimated annotation costs (generated tokens) comparison between ActPRM and popular methods, including Ensemble Prompting (Tan et al., 2025), MathShepherd (Wang et al., 2024) and Consensus Filtering (Zhang et al., 2025).
Obtaining high-quality process supervision labels is costly. To demonstrate the efficiency of ActPRM, we evaluate it in a one-shot active learning setting. Starting with the model trained in Section 5.1, we select the most uncertain samples from over 1M+ unlabeled examples and annotate them using a powerful reasoning model.
Figure 3 compares our estimated labeling costs with those of other real-world datasets for training PRMs, including MathShepherd (Wang et al., 2024), Consensus Filtering (Zhang et al., 2025), and Ensemble Prompting (Tan et al., 2025). Since the training data for Consensus Filtering is not publicly available, we estimate costs based on our data statistics. We introduce our estimation strategy in Appendix C.
Training a Qwen2.5-Math-7B-Instruct on our data, Ensemble Prompting data, MathShepherd data, and Consensus Filtering data yields ActPRM, UniversalPRM, Qwen2.5-Math-7B-Math-Shepherd, and Qwen2.5-Math-PRM-7B in Table 1. We evaluated the performance of models trained on this labeled data on both ProcessBench and PRMBench benchmarks.
5.2.1 Experimental Settings
Data Filtering with ActPRM. We used Qwen2.5-Math-7B-Instruct and Qwen2.5-Math-72B-Instruct to collect over 1 million (1,061,763) Chain-of-Thought (COT) trajectories from the Numinamath problem set (Li et al., 2024), after decontamination against the test benchmarks. ActPRM was then applied to filter out high-confidence ( $\delta_{pred}>0.95$ and $\delta_{std}<0.005$ following Section 5.1) data instances that were unnecessary for training, retaining the remaining data for labeling and training. This process resulted in a final dataset of 563,030 PRM data points labeled by QwQ-32B, reducing annotation costs by 47.0%.
Models. Obtaining the dataset, we continually train our ActPRM in Section 5.1 on our filtered dataset. In addition, we empirically find that our retrained data is generally useful to other PRMs. Specifically, we also continually train Qwen2.5-Math-PRM-7B (the previous SOTA model on ProcessBench) on our constructed data. The resultant model is named ActPRM-X X stands for extended version..
Benchmarks. We use ProcessBench (Zheng et al., 2024) and PRMBench (Song et al., 2025) to evaluate the effectiveness of our trained model. Different from ProcessBench that collects intermediate errors from real-world generative models, PRMBench heuristically builds intermediate errors by manipulating correct steps.
Baselines. We compare with the following PRMs: ❶ Qwen2.5-Math-PRM-7B (Zhang et al., 2025): This model uses consensus filtering for labeling. It labels 860K data twice using two methods (LLM-as-judge [Zheng et al., 2024] and Mathshepherd [Wang et al., 2024]) and filters out 40% of the data where the labels disagree. ❷ Pure-PRM-7B (Cheng et al., 2025): A Qwen2.5-Math-based PRM trained on PRM800K using a two-stage strategy: warming up the PRM head and then fine-tuning the entire model. ❸: EurusPRM-Stage2 (Cui et al., 2025): A PRM resulting from the Implicit PRM approach (Yuan et al., 2024), which derives process rewards from an ORM. ❹ Universal-PRM (Tan et al., 2025): A Qwen2.5-Math-based model trained with data augmentation techniques like ensemble prompting and reverse verification. ❺ Math-Shepherd-PRM-7B (Wang et al., 2024): a PRM trained on process labels that estimates hard Q-values for the policy model. ❻ Qwen2.5-Math-7B-Math-Shepherd (Zhang et al., 2025): a PRM trained on 860K data labeled using MathShepherd. ❼ Ensemble-PRM-PRM800K (ours): a model with ensemble heads trained by ourselves on PRM800K without active learning.
5.2.2 Experimental Results
| Models | GSM8K | MATH | Olympiad Bench | OmniMath | Average F1 |
| --- | --- | --- | --- | --- | --- |
| LLM-as-judge | | | | | |
| o1-Mini ⋄ | 0.932 | 0.889 | 0.872 | 0.824 | 0.879 |
| Deepseek-R1-Distill-32B | 0.817 | 0.739 | 0.659 | 0.585 | 0.700 |
| QwQ-32B | 0.871 | 0.834 | 0.787 | 0.771 | 0.816 |
| Process Reward Models (72B) | | | | | |
| Qwen2.5-Math-PRM-72B ⋄ | 0.873 | 0.806 | 0.743 | 0.711 | 0.783 |
| Process Reward Models (7B+) | | | | | |
| Math-Shepherd-PRM-7B ⋄ | 0.479 | 0.295 | 0.248 | 0.238 | 0.315 |
| Qwen2.5-Math-7B-Math-Shepherd ⋄ | 0.625 | 0.316 | 0.137 | 0.077 | 0.289 |
| EurusPRM-Stage2 ⋄ | 0.473 | 0.357 | 0.212 | 0.209 | 0.313 |
| Qwen2.5-Math-7B-PRM800K ⋄ | 0.683 | 0.626 | 0.507 | 0.443 | 0.565 |
| Ensemble-PRM-PRM800K (ours) | 0.705 | 0.630 | 0.472 | 0.433 | 0.560 |
| PURE-PRM-7B | 0.690 | 0.665 | 0.484 | 0.459 | 0.575 |
| Qwen2.5-Math-PRM-7B ⋄ | 0.824 | 0.776 | 0.675 | 0.663 | 0.735 |
| Universal-PRM | 0.858 | 0.777 | 0.676 | 0.664 | 0.743 |
| \hdashline ActPRM (ours) | 0.816 | 0.798 | 0.714 | 0.670 | 0.750 |
| ActPRM-X (ours) | 0.827 | 0.820 | 0.720 | 0.673 | 0.760 |
Table 1: Performance comparison on ProcessBench. We report the results in the same calculation method with ProcessBench. ⋄ denotes the results are from Qwen PRM’s report (Zhang et al., 2025).
| # | Models | Simlicity | Soundness | Sensitivity | Average |
| --- | --- | --- | --- | --- | --- |
| LLM-as-judge | | | | | |
| 1 | Gemini-2.0-thinking-exp-1219 | 0.662 | 0.718 | 0.753 | 0.688 |
| 1 | o1-mini | 0.646 | 0.721 | 0.755 | 0.688 |
| 4 | GPT-4o | 0.597 | 0.709 | 0.758 | 0.668 |
| 6 | Gemini-2.0-flash-exp | 0.627 | 0.673 | 0.754 | 0.660 |
| Process Reward Models (72B) | | | | | |
| 3 | Qwen-2.5-Math-PRM-72B | 0.546 | 0.739 | 0.770 | 0.682 |
| Process Reward Models (7B+) | | | | | |
| 7 | Qwen2.5-Math-PRM-7B | 0.521 | 0.710 | 0.755 | 0.655 |
| 9 | Pure-PRM-7B | 0.522 | 0.702 | 0.758 | 0.653 |
| \hdashline 7 | ActPRM (ours) | 0.536 | 0.713 | 0.752 | 0.655 |
| 5 | ActPRM-X (ours) | 0.545 | 0.727 | 0.756 | 0.667 |
Table 2: Performance comparison on PRMBench. All results of the other models are from the official leaderboard. # denotes the ranking.
ActPRM and ActPRM-X achieve new SOTA performance on ProcessBench compared with same size models. The evaluation results on ProcessBench are shown in Table 1. ActPRM achieves an average F1 score of $0.750$ , outperforming Qwen2.5-Math-PRM-7B by a margin of 1.5%. Furthermore, ActPRM-X training based on Qwen2.5-Math-PRM-7B achieves a new SOTA performance on ProcessBench with an average F1 of 0.760, outperforming the second-place model (Universal-PRM) with a margin of 1.7% and improve the performance of Qwen2.5-Math-PRM-7B by a significant margin of 2.5%.
QwQ-32B (our PRM label annotator) outperforms all PRMs on ProcessBench. As shown in Table 1, QwQ-32B outperforms all PRMs including 72B models. It indicates the reliability of utilizing QwQ-32B as a PRM label annotator as it provides a high empirical upperbound for the training PRMs.
ActPRM-X achieves new SOTA performance on PRMBench, on-par with GPT-4o. We further test our models on PRMBench and show the results in Table 2. As on the leaderboard, ActPRM achieves the best performance within 7B PRMs and ActPRM-X achieves new SOTA performance (0.667), outperforming the other models by a large margin of at least $1.2\%$ and on-par with GPT-4o (OpenAI et al., 2024a).
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Graphs: Average F1 Score and Training Loss Comparison
### Overview
The image contains two line graphs comparing performance metrics for two data selection methods: "ActPRM Selected" (red) and "Randomly Selected" (blue). The left graph tracks **Average F1 Score** over training steps, while the right graph tracks **Training Loss** over the same steps.
---
### Components/Axes
#### Left Graph (Average F1 Score)
- **X-axis**: "Step" (discrete markers at 500, 1000, 1500, 2000)
- **Y-axis**: "Average F1 Score" (range: 0.67–0.73)
- **Legend**: Located in the bottom-right corner, with red = "ActPRM Selected" and blue = "Randomly Selected"
#### Right Graph (Training Loss)
- **X-axis**: "Step" (continuous from 0 to 2000)
- **Y-axis**: "Training Loss" (range: 0.09–0.15)
- **Legend**: Located in the top-right corner, matching colors to labels
---
### Detailed Analysis
#### Left Graph (Average F1 Score)
- **ActPRM Selected (Red Line)**:
- Starts at **0.68** at step 500.
- Increases steadily to **0.73** at step 2000.
- Slope: Approximately +0.005 per 500 steps.
- **Randomly Selected (Blue Line)**:
- Starts at **0.67** at step 500.
- Rises to **0.70** at step 1500, then plateaus at **0.70** at step 2000.
- Slope: +0.003 per 500 steps initially, then flat.
#### Right Graph (Training Loss)
- **ActPRM Selected (Red Line)**:
- Starts at **0.15** at step 0.
- Decreases steadily to **0.10** by step 2000.
- Slope: Approximately -0.00025 per step.
- **Randomly Selected (Blue Line)**:
- Starts at **0.10** at step 0.
- Fluctuates between **0.09–0.10** throughout, with no clear trend.
- Higher volatility compared to ActPRM.
---
### Key Observations
1. **ActPRM Selected** consistently outperforms "Randomly Selected" in both metrics:
- Higher F1 Score (0.73 vs. 0.70).
- Lower Training Loss (0.10 vs. 0.09–0.10).
2. **Training Loss** for ActPRM shows a smooth decline, while Randomly Selected exhibits erratic fluctuations.
3. **F1 Score** for ActPRM improves linearly, whereas Randomly Selected plateaus after step 1500.
---
### Interpretation
The data suggests that the **ActPRM Selected** method is significantly more effective than random selection for the task analyzed. The ActPRM method demonstrates:
- **Improved generalization**: Higher F1 Score indicates better balance between precision and recall.
- **Stable training**: Lower and more consistent Training Loss suggests better optimization of model parameters.
- **Scalability**: The linear improvement in F1 Score and steady reduction in loss imply that ActPRM benefits from increased training steps.
The Randomly Selected method, while initially competitive, fails to maintain performance gains and exhibits higher training instability. This could indicate overfitting or suboptimal feature selection in the random approach. The stark contrast in trends highlights the importance of structured data selection (ActPRM) over random sampling in machine learning workflows.
</details>
Figure 4: ProcessBench performance (left) and training loss (right): ActPRM v.s. random data selection on 1M NuminaMath Rollouts.
5.2.3 Comparative Experiment with Random Selection
A potential concern is that while ActPRM achieves state-of-the-art (SOTA) performance on several benchmarks, this success might be attributed solely to the high quality of our collected data pool, rather than the method itself. To address this, we conducted a comparative study with random selection. Specifically, we randomly selected 256K data points from our retained dataset as the experimental group. For the control group, we randomly selected the same number of data points from the entire data pool (over 1M) and used the same annotator to label any unlabeled data (i.e., data not in the retained set). We then continually train ActPRM checkpoint, as in Sec 5.1, on both datasets. The results, including performance on ProcessBench and training loss, are shown in Figure 4.
ActPRM outperforms random selection of the same amount of data. As illustrated in Figure 4 (left), the model trained on data selected by ActPRM consistently achieves significantly better results than the model trained on randomly selected data. To further validate this, we compare their training losses in Figure 4 (right). The model trained on ActPRM -selected data exhibits a consistently higher training loss, with a margin of 0.05, suggesting that the data selected by ActPRM is more challenging and informative, thereby enhancing the learning process.
6 Conclusion and Future Work
In this work, we address the high annotation costs associated with training Process Reward Models (PRMs) by proposing ActPRM, an uncertainty-aware active learning framework that selectively annotates the most informative reasoning steps. By leveraging an ensemble PRM to estimate uncertainty and strategically labeling only uncertain data, ActPRM significantly reduces annotation costs while maintaining competitive performance. Extensive experiments demonstrate that ActPRM achieves a new state-of-the-art (75.0% on ProcessBench) with merely at most 20% of the labeling budget required by prior methods. Our results highlight the potential of efficient data selection for scalable PRM training, and we commit to releasing all models, datasets, and code to foster further research in this direction.
To further enhance PRM’s performance, several promising directions can be explored. First, leveraging larger base models and more advanced LLM judges (e.g., O1-mini) could yield significant improvements. Second, implementing the framework in an online setting would ultimately enable PRM to iteratively refine its performance through active learning. Additionally, integrating online PRM training with reinforcement learning frameworks—such as actor-critic methods—presents an exciting avenue for research.
References
- Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety, 2016. URL https://arxiv.org/abs/1606.06565.
- Cheng et al. (2025) Jie Cheng, Lijun Li, Gang Xiong, Jing Shao, and Yisheng Lv. Stop gamma decay: Min-form credit assignment is all process reward model needs for reasoning, 2025. Notion Blog.
- Coste et al. (2024) Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward Model Ensembles Help Mitigate Overoptimization, 2024.
- Cui et al. (2025) Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025.
- DeepSeek-AI et al. (2025) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948.
- Dwaracherla et al. (2024) Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, and Benjamin Van Roy. Efficient exploration for llms. In International Conference on Machine Learning, 2024.
- Gleave & Irving (2022) Adam Gleave and Geoffrey Irving. Uncertainty Estimation for Language Reward Models, 2022.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874.
- Li et al. (2024) Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-1.5](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf), 2024.
- Li & Li (2024) Wendi Li and Yixuan Li. Process reward model with q-value rankings. arXiv preprint arXiv:2410.11287, 2024.
- Liang et al. (2022) Xinran Liang, Katherine Shu, Kimin Lee, and Pieter Abbeel. Reward Uncertainty for Exploration in Preference-based Reinforcement Learning, 2022.
- Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step, 2023.
- Liu et al. (2024) Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, and Min Lin. Sample-efficient alignment for llms, 2024. URL https://arxiv.org/abs/2411.01493.
- Luo et al. (2024) Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve Mathematical Reasoning in Language Models by Automated Process Supervision, 2024.
- Mehta et al. (2023) Viraj Mehta, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Jeff Schneider, and Willie Neiswanger. Sample efficient reinforcement learning from human feedback via active exploration. arXiv preprint arXiv:2312.00267, 2023.
- Melo et al. (2024) Luckeciano C. Melo, Panagiotis Tigas, Alessandro Abate, and Yarin Gal. Deep bayesian active learning for preference modeling in large language models, 2024. URL https://arxiv.org/abs/2406.10023.
- OpenAI et al. (2024a) OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexis Conneau, Ali Kamali, Allan Jabri, Allison Moyer, Allison Tam, Amadou Crookes, Amin Tootoochian, Amin Tootoonchian, Ananya Kumar, Andrea Vallone, Andrej Karpathy, Andrew Braunstein, Andrew Cann, Andrew Codispoti, Andrew Galu, Andrew Kondrich, Andrew Tulloch, Andrey Mishchenko, Angela Baek, Angela Jiang, Antoine Pelisse, Antonia Woodford, Anuj Gosalia, Arka Dhar, Ashley Pantuliano, Avi Nayak, Avital Oliver, Barret Zoph, Behrooz Ghorbani, Ben Leimberger, Ben Rossen, Ben Sokolowsky, Ben Wang, Benjamin Zweig, Beth Hoover, Blake Samic, Bob McGrew, Bobby Spero, Bogo Giertler, Bowen Cheng, Brad Lightcap, Brandon Walkin, Brendan Quinn, Brian Guarraci, Brian Hsu, Bright Kellogg, Brydon Eastman, Camillo Lugaresi, Carroll Wainwright, Cary Bassin, Cary Hudson, Casey Chu, Chad Nelson, Chak Li, Chan Jun Shern, Channing Conger, Charlotte Barette, Chelsea Voss, Chen Ding, Cheng Lu, Chong Zhang, Chris Beaumont, Chris Hallacy, Chris Koch, Christian Gibson, Christina Kim, Christine Choi, Christine McLeavey, Christopher Hesse, Claudia Fischer, Clemens Winter, Coley Czarnecki, Colin Jarvis, Colin Wei, Constantin Koumouzelis, Dane Sherburn, Daniel Kappler, Daniel Levin, Daniel Levy, David Carr, David Farhi, David Mely, David Robinson, David Sasaki, Denny Jin, Dev Valladares, Dimitris Tsipras, Doug Li, Duc Phong Nguyen, Duncan Findlay, Edede Oiwoh, Edmund Wong, Ehsan Asdar, Elizabeth Proehl, Elizabeth Yang, Eric Antonow, Eric Kramer, Eric Peterson, Eric Sigler, Eric Wallace, Eugene Brevdo, Evan Mays, Farzad Khorasani, Felipe Petroski Such, Filippo Raso, Francis Zhang, Fred von Lohmann, Freddie Sulit, Gabriel Goh, Gene Oden, Geoff Salmon, Giulio Starace, Greg Brockman, Hadi Salman, Haiming Bao, Haitang Hu, Hannah Wong, Haoyu Wang, Heather Schmidt, Heather Whitney, Heewoo Jun, Hendrik Kirchner, Henrique Ponde de Oliveira Pinto, Hongyu Ren, Huiwen Chang, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian O’Connell, Ian Osband, Ian Silber, Ian Sohl, Ibrahim Okuyucu, Ikai Lan, Ilya Kostrikov, Ilya Sutskever, Ingmar Kanitscheider, Ishaan Gulrajani, Jacob Coxon, Jacob Menick, Jakub Pachocki, James Aung, James Betker, James Crooks, James Lennon, Jamie Kiros, Jan Leike, Jane Park, Jason Kwon, Jason Phang, Jason Teplitz, Jason Wei, Jason Wolfe, Jay Chen, Jeff Harris, Jenia Varavva, Jessica Gan Lee, Jessica Shieh, Ji Lin, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joanne Jang, Joaquin Quinonero Candela, Joe Beutler, Joe Landers, Joel Parish, Johannes Heidecke, John Schulman, Jonathan Lachman, Jonathan McKay, Jonathan Uesato, Jonathan Ward, Jong Wook Kim, Joost Huizinga, Jordan Sitkin, Jos Kraaijeveld, Josh Gross, Josh Kaplan, Josh Snyder, Joshua Achiam, Joy Jiao, Joyce Lee, Juntang Zhuang, Justyn Harriman, Kai Fricke, Kai Hayashi, Karan Singhal, Katy Shi, Kavin Karthik, Kayla Wood, Kendra Rimbach, Kenny Hsu, Kenny Nguyen, Keren Gu-Lemberg, Kevin Button, Kevin Liu, Kiel Howe, Krithika Muthukumar, Kyle Luther, Lama Ahmad, Larry Kai, Lauren Itow, Lauren Workman, Leher Pathak, Leo Chen, Li Jing, Lia Guy, Liam Fedus, Liang Zhou, Lien Mamitsuka, Lilian Weng, Lindsay McCallum, Lindsey Held, Long Ouyang, Louis Feuvrier, Lu Zhang, Lukas Kondraciuk, Lukasz Kaiser, Luke Hewitt, Luke Metz, Lyric Doshi, Mada Aflak, Maddie Simens, Madelaine Boyd, Madeleine Thompson, Marat Dukhan, Mark Chen, Mark Gray, Mark Hudnall, Marvin Zhang, Marwan Aljubeh, Mateusz Litwin, Matthew Zeng, Max Johnson, Maya Shetty, Mayank Gupta, Meghan Shah, Mehmet Yatbaz, Meng Jia Yang, Mengchao Zhong, Mia Glaese, Mianna Chen, Michael Janner, Michael Lampe, Michael Petrov, Michael Wu, Michele Wang, Michelle Fradin, Michelle Pokrass, Miguel Castro, Miguel Oom Temudo de Castro, Mikhail Pavlov, Miles Brundage, Miles Wang, Minal Khan, Mira Murati, Mo Bavarian, Molly Lin, Murat Yesildal, Nacho Soto, Natalia Gimelshein, Natalie Cone, Natalie Staudacher, Natalie Summers, Natan LaFontaine, Neil Chowdhury, Nick Ryder, Nick Stathas, Nick Turley, Nik Tezak, Niko Felix, Nithanth Kudige, Nitish Keskar, Noah Deutsch, Noel Bundick, Nora Puckett, Ofir Nachum, Ola Okelola, Oleg Boiko, Oleg Murk, Oliver Jaffe, Olivia Watkins, Olivier Godement, Owen Campbell-Moore, Patrick Chao, Paul McMillan, Pavel Belov, Peng Su, Peter Bak, Peter Bakkum, Peter Deng, Peter Dolan, Peter Hoeschele, Peter Welinder, Phil Tillet, Philip Pronin, Philippe Tillet, Prafulla Dhariwal, Qiming Yuan, Rachel Dias, Rachel Lim, Rahul Arora, Rajan Troll, Randall Lin, Rapha Gontijo Lopes, Raul Puri, Reah Miyara, Reimar Leike, Renaud Gaubert, Reza Zamani, Ricky Wang, Rob Donnelly, Rob Honsby, Rocky Smith, Rohan Sahai, Rohit Ramchandani, Romain Huet, Rory Carmichael, Rowan Zellers, Roy Chen, Ruby Chen, Ruslan Nigmatullin, Ryan Cheu, Saachi Jain, Sam Altman, Sam Schoenholz, Sam Toizer, Samuel Miserendino, Sandhini Agarwal, Sara Culver, Scott Ethersmith, Scott Gray, Sean Grove, Sean Metzger, Shamez Hermani, Shantanu Jain, Shengjia Zhao, Sherwin Wu, Shino Jomoto, Shirong Wu, Shuaiqi, Xia, Sonia Phene, Spencer Papay, Srinivas Narayanan, Steve Coffey, Steve Lee, Stewart Hall, Suchir Balaji, Tal Broda, Tal Stramer, Tao Xu, Tarun Gogineni, Taya Christianson, Ted Sanders, Tejal Patwardhan, Thomas Cunninghman, Thomas Degry, Thomas Dimson, Thomas Raoux, Thomas Shadwell, Tianhao Zheng, Todd Underwood, Todor Markov, Toki Sherbakov, Tom Rubin, Tom Stasi, Tomer Kaftan, Tristan Heywood, Troy Peterson, Tyce Walters, Tyna Eloundou, Valerie Qi, Veit Moeller, Vinnie Monaco, Vishal Kuo, Vlad Fomenko, Wayne Chang, Weiyi Zheng, Wenda Zhou, Wesam Manassra, Will Sheu, Wojciech Zaremba, Yash Patil, Yilei Qian, Yongjik Kim, Youlong Cheng, Yu Zhang, Yuchen He, Yuchen Zhang, Yujia Jin, Yunxing Dai, and Yury Malkov. Gpt-4o system card, 2024a. URL https://arxiv.org/abs/2410.21276.
- OpenAI et al. (2024b) OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Chelsea Voss, Chen Shen, Chong Zhang, Chris Koch, Chris Orsinger, Christopher Hesse, Claudia Fischer, Clive Chan, Dan Roberts, Daniel Kappler, Daniel Levy, Daniel Selsam, David Dohan, David Farhi, David Mely, David Robinson, Dimitris Tsipras, Doug Li, Dragos Oprica, Eben Freeman, Eddie Zhang, Edmund Wong, Elizabeth Proehl, Enoch Cheung, Eric Mitchell, Eric Wallace, Erik Ritter, Evan Mays, Fan Wang, Felipe Petroski Such, Filippo Raso, Florencia Leoni, Foivos Tsimpourlas, Francis Song, Fred von Lohmann, Freddie Sulit, Geoff Salmon, Giambattista Parascandolo, Gildas Chabot, Grace Zhao, Greg Brockman, Guillaume Leclerc, Hadi Salman, Haiming Bao, Hao Sheng, Hart Andrin, Hessam Bagherinezhad, Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian Osband, Ignasi Clavera Gilaberte, Ilge Akkaya, Ilya Kostrikov, Ilya Sutskever, Irina Kofman, Jakub Pachocki, James Lennon, Jason Wei, Jean Harb, Jerry Twore, Jiacheng Feng, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joaquin Quiñonero Candela, Joe Palermo, Joel Parish, Johannes Heidecke, John Hallman, John Rizzo, Jonathan Gordon, Jonathan Uesato, Jonathan Ward, Joost Huizinga, Julie Wang, Kai Chen, Kai Xiao, Karan Singhal, Karina Nguyen, Karl Cobbe, Katy Shi, Kayla Wood, Kendra Rimbach, Keren Gu-Lemberg, Kevin Liu, Kevin Lu, Kevin Stone, Kevin Yu, Lama Ahmad, Lauren Yang, Leo Liu, Leon Maksin, Leyton Ho, Liam Fedus, Lilian Weng, Linden Li, Lindsay McCallum, Lindsey Held, Lorenz Kuhn, Lukas Kondraciuk, Lukasz Kaiser, Luke Metz, Madelaine Boyd, Maja Trebacz, Manas Joglekar, Mark Chen, Marko Tintor, Mason Meyer, Matt Jones, Matt Kaufer, Max Schwarzer, Meghan Shah, Mehmet Yatbaz, Melody Y. Guan, Mengyuan Xu, Mengyuan Yan, Mia Glaese, Mianna Chen, Michael Lampe, Michael Malek, Michele Wang, Michelle Fradin, Mike McClay, Mikhail Pavlov, Miles Wang, Mingxuan Wang, Mira Murati, Mo Bavarian, Mostafa Rohaninejad, Nat McAleese, Neil Chowdhury, Neil Chowdhury, Nick Ryder, Nikolas Tezak, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Patrick Chao, Paul Ashbourne, Pavel Izmailov, Peter Zhokhov, Rachel Dias, Rahul Arora, Randall Lin, Rapha Gontijo Lopes, Raz Gaon, Reah Miyara, Reimar Leike, Renny Hwang, Rhythm Garg, Robin Brown, Roshan James, Rui Shu, Ryan Cheu, Ryan Greene, Saachi Jain, Sam Altman, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Santiago Hernandez, Sasha Baker, Scott McKinney, Scottie Yan, Shengjia Zhao, Shengli Hu, Shibani Santurkar, Shraman Ray Chaudhuri, Shuyuan Zhang, Siyuan Fu, Spencer Papay, Steph Lin, Suchir Balaji, Suvansh Sanjeev, Szymon Sidor, Tal Broda, Aidan Clark, Tao Wang, Taylor Gordon, Ted Sanders, Tejal Patwardhan, Thibault Sottiaux, Thomas Degry, Thomas Dimson, Tianhao Zheng, Timur Garipov, Tom Stasi, Trapit Bansal, Trevor Creech, Troy Peterson, Tyna Eloundou, Valerie Qi, Vineet Kosaraju, Vinnie Monaco, Vitchyr Pong, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wes McCabe, Wojciech Zaremba, Yann Dubois, Yinghai Lu, Yining Chen, Young Cha, Yu Bai, Yuchen He, Yuchen Zhang, Yunyun Wang, Zheng Shao, and Zhuohan Li. Openai o1 system card, 2024b. URL https://arxiv.org/abs/2412.16720.
- Song et al. (2025) Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, 2025.
- Tan et al. (2025) Xiaoyu Tan, Tianchu Yao, Chao Qu, Bin Li, Minghao Yang, Dakuan Lu, Haozhe Wang, Xihe Qiu, Wei Chu, Yinghui Xu, and Yuan Qi. Aurora:automated training framework of universal process reward models via ensemble prompting and reverse verification, 2025. URL https://arxiv.org/abs/2502.11520.
- Valdenegro-Toro & Saromo (2022) Matias Valdenegro-Toro and Daniel Saromo. A deeper look into aleatoric and epistemic uncertainty disentanglement, 2022. URL https://arxiv.org/abs/2204.09308.
- Wang et al. (2024) Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations, 2024.
- Wei et al. (2024) Xiong Wei, Zhang Hanning, Jiang Nan, and Zhang Tong. An implementation of generative prm., 2024. URL https://github.com/RLHFlow/RLHF-Reward-Modeling.
- Yang et al. (2024) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024.
- Yuan et al. (2024) Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels. arXiv preprint arXiv:2412.01981, 2024.
- Zhang et al. (2025) Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The Lessons of Developing Process Reward Models in Mathematical Reasoning, 2025.
- Zheng et al. (2024) Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. ProcessBench: Identifying Process Errors in Mathematical Reasoning, 2024.
Appendix A LLM-as-Judger Prompt Template
For LLM-as-Judger, we follow the prompt in Zhang et al. (2025).
⬇
I will provide a math problem along with a solution. They will be formatted as follows:
[Math Problem]
< math_problem >
...(math problem)...
</ math_problem >
[Solution]
< paragraph_1 >
...(paragraph 1 of solution)...
</ paragraph_1 >
...
< paragraph_n >
...(paragraph n of solution)...
</ paragraph_n >
Your task is to review each paragraph of the solution in sequence, analyzing,
verifying, and critiquing the reasoning in detail. You need to provide the
analyses and the conclusion in the following format:
< analysis_1 >
...(analysis of paragraph 1)...
</ analysis_1 >
...
< analysis_n >
...(analysis of paragraph n)...
</ analysis_n >
< conclusion >
Correct / Incorrect
</ conclusion >
* When you analyze each paragraph, you should use proper verification, recalculation, or reflection to indicate whether it is logically and mathematically valid. Please elaborate on the analysis process carefully.
* If an error is detected in any paragraph, you should describe the nature and cause of the error in detail, and suggest how to correct the error or the correct approach. Once a paragraph is found to contain any error, stop further analysis of subsequent paragraphs (as they may depend on the identified error) and directly provide the conclusion of " Incorrect." For instance, given a solution of five paragraphs, if an error is found in the third paragraph, you should reply in the following format:
< analysis_1 >
...(analysis of paragraph 1)...
</ analysis_1 >
< analysis_2 >
...(analysis of paragraph 2)...
</ analysis_3 >
< analysis_3 >
...(analysis of paragraph 3; since an error is found here, also provide detailed critique and correction guideline)...
</ analysis_3 >
< conclusion >
Incorrect
</ conclusion >
Note that the analyses of paragraphs 4 and 5 should be skipped as the paragraph 3 has been found to contain an error.
* Respond with your analyses and conclusion directly.
--------------------------------------------------
The following is the math problem and the solution for your task:
[Math Problem]
{tagged_problem}
[Solution]
{tagged_response}
Appendix B More Experiment Results
B.1 Problem diversity is important for Training PRMs
PRM800K (Lightman et al., 2023) is a widely used and human-annotated dataset for PRM training, which contains 800K step-level labels across 75K tree-of-thoughts solutions to 12K MATH (Hendrycks et al., 2021). Our empirical results show that models trained on our dataset (100K samples from 100K diverse questions) consistently and significantly outperform those trained on PRM800K https://huggingface.co/datasets/HuggingFaceH4/prm800k-trl-dedup (369K samples from only 12K questions) on ProcessBench. These findings suggest that problem diversity plays a more crucial role in PRM training than the number of step-level annotations.
| | # Problem set | # CoT Trajectories | ProcessBench F1 score |
| --- | --- | --- | --- |
| PRM800K | 7,500 | 460,000 | 0.575 |
| NuminaMath (Random Selected) | 100,000 | 100,000 | 0.673 |
Table 3: Comparison between PRM800K and 100K data collected from NuminaMath labeled by Qwen-QwQ.
Appendix C Annotation Cost Estimation
We estimate the labeling cost based on the statistics of our 1M data collected from NuminaMath Li et al. (2024) using Qwen2.5-Math-7B-Instruct and Qwen2.5-Math-72B-Instruct. We introduce the statistics in Table 4.
| | Value | Source |
| --- | --- | --- |
| # Reasoning Steps ( $S$ ) | 8.845 | Qwen Models’ rollouts |
| # Tokens per Rollout ( $R$ ) | 625.098 | Qwen Models’ rollouts |
| # Tokens per Critic Response from Judge ( $C$ ) | 1,919.860 | Qwen-QwQ’s responses as LLM-as-Judge |
Table 4: Statistics of 1M NuminaMath CoT Trajectories collected by Qwen2.5-Math Models.
In addition to the statistics, we also use $N$ to denote the data number of the dataset and show this statistic of each model’s training dataset in Table 5
| Dataset | # Labeled Data |
| --- | --- |
| ActPRM | 624,000 (labeled in two stages) |
| Qwen2.5-Math-PRM-Math-shepherd | 860,000 |
| Qwen2.5-Math-PRM | 860,000 |
| UniversalPRM | 690,000 |
Table 5: Data number of datasets.
Using the statistics, we compute the estimated labeling cost for ActPRM, Qwen2.5-Math-PRM-Math-shepherd (Zhang et al., 2025), Qwen2.5-Math-PRM (Zhang et al., 2025), UniversalPRM Tan et al. (2025) as follows:
- Qwen2.5-Math-PRM-Math-shepherd: $N× S× 8× R/2$ , where $8$ is the number of rollouts per step set in Zhang et al. (2025). We divided by two since the number of tokens for rollouts varies based on the position of reasoning step. For latter reasoning step, it requires less reasoning tokens. As a result, the expectation of tokens per rollout should be half of the number of tokens of the complete rollout.
- Qwen2.5-Math-PRM: $N× S× 8× R/2+N*C$ . It used consensus filtering for each data, the cost is both from MathShepherd ( $S× 8× R/2$ ) and LLM-as-Judge (C).
- UniversalPRM: $N× C× 4+N× S$ , where 4 is the number of ensemble prompts from the original paper (Tan et al., 2025) and another $N× S$ is for its semantic-based step seperation.
- ActPRM: $N× C$ . We solely use Qwen-QwQ as Judge and do not include any other operations.