# Efficient Process Reward Model Training via Active Learning
> Work done during the internship in Sea AI Lab. Email: Corresponding authors.
## Abstract
Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss w.r.t. the labels and update the PRM’s weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduce 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models The code is available at https://github.com/sail-sg/ActivePRM.
## 1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Line Chart: Average F1 Score vs. Estimated Annotation Cost
### Overview
This chart displays the relationship between the estimated annotation cost (in Generation Tokens) and the average F1 score for three different models: ActPRM, UniversalPRM, and Qwen2.5-Math-PRM-7B. The chart shows how the F1 score changes as the annotation cost increases. A dashed line indicates a cost of 17.3x.
### Components/Axes
* **X-axis:** Est. Annotation Cost (Gen. Tokens). Scale ranges from approximately 227 to 235. Marked values are 228, 230, 232, and 234.
* **Y-axis:** Average F1 Score. Scale ranges from approximately 0.68 to 0.76. Marked values are 0.68, 0.70, 0.72, 0.74, 0.75, and 0.76.
* **Legend:** Located in the bottom-right corner.
* ActPRM (Red circle marker)
* UniversalPRM (Blue star marker)
* Qwen2.5-Math-PRM-7B (Dark blue star marker)
* **Dashed Vertical Line:** Located at approximately x = 230, indicating a cost threshold.
* **Text Labels:**
* "0.750" at approximately (230, 0.75)
* "4.8x Cost" above the ActPRM line near x=230
* "17.3x Cost" above the dashed line near x=230
* "0.743" at approximately (232, 0.743)
* "0.735" at approximately (234, 0.735)
### Detailed Analysis
* **ActPRM (Red Line):** The line starts at approximately (228, 0.68) and increases sharply until reaching a peak at approximately (230, 0.750). After x=230, the line decreases slightly to approximately (232, 0.743) and then decreases further to approximately (234, 0.735).
* **UniversalPRM (Blue Line):** The line starts at approximately (228, 0.72) and increases to approximately (230, 0.74). After x=230, the line remains relatively flat at approximately 0.74.
* **Qwen2.5-Math-PRM-7B (Dark Blue Line):** The line starts at approximately (228, 0.70) and increases to approximately (230, 0.73). After x=230, the line remains relatively flat at approximately 0.73.
### Key Observations
* ActPRM demonstrates the most significant improvement in F1 score with increasing annotation cost, peaking at x=230. However, its performance declines slightly after this point.
* UniversalPRM and Qwen2.5-Math-PRM-7B show a more gradual increase in F1 score with annotation cost, and their performance plateaus after x=230.
* The dashed vertical line at x=230 appears to mark a point of diminishing returns for ActPRM, as the F1 score begins to decrease after this cost.
### Interpretation
The chart suggests that increasing the estimated annotation cost initially leads to substantial improvements in the F1 score, particularly for the ActPRM model. However, there is a point (around 230 Gen. Tokens) where further increases in annotation cost yield diminishing returns, and the F1 score may even decrease. This could indicate that the model has reached its optimal performance level with the available data and that additional annotation effort is not effectively improving its accuracy. The relatively stable performance of UniversalPRM and Qwen2.5-Math-PRM-7B suggests they may be less sensitive to annotation cost or have already reached their performance limits. The "4.8x Cost" and "17.3x Cost" labels likely refer to the relative cost increase compared to a baseline, and the dashed line highlights a cost threshold where the benefit of further annotation diminishes. The chart provides valuable insights into the trade-off between annotation cost and model performance, helping to optimize resource allocation for model training.
</details>
Figure 1: Average F1 score on ProcessBench (Zheng et al., 2024) versus estimated annotation cost. ActPRM outperforms prior SOTA models while requiring merely 20% of the annotation costs.
Recently, Large Language Models (LLMs) (DeepSeek-AI et al., 2025; Yang et al., 2024; OpenAI et al., 2024b) have shown remarkable advances in mathematical reasoning, yet they can make mistakes during chain-of-thought (CoT) reasoning despite correct final answers (Zheng et al., 2024). To address this challenge, process reward models (Lightman et al., 2023; Wang et al., 2024; Zhang et al., 2025) were proposed, aiming to identify process errors and provide finer-grained supervision of the training process.
The main challenge in training Process Reward Models (PRMs) lies in obtaining fine-grained step-level annotations, which remain prohibitively expensive. Lightman et al. (2023) pioneered PRM training by employing human experts to label 75K questions at the step level. While their approach achieved high-quality results (reaching 57.5% on ProcessBench (Zheng et al., 2024)), it does not scale automatically due to the heavy reliance on manual annotation. To reduce human efforts, Monte Carlo (M.C.) Estimation Methods (Wang et al., 2024; Wei et al., 2024; Luo et al., 2024) were proposed. However, these approaches come with high computational costs (massive rollouts are required for accurate estimation) and struggle to accurately identify the first error step (Zheng et al., 2024). To address this challenge, Qwen2.5-Math-PRM (Zhang et al., 2025) proposed using LLM-as-Judge — leveraging LLMs to detect the first error step — and filter out unreliable M.C. labels. It significantly boosts the performance of PRM on both ProcessBench (Zheng et al., 2024) and PRMBench (Song et al., 2025). More recently, UniversalPRM (Tan et al., 2025) relies solely on LLM-as-Judge with ensemble prompting (via majority voting), achieving new SOTA performance on ProcessBench within the same model size. However, the annotation costs are still considerable. We estimate the labeling costs of Qwen2.5-Math-PRM (Zhang et al., 2025) and UniversalPRM (Tan et al., 2025) and illustrate them in Figure 1. It shows that Qwen-Math-PRM-7B and UniversalPRM consume over $2^{34}$ and $2^{32}$ generated tokens, respectively. Refer to Appendix C for estimation strategy.
To reduce annotation costs, we propose ActPRM, which uses a trained ensemble PRM to identify and select uncertain data for annotation by a high-capability reasoning model. Our approach trains a PRM with ensemble heads for uncertainty estimation. For each reasoning step, we compute the mean $\mu$ and standard deviation $\sigma$ of ensemble predictions, identifying uncertain steps when prediction confidence is outside threshold $1-\delta_{pred}<\mu<\delta_{pred}$ or variation exceeds $\delta_{std}$ . We consider a CoT trajectory uncertain if any step up to and including the first predicted error meets these criteria. By annotating only uncertain data and training exclusively on this subset, we significantly reduce labeling costs while maintaining PRM performance.
To validate the effectiveness and efficiency of ActPRM, we conducted comprehensive experiments in multiple settings:
- Pool-based Evaluation (Section 5.1): Using 100K labeled samples, ActPRM achieved performance comparable to full-data tuning while reducing annotation costs by 50%. It consistently outperformed random selection under identical budget constraints.
- One-shot Active Learning (Section 5.2): Starting with our pool-based model, we applied ActPRM to select uncertain samples from 1M+ unlabeled CoT trajectories from NuminaMath (Li et al., 2024). After annotation and fine-tuning, we achieved new SOTA performance of 75.0% on ProcessBench. As shown in Figure 1, ActPRM surpasses prior SOTA models with significantly lower costs—outperforming UniversalPRM (Tan et al., 2025) by 0.7% using only 20% of its annotation budget and exceeding Qwen2.5-Math-PRM-7B by 1.5% with just 6% of its annotation budget.
Our contributions are summarized as follows: ❶ We propose an uncertainty-aware active learning approach ActPRM for PRM training that selectively annotates informative reasoning steps using ensemble-based uncertainty estimation, significantly reducing labeling costs while maintaining performance. ❷ ActPRM achieves state-of-the-art results (75.0% on ProcessBench, 65.5% on PRMBench) while requiring only 20% of the annotation budget compared to prior SOTA method UniversalPRM. ❸ We release all trained models, datasets, and code to ensure reproducibility and facilitate community adoption.
## 2 Preliminaries
### 2.1 Process Reward Models
Problem Formulation. Given a math problem $q$ and a corresponding solution trajectory $s=[s_{1},s_{2},\dots,s_{n}]$ , where $s_{i}$ denotes $i$ -th step., we require a PRM to identify the correctness of each step until a wrong step is identified. We only label the steps up to and including the first error step following prior works (Lightman et al., 2023; Zheng et al., 2024), since the afterward steps are genuinely difficult to define their correctness. As a result, in practice the labels for a solution trajectory are either $[1,1,\dots,1]$ or $[1,1,\dots,0]$ . Then a PRM could be trained using the typical BCE loss:
$$
\mathcal{L}_{BCE}(s,y|\theta)=-\frac{1}{|s|}\sum_{1}^{|s|}y_{i}\log(P_{\theta}
(s_{i}|s_{[:i]},q))+(1-y_{i})\log(1-P_{\theta}(s_{i}|s_{[:i]},q)), \tag{1}
$$
where $P_{\theta}$ is the PRM parameterized by $\theta$ and $s_{[:i]}$ denotes the steps before $s_{i}$ . When using PRM for inference, we set a threshold $\delta$ (usually $0.5$ ) to identify the first step that has a correctness probability $P_{\theta}(s_{i}|s_{[:i]},q)$ less than $\delta$ .
PRM Implementation Details. A typical PRM is built upon a pretrained generative LLM by replacing the causal language model head with a binary classification head that outputs the probability of correctness at corresponding token position. In practice, we solely need the prediction at the end of each reasoning step and thus a prediction mask is used to mask out the prediction at the other positions.
### 2.2 Uncertainty Estimation for Classification
Aleatoric Uncertainty. As aforementioned, a typical PRM $P_{\theta}$ is trained as a binary classification task. The simplest way to measure the uncertainty for it is to use aleatoric uncertainty (Valdenegro-Toro & Saromo, 2022): For simplicity, we use $P_{\theta}(s_{i})$ as the aleatoric probability for the $i$ -th step in the solution trajectory, where the full representation is $p_{\theta}(s_{i}|s_{[:i]},q)$ as in Equation 1.
$$
\mbox{Aleatoric Uncertainty}\propto P_{\theta}(s_{i})\log P_{\theta}(s_{i}). \tag{2}
$$
Epistemic Uncertainty. In addition, ensemble of models is also a common way to estimate epistemic uncertainty (Valdenegro-Toro & Saromo, 2022) by quantifying the disagreement among ensemble models. For example, Liang et al. (2022); Gleave & Irving (2022) use an ensemble of reward models to estimate uncertainty in preference learning. for process reward modeling, we could leverage the standard deviation of the ensemble predictions as the uncertainty estimation:
$$
\mbox{Epistemic Uncertainty}\propto\mbox{Var}(\{P_{\theta}(s_{i})\}), \tag{3}
$$
where $\{P_{\theta}\}$ is a set of models. It is worth noting that employing an ensemble of heads built upon a shared backbone is a common strategy to mitigate computational costs. We empirically study the combination of aleatoric and epistemic uncertainty and find that they are complementary to each other. Experimental results are shown in Section 5.1.
## 3 Related Work
Active Learning and Uncertainty Estimation. Active learning has been widely explored in the alignment of LLMs. Several studies adopt an online bandit formulation, leveraging uncertainty-aware reward models (RMs) for active exploration in response selection (Mehta et al., 2023; Dwaracherla et al., 2024; Liu et al., 2024; Melo et al., 2024; Gleave & Irving, 2022). For instance, Mehta et al. (2023) and Dwaracherla et al. (2024) use ensemble-based LLM heads to estimate epistemic uncertainty, prioritizing data most informative for preference learning. Similarly, Melo et al. (2024) propose an acquisition function combining both entropy (aleatoric uncertainty) and epistemic uncertainty. Our work builds on these approaches, empirically evaluating the role of both uncertainty types in the context of process reward modeling. Beyond active learning, ensemble methods—such as those in Coste et al. (2024) —have also proven effective in mitigating reward hacking (Amodei et al., 2016).
Process Reward Models. Different from outcome rewards (e.g., verifiable rewards (DeepSeek-AI et al., 2025) for mathematical reasoning problems) that assign rewards based on the final outcome, process rewards are assigned based on the intermediate steps of the problem-solving process. For a question and a corresponding solution with several steps, a PRM provides finer-grained rewards for each step. Til current stage, process reward modeling contains two categories: $(i)$ Process Reward as Q-values and $(ii)$ Process Reward as Judger. The former one (Wang et al., 2024; Luo et al., 2024; Wei et al., 2024; Li & Li, 2024) regards the process reward as the Q-values of the steps that estimate the probability of the policy model to reach the final correct answer. Specifically, they leverage the policy model that generates the solution steps to do Monte Carlo Estimation for each step. The estimated Q-values are used as the process rewards. However, recent works (Zhang et al., 2025; Zheng et al., 2024) show that this kind of process reward modeling suffers from identifying the process errors because it highly depends on the policy model and has large bias with the ground truth distribution. In contrast, the latter one (Lightman et al., 2023; Zhang et al., 2025) regards the process reward model as a proxy for identifying the intermediate process errors and the corresponding trained model achieves better performance on several benchmarks (Zheng et al., 2024; Song et al., 2025). In this work, we follow the latter one and regard the process reward as a judge that tries to identify the first error steps in the solutions if any. In addition, there are other works related to PRM. For example, Yuan et al. (2024) tries to train a PRM with a fashion of outcome reward modeling (ORM). Cheng et al. (2025); Cui et al. (2025) proposed RL training frameworks that integrate PRM as finer-grained supervison.
## 4 Efficient Process Reward Labeling via Active Learning
Labeling the process rewards for a large-scale dataset is very expensive as it either requires human experts to annotate the correctness of each step for each solution as in the previous work (Lightman et al., 2023) or leverages highly capable generative models to imitate human experts (Zhang et al., 2025). Even though the latter one is automated, it is still resource-consuming since the test time scales up with the difficulty of math problems.
Algorithm 1 PRM Active Learning with Cold Start.
1: // The difference with full data tuning is colored.
2: Ensemble PRM $P_{\theta}$ , dataset $\mathcal{D}=\{(q,s)\}$ , uncertainty thresholds $\delta_{pred}$ and $\delta_{std}$ , generative LLM $M$ , batch size $B$ , learning rate $\eta$
3: for $\mathcal{B}\subset\mathcal{D}$ do
4: $P_{\theta}(\mathcal{B})\leftarrow\text{Forward}(\mathcal{B})$
5: $\widetilde{\mathcal{B}}=\{\}$
6: for $(q,s)\in\mathcal{B}$ do
7: if $\mathcal{U}^{\text{alea}}_{\theta}(s)\lor\mathcal{U}^{\text{epis}}_{\theta}(s)$ then $\triangleright$ Equation 5
8: $\widetilde{\mathcal{B}}\leftarrow\widetilde{\mathcal{B}}\cup\{(q,s)\}$
9: end if
10: end for
11: $Y_{\widetilde{\mathcal{B}}}\leftarrow\text{labeling}(\widetilde{\mathcal{B}})$ $\triangleright$ Labeling using generative LLM
12: $\mathcal{L}\leftarrow\frac{1}{|\widetilde{\mathcal{B}}|}\sum_{(s,y)\in( \widetilde{\mathcal{B}},Y_{\widetilde{\mathcal{B}}})}{\color[rgb]{ 0.65234375,0.125,0.08984375}\mathcal{L}(s,y)}$ $\triangleright$ Equation 4
13: $\nabla_{\theta}\mathcal{L}\leftarrow\text{Backward}(\mathcal{L})$
14: $\theta\leftarrow\theta-\eta\nabla_{\theta}\mathcal{L}$
15: end for
16: $P_{\theta}$
To mitigate this issue, we propose to leverage active learning to make the PRM proactively select the data that is most informative to train on. Specifically, we train a PRM with ensemble heads to enable uncertainty estimation following Liang et al. (2022); Gleave & Irving (2022). As shown in Algorithm 1, We forward the data candidates to the ensemble PRM (line 3) and estimate the prediction uncertainty for each data point (line 5-6). Then we omit the data that the ensemble PRM is confident about (line 7) and only label the other retained data with a generative reasoning LLM (line 10). Then, we only backpropagate from the loss of labeled data (line 11). By doing so, we could considerably reduce the labeling cost while maintaining the PRM performance. Now we introduce our two key differences with the original finetuning: ensemble PRM training and uncertain data selection.
Ensemble PRM Training. In this work, we use ensemble of PRMs to estimate the epistemic uncertainty following Gleave & Irving (2022); Liang et al. (2022). Specifically, we use a shared LLM backbone and build multiple binary classification heads on top of it. In our training, the diversity of ensemble models is ensured by two ways: $(i)$ the random initialization of the head layers and $(ii)$ a diversity regularization term (Dwaracherla et al., 2024): $\mathcal{L}_{\mbox{div}}=\lambda\cdot\frac{1}{n}\sum_{i=1}^{n}||\phi^{i}-\phi_ {\text{init}}^{i}||_{2},$ where $\{\phi^{i}\}$ are the parameters of the ensemble heads and n is the number of ensemble heads. It is a $L2$ term penalizing the distance between the ensemble head parameters and their initial parameters. Our training objective for the ensemble PRM is therefore formulated as follows
$$
\mathcal{L}(y,s)=\frac{1}{n}\sum_{i=1}^{n}\left(\mathcal{L}_{BCE}(y,s|\theta,
\phi^{i})+\lambda||\phi^{i}-\phi_{\text{init}}^{i}||_{2}\right), \tag{4}
$$
where $\theta$ denotes the backbone parameters and $\mathcal{L}_{B}CE$ is from Equation 1, that computes the loss for a certain head.
Uncertain Data Selection. Considering a batch of data candidates $\mathcal{D}=\{(q,s)\}$ , we first forward the data to the ensemble PRM $P_{\theta}$ to get the ensemble predictions $P_{\theta}(\mathcal{D})\in\mathbb{R}^{n\times|\mathcal{D}|\times|s|}$ . for each data $(q,s)\in\mathcal{D}$ , we could determine the hard-value aleatoric and epistemic uncertainty with pre-set thresholds. Briefly, the aleatoric (or epistemic) uncertainty is defined as 1 if uncertainty occurs at any step prior to the first predicted error step; otherwise, it is 0. A formal definition is as follows:
$$
\begin{split}\mathcal{U}^{\text{alea}}_{\theta}(s)=\bigvee_{i=0}^{\mathcal{E}(
s)}\left(\max\left(\mu(P_{\theta}(s_{i})),1-\mu(P_{\theta}(s_{i}))\right)<
\delta_{pred}\right)\,\,;\,\,\mathcal{U}^{\text{epis}}_{\theta}(s)=\bigvee_{i=
0}^{\mathcal{E}(s)}\left(\sigma(P_{\theta}(s_{i}))>\delta_{std}\right),\end{split} \tag{5}
$$
where $\mu(\cdot)$ and $\sigma(\cdot)$ are the mean and standard deviation of the ensemble predictions among ensemble heads and $\lor$ denotes the logical ‘OR’ operation. Moreover, the $\mathcal{E}(s)$ denotes the first error step in the solution trajectory $s$ , defined as $\mathcal{E}(s)=\min\{j\mid\mu(s_{j})<\delta\}$ , where $\delta$ is the threshold for the correctness, typically set to $0.5$ . This is because we only care about the correctness of the steps before the first error step since it is genuinely difficult to define the correctness of the steps afterwards. By following the uncertainty estimation strategy, we retain the data in $\mathcal{D}$ that satisfies either $\mathcal{U}^{\text{alea}}_{\theta}$ or $\mathcal{U}^{\text{epis}}_{\theta}$ as $\widetilde{\mathcal{D}}$ . Then we could leverage expensive generative LLMs as judger (Zheng et al., 2024) to label the retained data in $\widetilde{\mathcal{D}}$ .
## 5 Experiments
In Section 5.1, we first validate ActPRM in a pool-based active learning setting using 100K labeled samples, including ablation studies on our uncertainty estimation strategy. Based on the optimal configuration, we then scale up to 1M unlabeled samples in Section 5.2, further proving our pipeline’s efficiency and effectiveness.
### 5.1 Pool-Based Active Learning
#### 5.1.1 Experimental Settings
To evaluate our active learning strategy’s effectiveness, we first conduct experiments in an offline setting where ActPRM iteratively selects the most informative examples from a large unlabeled pool as detailed in Algorithm 1. We establish a strong baseline by comparing against full data tuning, where a model is trained on the complete dataset labeled by a single annotator. It is worth noting that as our data is randomly shuffled, the performance of full data tuning at intermediate training steps is essentially equivalent to the performance of random selection with the corresponding budget.
Evaluation Benchmark. We utilize ProcessBench (Zheng et al., 2024) to evaluate the effectiveness of PRMs. The test data in ProcessBench contains intermediate step errors and requires the PRM to identify the first error step. ProcessBench contains four subsets, and we report the average F1 score following the original work.
Models. We train ActPRM based on Qwen2.5-Math-7B-Instruct.
Training Dataset. For dataset construction, we randomly select 100K data from Numinamath (Li et al., 2024) dataset after decontamination against the ProcessBench (Zheng et al., 2024) and PRMBench (Song et al., 2025). We leverage Qwen-2.5-Math-7B-Instruct to generate CoT reasoning trajectories for the selected data and further use QwQ-32B as a judge to annotate the process correctness for all trajectories following Zhang et al. (2025). For completeness, we provide the prompt template in Appendix A.
#### 5.1.2 Experimental Results
a)
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Chart: Average F1 Score vs. Budget
### Overview
This chart displays the relationship between "Budget" (on the x-axis) and "Average F1 Score" (on the y-axis) for two different models: "ActPRM" and "Full Data Tuning". The chart uses lines and scatter plots to represent the data, with a dashed horizontal line indicating a reference F1 score of 0.673. A vertical dashed line is present at Budget = 0.5.
### Components/Axes
* **X-axis:** "Budget", ranging from 0.0 to 1.0, with markers at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Y-axis:** "Average F1 Score", ranging from 0.45 to 0.70, with markers at 0.45, 0.50, 0.55, 0.60, 0.65, and 0.70.
* **Legend:** Located in the top-right corner.
* "ActPRM" - Represented by a red line with circular markers.
* "Full Data Tuning" - Represented by a blue line with circular markers.
* **Horizontal Line:** Dashed black line at F1 = 0.673.
* **Vertical Line:** Dashed black line at Budget = 0.5.
* **Text Annotation:** "Budget = 0.5" positioned near the vertical dashed line.
* **Text Annotation:** "F1 = 0.673" positioned near the horizontal dashed line.
### Detailed Analysis
**ActPRM (Red Line):**
The ActPRM line starts at approximately (0.0, 0.50) and exhibits a steep upward slope until around Budget = 0.3, reaching a peak F1 score of approximately 0.68. After this peak, the line gradually declines, fluctuating around 0.65-0.67 for budgets between 0.4 and 1.0.
* (0.0, 0.50)
* (0.1, 0.56)
* (0.2, 0.60)
* (0.3, 0.66)
* (0.4, 0.68)
* (0.5, 0.67)
* (0.6, 0.66)
* (0.7, 0.66)
* (0.8, 0.66)
* (0.9, 0.67)
* (1.0, 0.66)
**Full Data Tuning (Blue Line):**
The Full Data Tuning line begins at approximately (0.0, 0.51) and steadily increases until around Budget = 0.6, reaching a peak F1 score of approximately 0.67. The line then plateaus, remaining relatively stable between 0.66 and 0.68 for budgets between 0.6 and 1.0.
* (0.0, 0.51)
* (0.1, 0.57)
* (0.2, 0.61)
* (0.3, 0.64)
* (0.4, 0.66)
* (0.5, 0.67)
* (0.6, 0.68)
* (0.7, 0.68)
* (0.8, 0.67)
* (0.9, 0.67)
* (1.0, 0.67)
### Key Observations
* ActPRM achieves its peak performance at a lower budget (around 0.4) compared to Full Data Tuning (around 0.6).
* Full Data Tuning demonstrates more stable performance at higher budgets, while ActPRM's performance declines slightly after its peak.
* Both models show a clear positive correlation between budget and F1 score, at least up to a certain point.
* The horizontal line at F1 = 0.673 serves as a benchmark, and Full Data Tuning surpasses this benchmark at budgets greater than 0.5.
### Interpretation
The chart suggests that increasing the budget generally improves the F1 score for both models. However, the relationship is not linear, and there appears to be a diminishing return on investment. ActPRM is more sensitive to budget changes, achieving high performance with a smaller budget but also experiencing a decline in performance at higher budgets. Full Data Tuning, while requiring a larger budget to reach its peak, maintains a more consistent level of performance. The vertical line at Budget = 0.5 may indicate a critical point where Full Data Tuning begins to outperform ActPRM consistently. The horizontal line at F1 = 0.673 could represent a target performance level, and the chart helps determine which model is more likely to achieve or exceed this target given a specific budget. The scatter points suggest some variance in performance for each budget level, indicating that the F1 score is not solely determined by the budget. This could be due to factors such as data variability or model initialization.
</details>
( (
b)
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: Average F1 Score vs. Budget
### Overview
This line chart depicts the relationship between "Budget" and "Average F1 Score" for three different models. The chart shows how the average F1 score changes as the budget allocated to the model increases. A legend in the top-right corner identifies each model and provides associated statistical information.
### Components/Axes
* **X-axis:** "Budget" ranging from 0.0 to 0.7, with markers at 0.0, 0.2, 0.4, and 0.6.
* **Y-axis:** "Average F1 Score" ranging from 0.45 to 0.70, with markers at 0.45, 0.50, 0.55, 0.60, 0.65, and 0.70.
* **Legend:** Located in the top-right corner, labeled "Model". It contains three entries:
* Red line: "(std: 0.005, p: 0.95)"
* Blue line: "(p: 0.95)"
* Teal line: "(std: 0.005)"
### Detailed Analysis
* **Red Line (std: 0.005, p: 0.95):** This line represents a model with a standard deviation of 0.005 and a p-value of 0.95. The line slopes generally upward, indicating an increase in Average F1 Score as the Budget increases.
* At Budget = 0.0, Average F1 Score ≈ 0.49
* At Budget = 0.2, Average F1 Score ≈ 0.59
* At Budget = 0.4, Average F1 Score ≈ 0.64
* At Budget = 0.6, Average F1 Score ≈ 0.67
* **Blue Line (p: 0.95):** This line represents a model with a p-value of 0.95. The line initially decreases, then increases with the Budget.
* At Budget = 0.0, Average F1 Score ≈ 0.48
* At Budget = 0.2, Average F1 Score ≈ 0.55
* At Budget = 0.4, Average F1 Score ≈ 0.62
* At Budget = 0.6, Average F1 Score ≈ 0.65
* **Teal Line (std: 0.005):** This line represents a model with a standard deviation of 0.005. The line shows an initial increase, followed by fluctuations and a slight decrease.
* At Budget = 0.0, Average F1 Score ≈ 0.50
* At Budget = 0.2, Average F1 Score ≈ 0.61
* At Budget = 0.4, Average F1 Score ≈ 0.62
* At Budget = 0.6, Average F1 Score ≈ 0.63
### Key Observations
* The red line consistently demonstrates the highest Average F1 Score across all budget levels.
* The blue line exhibits the most significant initial drop in performance as the budget increases from 0.0 to 0.2.
* The teal line shows the most variability in performance, with fluctuations around the 0.62 mark.
* All three models show an overall positive trend, with increasing Average F1 Score as the Budget increases, although the rate of increase varies.
### Interpretation
The data suggests that increasing the budget generally improves the performance (as measured by Average F1 Score) of all three models. However, the red model consistently outperforms the other two, indicating it is the most effective at utilizing the increased budget. The initial dip in the blue model's performance suggests that there might be an initial cost or overhead associated with increasing the budget for that particular model. The statistical information provided in the legend (standard deviation and p-value) suggests that the red model's performance is more stable (lower standard deviation) and statistically significant (higher p-value) than the others. The chart highlights the importance of budget allocation in model performance and suggests that the red model is the most efficient choice for maximizing F1 score within the given budget range. The fluctuations in the teal line could indicate sensitivity to specific data subsets or a less robust training process.
</details>
c)
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Bar Chart: Average F1 Score vs. Number of Heads
### Overview
This image presents a bar chart illustrating the relationship between the number of "Heads" and the corresponding "Average F1 Score". Each bar represents the average F1 score for a specific number of heads, with error bars indicating the variability or confidence interval around that average.
### Components/Axes
* **X-axis:** Labeled "#Heads", with markers at 2, 4, 8, 16, 32, and 64.
* **Y-axis:** Labeled "Average F1 Score", with a scale ranging from approximately 0.600 to 0.650, incrementing by 0.025.
* **Bars:** Red bars representing the average F1 score for each number of heads.
* **Error Bars:** Black vertical lines extending above and below each bar, indicating the standard error or confidence interval.
### Detailed Analysis
The chart displays six bars, each corresponding to a different number of heads. The trend shows a general increase in Average F1 Score as the number of heads increases, but the rate of increase appears to diminish at higher numbers of heads.
* **2 Heads:** Average F1 Score is approximately 0.600, with an error bar extending from roughly 0.575 to 0.625.
* **4 Heads:** Average F1 Score is approximately 0.625, with an error bar extending from roughly 0.600 to 0.650.
* **8 Heads:** Average F1 Score is approximately 0.630, with an error bar extending from roughly 0.605 to 0.655.
* **16 Heads:** Average F1 Score is approximately 0.640, with an error bar extending from roughly 0.615 to 0.665.
* **32 Heads:** Average F1 Score is approximately 0.650, with an error bar extending from roughly 0.625 to 0.675.
* **64 Heads:** Average F1 Score is approximately 0.645, with an error bar extending from roughly 0.620 to 0.670.
### Key Observations
* The F1 score increases significantly from 2 to 32 heads.
* The increase in F1 score between 32 and 64 heads is minimal, and the error bars overlap, suggesting the difference may not be statistically significant.
* The error bars are relatively consistent across all numbers of heads, indicating similar variability in the F1 scores for each condition.
### Interpretation
The data suggests that increasing the number of "Heads" generally improves the "Average F1 Score", up to a certain point. Beyond 32 heads, the improvement in F1 score plateaus, and adding more heads does not yield substantial gains. This could indicate a diminishing return effect, where the benefits of adding more heads are reduced as the system becomes more complex or reaches a point of saturation. The error bars provide a measure of uncertainty, and the overlap between the error bars for 32 and 64 heads suggests that the difference in F1 scores between these two conditions may not be statistically significant. This implies that the optimal number of heads for maximizing F1 score is likely around 32, and further increasing the number of heads does not provide a significant advantage. The "Heads" likely refer to attention heads in a neural network architecture, and the F1 score is a metric for evaluating the performance of a model.
</details>
(
Figure 2: (a) Comparison of the average F1 score on ProcessBench between ActPRM and random selection, plotted against the normalized budget positively correlated the number of labeled data instances consumed. The semi-transparent points represent all results in grid searching w.r.t. $\delta_{pred}$ and $\delta_{std}$ . For the highlighted ActPRM curve in the figure, $\delta_{pred}=0.95$ and $\delta_{std}=0.005$ . (b) Ablation: uncertainty estimation strategies. (c) Ablation: number of ensemble PRM heads.
ActPRM achieves comparable performance while reducing annotation costs by 50%. We compare ActPRM with full data tuning across a normalized budget, as illustrated in Figure 2 (a). The results demonstrate that ActPRM achieves an average F1 score of $0.673$ on ProcessBench, matching baseline performance while using only half the annotation budget. Furthermore, ActPRM consistently outperforms random selection under the same budget constraints. Notably, at 50% budget, ActPRM surpasses random selection by a significant margin of 3.3%. at the end of pool-based active training, ActPRM achieves a better performance of 0.680 on ProcessBench while consuming solely 62.5% budget.
ActPRM Consistently Outperforms Random Selection Under Diverse $\delta_{pred}$ and $\delta_{std}$ . As shown in Figure 2 (a), the semi-transparent blue points represent all results of a grid searching over $\delta_{pred}\in\{0.9,0.95,0.97\}$ and $\delta_{std}\in\{0.01,0.005,0.002,0.001\}$ . One can see that most blue points are above the baseline (gray line) with the same budget, further demonstrating the effectiveness and robustness of ActPRM.
Ablation Study on Uncertainty Estimation Strategies. We conduct an ablation study on uncertainty estimation strategies, i.e. using epistemic and aleatoric uncertainty. We selected the best setting ( $\delta_{std}=0.005,\delta_{pred}=0.95$ ) searched by a grid search as in Figure 2 and ablates epistemic and aleatoric uncertainty by setting $\delta_{std}=\inf$ and $\delta_{pred}=0.5$ , respectively. As shown illustrated in Figure 2 (b), solely use either epistemic or aleatoric uncertainty under-performs using both, indicating that epistemic and aleatoric uncertainty are complementary to each other.
Ablation Study on Number of Heads for Ensemble PRM. The number of heads for ensemble PRM controls how accurate our estimated epistemic uncertainty is. To find the trade-off between good estimation and computational overhead, we conduct an ablation study regarding it and show the results in Figure 2 (c), where we only consider epistemic uncertainty by setting $\delta_{std}=0.005,\delta_{pred}=0.5$ and report the averaged results with 3 runs. We empirically find that the performance continually grows with the number of heads and converges at about 32.
### 5.2 Achieving New SOTA Performance on ProcessBench (75.0%) with Solely 6% Annotation Cost.
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Bar Chart: Estimated Annotation Cost Comparison
### Overview
This image presents a bar chart comparing the estimated annotation cost (in generated tokens) for four different methods: "Ours", "Ensemble Prompting", "Math Shepherd", and "Consensus Filtering". The cost is represented on the y-axis, while the methods are displayed on the x-axis. Each bar is labeled with a multiplier indicating the relative cost compared to the "Ours" method.
### Components/Axes
* **X-axis:** Methods - "Ours", "Ensemble Prompting", "Math Shepherd", "Consensus Filtering".
* **Y-axis:** Est. Annotation Cost (Gen. Tokens) - Scale ranges from 0.0e+00 to 2.0e+10.
* **Bars:** Represent the estimated annotation cost for each method.
* **Labels:** Each bar is labeled with a multiplier (e.g., "4.8x", "15.9x", "17.3x") indicating the cost relative to the "Ours" method.
### Detailed Analysis
The bar chart displays the following approximate values:
* **Ours:** The bar is very short, positioned near the bottom of the y-axis. The estimated annotation cost is approximately 2.0e+08 tokens.
* **Ensemble Prompting:** The bar is significantly taller than "Ours". The estimated annotation cost is approximately 4.8 times that of "Ours", or roughly 9.6e+08 tokens (2.0e+08 * 4.8).
* **Math Shepherd:** The bar is much taller than "Ensemble Prompting". The estimated annotation cost is approximately 15.9 times that of "Ours", or roughly 3.2e+09 tokens (2.0e+08 * 15.9).
* **Consensus Filtering:** The tallest bar, indicating the highest estimated annotation cost. The estimated annotation cost is approximately 17.3 times that of "Ours", or roughly 3.5e+09 tokens (2.0e+08 * 17.3).
The bars are arranged horizontally, with "Ours" on the left and "Consensus Filtering" on the right. The y-axis is in scientific notation.
### Key Observations
* The "Ours" method has the lowest estimated annotation cost by a substantial margin.
* "Consensus Filtering" and "Math Shepherd" have significantly higher annotation costs compared to "Ours" and "Ensemble Prompting".
* The annotation cost increases dramatically as you move from "Ours" to "Ensemble Prompting", and then continues to increase with "Math Shepherd" and "Consensus Filtering".
* The difference in cost between "Math Shepherd" and "Consensus Filtering" is relatively small.
### Interpretation
The data suggests that the "Ours" method is the most cost-effective in terms of annotation effort (measured in generated tokens). The other methods, particularly "Consensus Filtering" and "Math Shepherd", require significantly more annotation, potentially due to the complexity of the process or the need for more extensive validation. The multipliers provide a clear indication of the relative cost burden associated with each method.
The chart implies a trade-off between annotation cost and potentially the quality or reliability of the results. While "Ours" is the cheapest, it's unclear from this chart whether it achieves comparable performance to the more expensive methods. The chart focuses solely on the annotation cost, and doesn't provide information about the accuracy, precision, or other relevant metrics of each method.
The use of multipliers is effective in highlighting the relative differences in cost. The large differences in multipliers (4.8x, 15.9x, 17.3x) emphasize the substantial cost implications of choosing one method over another.
</details>
Figure 3: Estimated annotation costs (generated tokens) comparison between ActPRM and popular methods, including Ensemble Prompting (Tan et al., 2025), MathShepherd (Wang et al., 2024) and Consensus Filtering (Zhang et al., 2025).
Obtaining high-quality process supervision labels is costly. To demonstrate the efficiency of ActPRM, we evaluate it in a one-shot active learning setting. Starting with the model trained in Section 5.1, we select the most uncertain samples from over 1M+ unlabeled examples and annotate them using a powerful reasoning model.
Figure 3 compares our estimated labeling costs with those of other real-world datasets for training PRMs, including MathShepherd (Wang et al., 2024), Consensus Filtering (Zhang et al., 2025), and Ensemble Prompting (Tan et al., 2025). Since the training data for Consensus Filtering is not publicly available, we estimate costs based on our data statistics. We introduce our estimation strategy in Appendix C.
Training a Qwen2.5-Math-7B-Instruct on our data, Ensemble Prompting data, MathShepherd data, and Consensus Filtering data yields ActPRM, UniversalPRM, Qwen2.5-Math-7B-Math-Shepherd, and Qwen2.5-Math-PRM-7B in Table 1. We evaluated the performance of models trained on this labeled data on both ProcessBench and PRMBench benchmarks.
#### 5.2.1 Experimental Settings
Data Filtering with ActPRM. We used Qwen2.5-Math-7B-Instruct and Qwen2.5-Math-72B-Instruct to collect over 1 million (1,061,763) Chain-of-Thought (COT) trajectories from the Numinamath problem set (Li et al., 2024), after decontamination against the test benchmarks. ActPRM was then applied to filter out high-confidence ( $\delta_{pred}>0.95$ and $\delta_{std}<0.005$ following Section 5.1) data instances that were unnecessary for training, retaining the remaining data for labeling and training. This process resulted in a final dataset of 563,030 PRM data points labeled by QwQ-32B, reducing annotation costs by 47.0%.
Models. Obtaining the dataset, we continually train our ActPRM in Section 5.1 on our filtered dataset. In addition, we empirically find that our retrained data is generally useful to other PRMs. Specifically, we also continually train Qwen2.5-Math-PRM-7B (the previous SOTA model on ProcessBench) on our constructed data. The resultant model is named ActPRM-X X stands for extended version..
Benchmarks. We use ProcessBench (Zheng et al., 2024) and PRMBench (Song et al., 2025) to evaluate the effectiveness of our trained model. Different from ProcessBench that collects intermediate errors from real-world generative models, PRMBench heuristically builds intermediate errors by manipulating correct steps.
Baselines. We compare with the following PRMs: ❶ Qwen2.5-Math-PRM-7B (Zhang et al., 2025): This model uses consensus filtering for labeling. It labels 860K data twice using two methods (LLM-as-judge [Zheng et al., 2024] and Mathshepherd [Wang et al., 2024]) and filters out 40% of the data where the labels disagree. ❷ Pure-PRM-7B (Cheng et al., 2025): A Qwen2.5-Math-based PRM trained on PRM800K using a two-stage strategy: warming up the PRM head and then fine-tuning the entire model. ❸: EurusPRM-Stage2 (Cui et al., 2025): A PRM resulting from the Implicit PRM approach (Yuan et al., 2024), which derives process rewards from an ORM. ❹ Universal-PRM (Tan et al., 2025): A Qwen2.5-Math-based model trained with data augmentation techniques like ensemble prompting and reverse verification. ❺ Math-Shepherd-PRM-7B (Wang et al., 2024): a PRM trained on process labels that estimates hard Q-values for the policy model. ❻ Qwen2.5-Math-7B-Math-Shepherd (Zhang et al., 2025): a PRM trained on 860K data labeled using MathShepherd. ❼ Ensemble-PRM-PRM800K (ours): a model with ensemble heads trained by ourselves on PRM800K without active learning.
#### 5.2.2 Experimental Results
| Models | GSM8K | MATH | Olympiad Bench | OmniMath | Average F1 |
| --- | --- | --- | --- | --- | --- |
| LLM-as-judge | | | | | |
| o1-Mini ⋄ | 0.932 | 0.889 | 0.872 | 0.824 | 0.879 |
| Deepseek-R1-Distill-32B | 0.817 | 0.739 | 0.659 | 0.585 | 0.700 |
| QwQ-32B | 0.871 | 0.834 | 0.787 | 0.771 | 0.816 |
| Process Reward Models (72B) | | | | | |
| Qwen2.5-Math-PRM-72B ⋄ | 0.873 | 0.806 | 0.743 | 0.711 | 0.783 |
| Process Reward Models (7B+) | | | | | |
| Math-Shepherd-PRM-7B ⋄ | 0.479 | 0.295 | 0.248 | 0.238 | 0.315 |
| Qwen2.5-Math-7B-Math-Shepherd ⋄ | 0.625 | 0.316 | 0.137 | 0.077 | 0.289 |
| EurusPRM-Stage2 ⋄ | 0.473 | 0.357 | 0.212 | 0.209 | 0.313 |
| Qwen2.5-Math-7B-PRM800K ⋄ | 0.683 | 0.626 | 0.507 | 0.443 | 0.565 |
| Ensemble-PRM-PRM800K (ours) | 0.705 | 0.630 | 0.472 | 0.433 | 0.560 |
| PURE-PRM-7B | 0.690 | 0.665 | 0.484 | 0.459 | 0.575 |
| Qwen2.5-Math-PRM-7B ⋄ | 0.824 | 0.776 | 0.675 | 0.663 | 0.735 |
| Universal-PRM | 0.858 | 0.777 | 0.676 | 0.664 | 0.743 |
| ActPRM (ours) | 0.816 | 0.798 | 0.714 | 0.670 | 0.750 |
| ActPRM-X (ours) | 0.827 | 0.820 | 0.720 | 0.673 | 0.760 |
Table 1: Performance comparison on ProcessBench. We report the results in the same calculation method with ProcessBench. ⋄ denotes the results are from Qwen PRM’s report (Zhang et al., 2025).
| # | Models | Simlicity | Soundness | Sensitivity | Average |
| --- | --- | --- | --- | --- | --- |
| LLM-as-judge | | | | | |
| 1 | Gemini-2.0-thinking-exp-1219 | 0.662 | 0.718 | 0.753 | 0.688 |
| 1 | o1-mini | 0.646 | 0.721 | 0.755 | 0.688 |
| 4 | GPT-4o | 0.597 | 0.709 | 0.758 | 0.668 |
| 6 | Gemini-2.0-flash-exp | 0.627 | 0.673 | 0.754 | 0.660 |
| Process Reward Models (72B) | | | | | |
| 3 | Qwen-2.5-Math-PRM-72B | 0.546 | 0.739 | 0.770 | 0.682 |
| Process Reward Models (7B+) | | | | | |
| 7 | Qwen2.5-Math-PRM-7B | 0.521 | 0.710 | 0.755 | 0.655 |
| 9 | Pure-PRM-7B | 0.522 | 0.702 | 0.758 | 0.653 |
| 7 | ActPRM (ours) | 0.536 | 0.713 | 0.752 | 0.655 |
| 5 | ActPRM-X (ours) | 0.545 | 0.727 | 0.756 | 0.667 |
Table 2: Performance comparison on PRMBench. All results of the other models are from the official leaderboard. # denotes the ranking.
ActPRM and ActPRM-X achieve new SOTA performance on ProcessBench compared with same size models. The evaluation results on ProcessBench are shown in Table 1. ActPRM achieves an average F1 score of $0.750$ , outperforming Qwen2.5-Math-PRM-7B by a margin of 1.5%. Furthermore, ActPRM-X training based on Qwen2.5-Math-PRM-7B achieves a new SOTA performance on ProcessBench with an average F1 of 0.760, outperforming the second-place model (Universal-PRM) with a margin of 1.7% and improve the performance of Qwen2.5-Math-PRM-7B by a significant margin of 2.5%.
QwQ-32B (our PRM label annotator) outperforms all PRMs on ProcessBench. As shown in Table 1, QwQ-32B outperforms all PRMs including 72B models. It indicates the reliability of utilizing QwQ-32B as a PRM label annotator as it provides a high empirical upperbound for the training PRMs.
ActPRM-X achieves new SOTA performance on PRMBench, on-par with GPT-4o. We further test our models on PRMBench and show the results in Table 2. As on the leaderboard, ActPRM achieves the best performance within 7B PRMs and ActPRM-X achieves new SOTA performance (0.667), outperforming the other models by a large margin of at least $1.2\$ and on-par with GPT-4o (OpenAI et al., 2024a).
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Chart: Performance Comparison of ActPRM Selection vs. Random Selection
### Overview
The image presents two line charts side-by-side, comparing the performance of "ActPRM Selected" and "Randomly Selected" methods across 2000 steps. The left chart displays the Average F1 Score, while the right chart shows the Training Loss. Both charts share a common x-axis representing "Step".
### Components/Axes
* **Left Chart:**
* X-axis: "Step" (Scale: 500 to 2000, increments of 250)
* Y-axis: "Average F1 Score" (Scale: 0.67 to 0.73, increments of 0.01)
* Legend (bottom-left): "Data"
* "ActPRM Selected" (Red line with circular markers)
* "Randomly Selected" (Blue line with circular markers)
* **Right Chart:**
* X-axis: "Step" (Scale: 0 to 2000, increments of 250)
* Y-axis: "Training Loss" (Scale: 0.08 to 0.16, increments of 0.01)
* Legend (top-right): "Data"
* "ActPRM Selected" (Red line)
* "Randomly Selected" (Blue line)
### Detailed Analysis or Content Details
* **Left Chart (Average F1 Score):**
* **ActPRM Selected (Red):** The line slopes upward, indicating increasing F1 score with increasing steps.
* Step 500: Approximately 0.675
* Step 750: Approximately 0.685
* Step 1000: Approximately 0.698
* Step 1250: Approximately 0.715
* Step 1500: Approximately 0.722
* Step 1750: Approximately 0.728
* Step 2000: Approximately 0.73
* **Randomly Selected (Blue):** The line initially increases, then plateaus and slightly decreases.
* Step 500: Approximately 0.682
* Step 750: Approximately 0.692
* Step 1000: Approximately 0.705
* Step 1250: Approximately 0.708
* Step 1500: Approximately 0.704
* Step 1750: Approximately 0.702
* Step 2000: Approximately 0.701
* **Right Chart (Training Loss):**
* **ActPRM Selected (Red):** The line fluctuates significantly, generally decreasing over time but with substantial variance.
* Step 0: Approximately 0.155
* Step 500: Approximately 0.145
* Step 1000: Approximately 0.135
* Step 1500: Approximately 0.14
* Step 2000: Approximately 0.13
* **Randomly Selected (Blue):** The line also fluctuates, but remains consistently lower than the "ActPRM Selected" line.
* Step 0: Approximately 0.11
* Step 500: Approximately 0.095
* Step 1000: Approximately 0.10
* Step 1500: Approximately 0.11
* Step 2000: Approximately 0.105
### Key Observations
* The "ActPRM Selected" method consistently achieves a higher Average F1 Score than the "Randomly Selected" method.
* The "ActPRM Selected" method exhibits higher Training Loss compared to the "Randomly Selected" method.
* Both methods show decreasing Training Loss over time, but with considerable fluctuations.
* The F1 score for "Randomly Selected" plateaus after step 1000, while "ActPRM Selected" continues to improve.
### Interpretation
The data suggests that while the "ActPRM Selected" method results in a higher F1 score, indicating better predictive performance, it also leads to a higher Training Loss. This could indicate a more complex model that is potentially overfitting to the training data. The "Randomly Selected" method, while achieving a lower F1 score, maintains a lower Training Loss, suggesting a simpler model that generalizes better but may not capture the nuances of the data as effectively. The continued improvement in F1 score for "ActPRM Selected" even after the "Randomly Selected" method plateaus suggests that the former method benefits from continued training, despite the higher loss. The fluctuations in Training Loss for both methods suggest that the training process is not entirely stable and may benefit from techniques like regularization or learning rate scheduling. The difference in loss could also be due to the complexity of the model, or the amount of data used.
</details>
Figure 4: ProcessBench performance (left) and training loss (right): ActPRM v.s. random data selection on 1M NuminaMath Rollouts.
#### 5.2.3 Comparative Experiment with Random Selection
A potential concern is that while ActPRM achieves state-of-the-art (SOTA) performance on several benchmarks, this success might be attributed solely to the high quality of our collected data pool, rather than the method itself. To address this, we conducted a comparative study with random selection. Specifically, we randomly selected 256K data points from our retained dataset as the experimental group. For the control group, we randomly selected the same number of data points from the entire data pool (over 1M) and used the same annotator to label any unlabeled data (i.e., data not in the retained set). We then continually train ActPRM checkpoint, as in Sec 5.1, on both datasets. The results, including performance on ProcessBench and training loss, are shown in Figure 4.
ActPRM outperforms random selection of the same amount of data. As illustrated in Figure 4 (left), the model trained on data selected by ActPRM consistently achieves significantly better results than the model trained on randomly selected data. To further validate this, we compare their training losses in Figure 4 (right). The model trained on ActPRM -selected data exhibits a consistently higher training loss, with a margin of 0.05, suggesting that the data selected by ActPRM is more challenging and informative, thereby enhancing the learning process.
## 6 Conclusion and Future Work
In this work, we address the high annotation costs associated with training Process Reward Models (PRMs) by proposing ActPRM, an uncertainty-aware active learning framework that selectively annotates the most informative reasoning steps. By leveraging an ensemble PRM to estimate uncertainty and strategically labeling only uncertain data, ActPRM significantly reduces annotation costs while maintaining competitive performance. Extensive experiments demonstrate that ActPRM achieves a new state-of-the-art (75.0% on ProcessBench) with merely at most 20% of the labeling budget required by prior methods. Our results highlight the potential of efficient data selection for scalable PRM training, and we commit to releasing all models, datasets, and code to foster further research in this direction.
To further enhance PRM’s performance, several promising directions can be explored. First, leveraging larger base models and more advanced LLM judges (e.g., O1-mini) could yield significant improvements. Second, implementing the framework in an online setting would ultimately enable PRM to iteratively refine its performance through active learning. Additionally, integrating online PRM training with reinforcement learning frameworks—such as actor-critic methods—presents an exciting avenue for research.
## References
- Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety, 2016. URL https://arxiv.org/abs/1606.06565.
- Cheng et al. (2025) Jie Cheng, Lijun Li, Gang Xiong, Jing Shao, and Yisheng Lv. Stop gamma decay: Min-form credit assignment is all process reward model needs for reasoning, 2025. Notion Blog.
- Coste et al. (2024) Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward Model Ensembles Help Mitigate Overoptimization, 2024.
- Cui et al. (2025) Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025.
- DeepSeek-AI et al. (2025) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948.
- Dwaracherla et al. (2024) Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, and Benjamin Van Roy. Efficient exploration for llms. In International Conference on Machine Learning, 2024.
- Gleave & Irving (2022) Adam Gleave and Geoffrey Irving. Uncertainty Estimation for Language Reward Models, 2022.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874.
- Li et al. (2024) Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-1.5](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf), 2024.
- Li & Li (2024) Wendi Li and Yixuan Li. Process reward model with q-value rankings. arXiv preprint arXiv:2410.11287, 2024.
- Liang et al. (2022) Xinran Liang, Katherine Shu, Kimin Lee, and Pieter Abbeel. Reward Uncertainty for Exploration in Preference-based Reinforcement Learning, 2022.
- Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step, 2023.
- Liu et al. (2024) Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, and Min Lin. Sample-efficient alignment for llms, 2024. URL https://arxiv.org/abs/2411.01493.
- Luo et al. (2024) Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve Mathematical Reasoning in Language Models by Automated Process Supervision, 2024.
- Mehta et al. (2023) Viraj Mehta, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Jeff Schneider, and Willie Neiswanger. Sample efficient reinforcement learning from human feedback via active exploration. arXiv preprint arXiv:2312.00267, 2023.
- Melo et al. (2024) Luckeciano C. Melo, Panagiotis Tigas, Alessandro Abate, and Yarin Gal. Deep bayesian active learning for preference modeling in large language models, 2024. URL https://arxiv.org/abs/2406.10023.
- OpenAI et al. (2024a) OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexis Conneau, Ali Kamali, Allan Jabri, Allison Moyer, Allison Tam, Amadou Crookes, Amin Tootoochian, Amin Tootoonchian, Ananya Kumar, Andrea Vallone, Andrej Karpathy, Andrew Braunstein, Andrew Cann, Andrew Codispoti, Andrew Galu, Andrew Kondrich, Andrew Tulloch, Andrey Mishchenko, Angela Baek, Angela Jiang, Antoine Pelisse, Antonia Woodford, Anuj Gosalia, Arka Dhar, Ashley Pantuliano, Avi Nayak, Avital Oliver, Barret Zoph, Behrooz Ghorbani, Ben Leimberger, Ben Rossen, Ben Sokolowsky, Ben Wang, Benjamin Zweig, Beth Hoover, Blake Samic, Bob McGrew, Bobby Spero, Bogo Giertler, Bowen Cheng, Brad Lightcap, Brandon Walkin, Brendan Quinn, Brian Guarraci, Brian Hsu, Bright Kellogg, Brydon Eastman, Camillo Lugaresi, Carroll Wainwright, Cary Bassin, Cary Hudson, Casey Chu, Chad Nelson, Chak Li, Chan Jun Shern, Channing Conger, Charlotte Barette, Chelsea Voss, Chen Ding, Cheng Lu, Chong Zhang, Chris Beaumont, Chris Hallacy, Chris Koch, Christian Gibson, Christina Kim, Christine Choi, Christine McLeavey, Christopher Hesse, Claudia Fischer, Clemens Winter, Coley Czarnecki, Colin Jarvis, Colin Wei, Constantin Koumouzelis, Dane Sherburn, Daniel Kappler, Daniel Levin, Daniel Levy, David Carr, David Farhi, David Mely, David Robinson, David Sasaki, Denny Jin, Dev Valladares, Dimitris Tsipras, Doug Li, Duc Phong Nguyen, Duncan Findlay, Edede Oiwoh, Edmund Wong, Ehsan Asdar, Elizabeth Proehl, Elizabeth Yang, Eric Antonow, Eric Kramer, Eric Peterson, Eric Sigler, Eric Wallace, Eugene Brevdo, Evan Mays, Farzad Khorasani, Felipe Petroski Such, Filippo Raso, Francis Zhang, Fred von Lohmann, Freddie Sulit, Gabriel Goh, Gene Oden, Geoff Salmon, Giulio Starace, Greg Brockman, Hadi Salman, Haiming Bao, Haitang Hu, Hannah Wong, Haoyu Wang, Heather Schmidt, Heather Whitney, Heewoo Jun, Hendrik Kirchner, Henrique Ponde de Oliveira Pinto, Hongyu Ren, Huiwen Chang, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian O’Connell, Ian Osband, Ian Silber, Ian Sohl, Ibrahim Okuyucu, Ikai Lan, Ilya Kostrikov, Ilya Sutskever, Ingmar Kanitscheider, Ishaan Gulrajani, Jacob Coxon, Jacob Menick, Jakub Pachocki, James Aung, James Betker, James Crooks, James Lennon, Jamie Kiros, Jan Leike, Jane Park, Jason Kwon, Jason Phang, Jason Teplitz, Jason Wei, Jason Wolfe, Jay Chen, Jeff Harris, Jenia Varavva, Jessica Gan Lee, Jessica Shieh, Ji Lin, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joanne Jang, Joaquin Quinonero Candela, Joe Beutler, Joe Landers, Joel Parish, Johannes Heidecke, John Schulman, Jonathan Lachman, Jonathan McKay, Jonathan Uesato, Jonathan Ward, Jong Wook Kim, Joost Huizinga, Jordan Sitkin, Jos Kraaijeveld, Josh Gross, Josh Kaplan, Josh Snyder, Joshua Achiam, Joy Jiao, Joyce Lee, Juntang Zhuang, Justyn Harriman, Kai Fricke, Kai Hayashi, Karan Singhal, Katy Shi, Kavin Karthik, Kayla Wood, Kendra Rimbach, Kenny Hsu, Kenny Nguyen, Keren Gu-Lemberg, Kevin Button, Kevin Liu, Kiel Howe, Krithika Muthukumar, Kyle Luther, Lama Ahmad, Larry Kai, Lauren Itow, Lauren Workman, Leher Pathak, Leo Chen, Li Jing, Lia Guy, Liam Fedus, Liang Zhou, Lien Mamitsuka, Lilian Weng, Lindsay McCallum, Lindsey Held, Long Ouyang, Louis Feuvrier, Lu Zhang, Lukas Kondraciuk, Lukasz Kaiser, Luke Hewitt, Luke Metz, Lyric Doshi, Mada Aflak, Maddie Simens, Madelaine Boyd, Madeleine Thompson, Marat Dukhan, Mark Chen, Mark Gray, Mark Hudnall, Marvin Zhang, Marwan Aljubeh, Mateusz Litwin, Matthew Zeng, Max Johnson, Maya Shetty, Mayank Gupta, Meghan Shah, Mehmet Yatbaz, Meng Jia Yang, Mengchao Zhong, Mia Glaese, Mianna Chen, Michael Janner, Michael Lampe, Michael Petrov, Michael Wu, Michele Wang, Michelle Fradin, Michelle Pokrass, Miguel Castro, Miguel Oom Temudo de Castro, Mikhail Pavlov, Miles Brundage, Miles Wang, Minal Khan, Mira Murati, Mo Bavarian, Molly Lin, Murat Yesildal, Nacho Soto, Natalia Gimelshein, Natalie Cone, Natalie Staudacher, Natalie Summers, Natan LaFontaine, Neil Chowdhury, Nick Ryder, Nick Stathas, Nick Turley, Nik Tezak, Niko Felix, Nithanth Kudige, Nitish Keskar, Noah Deutsch, Noel Bundick, Nora Puckett, Ofir Nachum, Ola Okelola, Oleg Boiko, Oleg Murk, Oliver Jaffe, Olivia Watkins, Olivier Godement, Owen Campbell-Moore, Patrick Chao, Paul McMillan, Pavel Belov, Peng Su, Peter Bak, Peter Bakkum, Peter Deng, Peter Dolan, Peter Hoeschele, Peter Welinder, Phil Tillet, Philip Pronin, Philippe Tillet, Prafulla Dhariwal, Qiming Yuan, Rachel Dias, Rachel Lim, Rahul Arora, Rajan Troll, Randall Lin, Rapha Gontijo Lopes, Raul Puri, Reah Miyara, Reimar Leike, Renaud Gaubert, Reza Zamani, Ricky Wang, Rob Donnelly, Rob Honsby, Rocky Smith, Rohan Sahai, Rohit Ramchandani, Romain Huet, Rory Carmichael, Rowan Zellers, Roy Chen, Ruby Chen, Ruslan Nigmatullin, Ryan Cheu, Saachi Jain, Sam Altman, Sam Schoenholz, Sam Toizer, Samuel Miserendino, Sandhini Agarwal, Sara Culver, Scott Ethersmith, Scott Gray, Sean Grove, Sean Metzger, Shamez Hermani, Shantanu Jain, Shengjia Zhao, Sherwin Wu, Shino Jomoto, Shirong Wu, Shuaiqi, Xia, Sonia Phene, Spencer Papay, Srinivas Narayanan, Steve Coffey, Steve Lee, Stewart Hall, Suchir Balaji, Tal Broda, Tal Stramer, Tao Xu, Tarun Gogineni, Taya Christianson, Ted Sanders, Tejal Patwardhan, Thomas Cunninghman, Thomas Degry, Thomas Dimson, Thomas Raoux, Thomas Shadwell, Tianhao Zheng, Todd Underwood, Todor Markov, Toki Sherbakov, Tom Rubin, Tom Stasi, Tomer Kaftan, Tristan Heywood, Troy Peterson, Tyce Walters, Tyna Eloundou, Valerie Qi, Veit Moeller, Vinnie Monaco, Vishal Kuo, Vlad Fomenko, Wayne Chang, Weiyi Zheng, Wenda Zhou, Wesam Manassra, Will Sheu, Wojciech Zaremba, Yash Patil, Yilei Qian, Yongjik Kim, Youlong Cheng, Yu Zhang, Yuchen He, Yuchen Zhang, Yujia Jin, Yunxing Dai, and Yury Malkov. Gpt-4o system card, 2024a. URL https://arxiv.org/abs/2410.21276.
- OpenAI et al. (2024b) OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Chelsea Voss, Chen Shen, Chong Zhang, Chris Koch, Chris Orsinger, Christopher Hesse, Claudia Fischer, Clive Chan, Dan Roberts, Daniel Kappler, Daniel Levy, Daniel Selsam, David Dohan, David Farhi, David Mely, David Robinson, Dimitris Tsipras, Doug Li, Dragos Oprica, Eben Freeman, Eddie Zhang, Edmund Wong, Elizabeth Proehl, Enoch Cheung, Eric Mitchell, Eric Wallace, Erik Ritter, Evan Mays, Fan Wang, Felipe Petroski Such, Filippo Raso, Florencia Leoni, Foivos Tsimpourlas, Francis Song, Fred von Lohmann, Freddie Sulit, Geoff Salmon, Giambattista Parascandolo, Gildas Chabot, Grace Zhao, Greg Brockman, Guillaume Leclerc, Hadi Salman, Haiming Bao, Hao Sheng, Hart Andrin, Hessam Bagherinezhad, Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian Osband, Ignasi Clavera Gilaberte, Ilge Akkaya, Ilya Kostrikov, Ilya Sutskever, Irina Kofman, Jakub Pachocki, James Lennon, Jason Wei, Jean Harb, Jerry Twore, Jiacheng Feng, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joaquin Quiñonero Candela, Joe Palermo, Joel Parish, Johannes Heidecke, John Hallman, John Rizzo, Jonathan Gordon, Jonathan Uesato, Jonathan Ward, Joost Huizinga, Julie Wang, Kai Chen, Kai Xiao, Karan Singhal, Karina Nguyen, Karl Cobbe, Katy Shi, Kayla Wood, Kendra Rimbach, Keren Gu-Lemberg, Kevin Liu, Kevin Lu, Kevin Stone, Kevin Yu, Lama Ahmad, Lauren Yang, Leo Liu, Leon Maksin, Leyton Ho, Liam Fedus, Lilian Weng, Linden Li, Lindsay McCallum, Lindsey Held, Lorenz Kuhn, Lukas Kondraciuk, Lukasz Kaiser, Luke Metz, Madelaine Boyd, Maja Trebacz, Manas Joglekar, Mark Chen, Marko Tintor, Mason Meyer, Matt Jones, Matt Kaufer, Max Schwarzer, Meghan Shah, Mehmet Yatbaz, Melody Y. Guan, Mengyuan Xu, Mengyuan Yan, Mia Glaese, Mianna Chen, Michael Lampe, Michael Malek, Michele Wang, Michelle Fradin, Mike McClay, Mikhail Pavlov, Miles Wang, Mingxuan Wang, Mira Murati, Mo Bavarian, Mostafa Rohaninejad, Nat McAleese, Neil Chowdhury, Neil Chowdhury, Nick Ryder, Nikolas Tezak, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Patrick Chao, Paul Ashbourne, Pavel Izmailov, Peter Zhokhov, Rachel Dias, Rahul Arora, Randall Lin, Rapha Gontijo Lopes, Raz Gaon, Reah Miyara, Reimar Leike, Renny Hwang, Rhythm Garg, Robin Brown, Roshan James, Rui Shu, Ryan Cheu, Ryan Greene, Saachi Jain, Sam Altman, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Santiago Hernandez, Sasha Baker, Scott McKinney, Scottie Yan, Shengjia Zhao, Shengli Hu, Shibani Santurkar, Shraman Ray Chaudhuri, Shuyuan Zhang, Siyuan Fu, Spencer Papay, Steph Lin, Suchir Balaji, Suvansh Sanjeev, Szymon Sidor, Tal Broda, Aidan Clark, Tao Wang, Taylor Gordon, Ted Sanders, Tejal Patwardhan, Thibault Sottiaux, Thomas Degry, Thomas Dimson, Tianhao Zheng, Timur Garipov, Tom Stasi, Trapit Bansal, Trevor Creech, Troy Peterson, Tyna Eloundou, Valerie Qi, Vineet Kosaraju, Vinnie Monaco, Vitchyr Pong, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wes McCabe, Wojciech Zaremba, Yann Dubois, Yinghai Lu, Yining Chen, Young Cha, Yu Bai, Yuchen He, Yuchen Zhang, Yunyun Wang, Zheng Shao, and Zhuohan Li. Openai o1 system card, 2024b. URL https://arxiv.org/abs/2412.16720.
- Song et al. (2025) Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, 2025.
- Tan et al. (2025) Xiaoyu Tan, Tianchu Yao, Chao Qu, Bin Li, Minghao Yang, Dakuan Lu, Haozhe Wang, Xihe Qiu, Wei Chu, Yinghui Xu, and Yuan Qi. Aurora:automated training framework of universal process reward models via ensemble prompting and reverse verification, 2025. URL https://arxiv.org/abs/2502.11520.
- Valdenegro-Toro & Saromo (2022) Matias Valdenegro-Toro and Daniel Saromo. A deeper look into aleatoric and epistemic uncertainty disentanglement, 2022. URL https://arxiv.org/abs/2204.09308.
- Wang et al. (2024) Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations, 2024.
- Wei et al. (2024) Xiong Wei, Zhang Hanning, Jiang Nan, and Zhang Tong. An implementation of generative prm., 2024. URL https://github.com/RLHFlow/RLHF-Reward-Modeling.
- Yang et al. (2024) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024.
- Yuan et al. (2024) Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels. arXiv preprint arXiv:2412.01981, 2024.
- Zhang et al. (2025) Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The Lessons of Developing Process Reward Models in Mathematical Reasoning, 2025.
- Zheng et al. (2024) Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. ProcessBench: Identifying Process Errors in Mathematical Reasoning, 2024.
## Appendix A LLM-as-Judger Prompt Template
For LLM-as-Judger, we follow the prompt in Zhang et al. (2025).
⬇
I will provide a math problem along with a solution. They will be formatted as follows:
[Math Problem]
< math_problem >
...(math problem)...
</ math_problem >
[Solution]
< paragraph_1 >
...(paragraph 1 of solution)...
</ paragraph_1 >
...
< paragraph_n >
...(paragraph n of solution)...
</ paragraph_n >
Your task is to review each paragraph of the solution in sequence, analyzing,
verifying, and critiquing the reasoning in detail. You need to provide the
analyses and the conclusion in the following format:
< analysis_1 >
...(analysis of paragraph 1)...
</ analysis_1 >
...
< analysis_n >
...(analysis of paragraph n)...
</ analysis_n >
< conclusion >
Correct / Incorrect
</ conclusion >
* When you analyze each paragraph, you should use proper verification, recalculation, or reflection to indicate whether it is logically and mathematically valid. Please elaborate on the analysis process carefully.
* If an error is detected in any paragraph, you should describe the nature and cause of the error in detail, and suggest how to correct the error or the correct approach. Once a paragraph is found to contain any error, stop further analysis of subsequent paragraphs (as they may depend on the identified error) and directly provide the conclusion of " Incorrect." For instance, given a solution of five paragraphs, if an error is found in the third paragraph, you should reply in the following format:
< analysis_1 >
...(analysis of paragraph 1)...
</ analysis_1 >
< analysis_2 >
...(analysis of paragraph 2)...
</ analysis_3 >
< analysis_3 >
...(analysis of paragraph 3; since an error is found here, also provide detailed critique and correction guideline)...
</ analysis_3 >
< conclusion >
Incorrect
</ conclusion >
Note that the analyses of paragraphs 4 and 5 should be skipped as the paragraph 3 has been found to contain an error.
* Respond with your analyses and conclusion directly.
--------------------------------------------------
The following is the math problem and the solution for your task:
[Math Problem]
{tagged_problem}
[Solution]
{tagged_response}
## Appendix B More Experiment Results
### B.1 Problem diversity is important for Training PRMs
PRM800K (Lightman et al., 2023) is a widely used and human-annotated dataset for PRM training, which contains 800K step-level labels across 75K tree-of-thoughts solutions to 12K MATH (Hendrycks et al., 2021). Our empirical results show that models trained on our dataset (100K samples from 100K diverse questions) consistently and significantly outperform those trained on PRM800K https://huggingface.co/datasets/HuggingFaceH4/prm800k-trl-dedup (369K samples from only 12K questions) on ProcessBench. These findings suggest that problem diversity plays a more crucial role in PRM training than the number of step-level annotations.
| | # Problem set | # CoT Trajectories | ProcessBench F1 score |
| --- | --- | --- | --- |
| PRM800K | 7,500 | 460,000 | 0.575 |
| NuminaMath (Random Selected) | 100,000 | 100,000 | 0.673 |
Table 3: Comparison between PRM800K and 100K data collected from NuminaMath labeled by Qwen-QwQ.
## Appendix C Annotation Cost Estimation
We estimate the labeling cost based on the statistics of our 1M data collected from NuminaMath Li et al. (2024) using Qwen2.5-Math-7B-Instruct and Qwen2.5-Math-72B-Instruct. We introduce the statistics in Table 4.
| | Value | Source |
| --- | --- | --- |
| # Reasoning Steps ( $S$ ) | 8.845 | Qwen Models’ rollouts |
| # Tokens per Rollout ( $R$ ) | 625.098 | Qwen Models’ rollouts |
| # Tokens per Critic Response from Judge ( $C$ ) | 1,919.860 | Qwen-QwQ’s responses as LLM-as-Judge |
Table 4: Statistics of 1M NuminaMath CoT Trajectories collected by Qwen2.5-Math Models.
In addition to the statistics, we also use $N$ to denote the data number of the dataset and show this statistic of each model’s training dataset in Table 5
| Dataset | # Labeled Data |
| --- | --- |
| ActPRM | 624,000 (labeled in two stages) |
| Qwen2.5-Math-PRM-Math-shepherd | 860,000 |
| Qwen2.5-Math-PRM | 860,000 |
| UniversalPRM | 690,000 |
Table 5: Data number of datasets.
Using the statistics, we compute the estimated labeling cost for ActPRM, Qwen2.5-Math-PRM-Math-shepherd (Zhang et al., 2025), Qwen2.5-Math-PRM (Zhang et al., 2025), UniversalPRM Tan et al. (2025) as follows:
- Qwen2.5-Math-PRM-Math-shepherd: $N\times S\times 8\times R/2$ , where $8$ is the number of rollouts per step set in Zhang et al. (2025). We divided by two since the number of tokens for rollouts varies based on the position of reasoning step. For latter reasoning step, it requires less reasoning tokens. As a result, the expectation of tokens per rollout should be half of the number of tokens of the complete rollout.
- Qwen2.5-Math-PRM: $N\times S\times 8\times R/2+N*C$ . It used consensus filtering for each data, the cost is both from MathShepherd ( $S\times 8\times R/2$ ) and LLM-as-Judge (C).
- UniversalPRM: $N\times C\times 4+N\times S$ , where 4 is the number of ensemble prompts from the original paper (Tan et al., 2025) and another $N\times S$ is for its semantic-based step seperation.
- ActPRM: $N\times C$ . We solely use Qwen-QwQ as Judge and do not include any other operations.