# Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning
> *Corresponding author
## Abstract
While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies key OOD issues, including step OOD—caused by differences in reasoning patterns across model types and sizes—and question OOD, which arises from dataset shifts between training data and real-world problems. To address these issues, we introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle these OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps as a warmup, enhancing PRM’s ability to evaluate target steps and improving generalization and reasoning consistency across different models and problem types. Our extensive experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets. Our open-source contributions include a retrieval-enhanced dataset, a tuning framework for PRM training, and the RetrievalPRM model, establishing a new standard for PRM performance.
Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning
Jiachen Zhu 1, Congmin Zheng 1, Jianghao Lin 1, Kounianhua Du 1 Ying Wen 1 ∗, Yong Yu 1, Jun Wang 2, Weinan Zhang 1 thanks: *Corresponding author 1 Shanghai Jiao Tong University, 2 University College London {gebro13,desp.zcm,chiangel,kounianhuadu,ying.wen,wnzhang}@sjtu.edu.cn, yyu@apex.sjtu.edu.cn jun.wang@cs.ucl.ac.uk
## 1 Introduction
While large language models (LLMs) have advanced mathematical reasoning OpenAI (2023); Dubey et al. (2024); Zhu et al. (2024); Shao et al. (2024); Yang et al. (2024b), they remain prone to critical flaws: explicit errors (e.g., miscalculations, logical inconsistencies) and implicit risks where correct answers mask flawed intermediate steps. Even when final results are accurate, LLMs often generate plausible-but-incorrect reasoning chains, eroding trust in their problem-solving processes Lightman et al. (2023). To address this, Process Reward Models (PRMs) Lightman et al. (2023); Wang et al. (2024b) have been developed to rigorously evaluate the logical validity of intermediate steps Cobbe et al. (2021), mirroring human pedagogical practices that prioritize reasoning quality over answer correctness.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Scatter Plot with Density Contours: Dimensionality Reduction Visualization
### Overview
This image is a 2D scatter plot visualizing the distribution of data points across two dimensions, labeled "Dimension x" and "Dimension y". Three distinct categories of data points are represented by different colors: "GSM8k" (teal), "MATH" (orange), and "OlympiadBench" (blue). Density contours are overlaid for each category, indicating regions of higher data point concentration.
### Components/Axes
* **X-axis**: Labeled "Dimension x". The scale ranges approximately from -40 to 50. Major tick marks are present at intervals of 10.
* **Y-axis**: Labeled "Dimension y". The scale ranges approximately from -20 to 25. Major tick marks are present at intervals of 5.
* **Legend**: Located in the top-center of the plot. It associates colors with data categories:
* Teal dots represent "GSM8k".
* Orange dots represent "MATH".
* Blue dots represent "OlympiadBench".
* **Data Points**: Individual points plotted according to their x and y coordinates.
* **Density Contours**: Shaded regions representing the density of points for each category. The contours are semi-transparent, allowing the underlying points to be visible.
### Detailed Analysis or Content Details
**Data Series Trends and Distributions:**
* **GSM8k (Teal):**
* **Trend:** The teal data points are primarily clustered in a dense region on the right side of the plot, roughly between x-coordinates 20 and 40, and y-coordinates 0 and 15. There are a few scattered points outside this main cluster, including one at approximately x=25, y=15, and another at x=25, y=-15.
* **Approximate Extents:** The main cluster spans roughly x=[20, 40] and y=[0, 15]. The density contour for GSM8k is a roughly oval shape centered around x=30, y=7.
* **MATH (Orange):**
* **Trend:** The orange data points are distributed in two main areas. One significant cluster is located in the bottom-left quadrant, roughly between x-coordinates -35 and -15, and y-coordinates -15 and -2. Another, smaller cluster is interspersed with the blue and teal points in the central and right-central regions of the plot.
* **Approximate Extents:** The main bottom-left cluster spans roughly x=[-35, -15] and y=[-15, -2]. The density contour for MATH in this region is an elongated oval shape centered around x=-25, y=-8. Scattered orange points are also observed around x=[0, 30] and y=[-5, 15].
* **OlympiadBench (Blue):**
* **Trend:** The blue data points form a large, diffuse cluster in the central-left to central region of the plot, roughly between x-coordinates -30 and 10, and y-coordinates -5 and 18. There are also some blue points scattered throughout the other regions, particularly mixed with the teal points on the right.
* **Approximate Extents:** The main blue cluster spans roughly x=[-30, 10] and y=[-5, 18]. The density contour for OlympiadBench is a large, irregular shape centered around x=-10, y=6.
**Specific Data Points (Illustrative Examples):**
* **GSM8k:**
* A point at approximately (28, 8) is within the densest part of the teal cluster.
* A point at approximately (35, 12) is also within the main teal cluster.
* A point at approximately (25, -15) is an outlier from the main teal cluster.
* **MATH:**
* A point at approximately (-25, -10) is within the densest part of the bottom-left orange cluster.
* A point at approximately (-30, -5) is also within this cluster.
* A point at approximately (25, 5) is an orange point located within the main teal cluster.
* **OlympiadBench:**
* A point at approximately (-10, 5) is within the densest part of the blue cluster.
* A point at approximately (0, 15) is also within this cluster.
* A point at approximately (30, 10) is a blue point located within the main teal cluster.
### Key Observations
* **Distinct Clustering:** The three categories exhibit distinct clustering patterns, suggesting they represent different types of data or entities that can be separated in this reduced dimensional space.
* **Separation of MATH:** The "MATH" category appears to be the most separated, with a prominent cluster in the bottom-left quadrant that is largely distinct from the other two categories.
* **Overlap and Intermingling:** While there are distinct clusters, there is also significant overlap and intermingling, particularly between "GSM8k" and "OlympiadBench" in the right-central region, and to a lesser extent, "MATH" points are found within the "OlympiadBench" cluster.
* **Outliers:** Each category has some points that lie far from their main cluster, indicating potential anomalies or data points that are less representative of their group.
### Interpretation
This visualization likely represents the output of a dimensionality reduction technique (such as t-SNE or UMAP) applied to datasets from "GSM8k", "MATH", and "OlympiadBench". The plot suggests that these datasets, when projected into a 2D space, form distinct groups.
* **Data Separation:** The clear separation of the "MATH" cluster indicates that this dataset has unique characteristics that distinguish it from "GSM8k" and "OlympiadBench" in the learned latent space.
* **Relationship between GSM8k and OlympiadBench:** The significant overlap between "GSM8k" and "OlympiadBench" suggests that these two datasets share many common features or are structurally similar in the context of the dimensionality reduction. They might represent related tasks or domains.
* **Potential for Classification/Clustering:** The visual separation and clustering imply that a classifier or clustering algorithm could potentially be trained on this reduced-dimensional data to distinguish between these categories with a reasonable degree of accuracy.
* **Outlier Analysis:** The scattered points outside the main clusters could be indicative of data points that are mislabeled, represent edge cases, or are simply less typical examples of their respective categories. Further investigation into these outliers might reveal interesting insights.
In essence, the plot demonstrates the effectiveness of the dimensionality reduction in revealing underlying structures and relationships within the data, highlighting both the distinctiveness and the similarities between the three categories.
</details>
Figure 1: The distribution differences across three datasets: GSM8K, MATH and Olympiad. We use sentence-bert to encode these questions and perform t-sne visualization.
Existing works Wang et al. (2024a); o1 Team (2024); Zheng et al. (2024) frame PRM as a binary classification problem. They train PRM on open-source base LLMs such as Qwen Yang et al. (2024b) or Llama Dubey et al. (2024) using human-annotated dataset Lightman et al. (2023) or automated process supervision method Wang et al. (2024b); Luo et al. (2024); Qin et al. (2024). Although these approaches show great performance and empirical success, they still face kinds of out-of-distribution challenges. We believe the out-of-distribution (OOD) problem can be viewed from the following perspectives:
<details>
<summary>x2.png Details</summary>

### Visual Description
## Textual Information Extraction and Analysis
### Overview
The image presents a mathematical word problem and three different approaches to solving it, each generated by a distinct AI model. The problem asks to identify the largest of five different numbers, given that their pairwise sums result in a specific set of values. The three models are "ChatGPT-4o Process", "Qwen2.5-Math-72B-instruct Process", and "Qwen2.5-Math-1.5B-instruct Process". Each process is described with its steps, chain length, solution style, and the final answer. The models are also categorized by "Model Type" and "Model Size" with a "Difference" label indicating a comparison between them.
### Components/Axes
This image does not contain a chart or graph, therefore, there are no axes, legends, or axis markers to extract. The primary components are blocks of text describing the problem and its solutions.
### Detailed Analysis or Content Details
**Question:**
"Five different numbers are added together in pairs, and the results are 101, 102, 103, 104, 105, 106, 107, 108, 109, 111. Which is the largest of the five numbers?"
**Model 1: ChatGPT-4o Process**
* **Process Description:**
* **Step 1: Sum of all pairwise sums.**
* "Let's denote the five numbers as a,b,c,d,e with a<b<c<d<e"
* "The total of all ten pairwise sums is:....."
* "This is equivalent to: 4(a+b+c+d+e)"
* "Let S=a+b+c+d+e. Then, the total sum of all pairwise sums is 4S."
* **Step 2: Find the sum of the pairwise sums.**
* "Now, calculate the total sum of the pairwise sums given in the problem:....Thus, we have: 4S=1066 so S=1066/4=266.5"
* **Ellipsis (...)**: Indicates that there are further steps not fully displayed.
* **Footer Information:**
* "Chain Length: 6 steps"
* "Solution Style: Analyze, Calculate"
* "Answer: 56"
**Model 2: Qwen2.5-Math-72B-instruct Process**
* **Process Description:**
* "To solve this problem, Let's denote the five numbers as a,b,c,d and e such that a<b<c<d<e."
* **Step 1:**
* "First, we note that the smallest sum a+b=101 and the largest sum d+e=111."
* "The next smallest sum is a+c=102"
* "and the next largest sum is c+e=109."
* "The third smallest sum is a+d=103"
* "and the third largest sum is b+e=108."
* **Step 2:**
* "We can solve these equations step by step. First, subtract the first equation from the second: (a+c)-(a+b)=102-101 => c-b=1 =>"
* **Ellipsis (...)**: Indicates that there are further steps not fully displayed.
* **Footer Information:**
* "Chain Length: 10 steps"
* "Solution Style: Solve Equations"
* "Answer: 57"
**Model 3: Qwen2.5-Math-1.5B-instruct Process**
* **Process Description:**
* "Step 1: To determine which xnumber is the largest among the five given sums: 101,102,103,104,105,106,107,108, 109,111,"
* "we need to identify the pair that produces the highest sum when each number is added to itself."
* **Step 2:**
* "First, let's list all possible pairs:"
* "(101,101), (101,102), (101,103)"
* "(101,104), (101,105), (101,106)..."
* **Ellipsis (...)**: Indicates that there are further steps not fully displayed.
* **Footer Information:**
* "Chain Length: 3 steps"
* "Solution Style: Enumerate"
* "Answer: 56"
**Categorization Labels:**
* **Model Type:** (Appears between Model 1 and Model 2)
* **Difference:** (Appears between Model 1 and Model 2, with a double-headed arrow)
* **Model Size:** (Appears between Model 2 and Model 3)
* **Difference:** (Appears between Model 2 and Model 3, with a double-headed arrow)
### Key Observations
* **Problem Statement:** The problem involves finding five distinct numbers given their pairwise sums. The sums provided are 101, 102, 103, 104, 105, 106, 107, 108, 109, and 111. There are 10 pairwise sums for 5 numbers, which matches the count provided.
* **Model Outputs:**
* ChatGPT-4o and Qwen2.5-Math-1.5B-instruct both arrive at the answer 56.
* Qwen2.5-Math-72B-instruct arrives at the answer 57.
* **Solution Styles:** The models employ different strategies: "Analyze, Calculate" (ChatGPT-4o), "Solve Equations" (Qwen2.5-Math-72B-instruct), and "Enumerate" (Qwen2.5-Math-1.5B-instruct).
* **Chain Length:** The number of steps varies significantly: 6 steps (ChatGPT-4o), 10 steps (Qwen2.5-Math-72B-instruct), and 3 steps (Qwen2.5-Math-1.5B-instruct). The shortest chain length does not necessarily correlate with the correct answer, as the 3-step solution yields 56, while the 10-step solution yields 57.
* **Model Comparison:** The labels "Model Type" and "Model Size" with "Difference" suggest a comparison of these models, likely in terms of their capabilities or performance.
### Interpretation
This image demonstrates the varying approaches and outcomes of different AI models when tasked with solving a mathematical word problem. The problem itself is a classic system of equations problem where the unknowns are the five distinct numbers.
* **Problem Complexity and AI Approaches:** The problem can be solved by setting up a system of linear equations. Let the five distinct numbers be $a < b < c < d < e$. The ten pairwise sums are $a+b, a+c, a+d, a+e, b+c, b+d, b+e, c+d, c+e, d+e$. The provided sums are $\{101, 102, 103, 104, 105, 106, 107, 108, 109, 111\}$.
* The smallest sum is typically $a+b$, and the largest is $d+e$.
* The sum of all pairwise sums is $4(a+b+c+d+e)$.
* ChatGPT-4o's approach (Step 1) seems to leverage this property by calculating the sum of all given pairwise sums ($101+102+...+109+111 = 1066$). It then deduces that $4S = 1066$, leading to $S = 266.5$. This sum $S$ is the sum of the five numbers. The subsequent steps (indicated by ellipsis) would likely involve using this sum $S$ and some of the pairwise sums to isolate the individual numbers and find the largest.
* Qwen2.5-Math-72B-instruct's approach (Step 1) directly identifies some of the smallest and largest sums: $a+b=101$, $d+e=111$, $a+c=102$, $c+e=109$, $a+d=103$, $b+e=108$. This is a valid strategy for setting up a system of equations. Step 2 begins to solve these equations. The fact that it arrives at a different answer (57) suggests a potential error in its calculation or interpretation of the sums.
* Qwen2.5-Math-1.5B-instruct's approach (Step 1) focuses on identifying the pair that produces the highest sum when added to itself. This seems to be a misinterpretation of the problem, as the problem states sums of *different* numbers. Step 2 then lists pairs of sums, which is also an unusual enumeration strategy for this problem. However, it arrives at the same answer as ChatGPT-4o (56).
* **Discrepancy in Answers:** The most significant observation is the discrepancy in the answers provided by the models. ChatGPT-4o and Qwen2.5-Math-1.5B-instruct both suggest 56, while Qwen2.5-Math-72B-instruct suggests 57. This highlights the challenges AI models can face with complex mathematical reasoning and the potential for errors.
* **Chain Length vs. Accuracy:** The model with the shortest chain length (Qwen2.5-Math-1.5B-instruct, 3 steps) provides one of the answers. Conversely, the model with the longest chain length (Qwen2.5-Math-72B-instruct, 10 steps) provides a different answer. This suggests that chain length is not a direct indicator of solution accuracy. The "Solution Style" also plays a role, with "Analyze, Calculate" and "Enumerate" leading to one answer, while "Solve Equations" leads to another.
* **Model Comparison Context:** The "Model Type" and "Model Size" labels, along with "Difference," imply that this image is part of a comparative study of AI models. The differing results and chain lengths for the same problem would be key data points in such a study, illustrating the performance variations between models of different architectures or training data.
**To verify the correct answer:**
Let the numbers be $a, b, c, d, e$ with $a<b<c<d<e$.
The sum of all pairwise sums is $101+102+103+104+105+106+107+108+109+111 = 1066$.
This sum is also equal to $4(a+b+c+d+e)$.
So, $a+b+c+d+e = 1066 / 4 = 266.5$.
The smallest sum is $a+b=101$.
The largest sum is $d+e=111$.
The sum of the smallest and largest numbers is $a+e$.
The sum of the second smallest and second largest numbers is $b+d$.
The middle sum is $c+c$ (if we consider the sum of a number with itself, which is not the case here) or $a+d$ or $b+c$.
Let's consider the sums:
$a+b = 101$
$a+c = 102$
$b+c = 103$ (This is a common assumption for the next smallest sum after $a+b$ and $a+c$)
From $a+b=101$ and $a+c=102$, we get $c-b=1$.
From $a+c=102$ and $b+c=103$, we get $b-a=1$.
So, $a, b, c$ are consecutive integers. Let $a=x$. Then $b=x+1$ and $c=x+2$.
$a+b = x + (x+1) = 2x+1 = 101 \implies 2x = 100 \implies x=50$.
So, $a=50, b=51, c=52$.
Check: $a+c = 50+52 = 102$. $b+c = 51+52 = 103$. This is consistent.
Now consider the largest sums:
$d+e = 111$
$c+e = 109$ (This is a common assumption for the next largest sum before $d+e$)
$b+e = 108$ (This is a common assumption for the next largest sum before $c+e$)
From $c+e=109$ and $d+e=111$, we get $d-c=2$.
From $b+e=108$ and $c+e=109$, we get $c-b=1$. This is consistent with our earlier finding.
We have $c=52$.
$d-c=2 \implies d-52=2 \implies d=54$.
$d+e=111 \implies 54+e=111 \implies e=57$.
Check: $c+e = 52+57 = 109$. $b+e = 51+57 = 108$. This is consistent.
So the five numbers are $a=50, b=51, c=52, d=54, e=57$.
Let's check all pairwise sums:
$a+b = 50+51 = 101$ (Given)
$a+c = 50+52 = 102$ (Given)
$a+d = 50+54 = 104$ (Given)
$a+e = 50+57 = 107$ (Given)
$b+c = 51+52 = 103$ (Given)
$b+d = 51+54 = 105$ (Given)
$b+e = 51+57 = 108$ (Given)
$c+d = 52+54 = 106$ (Given)
$c+e = 52+57 = 109$ (Given)
$d+e = 54+57 = 111$ (Given)
The set of sums is $\{101, 102, 104, 107, 103, 105, 108, 106, 109, 111\}$.
Rearranging these sums in ascending order: $\{101, 102, 103, 104, 105, 106, 107, 108, 109, 111\}$.
This matches the given set of sums exactly.
The five numbers are 50, 51, 52, 54, 57.
The largest of these five numbers is **57**.
Therefore, Qwen2.5-Math-72B-instruct's answer of 57 is correct, and the answers of 56 from ChatGPT-4o and Qwen2.5-Math-1.5B-instruct are incorrect. This further emphasizes the variability in AI model performance.
</details>
Figure 2: Processes and problem-solving ideas for the same question vary from different models with the perspectives of model types and model sizes. GPT tends to analyze and calculate, while Qwen-72B tends to solve equations. Qwen-1.5B is small and relatively weak. It can only enumerate, and its thinking chain is short, so its answers are also very wrong.
Firstly, Step OOD may occur because of different processes generated by different models. Due to the high cost of manual annotation, there are very few accurately labeled PRM expert datasets, such as PRM800K and ProcessBench, with processes generated by GPT OpenAI (2023) and Qwen Yang et al. (2024b), respectively. However, different model types (e.g., GPT, Qwen, Llama Dubey et al. (2024)) approach problem-solving differently. As is shown in Figure 2, when facing the same question, GPT-4o tends to analyze and calculate, while Qwen-72B tends to solve questions directly. They have different solution styles. Therefore, using process data generated by one model to train a PRM and then applying it to guide another model leads to an OOD issue. Moreover, models of different sizes also exhibit different reasoning processes. Larger models, like exceptional students, tend to have clearer and more accurate reasoning steps, while smaller models tend to have very short reasoning chains, as shown in Figure 2.
Secondly, Question OOD emerges because of dataset shift. Current PRM datasets contain only a limited number of problems. For example, Math Shepherd and PRM800K cover problems from the GSM8K and MATH datasets, with GSM8K being at the elementary to middle school level and MATH at the high school to university level. However, real-world problems are far more diverse, such as those in the Olympic math competition dataset He et al. (2024), leading to OOD issues in other datasets. As shown in the Figure 1, we used Sentence-BERT Reimers (2019) to encode all the problems from the three datasets and visualized the distribution with t-SNE. It is evident that the distributions differ, and since both Olympic and MATH problems are typically from high school-level exams, they are semantically closer to each other than to GSM8K.
To address this issue, we propose a new framework, Retrieval Augmented Process Reward Model (RetrievalPRM), which leverages a Two-stage Retrieval-enhanced Mechanism to help PRMs solve the OOD problem. we retrieve relevant questions and steps in these two stages to address the issues of question OOD and step OOD, respectively. Specifically, when predicting a step for a given question, we select semantically similar questions based on their embeddings, placing them at the beginning of the entire prompt. Additionally, we select more fine-grained, similar steps and use them as references when predicting the correctness of the step. These retrieved questions and steps serve as a kind of warm-up for PRM, acting as example problems for reference. They not only help stimulate PRM’s potential by warming up but also allow the system to handle more difficult problems by identifying similarities, thus alleviating OOD issues.
Our main contributions are summarized as follows:
- To the best of our knowledge, we are the first to highlight the key OOD problems in Process Reward Models (PRMs), particularly the question OOD and step OOD, which arise due to differences in reasoning patterns across model types (e.g., GPT, Qwen), model sizes (1.5B, 72B) and varying problem difficulties in real-world datasets.
- We introduce the Retrieval-Augmented Process Reward Model (RetrievalPRM) framework, which utilizes a Two-stage Retrieval-enhanced Mechanism to address OOD issues by incorporating both Question-level Retrieval and Step-level Retrieval, thereby enhancing PRM’s ability to generalize across diverse problem-solving scenarios.
- We build a Retrieval-enhanced dataset for training PRM using RetrievalPRM framework. We have made our code publicly available. https://anonymous.4open.science/r/RetrievalPRM-1C77 Our dataset https://huggingface.co/datasets/gebro13/RetrievalPRM_ Dataset and model https://huggingface.co/gebro13/RetrievalPRM are open-sourced.
- Extensive experiments on the ProcessBench Zheng et al. (2024) on four public real-world datasets demonstrate that RetrievalPRM outperforms strong baselines and that the Out-of-distribution issue has been alleviated due to our retrieval approach.
## 2 Preliminary
In this section, we formulate the whole problem and introduce PRM as a binary classification model.
### 2.1 Problem Formulation
We denote the Math dataset as $D=\{(q_i,s_i,y_i)\}_i=1^N$ , where $N$ is the number of data instances. The input $q_i$ is the $i^th$ Math question. $s_i=\{s^1_i,s^2_i,…,s^n_i_i\}$ are the solution steps, where $n_i$ is the step number of solution $s_i$ . $y_i=\{y^1_i,y^2_i,…,y^n_i_i\}$ and the label $y^j_i$ indicates the correctness from the $1^st$ step to the $j^th$ step.
$$
y^j_i=\begin{cases}1,~{}(s^1_i,…,s^j_i)~{}is correct
for~{}q_i;\\
0,~{}otherwise.\end{cases} \tag{1}
$$
### 2.2 ORM vs. PRM
Outcome-supervised Reward Models are introduced (ORM) by Cobbe et al. (2021), where verifiers are trained for judging the final correctness of generated solutions. ORM only predicts the final label $\hat{y}^n_i_i$ , which can be formulated as
$$
∀{i},\hat{y}^n_i_i=ORM(q_i,s^1_i,…,s^n_i_i). \tag{2}
$$
Building on this, the concept of process reward models (PRM) is introduced as a more granular and transparent approach. Not only does PRM evaluate the final solutions but it also assesses intermediate processes, where $\hat{y}^j_i$ represents the predicted label for the $j^th$ step by PRM.
$$
∀{i,j},\hat{y}^j_i=PRM(q_i,s^1_i,…,s^j_i). \tag{3}
$$
### 2.3 Large Language Model for PRM scoring
When directly adopting LLMs as the PRM for scoring, we need to convert the data $(q_i,s_i,y_i)$ with a hard prompt template. The whole template example is illustrated in Appendix B.2.
The textual input consists of the question $q_i$ and steps $s_i$ , followed by a binary question about the correctness of these steps.
To obtain the floating-point correctness estimation $\hat{y}_i^j∈[0,1]$ instead of discrete word tokens ’+’ or ’-’, we apply bidimensional softmax over the corresponding logits of the binary key answer tokens (ie., + & -) from LLMs to accomplish the correctness estimation during evaluation:
$$
\hat{y}_i^j=\frac{\exp(l_i,+)}{\exp(l_i,+)+\exp(l_i,
-)}∈(0,1). \tag{4}
$$
where $l_i,+$ and $l_i,-$ are the logits of token + and - in the $i^th$ instance, respectively.
It is important to note that the estimated PRM scoring $\hat{y}_i^j$ is used solely for evaluation on the testing set. If training is involved, we maintain the standard instruction tuning and causal language modeling paradigm for LLMs. In this way, we don’t need to replace the language model head with binary classification head which is the last layer of LLM.
## 3 Methodology
In this section, we introduce our proposed RetrievalPRM framework in detail.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Diagrams: Comparison of Three Problem-Solving Frameworks
### Overview
This image presents a comparative analysis of three distinct problem-solving frameworks: "Traditional PRM," "Two-stage Retrieval-enhanced Mechanism," and "Retrieval PRM Framework." Each framework outlines a process for addressing a target question, likely in a mathematical context, by providing a system prompt, a target question, and a series of solution steps. The diagrams illustrate the flow of information and decision-making within each framework, highlighting differences in their approach to problem-solving and verification.
### Components/Axes
The image is divided into three main vertical sections, each representing one of the frameworks. Within each section, the following common elements are observed:
* **System Prompt:** A text box at the top, defining the role of the system and the task.
* **Target Question:** A rounded rectangle containing the specific question to be solved.
* **Solution Steps:** A series of rounded rectangles detailing the steps taken to solve the problem.
* **Target Step/Verification:** A final step or decision point, often accompanied by a question and a binary outcome (Yes/No) with associated probabilities.
**Framework 1: Traditional PRM**
* **System Prompt:** A green text box containing the text: "I want you to act as a math teacher. I will provide a mathematical question and several solution steps, and it will be your job to judge whether these steps are correct or not."
* **Target Question:** A rounded rectangle with the text: "How many seconds are in 5.5 minutes?"
* **Solution Steps:**
* Step 1: A rounded rectangle with the text: "Step 1: 5.5 minutes is the same as 5 minutes and 0.5 minutes."
* Step 2: A rounded rectangle with the text: "Step 2: Since there are 60 seconds in a minute, then there are 300 seconds in 5 minutes."
* Step 3: A rounded rectangle with the text: "Step 3: And since there are 60 seconds in a minute, there are 50 seconds in 0.5 minutes."
* **Target Step/Verification:**
* A question: "Is that step correct?"
* Two branches labeled "Yes" and "No."
* Associated values: "Yes" has a value of "0.9," and "No" has a value of "0.1."
* A sad yellow emoji is positioned to the left of the "Yes/No" branches.
**Framework 2: Two-stage Retrieval-enhanced Mechanism**
* **Target Question:** Labeled "Q" with an arrow pointing down to a database icon.
* **Question Pool:** Represented by a database icon, with arrows pointing to multiple document icons (representing retrieved questions).
* **Reference Question 1:** An orange text box containing:
* "Reference Question 1:"
* "What is the equivalent number of seconds in 7.8 minutes?"
* "Process:"
* "Since there are 60 seconds in a minute, we can find the number of seconds by multiplying the number of minutes by 60. (+) So, 7.8 minutes is equal to 7.8 * 60 = 46 seconds. The answer is: 46 (-)"
* "Reference Question 2:"
* "Process:"
* "..."
* **Document Icons:** Multiple document icons with ellipses (...) indicating a pool of questions.
* **List Icons:** Multiple list icons with ellipses (...) indicating a pool of solution steps.
* **Target Step:** Labeled "S" with an arrow pointing to a database icon labeled "New Step Pool."
* **New Step Pool:** A database icon representing a pool of new steps.
* **Reference Step 1:** A light blue cloud shape containing:
* "Reference Step 1:"
* "0.3 hours equal to 0.3"
* "* 60 = 18 minutes."
* "This reference step is correct."
* "Reference Step 2:"
* "..."
* **Arrows:** Indicate the flow from the target question to the question pool, then to retrieved documents and lists, and finally to the target step and new step pool. A curved arrow connects the document/list icons to the "Reference Question 1" text box.
**Framework 3: Retrieval PRM Framework**
* **System Prompt:** A green text box containing the text: "I want you to act as a math teacher. I will ... judge whether these steps are correct or not. First I will give you some similar questions and their steps for reference. For each step, if the step is correct, the step is labeled as +. If the step is wrong, the step is labeled as -. If there is no relevant or helpful information in the provided questions and steps, try to answer yourself."
* **Reference Question 1:** An orange text box.
* **Reference Question 2:** An orange text box. An arrow connects "Reference Question 1" to "Reference Question 2."
* **Target Question:** A rounded rectangle with the text: "How many seconds are in 5.5 minutes?" An arrow points from "Reference Question 2" to this target question.
* **Solution Steps:**
* Step 1: A rounded rectangle with the text: "Step 1: 5.5 minutes is the same as 5 minutes and 0.5 minutes." An arrow connects this to "Reference Question 1."
* Step 2: A rounded rectangle with the text: "Step 2: Since there are 60 seconds in a minute, then there are 300 seconds in 5 minutes." An arrow connects this to "Step 1."
* Step 3: A rounded rectangle with the text: "Step 3: And since there are 60 seconds in a minute, there are 50 seconds in 0.5 minutes." An arrow connects this to "Step 2."
* **Reference Steps:** Two light blue cloud shapes labeled "Reference Step2" and "Reference Step1." Arrows connect "Step 3" to "Reference Step1" and "Reference Step2" to "Step 3."
* **Verification:**
* A question: "Is the target step correct?"
* Two branches labeled "Yes" and "No."
* Associated values: "Yes" has a value of "0.2," and "No" has a value of "0.8."
* A happy yellow emoji is positioned to the right of the "Yes/No" branches.
* **Textual Annotation:** A red text annotation below "Step 3" and above the "Reference Step2" cloud reads: "I will give you some steps for reference."
### Detailed Analysis or Content Details
**Framework 1: Traditional PRM**
This framework presents a direct, step-by-step approach to solving the target question. The solution steps provided are:
1. Decomposition of 5.5 minutes into 5 minutes and 0.5 minutes.
2. Calculation of seconds in 5 minutes (300 seconds).
3. Calculation of seconds in 0.5 minutes (50 seconds).
The final step involves a binary judgment ("Is that step correct?") with a high probability of "Yes" (0.9) and a low probability of "No" (0.1), indicated by a sad emoji. This suggests a confidence in the correctness of the provided steps, with a slight uncertainty.
**Framework 2: Two-stage Retrieval-enhanced Mechanism**
This framework introduces a retrieval mechanism.
* The "Target Question" is processed, leading to a "Question Pool."
* From the "Question Pool," relevant questions are retrieved, exemplified by "Reference Question 1" (calculating seconds in 7.8 minutes). This reference question includes a detailed process and an answer (46 seconds).
* The process then moves to a "Target Step" which is stored in a "New Step Pool."
* A "Reference Step 1" is shown, which is deemed "correct." This step appears to be an example of a correct step within the system.
The diagram implies a process of retrieving relevant questions and steps to aid in solving the target question. The connection between retrieved documents/lists and the "Reference Question 1" text box suggests that the retrieved items inform the understanding or generation of reference questions.
**Framework 3: Retrieval PRM Framework**
This framework combines retrieval with a more explicit system prompt for judging correctness.
* The "System Prompt" clearly defines how steps will be labeled (+ for correct, - for wrong).
* "Reference Question 1" and "Reference Question 2" are provided, with "Reference Question 1" being linked to "Step 1" of the target problem.
* The "Target Question" is "How many seconds are in 5.5 minutes?"
* The "Solution Steps" (Step 1, Step 2, Step 3) are presented sequentially, with arrows indicating their dependency.
* "Reference Step1" and "Reference Step2" are provided as examples. The annotation "I will give you some steps for reference" clarifies their purpose.
* The verification stage ("Is the target step correct?") has a high probability of "No" (0.8) and a low probability of "Yes" (0.2), indicated by a happy emoji. This is a significant contrast to Framework 1 and suggests that the provided steps for the target question are likely incorrect or that the framework is designed to be more critical.
### Key Observations
* **Varying Confidence in Solution Steps:** Framework 1 shows high confidence (0.9 Yes) in the correctness of its solution steps, while Framework 3 shows low confidence (0.2 Yes, 0.8 No).
* **Role of Reference Information:** Framework 2 and 3 explicitly utilize "Reference Questions" and "Reference Steps" to aid in problem-solving or verification, whereas Framework 1 does not show this.
* **Complexity of Mechanisms:** Framework 2 and 3 appear to be more complex, involving retrieval mechanisms and explicit guidance on step evaluation.
* **Emoji Usage:** Framework 1 uses a sad emoji with a high "Yes" probability, which is counter-intuitive. Framework 3 uses a happy emoji with a high "No" probability, which is also counter-intuitive if the emoji is meant to represent the outcome of the step. It's more likely the emoji represents the system's sentiment towards the outcome.
### Interpretation
These diagrams illustrate different approaches to a problem-solving task, likely within a Natural Language Processing or AI context, where a system is tasked with evaluating mathematical steps.
* **Traditional PRM (Framework 1)** represents a baseline or a simpler model. It directly presents a problem and its solution steps, then asks for a judgment. The high probability of "Yes" suggests either the steps are indeed correct and the system is confident, or it's a simplified representation where the system is expected to agree. The sad emoji with a high "Yes" probability is peculiar and might indicate a misunderstanding of emoji sentiment or a specific convention within this framework.
* **Two-stage Retrieval-enhanced Mechanism (Framework 2)** introduces the concept of retrieving relevant information (questions and steps) from a pool. This suggests a more sophisticated approach where the system leverages external knowledge to assist in solving the target problem. The diagram highlights the flow of information from the target question to retrieval and then to the generation or evaluation of steps.
* **Retrieval PRM Framework (Framework 3)** builds upon the retrieval idea and provides a more detailed system prompt for evaluation. The critical difference here is the low confidence in the correctness of the provided steps (0.2 Yes, 0.8 No). This suggests that this framework is designed to be more discerning or that the example steps provided for the target question are intentionally flawed to demonstrate the framework's ability to identify errors. The happy emoji with a high "No" probability could signify that the system is pleased to have identified an incorrect step, aligning with its role of judging correctness.
In essence, the diagrams demonstrate an evolution from a direct, less informed approach (Framework 1) to more advanced, retrieval-augmented, and critically evaluative methods (Frameworks 2 and 3). Framework 3, in particular, seems to emphasize the system's role in identifying errors, as indicated by the high probability of "No" for the target step. The comparison highlights the trade-offs between simplicity and the sophistication of leveraging external knowledge and detailed evaluation criteria for problem-solving.
</details>
Figure 3: The model structure of our proposed RetrievalPRM framework and its difference with traditional PRM. We design a Two-stage Retrieval Module to retrieve reference questions and steps in each stage.
### 3.1 Overview of RetrievalPRM
The RetrievalPRM is developed to address the problem of out-of-distribution (OOD) scenarios in mathematical problem-solving, specifically focusing on both question OOD and step OOD. According to Figure 3, traditional PRM models are constrained by predefined solution steps and are unable to handle unseen questions or steps effectively, especially when the problem context shifts or the solution process deviates from previously seen examples. RetrievalPRM overcomes this challenge by incorporating a Two-stage Retrieval-enhanced Mechanism that dynamically fetches relevant questions and steps from a large pool of questions and their solutions. These retrieved questions and steps serve as a kind of warm-up for PRM, acting as example problems for reference. They not only help stimulate PRM’s potential by warming up but also allow the system to handle more difficult problems by identifying similarities.
### 3.2 Two-stage Retrieval-enhanced Mechanism
The core of RetrievalPRM is the Two-stage Retrieval-enhanced Mechanism, which consists of two key phases: Question-level Retrieval and Step-level Retrieval.
#### 3.2.1 Question-level Retrieval
The first stage of retrieval tackles the question OOD issue. As is shown in Figure 3, the retrieval pool is the question database $D_q=\{q_i\}_i=1^N$ . During retrieval process, we treat:
- Query: the target question $q_t$ .
- Key: all $q_i$ in the retrieval pool.
- Value: all the $(q_i,s_i)$ pair in the retrieval pool.
We calculate their similarities $<q_i,q_t>$ to match the most similar n questions. Specifically, all questions will first pass through a Sentence-BERT model to encode questions and obtain their semantic representations.
$$
\{e_q_{i}\}_i=1^N=SentenceBERT(\{q_i\}_i=1^N) \tag{5}
$$
where $e_q_{i}∈ℝ^D$ is the embedding vector of the question $q_i$ .
And then all the embeddings undergo Principle Component Analysis (PCA) Kurita (2021) for dimensionality reduction to extract the most important dimensions.
$$
\{e^\prime_q_{i}\}_i=1^N=PCA(\{e_q_{i}\}_i=1^N) \tag{6}
$$
where $e^\prime_q_{i}∈ℝ^d$ is the embedding after dimension reduction.
Finally, we compute the cosine similarity between the target question and the entire question pool, selecting the top- k most similar questions and inputting them into the text.
$$
\displaystyle⟨ q_i,q_t⟩ \displaystyle=\frac{e^\prime_q_{t}· e^\prime_q_{i}}{|e^\prime_q
_{t}|·|e^\prime_q_{i}|}. \tag{7}
$$
Now we sort the vector $\{⟨ q_i,q_t⟩\}_i=1^N$ of similarity and choose top- k $(q_i,s_i)$ pairs as reference questions $q_r$ and put them in RetrievalPRM’s input together with the target question. Furthermore, we store all the solutions $\{s_i\}_i=1^m$ of top- m ( $m>k$ ) questions in a new database to conduct a further step-level retrieval.
#### 3.2.2 Step-level Retrieval
We place step-level retrieval in the second stage of the two-stage retrieval process, rather than as a separate module, for two key reasons:
Firstly, for a solution to be meaningful, both the question and the steps must be similar. For example, two different types of questions might both use the letter "p" to represent an unknown variable, but in some problems, "p" represents a prime number, while in others, it represents probability. This results in steps that may appear similar but have entirely different meanings, rendering the retrieved steps potentially unhelpful.
Secondly, since there are many possible solutions to a question, this leads to a large number of steps. If the majority of these steps are irrelevant, the time spent calculating similarities becomes inefficient. By placing step-level retrieval in the second stage, we can save both time and computational resources.
Therefore, after retrieving the top- m most similar questions, we inject all their solutions into a new steps database $D_s$ . Then, we use the target step as the query to retrieve reference steps from this new database. The similarity for retrieval is still calculated using Sentence-BERT, PCA, and cosine similarity, as mentioned in 3.2.1.
### 3.3 Retrieval-based System Prompt
In RetrievalPRM, The system prompt serves as the instruction set for the model, framing the problem and directing it to evaluate each step of the solution. Besides the traditional system prompt for PRM, the Retrieval-based System Prompt (RetSP) is extended with additional instructions, as shown in the red sentence in Figure 3, which encourages the model to leverage knowledge from reference questions. For example, we inform PRM that step labels "+" and "-" represent correct and incorrect steps, respectively. At the same time, to avoid noise, we specify that if the reference question or step contains no relevant or helpful information, it should not be considered. These retrieval-based system prompts give PRM a more flexible thinking process, enabling it to actively decide whether to use retrieval-based knowledge.
We define reference questions of $q_i$ as $q_i^r$ and reference steps as $s_i^r$ . The whole input $x_i^j$ of predicting the $j_th$ step of $q_i$ in RetrievalPRM can be formulated as:
$$
\displaystylex^j_i=(RetSP,q^r_i, \displaystyle q_i,s^1_i,…,s^j-1_i,s^r_i,s^j_i,
y^j_i), \displaystyle\hat{y}^j_i= \displaystylePRM(x^j_i) \tag{8}
$$
where $s^j_i$ is the $j_th$ step of solution $s_i$ .
According to the input template above, it is worth noting that when predicting step n, we assume that steps 1 through n-1 are correct Luo et al. (2024); Zheng et al. (2024). At this point, the most important task for PRM is to predict step n, so PRM can only access the reference steps for step n and cannot see the reference steps for steps $1∼ n-1$ .
## 4 Experiments
Table 1: The performance of different models on ProcessBench. The best result is given in bold, and the second-best value is underlined. See Table 3 in Appendix D for breakdown of evaluation results.
| Model | GSM8k | MATH | OlympiadBench | OmniMATH | Avg.F1 | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ArithACC | F1 | ArithACC | F1 | ArithACC | F1 | ArithACC | F1 | | | |
| Open-source PRM | RetrievalPRM-7B(Ours) | 76.0 | 74.6 | 70.6 | 71.1 | 59.1 | 60.2 | 55.2 | 57.33 | 65.8 |
| Qwen2.5-Math-7B-PRM800K | 73.5 | 68.2 | 65.1 | 62.6 | 53.2 | 50.7 | 43.4 | 44.3 | 56.5 | |
| Skywork-PRM-7B | 71.6 | 70.8 | 54.5 | 53.6 | 25.6 | 22.9 | 23.7 | 21.0 | 42.1 | |
| Skywork-PRM-1.5B | 59.9 | 59.0 | 49.1 | 48.0 | 20.5 | 19.3 | 19.7 | 19.2 | 36.4 | |
| Math-Shepherd-PRM-7B | 58.3 | 47.9 | 45.1 | 29.5 | 39.7 | 24.8 | 34.8 | 23.8 | 31.5 | |
| RLHFlow-PRM-Mistral-8B | 62.3 | 50.4 | 42.1 | 33.4 | 22.3 | 13.8 | 19.1 | 15.8 | 28.4 | |
| RLHFlow-PRM-Deepseek-8B | 56.9 | 38.8 | 45.1 | 33.8 | 26.5 | 16.9 | 23.2 | 16.9 | 26.6 | |
| Language Models as Critic | QwQ-32B-Preview | 87.9 | 88.0 | 78.5 | 78.7 | 59.2 | 57.8 | 61.1 | 61.3 | 71.5 |
| GPT-4o | 80.2 | 79.2 | 63.4 | 63.6 | 50.1 | 51.4 | 50.1 | 53.5 | 61.9 | |
| Qwen2.5-72B-Instruct | 77.9 | 76.2 | 65.4 | 61.8 | 59.8 | 54.6 | 55.1 | 52.2 | 61.2 | |
| Llama-3.3-70B-Instruct | 83.7 | 82.9 | 63.7 | 59.4 | 54.3 | 46.7 | 51.0 | 43.0 | 58.0 | |
| Qwen2.5-Coder-32B-Instruct | 72.0 | 68.9 | 64.5 | 60.1 | 57.0 | 48.9 | 52.5 | 46.3 | 56.1 | |
| Llama-3.1-70B-Instruct | 75.3 | 74.9 | 52.6 | 48.2 | 50.0 | 46.7 | 43.2 | 41.0 | 52.7 | |
| Qwen2.5-14B-Instruct | 72.3 | 69.3 | 59.2 | 53.3 | 50.2 | 45.0 | 43.5 | 41.3 | 52.2 | |
| Qwen2-72B-Instruct | 67.8 | 67.6 | 52.3 | 49.2 | 43.3 | 42.1 | 39.3 | 40.2 | 49.8 | |
| Qwen2.5-32B-Instruct | 70.6 | 65.6 | 61.9 | 53.1 | 53.5 | 40.0 | 47.7 | 38.3 | 49.3 | |
| Qwen2.5-Math-72B-Instruct | 70.3 | 65.8 | 59.6 | 52.1 | 56.1 | 32.5 | 55.1 | 31.7 | 45.5 | |
| Qwen2.5-Coder-14B-Instruct | 61.9 | 50.1 | 54.2 | 39.9 | 51.4 | 34.0 | 55.6 | 27.3 | 37.8 | |
| Qwen2.5-7B-Instruct | 37.8 | 36.5 | 36.9 | 36.6 | 29.9 | 29.7 | 27.3 | 27.4 | 32.6 | |
| Meta-Llama-3-70B-Instruct | 62.4 | 52.2 | 48.3 | 22.8 | 46.2 | 21.2 | 44.8 | 20.0 | 29.1 | |
| Qwen2.5-Math-7B-Instruct | 54.4 | 26.8 | 50.3 | 25.7 | 43.1 | 14.2 | 41.6 | 12.7 | 19.9 | |
| Qwen2-7B-Instruct | 25.1 | 8.4 | 20.4 | 19.0 | 16.1 | 14.7 | 13.8 | 12.1 | 13.6 | |
| Meta-Llama-3-8B-Instruct | 27.1 | 13.1 | 17.3 | 13.8 | 14.2 | 4.8 | 19.7 | 12.6 | 11.1 | |
| Qwen2.5-Coder-7B-Instruct | 49.1 | 14.3 | 46.3 | 6.5 | 47.2 | 4.1 | 48.9 | 1.8 | 6.7 | |
| Llama-3.1-8B-Instruct | 27.3 | 10.9 | 20.5 | 5.1 | 16.0 | 2.8 | 15.0 | 1.6 | 5.1 | |
Table 2: The performance of different variants of RetrievalPRM on ProcessBench. We remove different components of RetrievalPRM to evaluate the contribution of each part to the model. The best result is given in bold, and the second-best value is underlined. See Table 4 in Appendix D for breakdown of evaluation results.
| Retrieval Components | GSM8k | MATH | OlympiadBench | OmniMATH | Avg.F1 | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Question-level | Step-level | ArithACC | F1 | ArithACC | F1 | ArithACC | F1 | ArithACC | F1 | |
| ✓ | ✓ | 76.0 | 74.6 | 70.6 | 71.1 | 59.1 | 60.2 | 55.2 | 57.3 | 65.8 |
| ✓ | $×$ | 77.8 | 74.9 | 70.7 | 71.2 | 58.4 | 59.8 | 50.5 | 54.4 | 65.0 |
| $×$ | ✓ | 73.8 | 67.5 | 69.5 | 69.2 | 58.2 | 58.9 | 52.2 | 56.3 | 63.0 |
| $×$ | $×$ | 71.0 | 65.6 | 67.3 | 67.5 | 54.3 | 55.8 | 47.2 | 50.9 | 59.9 |
<details>
<summary>x4.png Details</summary>

### Visual Description
## Bar Chart with Line Plot: F1 Score by Dataset and Number of Retrieval Questions
### Overview
This image displays a bar chart overlaid with a line plot, illustrating the F1 scores for different datasets across varying numbers of retrieval questions. The x-axis represents the "Number of Retrieval Questions" categorized as Top-0, Top-1, Top-2, and Top-3. The y-axis represents the "F1 Score" ranging from 50 to 80. Four datasets (GSM8k, MATH, OlympiadBench, OmniMATH) are represented by colored bars, and an "Average F1" is shown as a dashed line with markers.
### Components/Axes
**X-axis:**
* **Title:** Number of Retrieval Questions
* **Categories:** Top-0, Top-1, Top-2, Top-3
**Y-axis:**
* **Title:** F1 Score
* **Scale:** 50, 55, 60, 65, 70, 75, 80
**Legend:**
Located in the top-left quadrant of the chart.
* **Dataset:**
* GSM8k (light gray)
* MATH (light blue)
* OlympiadBench (teal green)
* OmniMATH (orange)
* **Average F1:** (black dashed line with black circular markers)
### Detailed Analysis or Content Details
**Top-0:**
* **GSM8k:** 65.60 (light gray bar)
* **MATH:** 67.50 (light blue bar)
* **OlympiadBench:** 55.80 (teal green bar)
* **OmniMATH:** 50.90 (orange bar)
* **Average F1:** 59.80 (black marker on dashed line)
**Top-1:**
* **GSM8k:** 70.50 (light gray bar)
* **MATH:** 72.60 (light blue bar)
* **OlympiadBench:** 60.80 (teal green bar)
* **OmniMATH:** 56.90 (orange bar)
* **Average F1:** 65.50 (black marker on dashed line)
**Top-2:**
* **GSM8k:** 74.90 (light gray bar)
* **MATH:** 71.20 (light blue bar)
* **OlympiadBench:** 59.80 (teal green bar)
* **OmniMATH:** 54.40 (orange bar)
* **Average F1:** 65.50 (black marker on dashed line)
**Top-3:**
* **GSM8k:** 72.30 (light gray bar)
* **MATH:** 71.60 (light blue bar)
* **OlympiadBench:** 57.30 (teal green bar)
* **OmniMATH:** 56.70 (orange bar)
* **Average F1:** 64.50 (black marker on dashed line)
### Key Observations
* **GSM8k and MATH datasets consistently show higher F1 scores** compared to OlympiadBench and OmniMATH across all categories of retrieval questions.
* **The MATH dataset generally performs slightly better than GSM8k** for Top-0 and Top-1 retrieval questions, but GSM8k slightly outperforms MATH for Top-2 and Top-3.
* **OlympiadBench and OmniMATH datasets exhibit significantly lower F1 scores** throughout. OmniMATH's scores are the lowest among all datasets.
* **The Average F1 score increases from Top-0 to Top-1, then slightly decreases for Top-2 and Top-3.**
* **For Top-0, the Average F1 score is approximately 59.80.**
* **For Top-1, the Average F1 score is approximately 65.50.**
* **For Top-2, the Average F1 score is approximately 65.50.**
* **For Top-3, the Average F1 score is approximately 64.50.**
* The F1 scores for GSM8k and MATH show a general upward trend from Top-0 to Top-2, with a slight dip for GSM8k at Top-3.
* The F1 scores for OlympiadBench and OmniMATH show a general downward trend from Top-0 to Top-3, with a slight increase for OmniMATH at Top-3.
### Interpretation
This chart demonstrates the performance of different datasets in a retrieval task, measured by the F1 score, as the number of retrieval questions increases. The data suggests that the **GSM8k and MATH datasets are more robust and performant** in this retrieval context, likely due to their nature or the way they are structured for such tasks. The **OlympiadBench and OmniMATH datasets appear to struggle more**, indicating potential limitations in their suitability or representation for this specific retrieval scenario.
The trend of the **Average F1 score peaking at Top-1 and Top-2** suggests that there might be an optimal number of retrieval questions for achieving the best overall performance. The subsequent slight decline at Top-3 could indicate diminishing returns or increased complexity that negatively impacts the average performance.
The relative performance of GSM8k and MATH across different "Top-k" categories suggests that while both are strong, there might be subtle differences in how they handle increasing numbers of potential matches. The consistent decline in performance for OlympiadBench and OmniMATH as more retrieval questions are considered highlights a potential weakness in their ability to generalize or maintain accuracy under increased search scope. This could imply that these datasets are either less discriminative or that the models trained on them are less capable of filtering relevant information from a larger pool.
</details>
Figure 4: We show the F1 scores of Retrieval-PRM on four datasets and their average, as the number of retrieval questions varies. Specifically, Top-0 means no retrieval questions.
In this section, we present the experimental settings and results. Our implementation code of RetrievalPRM is publicly available.
### 4.1 Experiment Setup
#### 4.1.1 Datasets
Datasets are categorized into two kinds: Math reasoning datasets, and prm training datasets.
Math Reasoning Datasets
We conduct experiments on four public and widely used datasets in mathematical reasoning tasks: GSM8K Cobbe et al. (2021) which contains math problems from elementary to middle school, MATH Hendrycks et al. (2021) which contains math problems from basic to university level, OlympiadBench He et al. (2024) which involves questions from the Mathematical Olympiad, Omni-MATH Gao et al. (2024b) which covers multi-domain high-difficulty problems. Further details are provided in Appendix C.
Except for GSM8K, which focuses on grade school math problems, the other three datasets feature problems of competition or Olympiad-level difficulty.
PRM training datasets
We conduct experiments on two publicly available datasets for PRM:
PRM800K Lightman et al. (2023): Based on the MATH dataset, it contains 800,000 manually annotated step-level correctness labels for training the Process Reward Model. It relies on expensive manual annotations.
Math-Shepherd Wang et al. (2024b): It generates 400,000 machine-annotated step-level labels (covering MATH and GSM8K datasets) by automatically building process supervision data, without manual annotation.
#### 4.1.2 Evaluation Metrics
We evaluate our model in a public PRM benchmark ProcessBench Zheng et al. (2024). The aim is to judge whether PRM can find the first wrong step. It divides data into two parts: samples with incorrect and correct final answers and then conducts harmonic mean on the accuracy of these two parts to get the final F1-score. Moreover, we think since the sample number of each part isn’t balanced, We add an additional metric: weighted arithmetic mean of these two parts, which is shown in Table 1 as ArithACC.
#### 4.1.3 Baselines
Following Zheng et al. (2024), we divide all baselines into two parts:
(1) Open-source PRM, including Skywork o1 Team (2024), Qwen2.5-PRM Zheng et al. (2024), Math-Shepherd Wang et al. (2024b) and RLHFlow Xiong et al. (2024). These models are binary classification PRMs.
(2) Language Models as Critic, including Llama Dubey et al. (2024), Qwen2 Yang et al. (2024b), Qwen2.5 Team (2024), Qwen2.5-MATH Yang et al. (2024a), Qwen2.5-Coder Hui et al. (2024), GPT-4o OpenAI et al. (2024). These models are promoted to judge the steps with the help of majority voting.
Further details of these baselines are provided in Appendix A due to article length limitations.
#### 4.1.4 Implementation Details
Details like base models, hyperparameters, prompts, and training sizes are provided in Appendix B due to the article length limitations.
### 4.2 Overall Performance
We evaluate RetrievalPRM against existing baselines on ProcessBench, and the results are presented in Table 1. The findings are as follows:
- RetrievalPRM-7B surpasses all open-source PRM baselines, achieving the highest performance. Notably, the most significant improvement is observed on OmniMATH, the most challenging dataset, with performance gains increasing as dataset difficulty rises. This phenomenon may stem from the fact that most baseline PRMs are trained on human- or machine-annotated datasets such as PRM800K or Math-Shepherd, which primarily focus on GSM8K or MATH and exhibit OOD issues when applied to more complex datasets. In contrast, our RetrievalPRM effectively mitigates the OOD problem through its retrieval-based approach, demonstrating the efficacy of our Two-stage Retrieval-enhanced Mechanism.
- When comparing models of different scales, RetrievalPRM outperforms all evaluated language models, including Qwen2.5-72B-Instruct and Llama3.3-70B-Instruct, with the sole exception of QwQ-32B-Preview. Remarkably, RetrievalPRM achieves this with a model size of just 7B. This highlights that PRMs, being both lightweight and task-specific, maintain strong competitiveness and potential compared to LLMs as critics.
### 4.3 Ablation Study
We analyze two main components in the Two-stage Retrieval-enhanced Mechanism: Question-level Retrieval and Step-level Retrieval —through the following ablations:
RetrievalPRM (Ours): The complete version of our proposed method.
RetrievalPRM (w/o Step-level Retrieval): This variant retains only the Question-level Retrieval, removing Step-level Retrieval during both training and inference.
RetrievalPRM (w/o Question-level Retrieval): This variant retains only the Step-level Retrieval, removing Question-level Retrieval during both training and inference.
RetrievalPRM (w/o Question-level and Step-level Retrieval): In this variant, both Question-level and Step-level Retrieval are removed during training and inference.
The performance of these variants is presented in Table 2, from which we can draw the following observations:
- The performance of RetrievalPRM (w/o Step-level Retrieval) remains almost identical to that of RetrievalPRM on GSM8K and MATH but exhibits a slight decline on OlympiadBench and OmniMATH. This can be attributed to the fact that Step-level Retrieval information is partially absorbed by Question-level Retrieval. As a result, Question-level Retrieval alone may be sufficiently effective for relatively easy datasets, as the reference steps it provides contain adequate knowledge for step prediction. However, for more challenging datasets, Step-level Retrieval becomes significantly more crucial, as it offers finer-grained guidance essential for handling complex problem-solving processes.
- RetrievalPRM (w/o Question-level Retrieval) shows lower performance, as it relies solely on Step-level Retrieval. The model lacks knowledge of reference questions, which is useful to alleviate question OOD, restricting its overall performance.
- RetrievalPRM (w/o both Retrieval) performs the worst, which is expected, demonstrating the effectiveness of both question-level and Step-level Retrieval.
### 4.4 Hyperparameter Study
Figure 4 illustrates the impact of the number of retrieval questions on the model’s performance. The findings are as follows:
Compared to Top-0, where no retrieval questions are used, models that incorporate retrieval questions show improved performance, highlighting the importance of Question-level Retrieval. It inspires us that Reference questions are important for PRM to get warmup, no matter how many reference questions there are.
The performance of Top-3 exhibits a slight decline, potentially due to two factors: (1) An excessive number of reference questions may lead to an overly long input prompt, making it difficult for PRMs to comprehend or extract key information effectively. (2) A limited retrieval pool might result in later reference questions being less relevant than earlier ones, increasing the likelihood of misjudgments in the model’s predictions.
## 5 Related Works
### 5.1 Process Reward Models
Process reward models (PRMs) have demonstrated significant advantages over traditional outcome reward models (ORMs) Cobbe et al. (2021) in enhancing process-level reasoning accuracy and improving long-process reasoning abilities in model training Lightman et al. (2023); Uesato et al. (2022). A growing number of PRMs have been proposed for application in process-level reinforcement learning with human feedback (RLHF) Wang et al. (2024b); Qin et al. (2024); Xia et al. (2025); o1 Team (2024). For instance, Lightman et al. (2023) made a substantial contribution by releasing a large set of human-annotated data at the process level, opening up new research opportunities for multi-step reasoning.
Additionally, Wang et al. (2024b) introduces an automatic, self-supervised pipeline for generating process-level labels and training PRMs, enabling efficient data generation. Xia et al. (2025) employs PRMs as automatic evaluators to assess the accuracy of multi-step reasoning in language models (LMs). With the surge in PRM-focused research and data curation, numerous PRMs o1 Team (2024); Xiong et al. (2024); Sun et al. (2024); Gao et al. (2024a); Wang et al. (2024a) have been proposed. Additionally, several studies focus on leveraging natural language feedback from large language models (LLMs) as rewards, which are termed critic models McAleese et al. (2024); Zhang et al. (2024); Gao et al. (2024a).
However, most existing PRMs trained on math datasets such as GSM8K and MATH inevitably encounter Out-of-distribution issues, which can be divided into two categories: question OOD, where PRMs trained on simpler or medium-difficulty datasets lack understanding of questions from more challenging datasets, and step OOD, where different base models and model sizes in LLMs lead to different step distributions for the same question. This is reflected in differences in chain length, problem-solving approaches, and methods. To address these issues, we propose the RetrievalPRM framework to tackle the OOD problems encountered in the current PRM field, achieving promising results.
### 5.2 Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) enhances language models by dynamically integrating external knowledge, pioneered by Lewis et al. (2021) through their joint retrieval-generation architecture. Subsequent advances refined this paradigm. Guu et al. (2020) introduced REALM to co-train retrieval and generation modules via masked language modeling, while Izacard and Grave (2021) proposed Fusion-in-Decoder (FiD) to process multi-document contexts efficiently. Research further optimized retrieval precision through dense passage embeddings Karpukhin et al. (2020) and scaled retrieval to web-level corpora Borgeaud et al. (2022).
## 6 Conclusion
In this paper, we have addressed the significant out-of-distribution (OOD) challenges faced by Process Reward Models (PRMs), particularly step OOD and question OOD. By introducing the Retrieval Augmented Process Reward Model (RetrievalPRM), we propose an effective solution that leverages a Two-stage Retrieval-enhanced Mechanism to improve the generalization of PRMs across diverse models and problems. Extensive experiments on multiple real-world datasets have shown that RetrievalPRM consistently outperforms existing methods, highlighting its effectiveness in tackling OOD issues.
## 7 Limitation
RetrievalPRM has two main limitations. Firstly, the retrieval pool is only constructed from PRM800K and Math-Shepherd at present, which is relatively small and limits the diversity and breadth of the mathematical problems. Second, using Sentence-BERT to embed questions and steps struggles to capture the full complexity of mathematical problems as semantic similarity doesn’t mean knowledge similarity in Math problems. As a result, the naive cosine similarity calculated through embeddings may fail to accurately reflect the true similarity between two questions.
## References
- Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. 2022. Improving language models by retrieving from trillions of tokens. Preprint, arXiv:2112.04426.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Gao et al. (2024a) Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Dayiheng Liu, Chang Zhou, Wen Xiao, Junjie Hu, Tianyu Liu, and Baobao Chang. 2024a. Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback. Preprint, arXiv:2406.14024.
- Gao et al. (2024b) Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. 2024b. Omni-math: A universal olympiad level mathematic benchmark for large language models. Preprint, arXiv:2410.07985.
- Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Realm: Retrieval-augmented language model pre-training. Preprint, arXiv:2002.08909.
- He et al. (2024) Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. Preprint, arXiv:2402.14008.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- Hui et al. (2024) Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-coder technical report. Preprint, arXiv:2409.12186.
- Izacard and Grave (2021) Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. Preprint, arXiv:2007.01282.
- Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense passage retrieval for open-domain question answering. Preprint, arXiv:2004.04906.
- Kurita (2021) Takio Kurita. 2021. Principal component analysis (pca). In Computer vision: a reference guide, pages 1013–1016. Springer.
- Lewis et al. (2021) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-augmented generation for knowledge-intensive nlp tasks. Preprint, arXiv:2005.11401.
- Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. arXiv preprint arXiv:2305.20050.
- Luo et al. (2024) Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. 2024. Improve mathematical reasoning in language models by automated process supervision. Preprint, arXiv:2406.06592.
- McAleese et al. (2024) Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, and Jan Leike. 2024. Llm critics help catch llm bugs. Preprint, arXiv:2407.00215.
- o1 Team (2024) Skywork o1 Team. 2024. Skywork-o1 open series. https://huggingface.co/Skywork.
- OpenAI et al. (2024) OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, and Akila Welihinda et al. 2024. Gpt-4o system card. Preprint, arXiv:2410.21276.
- OpenAI (2023) R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5).
- Qin et al. (2024) Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, and Pengfei Liu. 2024. O1 replication journey: A strategic progress report – part 1. Preprint, arXiv:2410.18982.
- Reimers (2019) N Reimers. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
- Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.
- Sun et al. (2024) Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, and Chuang Gan. 2024. Easy-to-hard generalization: Scalable alignment beyond human supervision. Preprint, arXiv:2403.09472.
- Team (2024) Qwen Team. 2024. Qwen2.5: A party of foundation models.
- Uesato et al. (2022) Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process- and outcome-based feedback. Preprint, arXiv:2211.14275.
- Wang et al. (2024a) Jun Wang, Meng Fang, Ziyu Wan, Muning Wen, Jiachen Zhu, Anjie Liu, Ziqin Gong, Yan Song, Lei Chen, Lionel M. Ni, Linyi Yang, Ying Wen, and Weinan Zhang. 2024a. Openr: An open source framework for advanced reasoning with large language models. Preprint, arXiv:2410.09671.
- Wang et al. (2024b) Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. 2024b. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439.
- Xia et al. (2025) Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, and Pengfei Liu. 2025. Evaluating mathematical reasoning beyond accuracy. Preprint, arXiv:2404.05692.
- Xiong et al. (2024) Wei Xiong, Hanning Zhang, Nan Jiang, and Tong Zhang. 2024. An implementation of generative prm. https://github.com/RLHFlow/RLHF-Reward-Modeling.
- Yang et al. (2024a) An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. 2024a. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. Preprint, arXiv:2409.12122.
- Yang et al. (2024b) An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. 2024b. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122.
- Zhang et al. (2024) Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. 2024. Generative verifiers: Reward modeling as next-token prediction. Preprint, arXiv:2408.15240.
- Zheng et al. (2024) Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2024. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559.
- Zhu et al. (2024) Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931.
## Appendix A Baselines
### A.1 Open-source PRM
<details>
<summary>x5.png Details</summary>

### Visual Description
## Textual Content: Math Teacher Simulation
### Overview
This image contains a simulated interaction where a system is instructed to act as a math teacher. The system is provided with a mathematical question and a proposed solution process. The system's task is to evaluate the correctness of the provided steps and respond with either a '+' for correct or a '-' for incorrect. The image displays the "System" instructions, the "Input" (question and process), and the "Output" from the system.
### Components/Axes
This section is not applicable as the image does not contain a chart or diagram with axes. It is a block of text.
### Content Details
**System:**
"I want you to act as a math teacher. I will provide a mathematical question and several solution steps, and it will be your job to judge whether these steps are correct or not."
**Input:**
**Question:**
"How many seconds are in 5.5 minutes?"
**Process:**
"Step 1 : 5.5 minutes is the same as 5 minutes and 0.5 minutes.
Step 2 : Since there are 60 seconds in a minute, then there are 300 seconds in 5 minutes.
Step 3 : And since there are 60 seconds in a minute, there are 30 seconds in 0.5 minutes.
Is that Step Correct? You should ONLY tell me + or -."
**Output:**
"+."
### Key Observations
The system is designed to evaluate a mathematical problem-solving process. The specific problem is a unit conversion from minutes to seconds. The provided process breaks down 5.5 minutes into 5 minutes and 0.5 minutes, calculates the seconds for each part, and implies a summation. The system's output is a single character, '+', indicating that it judges the provided steps as correct.
### Interpretation
The data presented demonstrates a simple AI task of evaluating a mathematical process. The "System" section defines the role and objective. The "Input" section provides the specific problem and the steps taken to solve it. The question "How many seconds are in 5.5 minutes?" is a straightforward conversion.
The "Process" outlines a logical approach:
* **Step 1:** Decomposing 5.5 minutes into 5 minutes and 0.5 minutes is a valid strategy for simplifying the calculation.
* **Step 2:** Correctly calculates that 5 minutes * 60 seconds/minute = 300 seconds.
* **Step 3:** Correctly calculates that 0.5 minutes * 60 seconds/minute = 30 seconds.
The implicit next step, which the system is likely evaluating, would be to add these two results: 300 seconds + 30 seconds = 330 seconds. Since all individual steps are mathematically sound and lead to the correct overall answer, the system's output of '+' signifies that it has correctly identified the process as valid. The instruction to "ONLY tell me + or -" restricts the output to a binary judgment, which the system adheres to. The trailing period after the '+' in the output is a minor detail that might indicate a formatting convention or a slight deviation from the strict instruction, but the core judgment is positive.
</details>
Figure 5: The illustration of PRM input template.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Textual Content: Math Teacher Simulation
### Overview
This document outlines a system for a math teacher simulation. The system provides a mathematical question and solution steps, and the user's task is to judge the correctness of each step. Correct steps are labeled with a '+', and incorrect steps with a '-'. The system also provides reference questions and their steps for guidance. The specific task presented involves converting minutes to seconds.
### Components/Axes
This document does not contain charts or diagrams with axes. It is a block of text structured into sections:
* **System:** Describes the role of the user as a math teacher and the evaluation criteria for solution steps.
* **Input:** Contains the reference questions and the target question with its solution steps.
* **Reference Question 1:**
* **Question:** "What is the equivalent number of seconds in 7.8 minutes?"
* **Process:** "Since there are 60 seconds in a minute, we can find the number of seconds by multiplying the number of minutes by 60. (+) So, 7.8 minutes is equal to 7.8 * 60 = 46 seconds. The answer is: 46 (-)"
* **Reference Question 2:** (Content is omitted, indicated by "...")
* **Target Question:**
* **Question:** "How many seconds are in 5.5 minutes?"
* **Process:**
* **Step 1:** "5.5 minutes is the same as 5 minutes and 0.5 minutes."
* **Step 2:** "Since there are 60 seconds in a minute, then there are 300 seconds in 5 minutes."
* **Reference Step 1:** "0.3 hours equal to 0.3 * 60 = 18 minutes. This reference step is correct."
* **Reference Step 2:** (Content is omitted, indicated by "...")
* **Target Step 3:** "And since there are 60 seconds in a minute, there are 30 seconds in 0.5 minutes. Is the Step Correct? You should ONLY tell me + or -."
* **Output:** The system's judgment on the correctness of the last step.
### Detailed Analysis or Content Details
**Reference Question 1 Analysis:**
* **Question:** Convert 7.8 minutes to seconds.
* **Process Step:** "Since there are 60 seconds in a minute, we can find the number of seconds by multiplying the number of minutes by 60." This is a correct statement of the conversion factor.
* **Calculation:** "So, 7.8 minutes is equal to 7.8 * 60 = 46 seconds."
* **Calculation Check:** 7.8 * 60 = 468. The provided calculation result of 46 is incorrect.
* **Labeling:** The step is labeled with "(+)" before the calculation and "(-)" after the result. This indicates the system is marking the *entire step* as incorrect due to the erroneous calculation.
**Target Question Analysis:**
* **Question:** Convert 5.5 minutes to seconds.
* **Step 1:** "5.5 minutes is the same as 5 minutes and 0.5 minutes." This is a correct decomposition of 5.5 minutes.
* **Step 2:** "Since there are 60 seconds in a minute, then there are 300 seconds in 5 minutes."
* **Calculation Check:** 5 minutes * 60 seconds/minute = 300 seconds. This calculation is correct.
* **Target Step 3:** "And since there are 60 seconds in a minute, there are 30 seconds in 0.5 minutes."
* **Calculation Check:** 0.5 minutes * 60 seconds/minute = 30 seconds. This calculation is correct.
* **Question for User:** "Is the Step Correct? You should ONLY tell me + or -." This refers to Target Step 3.
**Reference Step 1 Analysis:**
* **Statement:** "0.3 hours equal to 0.3 * 60 = 18 minutes."
* **Calculation Check:** 0.3 hours * 60 minutes/hour = 18 minutes. This calculation is correct.
* **Labeling:** "This reference step is correct." This confirms the system's assessment of this reference step.
### Key Observations
* The system's evaluation of "Reference Question 1" is inconsistent. The process description is correct, but the calculation (7.8 * 60 = 46) is wrong. The system labels the entire step as incorrect with a "(-)" after the erroneous result, but the initial "(+)" before the calculation is confusing. It's possible the "(+)" was intended to mark the correct part of the process, and the "(-)" to mark the incorrect result.
* The "Target Question" requires the user to evaluate "Target Step 3".
* "Target Step 3" states that 0.5 minutes is equal to 30 seconds. This is mathematically correct.
* The "Output" section provides the system's judgment on "Target Step 3".
### Interpretation
The document describes a system designed to test a user's ability to identify correct mathematical steps. The system provides examples and a target problem.
The core of the task is to evaluate "Target Step 3" which states: "And since there are 60 seconds in a minute, there are 30 seconds in 0.5 minutes." This statement is mathematically sound, as 0.5 minutes * 60 seconds/minute = 30 seconds.
The "Output: +." indicates that the system has judged "Target Step 3" to be correct. This aligns with the mathematical verification.
The "Reference Question 1" serves as a demonstration of how the system labels steps. The incorrect calculation (7.8 * 60 = 46 instead of 468) is a clear error. The labeling with both "(+)" and "(-)" is ambiguous but likely signifies that while the initial premise was correct, the execution led to a wrong answer, thus the step as a whole is deemed incorrect.
The "Reference Step 1" (0.3 hours = 18 minutes) is a correct conversion and is explicitly labeled as correct, reinforcing the system's ability to identify accurate mathematical operations.
In essence, the system is presenting a scenario where the user acts as a judge for mathematical steps, and the provided output suggests that the user's role is to confirm the correctness of the final step in the target question, which is indeed correct.
</details>
Figure 6: The illustration of RetrievalPRM input template.
- Skywork-PRM o1 Team (2024) is a Qwen2.5-Math-based PRM published by KunLun.
- Qwen2.5-PRM Zheng et al. (2024) is trained by fine-tuning the Qwen2.5-Math-7B-Instruct model on the PRM800K dataset.
- Math-Shepherd Wang et al. (2024b) generates process labels for each step by estimating the empirical probability that a given step leads to the correct final answer and trains a PRM based on their published dataset.
- RLHFlow-PRM Xiong et al. (2024) is an 8-billion-parameter reward model trained with process supervision.
### A.2 Language Models as Critic
- Llama Dubey et al. (2024) is an open-source model developed by Meta (formerly Facebook), designed for natural language understanding and generation tasks.
- Qwen2 Yang et al. (2024b) is a large language model developed by Alibaba Cloud, offering multilingual support and strong capabilities in language understanding and generation.
- Qwen2.5 Team (2024) is an advanced iteration of the Qwen series, pretrained on 18 trillion tokens, enhancing knowledge retention, programming, and mathematical reasoning.
- Qwen2.5-MATH Yang et al. (2024a) is a specialized model for mathematical problem-solving, trained on extensive math-focused data and incorporating Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR).
- Qwen2.5-Coder Hui et al. (2024) is a programming-oriented model trained on 5.5 trillion code-related tokens, excelling in code generation, debugging, and multilingual programming tasks.
- GPT-4o OpenAI et al. (2024) is a multimodal AI model developed by OpenAI that processes and generates text, audio, and images in real-time, with enhanced speed and natural interaction capabilities.
## Appendix B Implementation Details
### B.1 Basemodel and Training hyperparameters
We selected Qwen-2.5-Math-7b-instruct Team (2024) as the foundational large language model (LLM) for our experiments. All computations were performed using H100 GPUs. To enhance training resource efficiency, we employed Parameter-Efficient Fine-tuning techniques LoRA. The LoRA configuration was set with a rank of 32, an alpha value of 64, and dropout set to 0.1. LoRA update matrices were specifically applied to the query and value projection matrices within the attention blocks.
We use PRM800K as our training data and both PRM800K and Math-Shepherd as our retrieval pool. The training process was carried out with batch sizes chosen from $\{64,128,256,512\}$ and initial learning rates selected from $\{1× 10^-3,1× 10^-4,3× 10^-4,1× 10^-5,3× 10^ -5\}$ using a linear scheduler.
### B.2 Prompts
In this section, we show our training prompts for PRM in details as is shown in Figure 5 and Figure 6.
## Appendix C Datasets
GSM8K Cobbe et al. (2021): Grade School Math is a dataset for basic to intermediate math problems, covering arithmetic, algebra, geometry and other fields. Its difficulty is suitable for math problems in elementary to middle school.
MATH Hendrycks et al. (2021): The MATH dataset contains a variety of math problems from basic to university level, covering multiple mathematical fields such as algebra, geometry, calculus, number theory, etc.
OlympiadBench He et al. (2024): The Olympiadbench dataset contains questions from the Mathematical Olympiad. The questions are of high difficulty and involve complex combinatorial mathematics, number theory, geometry and other advanced mathematical fields.
Omni-MATH Gao et al. (2024b): Omni-MATH is a general Olympiad-level mathematics benchmark dataset for large language models, covering multi-domain and high-difficulty mathematics problems, and is designed to evaluate the reasoning ability of models in various mathematical fields.
Except for GSM8K, which focuses on grade school math problems, the other three datasets feature problems of competition or Olympiad-level difficulty.
## Appendix D Supplementary Evaluation Results
In this section, we show the breakdown of our main results in Table 3 and ablation results in Table 4
Table 3: Breakdown of evaluation results of different models on ProcessBench. The best result is given in bold, and the second-best value is underlined.
| Model | GSM8k | MATH | OlympiadBench | OmniMATH | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| error | correct | F1 | error | correct | F1 | error | correct | F1 | error | correct | F1 | | |
| Open-source PRM | RetrievalPRM-7B(Ours) | 64.7 | 88.1 | 74.6 | 67.2 | 75.6 | 71.1 | 56.0 | 65.2 | 60.2 | 52.8 | 62.65 | 57.33 |
| Qwen2.5-Math-7B-PRM800K | 53.1 | 95.3 | 68.2 | 48.0 | 90.1 | 62.6 | 35.7 | 87.3 | 50.7 | 29.8 | 86.3 | 44.3 | |
| Skywork-PRM-7B | 61.8 | 82.9 | 70.8 | 43.8 | 69.2 | 53.6 | 17.9 | 31.9 | 22.9 | 14.0 | 41.9 | 21.0 | |
| RLHFlow-PRM-Mistral-8B | 33.8 | 99.0 | 50.4 | 21.7 | 72.2 | 33.4 | 8.2 | 43.1 | 13.8 | 9.6 | 45.2 | 15.8 | |
| RLHFlow-PRM-Deepseek-8B | 24.2 | 98.4 | 38.8 | 21.4 | 80.0 | 33.8 | 10.1 | 51.0 | 16.9 | 10.1 | 51.9 | 16.9 | |
| Skywork-PRM-1.5B | 50.2 | 71.5 | 59.0 | 37.9 | 65.3 | 48.0 | 15.4 | 26.0 | 19.3 | 13.6 | 32.8 | 19.2 | |
| Math-Shepherd-PRM-7B | 32.4 | 91.7 | 47.9 | 18.0 | 82.0 | 29.5 | 15.0 | 71.1 | 24.8 | 14.2 | 73.0 | 23.8 | |
| Language Models | QwQ-32B-Preview | 81.6 | 95.3 | 88.0 | 78.1 | 79.3 | 78.7 | 61.4 | 54.6 | 57.8 | 55.7 | 68.0 | 61.3 |
| GPT-4o | 70.0 | 91.2 | 79.2 | 54.4 | 76.6 | 63.6 | 45.8 | 58.4 | 51.4 | 45.2 | 53.5 | 61.9 | |
| Qwen2.5-72B-Instruct | 62.8 | 96.9 | 76.2 | 46.3 | 93.1 | 61.8 | 38.7 | 92.6 | 54.6 | 36.6 | 90.9 | 52.2 | |
| Llama-3.3-70B-Instruct | 72.5 | 96.9 | 82.9 | 43.3 | 94.6 | 59.4 | 31.0 | 94.1 | 46.7 | 28.2 | 90.5 | 43.0 | |
| Qwen2.5-32B-Instruct | 49.3 | 97.9 | 65.6 | 36.7 | 95.8 | 53.1 | 25.3 | 95.9 | 40.0 | 24.1 | 92.5 | 38.3 | |
| Qwen2.5-14B-Instruct | 54.6 | 94.8 | 69.3 | 38.4 | 87.4 | 53.3 | 31.5 | 78.8 | 45.0 | 28.3 | 76.3 | 41.3 | |
| Qwen2.5-Coder-32B-Instruct | 54.1 | 94.8 | 68.9 | 44.9 | 90.6 | 60.1 | 33.4 | 91.2 | 48.9 | 31.5 | 87.6 | 46.3 | |
| Qwen2.5-Coder-14B-Instruct | 33.8 | 96.4 | 50.1 | 25.4 | 92.4 | 39.9 | 20.7 | 94.1 | 34.0 | 15.9 | 94.2 | 27.3 | |
| Qwen2.5-Coder-7B-Instruct | 7.7 | 100.0 | 14.3 | 3.4 | 98.3 | 6.5 | 2.1 | 99.1 | 4.1 | 0.9 | 98.3 | 1.8 | |
| Qwen2.5-Math-72B-Instruct | 49.8 | 96.9 | 65.8 | 36.0 | 94.3 | 52.1 | 19.5 | 97.3 | 32.5 | 19.0 | 96.3 | 31.7 | |
| Qwen2.5-Math-7B-Instruct | 15.5 | 100.0 | 26.8 | 14.8 | 96.8 | 25.7 | 7.7 | 91.7 | 14.2 | 6.9 | 88.0 | 12.7 | |
| Llama-3.1-70B-Instruct | 64.3 | 89.6 | 74.9 | 35.4 | 75.6 | 48.2 | 35.1 | 69.9 | 46.7 | 30.7 | 61.8 | 41.0 | |
| Meta-Llama-3-70B-Instruct | 35.7 | 96.9 | 52.2 | 13.0 | 93.3 | 22.8 | 12.0 | 92.0 | 21.2 | 11.2 | 91.7 | 20.0 | |
| Qwen2-72B-Instruct | 57.0 | 82.9 | 67.6 | 37.7 | 70.9 | 49.2 | 34.0 | 55.2 | 42.1 | 32.3 | 53.1 | 40.2 | |
| Qwen2.5-7B-Instruct | 40.6 | 33.2 | 36.5 | 30.8 | 45.1 | 36.6 | 26.5 | 33.9 | 29.7 | 26.2 | 28.6 | 27.4 | |
| Qwen2-7B-Instruct | 40.6 | 4.7 | 8.4 | 30.5 | 13.8 | 19.0 | 22.4 | 10.9 | 14.7 | 20.0 | 8.7 | 12.1 | |
| Llama-3.1-8B-Instruct | 44.4 | 6.2 | 10.9 | 41.9 | 2.7 | 5.1 | 32.4 | 1.5 | 2.8 | 32.0 | 0.8 | 1.6 | |
| Meta-Llama-3-8B-Instruct | 42.5 | 7.8 | 13.1 | 28.6 | 9.1 | 13.8 | 27.1 | 2.7 | 4.8 | 26.1 | 8.3 | 12.6 | |
Table 4: Breakdown of evaluation results of different variants of RetrievalPRM on ProcessBench. We remove different components of RetrievalPRM to evaluate the contribution of each part to the model. The best result is given in bold, and the second-best value is underlined.
| Retrieval Components | GSM8k | MATH | OlympiadBench | OmniMATH | Avg.F1 | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Question-level | Step-level | error | correct | F1 | error | correct | F1 | error | correct | F1 | error | correct | F1 | |
| ✓ | ✓ | 64.7 | 88.1 | 74.6 | 67.2 | 75.6 | 71.1 | 56.0 | 65.2 | 60.2 | 52.8 | 62.65 | 57.33 | 65.8 |
| ✓ | $×$ | 61.8 | 94.8 | 74.9 | 62.1 | 83.3 | 71.2 | 48.7 | 77.3 | 59.8 | 43.2 | 73.4 | 54.4 | 65.0 |
| $×$ | ✓ | 51.7 | 97.4 | 67.5 | 57.2 | 87.4 | 69.2 | 46.0 | 82.0 | 58.9 | 43.9 | 78.4 | 56.3 | 63.0 |
| $×$ | $×$ | 50.7 | 92.7 | 65.6 | 57.9 | 81.0 | 67.5 | 46.9 | 68.7 | 55.8 | 39.7 | 71.0 | 50.9 | 59.9 |