2506.18896v2

Model: gemma-3-27b-it-free

# ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs **Authors**: - Jiaru Zou - Ling Yang2,4⁣∗\textsuperscript{{$\dagger$}} - Jingwen Gu - Jiahao Qiu - Ke Shen - Jingrui He - Mengdi Wang (UIUC Princeton University Cornell University ByteDance Seed) [*]Equal Contribution [†]Corresponding authors ## Abstract Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory–response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Ling Yang at , Mengdi Wang at <details> <summary>plots/intro_res.png Details</summary> ![b02b46e2](/v1/image/b02b46e20d4196d9b9fa8e0a95c86a44bcdcfaec8051fbd6fd87e20ff9f275d6) ### Visual Description ## Charts: Performance Comparison of Reasoning Agents ### Overview The image presents three charts comparing the performance of different reasoning agents (GRPO, Owen2.5-Math-PRM-TB, Skywork-PRM-TB, ReasonFlux-PRM-TB, and Majority) across three datasets (AIME25, MATH50K, and GPQA-Diamond). The charts show performance in terms of radial plots for AIME25, a line graph of training reward over steps, and a line graph of accuracy versus the number of solutions. ### Components/Axes * **Chart 1 (AIME25):** Radial plot with axes representing performance on AIME25. The radial scale ranges from 0 to 1.0. The legend is located at the bottom-left, listing the agents: "Human Curated" (dark blue), "Random" (light gray), "Skywork-PRM-TB" (orange), "Owen2.5-Math-PRM-TB" (yellow), "Owen2.5-Math-PRM-72E" (dashed yellow), and "ReasonFlux-PRM-TB" (purple). * **Chart 2 (Training Reward):** Line graph with "Step" on the x-axis (ranging from 0 to 180) and "Training Reward" on the y-axis (ranging from 0.1 to 0.5). The legend is located at the top-right, listing the agents: "GRPO" (blue), "Owen2.5-Math-PRM-TB" (orange), and "ReasonFlux-PRM-TB" (purple). * **Chart 3 (GPQA-Diamond):** Line graph with "Number of Solutions (N)" on the x-axis (ranging from 21 to 25) and "Accuracy (%)" on the y-axis (ranging from 48% to 54%). The legend is located at the top-right, listing the agents: "Owen2.5-Math-PRM-TB" (orange), "Skywork-PRM-TB" (green), "Majority" (black), and "ReasonFlux-PRM-TB" (purple). ### Detailed Analysis or Content Details **Chart 1 (AIME25):** * **Human Curated:** Shows a relatively consistent performance across the AIME25 dataset, with values around 0.8-0.9. * **Random:** Exhibits very low performance, consistently below 0.2. * **Skywork-PRM-TB:** Performance fluctuates, with values ranging from approximately 0.3 to 0.7. * **Owen2.5-Math-PRM-TB:** Performance is moderate, with values ranging from approximately 0.4 to 0.8. * **Owen2.5-Math-PRM-72E:** Similar to Owen2.5-Math-PRM-TB, with values ranging from approximately 0.4 to 0.8. * **ReasonFlux-PRM-TB:** Shows the highest performance, consistently above 0.7 and reaching close to 1.0. **Chart 2 (Training Reward):** * **GRPO (Blue):** The line fluctuates around 0.3, with some oscillations. At step 180, the reward is approximately 0.32. * **Owen2.5-Math-PRM-TB (Orange):** The line fluctuates around 0.35, with more pronounced oscillations than GRPO. At step 180, the reward is approximately 0.38. * **ReasonFlux-PRM-TB (Purple):** The line shows a generally increasing trend, starting around 0.3 and reaching approximately 0.45 at step 180. **Chart 3 (GPQA-Diamond):** * **Owen2.5-Math-PRM-TB (Orange):** Starts at approximately 48% accuracy at N=21 and increases to approximately 52% at N=25. * **Skywork-PRM-TB (Green):** Starts at approximately 49% accuracy at N=21 and increases to approximately 53% at N=25. * **Majority (Black):** Starts at approximately 48% accuracy at N=21 and remains relatively flat, reaching approximately 49% at N=25. * **ReasonFlux-PRM-TB (Purple):** Starts at approximately 48% accuracy at N=21 and increases sharply to approximately 54% at N=25. ### Key Observations * ReasonFlux-PRM-TB consistently outperforms other agents across all three datasets. * The training reward for ReasonFlux-PRM-TB shows a clear upward trend, suggesting continued learning. * Accuracy on GPQA-Diamond increases with the number of solutions for most agents, but ReasonFlux-PRM-TB shows the most significant improvement. * The "Majority" agent shows minimal improvement in accuracy with increasing solutions. ### Interpretation The data suggests that ReasonFlux-PRM-TB is the most effective reasoning agent among those tested, demonstrating superior performance on AIME25, higher training rewards, and greater accuracy gains on GPQA-Diamond as the number of solutions increases. The radial plot for AIME25 visually confirms this, with ReasonFlux-PRM-TB extending furthest towards the outer edge of the plot, indicating higher performance. The increasing training reward for ReasonFlux-PRM-TB suggests that it is capable of continued learning and improvement. The relatively flat performance of the "Majority" agent on GPQA-Diamond indicates that simply aggregating multiple solutions does not necessarily lead to improved accuracy, highlighting the importance of sophisticated reasoning capabilities. The differences in performance between the agents likely stem from variations in their underlying architectures and training methodologies. The fact that Owen2.5-Math-PRM-72E and Owen2.5-Math-PRM-TB perform similarly suggests that the 72E parameter does not significantly impact performance in this context. </details> Figure 1: Overview of ReasonFlux-PRM. ReasonFlux-PRM is designed to provide general-purpose reward supervision across multiple application scenarios. Left: Offline selection of high-quality distilled trajectory–response data to enhance downstream supervised fine-tuning of smaller models. Middle: Online reward modeling integrated into GRPO-based policy optimization. Right: Reward-guided Best-of-N test-time scaling to improve inference-time performance. ## 1 Introduction Process Reward Models [1, 2, 3] have recently emerged as a powerful framework for providing process-level supervision in large language models (LLMs) reasoning process, particularly for complex domains such as mathematical problem solving [4, 5, 1]. Given a question and the corresponding model’s final response, PRMs verify the reasoning step-by-step and assign fine-grained rewards to each step of the response. Prior studies have leveraged PRMs in both post-training stages [6, 7], including providing dense rewards for online reinforcement learning (RL) [8], and reward-guided inference-time scaling [9, 10]. However, existing PRMs are primarily trained and applied to model-generated final responses, typically presented in an explicit and organized stey-by-step chain-of-thought (CoT) format. Concurrently, with recent advancements in frontier reasoning models such as OpenAI-o1 [11] and Deepseek-R1 [12], these models have increasingly adopted a trajectory-response format of output: a lengthy, comprehensive, and less organized intermediate thinking trajectory, followed by a concise, step-by-step final response conditioned on the prior thinking (as illustrated in Figure 2). Such trajectory–response pairs have been widely distilled and acquired from large reasoning models to support downstream training of smaller models, enabling them to emulate the reasoning capabilities of larger models to first think then produce coherent, extended CoT rationales [13, 14, 15]. The increasing utilization of trajectory–response data raises an important question: Can PRMs provide supervision not only to the final responses of large reasoning models, but also to their intermediate thinking trajectories? Addressing this question first presents a challenge of how to assign informative and correct rewards to the model intermediate thinking trajectories. Unlike final responses, these trajectories are typically treated as silver-standard data [16], automatically generated by large reasoning models without rigorous quality control or standardized verification criteria, making their evaluation inherently noisy and less reliable. To address this, we first revisit several state-of-the-art PRMs and evaluate their performance on trajectory–response pairs. Our analysis reveals that existing PRMs struggle to robustly supervise model thinking trajectories and can degrade downstream training on such data. We further find that this degradation stems primarily from two key issues: an structural and formatting mismatch between intermediate thinking trajectories and final responses, and the lack of trajectory–response data with assigned rewards during PRMs training. <details> <summary>plots/data_example.png Details</summary> ![5ce024c6](/v1/image/5ce024c6c6e620dfc982d87ea7cd73367ad0f2d44b843d9e5cf7bd4b8302c3cd) ### Visual Description \n ## Diagram: Trajectory-Response Data - Fourier Transform Question ### Overview This diagram presents a question regarding the Fourier transform of a signal, along with a visual representation of the thought process used to arrive at a solution. It mimics a conversational AI interface, showing "Thinking Trajectories" and a "Final Response" section. The diagram includes a mathematical definition of the signal, intermediate steps, and visual cues (icons) to indicate progress. ### Components/Axes The diagram is structured into several key components: * **Header:** "Trajectory-Response Data" * **Question:** "Is the Fourier transform of the signal imaginary?" * **Signal Definition:** `x₁(t) = { sin(ω₀t), -π/ω₀ ≤ t ≤ π/ω₀; 0, otherwise }` * **Thinking Trajectories:** A block of text simulating the reasoning process. * **Final Response:** A step-by-step solution. * **Icons:** Two icons labeled "PRM" (one with a question mark, one with a lightbulb) are present. * **Step Boxes:** Three boxes labeled "Step 1", "Step 2", and "Step 3" containing values enclosed in `\boxed{}`. ### Detailed Analysis or Content Details **Signal Definition:** The signal `x₁(t)` is defined as follows: * `sin(ω₀t)` for `-π/ω₀ ≤ t ≤ π/ω₀` * `0` otherwise. **Thinking Trajectories (Transcription):** "The user is asking whether the Fourier transform of the given signal is imaginary … Let me first see the definition of the signal definition … The Fourier transform properties suggests that … Break the exponential into sine terms, we have … Recall from the previous derive integral expression… Let me revisit the given term for x₁(t)… Wait, if the function is odd … Recall the definition of imaginary and the expression … the Fourier transform is imaginary." **Final Response (Transcription):** "Let me answer step-by-step. Step 1: The Fourier transform of the signal is defined as … Step 2: We use the identity as … Step 3: Substituting into the integral … Based on the following steps, the result is purely imaginary." **Step Boxes:** * Step 1: `\boxed{0.71}` * Step 2: `\boxed{0.85}` * Step 3: `\boxed{0.92}` **Icon Placement:** * The "PRM" icon with a question mark is located in the top-right corner of the "Thinking Trajectories" section. * The "PRM" icon with a lightbulb is located in the bottom-right corner of the "Final Response" section. ### Key Observations * The diagram presents a problem-solving process, not a dataset to analyze. * The values in the step boxes (0.71, 0.85, 0.92) appear to be intermediate results in the calculation, but their specific meaning is not explicitly stated. * The "Thinking Trajectories" section provides insight into the reasoning behind the solution. * The use of icons and step-by-step formatting aims to make the solution more accessible. ### Interpretation The diagram illustrates a typical approach to solving a mathematical problem, particularly in the context of signal processing. The "Thinking Trajectories" section demonstrates the iterative nature of problem-solving, where the solver revisits definitions, applies properties, and considers different approaches. The "Final Response" section provides a structured solution, breaking down the problem into manageable steps. The values in the step boxes likely represent intermediate calculations within the Fourier transform derivation. The diagram suggests that the Fourier transform of the given signal is indeed purely imaginary, as indicated by the final statement. The PRM icons likely represent a process or agent (possibly "Problem Reasoning Module") guiding the solution. The diagram is a visual representation of an AI's thought process, rather than a presentation of empirical data. </details> Figure 2: Illustration of the Trajectory-Response Data generated by Deepseek-R1. Existing PRMs can assign appropriate scores to final responses but often struggle to evaluate intermediate reasoning trajectories accurately. Motivated by these observations, we propose a new trajectory-aware PRM, namely ReasonFlux-PRM, which incorporates both step-level and trajectory-level supervision to better align the models’ middle thinking trajectories with their final responses. ReasonFlux-PRM is trained on a 10k curated dataset of high-quality trajectory–response pairs covering math and science reasoning. Unlike existing PRMs, ReasonFlux-PRM is explicitly tailored to intermediate thinking processes by providing fine-grained rewards as supervision signals for each step within the thinking trajectory. We further adapt ReasonFlux-PRM for more general reward modeling scenarios, as illustrated in Figure 1. In offline settings, ReasonFlux-PRM assigns scores to filter high-quality trajectory–response pairs, facilitating effective training data curation for downstream supervised fine-tuning of smaller models. In online settings, ReasonFlux-PRM is integrated into reward modeling process to provide fine-grained supervision signals during policy optimization, such as GRPO [17]. Moreover, ReasonFlux-PRM facilitates test-time scaling by evaluating multiple generated responses and selecting the most promising one via a reward-guided Best-of-N strategy. In summary, our main contributions are: - In-Depth Trajectory-Response Data Analysis in Long-CoT Reasoning. We identify, formulate, and analyze the problem of adapting several existing PRMs to supervise both models’ intermediate reasoning trajectories and their final responses, motivated by the increasing prevalence of trajectory–response distillation data in downstream post-training and test-time scaling. - Trajectory-aware Reward Modeling for Data Selection, RL and Test-Time Scaling. We introduce ReasonFlux-PRM, a trajectory-aware process reward model that incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment for model thinking trajectories. ReasonFlux-PRM can be integrated into both offline and online workflows for more generalized purposes, including offline selection of high-quality training data, online policy optimization in RL training, and test-time scaling. - Extensive Downstream Evaluations. Across extensive evaluations on challenging reasoning benchmarks, ReasonFlux-PRM demonstrates superior data selection quality at smaller model scales, with ReasonFlux-PRM-7B outperforming strong baselines such as Qwen2.5-Math-PRM-72B [2] and datasets curated by human experts. On tasks such as AIME [18, 19], MATH500 [20], and GPQA-Diamond [21], ReasonFlux-PRM-7B achieves notable average accuracy improvement of 12.1% during supervised fine-tuning, 4.5% during reinforcement learning, and 6.3% during inference test-time scaling. ## 2 Preliminaries Trajectory-Response Data. Let $f_{\text{oracle}}(\cdot)$ denote an oracle model, such as Deepseek-R1, capable of producing structured reasoning traces. Given a complex input prompt $x$ , the oracle generates a sequence of intermediate thinking steps followed by a final response. We represent each instance of such data as a tuple $(s,a)$ , where $s=(s_{1},s_{2},\dots,s_{T})$ denotes a thinking trajectory consisting of $T$ intermediate steps, and $a=(a_{1},a_{2},\dots,a_{T})$ denotes the final response, which can also be structured as a chain-of-thought trace with $T$ formatted and organized steps. For large reasoning models, we assume that both $s$ and $a$ consist of $T$ reasoning steps. This structural alignment reflects the modeling assumption that the final output trace $a$ is generated in a step-by-step manner, strictly conditioned on the preceding intermediate reasoning steps $s$ . Both the thinking trajectory and final response are generated auto-regressively by the oracle model, i.e., $$ s_{t}\sim f_{\text{oracle}}(x,s_{<t}),\quad a_{t}\sim f_{\text{oracle}}(x,s,a_{<t}), \tag{1} $$ where $s_{<t}=(s_{1},\dots,s_{t-1})$ and $a_{<t}=(a_{1},\dots,a_{t-1})$ denote the reasoning and answer histories up to step $t$ , respectively. In the trajectory-response outputs distillation setting, the full supervision target instance $y$ can be constructed as the concatenation of thinking trajectories and the final response, i.e., $y=s\oplus a$ . Process Reward Modeling. Given a trajectory-answer pair $(s,a)$ , where both $s=(s_{1},\dots,s_{T})$ and $a=(a_{1},\dots,a_{T})$ are structured as reasoning traces, the goal of a process reward model is to evaluate each intermediate reasoning step $s_{t}\in s$ with respect to its utility in achieving a correct and coherent final response. We first define a reference reward function $R_{\text{ref}}$ that provides step-level supervision: $$ r_{t}=R_{\text{ref}}(s_{t}\mid x,s_{<t},a), \tag{2} $$ where $R_{\text{ref}}(\cdot)$ scores the $t$ -th step conditioned on the input $x$ , the prior thinking trajectory steps, and the full final response $a$ . The total reward for the trajectory is then computed by aggregating the step-by-step scores: $$ R_{\text{total}}=\mathcal{A}(r_{1},r_{2},\dots,r_{T}), \tag{3} $$ where $\mathcal{A}(\cdot)$ denotes an aggregation function such as Mean and Sum. The training objective for PRMs is to learn a scoring function $R_{\phi}(\cdot)$ , parameterized by $\phi$ , that approximates the reference reward for each step. This is formulated as minimizing the discrepancy between predicted and reference rewards over a training dataset $\mathcal{D}=\{(x^{(i)},s^{(i)},a^{(i)},r^{(i)}_{1:T})\}_{i=1}^{N}$ , where $r_{t}^{(i)}$ denotes the target reward for step $s_{t}^{(i)}$ . Formally, the training objective can be written as: $$ \min_{\phi}\ \frac{1}{N}\sum_{i=1}^{N}\sum_{t=1}^{T^{(i)}}\mathcal{L}\left(R_{\phi}(s_{t}^{(i)}\mid x^{(i)},s_{<t}^{(i)},a^{(i)}),\ r_{t}^{(i)}\right). \tag{4} $$ <details> <summary>x1.png Details</summary> ![2a237c8c](/v1/image/2a237c8c38a0eb1dc448e35ea88120e8a82f7379dc636ccabe0e1554bd8302d4) ### Visual Description \n ## Histograms: Reward Score Distribution for Deepseek-R1 and Gemini Flash Thinking ### Overview The image presents two histograms, side-by-side, comparing the distribution of "Reward Scores" for two models: Deepseek-R1 and Gemini Flash Thinking. Both histograms use the same x-axis scale (Reward Score from 0.0 to 1.0) and y-axis scale (Density from 0.0 to approximately 8.0). The histograms are overlaid, allowing for a visual comparison of the score distributions. ### Components/Axes * **X-axis:** "Reward Score" ranging from 0.0 to 1.0. * **Y-axis:** "Density" ranging from 0.0 to approximately 8.0. * **Legend:** Located in the top-left corner of each histogram. * Deepseek-R1: Represented by a light blue color. * Gemini Flash Thinking: Represented by a light orange color. ### Detailed Analysis or Content Details **Histogram 1 (Left):** * **Deepseek-R1 (Light Blue):** The distribution appears to be unimodal, peaking around a Reward Score of 0.7. The density at the peak is approximately 1.5. The distribution extends from approximately 0.1 to 1.0, with a significant drop-off in density below 0.2 and above 0.8. * **Gemini Flash Thinking (Light Orange):** This distribution is also unimodal, but peaks at a lower Reward Score, around 0.2. The peak density is approximately 3.0. The distribution is wider than Deepseek-R1, extending from approximately 0.0 to 0.6, with a rapid decline in density above 0.4. **Histogram 2 (Right):** * **Deepseek-R1 (Light Blue):** This distribution is strongly skewed to the right, with a large peak at a Reward Score of approximately 0.95. The peak density is approximately 7.5. The distribution has a long tail extending towards 1.0. * **Gemini Flash Thinking (Light Orange):** The distribution is unimodal, peaking around a Reward Score of 0.5. The peak density is approximately 2.5. The distribution extends from approximately 0.2 to 0.8, with a gradual decline in density towards the extremes. ### Key Observations * In the first histogram, Gemini Flash Thinking has a higher density of scores in the lower range (0.0-0.4) compared to Deepseek-R1. Deepseek-R1 has a higher density of scores in the higher range (0.6-1.0). * In the second histogram, Deepseek-R1 exhibits a strong preference for high Reward Scores (close to 1.0), while Gemini Flash Thinking's distribution is more centered around 0.5. * The distributions are significantly different in both histograms, suggesting that the two models have different reward-seeking behaviors. ### Interpretation The data suggests that Deepseek-R1 and Gemini Flash Thinking have fundamentally different approaches to the task being evaluated. The first histogram indicates that Gemini Flash Thinking tends to produce results with lower reward scores, while Deepseek-R1 produces results with higher reward scores. The second histogram shows a more extreme difference, with Deepseek-R1 heavily favoring high reward scores, potentially indicating a more optimized or successful strategy. The difference in distributions could be due to several factors, including: * **Different training data:** The models may have been trained on different datasets, leading to different biases. * **Different model architectures:** The underlying architectures of the models may influence their ability to achieve high reward scores. * **Different reward functions:** The reward function used to evaluate the models may favor different types of outputs. The presence of distinct peaks and shapes in the histograms suggests that the reward scores are not randomly distributed, and that there are underlying patterns in the models' behavior. The strong skew in the second histogram for Deepseek-R1 is particularly noteworthy, as it indicates a high degree of confidence in its ability to achieve high reward scores. </details> Figure 3: Score distributions rewarded by Qwen2.5-Math-PRM-72B over 1,000 trajectory–response pairs distilled from Deepseek-R1 and the Gemini Flash Thinking API. Left: Distribution of scores computed over thinking trajectories. Right: Distribution of scores based on final responses. ## 3 Existing PRMs Are Not Prepared for Rewarding Thinking Trajectories To examine whether existing frontier PRMs can be directly applied to reward the trajectory-response data, we first conduct a preliminary study to investigate two key questions: RQ1: Can PRMs distinguish the quality of thinking trajectories distilled from different oracle models? RQ2: What is the effectiveness of using the PRM-selected trajectory-response data on the downstream fine-tuning of smaller models? For brevity, we defer detailed experimental setups to Appendix A.1. To investigate RQ1, we evaluate the Qwen2.5-Math-PRM-72B PRM model on 1,000 sampled problems in s1k [13] with trajectory-response traces generated by Google Flash Thinking API [22] and Deepseek-R1 [12], respectively. For each data trace, we apply the PRM model to compute the step-level rewards (spitted by "\n\n "), and then aggregate these rewards by taking the mean to obtain a final trajectory-level reward. Figure 3 (left) compares the distribution of PRM scores across the two oracle models. The histogram shows a significant overlap in the score distributions, though Deepseek-R1 traces tend to receive higher rewards on average, with a longer tail toward high-reward regions (e.g., scores above 0.6). The results suggest that while Qwen2.5-Math-PRM-72B captures some signal for differentiating between the two sources, its discriminative ability remains limited. Takeaway 1 Several existing PRMs exhibit limitations in distinguishing reasoning traces distilled from different oracle models and often struggle to clearly separate high- and low-quality model thinking trajectories. Next, to investigate RQ2, we evaluate the performance using the PRM-selected data on the downstream supervised fine-tuning of smaller models. We apply four different PRMs to assign a reward score to each of the 59K raw trajectory-response traces generated by Gemini [22] in s1 [13], using the same mean aggregation over step-level rewards to compute a trajectory-level score. Based on these scores, we rank all traces and select the top 1,000 samples from each PRM as a fine-tuning dataset for the downstream small model. For better comparison, we also adopt the direct set of 1K human-curated examples in s1k [13]. Table 1 presents the accuracy of the fine-tuned Qwen2.5-14B-Instruct on four challenging downstream tasks. We observe that all PRM-selected training sets underperform significantly compared to the human-curated baseline, suggesting that existing PRMs are not yet sufficiently calibrated to identify high-quality trajectory-response data, and can even degrade downstream model performance by selecting suboptimal or misaligned training samples. Takeaway 2 Direct reliance on current PRMs for trajectory-response selection can yield misaligned training data, which in turn diminishes the effectiveness of downstream supervised fine-tuning for smaller models. Table 1: Performance of Qwen2.5-14B-Instruct on four challenging reasoning tasks after fine-tuning on the trajectory-response data selected by four different PRMs. We also compare the fine-tuning performance of using PRM-selected data with using randomly sampled data (1k from 59k) and the s1k human-curated data [13]. | SFT Data Source | AIME24 | AIME25 | MATH500 | GPQA-Diamond | | --- | --- | --- | --- | --- | | Random | 16.7 ( $\downarrow$ 16.6) | 20.0 ( $\downarrow$ 13.3) | 68.4 ( $\downarrow$ 10.4) | 34.8 ( $\downarrow$ 6.6) | | Math-Shepherd-PRM-7B | 13.3 ( $\downarrow$ 20.0) | 6.7 ( $\downarrow$ 26.6) | 67.8 ( $\downarrow$ 11.0) | 33.3 ( $\downarrow$ 8.1) | | Skywork-PRM-7B | 13.3 ( $\downarrow$ 20.0) | 13.3 ( $\downarrow$ 20.0) | 71.8 ( $\downarrow$ 7.0) | 37.9 ( $\downarrow$ 3.5) | | Qwen2.5-Math-PRM-7B | 26.7 ( $\downarrow$ 6.6) | 20.0 ( $\downarrow$ 13.3) | 73.2 ( $\downarrow$ 5.6) | 39.4 ( $\downarrow$ 2.0) | | Qwen2.5-Math-PRM-72B | 33.3 ( $\downarrow$ 0.0) | 26.7 ( $\downarrow$ 6.6) | 77.0 ( $\downarrow$ 1.8) | 39.4 ( $\downarrow$ 2.0) | | on model responses | 36.7 ( $\uparrow$ 3.4) | 26.7 ( $\downarrow$ 6.6) | 77.8 ( $\downarrow$ 1.0) | 40.9 ( $\downarrow$ 0.5) | | Human-curated (s1k) | 33.3 | 33.3 | 78.8 | 41.4 | As most existing PRMs are trained on reasoning traces derived from model final output responses rather than intermediate thinking trajectories [2, 23], we take a closer look at the distinctions between genuine thinking trajectories and post-hoc generated responses. As we detailed in the Appendix A.2, these two types of data exhibit several fundamental differences: (i) Thinking trajectories often include branching, where the model revisits earlier steps, explores alternative paths, and revises prior assumptions—behavior rarely observed in the linear and polished structure of final responses. (ii) Thinking trajectories tend to exhibit weaker global coherence across steps, as each step is often locally focused and not optimized for narrative continuity. To further validate that the performance degradation of existing PRMs stems from the aforementioned data mismatch, we conduct an additional experiment in which Qwen2.5-Math-PRM-72B is applied to score each data instance based solely on the model response, rather than the middle thinking trajectories. As shown in Figure 3 (right), the PRM produces a relatively clearer separation in score distributions between the two oracle models. Also as shown in Table 1 (row: on model responses), the performance drop is reduced when training on PRM-selected data based on final responses, suggesting that existing PRMs are better aligned with model-response-level supervision. Takeaway 3 Thinking trajectories instinctively differ from final responses, and existing PRMs are more accustomed to scoring final outputs than intermediate reasoning steps. Motivation on ReasonFlux-PRM. Our findings above highlight the need for a more general reward model that can effectively evaluate both intermediate model thinking trajectories and final responses. As thinking trajectories become integral to supervised and RL-based fine-tuning, existing PRMs, trained primarily on final responses, struggle to provide reliable supervision. To address this, we propose and train a new thinking-aware process reward model tailored to the trajectory-response data supervision. <details> <summary>plots/method_pipeline.png Details</summary> ![6152bb5c](/v1/image/6152bb5cec372016a8f33ee98013cf6114282372afd8536e921e73484faf6b8e) ### Visual Description \n ## Diagram: ReasonFlux-PRM Training and Inference ### Overview This diagram illustrates the training and inference processes of the ReasonFlux-PRM system. The diagram is divided into two main sections: "ReasonFlux-PRM Training" on the left and "ReasonFlux-PRM Inference" on the right. The training section details data curation and reward design, while the inference section outlines offline and online settings. The diagram uses a flowchart style with icons representing different components and processes. ### Components/Axes The diagram is structured around several key components: * **Training Data Curation:** Includes "Question", "Thinking Trajectories" (Step 1 to Step t), and "Final Response". * **Reward Design:** Includes "Quality Reward", "Coherence Reward", and "Alignment Reward", each with associated "Step-level reward" and "Trajectory-level reward". * **ReasonFlux-PRM:** Represented by a stylized brain icon. * **Expert LLM:** Represented by a head icon, used for judging and verifying. * **Policy Model:** Represented by a box with "Generate" and "Instruct" labels. * **Offline Setting:** Includes "Distilled Trajectory-Response Pairs", "High-quality Data Selection", and "Downstream Training". * **Online Setting:** Divided into "1. RL Training" and "2. Test-Time-Scaling". * **RL Training:** Shows a flow from "ReasonFlux-PRM" to "Anew" to "JGRPO" with "RL Policy Optimization". * **Test-Time-Scaling:** Displays "Response 1", "Response 2", and "Response 3" with associated scores. ### Detailed Analysis or Content Details **Training Data Curation:** * A "Question" initiates the process. * The question leads to "Thinking Trajectories" consisting of multiple steps (Step 1 to Step t). * These trajectories culminate in a "Final Response". * The trajectory-response data is used for reward design. **Reward Design:** * **Quality Reward:** An "Expert LLM" judges the quality. * **Coherence Reward:** Evaluates the coherence of "Thinking Trajectories" (Step 1 to Step 3, with ellipsis indicating more steps). * **Alignment Reward:** Assesses the alignment of "Thinking Trajectories" (Step 1 to Step 3, with ellipsis indicating more steps). * Rewards are provided at both "Step-level" and "Trajectory-level". **ReasonFlux-PRM Inference - Offline Setting:** * "Distilled Trajectory-Response Pairs" are used for "High-quality Data Selection". * This selection feeds into "Downstream Training". **ReasonFlux-PRM Inference - Online Setting:** * **RL Training:** "ReasonFlux-PRM" is optimized via "RL Policy Optimization" based on "Anew" and "JGRPO". * **Test-Time-Scaling:** * Response 1: Score = 0.19 * Response 2: Score = 0.54 * Response 3: Score = 0.97 **Policy Model:** * The "Policy Model" receives input and generates responses. * It is instructed by the "ReasonFlux-PRM" system. ### Key Observations * The diagram emphasizes the iterative nature of the training process, with multiple steps in the thinking trajectories. * The reward design incorporates multiple dimensions (quality, coherence, alignment) to guide the learning process. * The inference process has both offline (data-driven) and online (RL-based) components. * The test-time scaling shows a clear improvement in scores from Response 1 to Response 3, suggesting successful optimization. * The use of an "Expert LLM" for judging and verifying highlights the importance of human-level evaluation. ### Interpretation The diagram illustrates a sophisticated framework for training and deploying a reasoning model (ReasonFlux-PRM). The system leverages a combination of trajectory-response data, reward signals, and reinforcement learning to improve its performance. The separation of training and inference into distinct stages allows for both data-driven learning and real-time adaptation. The increasing scores in the test-time scaling section suggest that the RL training is effective in optimizing the policy model. The use of an Expert LLM indicates a focus on aligning the model's reasoning with human expectations. The diagram suggests a closed-loop system where the model learns from its interactions and continuously improves its reasoning capabilities. The diagram is a high-level overview and does not provide specific details about the algorithms or implementation details. However, it effectively conveys the key components and processes involved in the ReasonFlux-PRM system. </details> Figure 4: Illustration of the overall method design. ReasonFlux-PRM is trained on trajectory–response data pairs with a novel reward design that integrates both step-level and trajectory-level signals. As a general-purpose PRM, ReasonFlux-PRM supports both offline data selection for supervised fine-tuning of small models and online reward modeling including policy optimization in RL training and test-time scaling. ## 4 ReasonFlux-PRM In this section, we introduce ReasonFlux-PRM, a trajectory-aware process reward model, as illustrated in Figure 4. We first present a new reward design tailored for thinking trajectories in Section 4.1, which incorporates both step-level and trajectory-level signals to reflect fine-grained and holistic reasoning quality. We then elaborate how ReasonFlux-PRM is applied in a more general reward supervision setting in Section 4.2, covering both offline data selection and online reward modeling. ### 4.1 How Should We Define Process Rewards and Why? We first propose a new reward design to train ReasonFlux-PRM from the trajectory–response data. Our formulation integrates both step-level and trajectory-level rewards to better address the discrepancy between intermediate thinking trajectories and final responses, and to align ReasonFlux-PRM with the underlying thinking process through more targeted reward signals during training. Step-level reward for thinking trajectories. As discussed in Section 3, we observe that thinking trajectories are often more complex than final responses, frequently involving branching logic, self-corrections, and redundant reasoning. To better align these two, we incorporate a straightforward alignment score $r_{t}^{\text{align}}$ that measures the semantic similarity between each step in the intermediate thinking trajectories $s_{t}$ and each step in the final response $a_{t}$ : $$ r_{t}^{\text{align}}=\text{sim}(\Phi(s_{t}),\Phi(a_{t})), \tag{5} $$ where $\Phi$ is a pretrained encoder and $\text{sim}(\cdot,\cdot)$ denotes cosine similarity. This alignment score uses the final response as a learning signal for earlier thinking trajectories, encouraging those that are topically relevant to the final response and penalizing hallucinated or off-topic content. Concurrently, to avoid over-penalizing complex yet meaningful thinking trajectory steps that may not be semantically aligned with the final response, we incorporate a complementary quality score $r_{t}^{\text{qual}}$ . Inspired by the LLM-as-a-judge paradigm [24, 25, 26], we employ a strong expert model (e.g., GPT-4o) as a judge $J$ to evaluate the logical soundness of each step $s_{t}$ in context: $$ r_{t}^{\text{qual}}=J(s_{t}\mid x,s_{<t},a). \tag{6} $$ The quality score is designed to capture deeper aspects inside reasoning traces, including step correctness, internal coherence, and progression toward the final response. In addition to alignment with the final model output and logical step quality, we apply a step-by-step coherence score $r_{t}^{\text{coh}}$ to ensure contextual compatibility between adjacent reasoning steps using a contrastive mutual information formulation. Specifically, we model the coherence between each thinking trajectory step $s_{t}$ and its predecessor $s_{t-1}$ by contrasting their embedding similarity against $\mathcal{N}$ negative samples drawn from unrelated trajectories: $$ r_{t}^{\text{coh}}=\log\frac{\exp(\text{sim}(\Phi(s_{t-1}),\Phi(s_{t}))/\tau)}{\sum_{s^{\prime}\in\mathcal{N}}\exp(\text{sim}(\Phi(s_{t-1}),\Phi(s^{\prime}))/\tau)}, \tag{7} $$ where $\tau$ is the temperature parameter. By penalizing incoherent transitions or topic shifts, the coherence score encourages each step to be semantically and logically consistent with its immediate predecessor while remaining distinct from unrelated or disjoint reasoning steps. Finally, to aggregate the alignment, quality, and coherence scores into a unified reward signal, we apply softmax-based weighting over the three components: $$ r_{t}^{\text{step}}=\sum_{k\in\{\text{alig, qua, coh}\}}\text{softmax}(r_{t}^{\text{ali}},r_{t}^{\text{qua}},r_{t}^{\text{coh}})_{k}\cdot r_{t}^{k}. \tag{8} $$ Template-Guided Trajectory-level Reward. While the step-level rewards offer fine-grained supervision on the completeness and coherence of individual reasoning steps, they might not fully assess whether the overall problem-solving strategy encoded in model’s thinking trajectory is reliably leads to correct solutions, derived from the final response. We thus introduce a template-guided trajectory-level reward to evaluate each trajectory-response data at a higher level of abstraction [27, 15]. Specifically, given an input problem $x$ and the distilled trajectory-response $y=s\oplus a$ , we employ a strong expert LLM (e.g., GPT-4o) as a verifier $v$ . The verifier processes the complete output $y$ and extracts a reasoning template $\mathcal{T}$ , which captures the high-level strategy underlying the original trajectory-response trace. By abstracting the high-level strategy, the template provides a structured guide for subsequent reasoning. The detailed prompt used for template generation is provided in Appendix B. Next, a policy model $\pi_{\theta}$ is conditioned on the extracted template $\mathcal{T}$ and tasked with solving the input problem $x$ by strictly adhering to the prescribed template $\mathcal{T}$ . The model generates $N$ chain-of-thought responses as follows: $$ y^{(1)},\dots,y^{(N)}\sim\pi_{\theta}(\cdot\mid x,\mathcal{T}). \tag{1} $$ Then, we define the trajectory-level reward $r^{\text{final}}$ as the average correctness of the generated responses: $$ r^{\text{final}}=\frac{1}{N}\sum_{j=1}^{N}\mathbb{I}\big(y^{(j)}\text{ is correct}\big). \tag{9} $$ The template-guided trajectory-level reward evaluates whether the high-level reasoning strategy can be generalized and executed by the policy model independent of the low-level execution in the original trace. Joint Training Objective. To fully leverage both step-level and trajectory-level supervision signals, we integrate the previously defined rewards and propose the following joint training objective: $$ \mathcal{L}_{\text{total}}=\lambda_{\text{step}}\cdot\frac{1}{T}\sum_{t=1}^{T}\mathcal{L}_{\text{step}}\left(R_{\phi}(s_{t}\mid x,s_{<t},a),\ r_{t}^{\text{step}}\right)+\lambda_{\text{final}}\cdot\mathcal{L}_{\text{final}}\left(R_{\phi}(x,y),\ r^{\text{final}}\right), \tag{10} $$ where we adopt mean squared error (MSE) as the loss function for both the step and trajectory reward supervision, and $\lambda_{\text{step}}$ and $\lambda_{\text{final}}$ are tunable parameters to balance the relative contributions of fine-grained step supervision and high-level strategic feedback. We train ReasonFlux-PRM with this joint objective as the practical surrogate for the optimization objective in Eq. 4 to align with both token-level and trajectory-level reward signals, thereby enabling the supervision effectiveness on the trajectory-response data. ### 4.2 Offline Data Selection and Online Reward Modeling We elaborate on the utilities of ReasonFlux-PRM from two perspectives: (i) Offline trajectory-response data selection, where ReasonFlux-PRM is used to identify and select high-quality reasoning traces for downstream supervised fine-tuning and reinforcement learning; and (ii) Online reward modeling, where ReasonFlux-PRM provides token-level and trajectory-level reward signals during RL training, and enables efficient reward estimation for test-time scaling. Offline Data Selection. For offline data selection, ReasonFlux-PRM assigns each trajectory–response pair ( $x,y=s\oplus a$ ) a step-level reward sequence $\{\hat{r}_{t}^{\text{step}}\}_{t=1}^{T}$ for each reasoning steps and a trajectory-level reward $\hat{r}^{\text{final}}$ . The overall score is computed as: $$ \hat{r}=\frac{1}{T}\sum_{t=1}^{T}\hat{r}_{t}^{\text{step}}+\alpha\cdot\hat{r}^{\text{final}}, \tag{11} $$ where $\alpha$ balances the contributions of local and global reward signals. The aggregated score $\hat{r}$ is applied to filter samples for later downstream supervised fine-tuning of smaller models. Online Reward Modeling. We first leverage ReasonFlux-PRM to produce a composite reward signal that guides policy optimization through process-level supervision during reinforcement learning. Specifically, during the RL training, we incorporate ReasonFlux-PRM into the Group Relative Policy Optimization (GRPO) [17]. By default, GRPO optimizes for the outcome-level reward $r_{\text{out}}$ , which reflects the task accuracy of the policy $\pi_{\theta}$ on each training sample. To incorporate process-level supervision from ReasonFlux-PRM, we augment this reward with the PRM-based reward $\hat{r}$ in Eq. 11. Given input $x$ and sampled response $y\sim\pi_{\theta}(\cdot\mid x)$ , the new composite reward used for policy training after incorporating ReasonFlux-PRM then becomes: $$ r_{\text{new}}=(1-\beta)\cdot r_{\text{out}}+\beta\cdot\hat{r}, \tag{12} $$ where $\beta$ controls the relative weight of supervision from $\hat{r}$ . With a total of G group size (i.e., number of sampled responses per input), we proceed with group-normalized advantage estimation as: $$ A_{\text{new}}=\frac{r_{\text{new}}-\text{mean}(\{r_{\text{new}}\}_{j=1}^{G})}{\text{std}(\{r_{\text{new}}\}_{j=1}^{G})}. \tag{13} $$ With the ReasonFlux-PRM derived advantage term $A_{\text{new}}$ , we then update the GRPO objective by: Note that ReasonFlux-PRM can be seamlessly integrated into other online RL policy optimization algorithms such as PPo [28] and Reinforce $++$ [29] by replacing the reward signal with ReasonFlux-PRM ’s composite rewards. Reward-guided Test-Time Scaling. During inference, we further apply ReasonFlux-PRM into test-time-scaling strategies such as Best-of-N to identify the most promising output from a set of generated candidates. For each new input question and its corresponding set of sampled model responses, ReasonFlux-PRM assigns a score to each response based on the formulation in Eq. 11, and selects the response with the highest score as the final output. ## 5 Empirical Evaluations We empirically evaluate ReasonFlux-PRM, focusing on two core applications: (i) Offline data selection, where ReasonFlux-PRM identifies high-quality reasoning traces to improve supervised fine-tuning; and (ii) Online reward modeling, where ReasonFlux-PRM offers reward signals for Best-of-N decoding strategy in test-time scaling and GRPO-based policy optimization. Benchmarks. We evaluate ReasonFlux-PRM on four representative and challenging reasoning benchmarks, including MATH500 [20], a diverse set of 500 mathematical problems of varying difficulty; AIME24 [18], consisting of 30 problems from the 2024 American Invitational Mathematics Examination (AIME); AIME25, which includes 15 problems from the 2025 AIME [19]; and GPQA-Diamond [21], a benchmark of 198 PhD-level science questions to assess advanced scientific reasoning. Implementation Details. We train ReasonFlux-PRM using two off-the-shelf base models, Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct [30], resulting in ReasonFlux-PRM-1.5B and ReasonFlux-PRM-7B, respectively. The training data is primarily sourced from the public trajectory-response reasoning traces such as OpenThoughts-114K [31]. All experiments are conducted on 8 A100 GPUs. Additional experimental setups including ReasonFlux-PRM training details and downstream tasks model configurations are provided in Appendix C. Baselines and Models. For offline data selection, we compare ReasonFlux-PRM with the four frontier PRMs introduced in Section 3, using Qwen2.5-14B-Instruct [32, 30] as the generator model for standard supervised fine-tuning evaluations. For online reward modeling, constrained by computational resources, we primarily use 7B-scale models as policy models for reinforcement learning, including Qwen2.5-7B and Deepseek-R1-Distill-Qwen-7B [12]. For test-time Best-of-N scaling, we adopt Qwen2.5-14B as the generator model to evaluate inference-time performance. Table 2: Offline Data Selection Comparison. We fine-tune the generator model Qwen2.5-14B-Instruct using data selected by ReasonFlux-PRM-7B and additional baselines. The highest performance of the generators trained on each data source is bold. ReasonFlux-PRM-7B achieves better performance than the strongest human-curated baseline. | SFT Data Source | AIME24 | AIME25 | MATH500 | GPQA-Diamond | | --- | --- | --- | --- | --- | | Human-curated (s1k) | 33.3 | 33.3 | 78.8 | 41.4 | | Random | 16.7 ( $\downarrow$ 16.6) | 20.0 ( $\downarrow$ 13.3) | 68.4 ( $\downarrow$ 10.4) | 34.8 ( $\downarrow$ 6.6) | | Math-Shepherd-PRM-7B | 13.3 ( $\downarrow$ 20.0) | 6.7 ( $\downarrow$ 26.6) | 67.8 ( $\downarrow$ 11.0) | 33.3 ( $\downarrow$ 8.1) | | Skywork-PRM-7B | 13.3 ( $\downarrow$ 20.0) | 13.3 ( $\downarrow$ 20.0) | 71.8 ( $\downarrow$ 7.0) | 37.9 ( $\downarrow$ 3.5) | | Qwen2.5-Math-PRM-7B | 26.7 ( $\downarrow$ 6.6) | 20.0 ( $\downarrow$ 13.3) | 73.2 ( $\downarrow$ 5.6) | 39.4 ( $\downarrow$ 2.0) | | Qwen2.5-Math-PRM-72B | 33.3 ( $\downarrow$ 0.0) | 26.7 ( $\downarrow$ 6.6) | 77.0 ( $\downarrow$ 1.8) | 39.4 ( $\downarrow$ 2.0) | | on model responses | 36.7 ( $\uparrow$ 3.4) | 26.7 ( $\downarrow$ 6.6) | 77.8 ( $\downarrow$ 1.0) | 40.9 ( $\downarrow$ 0.5) | | ReasonFlux-PRM-7B | 40.0 ( $\uparrow$ 6.7) | 33.3 ( $\uparrow$ 0.0) | 84.8 ( $\uparrow$ 6.0) | 47.5 ( $\uparrow$ 6.1) | <details> <summary>plots/TAP_data.png Details</summary> ![823edf8a](/v1/image/823edf8ac9ce9b4b4db83b133d9ccd60478bf2b11abb93894de24bfb804043f5) ### Visual Description \n ## Histogram: Reward Score Distribution ### Overview This image presents a histogram comparing the distribution of "Reward Scores" for two models: "Deepseek-R1" and "Gemini Flash Thinking". The histogram displays the density of scores across a range from 0.0 to 1.0. ### Components/Axes * **X-axis:** "Reward Score" ranging from 0.0 to 1.0. * **Y-axis:** "Density", ranging from 0.0 to approximately 4.0. * **Legend:** Located at the top-right corner of the chart. * "Deepseek-R1" - represented by a blue color. * "Gemini Flash Thinking" - represented by an orange color. ### Detailed Analysis The chart shows two overlapping histograms. **Deepseek-R1 (Blue):** The distribution appears to be bimodal, with a peak around 0.75-0.85 and a smaller peak near 0.95. The density increases from approximately 0.5 at a reward score of 0.0, reaches a maximum of approximately 3.5 at a reward score of 0.8, and then decreases to approximately 0.5 at a reward score of 1.0. **Gemini Flash Thinking (Orange):** The distribution is unimodal, peaking around 0.2-0.4. The density is highest at approximately 3.2 at a reward score of 0.2, and then gradually decreases to approximately 0.2 at a reward score of 1.0. There is a small secondary peak around 0.6. Here's an approximate breakdown of density values at specific reward scores: | Reward Score | Deepseek-R1 Density | Gemini Flash Thinking Density | |--------------|----------------------|-------------------------------| | 0.0 | ~0.5 | ~2.8 | | 0.2 | ~1.0 | ~3.2 | | 0.4 | ~1.5 | ~2.0 | | 0.6 | ~2.5 | ~1.0 | | 0.8 | ~3.5 | ~0.5 | | 1.0 | ~0.5 | ~0.2 | ### Key Observations * Deepseek-R1 generally achieves higher reward scores than Gemini Flash Thinking. * The distribution of reward scores for Deepseek-R1 is more spread out, indicating greater variability in performance. * Gemini Flash Thinking's distribution is more concentrated in the lower reward score range. * There is some overlap between the two distributions, particularly in the 0.6-0.8 range. ### Interpretation The data suggests that Deepseek-R1 consistently outperforms Gemini Flash Thinking in terms of reward scores. The bimodal distribution of Deepseek-R1 could indicate the presence of two distinct modes of operation or two different types of tasks where the model excels. Gemini Flash Thinking, on the other hand, appears to be more consistently performing at a lower level. The overlap in the distributions suggests that there are instances where Gemini Flash Thinking can achieve comparable scores to Deepseek-R1, but these are less frequent. The difference in the spread of the distributions could indicate that Deepseek-R1 is more sensitive to the specific input or task, while Gemini Flash Thinking is more robust but less capable of achieving very high scores. This data is likely derived from a reinforcement learning or reward-based evaluation framework, where the "Reward Score" represents the quality or desirability of the model's output. </details> Figure 5: Score distributions rewarded by ReasonFlux-PRM-7B on Deepseek-R1 and Gemini over 1000 trajectory-response data. Table 3: Performance of PRMs as reward signals in policy optimization. For each of the two policy models, i.e. DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-7B-Instruct, we run GRPO with three different reward signals: entirely rule-based, Qwen2.5-Math-PRM-7B, and ReasonFlux-PRM. The latter two non-rule-based rewards are factored into the overall reward signal according to Eq. 12. We report the mean accuracy averaged over 16 independent runs. | Policy Model | Reward Signal Source | AIME24 | AIME25 | MATH500 | GPQA-Diamond | | --- | --- | --- | --- | --- | --- | | Qwen2.5-7B-Instruct | Rule-based | 12.9 | 11.1 | 73.6 | 32.7 | | Qwen2.5-Math-PRM-7B | 12.9 | 13.3 | 74.8 | 32.4 | | | ReasonFlux-PRM-7B | 16.3 | 17.1 | 77.2 | 34.9 | | | DeepSeek-R1-Distill-Qwen-7B | Rule-based | 50.2 | 38.3 | 89.6 | 47.1 | | Qwen2.5-Math-PRM-7B | 51.2 | 40.8 | 92.8 | 49.1 | | | ReasonFlux-PRM-7B | 54.6 | 44.2 | 94.8 | 51.6 | | ### 5.1 Offline Data Selection Table 2 presents the supervised fine-tuning results of Qwen2.5-14B-Instruct, with training data selected by different strategies, including ReasonFlux-PRM-7B, baseline PRMs, and human-curated examples. Notably, ReasonFlux-PRM-7B outperforms the high-quality human-curated s1k dataset. Specifically, our model achieves a 6.0% gain on MATH500 and a 6.1% improvement on GPQA-Diamond relative to the human-curated baseline. We also plot the score distribution over the 1,000 trajectory-response pairs generated by Deepseek-R1 and Gemini, as shown in Figure 5. The clearly separated score distributions in the figure demonstrate that ReasonFlux-PRM-7B effectively distinguishes between the trajectory-response quality generated by different models, providing a reliable reward signal for high-quality data selection. ### 5.2 Online Reward Modeling Reward Signal for RL training. Figure 6 and Table 3 present the training dynamics and downstream reasoning performance after incorporating different reward signals into policy optimization via GRPO. We evaluate two 7B-scale policy models: Qwen2.5-7B-Instruct and DeepSeek-R1-Distill-Qwen-7B. For each model, we compare three reward signal sources: a fully rule-based heuristic following the original GRPO approach, Qwen2.5-Math-PRM-7B, and ReasonFlux-PRM-7B. Across both policy models and all evaluated tasks, ReasonFlux-PRM-7B consistently delivers superior gains over both the rule-based and prior PRM-based reward signals. On Qwen2.5-7B-Instruct, ReasonFlux-PRM-7B improves performance by 3.4% on AIME24 and 5.8% on AIME25 relative to the rule-based baseline. On the stronger DeepSeek-R1-Distill-Qwen-7B model, ReasonFlux-PRM-7B further advances results, raising MATH500 accuracy from 89.6% to 94.8% and GPQA-Diamond from 47.1% to 51.6%. In addition, when directly comparing ReasonFlux-PRM-7B against Qwen2.5-Math-PRM-7B, we observe consistent improvements. For example, a 3.8% gain on AIME25 with Qwen2.5-7B-Instruct and a 2.5% gain on GPQA-Diamond with DeepSeek-R1-Distill-Qwen-7B. These results demonstrate that the high-quality learned reward signals from ReasonFlux-PRM substantially enhance policy optimization, outperforming both heuristic and strong PRM baselines, and ultimately yielding more capable reasoning models through RL training. <details> <summary>plots/GRPO.png Details</summary> ![a226db89](/v1/image/a226db89bfb583ec17b6c2763510c980d26c14f10a2ac1bdc59579a66daf5a8b) ### Visual Description ## Line Charts: Training Performance of Language Models ### Overview The image presents three line charts (labeled a, b, and c) illustrating the training performance of three different language models: GRPO (Rule-based), Qwen2.5-Math-PRM-7B, and ReasonFlux-7B. Chart (a) shows the training reward over steps, chart (b) shows validation accuracy over steps, and chart (c) shows response length over steps. All charts share a common x-axis representing "Step," ranging from 0 to approximately 180. ### Components/Axes * **Chart (a): Training Reward vs. Step** * X-axis: Step (0 to 180, approximately) * Y-axis: Training Reward (0.1 to 0.45, approximately) * Data Series: * GRPO (Rule-based) - Orange * Qwen2.5-Math-PRM-7B - Green * ReasonFlux-7B - Blue * **Chart (b): Validation Accuracy vs. Step** * X-axis: Step (0 to 180, approximately) * Y-axis: Validation Accuracy (0 to 0.3, approximately) * Data Series: * ReasonFlux-7B - Blue * **Chart (c): Response Length vs. Step** * X-axis: Step (0 to 180, approximately) * Y-axis: Response Length (600 to 1600, approximately) * Data Series: * GRPO (Rule-based) - Orange * Qwen2.5-Math-PRM-7B - Green * ReasonFlux-7B - Blue ### Detailed Analysis or Content Details **Chart (a): Training Reward vs. Step** * **GRPO (Orange):** Starts at approximately 0.28, dips to around 0.12 at step 10, then fluctuates between 0.24 and 0.32 for the remainder of the steps. The line exhibits significant volatility. * **Qwen2.5-Math-PRM-7B (Green):** Begins at approximately 0.18, rises to around 0.30 by step 20, and then oscillates between 0.25 and 0.35. The line is less volatile than GRPO. * **ReasonFlux-7B (Blue):** Starts at approximately 0.32, increases steadily to around 0.42 by step 100, then fluctuates between 0.38 and 0.45. This line shows a clear upward trend. **Chart (b): Validation Accuracy vs. Step** * **ReasonFlux-7B (Blue):** Starts at approximately 0.24, decreases slightly to around 0.21 at step 20, then increases to approximately 0.28 by step 180. The line is relatively stable. **Chart (c): Response Length vs. Step** * **GRPO (Orange):** Starts at approximately 800, increases rapidly to around 1200 by step 40, then plateaus around 1200-1300 for the remainder of the steps. * **Qwen2.5-Math-PRM-7B (Green):** Begins at approximately 700, increases to around 900 by step 20, then fluctuates between 800 and 1000 before increasing to around 1400 by step 180. * **ReasonFlux-7B (Blue):** Starts at approximately 750, increases steadily to around 1500 by step 180. This line shows a consistent upward trend. ### Key Observations * ReasonFlux-7B consistently outperforms GRPO and Qwen2.5-Math-PRM-7B in terms of training reward (Chart a) and exhibits the highest validation accuracy (Chart b). * GRPO shows the most volatile training reward and a relatively stable response length. * Qwen2.5-Math-PRM-7B demonstrates moderate performance across all metrics. * Response length increases over time for all models, with ReasonFlux-7B showing the most significant increase. ### Interpretation The data suggests that ReasonFlux-7B is the most effective model among the three tested, achieving higher training rewards and validation accuracy. The increasing response length across all models indicates that they are learning to generate more complex outputs as training progresses. The volatility of GRPO's training reward might indicate instability in its learning process. The consistent upward trend in ReasonFlux-7B's training reward and response length suggests a stable and effective learning process. The validation accuracy of ReasonFlux-7B, while increasing, remains relatively low (around 0.28), indicating potential for further improvement. The differences in performance between the models could be attributed to variations in their architectures, training data, or hyperparameters. The charts provide a quantitative assessment of the models' learning dynamics and highlight the strengths and weaknesses of each approach. </details> Figure 6: Training dynamics of GRPO policy optimization using ReasonFlux-PRM-7B as reward signals and Qwen2.5-7B-Instrct as the policy model. (a) Training reward vs. step: We compare the training reward evolution across original rule-based GRPO, Qwen2.5-Math-PRM-7B, and ReasonFlux-PRM-7B; (b) Validation accuracy vs. step: We report the validation accuracy during training with ReasonFlux-PRM-7B; (c) Response length vs. step: We report the evolution of generated response lengths over training steps with ReasonFlux-PRM-7B. <details> <summary>plots/tts.png Details</summary> ![250a3c2e](/v1/image/250a3c2e991156df51698408bb16881df7cbb2ba1b0217e32acfa8d2cc902e6e) ### Visual Description ## Line Charts: Model Accuracy vs. Number of Solutions ### Overview This image presents three separate line charts, each displaying the accuracy of different models (ReasonFlux-PRM-7B, Qwen2.5-Math-PRM-72B, Skywork-PRM-7B, and Majority) as a function of the number of solutions (N). The charts are arranged horizontally, one next to the other, and represent data for three different datasets: AIME24, MATH500, and GPQA-Diamond. The y-axis represents accuracy in percentage (%), and the x-axis represents the number of solutions (N) on a logarithmic scale (21 to 24). ### Components/Axes * **X-axis (all charts):** Number of Solutions (N), ranging from 21 (2) to 24 (16). * **Y-axis (all charts):** Accuracy (%), ranging from approximately 46% to 94%. * **Legend (top-left, spanning all charts):** * ReasonFlux-PRM-7B (Blue line with diamond markers) * Qwen2.5-Math-PRM-72B (Green line with circle markers) * Skywork-PRM-7B (Brown line with square markers) * Majority (Red dashed line with triangle markers) * **Chart Titles (top-center of each chart):** * AIME24 * MATH500 * GPQA-Diamond ### Detailed Analysis or Content Details **AIME24 Chart:** * **ReasonFlux-PRM-7B (Blue):** The line slopes upward, starting at approximately 46% accuracy at N=2 and reaching approximately 52% at N=16. * N=2: ~46% * N=4: ~48% * N=8: ~50% * N=16: ~52% * **Qwen2.5-Math-PRM-72B (Green):** The line is relatively flat, starting at approximately 47% accuracy at N=2 and ending at approximately 48% at N=16. * N=2: ~47% * N=4: ~47% * N=8: ~47% * N=16: ~48% * **Skywork-PRM-7B (Brown):** The line slopes upward, starting at approximately 46% accuracy at N=2 and reaching approximately 49% at N=16. * N=2: ~46% * N=4: ~47% * N=8: ~48% * N=16: ~49% * **Majority (Red):** The line slopes upward, starting at approximately 40% accuracy at N=2 and reaching approximately 44% at N=16. * N=2: ~40% * N=4: ~42% * N=8: ~43% * N=16: ~44% **MATH500 Chart:** * **ReasonFlux-PRM-7B (Blue):** The line slopes sharply upward, starting at approximately 84% accuracy at N=2 and reaching approximately 93% at N=16. * N=2: ~84% * N=4: ~88% * N=8: ~91% * N=16: ~93% * **Qwen2.5-Math-PRM-72B (Green):** The line slopes upward, starting at approximately 83% accuracy at N=2 and reaching approximately 87% at N=16. * N=2: ~83% * N=4: ~85% * N=8: ~86% * N=16: ~87% * **Skywork-PRM-7B (Brown):** The line slopes upward, starting at approximately 82% accuracy at N=2 and reaching approximately 86% at N=16. * N=2: ~82% * N=4: ~84% * N=8: ~85% * N=16: ~86% * **Majority (Red):** The line slopes downward, starting at approximately 86% accuracy at N=2 and reaching approximately 84% at N=16. * N=2: ~86% * N=4: ~85% * N=8: ~85% * N=16: ~84% **GPQA-Diamond Chart:** * **ReasonFlux-PRM-7B (Blue):** The line slopes upward, starting at approximately 47% accuracy at N=2 and reaching approximately 54% at N=16. * N=2: ~47% * N=4: ~49% * N=8: ~51% * N=16: ~54% * **Qwen2.5-Math-PRM-72B (Green):** The line slopes upward, starting at approximately 47% accuracy at N=2 and reaching approximately 49% at N=16. * N=2: ~47% * N=4: ~48% * N=8: ~48% * N=16: ~49% * **Skywork-PRM-7B (Brown):** The line is relatively flat, starting at approximately 48% accuracy at N=2 and ending at approximately 50% at N=16. * N=2: ~48% * N=4: ~49% * N=8: ~49% * N=16: ~50% * **Majority (Red):** The line is relatively flat, starting at approximately 48% accuracy at N=2 and ending at approximately 48% at N=16. * N=2: ~48% * N=4: ~48% * N=8: ~48% * N=16: ~48% ### Key Observations * **MATH500** shows the highest overall accuracy across all models. * **ReasonFlux-PRM-7B** consistently performs well, particularly on MATH500 and GPQA-Diamond, showing a positive correlation between accuracy and the number of solutions. * The **Majority** model generally exhibits the lowest accuracy, and in the MATH500 dataset, its accuracy *decreases* with an increasing number of solutions. * The accuracy gains from increasing the number of solutions appear to diminish for some models (e.g., Qwen2.5-Math-PRM-72B on AIME24). ### Interpretation The charts demonstrate the performance of different language models on three distinct mathematical reasoning datasets as the complexity of the problem (represented by the number of solutions) increases. The significant performance difference across datasets suggests that the models' strengths and weaknesses are dataset-dependent. The positive correlation between accuracy and the number of solutions for ReasonFlux-PRM-7B on MATH500 and GPQA-Diamond indicates that this model benefits from more computational resources or a more thorough search process. The decreasing accuracy of the Majority model on MATH500 is a notable anomaly, potentially indicating that the majority vote approach is less effective for more complex problems. The relatively flat performance of Qwen2.5-Math-PRM-72B and Skywork-PRM-7B on AIME24 suggests they may have reached a performance plateau for this dataset, or that the benefit of increasing the number of solutions is minimal. These results provide valuable insights into the capabilities and limitations of each model and highlight the importance of selecting the appropriate model for a given task and dataset. </details> Figure 7: Test-time performance of Best-of-N selection using ReasonFlux-PRM-7B, Qwen2.5-Math-PRM-72B, and Skywork-PRM-7B across reasoning tasks. We also report results using the majority voting method. <details> <summary>x2.png Details</summary> ![8193445c](/v1/image/8193445c128ece61d8e1859f28363b41d3d8bbeabe96e3c87665e1e16d0f8b12) ### Visual Description ## Charts: SFT on Qwen2.5-14B-Instruct Performance ### Overview The image presents two line charts comparing the performance of different models on the MATH500 dataset. The first chart shows the impact of the number of training samples on accuracy, while the second chart shows the impact of parameter size on accuracy. Both charts use the same y-axis scale representing accuracy as a percentage. ### Components/Axes **Chart 1: Number of Training Samples** * **Title:** SFT on Qwen2.5-14B-Instruct * **X-axis:** Number of Training Samples (labeled with values: 1000, 5000, 10000, 590000) * **Y-axis:** Accuracy on MATH500 (%) (labeled with values: 60, 70, 80, 90) * **Legend:** * ReasonFlux-PRM-7B (Blue Circle) * Human selected (≤1k) (Orange Triangle) * Raw Data (59k) (Purple Square) **Chart 2: Parameter Size of PRMs** * **Title:** SFT on Qwen2.5-14B-Instruct * **X-axis:** Parameter Size of PRMs (labeled with values: 1.5B, 7B, 72B) * **Y-axis:** Accuracy on MATH500 (%) (labeled with values: 60, 70, 80, 90) * **Legend:** * ReasonFlux-PRM-7B (Blue Circle) * Qwen2.5-Math-PRM-72B (Orange Triangle) * Qwen2.5-Math-PRM-7B (Purple Square) * Skywork-PRM-7B (Purple Square) * ReasonFlux-PRM-1.5B (Gray Circle) ### Detailed Analysis or Content Details **Chart 1: Number of Training Samples** * **ReasonFlux-PRM-7B (Blue):** The line slopes upward, indicating increasing accuracy with more training samples. * At 1000 samples: Approximately 84% accuracy. * At 5000 samples: Approximately 89% accuracy. * At 10000 samples: Approximately 92% accuracy. * At 590000 samples: Approximately 92% accuracy. * **Human selected (≤1k) (Orange):** A single data point at approximately 82% accuracy. * **Raw Data (59k) (Purple):** A single data point at approximately 87% accuracy. **Chart 2: Parameter Size of PRMs** * **ReasonFlux-PRM-7B (Blue):** Approximately 85% accuracy. * **Qwen2.5-Math-PRM-72B (Orange):** Approximately 81% accuracy. * **Qwen2.5-Math-PRM-7B (Purple):** Approximately 73% accuracy. * **Skywork-PRM-7B (Purple):** Approximately 70% accuracy. * **ReasonFlux-PRM-1.5B (Gray):** Approximately 76% accuracy. ### Key Observations * In the first chart, increasing the number of training samples generally improves accuracy for ReasonFlux-PRM-7B, but the improvement plateaus after 10,000 samples. * In the second chart, ReasonFlux-PRM-7B achieves the highest accuracy among the models tested. * The model Qwen2.5-Math-PRM-72B, despite having the largest parameter size, does not achieve the highest accuracy. * Skywork-PRM-7B has the lowest accuracy. ### Interpretation The data suggests that for the Qwen2.5-14B-Instruct model, the number of training samples is a significant factor in performance on the MATH500 dataset, up to a certain point. Beyond 10,000 samples, the gains in accuracy diminish. The second chart indicates that parameter size alone does not guarantee higher accuracy; model architecture and training data quality also play crucial roles. The superior performance of ReasonFlux-PRM-7B suggests that its architecture or training process is more effective than the other models tested, even with fewer parameters than Qwen2.5-Math-PRM-72B. The relatively low accuracy of Skywork-PRM-7B could indicate issues with its architecture, training data, or optimization process. The difference between the "Human selected" and "Raw Data" points in the first chart suggests that curated training data can improve performance, but the effect is less pronounced than increasing the overall amount of training data. </details> Figure 8: Effeciency Analyses on ReasonFlux-PRM-7B. Left: Accuracy on MATH500 improves steadily as the number of ReasonFlux-PRM-7B selected training samples increases, outperforming both human-selected (1k) and full raw data (59k) baselines with fewer total training instances. Right: ReasonFlux-PRM-7B achieves higher accuracy than other PRMs under 7B scale and even larger 72B scale parameter size. Best-of-N in Test-Time Scaling. In Figure 7, we present Best-of-N selection results using ReasonFlux-PRM-7B and baseline PRMs across four reasoning tasks. For the generator model, we use the fine-tuned Qwen2.5-14B-Instruct with the same checkpoint in Section 5.1. ReasonFlux-PRM-7B consistently leads to greater accuracy gains as N increases, outperforming all baselines by notable margins. While other PRMs show diminishing or flat returns with increased sampling, ReasonFlux-PRM-7B maintains a strong upward trend, demonstrating its superior ability to identify high-quality reasoning traces. Additional Performance Analyses. We leave further performance analyses on ReasonFlux-PRM and case studies in Appendix D and Appendix E. ### 5.3 Efficiency Analyses In this section, we evaluate the efficiency of ReasonFlux-PRM-7B in both offline data selection for SFT and online RL settings by comparing the training performance and overhead under different data and reward supervision strategies. <details> <summary>x3.png Details</summary> ![2d9c7dc5](/v1/image/2d9c7dc57b755a031bb1d7445c45a56d29924202e3247677aeb6f775bf009666) ### Visual Description \n ## Bar Charts: Training Time Comparison ### Overview The image presents two bar charts comparing the total training time for different models under two training paradigms: Supervised Fine-tuning and Policy Optimization (Reinforcement Learning - RL). Each chart compares the training time for "59k" and "ReasonFlux-PRM-7B (1K)" models. ### Components/Axes * **X-axis:** Model Name ("59k", "ReasonFlux-PRM-7B (1K)") * **Y-axis:** Total training time (unspecified units, but assumed to be consistent across both charts) * **Chart 1 Title:** "Supervised Fine-tuning" (top-left) * **Chart 2 Title:** "Policy Optimization (RL)" (top-right) * **Bar Colors:** * "59k": Dark Blue * "ReasonFlux-PRM-7B (1K)": Orange-Red ### Detailed Analysis or Content Details **Chart 1: Supervised Fine-tuning** * **59k:** The dark blue bar reaches approximately 1.75 on the Y-axis. * **ReasonFlux-PRM-7B (1K):** The orange-red bar reaches approximately 0.3 on the Y-axis. * **Trend:** The "59k" model takes significantly longer to train than the "ReasonFlux-PRM-7B (1K)" model. **Chart 2: Policy Optimization (RL)** * **59k:** The dark blue bar reaches approximately 1.2 on the Y-axis. * **ReasonFlux-PRM-7B (1K):** The orange-red bar reaches approximately 1.8 on the Y-axis. * **Trend:** The "ReasonFlux-PRM-7B (1K)" model takes significantly longer to train than the "59k" model. ### Key Observations * The relative training times of the models are reversed between the two training paradigms. * "59k" is faster in Supervised Fine-tuning, but slower in Policy Optimization (RL). * "ReasonFlux-PRM-7B (1K)" is slower in Supervised Fine-tuning, but faster in Policy Optimization (RL). * The difference in training time is more pronounced in the Supervised Fine-tuning chart. ### Interpretation The data suggests that the optimal model choice depends heavily on the training paradigm used. The "59k" model appears to be more efficient for Supervised Fine-tuning, while the "ReasonFlux-PRM-7B (1K)" model is more efficient for Policy Optimization (RL). This could be due to differences in model architecture, learning rates, or other hyperparameters that are better suited to each training method. The reversal in training time indicates a complex interaction between the model and the training process. Further investigation would be needed to understand the underlying reasons for this behavior. The magnitude of the difference in training time is larger for Supervised Fine-tuning, suggesting that the choice of training paradigm has a more significant impact on the "59k" model's performance. </details> Figure 9: Time overhead of ReasonFlux-PRM- during SFT and RL stages. For SFT, we compare the training time using 1k selected samples versus the full 59k raw data. For RL training, we evaluate the overall time with/without incorporating ReasonFlux-PRM-7B. As shown in Figure 8, the data selected by ReasonFlux-PRM-7B reduces the amount of training data required while achieving superior model performance. When fine-tuning Qwen2.5-14B-Instruct on only 1k samples selected by ReasonFlux-PRM-7B, the model outperforms the baseline trained on 59k raw trajectories by a substantial margin on MATH500. This highlights ReasonFlux-PRM ’s ability to identify high-quality, informative samples that yield greater performance per data point. The result aligns with recent findings on the power of curated supervision in data-efficient post-training, and further shows that ReasonFlux-PRM-7B can outperform even human-selected samples under similar data scales. We further investigate the overhead of incorporating ReasonFlux-PRM-7B into policy optimization using the GRPO framework. As shown in the right panel of Figure 9, although ReasonFlux-PRM-7B introduces additional computation for step- and trajectory-level reward modeling, the increase in total training time remains moderate compared to standard GRPO. Crucially, this additional cost leads to consistent improvements in downstream reasoning performance, as we demonstrated in our main experiments. Our experiments on both online and offline settings above demonstrate that ReasonFlux-PRM not only improves model performance across both SFT and RL regimes, but does so with minimal computational overhead, achieving superior efficiency in reasoning-centric fine-tuning and optimization pipelines. ### 5.4 Ablation Study Table 4: Ablation study on the $\alpha$ parameter. | $\alpha$ | AIME24 | AIME25 | MATH500 | | --- | --- | --- | --- | | 0.1 | 26.7 | 6.7 | 81.2 | | 0.8 | 40.0 | 33.3 | 83.6 | | 1.0 | 33.3 | 33.3 | 84.8 | | 1.5 | 33.3 | 40.0 | 83.2 | Table 5: Ablation study on the $\beta$ parameter. | $\beta$ | AIME24 | AIME25 | MATH500 | | --- | --- | --- | --- | | 0.1 | 10.0 | 6.7 | 73.6 | | 0.3 | 13.3 | 13.3 | 74.4 | | 0.5 | 13.3 | 6.7 | 75.2 | | 0.8 | 20.0 | 16.7 | 76.8 | Ablation on $\alpha$ . As described in Eq. 11, the parameter $\alpha$ controls the balance between step-level rewards and the trajectory-level reward during ReasonFlux-PRM’s reward aggregation. To assess the impact of this weighting, we conduct an ablation study by varying $\alpha\in\{0.1,0.8,1.0,1.5\}$ , and use ReasonFlux-PRM-7B to select offline fine-tuning data accordingly. The Qwen2.5-14B-Instruct model is then fine-tuned on the top 1,000 selected examples and evaluated across AIME24, AIME25, and MATH500. As shown in Table 5, performance improves when more weight is placed on the trajectory-level reward. Notably, $\alpha=1.0$ achieves the best result on MATH500, while $\alpha=1.5$ yields the highest accuracy on AIME25. These results suggest that combining both local (step-level) and global (trajectory-level) reward signals is essential, and that moderate emphasis on trajectory-level reasoning is particularly beneficial for complex tasks. We also observe that the optimal value of $\alpha$ may be influenced by the underlying data distribution. As part of future work, we plan to make $\alpha$ learnable by introducing a lightweight neural module that dynamically adapts the weight between step-level and trajectory-level rewards based on the characteristics of each input sample. Ablation on $\beta$ . In Eq. 12, we introduce $\beta$ as a weighting coefficient to balance the original rule-based GRPO reward and the process-level reward provided by ReasonFlux-PRM-7B. To understand its influence, we conduct an ablation study by varying $\beta\in\{0.1,0.3,0.5,0.8\}$ and applying GRPO with ReasonFlux-PRM-7B reward integration on the Qwen2.5-7B-Instruct policy model. As shown in Table 5, we evaluate the resulting models across AIME24, AIME25, and MATH500. The performance consistently improves with increasing $\beta$ , indicating the effectiveness of ReasonFlux-PRM ’s process-level supervision. The highest gains are achieved at $\beta=0.8$ , which yields 20.0% accuracy on AIME24, 16.7% on AIME25, and 76.8% on MATH500. The result demonstrates that a stronger emphasis on ReasonFlux-PRM rewards leads to more effective RL training. ## 6 Related Works Offline Data Selection for CoT Reasoning at Scale. The quality of data has proven pivotal in the model training process [33, 34]. Recent studies further demonstrate that small subsets of high-quality data can outperform much larger unfiltered datasets in enhancing model reasoning abilities during post-training stages such as supervised fine-tuning [35, 13, 14, 36]. In contrast to online batch data selection methods [37, 38], which select samples based on updated model signals such as gradient norms or maximum sample loss during training, offline data selection approaches aim to select data once prior to the model training process. Motivated by the need for efficiency at scale, recent works have increasingly explored offline data selection as a means of curating high-quality datasets for LLMs training. Beyond simple rejection sampling, these approaches either train an additional model for data selection [39, 40], or adaptively select data based on natural language quality indicators [41], dataset diversity [42], or model-specific quality labels [43, 44]. More recently, model distillation [45, 46] has been widely adopted to leverage longer reasoning traces distilled from large-scale reasoning models as training data for improving the capabilities of downstream smaller models. Methods such as s1 [13], LIMO [14] and ReasonFlux [15] adapt smaller subsets of human-selected high-quality distilled data, enabling smaller models to perform better on sophisticated reasoning tasks compared to training on much larger quantities of raw distilled data. Building on these insights, instead of incurring additional computational costs by focusing solely on training data selection, our work extends the applicability of process reward models from traditional reward supervision to offline data selection, particularly in the context of raw model-distilled chain-of-thought reasoning trajectories [47, 48]. Leveraging the step-by-step supervision capability of PRMs, we utilize them as a metric to select high-quality reasoning traces from raw "silver" distilled data [16], with the goal of improving downstream post-training performance for smaller models. Process Reward Models. Process Reward Models (PRMs) [5] provide step-level supervision for model reasoning answers, assigning intermediate rewards to each reasoning step [2, 49, 50, 51, 52, 7, 53, 54]. Existing PRMs, such as Math-Shepherd [55], Skywork-PRM [23], and Qwen2.5-Math-PRM series [2], are trained on either human-annotated rewards [5] or synthesized supervision signals [4] to provide fine-grained step-level rewards for model-generated reasoning solutions across different tasks such as math problem solving [18, 56], science reasoning [21], and programming [57]. More recent work such as Think-PRM [9] introduces a generative PRM to produce long CoT verification. Prior works have integrated PRMs as reward signals during training [6, 1, 58, 7], such as step-by-step verified online RL policy optimization [7, 8] or iterative generator improvement through verifier-guided self-training [59]. Others apply PRMs during inference-time scaling [60, 10, 9, 60, 61, 62, 15, 27] by integrating the models with step-level search and decoding strategies, including beam search [60], reward-guided tree search [63], Best-of-N sampling [64], etc. However, since current PRMs are mostly trained on model-generated final solutions, they struggle to provide effective reward supervision for the internal reasoning trajectories produced by large reasoning models [12] prior to generating final answers. To address this, we design a new trajectory-aware PRM specifically aimed at providing reward supervision for such trajectory–response formatted long CoT data. ## 7 Conclusion We present ReasonFlux-PRM, a trajectory-aware PRM that delivers fine-grained step-level and trajectory-level supervision for trajectory-response long chain-of-thought reasoning traces. Through extensive empirical evaluations, ReasonFlux-PRM consistently improves downstream model performance across multiple challenging benchmarks and application settings. Specifically, ReasonFlux-PRM surpasses strong baselines and human-curated data in offline training data selection, enhances policy optimization during reinforcement learning via dense process-level rewards, and demonstrates superior test-time scaling in Best-of-N inference. Our results highlight the importance of trajectory-aware reward modeling for supervising model intermediate reasoning processes. The discussion of limitations and broader impacts is provided in Appendix F. ## References - [1] Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. arXiv preprint arXiv:2410.08146, 2024. - [2] Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301, 2025. - [3] Wendi Li and Yixuan Li. Process reward model with q-value rankings. In The Thirteenth International Conference on Learning Representations, 2025. - [4] Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592, 2024. - [5] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2024. - [6] Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022. - [7] Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519, 2025. - [8] Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025. - [9] Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process reward models that think, 2025. - [10] Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, and Bowen Zhou. Genprm: Scaling test-time compute of process reward models via generative reasoning, 2025. - [11] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024. - [12] DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. - [13] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. - [14] Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387, 2025. - [15] Ling Yang, Zhaochen Yu, Bin Cui, and Mengdi Wang. Reasonflux: Hierarchical llm reasoning via scaling thought templates. arXiv preprint arXiv:2502.06772, 2025. - [16] Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025. - [17] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. - [18] Maxwell-Jia. AIME 2024 dataset. https://huggingface.co/datasets/Maxwell-Jia/AIME_2024, 2024. Accessed: 2025-05-15. - [19] math ai. AIME 2025 dataset. https://huggingface.co/datasets/math-ai/aime25, 2025. Accessed: 2025-05-15. - [20] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. - [21] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. - [22] Google. Gemini Flash Thinking API. Accessed via the Google Cloud Platform, 2025. Accessed on May 15, 2025. - [23] Jujie He, Tianwen Wei, Rui Yan, Jiacai Liu, Chaojie Wang, Yimeng Gan, Shiwen Tu, Chris Yuhao Liu, Liang Zeng, Xiaokun Wang, Boyang Wang, Yongcong Li, Fuxiang Zhang, Jiacheng Xu, Bo An, Yang Liu, and Yahui Zhou. Skywork-o1 open series. https://huggingface.co/Skywork, November 2024. - [24] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023. - [25] Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D Chang, and Prithviraj Ammanabrolu. Critique-out-loud reward models. arXiv preprint arXiv:2408.11791, 2024. - [26] Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, and Tianlu Wang. Learning to plan & reason for evaluation with thinking-llm-as-a-judge. arXiv preprint arXiv:2501.18099, 2025. - [27] Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E Gonzalez, and Bin Cui. Buffer of thoughts: Thought-augmented reasoning with large language models. Advances in Neural Information Processing Systems, 37:113519–113544, 2024. - [28] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. - [29] Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025. - [30] Qwen Team. Qwen2.5: A party of foundation models, September 2024. - [31] OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025. - [32] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zhihao Fan. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024. - [33] Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36:50358–50376, 2023. - [34] Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When scaling meets llm finetuning: The effect of data, model and finetuning method. arXiv preprint arXiv:2402.17193, 2024. - [35] Ziche Liu, Rui Ke, Yajiao Liu, Feng Jiang, and Haizhou Li. Take the essence and discard the dross: A rethinking on data selection for fine-tuning large language models. arXiv preprint arXiv:2406.14115, 2024. - [36] Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. A survey on data selection for language models. arXiv preprint arXiv:2402.16827, 2024. - [37] Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pages 2525–2534. PMLR, 2018. - [38] Jiachen Tianhao Wang, Tong Wu, Dawn Song, Prateek Mittal, and Ruoxi Jia. Greats: Online selection of high-quality data for llm training in every iteration. Advances in Neural Information Processing Systems, 37:131197–131223, 2024. - [39] Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36:69798–69818, 2023. - [40] Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2024. - [41] Yihan Cao, Yanbin Kang, Chi Wang, and Lichao Sun. Instruction mining: Instruction data selection for tuning large language models, 2024. - [42] Minghao Wu, Thuy-Trang Vu, Lizhen Qu, and Gholamreza Haffari. The best of both worlds: Bridging quality and diversity in data selection with bipartite graph, 2024. - [43] Ziche Liu, Rui Ke, Yajiao Liu, Feng Jiang, and Haizhou Li. Take the essence and discard the dross: A rethinking on data selection for fine-tuning large language models, 2025. - [44] Jiaru Zou, Mengyu Zhou, Tao Li, Shi Han, and Dongmei Zhang. Promptintern: Saving inference costs by internalizing recurrent prompt during large language model fine-tuning. arXiv preprint arXiv:2407.02211, 2024. - [45] Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, et al. Synthetic data (almost) from scratch: Generalized instruction tuning for language models. arXiv preprint arXiv:2402.13064, 2024. - [46] Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024. - [47] Murong Yue, Wenlin Yao, Haitao Mi, Dian Yu, Ziyu Yao, and Dong Yu. Dots: Learning to reason dynamically in llms via optimal reasoning trajectories search. arXiv preprint arXiv:2410.03864, 2024. - [48] Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah Pratt, Vivek Ramanujan, Jon Saad-Falcon, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A. Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G. Dimakis, and Ludwig Schmidt. Openthoughts: Data recipes for reasoning models, 2025. - [49] Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022. - [50] Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search. arXiv preprint arXiv:2406.03816, 2024. - [51] Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. arXiv preprint arXiv:2410.08146, 2024. - [52] Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, et al. Exploring the limit of outcome reward for learning mathematical reasoning. arXiv preprint arXiv:2502.06781, 2025. - [53] Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang. Co-evolving llm coder and unit tester via reinforcement learning. arXiv preprint arXiv:2506.03136, 2025. - [54] Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, and Heng Ji. Rm-r1: Reward modeling as reasoning, 2025. - [55] Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024. - [56] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, 2025. - [57] Xinyi He, Jiaru Zou, Yun Lin, Mengyu Zhou, Shi Han, Zejian Yuan, and Dongmei Zhang. CoCoST: Automatic complex code generation with online searching and correctness testing. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19433–19451, Miami, Florida, USA, November 2024. Association for Computational Linguistics. - [58] Wendi Li and Yixuan Li. Process reward model with q-value rankings. arXiv preprint arXiv:2410.11287, 2024. - [59] Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners. arXiv preprint arXiv:2402.06457, 2024. - [60] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024. - [61] Jiaru Zou, Qing Wang, Pratyush Thakur, and Nickvash Kani. Stem-pom: Evaluating language models math-symbol reasoning in document parsing. arXiv preprint arXiv:2411.00387, 2024. - [62] Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test-time compute for llm reasoning. arXiv preprint arXiv:2502.18080, 2025. - [63] Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models. arXiv preprint arXiv:2408.00724, 2024. - [64] Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling. arXiv preprint arXiv:2502.06703, 2025. - [65] Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf), 2024. - [66] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814, 2022. - [67] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large scale language model society, 2023. - [68] Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025. - [69] Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformer Reinforcement Learning. Table of Contents 1. 1 Introduction 1. 2 Preliminaries 1. 3 Existing PRMs Are Not Prepared for Rewarding Thinking Trajectories 1. 4 ReasonFlux-PRM 1. 4.1 How Should We Define Process Rewards and Why? 1. 4.2 Offline Data Selection and Online Reward Modeling 1. 5 Empirical Evaluations 1. 5.1 Offline Data Selection 1. 5.2 Online Reward Modeling 1. 5.3 Efficiency Analyses 1. 5.4 Ablation Study 1. 6 Related Works 1. 7 Conclusion 1. A Details on the Preliminary Study in Section 3 1. A.1 Preliminary Study Setups 1. A.2 Difference between Model Thinking Trajectories and Final Responses 1. B Template guided trajectory-level reward design 1. C Additional Experimental Setups 1. C.1 ReasonFlux-PRM Training 1. C.2 Downstream Tasks 1. D Additional Analyses 1. D.1 Scaling up ReasonFlux-PRM Model Size on Policy Optimization 1. D.2 End-to-End Training with ReasonFlux-PRM (SFT+RL) 1. E Case Study on ReasonFlux-PRM 1. F Limitations and Broader Impacts ## Appendix A Details on the Preliminary Study in Section 3 ### A.1 Preliminary Study Setups Process Reward Models. We evaluate four state-of-the-art process reward models for scoring the quality of the thinking trajectories data: Math-Shepherd-PRM-8B [55], Skywork-PRM-7B [23], Qwen2.5-Math-PRM-7B [2], and Qwen2.5-Math-PRM-72B [2]. The details description for each model is shown below: - Math-Shepherd-PRM-8B [55]: A 7B PRM based on Mistral, trained with data auto-generated from Mistral-7B fine-tuned on MetaMath. It emphasizes verification of step-level reasoning through process-level rewards without human annotations. - Skywork-PRM-7B [23]: A PRM built on Qwen2.5-Math-7B-Instruct and trained on data derived from LLaMA-2 fine-tuned on math tasks. It shows strong generalization for verifying reasoning trajectories across models and supports efficient TTS with low inference overhead. - Qwen2.5-Math-PRM-7B [2]: Trained on Qwen2.5-Math-7B-Instruct using data from the Qwen2.5-Math series, this PRM offers robust step-by-step reward signals and high compatibility with Qwen family models, demonstrating superior supervision ability in TTS tasks among 7B-scale PRMs. - Qwen2.5-Math-PRM-72B [2]: A high-capacity verifier trained on Qwen2.5-Math-72B-Instruct and Qwen-generated data. It achieves state-of-the-art process supervision and excels in guiding both sampling- and search-based TTS strategies across a range of mathematical reasoning tasks. Data Sources. For the data sources, we follow s1k [13] to use its collected datasets consisting of 59K raw model thinking trajectories distilled from the Google Gemini Flash Thinking API [22], along with 1K human-curated samples from the same source and an additional 1K human-curated samples from Deepseek-R1 [12]. These trajectories span a broad range of topics, including math and scientific reasoning. For downstream tasks, we choose 4 challenging benchmarks including: AIME24 [18], AIME25 [19], MATH500 [20], and GPQA-Diamond [21]. Training Details in RQ2. As the downstream generator, we adopt Qwen2.5-14B-Instruct as our base model for fine-tuning evaluation. We perform supervised fine-tuning on the Qwen2.5-14B-Instruct model using various 1,000-sample training datasets, each selected either by different PRM-based rankings or curated by human annotators in s1k [13]. We fine-tune the model for 5 epochs using a learning rate of $1\text{e}^{-5}$ , weight decay of $1\text{e}^{-4}$ , and a maximum sequence length of 32,768. All experiments are conducted on a server node with 8 A100-80G GPUs. ### A.2 Difference between Model Thinking Trajectories and Final Responses As we mentioned in Section 3, there are two key difference of the data between model thinking trajectories and final responses: - Branching steps across thinking trajectories: For instance, the thinking trajectories might initially assume an incorrect variable assignment in a math problem, detect the inconsistency, and backtrack to re-derive the solution. - Weaker global coherence across steps: This manifests in speculative or uncertain statements (e.g., “if we assume X, then Y”) that may not resolve within the same reasoning path, and in disjointed or redundant logic, such as repeating subgoals or prematurely concluding without fully integrating prior steps. In contrast, final responses are typically globally fluent and logically unified, aiming to deliver a streamlined and conclusive answer. In Examples 1 and 2 below, we provide detailed data examples that highlight the distinctions between model thinking trajectories and final output responses. These examples showcase how the thinking trajectories often contain exploratory, speculative, or partially resolved reasoning steps, in contrast to the more polished and coherent final outputs. To better illustrate these differences, we highlight all branching steps, backtracking behaviors, and internal deliberations in blue, emphasizing the non-linear and less deterministic nature of the model’s intermediate reasoning process. Example 1 Question: Two capacitors with capacitance values $C_{1}=2000\pm 10\penalty 10000\ \mathrm{pF}$ and $C_{2}=3000\pm 15\penalty 10000\ \mathrm{pF}$ are connected in series. The voltage applied across this combination is $V=5.00\pm 0.02\penalty 10000\ \mathrm{V}$ . What is the percentage error in the calculation of the energy stored in this combination of capacitors? Thinking Trajectories: Let $C_{1}$ and $C_{2}$ be the capacitance values of the two capacitors, and $\Delta C_{1}$ , $\Delta C_{2}$ be their absolute errors. Given: $C_{1}=2000\pm 10\penalty 10000\ \mathrm{pF}$ $C_{2}=3000\pm 15\penalty 10000\ \mathrm{pF}$ Voltage applied across the combination, $V=5.00\pm 0.02\penalty 10000\ \mathrm{V}$ When capacitors are connected in series, the equivalent capacitance $C_{\text{eq}}$ is given by: $\frac{1}{C_{\text{eq}}}=\frac{1}{C_{1}}+\frac{1}{C_{2}}\Rightarrow C_{\text{eq}}=\frac{C_{1}C_{2}}{C_{1}+C_{2}}$ Let me double check, initially I thought the formula was $C_{1}+C_{2}$ , but that’s for parallel combination. Since it’s in series, we should use the reciprocal formula. Calculate the nominal value: $C_{\text{eq}}=\frac{2000\times 3000}{2000+3000}=\frac{6\times 10^{6}}{5000}=1200\penalty 10000\ \mathrm{pF}$ To find the error in $C_{\text{eq}}$ , we use: $\Delta C_{\text{eq}}=\sqrt{\left(\frac{\partial C_{\text{eq}}}{\partial C_{1}}\Delta C_{1}\right)^{2}+\left(\frac{\partial C_{\text{eq}}}{\partial C_{2}}\Delta C_{2}\right)^{2}}$ Partial derivatives: $\frac{\partial C_{\text{eq}}}{\partial C_{1}}=\frac{C_{2}^{2}}{(C_{1}+C_{2})^{2}},\quad\frac{\partial C_{\text{eq}}}{\partial C_{2}}=\frac{C_{1}^{2}}{(C_{1}+C_{2})^{2}}$ Evaluate: $\frac{9}{25},\quad\frac{4}{25}$ $\displaystyle\Delta C_{\text{eq}}$ $\displaystyle=\sqrt{\left(\frac{9}{25}\times 10\right)^{2}+\left(\frac{4}{25}\times 15\right)^{2}}=\frac{1}{25}\sqrt{90^{2}+60^{2}}$ $\displaystyle=\frac{1}{25}\sqrt{8100+3600}=\frac{1}{25}\sqrt{11700}\approx 4.33\penalty 10000\ \mathrm{pF}$ Hmm, but is this the only way? Maybe logarithmic differentiation would give a cleaner expression? Let me try an alternative method: $\frac{1}{C_{\text{eq}}}=\frac{1}{C_{1}}+\frac{1}{C_{2}}\Rightarrow-\frac{1}{C_{\text{eq}}^{2}}dC_{\text{eq}}=-\frac{1}{C_{1}^{2}}dC_{1}-\frac{1}{C_{2}^{2}}dC_{2}$ So: $\frac{dC_{\text{eq}}}{C_{\text{eq}}^{2}}=\frac{dC_{1}}{C_{1}^{2}}+\frac{dC_{2}}{C_{2}^{2}}\Rightarrow\frac{\Delta C_{\text{eq}}}{C_{\text{eq}}}=C_{\text{eq}}\left(\frac{\Delta C_{1}}{C_{1}^{2}}+\frac{\Delta C_{2}}{C_{2}^{2}}\right)$ This seems messier, maybe it is generalizable. Now let me compute it: $=\frac{3000\cdot 10}{2000\cdot 5000}+\frac{2000\cdot 15}{3000\cdot 5000}=0.003+0.002=0.005\Rightarrow\Delta C_{\text{eq}}=0.005\cdot 1200=6\penalty 10000\ \mathrm{pF}$ Now compute energy: $E=\frac{1}{2}C_{\text{eq}}V^{2}$ Let me double-check: voltage is squared here, so the error contribution from voltage gets doubled in fractional terms. Fractional error: $\frac{\Delta E}{E}=\frac{\Delta C_{\text{eq}}}{C_{\text{eq}}}+2\cdot\frac{\Delta V}{V}=0.005+2\cdot 0.004=0.013\Rightarrow\text{Percentage error}=1.3\$ Alternative verification might help… I might try distribute the weight of errors proportionally: $\frac{\Delta C_{\text{eq}}}{C_{\text{eq}}}=\frac{\Delta C_{1}}{C_{1}}\cdot\frac{C_{2}}{C_{1}+C_{2}}+\frac{\Delta C_{2}}{C_{2}}\cdot\frac{C_{1}}{C_{1}+C_{2}}=0.005\cdot 0.6+0.005\cdot 0.4=0.005$ Let me Final check: $\ln E=\ln\left(\frac{1}{2}\right)+\ln C_{\text{eq}}+2\ln V\Rightarrow\frac{\Delta E}{E}=\frac{\Delta C_{\text{eq}}}{C_{\text{eq}}}+2\cdot\frac{\Delta V}{V}$ The final answer is: $\Rightarrow\boxed{1.3\$ Final Response: Step 1: Calculate the Equivalent Capacitance When capacitors are connected in series, the equivalent capacitance $C_{\text{eq}}$ is given by: $\frac{1}{C_{\text{eq}}}=\frac{1}{C_{1}}+\frac{1}{C_{2}}=\frac{C_{1}+C_{2}}{C_{1}C_{2}}\quad\Rightarrow\quad C_{\text{eq}}=\frac{C_{1}C_{2}}{C_{1}+C_{2}}$ $C_{\text{eq}}=\frac{2000\times 3000}{2000+3000}=\frac{6\times 10^{6}}{5000}=1200\penalty 10000\ \mathrm{pF}$ Step 2: Calculate the Error in $C_{\text{eq}}$ Using the formula for propagation of uncertainty: $\Delta C_{\text{eq}}=\left|\frac{\partial C_{\text{eq}}}{\partial C_{1}}\right|\Delta C_{1}+\left|\frac{\partial C_{\text{eq}}}{\partial C_{2}}\right|\Delta C_{2}$ Compute the partial derivatives: $\frac{\partial C_{\text{eq}}}{\partial C_{1}}=\frac{C_{2}^{2}}{(C_{1}+C_{2})^{2}},\quad\frac{\partial C_{\text{eq}}}{\partial C_{2}}=\frac{C_{1}^{2}}{(C_{1}+C_{2})^{2}}$ Substitute the given values: $\frac{\partial C_{\text{eq}}}{\partial C_{1}}=\frac{3000^{2}}{5000^{2}}=\frac{9}{25},\quad\frac{\partial C_{\text{eq}}}{\partial C_{2}}=\frac{2000^{2}}{5000^{2}}=\frac{4}{25}$ Now compute the error: $\Delta C_{\text{eq}}=\frac{9}{25}\cdot 10+\frac{4}{25}\cdot 15=\frac{90}{25}+\frac{60}{25}=\frac{150}{25}=6\penalty 10000\ \mathrm{pF}$ Step 3: Compute the Fractional Error in $C_{\text{eq}}$ $\frac{\Delta C_{\text{eq}}}{C_{\text{eq}}}=\frac{6}{1200}=0.005$ Step 4: Energy Stored in the Capacitor System The energy stored in a capacitor is given by: $E=\frac{1}{2}C_{\text{eq}}V^{2}$ Step 5: Propagate Error in Energy Calculation The fractional error in energy $E$ is: $\frac{\Delta E}{E}=\frac{\Delta C_{\text{eq}}}{C_{\text{eq}}}+2\cdot\frac{\Delta V}{V}$ Given: $\frac{\Delta C_{\text{eq}}}{C_{\text{eq}}}=0.005,\quad\frac{\Delta V}{V}=\frac{0.02}{5.00}=0.004$ Substitute into the error formula: $\frac{\Delta E}{E}=0.005+2\cdot 0.004=0.005+0.008=0.013$ Step 6: Convert to Percentage Error $\text{Percentage error}=\frac{\Delta E}{E}\times 100\$ Final Answer: The percentage error in the energy stored is $\boxed{1.3\$ . Example 2 Question: One base of a trapezoid is $100$ units longer than the other base. The segment that joins the midpoints of the legs divides the trapezoid into two regions whose areas are in the ratio $2:3$ . Let $x$ be the length of the segment, parallel to the bases, that divides the trapezoid into two regions of equal area. Find the greatest integer that does not exceed $\dfrac{x^{2}}{100}$ . Thinking Trajectories: The problem asks for the value of $\lfloor x^{2}/100\rfloor$ , where $x$ is the length of a segment parallel to the bases that divides the trapezoid into two regions of equal area. We are given information about the lengths of the bases and the ratio of the areas created by the segment joining the midpoints of the legs. Let the lengths of the two bases be $b_{1}$ and $b_{2}$ , with $b_{1}>b_{2}$ . We are told that: $b_{1}=b_{2}+100$ Let the height of the trapezoid be $h$ . The midline of the trapezoid has length: $m=\frac{b_{1}+b_{2}}{2}$ It divides the trapezoid into two smaller trapezoids, each of height $h/2$ . The area of the upper trapezoid is: $A_{1}=\frac{1}{2}(b_{2}+m)\cdot\frac{h}{2}=\frac{h}{8}(b_{1}+3b_{2})$ The area of the lower trapezoid is: $A_{2}=\frac{1}{2}(b_{1}+m)\cdot\frac{h}{2}=\frac{h}{8}(3b_{1}+b_{2})$ We are told that the ratio of areas is $2:3$ (upper to lower), so: $\frac{A_{1}}{A_{2}}=\frac{b_{1}+3b_{2}}{3b_{1}+b_{2}}=\frac{2}{3}\Rightarrow 3(b_{1}+3b_{2})=2(3b_{1}+b_{2})\Rightarrow 3b_{1}+9b_{2}=6b_{1}+2b_{2}\Rightarrow 7b_{2}=3b_{1}$ Substitute $b_{1}=b_{2}+100$ : $7b_{2}=3(b_{2}+100)\Rightarrow 7b_{2}=3b_{2}+300\Rightarrow 4b_{2}=300\Rightarrow b_{2}=75\Rightarrow b_{1}=175$ Let me also test the other case just in case it’s a typo, what if the area ratio was actually $3:2$ ? $\displaystyle\frac{b_{1}+3b_{2}}{3b_{1}+b_{2}}=\frac{3}{2}$ $\displaystyle\Rightarrow 2(b_{1}+3b_{2})=3(3b_{1}+b_{2})\Rightarrow 2b_{1}+6b_{2}=9b_{1}+3b_{2}\Rightarrow 3b_{2}=7b_{1}$ $\displaystyle\Rightarrow 3b_{2}=7(b_{2}+100)\Rightarrow 3b_{2}=7b_{2}+700\Rightarrow-4b_{2}=700\Rightarrow b_{2}=-175$ That’s invalid as the length can’t be negative. So the correct ratio is $2:3$ and $b_{2}=75$ , $b_{1}=175$ . Now consider a segment of length $x$ , parallel to the bases, that divides the trapezoid into two equal-area regions. Let the height from $b_{2}$ up to this segment be $h^{\prime}$ . The total area is: $A_{\text{total}}=\frac{1}{2}(b_{1}+b_{2})h=\frac{1}{2}(175+75)h=125h$ We want: $\frac{1}{2}(b_{2}+x)h^{\prime}=\frac{125h}{2}\Rightarrow(b_{2}+x)h^{\prime}=125h$ We also know from similar triangles: $\frac{x-b_{2}}{h^{\prime}}=\frac{b_{1}-b_{2}}{h}\Rightarrow\frac{h^{\prime}}{h}=\frac{x-b_{2}}{b_{1}-b_{2}}$ Substitute into the area equation: $(b_{2}+x)\cdot h\cdot\frac{x-b_{2}}{b_{1}-b_{2}}=125h\Rightarrow(b_{2}+x)(x-b_{2})=125(b_{1}-b_{2})$ $x^{2}-b_{2}^{2}=125(b_{1}-b_{2})\Rightarrow x^{2}=125(b_{1}-b_{2})+b_{2}^{2}$ Substitute $b_{1}=175$ , $b_{2}=75$ : $x^{2}=125(100)+75^{2}=12500+5625=18125\Rightarrow\left\lfloor\frac{x^{2}}{100}\right\rfloor=\left\lfloor\frac{18125}{100}\right\rfloor=\boxed{181}$ I also saw a different derivation that leads directly to this result. Let me verify it: $2x^{2}=b_{1}^{2}+b_{2}^{2}\Rightarrow x^{2}=\frac{b_{1}^{2}+b_{2}^{2}}{2}\Rightarrow x^{2}=\frac{175^{2}+75^{2}}{2}=\frac{30625+5625}{2}=18125\Rightarrow\left\lfloor\frac{x^{2}}{100}\right\rfloor=\boxed{181}$ Final Response: Step 1: Set up variables and use the area ratio condition Let the lengths of the two bases of the trapezoid be $b_{1}$ and $b_{2}$ , with $b_{1}>b_{2}$ . We are given that $b_{1}=b_{2}+100$ . The segment that joins the midpoints of the legs (the midline) divides the trapezoid into two smaller trapezoids with heights $h/2$ . The lengths of the parallel sides of the upper trapezoid are $b_{2}$ and $m=\frac{b_{1}+b_{2}}{2}$ , and its area is $A_{1}=\frac{h}{8}(b_{1}+3b_{2}).$ The lengths of the parallel sides of the lower trapezoid are $m$ and $b_{1}$ , and its area is $A_{2}=\frac{h}{8}(3b_{1}+b_{2}).$ The ratio of the areas is $A_{1}:A_{2}=(b_{1}+3b_{2}):(3b_{1}+b_{2})$ . Given that this ratio is $2:3$ or $3:2$ . If $\frac{b_{1}+3b_{2}}{3b_{1}+b_{2}}=\frac{2}{3},$ then $3b_{1}+9b_{2}=6b_{1}+2b_{2},$ , so $7b_{2}=3b_{1}.$ Substituting $b_{1}=b_{2}+100$ , we get $7b_{2}=3(b_{2}+100),$ so $4b_{2}=300,\quad b_{2}=75,\quad\text{and}\quad b_{1}=175.$ Step 2: Verify the alternate area ratio leads to contradiction If $\frac{b_{1}+3b_{2}}{3b_{1}+b_{2}}=\frac{3}{2},$ we get $2b_{1}+6b_{2}=9b_{1}+3b_{2},$ so $3b_{2}=7b_{1},$ which leads to a negative value for $b_{2}$ , which is impossible. The lengths of the bases are $b_{1}=175$ and $b_{2}=75$ . Step 3: Express the condition for equal-area division Let the segment of length $x$ parallel to the bases divide the trapezoid into two regions of equal area. The square of the length of such a segment is given by $x^{2}=\frac{b_{1}^{2}+b_{2}^{2}}{2}.$ $x^{2}=\frac{175^{2}+75^{2}}{2}=\frac{30625+5625}{2}=\frac{36250}{2}=18125.$ Step 4: Compute the final result We need to find the greatest integer that does not exceed $x^{2}/100$ . $\frac{x^{2}}{100}=\frac{18125}{100}=181.25.$ The greatest integer that does not exceed $181.25$ is $\lfloor 181.25\rfloor=181.$ Final Answer: The final answer is $\boxed{181}$ . ## Appendix B Template guided trajectory-level reward design Prompt Design of the Template (LLM-as-a-Verifier) You are given a long chain-of-thought (CoT) response to a challenging math problem. Your task is to summarize the response into a structured sequence of reasoning steps that can serve as a clear and guided template for use by a smaller model. Problem: [problem] Response: [response] Instructions: Please summarize the response as a concise list of reasoning steps, each capturing a distinct part of the thought process. These may include restating the problem, defining variables, constructing mathematical models, performing calculations, verifying results, handling different cases, correcting mistakes, and drawing the final conclusion. Focus on preserving the logical flow while keeping each step clear and concise. Here are a few template examples you should strictly follow: [Template Example 1] [Template Example 2] [Template Example 3] Write your answer below. ## Appendix C Additional Experimental Setups ### C.1 ReasonFlux-PRM Training For ReasonFlux-PRM training, we initialize from the off-the-shelf Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct models [30], serving as our 1.5B-scale and 7B-scale ReasonFlux-PRM backbones. We then train ReasonFlux-PRM on the OpenThoughts-114K [31] collection of datasets containing rich, model-generated thinking trajectories paired with their corresponding final responses. OpenThoughts-114k is a publicly available synthetic reasoning dataset comprising 114,000 high-quality examples across four domains: mathematics, science, code, and puzzles. Each example includes a problem statement, a thinking trajectory generated by the Deepseek-R1, and a corresponding model response. The dataset was constructed by curating prompts from existing datasets, such as AI-MO/NuminaMath-CoT [65] for math, DeepMind/code-contests [66] for code, and camel-ai/chemistry [67] for science. We utilize the model-generated thinking trajectories and final responses from the datasets as raw training data. Subsequently, we assign step-level and trajectory-level rewards based on our specific reward design, as detailed in Section 4.1. We follow our detailed description in Section 4.1 to train with the step-level reward. In addition, to train with the template-guided trajectory-level reward, we first randomly sample 1000 problem-response samples from OpenThoughts-114k, and prompt GPT-4o to extract the reasoning template from each CoT sample using the prompt in Section B. For each problem-template pair, we choose Qwen2.5-7B-Instruct as our generator $\pi_{\theta}$ and generate $N=5$ responses that attempt to solve the problem while adhering to the reasoning template. The ground truth trajectory-level reward is then computed as the average accuracy across the 5 responses, as shown in Eq. 9. We then combine the step-level and trajectory-level rewards to obtain the ground truth reward values for the 1000 samples, and train ReasonFlux-PRM to learn these reward values using the joint training objective in Eq. 10. To train our reward model, we use a learning rate of 1e-5 and train for 3 epochs. ### C.2 Downstream Tasks For offline data selection and subsequent supervised fine-tuning, we follow the exact experimental setup described in Appendix A.1 to ensure a fair comparison with baseline models. Specifically, we begin by assigning reward scores to each trajectory–response pair in OpenThoughts-114k using the designated reward model. We then rank all samples based on their aggregated reward scores and select the top 1,000 examples to serve as the training set for downstream fine-tuning. For online policy optimization, we use a training dataset comprising 10k competition-level mathematical reasoning problems collected from MATH [20] and the DAPO [68] training set. These training data contains math problems spanning a wide range of topics, including algebra, geometry, probability, and precalculus. Our GRPO training framework is built on the original Hugging Face GRPO Trainer [69]. We train with a batch size of 32, generating 6 samples per prompt, and run training for 3 epochs. As mentioned above, the vanilla GRPO relies on a rule-based reward that evaluates only the correctness of the final answer. On the other hand, we replace the rule-based reward with the learned reward signal obtained by passing the training prompt and the policy model’s output through ReasonFlux-PRM. For the Best-of-N test-time scaling experiments, we use Qwen2.5-14B-Instruct as the generator model. Given an input problem $x$ , the generator produces $N$ candidate reasoning trajectories using nucleus sampling with temperature $T=0.3$ , where $N\in\{2,4,8,16\}$ . Each candidate trajectory is then scored by ReasonFlux-PRM, which provides a scalar reward reflecting the trajectory’s quality in terms of correctness, coherence, and reasoning structure. The final output is selected as the trajectory with the highest ReasonFlux-PRM assigned reward. We evaluate performance by measuring final-answer accuracy over the selected outputs. ## Appendix D Additional Analyses ### D.1 Scaling up ReasonFlux-PRM Model Size on Policy Optimization Table 6: Scaling Effects of ReasonFlux-PRM model size on GRPO online policy optimization performance. Larger ReasonFlux-PRM reward models (7B vs. 1.5B) consistently yield better downstream performance on MATH500 and GPQA-Diamond across both Qwen2.5-7B-Instruct and DeepSeek-R1-Distill-Qwen-7B policy models. | Policy Model | ReasonFlux-PRM Size | MATH500 | GPQA-Diamond | | --- | --- | --- | --- | | Qwen2.5-7B-Instruct | 1.5B | 73.8 | 30.8 | | 7B | 77.6 | 34.3 | | | DeepSeek-R1-Distill-Qwen-7B | 1.5B | 90.4 | 48.5 | | 7B | 93.8 | 51.5 | | To investigate the impact of reward model capacity, we vary the size of the ReasonFlux-PRM model used to provide rewards for GRPO-based policy optimization. As shown in Table 6, using a larger ReasonFlux-PRM model consistently improves performance across both policy models, Qwen2.5-7B-Instruct and DeepSeek-R1-Distill-Qwen-7B. Specifically, scaling ReasonFlux-PRM from 1.5B to 7B leads to a 3.8% gain on MATH500 and 3.5% on GPQA-Diamond for Qwen2.5-7B-Instruct. Likewise, for DeepSeek-R1-Distill-Qwen-7B, we observe a 3.4% improvement on MATH500 and 3.0% on GPQA-Diamond. These results indicate that larger reward models provide more accurate and informative signals for RL, thereby enabling stronger policy optimization. ### D.2 End-to-End Training with ReasonFlux-PRM (SFT+RL) Table 7: Effect of ReasonFlux-PRM-7B selected supervised fine-tuning on downstream RL. We compare the original backbone model (Checkpoint 1) and the model fine-tuned on 1k ReasonFlux-PRM-7B selected data (Checkpoint 2), each evaluated under different reward signal sources. | Policy Model (Qwen2.5-7B-Instruct) | Reward Signal Source | MATH500 | | --- | --- | --- | | Checkpoint 1: Original backbone model | Rule-based | 74.0 | | Qwen2.5-Math-PRM-7B | 75.4 | | | ReasonFlux-PRM-7B | 77.0 | | | Checkpoint 2: SFT on 1k ReasonFlux-PRM-7B selected data | Rule-based | 84.8 | | Qwen2.5-Math-PRM-7B | 87.6 | | | ReasonFlux-PRM-7B | 89.8 | | As supervised fine-tuning followed by reinforcement learning (SFT+RL) has become a dominant paradigm for aligning large language models with reasoning-intensive tasks, we are motivated to evaluate if ReasonFlux-PRM can serve as a general-purpose reward model to be effectively applied across both stages of training. Table 7 presents a comparative analysis on the Qwen2.5-7B-Instruct policy model, where we evaluate two checkpoints: (i) the original backbone model, and (ii) the same model after SFT on 1k ReasonFlux-PRM-7B selected data over the 59K raw data in s1 [13]. Both checkpoints are then further optimized with different reward signal sources during RL. The results demonstrate that ReasonFlux-PRM-7B consistently improves downstream performance at SFT and RL stages. We also observe that across all reward signal sources, fine-tuning on 1k ReasonFlux-PRM-7B selected data consistently improves performance over the original backbone model. Notably, the combination of ReasonFlux-PRM-7B based supervised fine-tuning and ReasonFlux-PRM-7B guided reinforcement learning yields the highest MATH500 accuracy of 89.8%, with a significant 12.8% accuracy improvement compared to the original backbone model (77.0%). These results highlight the end-to-end effectiveness of ReasonFlux-PRM as a general reward model for enhancing reasoning capabilities throughout the full training pipeline. ## Appendix E Case Study on ReasonFlux-PRM In Case Studies 1 and 2 below, we present two responses to the same mathematical problem, one incorrect and one correct. For each response, we show the step-level and trajectory-level rewards assigned by ReasonFlux-PRM-7B. In the incorrect response, ReasonFlux-PRM-7B effectively identifies the point at which the reasoning begins to deviate by assigning significantly lower rewards to the subsequent steps. In contrast, the correct response receives consistently high rewards throughout the trajectory. Our study on the example suggests that ReasonFlux-PRM is not only capable of distinguishing between correct and incorrect solutions but also of pinpointing the precise step where an error in reasoning occurs. ## Appendix F Limitations and Broader Impacts Limitations While ReasonFlux-PRM demonstrates robust improvements in reasoning quality across diverse mathematical and scientific benchmarks, one limitation lies in its reliance on high-quality trajectory–response pairs for training. Although ReasonFlux-PRM is designed to be general-purpose and adaptable, its current instantiation is tailored toward domains with well-structured logic (e.g., math and science), where reward signals based on alignment, quality, and coherence can be meaningfully computed. Applying ReasonFlux-PRM in more open-ended tasks such as commonsense dialogue or code generation may require redefining trajectory-level reward criteria and further tuning of the reward decomposition strategy. However, these domains fall outside the scope of this work and do not detract from the validity or strength of our conclusions in the reasoning-focused setting. Broader Impact This paper introduces ReasonFlux-PRM, a trajectory-aware reward modeling framework that advances LLM reasoning by providing fine-grained supervision over both intermediate thinking steps and high-level solution strategies. This improves LLM alignment, interpretability, and efficiency in structured reasoning tasks such as math and science, while supporting both supervised fine-tuning and reinforcement learning. While our work has broad implications, we do not foresee significant negative impacts or societal concerns that warrant specific emphasis. Case Study 1 Question: If $-6\leq a\leq-2$ and $3\leq b\leq 5$ , what is the greatest possible value of $\displaystyle\left(a+\frac{1}{b}\right)\left(\frac{1}{b}-a\right)$ ? Thinking Trajectories: The problem asks for the greatest possible value of the expression $\displaystyle\left(a+\frac{1}{b}\right)\left(\frac{1}{b}-a\right)$ given the ranges for $a$ and $b$ . [Step-level reward: 0.997] First, let’s expand the expression: $\displaystyle\left(a+\frac{1}{b}\right)\left(\frac{1}{b}-a\right)=a\left(\frac{1}{b}-a\right)+\frac{1}{b}\left(\frac{1}{b}-a\right)$ $=a\cdot\frac{1}{b}-a^{2}+\frac{1}{b}\cdot\frac{1}{b}-\frac{1}{b}\cdot a$ $=\frac{a}{b}-a^{2}+\frac{1}{b^{2}}-\frac{a}{b}$ $=-a^{2}+\frac{1}{b^{2}}$ [Step-level reward: 0.983] Now we need to find the maximum value of $-a^{2}+\frac{1}{b^{2}}$ given the constraints $-6\leq a\leq-2$ and $3\leq b\leq 5$ . [Step-level reward: 0.996] Consider the terms $-a^{2}$ and $\frac{1}{b^{2}}$ separately. Since $-6\leq a\leq-2$ , we have $0\leq(-a)^{2}\leq(-(-6))^{2}=36$ . Therefore, $-36\leq-a^{2}\leq-4$ . [Step-level reward: 0.503] Since $3\leq b\leq 5$ , we have $9\leq b^{2}\leq 25$ . Therefore, $\frac{1}{25}\leq\frac{1}{b^{2}}\leq\frac{1}{9}$ . [Step-level reward: 0.895] We want to maximize $-a^{2}+\frac{1}{b^{2}}$ . This is equivalent to minimizing $-a^{2}$ and maximizing $\frac{1}{b^{2}}$ . [Step-level reward: 0.297] (This is the place where the thinking trajectory went wrong, because in order to maximize $-a^{2}+\frac{1}{b^{2}}$ , $-a^{2}$ needs to be maximized, not minimized. ReasonFlux-PRM assigns low reward starting from this segment within the thinking trajectory, which indicates that ReasonFlux-PRM is able to identify errors in reasoning traces.) The minimum value of $-a^{2}$ is $-36$ . The maximum value of $\frac{1}{b^{2}}$ is $\frac{1}{9}$ . [Step-level reward: 0.238] (The step-level rewards drop significantly since the last wrong step.) So the greatest possible value of $-a^{2}+\frac{1}{b^{2}}$ is $-36+\frac{1}{9}$ . Let’s calculate this value: $-36+\frac{1}{9}=-\frac{36\times 9}{9}+\frac{1}{9}=-\frac{324}{9}+\frac{1}{9}=-\frac{324-1}{9}=-\frac{323}{9}$ [Step-level reward: 0.280] Final Response: Given the ranges $-6\leq a\leq-2$ and $3\leq b\leq 5$ . We want to find the greatest possible value of the expression $\displaystyle\left(a+\frac{1}{b}\right)\left(\frac{1}{b}-a\right)$ . [Step-level reward: 0.995] First, we expand the expression: $\left(a+\frac{1}{b}\right)\left(\frac{1}{b}-a\right)=a\left(\frac{1}{b}-a\right)+\frac{1}{b}\left(\frac{1}{b}-a\right)$ $=a\cdot\frac{1}{b}-a^{2}+\frac{1}{b}\cdot\frac{1}{b}-\frac{1}{b}\cdot a$ $=\frac{a}{b}-a^{2}+\frac{1}{b^{2}}-\frac{a}{b}$ $=-a^{2}+\frac{1}{b^{2}}$ [Step-level reward: 0.997] Now we need to find the maximum value of $-a^{2}+\frac{1}{b^{2}}$ given the constraints for $a$ and $b$ . Consider the term $-a^{2}$ . Since $-6\leq a\leq-2$ , the values of $a^{2}$ range from $(-6)^{2}=36$ to $(-2)^{2}=4$ . Therefore, the values of $-a^{2}$ range from $-36$ to $-4$ . [Step-level reward: 0.988] Consider the term $\frac{1}{b^{2}}$ . Since $3\leq b\leq 5$ , the values of $b^{2}$ range from $3^{2}=9$ to $5^{2}=25$ . Therefore, the values of $\frac{1}{b^{2}}$ range from $\frac{1}{25}$ to $\frac{1}{9}$ . [Step-level reward: 0.897] To maximize $-a^{2}+\frac{1}{b^{2}}$ , we need to maximize $\frac{1}{b^{2}}$ and minimize $-a^{2}$ . The maximum value of $\frac{1}{b^{2}}$ is $\frac{1}{9}$ , which occurs when $b=3$ . The minimum value of $-a^{2}$ is $-36$ , which occurs when $a=-6$ . [Step-level reward: 0.108] (The low score here aligns with the low score in the model thinking trajectories.) Thus, the greatest possible value of $-a^{2}+\frac{1}{b^{2}}$ is $-36+\frac{1}{9}$ . $-36+\frac{1}{9}=-\frac{36\times 9}{9}+\frac{1}{9}=-\frac{324}{9}+\frac{1}{9}=-\frac{324-1}{9}=-\frac{323}{9}$ Final Answer: The final answer is $\boxed{-\frac{323}{9}}$ [Step-level reward: 0.319] [Trajectory-level reward: 0.197] (As the final answer is incorrect, ReasonFlux-PRM assigns a low trajectory-level reward to the overall trajectory-response data.) Case Study 2 Question: If $-6\leq a\leq-2$ and $3\leq b\leq 5$ , what is the greatest possible value of $\displaystyle\left(a+\frac{1}{b}\right)\left(\frac{1}{b}-a\right)$ ? Thinking Trajectories: The problem asks for the greatest possible value of the expression $\displaystyle\left(a+\frac{1}{b}\right)\left(\frac{1}{b}-a\right)$ given the ranges for $a$ and $b$ . [Step-level reward: 0.998] Let $E=\left(a+\frac{1}{b}\right)\left(\frac{1}{b}-a\right)$ . We can expand the expression: $E=a\left(\frac{1}{b}\right)+a\left(-a\right)+\frac{1}{b}\left(\frac{1}{b}\right)+\frac{1}{b}\left(-a\right)$ $E=\frac{a}{b}-a^{2}+\frac{1}{b^{2}}-\frac{a}{b}$ $E=-a^{2}+\frac{1}{b^{2}}$ [Step-level reward: 0.985] We need to find the maximum value of $E$ given the constraints $-6\leq a\leq-2$ and $3\leq b\leq 5$ . [Step-level reward: 0.998] Since $E=-a^{2}+\frac{1}{b^{2}}$ , we need to minimize $a^{2}$ and maximize $\frac{1}{b^{2}}$ . [Step-level reward: 0.974] For $a$ , the range is $-6\leq a\leq-2$ . The minimum value of $a^{2}$ occurs when $a$ is closest to 0. The values of $a$ are negative, so the minimum value of $a^{2}$ occurs when $a$ is the least negative, which is $-2$ . When $a=-2$ , $a^{2}=(-2)^{2}=4$ . [Step-level reward: 0.962] For $b$ , the range is $3\leq b\leq 5$ . The maximum value of $\frac{1}{b^{2}}$ occurs when $b$ is the smallest. When $b=3$ , $\frac{1}{b^{2}}=\frac{1}{3^{2}}=\frac{1}{9}$ . [Step-level reward: 0.985] So, the maximum value of $E$ occurs when $a^{2}$ is minimized and $\frac{1}{b^{2}}$ is maximized. Maximum $E=-a^{2}+\frac{1}{b^{2}}=-4+\frac{1}{9}$ . [Step-level reward: 0.818] To add these values, we need a common denominator: Maximum $E=-4+\frac{1}{9}=-\frac{4\times 9}{9}+\frac{1}{9}=-\frac{36}{9}+\frac{1}{9}=-\frac{36-1}{9}=-\frac{35}{9}$ . [Step-level reward: 0.815] Let’s double check the calculations. The expression is $\left(a+\frac{1}{b}\right)\left(\frac{1}{b}-a\right)$ . Expand the expression: $a\cdot\frac{1}{b}+a\cdot(-a)+\frac{1}{b}\cdot\frac{1}{b}+\frac{1}{b}\cdot(-a)$ $=\frac{a}{b}-a^{2}+\frac{1}{b^{2}}-\frac{a}{b}$ $=-a^{2}+\frac{1}{b^{2}}$ [Step-level reward: 0.980] We want to maximize $-a^{2}+\frac{1}{b^{2}}$ . Since $a^{2}\geq 0$ , $-a^{2}\leq 0$ . Since $b^{2}>0$ , $\frac{1}{b^{2}}>0$ . The maximum value of $\frac{1}{b^{2}}$ is $\frac{1}{3^{2}}=\frac{1}{9}$ , which occurs when $b=3$ . We need to find the minimum value of $a^{2}$ in the range $[-6,-2]$ . The minimum value of $a^{2}$ occurs at the boundary of the interval, which is $a=-2$ . When $a=-2$ , $a^{2}=(-2)^{2}=4$ . [Step-level reward: 0.927] We need to maximize $\frac{1}{b^{2}}$ in the range $[3,5]$ . The maximum value of $\frac{1}{b^{2}}$ occurs at the boundary of the interval, which is $b=3$ . When $b=3$ , $\frac{1}{b^{2}}=\frac{1}{3^{2}}=\frac{1}{9}$ . [Step-level reward: 0.989] So the maximum value of the expression is $-a^{2}+\frac{1}{b^{2}}=-4+\frac{1}{9}=-\frac{36}{9}+\frac{1}{9}=-\frac{35}{9}$ . Final Response: Given the ranges for $a$ and $b$ : $-6\leq a\leq-2$ $3\leq b\leq 5$ [Step-level reward: 0.958] We want to find the greatest possible value of the expression $\displaystyle\left(a+\frac{1}{b}\right)\left(\frac{1}{b}-a\right)$ . [Step-level reward: 0.999] First, we expand the expression: $\left(a+\frac{1}{b}\right)\left(\frac{1}{b}-a\right)=a\left(\frac{1}{b}\right)+a\left(-a\right)+\frac{1}{b}\left(\frac{1}{b}\right)+\frac{1}{b}\left(-a\right)$ $=\frac{a}{b}-a^{2}+\frac{1}{b^{2}}-\frac{a}{b}$ $=-a^{2}+\frac{1}{b^{2}}$ [Step-level reward: 0.998] To find the greatest possible value of this expression, we need to minimize $a^{2}$ and maximize $\frac{1}{b^{2}}$ . [Step-level reward: 0.987] For the term $-a^{2}$ , we want to minimize $a^{2}$ . Since $a^{2}\geq 0$ for all real numbers $a$ , the minimum value of $a^{2}$ occurs when $a=0$ . However, the given range for $a$ is $[-6,-2]$ , so $a=0$ is not within the range. The minimum value of $a^{2}$ in the given range occurs at the boundary of the interval, which is $a=-2$ . When $a=-2$ , $a^{2}=(-2)^{2}=4$ . [Step-level reward: 0.967] For the term $\frac{1}{b^{2}}$ , we want to maximize $\frac{1}{b^{2}}$ . Since $b^{2}>0$ for all real numbers $b\neq 0$ , the maximum value of $\frac{1}{b^{2}}$ occurs when $b^{2}$ is the smallest. The smallest value of $b^{2}$ in the given range occurs when $b$ is the largest in absolute value within the interval $[3,5]$ , which is $b=3$ . When $b=3$ , $b^{2}=3^{2}=9$ , and $\frac{1}{b^{2}}=\frac{1}{9}$ . [Step-level reward: 0.667] Now, we substitute the minimum value of $a^{2}$ and the maximum value of $\frac{1}{b^{2}}$ into the expanded expression: $-a^{2}+\frac{1}{b^{2}}=-4+\frac{1}{9}$ To combine these terms, we find a common denominator: $-4+\frac{1}{9}=-\frac{4\times 9}{9}+\frac{1}{9}=-\frac{36}{9}+\frac{1}{9}=-\frac{36-1}{9}=-\frac{35}{9}$ [Step-level reward: 0.993] Thus, the greatest possible value of the given expression is $-\frac{35}{9}$ . [Step-level reward: 0.988] Final Answer: The final answer is $\boxed{-\frac{35}{9}}$ [Step-level reward: 0.999] [Trajectory-level reward: 0.949] (As the final answer is correct, ReasonFlux-PRM assigns a high trajectory-level reward to the overall trajectory-response data.)

Rendering Paper...