2509.21154
Model: gemini-2.0-flash
# GRPO is Secretly a Process Reward Model
**Authors**:
- Michael Sullivan (Department of Language Science and Technology)
- Saarland Informatics Campus (Saarland University, SaarbrĂŒcken, Germany)
Abstract
We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ( $\lambda$ -GRPO), and show that LLMs trained with $\lambda$ -GRPO achieve higher validation accuracy and performance on downstream reasoning tasksâand reach peak performance more rapidlyâthan LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.
1 Introduction
Process reward models (PRMs)âmodels that assign reward to intermediate steps (see Section 2.2)âallow for finer-grained reward assignment than outcome-level signals, thereby yielding improved multi-step reasoning performance (Lightman et al., 2024). PRMs are therefore particularly applicable to RL training for step-by-step processes such as mathematical reasoning.
However, training neural PRMs requires costly, step-level human annotation (Zhang et al., 2025), and such models are particularly susceptible to reward hacking (Cui et al., 2025). These shortcomings have resulted in the limited adoption of learned PRMs for RL training Setlur et al. (2025), leading to the development of Monte-Carlo-based and other heuristic, non-neural PRMs (e.g. Wang et al., 2024; Kazemnejad et al., 2025; Hou et al., 2025).
PRMs are typically employed with RL algorithms such as Proximal Policy Optimization (PPO; Schulman et al., 2017) that employ a critic model and/or generalized advantage estimation (GAE). Although Group Relative Policy Optimization (GRPO; Shao et al., 2024) greatly simplifies and reduces the memory consumption of RL training, leading to its adoption for a wide range of applicationsâe.g. tool use (Qian et al., 2025; Sullivan et al., 2025), RLHF (Yang et al., 2025b), and, in particular, mathematical reasoning (Shao et al., 2024; DeepSeek-AI, 2025) âit does so by eliminating the critic model and GAE of PPO (see Section 2.1). GRPO has therefore not been widely used with PRMs: to the best of our knowledge, Shao et al. (2024), Yang et al. (2025a), and Feng et al. (2025) are the only instances in which GRPO is employed with step-level rewardsâand these approaches necessitate the modification of the algorithm to accommodate finer-grained reward signals.
In this paper, we show thatâunder certain mild assumptionsâGRPO induces a Monte-Carlo-based PRM (Section 3). Specifically, we prove theoretically that GRPO assigns rewards (and advantages) derived from outcome-level rewards and Monte-Carlo-sampled completions to sub-trajectories, whenever subsets of trajectories within each group share identical prefixes. We then show empirically that this identical-prefix condition is almost always met under real-world conditions, yielding rich step-level process reward structures. These two findings definitively demonstrate that the GRPO objective covertly assigns and optimizes for complex, structured step-level rewards and advantages.
An investigation of the properties of GRPOâs hidden PRM reveals a defect in the objective function that hinders both exploration and exploitation (under different conditions) during RL training (Section 4): namely, a vulnerability to non-uniformly-distributed process steps within a group. To mitigate this shortcoming, we propose including a PRM-aware normalization factor into the GRPO loss function ( $\lambda$ -GRPO).
We show that $\lambda$ -GRPO results in higher validation accuracy and a $\sim$ 2x training speedup over standard GRPO (Section 5). On downstream reasoning benchmarks, $\lambda$ -GRPO consistently improves over standard GRPO, demonstrating the superiority of our method.
Our findings call into question the utility of dedicated PRMs for GRPO, and suggest that future work may benefit instead from exploiting the step-level reward signal already available to the GRPO algorithm.
2 Background
2.1 GRPO
GRPO is a variant of PPO that discards the critic model and GAE of the latter. Instead, for each query $q$ in the training set, GRPO nondeterministically samples a group ${\mathbb{G}}$ of $k$ trajectories (completions to $q$ ) and computes the advantage ${a}_{i}$ for the completion $g^{(i)}â{\mathbb{G}}$ relative to the mean (outcome) reward of ${\mathbb{G}}$ , as in Equation 1 (where ${r}_{i}$ is the reward for $g^{(i)}$ ).
$$
{a}_{i}=\frac{{r}_{i}-r_{\text{{mean}}}({\mathbb{G}})}{r_{\text{{std}}}({\mathbb{G}})} \tag{1}
$$
In our theoretical analysis in Section 3, we make two key assumptions: first, we assume the use of the DAPO token-level policy gradient objective (Yu et al., 2025), rather than sample-level loss. Although it differs from the original GRPO formulation laid out in Shao et al. (2024), Yu et al. (2025) show that this objective leads to more stable training, and it is the standard GRPO loss function employed in commonly-used RL packages (e.g. the TRL GRPO trainer https://huggingface.co/docs/trl/main/en/grpo_trainer).
Second, we assume that the number of update iterations per batch ( $\mu$ ) is set to $\mu=1$ . Under this assumption, the ratio ${P}_{i,t}$ is fixed at $1.0$ (see Equation 2b), allowing us to ignore the clipping factor of the GRPO loss function in our theoretical analysis.
Under these two assumptions, the per-group GRPO loss $L_{\text{GRPO}}({\mathbb{G}})$ reduces to that in Equation 2.
$$
\displaystyle L_{\text{GRPO}}({\mathbb{G}})=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\sum_{g^{(i)}\in{\mathbb{G}}}\sum_{t=0}^{\textit{len}(g^{(i)})-1}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t}
$$
$$
{P}_{i,t}=\frac{\pi_{\theta}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})}{\pi_{\theta_{\textit{old}}}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})}
$$
$$
{D}_{i,t}=\beta\cdot\left(\frac{\pi_{\theta_{\textit{ref}}}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})}{\pi_{\theta}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})}-\textit{ln}\frac{\pi_{\theta_{\textit{ref}}}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})}{\pi_{\theta}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})}-1\right)
$$
2.2 Process Reward Models (PRMs)
Given an alphabet $\Sigma$ (i.e. set of tokens), we formally define a PRM as a function $f_{\phi}\colon\Sigma^{*}â(\Sigma^{*}Ă\mathbb{R})^{*}$ parameterized by $\phi$ that maps a trajectory $yâ\Sigma^{*}$ to the sequence $f_{\phi}(y)=((y_{:i_{1}},r^{(y)}_{0}),(y_{i_{1}:i_{2}},r^{(y)}_{1}),...,(y_{i_{n-1}:},r^{(y)}_{n-1}))$ of pairs of process steps (sub-trajectories) $y_{i_{k}:i_{k+1}}$ and step-level rewards $r^{(y)}_{k}$ .
While PRMs are typically contrasted with outcome reward models (ORMs)âwhich assign a single reward to the entire trajectoryâunder the above definition, an ORM $f_{\phi^{\prime}}^{\prime}$ is simply a trivial PRM: i.e. $f_{\phi^{\prime}}^{\prime}(y)=((y,r^{(y)}_{0}))$ .
Both the division of the trajectory $y$ into steps and the assignment of rewards to those steps are dependent upon the PRM in question. When trajectories are clearly delineated into individual stepsâe.g. via ReACT-style prompting Yao et al. (2023) or instructing the model to divide its reasoning into demarcated stepsâthe PRM can simply be directed to assign a reward to each pre-defined step (e.g. Li & Li, 2025). In other cases, trajectories are heuristically split into stepsâfor example, at high-entropy tokens (e.g. Hou et al., 2025).
Although the assignment of step-level reward can be performed by a model with learned parameters $\phi$ (e.g. Uesato et al., 2022), Kazemnejad et al. (2025) and Hou et al. (2025) combine Monte Carlo estimation with outcome-level rewards to yield heuristic PRMs that do not require the labor-intensive annotation ofâand are less susceptible to reward-hacking thanâtheir learned counterparts. In cases such as these in which the PRM $f_{\phi}$ is not learned, we simply consider $\phi$ to be fixed/trivial.
3 The PRM in the Machine
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Set Partitioning Visualization
### Overview
The image illustrates a set partitioning process, visually representing how a set *G* is divided into subsets. The left side shows the composition of each subset *g<sup>(i)</sup>* as a series of colored blocks with numerical indices. The right side depicts a hierarchical tree diagram illustrating the partitioning of *G* into smaller subsets.
### Components/Axes
**Left Side:**
* **Subset Definitions:** Six subsets are defined as *g<sup>(1)</sup>* through *g<sup>(6)</sup>*. Each subset is represented by a row of colored blocks.
* **Block Indices:** Each block within a subset is labeled with a numerical index, starting from 0.
* **Colors:** Each block has a distinct color. The colors are: blue, light pink, orange, dark orange, green, purple, light yellow, and teal.
**Right Side:**
* **Root Node:** The root node represents the original set *G* = {*g<sup>(1)</sup>*, *g<sup>(2)</sup>*, *g<sup>(3)</sup>*, *g<sup>(4)</sup>*, *g<sup>(5)</sup>*, *g<sup>(6)</sup>*}.
* **Hierarchical Tree:** A tree diagram shows the partitioning of *G* into subsets.
* **Nodes:** Each node represents a subset of *G*.
* **Edges:** Arrows indicate the partitioning direction.
### Detailed Analysis
**Left Side - Subset Composition:**
* *g<sup>(1)</sup>*: Contains 6 blocks, indexed 0 to 5. Colors: blue, blue, light pink, light pink, light pink, light pink.
* *g<sup>(2)</sup>*: Contains 9 blocks, indexed 0 to 8. Colors: blue, blue, orange, orange, orange, orange, orange, orange, orange.
* *g<sup>(3)</sup>*: Contains 7 blocks, indexed 0 to 6. Colors: light pink, light pink, light pink, green, green, green, green.
* *g<sup>(4)</sup>*: Contains 9 blocks, indexed 0 to 8. Colors: light pink, light pink, light pink, light pink, purple, purple, purple, purple, purple.
* *g<sup>(5)</sup>*: Contains 8 blocks, indexed 0 to 7. Colors: light pink, light pink, light pink, purple, purple, light yellow, light yellow, light yellow.
* *g<sup>(6)</sup>*: Contains 5 blocks, indexed 0 to 4. Colors: teal, teal, teal, teal, teal.
**Right Side - Partitioning Tree:**
1. **Level 1:** *G* is partitioned into three subsets: {*g<sup>(1)</sup>*, *g<sup>(2)</sup>*}, {*g<sup>(3)</sup>*, *g<sup>(4)</sup>*, *g<sup>(5)</sup>*}, and {*g<sup>(6)</sup>*}.
* {*g<sup>(1)</sup>*, *g<sup>(2)</sup>*} is colored blue.
* {*g<sup>(3)</sup>*, *g<sup>(4)</sup>*, *g<sup>(5)</sup>*} is colored light pink.
* {*g<sup>(6)</sup>*} is colored teal.
2. **Level 2:**
* {*g<sup>(1)</sup>*, *g<sup>(2)</sup>*} is further partitioned into {*g<sup>(1)</sup>*} and {*g<sup>(2)</sup>*}.
* {*g<sup>(1)</sup>*} is colored light pink.
* {*g<sup>(2)</sup>*} is colored orange.
* {*g<sup>(3)</sup>*, *g<sup>(4)</sup>*, *g<sup>(5)</sup>*} is partitioned into {*g<sup>(3)</sup>*} and {*g<sup>(4)</sup>*, *g<sup>(5)</sup>*}.
* {*g<sup>(3)</sup>*} is colored green.
* {*g<sup>(4)</sup>*, *g<sup>(5)</sup>*} is colored purple.
3. **Level 3:** {*g<sup>(4)</sup>*, *g<sup>(5)</sup>*} is partitioned into {*g<sup>(4)</sup>*} and {*g<sup>(5)</sup>*}.
* {*g<sup>(4)</sup>*} is colored light pink.
* {*g<sup>(5)</sup>*} is colored light yellow.
### Key Observations
* The subsets *g<sup>(i)</sup>* have varying lengths (number of blocks).
* The partitioning tree shows a hierarchical division of the original set *G*.
* The colors of the nodes in the tree correspond to the colors of the blocks in the subset definitions.
### Interpretation
The image illustrates a hierarchical set partitioning process. The left side defines the composition of each subset, while the right side shows how the original set *G* is recursively divided into smaller subsets. The colors provide a visual link between the subset definitions and the partitioning tree, making it easier to follow the partitioning process. The diagram demonstrates a specific example of how a set can be partitioned into smaller, potentially overlapping, subsets. The partitioning appears to be based on some underlying criteria that are not explicitly stated in the image, but are reflected in the composition of the subsets.
</details>
Figure 1: Toy example of a group ${\mathbb{G}}=\{g^{(1)},...,g^{(6)}\}$ (left) and its corresponding ${\mathcal{B}}({\mathbb{G}})$ tree (right). Tokens (boxes) are numbered for readabilityâsubscripted numbers within boxes only indicate position. Each process set (node in the ${\mathcal{B}}({\mathbb{G}})$ tree) is a set of trajectories that share a common prefix, and corresponds to a process step (subtrajectory) spanning those shared tokens: in this figure, colored nodes in ${\mathcal{B}}({\mathbb{G}})$ correspond to those subsequences in ${\mathbb{G}}$ that span tokens/boxes of the same color Same-colored boxes in ${\mathbb{G}}$ indicate identical sequences of tokens across trajectories only: for example, $g^{(3)}_{:4}=g^{(4)}_{:4}$ and $g^{(4)}_{4:6}=g^{(5)}_{4:6}$ , but it is not necessarily the case that e.g. $g^{(3)}_{0}=g^{(3)}_{1}$ .. GRPO implicitly assigns a step-level reward and advantage to the tokens of each process step, which are computed as functions of the mean outcome-level reward of each trajectory in the corresponding process set.
In Section 3.1, we prove that GRPO theoretically induces a PRM (given the assumptions of Section 2.1) as defined in Section 2.2. However, this PRM is only non-trivialâi.e. not equivalent to an ORMâif subsets of trajectories within each group share identical initial sub-trajectories.
In Section 3.2, we empirically demonstrate that such rich, overlapping prefix structures arise very frequently under real-world conditions: this shows that GRPO is âsecretlyâ a non-trivial PRM.
3.1 Theoretical Analysis
Let ${\mathcal{B}}({\mathbb{G}})=\{\lambdaâeq{\mathbb{G}}\hskip 2.84526pt|\hskip 2.84526ptâ nâ„ 0â g^{(i)},g^{(k)}â\lambda\colon g^{(i)}_{:n}=g^{(k)}_{:n}\}$ be the set of all process sets: sets $\lambdaâeq{\mathbb{G}}$ of completions such that all $g^{(i)}â\lambda$ are identical up to the $n^{th}$ token, for some $nâ„ 0$ (see Figure 2). Note that there is a natural tree structure on ${\mathcal{B}}({\mathbb{G}})$ , which is induced by the $âeq$ relation.
Each $\lambdaâ{\mathcal{B}}({\mathbb{G}})$ defines a process step within each $g^{(i)}â\lambda$ , spanning the subsequence $g^{(i)}_{s(\lambda):e(\lambda)}$ from the $s(\lambda)^{th}$ to the $e(\lambda)^{th}$ tokens of $g^{(i)}$ . The endpoint $e(\lambda)$ is defined as the largest $n$ such that $g^{(i)}_{:n}=g^{(k)}_{:n}$ for all $g^{(i)},g^{(k)}â\lambda$ , and $s(\lambda)$ is defined as the endpoint of the immediate parent $Pa_{{\mathcal{B}}({\mathbb{G}})}(\lambda)$ of $\lambda$ in the tree structure induced on ${\mathcal{B}}({\mathbb{G}})$ .
$$
\begin{split}&e(\lambda)=\textit{max}\{n\geq 0\hskip 2.84526pt|\hskip 2.84526pt\forall g^{(i)},g^{(k)}\in\lambda:g^{(i)}_{:n}=g^{(k)}_{:n}\}\\
&s(\lambda)=\begin{cases}0&\text{if }\lambda=\textit{root}(\mathcal{B}(G))\\
e(Pa_{{\mathcal{B}}({\mathbb{G}})}(\lambda))&\text{otherwise}\end{cases}\end{split}
$$
For example, $s(\{g^{(4)},g^{(5)}\})=e(\{g^{(3)},g^{(4)},g^{(5)}\})=4$ in Figure 2, and $e(\{g^{(4)},g^{(5)}\})=6$ : the process step corresponding to $\{g^{(4)},g^{(5)}\}$ spans $g^{(4)}_{4:6}$ and $g^{(5)}_{4:6}$ .
For each $g^{(i)}â{\mathbb{G}}$ and each $0†t<\textit{len}(g^{(i)})$ , let $\lambda^{(i,t)}â{\mathcal{B}}({\mathbb{G}})$ denote the unique process set such that $g^{(i)}â\lambda^{(i,t)}$ and $s(\lambda^{(i,t)})†t<e(\lambda^{(i,t)})$ . In other words, $\lambda^{(i,t)}$ is the process step to which the token $g^{(i)}_{t}$ belongs. In Figure 2, $\lambda^{(i,t)}$ corresponds to the set whose color matches that of $g^{(i)}_{t}$ : $\lambda^{(1,0)}=\{g^{(1)},g^{(2)}\}$ , $\lambda^{(1,3)}=\{g^{(1)}\}$ , $\lambda^{(5,5)}=\{g^{(4)},g^{(5)}\}$ , etc.
Now, for each process step defined by some $\lambdaâ{\mathcal{B}}({\mathbb{G}})$ , we define the step-level reward $\hat{R}(\lambda)$ via Monte Carlo estimation (Equation 3): $\hat{R}(\lambda)$ is the mean outcome-level reward of each trajectory in $\lambda$ . In other words, $\hat{R}(\lambda)$ is the mean reward of each leaf node dominated by $\lambda$ in the tree structure induced on ${\mathcal{B}}({\mathbb{G}})$ âi.e. of each sampled completion to the process step defined by $\lambda$ .
$$
\hat{R}(\lambda)=\frac{\sum_{g^{(i)}\in\lambda}{r}_{i}}{|\lambda|} \tag{3}
$$
For each trajectory $g^{(i)}â{\mathbb{G}}$ and each $0†t<\textit{len}(g^{(i)})$ define the reward ${R}_{i,t}$ for the token $g^{(i)}_{t}$ as the reward of the process step to which $g^{(i)}_{t}$ belongs: ${R}_{i,t}=\hat{R}(\lambda^{(i,t)})$ . For example, in Figure 2, the step-level reward for the sub-trajectories $g^{(3)}_{:4}$ , $g^{(4)}_{:4}$ , $g^{(5)}_{:4}$ is the mean of the outcome-level rewards for $g^{(3)}$ , $g^{(4)}$ , and $g^{(5)}$ : ${R}_{3,0}=...={R}_{5,3}=\textit{mean}(\{{r}_{3},{r}_{4},{r}_{5}\})$ .
By the definition given in Section 2.2, ${R}_{i,t}$ and ${\mathcal{B}}({\mathbb{G}})$ clearly define a PRM: each $g^{(i)}â{\mathbb{G}}$ is mapped to the sequence $(g^{(i)}_{:s(\lambda_{1})},{R}_{i,0}),(g^{(i)}_{s(\lambda_{1}):s(\lambda_{2})},{R}_{i,s(\lambda_{1})}),...,(g^{(i)}_{s(\lambda_{n}):},{R}_{i,s(\lambda_{n})})$ , where $(\lambda_{0}={\mathbb{G}})â...â(\lambda_{n}=\{g^{(i)}\})$ is the unique path in the tree structure induced on ${\mathcal{B}}({\mathbb{G}})$ from the root ${\mathbb{G}}$ to the node $\{g^{(i)}\}$ .
Now, define the step-level advantage ${A}_{i,t}$ for the token $g^{(i)}_{t}$ in an analogous manner to the original GRPO definition in Equation 1 âi.e. as the normalized difference between the step-level reward ${R}_{i,t}$ for $g^{(i)}_{t}$ and the mean reward of ${\mathbb{G}}$ : ${A}_{i,t}=({R}_{i,t}-r_{\textit{mean}}({\mathbb{G}}))/r_{\textit{std}}({\mathbb{G}})$ .
Replacing the term ${a}_{i}$ with ${A}_{i,t}$ in Equation 2a yields a PRM-aware RL objective (Equation 4).
$$
L_{\text{PRM}}({\mathbb{G}})=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\sum_{g^{(i)}\in{\mathbb{G}}}\sum_{t=0}^{\textit{len}(g^{(i)})-1}({P}_{i,t}\cdot\hbox{\pagecolor{yellow}$\displaystyle{A}_{i,t}$})-{D}_{i,t} \tag{4}
$$
We now show that the standard GRPO objective defined in Equation 2a with outcome-level rewards ( $L_{\text{GRPO}}$ ) is equivalent to the PRM defined in Equations 3 - 4 ( $L_{\text{PRM}}$ ).
**Theorem 1**
*For any query $q$ , policy $\pi_{\theta}$ , and group ${\mathbb{G}}\sim\pi_{\theta}(\text{\textendash}\hskip 2.84526pt|\hskip 2.84526ptq)$ with outcome-level rewards $\{r_{i}\}_{g^{(i)}â{\mathbb{G}}}$ : $L_{\text{GRPO}}({\mathbb{G}})=L_{\text{PRM}}({\mathbb{G}})$ .*
* Proof*
Let $\lambda$ be any process set in ${\mathcal{B}}({\mathbb{G}})$ , and let $t$ be any integer such that $s(\lambda)†t<e(\lambda)$ . We first prove that the sum of the PRM loss terms ${P}_{i,t}·{A}_{i,t}-{D}_{i,t}$ for each trajectory $g^{(i)}â\lambda$ at the token $t$ is equivalent to the sum of the standard GRPO loss terms ${P}_{i,t}·{a}_{i}-{D}_{i,t}$ (Equation 5). Recall that for any $g^{(i)},g^{(k)}â\lambda$ , $g^{(i)}_{:t+1}=g^{(k)}_{:t+1}$ by definition: therefore, ${P}_{i,t}={P}_{k,t}$ and ${D}_{i,t}={D}_{k,t}$ (Equations 2b - 2c). Again by definition, $g^{(i)},g^{(k)}â\lambda$ implies that ${R}_{i,t}={R}_{k,t}=\hat{R}(\lambda)$ (Equation 3), and so ${A}_{i,t}={A}_{k,t}$ . As such, we may define $\hat{P}_{t}(\lambda)={P}_{k,t}$ , $\hat{D}_{t}(\lambda)={D}_{k,t}$ , and $\hat{A}(\lambda)={A}_{k,t}$ in Equation 5, choosing any arbitrary $g^{(k)}â\lambda$ . $$
\begin{split}&\sum_{g^{(i)}\in\lambda}({P}_{i,t}\cdot{A}_{i,t})-{D}_{i,t}=|\lambda|\cdot((\hat{P}_{t}(\lambda)\cdot\hat{A}(\lambda))-\hat{D}_{t}(\lambda))\\
&=|\lambda|\cdot\left(\left(\hat{P}_{t}(\lambda)\cdot\frac{\hat{R}(\lambda)-r_{\textit{mean}}({\mathbb{G}})}{r_{\textit{std}}({\mathbb{G}})}\right)-\hat{D}_{t}(\lambda)\right)=|\lambda|\cdot\left(\left(\hat{P}_{t}(\lambda)\cdot\frac{\sum_{g^{(i)}\in\lambda}\frac{{r}_{i}}{|\lambda|}-r_{\textit{mean}}({\mathbb{G}})}{r_{\textit{std}}({\mathbb{G}})}\right)-\hat{D}_{t}(\lambda)\right)\\
&=\left(\hat{P}_{t}(\lambda)\frac{|\lambda|(\sum_{g^{(i)}\in\lambda}\frac{{r}_{i}}{|\lambda|}-r_{\textit{mean}}({\mathbb{G}}))}{r_{\textit{std}}({\mathbb{G}})}\right)-|\lambda|\hat{D}_{t}(\lambda)=\left(\hat{P}_{t}(\lambda)\frac{\sum_{g^{(i)}\in\lambda}{r}_{i}-\sum_{g^{(i)}\in\lambda}r_{\textit{mean}}({\mathbb{G}})}{r_{\textit{std}}({\mathbb{G}})}\right)-|\lambda|\hat{D}_{t}(\lambda)\\
&=\left(\sum_{g^{(i)}\in\lambda}{P}_{i,t}\frac{{r}_{i}-r_{\textit{mean}}({\mathbb{G}})}{r_{\textit{std}}({\mathbb{G}})}\right)-\sum_{g^{(i)}\in\lambda}{D}_{i,t}=\sum_{g^{(i)}\in\lambda}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t}\end{split} \tag{5}
$$ Now, letting $t_{\textit{max}}=\textit{max}_{g^{(i)}â{\mathbb{G}}}\hskip 2.84526pt\textit{len}(g^{(i)})$ , for each $0†t<t_{\textit{max}}$ we can define a partition ${\mathbb{X}}_{t}âeq{\mathcal{B}}({\mathbb{G}})$ of $\{g^{(i)}â{\mathbb{G}}\hskip 2.84526pt|\hskip 2.84526pt\textit{len}(g^{(i)})†t\}$ such that ${\mathbb{X}}_{t}=\{\lambdaâ{\mathcal{B}}({\mathbb{G}})\hskip 2.84526pt|\hskip 2.84526pts(\lambda)†t<e(\lambda)\}$ is the set of all process sets corresponding to a token span containing the index $t$ . The GRPO loss term $L_{\text{GRPO}}({\mathbb{G}})$ (Equation 2a) can be equivalently expressed as in Equation 6 (and analogously for $L_{\text{PRM}}({\mathbb{G}})$ of Equation 4). $$
L_{\text{GRPO}}({\mathbb{G}})=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}\sum_{\lambda\in{\mathbb{X}}_{t}}\sum_{g_{i}\in\lambda}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t} \tag{6}
$$ We then have the following equalities by Equations 5 and 6: $$
\begin{split}L_{\text{GRPO}}({\mathbb{G}})&=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{g^{(i)}\in{\mathbb{G}}}\sum_{t=0}^{\textit{len}(g^{(i)})-1}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t}=\\
\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}&\sum_{\lambda\in{\mathbb{X}}_{t}}\sum_{g^{(i)}\in\lambda}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t}=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}\sum_{\lambda\in{\mathbb{X}}_{t}}\sum_{g^{(i)}\in\lambda}({P}_{i,t}\cdot{A}_{i,t})-{D}_{i,t}\\
&=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{g^{(i)}\in{\mathbb{G}}}\sum_{t=0}^{\textit{len}(g^{(i)})-1}({P}_{i,t}\cdot{A}_{i,t})-{D}_{i,t}=L_{\text{PRM}}({\mathbb{G}})\end{split}
$$
â
In other words, the standard GRPO objective defined in Equation 2a automatically induces the PRM and PRM-aware objective defined in Equations 3 - 4.
3.2 Empirical Analysis
The theoretical analysis in Section 3.1 shows that the GRPO objective induces a PRM: it remains to be shown, however, that this induced PRM is non-trivial. We refer to the set ${\mathcal{B}}({\mathbb{G}})$ of process sets as trivial if it contains only singleton sets And, by definition, ${\mathbb{G}}$ itself, with $s({\mathbb{G}})=e({\mathbb{G}})=0$ . âi.e. ${\mathcal{B}}({\mathbb{G}})=\{{\mathbb{G}}\}\cup\{\{g^{(i)}\}\hskip 2.84526pt|\hskip 2.84526ptg^{(i)}â{\mathbb{G}}\}$ .
If ${\mathcal{B}}({\mathbb{G}})$ is trivial, then there are no meaningful process steps within each trajectory, and only the outcome reward has an effect on learning. On the other hand, if ${\mathcal{B}}({\mathbb{G}})$ is not trivial, then Theorem 1 entails that the standard GRPO objective of Equation 2a induces a non-trivial PRM.
To analyze the complexity of real-world, GRPO-derived step-level rewards, we empirically evaluated ${\mathcal{B}}({\mathbb{G}})$ structures generated during standard GRPO training: for each group ${\mathbb{G}}$ , we computed its ${\mathcal{B}}({\mathbb{G}})$ tree, and measured the number of intermediate nodes between the root ${\mathbb{G}}$ and each terminal node $\{g^{(i)}\}$ (path depth), as a proxy for the complexity of the ${\mathcal{B}}({\mathbb{G}})$ structure.
In addition, for each completion $g^{(i)}â{\mathbb{G}}$ , we counted the number of tokens $n^{(i)}_{\textit{term}}=e(\{g^{(i)}\})-s(\{g^{(i)}\})$ contained in the process step corresponding to the terminal node $\{g^{(i)}\}$ âi.e. the number of tokens unique to $g^{(i)}$ âand calculated the intermediate proportion ${p}_{i}$ of $g^{(i)}$ : ${p}_{i}=(\textit{len}(g^{(i)})-n^{(i)}_{\textit{term}})/\textit{len}(g^{(i)})$ . Higher values of ${p}_{i}$ indicate that a greater proportion of the trajectory $g^{(i)}$ belongs to intermediate process steps and is therefore assigned non-trivial step-level reward.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Charts: Performance Metrics vs. Step for Different Group Sizes
### Overview
The image contains six line charts arranged in a 2x3 grid. Each row represents a different group size (36 and 6), and each column represents a different performance metric: Validation Reward, Path Depth, and Intermediate Proportion. Each chart plots the mean and median values of the metric against the step number.
### Components/Axes
**General Chart Elements:**
* **Titles:** Each chart has a title indicating the metric and group size (e.g., "Validation Reward (Group Size = 36)").
* **X-axis:** Labeled "STEP". The x-axis scales differ between the top and bottom rows. Top row ranges from 0 to 260. Bottom row ranges from 0 to 1600.
* **Y-axis:** Varies depending on the metric.
* **Legend:** Each chart includes a legend in the top-left corner indicating "Mean" (solid blue line) and "Median" (dashed orange line).
**Specific Axes:**
* **Validation Reward:**
* Y-axis (Group Size = 36): Ranges from 0.20 to 0.34.
* Y-axis (Group Size = 6): Ranges from 0.200 to 0.375.
* **Path Depth:**
* Y-axis (Group Size = 36): Ranges from 4 to 12.
* Y-axis (Group Size = 6): Ranges from 1.0 to 4.0.
* **Intermediate Proportion:**
* Y-axis (Group Size = 36): Ranges from 0.0 to 1.0.
* Y-axis (Group Size = 6): Ranges from 0.0 to 0.8.
### Detailed Analysis
**1. Validation Reward:**
* **(Group Size = 36):**
* **Mean (Blue):** The line starts at approximately 0.20 at step 0, increases to approximately 0.27 at step 40, dips to approximately 0.23 at step 80, then rises again to approximately 0.32 at step 160, dips slightly to approximately 0.30 at step 200, and ends at approximately 0.33 at step 260.
* **(Group Size = 6):**
* **Mean (Blue):** The line starts at approximately 0.20 at step 0, increases to approximately 0.37 at step 200, then fluctuates between approximately 0.26 and 0.37 until step 1600, ending at approximately 0.38.
**2. Path Depth:**
* **(Group Size = 36):**
* **Mean (Blue):** The line starts at approximately 4.0 at step 0, gradually increases with some plateaus, and ends at approximately 12 at step 260.
* **Median (Orange):** The line starts at approximately 5.0 at step 0, increases in a step-wise fashion, and ends at approximately 10 at step 260.
* **(Group Size = 6):**
* **Mean (Blue):** The line starts at approximately 1.0 at step 0, gradually increases with some plateaus, and ends at approximately 3.3 at step 1600.
* **Median (Orange):** The line starts at approximately 1.5 at step 0, increases in a step-wise fashion, and ends at approximately 4.0 at step 1600.
**3. Intermediate Proportion:**
* **(Group Size = 36):**
* **Mean (Blue):** The line starts at approximately 0.0 at step 0, gradually increases, and ends at approximately 0.95 at step 260.
* **Median (Orange):** The line starts at approximately 0.0 at step 0, gradually increases, and ends at approximately 0.75 at step 260.
* **(Group Size = 6):**
* **Mean (Blue):** The line starts at approximately 0.0 at step 0, gradually increases, and ends at approximately 0.25 at step 1600.
* **Median (Orange):** The line starts at approximately 0.0 at step 0, gradually increases, and ends at approximately 0.35 at step 1600.
### Key Observations
* The "Validation Reward" for Group Size 6 fluctuates more than for Group Size 36.
* "Path Depth" shows a step-wise increase for the median in both group sizes.
* "Intermediate Proportion" increases more rapidly for Group Size 36 than for Group Size 6.
* The x-axis scales are different for group sizes 36 and 6, making direct comparison difficult without normalization.
### Interpretation
The charts compare the performance of a model or system under two different group sizes (36 and 6) across three metrics: Validation Reward, Path Depth, and Intermediate Proportion. The "Validation Reward" indicates the model's accuracy or success. The "Path Depth" might represent the complexity of the solution or the number of steps taken to reach a result. The "Intermediate Proportion" could indicate the ratio of intermediate steps or states.
The data suggests that a larger group size (36) leads to a more stable validation reward and a higher intermediate proportion, while the path depth increases more rapidly. The smaller group size (6) shows more fluctuation in the validation reward and a slower increase in intermediate proportion. The step-wise increase in median path depth suggests discrete changes or levels in the solution complexity.
The different x-axis scales suggest that the model or system requires more steps to converge or reach a certain performance level with a smaller group size.
</details>
Figure 2: Validation reward (exact-match accuracy; left), ${\mathcal{B}}({\mathbb{G}})$ root-to-terminal path depth (center), and proportions of trajectories spanned by intermediate (non-terminal) process steps (right) for GRPO runs with group sizes of 36 (top) and 6 (bottom).
Experimental Setup.
We trained two DeepSeek-R1-Distill-Qwen-1.5B models (DeepSeek-AI, 2025) on the OpenRS (Dang & Ngo, 2025) dataset using the standard GRPO algorithm and objective of Equation 2. We selected 125 OpenRS examples at random to serve as a validation set.
The first model trained for 1675 steps with a group size of six and a learn rate of $6Ă 10^{-6}$ . The second was trained with a group size of 36 and a learn rate of $10^{-6}$ for 275 steps (due to the larger group size). Both models were trained with a maximum new token limit of 4096, a batch size of four, and a temperature of 0.75. Additional training details are located in Appendix B.
Results.
Figure 2 shows that both path depth and intermediate proportion increase drastically as validation reward saturates, for group sizes of six and 36. These results are supported by Yu et al. (2025), who find that entropy decreases sharply as GRPO training progresses: this indicates that increasingly rich PRM-inducing structures arise as the model converges on a locally optimal policy.
In addition, found that only twelve of 6,700 ${\mathcal{B}}({\mathbb{G}})$ structures were trivial with a group size of six ( $\sim$ 0.2%). With a group size of 36, zero trivial ${\mathcal{B}}({\mathbb{G}})$ structures arose out of the 1,100 generated groups. Examples of non-trivial ${\mathcal{B}}({\mathbb{G}})$ structures from this experiment are given in Appendix C.
In conjunction with the theoretical analysis of Section 3.1, the results of this experiment demonstrate that GRPO induces a non-trivial PRM under real-world conditions. In Section 4, we show that this induced PRM carries a serious flaw that is detrimental to RL training, and propose a minor modification to the GRPO algorithm to mitigate this shortcoming.
4 Proposed Approach: $\lambda$ -GRPO
Viewing the GRPO objective in terms of process set partitions ${\mathbb{X}}_{t}$ (see Equation 6), we note that the contribution of each trajectory $g^{(i)}â{\mathbb{G}}$ to the loss at index $t$ is identical to that of all other trajectories in the process set $\lambda^{(i,t)}$ (where $\hat{P}_{t}(\lambda)$ , $\hat{D}_{t}(\lambda)$ , and $\hat{A}(\lambda)$ are defined as in Equation 5):
$$
\begin{split}\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{g^{(i)}\in{\mathbb{G}}}&\sum_{t=0}^{\textit{len}(g^{(i)})-1}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t}=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}\sum_{\lambda\in{\mathbb{X}}_{t}}\sum_{g^{(i)}\in\lambda}({P}_{i,t}\cdot{A}_{i,t})-{D}_{i,t}\\
=&\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}\sum_{\lambda\in{\mathbb{X}}_{t}}|\lambda|\cdot((\hat{P}_{t}(\lambda)\cdot\hat{A}(\lambda))-\hat{D}_{t}(\lambda))\end{split} \tag{7}
$$
The contribution of each process set $\lambdaâ{\mathcal{B}}({\mathbb{G}})$ to the overall loss, $\hat{P}_{t}(\lambda)·\hat{A}(\lambda)-\hat{D}_{t}(\lambda)$ , is scaled by $|\lambda|$ : this carries the danger of harming exploration (for $\hat{A}(\lambda)<0$ ) and exploitation (for $\hat{A}(\lambda)>0$ ). Consider some process set $\lambda$ with $|\lambda|\gg 1$ . If $\hat{A}(\lambda)>0$ , then the increase in probability assigned to the process step corresponding to $\lambda$ by $\pi_{\theta}$ under GRPO is compounded by a factor of $|\lambda|$ , decreasing the likelihood of exploring process steps that are dissimilar from $\lambda$ in subsequent training episodes.
Conversely, if $\hat{A}(\lambda)<0$ , then the decrease in probability assigned to $\lambda$ under GRPO is compounded by a factor of $|\lambda|$ , decreasing the likelihood of exploiting high-reward trajectories in $\lambda$ . To illustrate, consider the group ${\mathbb{G}}$ in Figure 2, assume ${r}_{1}={r}_{2}={r}_{6}=0.5$ , ${r}_{4}={r}_{5}=0$ , ${r}_{3}=1$ , and let $\lambda=\{g^{(3)},g^{(4)},g^{(5)}\}$ . Then $\hat{A}(\lambda)=-0.22$ : despite the fact that $g^{(3)}$ has the highest reward, the probability of the sub-trajectory $g^{(3)}_{:4}$ is decreased under the GRPO objective, thereby decreasing the overall likelihood of generating the completion $g^{(3)}$ . The term $|\lambda|$ in Equation 7 then scales this decrease in probability by a factor of three.
We propose scaling the token-level loss for $g^{(i)}_{t}$ by $|\lambda^{(i,t)}|^{-1}$ ( $\lambda$ -GRPO; Equation 8): this has the effect of canceling out the term $|\lambda|$ in Equation 7, so that each process set contributes equally to the loss at index $t$ .
$$
\begin{split}L_{\lambda\text{-GRPO}}({\mathbb{G}})&=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{g^{(i)}\in{\mathbb{G}}}\sum_{t=0}^{\textit{len}(g^{(i)})-1}\frac{({P}_{i,t}\cdot{a}_{i})-{D}_{i,t}}{|\lambda^{(i,t)}|}\\
&=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}\sum_{\lambda\in{\mathbb{X}}_{t}}(\hat{P}_{t}(\lambda)\cdot\hat{A}(\lambda))-\hat{D}_{t}(\lambda)\end{split} \tag{8}
$$
5 Experiments
To evaluate our proposed approach, we trained DeepSeek-R1-Distill-Qwen-1.5B and Llama-3.2-1B-Instruct https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct with the $\lambda$ -GRPO (Equation 8) objective on the OpenRS dataset of Section 3.2, and compared them to standard GRPO (Equation 2a) models trained with an identical setup. All models were evaluated on an OpenRS validation set and five downstream reasoning benchmarks (see Section 5.1).
5.1 Experimental Setup
All models were trained for 1,000 steps with a group size of six, a batch size of four, a maximum of 4,096 new tokens, and a temperature of 0.75. We conducted two sets of trials across the two models (for a total of four trials): in the first, we set the KL coefficient $\beta=0.0$ , and in the second $\beta=0.04$ . The Qwen models were trained with a learn rate of $10^{-6}$ ; the Llama models were trained with a learn rate of $5Ă 10^{-7}$ for the $\beta=0.0$ trial and $10^{-7}$ for $\beta=0.04$ (as training was highly unstable with higher learn rates for Llama). Additional training details are located in Appendix B.
We evaluated the models on the AIME24 https://huggingface.co/datasets/AI-MO/aimo-validation-aime, MATH-500 (Hendrycks et al., 2021; Lightman et al., 2024), AMC23 https://huggingface.co/datasets/AI-MO/aimo-validation-amc, Minerva (Lewkowycz et al., 2022), and OlympiadBench (He et al., 2024) benchmarks, using the LightEval framework Habib et al. (2023); Dang & Ngo (2025). All models were evaluated at the checkpoint corresponding to the step at which they achieved maximum validation accuracy. As in the experiment in Section 3.2, we withheld 125 examples as a validation set.
5.2 Results and Discussion
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Graphs: Validation Reward vs. Step for Different Models
### Overview
The image presents four line graphs comparing the performance of two models, CRPO (cyan) and A-GRPO (purple), across different scenarios labeled as "Qwen (B=0.0)", "Qwen (B=0.04)", "Llama (B=0.0)", and "Llama (B=0.04)". Each graph plots validation reward against the step number, showing how the models learn and perform over time. Vertical dashed lines mark specific steps, and coordinate points are annotated on the graphs.
### Components/Axes
* **X-axis:** "STEP" - Ranges from 0 to 875 in each graph, with tick marks at intervals of 125.
* **Y-axis:** "VALIDATION REWARD" - Ranges vary for each graph, but generally span from approximately 0.06 to 0.45.
* **Legends:** Located at the bottom-center of each graph.
* Cyan line: "CRPO"
* Purple line: "A-GRPO"
* **Titles:** Located at the top of each graph.
* Top-left: "Qwen (B=0.0)"
* Top-middle-left: "Qwen [B=0.04)"
* Top-middle-right: "Llama (B=0.0)"
* Top-right: "Llama (B=0.04]"
### Detailed Analysis
**1. Qwen (B=0.0)**
* **CRPO (Cyan):** Starts around 0.20, increases to approximately 0.36 by step 375, then decreases and fluctuates between 0.28 and 0.36 until step 875.
* **A-GRPO (Purple):** Starts around 0.20, increases sharply to a peak of 0.429 at step 325, then decreases and fluctuates between 0.25 and 0.40 until step 875.
* **Key Data Points:**
* A-GRPO Peak: (325, 0.429)
* CRPO Point: (900, 0.373) - Note: 900 is outside the x-axis range, likely an annotation error.
**2. Qwen (B=0.04)**
* **CRPO (Cyan):** Starts around 0.20, increases to approximately 0.373 by step 600, then decreases and fluctuates between 0.25 and 0.35 until step 875.
* **A-GRPO (Purple):** Starts around 0.20, increases sharply to a peak of 0.381 at step 400, then decreases and fluctuates between 0.25 and 0.40 until step 875.
* **Key Data Points:**
* A-GRPO Peak: (400, 0.381)
* CRPO Point: (600, 0.373)
**3. Llama (B=0.0)**
* **CRPO (Cyan):** Starts around 0.08, increases to approximately 0.159 by step 350, then decreases and fluctuates between 0.08 and 0.16 until step 875.
* **A-GRPO (Purple):** Starts around 0.08, increases sharply to a peak of 0.19 at step 150, then decreases and fluctuates between 0.10 and 0.14 until step 875.
* **Key Data Points:**
* A-GRPO Peak: (150, 0.19)
* CRPO Point: (350, 0.159)
**4. Llama (B=0.04)**
* **CRPO (Cyan):** Starts around 0.06, increases to approximately 0.135 by step 700, then decreases and fluctuates between 0.06 and 0.12 until step 875.
* **A-GRPO (Purple):** Starts around 0.06, increases sharply to a peak of 0.143 at step 325, then decreases and fluctuates between 0.08 and 0.14 until step 875.
* **Key Data Points:**
* A-GRPO Peak: (325, 0.143)
* CRPO Point: (700, 0.135)
### Key Observations
* In all four scenarios, A-GRPO (purple) initially reaches a higher peak validation reward compared to CRPO (cyan).
* The "Qwen" scenarios show higher validation reward values than the "Llama" scenarios.
* Both models exhibit fluctuations in validation reward after reaching their initial peaks, indicating ongoing learning and adaptation.
### Interpretation
The graphs suggest that A-GRPO initially learns faster and achieves higher peak performance compared to CRPO in these scenarios. However, both models demonstrate a degree of instability and fluctuation after their initial learning phase. The difference in performance between "Qwen" and "Llama" scenarios indicates that the models' effectiveness is influenced by the specific task or dataset. The parameter "B" (0.0 vs. 0.04) also appears to influence the models' performance, although the exact nature of this influence is not clear from the graphs alone. The annotations, while helpful, contain a potential error in the "Qwen (B=0.0)" graph, where the x-coordinate of the CRPO point exceeds the axis limit.
</details>
Figure 3: Modelsâ validation accuracy across training steps. Peak accuracy is highlighted by vertical, dashed lines.
| Model Qwen | $\beta$ â | Version Base | AIME24 0.2000 ±0.0743 | MATH-500 0.8300 ±0.0168 | AMC23 0.7500 ±0.0693 | Minerva 0.2978 ±0.0278 | OB 0.5096 ±0.0193 | Avg. 0.5175 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0.0 | GRPO | 0.3333 ±0.0875 | 0.7660 ±0.0190 | 0.6250 ±0.0775 | 0.2500 ±0.0263 | 0.4444 ±0.0191 | 0.4837 | |
| $\lambda$ -GRPO (ours) | 0.3667 ±0.0895 | 0.8460 ±0.0162 | 0.7500 ±0.0693 | 0.2904 ±0.0276 | 0.5348 ±0.0192 | 0.5576 | | |
| 0.04 | GRPO | 0.3000 ±0.0851 | 0.8660 ±0.0152 | 0.7500 ±0.0693 | 0.2610 ±0.0267 | 0.5200 ±0.0192 | 0.5394 | |
| $\lambda$ -GRPO (ours) | 0.4000 ±0.0910 | 0.8340 ±0.0167 | 0.8000 ±0.0641 | 0.2978 ±0.0278 | 0.5378 ±0.0192 | 0.5739 | | |
| Llama | â | Base | 0.0000 ±0.0000 | 0.2280 ±0.0188 | 0.0750 ±0.0422 | 0.0478 ±0.0130 | 0.0563 ±0.0089 | 0.0814 |
| 0.0 | GRPO | 0.0000 ±0.0000 | 0.2300 ±0.0188 | 0.0750 ±0.0422 | 0.0551 ±0.0130 | 0.0607 ±0.0089 | 0.0842 | |
| $\lambda$ -GRPO (ours) | 0.0000 ±0.0000 | 0.2620 ±0.0197 | 0.1250 ±0.0530 | 0.0515 ±0.0134 | 0.0622 ±0.0092 | 0.1001 | | |
| 0.04 | GRPO | 0.0000 ±0.0000 | 0.2180 ±0.0185 | 0.1750 ±0.0608 | 0.0515 ±0.0134 | 0.0533 ±0.0087 | 0.0996 | |
| $\lambda$ -GRPO (ours) | 0.0333 ±0.0333 | 0.2560 ±0.0195 | 0.0750 ±0.0422 | 0.0735 ±0.0159 | 0.0489 ±0.0083 | 0.0973 | | |
Table 1: Exact-match accuracy for the base and GRPO-/ $\lambda$ -GRPO-trained Llama and Qwen models on downstream reasoning datasets ( $\text{OB}=\text{OlympiadBench}$ ). The best results in each column are indicated in bold, and the best results within each model type (i.e. Llama or Qwen) are underlined. Confidence intervals are subscripted. For each $\lambda$ -GRPO-trained model, results are given in green if it outperforms its GRPO-trained counterpart and the base model; yellow if it outperforms only its GRPO-trained counterpart; orange if it only improves over the base model; and red if it fails to outperform either model (see Table 2 in the Appendix for exact differences).
All four $\lambda$ -GRPO models reach a higher validation accuracy in fewer steps than their GRPO counterparts (see Figure 3): on average, $\lambda$ -GRPO represents a more than 10% increase over the standard GRPO validation accuracyâin less than half of the number of training steps.
This increase in validation accuracy corresponds to improved performance on downstream reasoning tasks (see Table 1). In total, the $\lambda$ -GRPO models outperform standard GRPO on 15/20 cells (excluding average performance) in Table 1, and they improve over the base Llama/Qwen models on 14/20 cells. Only the Llama $\lambda$ -GRPO model with $\beta=0.04$ failed to outperform its GRPO counterpart on average downstream performanceâthis model still outperformed standard GRPO on a majority (3/5) of the tasks.
In addition, these performance gains result in effectively zero training slowdown: as we are merely detecting ${\mathcal{B}}({\mathbb{G}})$ structures that occur during trainingârather than generating them (e.g. Kazemnejad et al., 2025; Hou et al., 2025; Yang et al., 2025a) âthe added computational cost from $\lambda$ -GRPO (vs. standard GRPO) is negligible.
6 Related Work
Monte Carlo Sampling for Finer-Grained Rewards.
As discussed in Section 2.2, the labor-intensive human annotation required to obtain step-level rewards for PRM training has driven the development of heuristic methods for PRMs and PRM-like finer-grained reward signals, in particular those based on Monte Carlo estimation. Kazemnejad et al. (2025) replace the critic model in PPO with Monte Carlo estimation: multiple completions are sampled for each step, and the value for that step is derived from their mean outcome reward. On the other hand, Wang et al. (2024) train a neural PRM, using Monte Carlo estimation in a similar manner to Kazemnejad et al. (2025) in order to obtain step-level training rewards, and thereby avoid the need for costly human annotation.
Xie et al. (2024) generate step-level preference data for Direct Preference Optimization (DPO; Rafailov et al., 2023) training via Monte Carlo Tree Search and outcome-level rewards: sequences of overlapping trajectories are obtained by forking trajectories at defined split points to construct tree structures similar to the $\mathcal{B}(G)$ trees introduced in Section 3.1. The daughter nodes with the highest and lowest mean reward are then selected as the preferred and dispreferred (respectively) sequences for step-level DPO. Similarly, Hou et al. (2025) construct $\mathcal{B}(G)$ -like trees by splitting generated trajectories at high-entropy tokens to create multiple completions to the same initial trajectory prefix. Subtrajectory rewards are then derived from the mean reward of the corresponding nodeâs daughters in the tree structure. Yang et al. (2025a) employ an analogous approach to generate step-level rewards for GRPO training. Unlike standard GRPO, however, the advantages for each node (step) are computed relative to the rewards of nodeâs sisters in the tree structure, rather than the entire group.
These methods are orthogonal to our approach: they apply Monte Carlo estimation to explicitly construct step-level reward signals from outcome-level rewards, while we leverage the implicit step-level rewards already present in standard GRPO.
PRMs with GRPO.
Aside from Yang et al. (2025a), GRPO has been employed with PRMs in Shao et al. (2024) and Feng et al. (2025). Shao et al. (2024) modify the advantage computation of GRPO to account for step-level rewards: normalized rewards are computed relative to all step-level rewards of all trajectories in the group, and the advantage for each step is the sum of the normalized reward of each subsequent step in its trajectory. Feng et al. (2025) construct a two-level variant of GRPO, in which standard, trajectory-level GRPO advantage is combined with a step-level GRPO advantage, and step-level groups are dynamically computed according to the similarity of steps across trajectories.
Our results in Sections 3 and 5 call into question the necessity of adapting GRPO to step-level rewards, given the rich step-level reward signal already present in the outcome-level variant of the algorithm.
7 Conclusion
In this paper, we demonstrated both theoretically and empirically that the standard GRPO algorithm induces a PRM that derives step-level rewards via Monte Carlo estimation. We then showed that this hidden PRM is faulty, and as a result is potentially detrimental to exploration and exploitation. To mitigate this flaw, we introduced a process-step-aware scaling factor to GRPO to derive the $\lambda$ -GRPO objective. Models trained with $\lambda$ -GRPO reach higher validation accuracy faster than with standard GRPO, and achieve improved performance on downstream reasoning tasks.
Our results indicate that it is possible to leverage the existing PRM structure inherent in the outcome-based GRPO algorithm, rather than employing costly, explicitly-defined PRMs. The limitations of this work are discussed in Appendix A.
Reproducibility Statement
We provide the complete proof of Theorem 1 in the main body of the paper. All settings and hyperparameters for the experiments conducted in this work are given in Sections 3.2 / 5.1 and Appendix B. All code is available on GitHub: https://github.com/coli-saar/grpo-prm/.
References
- Cui et al. (2025) Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025.
- Dang & Ngo (2025) Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesnât. arXiv preprint arXiv:2503.16219, 2025.
- DeepSeek-AI (2025) DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Feng et al. (2025) Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. arXiv preprint arXiv:2505.10978, 2025.
- Habib et al. (2023) Nathan Habib, ClĂ©mentine Fourrier, Hynek KydlĂÄek, Thomas Wolf, and Lewis Tunstall. Lighteval: A lightweight framework for llm evaluation, 2023. URL https://github.com/huggingface/lighteval.
- He et al. (2024) Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3828â3850, 2024.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- Hou et al. (2025) Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. TreeRL: LLM reinforcement learning with on-policy tree search. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12355â12369, 2025.
- Kazemnejad et al. (2025) Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. VinePPO: Refining credit assignment in RL training of LLMs. In Forty-second International Conference on Machine Learning, 2025.
- Lewkowycz et al. (2022) Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843â3857, 2022.
- Li & Li (2025) Wendi Li and Yixuan Li. Process reward model with q-value rankings. In The Thirteenth International Conference on Learning Representations, 2025.
- Lightman et al. (2024) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Letâs verify step by step. In The Twelfth International Conference on Learning Representations, 2024.
- Luo et al. (2025) Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, 2025. Notion Blog.
- Muennighoff et al. (2025) Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel CandĂšs, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025.
- Qian et al. (2025) Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-TĂŒr, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs. arXiv preprint arXiv:2504.13958, 2025.
- Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 53728â53741, 2023.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Setlur et al. (2025) Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for LLM reasoning. In The Thirteenth International Conference on Learning Representations, 2025.
- Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Sullivan et al. (2025) Michael Sullivan, Mareike Hartmann, and Alexander Koller. Procedural environment generation for tool-use agents. arXiv preprint arXiv:2506.11045, 2025.
- Uesato et al. (2022) Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
- Wang et al. (2024) Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9426â9439, 2024.
- Xie et al. (2024) Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning. In The First Workshop on System-2 Reasoning at Scale, NeurIPSâ24, 2024.
- Yang et al. (2025a) Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, and Jing Tang. Treerpo: Tree relative policy optimization. arXiv preprint arXiv:2506.05183, 2025a.
- Yang et al. (2025b) Zonglin Yang, Zhexuan Gu, Houduo Qi, and Yancheng Yuan. Accelerating rlhf training with reward variance increase. arXiv preprint arXiv:2505.23247, 2025b.
- Yao et al. (2023) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.
- Yu et al. (2025) Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025.
- Zhang et al. (2025) Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301, 2025.
| Model Qwen | $\beta$ 0.0 | Version Base | AIME24 +0.1667 | MATH-500 +0.0160 | AMC23 0.0000 | Minerva -0.0074 | OB +0.0252 | Avg. +0.0401 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| GRPO | +0.0334 | +0.0800 | +0.1250 | +0.0404 | +0.0904 | +0.0738 | | |
| 0.04 | Base | +0.2000 | +0.0040 | +0.0500 | 0.0000 | +0.0282 | +0.0564 | |
| GRPO | +0.1000 | -0.0320 | +0.0500 | +0.0368 | +0.0178 | +0.0345 | | |
| Llama | 0.0 | Base | 0.0000 | +0.0340 | +0.0500 | +0.0037 | +0.0059 | +0.0187 |
| GRPO | 0.0000 | +0.0320 | +0.0500 | -0.0036 | +0.0015 | +0.0160 | | |
| 0.04 | Base | +0.0333 | +0.0280 | 0.0000 | +0.0257 | -0.0074 | +0.0159 | |
| GRPO | +0.0333 | +0.0380 | -0.1000 | +0.0220 | -0.0044 | -0.0022 | | |
Table 2: Difference in accuracy between the $\lambda$ -GRPO-trained models, and their corresponding base models and standard-GRPO-trained counterparts. Positive differences (i.e. $\lambda$ -GRPO outperforms the comparison model) are highlighted in green; negative differences (i.e. the comparison model outperforms $\lambda$ -GRPO) are highlighted in red. For example, the top-most entry in the AIME24 column indicates that the $\lambda$ -GRPO Qwen model with $\beta=0.0$ outperformed the base DeepSeek-R1-Distill-Qwen-1.5B by 0.1667 on the AIME24 benchmark.
Appendix A Limitations
Due to computational resource constraints, we were only able to conduct the experiments in Sections 3.2 and 5 with relatively small models: 1.5 billion (Qwen) and 1 billion (Llama) parameters. Similarly, we only use one dataset for RL training in both experimentsâalthough OpenRS is a combination of the s1 (Muennighoff et al., 2025) and DeepScaleR Luo et al. (2025) datasets. Future work should extend our findings regarding the non-triviality of the GRPO-induced PRM and the effectiveness of $\lambda$ -GRPO to larger models and more diverse (training) datasets.
Finally, the objective of this work is to expose the PRM induced by the GRPO algorithm, and to highlight the deficiencies of that PRM as described in Section 4. To that end, our proposed $\lambda$ -GRPO method does not actually remedy the anti-exploitation effect of the GRPO-induced PRMâit merely lessens its impact. In future work, we intend to investigate more extensive modifications to the GRPO algorithm, with the goal of entirely solving the problems laid out in Section 4.
Appendix B Experimental Setup
All experiments were conducted on a single NVIDIA H100 GPU. We trained all models with 24 gradient accumulation steps per step and a generation batch size of 6. The models were evaluated on the validation split every 25 training steps.
We additionally hard-coded the generation procedure to halt after â\boxed{âŠ}â was detected: this was to prevent the model from generating multiple boxed answers for a single prompt.
Appendix C ${\mathcal{B}}({\mathbb{G}})$ Structure Examples
The following (Figures 4, 5, 6) represent ${\mathcal{B}}({\mathbb{G}})$ structures on groups generated during the group size 6 trial of the experiment in Section 3.2. A full trajectory is reconstructed by tracing the unique path from the root to a terminal node. The root ( red) corresponds to the prompt/query $q$ . Terminal nodes ( yellow) denote singleton process steps $\{g^{(i)}\}$ ; each non-terminal node $\lambda$ (white; including the root) denotes the process step corresponding to the set of all terminal nodes dominated by $\lambda$ . For the sake of presentation, overlong terminal steps are truncated with ââŠâ.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Diagram: Problem Decomposition
### Overview
The image is a diagram illustrating the decomposition of a mathematical problem. It shows a top-down approach, breaking down the initial problem statement into smaller, more manageable sub-problems. The diagram uses rounded rectangles to represent the problem statements and sub-problems, connected by arrows to indicate the flow of thought.
### Components/Axes
* **Nodes:** Rounded rectangles containing text representing problem statements or sub-problems.
* **Edges:** Arrows indicating the flow of thought or decomposition.
* **Root Node (Top):** A red rounded rectangle containing the original problem statement.
* **Intermediate Nodes:** Yellow rounded rectangles representing sub-problems or intermediate steps.
### Detailed Analysis
**1. Root Node (Top, Red):**
* Text: "Find, with proof, all pairs of positive integers $(n,d)$ with the following property: for every integer $S$, there exists a unique non-decreasing sequence of $n$ integers $a_1,a_2,...,a_n$ such that $a_1+a_2 + ... + a_n = S$ and $a_n-a_1=d.$"
**2. Second Level Nodes (from left to right):**
* **Node 1 (Yellow):** "Okay, so I have this problem where I need to find all pairs of positive integers (n, d) such that for every integer S, there exists a unique non-decreasing sequence of n integers..."
* **Node 2 (Yellow):** "'ve been given this interesting problem to solve. It asks to find all pairs of positive integers (n, d) such that for every integer S, there exists a unique non-decreasing sequence of n integers..."
* **Node 3 (Black):** "Alright, so I"
* **Node 4 (Black):** "have this problem"
* **Node 5 (Yellow):** "where I need to find all pairs of positive integers (n, d) such that for every integer S, there exists a unique non-decreasing sequence of n integers a"
* **Node 6 (Yellow):** "to solve: find all pairs of positive integers (n, d) such that for every integer S, there exists a unique non-decreasing sequence of n integers a_1, a_2, ..."
* **Node 7 (Yellow):** "here: I need to find all pairs of positive integers (n, d) such that for every integer S, there exists a unique non-decreasing sequence of n integers a1, a2, ..."
**3. Third Level Nodes (from left to right):**
* **Node 8 (Yellow):** "_1, a_2, ..., a_n where the sum of these a_i's is S, and the difference between the last term a_n and the first term a_1 is equal to d.\n\nOkay, let me parse this again. We need pairs (n, d) where:\n\n1...."
* **Node 9 (Yellow):** "1, a2,..., an where the sum of these integers is S and the difference between the last and the first term is d. Hmm, okay, let me try to unpack this step by step.\n\nFirst, let me make sure I understand..."
### Key Observations
* The diagram starts with a formal mathematical problem statement and decomposes it into smaller, more intuitive sub-problems.
* The decomposition process involves rephrasing the problem in different ways to gain a better understanding.
* The final level of decomposition focuses on understanding the relationship between the variables and constraints.
### Interpretation
The diagram represents a problem-solving process where a complex mathematical problem is broken down into smaller, more manageable parts. The process involves rephrasing the problem, identifying key components, and attempting to understand the underlying relationships. The diagram suggests a thought process that aims to simplify a complex problem by decomposing it into smaller, more understandable steps. The use of natural language alongside mathematical notation indicates an attempt to bridge the gap between formal mathematics and intuitive understanding.
</details>
Figure 4: ${\mathcal{B}}({\mathbb{G}})$ structure from step 1 (see the beginning of Appendix C for additional details).
<details>
<summary>x5.png Details</summary>

### Visual Description
## Probability Diagram: Drawing Balls from a Bag
### Overview
The image is a diagram explaining how to calculate the probability of drawing two balls of the same color from a bag containing 3 black balls and 2 red balls. It breaks down the problem into smaller steps, using combinations to calculate the number of ways to draw balls of the same color and the total number of ways to draw two balls.
### Components/Axes
* **Nodes:** The diagram consists of several nodes, each containing a step in the calculation or an explanation. The nodes are connected by arrows, indicating the flow of the calculation.
* **Text:** Each node contains text explaining the step or providing a formula. Mathematical notation is used extensively.
* **Colors:** The nodes are colored differently to distinguish between different types of information. Red is used for the initial problem statement, black for general explanations, and yellow for calculations.
### Detailed Analysis
Here's a breakdown of the diagram's content, node by node:
1. **Top Node (Red):**
* Text: "In a bag, there are 5 balls of the same size, including 3 black balls and 2 red balls. If two balls are drawn at random, the probability of drawing two balls of the same color is ( )."
* Multiple Choice Options:
* A: $\frac{1}{5}$
* B: $\frac{3}{10}$
* C: $\frac{2}{5}$
* D: $\frac{1}{2}$
2. **Left Branching Node (Black):**
* Text: "Given a bag with 5 balls of the same size, including 3 black balls and 2 red balls, we need to find the probability of drawing two balls of the same color when two balls are drawn at random. First, we calculate the total number of ways to draw..."
3. **Right Branching Node (Black):**
* Text: "To determine the probability of drawing two balls of the same color from a bag containing 3 black balls and 2 red balls, we can follow these steps: 1. **Calculate the total number of balls:** \[ 3 \text{ (black)} + 2 \text{ (red)} = 5 \text{ (balls)} \] 2. **Determine the..."
4. **Left-most Node (Yellow):**
* Text: "two balls from the bag. This is given by the combination formula \(\binom{n}{k}\), where \(n\) is the total number of balls and \(k\) is the number of balls to draw. Here, \(n = 5\) and \(k = 2\): ..."
5. **Middle-Left Node (Black):**
* Text: "2 balls from 5, which is given by the combination formula"
6. **Middle Node (Yellow):**
* Text: "total number of ways to draw 2 balls from 5: \[\binom{5}{2} = \frac{5!}{2!(5-2)!} = \frac{5 \times 4}{2 \times 1} = 10\] 3. **Calculate the number of ways to draw 2 black balls:** \[\binom{3}{2} = \frac{3!}{2!(3-2)!} = ...\]"
7. **Right-most Node (Yellow):**
* Text: "number of ways to draw 2 balls from 5: \[\binom{5}{2} = \frac{5!}{2!(5-2)!} = \frac{5 \times 4}{2 \times 1} = 10\] 3. **Calculate the number of ways to draw 2 black balls:** \[\binom{3}{2} = \frac{3!}{2!(3-2)!} = \frac{3 \times 2}{2 \times 1} = 3\] 4. **Calculate the number of ways to draw 2 red balls:** \[\binom{2}{2} = \frac{2!}{2!(2-2)!} = 1\] 5. **Find the total number of ways to draw 2 balls of the same color:** \[3 \text{ (black)} + 1 \text{ (red)} = 4\] 6. **Calculate the probability:** \[\text{Probability} = \frac{\text{Number of ways to draw 2 same color}}{\text{Total number of ways to draw 2 balls}} = ...\]"
8. **Bottom-Left Node (Yellow):**
* Text: \[\binom{5}{2} = \frac{5!}{2!(5-2)!} = \frac{5 \times 4}{2 \times 1} = 10\] Next, we calculate the number of ways to draw 2 black balls from the 3 available: \[\binom{3}{2} = ...\]"
9. **Bottom-Middle Node (Yellow):**
* Text: \[\binom{5}{2} = \frac{5!}{2!(5-2)!} = \frac{5 \times 4}{2 \times 1} = 10\] Next, we calculate the number of ways to draw 2 black balls from the 3 black balls: ..."
10. **Bottom-Left-Most Node (Yellow):**
* Text: "ways to draw 2 same color} {\text{Total number of ways to draw 2 balls}} = \frac{4}{10} = \frac{2}{5} Thus, the probability of drawing two balls of the same color is: \[\boxed{\frac{2}{5}}\]"
11. **Bottom-Right-Most Node (Yellow):**
* Text: "favorable outcomes}} {\text{Total number of outcomes}} = \frac{4}{10} = \frac{2}{5} Thus, the probability of drawing two balls of the same color is: \[\boxed{C}\]"
### Key Observations
* The diagram uses combinations to calculate probabilities.
* It breaks down the problem into smaller, manageable steps.
* The final answer is $\frac{2}{5}$, which corresponds to option C in the initial question.
### Interpretation
The diagram provides a step-by-step guide to calculating the probability of drawing two balls of the same color from a bag. It demonstrates the application of combinatorial principles to probability problems. The diagram clearly shows how to calculate the total possible outcomes and the favorable outcomes, leading to the final probability. The boxed answer at the bottom confirms that option C is the correct answer to the initial multiple-choice question.
</details>
Figure 5: ${\mathcal{B}}({\mathbb{G}})$ structure from step 1001 (see the beginning of Appendix C for additional details).
<details>
<summary>x6.png Details</summary>

### Visual Description
## Flowchart: Probability Problem Decomposition
### Overview
The image is a flowchart outlining the decomposition of a probability problem. It starts with a general problem statement and branches into different approaches and considerations for solving it. The flowchart uses text boxes connected by arrows to represent the flow of logic and problem-solving steps.
### Components/Axes
* **Nodes:** Rectangular boxes with rounded corners containing text.
* **Edges:** Arrows indicating the flow of thought or steps.
* **Root Node (Top):** Contains the initial problem statement.
* **Intermediate Nodes:** Represent different perspectives or sub-problems.
* **Leaf Nodes (Bottom):** Represent specific approaches or considerations.
### Detailed Analysis
1. **Root Node (Top, Red):**
* Text: "Let $a$ and $b$ be given positive real numbers, with $a<b.$ If two points are selected at random from a straight line segment of length $b,$ what is the probability that the distance between them is at least $a?$".
* Color: Red.
* Shape: Rounded rectangle.
2. **Second Node (Center, White):**
* Text: "<think>\n</think>\n\nGiven two"
* Color: White.
* Shape: Rounded rectangle.
* Position: Directly below the root node, connected by a dashed arrow.
3. **Third Node (Left Branch, White):**
* Text: "points selected at random from a straight line segment of length \(b \), we need to find the probability that the distance between them is at least (a ), where \("
* Color: White.
* Shape: Rounded rectangle.
* Position: Below and to the left of the second node.
4. **Third Node (Right Branch, White):**
* Text: "positive real numbers \(a\) and \(b\) such that \(a < b\). We need to find the probability that the distance between two randomly selected points on a"
* Color: White.
* Shape: Rounded rectangle.
* Position: Below and to the right of the second node.
5. **Fourth Node (Left-Left Branch, Yellow):**
* Text: "a \) and \(b) are positive real numbers with \(a < b \).\n\nFirst, we recognize that the problem can be modeled using a uniform distribution over the interval \([0, b]) ..."
* Color: Yellow.
* Shape: Rounded rectangle.
* Position: Below and to the left of the third node (left branch).
6. **Fourth Node (Left-Right Branch, Yellow):**
* Text: "0 < a < b \).\n\nFirst, we"
* Color: Yellow.
* Shape: Rounded rectangle.
* Position: Below and to the right of the third node (left branch).
7. **Fourth Node (Right-Left Branch, Yellow):**
* Text: "line segment of length \ (b\) is at least \ (a\).\n\nFirst, we note that the total number of possible pairs of points is \(b^2\), since each point can be anywhere along the segment of length \ (bl)..."
* Color: Yellow.
* Shape: Rounded rectangle.
* Position: Below and to the left of the third node (right branch).
8. **Fourth Node (Right-Right Branch, Yellow):**
* Text: "straight line segment of length \(b\) is at least \ (a\).\n\nFirst, we note that the total number of possible pairs of points is represented by the number of ways to choose 2 points from a segment of length \(b\)"
* Color: Yellow.
* Shape: Rounded rectangle.
* Position: Below and to the right of the third node (right branch).
9. **Fifth Node (Left-Left-Down Branch, Yellow):**
* Text: "note that the total number of ways to select two points from a segment of length \(b \) is represented by the number of possible positions for these points. Since the points are selected continuously, we can model this using a unit square where each axis ..."
* Color: Yellow.
* Shape: Rounded rectangle.
* Position: Below the fourth node (left-left branch).
10. **Fifth Node (Right-Left-Down Branch, Yellow):**
* Text: "recognize that the problem can be modeled using a uniform distribution over the interval \ ([0, b]\). Let the positions of the two points be (X) and \(YI), where \(X \) and \(Y \) are independent random variables uniformly distributed over \([0, b]\).\n\nThe distance between the two points is \(IX - Y 1). We need to find the probability that \(IX - Y \geq a \).\n\nTo solve this, we consider the sample space"
* Color: Yellow.
* Shape: Rounded rectangle.
* Position: Below the fourth node (right-left branch).
11. **Sixth Node (Left-Left-Down-Down Branch, Yellow):**
* Text: "\([0, b] \times [0, b]\), where each point \((X, Y)\) represents a possible outcome. The area of this sample space is \(b^2 \).\n\nWe are interested in the region where \(IX - Y \geq a 1). This can be split into two regions:\n1...."
* Color: Yellow.
* Shape: Rounded rectangle.
* Position: Below the fifth node (left-left-down branch).
12. **Sixth Node (Right-Left-Down-Down Branch, Yellow):**
* Text: ", which is the set of all possible pairs \((X, Y)\). This can be represented as a square in the \ (XY)-plane with side length \(b \). The area of this square is \(b^2 \).\n\nWe are interested in the region where \(X - Y \geq a \) ..."
* Color: Yellow.
* Shape: Rounded rectangle.
* Position: Below the fifth node (right-left-down branch).
### Key Observations
* The flowchart starts with a general probability problem and breaks it down into smaller, more manageable sub-problems.
* Different approaches to modeling the problem are explored, including using uniform distributions and considering the sample space.
* The flowchart highlights the importance of recognizing the constraints and conditions of the problem (e.g., 0 < a < b).
### Interpretation
The flowchart represents a structured approach to solving a probability problem. It demonstrates how a complex problem can be broken down into smaller, more manageable parts. The different branches represent alternative ways of thinking about the problem and highlight the key considerations for finding a solution. The use of uniform distributions and sample spaces suggests a quantitative approach to solving the problem. The flowchart is a visual representation of the problem-solving process, guiding the user through the steps needed to arrive at a solution.
</details>
Figure 6: ${\mathcal{B}}({\mathbb{G}})$ structure from step 1673 (see the beginning of Appendix C for additional details).