2509.21154

Model: gemma-3-27b-it-free

# GRPO is Secretly a Process Reward Model **Authors**: - Michael Sullivan (Department of Language Science and Technology) - Saarland Informatics Campus (Saarland University, Saarbrücken, Germany) Abstract We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ( $\lambda$ -GRPO), and show that LLMs trained with $\lambda$ -GRPO achieve higher validation accuracy and performance on downstream reasoning tasks—and reach peak performance more rapidly—than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost. 1 Introduction Process reward models (PRMs)—models that assign reward to intermediate steps (see Section 2.2)—allow for finer-grained reward assignment than outcome-level signals, thereby yielding improved multi-step reasoning performance (Lightman et al., 2024). PRMs are therefore particularly applicable to RL training for step-by-step processes such as mathematical reasoning. However, training neural PRMs requires costly, step-level human annotation (Zhang et al., 2025), and such models are particularly susceptible to reward hacking (Cui et al., 2025). These shortcomings have resulted in the limited adoption of learned PRMs for RL training Setlur et al. (2025), leading to the development of Monte-Carlo-based and other heuristic, non-neural PRMs (e.g. Wang et al., 2024; Kazemnejad et al., 2025; Hou et al., 2025). PRMs are typically employed with RL algorithms such as Proximal Policy Optimization (PPO; Schulman et al., 2017) that employ a critic model and/or generalized advantage estimation (GAE). Although Group Relative Policy Optimization (GRPO; Shao et al., 2024) greatly simplifies and reduces the memory consumption of RL training, leading to its adoption for a wide range of applications—e.g. tool use (Qian et al., 2025; Sullivan et al., 2025), RLHF (Yang et al., 2025b), and, in particular, mathematical reasoning (Shao et al., 2024; DeepSeek-AI, 2025) —it does so by eliminating the critic model and GAE of PPO (see Section 2.1). GRPO has therefore not been widely used with PRMs: to the best of our knowledge, Shao et al. (2024), Yang et al. (2025a), and Feng et al. (2025) are the only instances in which GRPO is employed with step-level rewards—and these approaches necessitate the modification of the algorithm to accommodate finer-grained reward signals. In this paper, we show that—under certain mild assumptions—GRPO induces a Monte-Carlo-based PRM (Section 3). Specifically, we prove theoretically that GRPO assigns rewards (and advantages) derived from outcome-level rewards and Monte-Carlo-sampled completions to sub-trajectories, whenever subsets of trajectories within each group share identical prefixes. We then show empirically that this identical-prefix condition is almost always met under real-world conditions, yielding rich step-level process reward structures. These two findings definitively demonstrate that the GRPO objective covertly assigns and optimizes for complex, structured step-level rewards and advantages. An investigation of the properties of GRPO’s hidden PRM reveals a defect in the objective function that hinders both exploration and exploitation (under different conditions) during RL training (Section 4): namely, a vulnerability to non-uniformly-distributed process steps within a group. To mitigate this shortcoming, we propose including a PRM-aware normalization factor into the GRPO loss function ( $\lambda$ -GRPO). We show that $\lambda$ -GRPO results in higher validation accuracy and a $\sim$ 2x training speedup over standard GRPO (Section 5). On downstream reasoning benchmarks, $\lambda$ -GRPO consistently improves over standard GRPO, demonstrating the superiority of our method. Our findings call into question the utility of dedicated PRMs for GRPO, and suggest that future work may benefit instead from exploiting the step-level reward signal already available to the GRPO algorithm. 2 Background 2.1 GRPO GRPO is a variant of PPO that discards the critic model and GAE of the latter. Instead, for each query $q$ in the training set, GRPO nondeterministically samples a group ${\mathbb{G}}$ of $k$ trajectories (completions to $q$ ) and computes the advantage ${a}_{i}$ for the completion $g^{(i)}∈{\mathbb{G}}$ relative to the mean (outcome) reward of ${\mathbb{G}}$ , as in Equation 1 (where ${r}_{i}$ is the reward for $g^{(i)}$ ). $$ {a}_{i}=\frac{{r}_{i}-r_{\text{{mean}}}({\mathbb{G}})}{r_{\text{{std}}}({\mathbb{G}})} \tag{1} $$ In our theoretical analysis in Section 3, we make two key assumptions: first, we assume the use of the DAPO token-level policy gradient objective (Yu et al., 2025), rather than sample-level loss. Although it differs from the original GRPO formulation laid out in Shao et al. (2024), Yu et al. (2025) show that this objective leads to more stable training, and it is the standard GRPO loss function employed in commonly-used RL packages (e.g. the TRL GRPO trainer https://huggingface.co/docs/trl/main/en/grpo_trainer). Second, we assume that the number of update iterations per batch ( $\mu$ ) is set to $\mu=1$ . Under this assumption, the ratio ${P}_{i,t}$ is fixed at $1.0$ (see Equation 2b), allowing us to ignore the clipping factor of the GRPO loss function in our theoretical analysis. Under these two assumptions, the per-group GRPO loss $L_{\text{GRPO}}({\mathbb{G}})$ reduces to that in Equation 2. $$ \displaystyle L_{\text{GRPO}}({\mathbb{G}})=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\sum_{g^{(i)}\in{\mathbb{G}}}\sum_{t=0}^{\textit{len}(g^{(i)})-1}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t} $$ $$ {P}_{i,t}=\frac{\pi_{\theta}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})}{\pi_{\theta_{\textit{old}}}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})} $$ $$ {D}_{i,t}=\beta\cdot\left(\frac{\pi_{\theta_{\textit{ref}}}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})}{\pi_{\theta}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})}-\textit{ln}\frac{\pi_{\theta_{\textit{ref}}}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})}{\pi_{\theta}(g^{(i)}_{t}\hskip 2.84526pt|\hskip 2.84526ptq,g^{(i)}_{:t})}-1\right) $$ 2.2 Process Reward Models (PRMs) Given an alphabet $\Sigma$ (i.e. set of tokens), we formally define a PRM as a function $f_{\phi}\colon\Sigma^{*}→(\Sigma^{*}×\mathbb{R})^{*}$ parameterized by $\phi$ that maps a trajectory $y∈\Sigma^{*}$ to the sequence $f_{\phi}(y)=((y_{:i_{1}},r^{(y)}_{0}),(y_{i_{1}:i_{2}},r^{(y)}_{1}),...,(y_{i_{n-1}:},r^{(y)}_{n-1}))$ of pairs of process steps (sub-trajectories) $y_{i_{k}:i_{k+1}}$ and step-level rewards $r^{(y)}_{k}$ . While PRMs are typically contrasted with outcome reward models (ORMs)—which assign a single reward to the entire trajectory—under the above definition, an ORM $f_{\phi^{\prime}}^{\prime}$ is simply a trivial PRM: i.e. $f_{\phi^{\prime}}^{\prime}(y)=((y,r^{(y)}_{0}))$ . Both the division of the trajectory $y$ into steps and the assignment of rewards to those steps are dependent upon the PRM in question. When trajectories are clearly delineated into individual steps—e.g. via ReACT-style prompting Yao et al. (2023) or instructing the model to divide its reasoning into demarcated steps—the PRM can simply be directed to assign a reward to each pre-defined step (e.g. Li & Li, 2025). In other cases, trajectories are heuristically split into steps—for example, at high-entropy tokens (e.g. Hou et al., 2025). Although the assignment of step-level reward can be performed by a model with learned parameters $\phi$ (e.g. Uesato et al., 2022), Kazemnejad et al. (2025) and Hou et al. (2025) combine Monte Carlo estimation with outcome-level rewards to yield heuristic PRMs that do not require the labor-intensive annotation of—and are less susceptible to reward-hacking than—their learned counterparts. In cases such as these in which the PRM $f_{\phi}$ is not learned, we simply consider $\phi$ to be fixed/trivial. 3 The PRM in the Machine <details> <summary>x1.png Details</summary> ![deeb8af6](/v1/image/deeb8af65abcc7e4a0168686d829094f84c72e67319689514f4eec600b56d7f5) ### Visual Description \n ## Diagram: Group Structure Visualization ### Overview The image depicts a visual representation of a group structure, likely related to a mathematical or computational concept. It consists of six labeled sequences (g^(1) through g^(6)) displayed as horizontal bars with numbered segments, and a tree diagram illustrating the grouping of these sequences. The diagram shows how the sequences are combined into larger groups. ### Components/Axes The diagram is composed of two main sections: 1. **Sequence Definitions:** Six horizontal bars, each labeled with "g^(i)" where 'i' ranges from 1 to 6. Each bar is divided into numbered segments, starting from 0. 2. **Tree Diagram:** A hierarchical tree structure showing how the sequences are grouped. The root node is labeled "G". There are no explicit axes in the traditional sense, but the horizontal bars represent a sequence index, and the numbers within the bars represent segment indices. The tree diagram uses arrows to indicate the grouping relationships. ### Detailed Analysis or Content Details **Sequence Definitions:** * **g^(1):** Segments 0, 1, 2, 3, 4, 5. Colors: Pink, Light Blue, Pink, Light Blue, Pink, Light Blue. * **g^(2):** Segments 0, 1, 2, 3, 4, 5, 6, 7, 8. Colors: Dark Blue, Orange, Orange, Orange, Orange, Orange, Orange, Orange, Orange. * **g^(3):** Segments 0, 1, 2, 3, 4, 5. Colors: Pink, Light Blue, Pink, Green, Pink, Green. * **g^(4):** Segments 0, 1, 2, 3, 4, 5, 6, 7, 8. Colors: Pink, Light Blue, Pink, Light Blue, Pink, Light Blue, Orange, Orange, Orange. * **g^(5):** Segments 0, 1, 2, 3, 4, 5, 6, 7. Colors: Pink, Light Blue, Pink, Light Blue, Pink, Light Blue, Pink, Light Blue. * **g^(6):** Segments 0, 1, 2, 3, 4. Colors: Dark Blue, Light Blue, Pink, Light Blue, Pink. **Tree Diagram:** * **Root Node (G):** G = {g^(1), g^(2), g^(3), g^(4), g^(5), g^(6)} * **Level 1:** * {g^(1), g^(2)} – Arrow points to this group. * {g^(3), g^(4), g^(5)} – Arrow points to this group. * {g^(6)} – Arrow points to this group. * **Level 2:** * {g^(1)} – Branching from {g^(1), g^(2)}. * {g^(2)} – Branching from {g^(1), g^(2)}. * {g^(3)} – Branching from {g^(3), g^(4), g^(5)}. * {g^(4), g^(5)} – Branching from {g^(3), g^(4), g^(5)}. * **Level 3:** * {g^(4)} – Branching from {g^(4), g^(5)}. * {g^(5)} – Branching from {g^(4), g^(5)}. ### Key Observations The sequences g^(1) through g^(6) have varying lengths. The tree diagram shows a hierarchical grouping of these sequences. The grouping appears to be based on some underlying criteria, potentially related to the color patterns or segment values within the sequences. The sequences g^(1), g^(3), g^(4), g^(5) and g^(6) share the color pink, while g^(2) is predominantly orange. ### Interpretation This diagram likely represents a grouping or clustering operation performed on a set of sequences. The sequences themselves could represent data points, code segments, or any other ordered collection of elements. The tree diagram illustrates how these sequences are organized into larger groups based on some similarity or relationship. The color coding within the sequences might be a visual cue to highlight these relationships. The structure suggests a recursive grouping process. The root node G contains all sequences, which are then divided into smaller groups, and those groups are further subdivided until individual sequences are reached. This could be a representation of a divide-and-conquer algorithm, a hierarchical classification scheme, or a data partitioning strategy. The fact that some sequences are grouped together while others remain isolated suggests that there are distinct characteristics that differentiate the sequences. The color patterns might be indicative of these characteristics. Without further context, it is difficult to determine the specific meaning of the diagram, but it clearly demonstrates a hierarchical organization of a set of sequences. </details> Figure 1: Toy example of a group ${\mathbb{G}}=\{g^{(1)},...,g^{(6)}\}$ (left) and its corresponding ${\mathcal{B}}({\mathbb{G}})$ tree (right). Tokens (boxes) are numbered for readability—subscripted numbers within boxes only indicate position. Each process set (node in the ${\mathcal{B}}({\mathbb{G}})$ tree) is a set of trajectories that share a common prefix, and corresponds to a process step (subtrajectory) spanning those shared tokens: in this figure, colored nodes in ${\mathcal{B}}({\mathbb{G}})$ correspond to those subsequences in ${\mathbb{G}}$ that span tokens/boxes of the same color Same-colored boxes in ${\mathbb{G}}$ indicate identical sequences of tokens across trajectories only: for example, $g^{(3)}_{:4}=g^{(4)}_{:4}$ and $g^{(4)}_{4:6}=g^{(5)}_{4:6}$ , but it is not necessarily the case that e.g. $g^{(3)}_{0}=g^{(3)}_{1}$ .. GRPO implicitly assigns a step-level reward and advantage to the tokens of each process step, which are computed as functions of the mean outcome-level reward of each trajectory in the corresponding process set. In Section 3.1, we prove that GRPO theoretically induces a PRM (given the assumptions of Section 2.1) as defined in Section 2.2. However, this PRM is only non-trivial—i.e. not equivalent to an ORM—if subsets of trajectories within each group share identical initial sub-trajectories. In Section 3.2, we empirically demonstrate that such rich, overlapping prefix structures arise very frequently under real-world conditions: this shows that GRPO is “secretly” a non-trivial PRM. 3.1 Theoretical Analysis Let ${\mathcal{B}}({\mathbb{G}})=\{\lambda⊂eq{\mathbb{G}}\hskip 2.84526pt|\hskip 2.84526pt∃ n≥ 0∀ g^{(i)},g^{(k)}∈\lambda\colon g^{(i)}_{:n}=g^{(k)}_{:n}\}$ be the set of all process sets: sets $\lambda⊂eq{\mathbb{G}}$ of completions such that all $g^{(i)}∈\lambda$ are identical up to the $n^{th}$ token, for some $n≥ 0$ (see Figure 2). Note that there is a natural tree structure on ${\mathcal{B}}({\mathbb{G}})$ , which is induced by the $⊃eq$ relation. Each $\lambda∈{\mathcal{B}}({\mathbb{G}})$ defines a process step within each $g^{(i)}∈\lambda$ , spanning the subsequence $g^{(i)}_{s(\lambda):e(\lambda)}$ from the $s(\lambda)^{th}$ to the $e(\lambda)^{th}$ tokens of $g^{(i)}$ . The endpoint $e(\lambda)$ is defined as the largest $n$ such that $g^{(i)}_{:n}=g^{(k)}_{:n}$ for all $g^{(i)},g^{(k)}∈\lambda$ , and $s(\lambda)$ is defined as the endpoint of the immediate parent $Pa_{{\mathcal{B}}({\mathbb{G}})}(\lambda)$ of $\lambda$ in the tree structure induced on ${\mathcal{B}}({\mathbb{G}})$ . $$ \begin{split}&e(\lambda)=\textit{max}\{n\geq 0\hskip 2.84526pt|\hskip 2.84526pt\forall g^{(i)},g^{(k)}\in\lambda:g^{(i)}_{:n}=g^{(k)}_{:n}\}\\ &s(\lambda)=\begin{cases}0&\text{if }\lambda=\textit{root}(\mathcal{B}(G))\\ e(Pa_{{\mathcal{B}}({\mathbb{G}})}(\lambda))&\text{otherwise}\end{cases}\end{split} $$ For example, $s(\{g^{(4)},g^{(5)}\})=e(\{g^{(3)},g^{(4)},g^{(5)}\})=4$ in Figure 2, and $e(\{g^{(4)},g^{(5)}\})=6$ : the process step corresponding to $\{g^{(4)},g^{(5)}\}$ spans $g^{(4)}_{4:6}$ and $g^{(5)}_{4:6}$ . For each $g^{(i)}∈{\mathbb{G}}$ and each $0≤ t<\textit{len}(g^{(i)})$ , let $\lambda^{(i,t)}∈{\mathcal{B}}({\mathbb{G}})$ denote the unique process set such that $g^{(i)}∈\lambda^{(i,t)}$ and $s(\lambda^{(i,t)})≤ t<e(\lambda^{(i,t)})$ . In other words, $\lambda^{(i,t)}$ is the process step to which the token $g^{(i)}_{t}$ belongs. In Figure 2, $\lambda^{(i,t)}$ corresponds to the set whose color matches that of $g^{(i)}_{t}$ : $\lambda^{(1,0)}=\{g^{(1)},g^{(2)}\}$ , $\lambda^{(1,3)}=\{g^{(1)}\}$ , $\lambda^{(5,5)}=\{g^{(4)},g^{(5)}\}$ , etc. Now, for each process step defined by some $\lambda∈{\mathcal{B}}({\mathbb{G}})$ , we define the step-level reward $\hat{R}(\lambda)$ via Monte Carlo estimation (Equation 3): $\hat{R}(\lambda)$ is the mean outcome-level reward of each trajectory in $\lambda$ . In other words, $\hat{R}(\lambda)$ is the mean reward of each leaf node dominated by $\lambda$ in the tree structure induced on ${\mathcal{B}}({\mathbb{G}})$ —i.e. of each sampled completion to the process step defined by $\lambda$ . $$ \hat{R}(\lambda)=\frac{\sum_{g^{(i)}\in\lambda}{r}_{i}}{|\lambda|} \tag{3} $$ For each trajectory $g^{(i)}∈{\mathbb{G}}$ and each $0≤ t<\textit{len}(g^{(i)})$ define the reward ${R}_{i,t}$ for the token $g^{(i)}_{t}$ as the reward of the process step to which $g^{(i)}_{t}$ belongs: ${R}_{i,t}=\hat{R}(\lambda^{(i,t)})$ . For example, in Figure 2, the step-level reward for the sub-trajectories $g^{(3)}_{:4}$ , $g^{(4)}_{:4}$ , $g^{(5)}_{:4}$ is the mean of the outcome-level rewards for $g^{(3)}$ , $g^{(4)}$ , and $g^{(5)}$ : ${R}_{3,0}=...={R}_{5,3}=\textit{mean}(\{{r}_{3},{r}_{4},{r}_{5}\})$ . By the definition given in Section 2.2, ${R}_{i,t}$ and ${\mathcal{B}}({\mathbb{G}})$ clearly define a PRM: each $g^{(i)}∈{\mathbb{G}}$ is mapped to the sequence $(g^{(i)}_{:s(\lambda_{1})},{R}_{i,0}),(g^{(i)}_{s(\lambda_{1}):s(\lambda_{2})},{R}_{i,s(\lambda_{1})}),...,(g^{(i)}_{s(\lambda_{n}):},{R}_{i,s(\lambda_{n})})$ , where $(\lambda_{0}={\mathbb{G}})→...→(\lambda_{n}=\{g^{(i)}\})$ is the unique path in the tree structure induced on ${\mathcal{B}}({\mathbb{G}})$ from the root ${\mathbb{G}}$ to the node $\{g^{(i)}\}$ . Now, define the step-level advantage ${A}_{i,t}$ for the token $g^{(i)}_{t}$ in an analogous manner to the original GRPO definition in Equation 1 —i.e. as the normalized difference between the step-level reward ${R}_{i,t}$ for $g^{(i)}_{t}$ and the mean reward of ${\mathbb{G}}$ : ${A}_{i,t}=({R}_{i,t}-r_{\textit{mean}}({\mathbb{G}}))/r_{\textit{std}}({\mathbb{G}})$ . Replacing the term ${a}_{i}$ with ${A}_{i,t}$ in Equation 2a yields a PRM-aware RL objective (Equation 4). $$ L_{\text{PRM}}({\mathbb{G}})=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\sum_{g^{(i)}\in{\mathbb{G}}}\sum_{t=0}^{\textit{len}(g^{(i)})-1}({P}_{i,t}\cdot\hbox{\pagecolor{yellow}$\displaystyle{A}_{i,t}$})-{D}_{i,t} \tag{4} $$ We now show that the standard GRPO objective defined in Equation 2a with outcome-level rewards ( $L_{\text{GRPO}}$ ) is equivalent to the PRM defined in Equations 3 - 4 ( $L_{\text{PRM}}$ ). **Theorem 1** *For any query $q$ , policy $\pi_{\theta}$ , and group ${\mathbb{G}}\sim\pi_{\theta}(\text{\textendash}\hskip 2.84526pt|\hskip 2.84526ptq)$ with outcome-level rewards $\{r_{i}\}_{g^{(i)}∈{\mathbb{G}}}$ : $L_{\text{GRPO}}({\mathbb{G}})=L_{\text{PRM}}({\mathbb{G}})$ .* * Proof* Let $\lambda$ be any process set in ${\mathcal{B}}({\mathbb{G}})$ , and let $t$ be any integer such that $s(\lambda)≤ t<e(\lambda)$ . We first prove that the sum of the PRM loss terms ${P}_{i,t}·{A}_{i,t}-{D}_{i,t}$ for each trajectory $g^{(i)}∈\lambda$ at the token $t$ is equivalent to the sum of the standard GRPO loss terms ${P}_{i,t}·{a}_{i}-{D}_{i,t}$ (Equation 5). Recall that for any $g^{(i)},g^{(k)}∈\lambda$ , $g^{(i)}_{:t+1}=g^{(k)}_{:t+1}$ by definition: therefore, ${P}_{i,t}={P}_{k,t}$ and ${D}_{i,t}={D}_{k,t}$ (Equations 2b - 2c). Again by definition, $g^{(i)},g^{(k)}∈\lambda$ implies that ${R}_{i,t}={R}_{k,t}=\hat{R}(\lambda)$ (Equation 3), and so ${A}_{i,t}={A}_{k,t}$ . As such, we may define $\hat{P}_{t}(\lambda)={P}_{k,t}$ , $\hat{D}_{t}(\lambda)={D}_{k,t}$ , and $\hat{A}(\lambda)={A}_{k,t}$ in Equation 5, choosing any arbitrary $g^{(k)}∈\lambda$ . $$ \begin{split}&\sum_{g^{(i)}\in\lambda}({P}_{i,t}\cdot{A}_{i,t})-{D}_{i,t}=|\lambda|\cdot((\hat{P}_{t}(\lambda)\cdot\hat{A}(\lambda))-\hat{D}_{t}(\lambda))\\ &=|\lambda|\cdot\left(\left(\hat{P}_{t}(\lambda)\cdot\frac{\hat{R}(\lambda)-r_{\textit{mean}}({\mathbb{G}})}{r_{\textit{std}}({\mathbb{G}})}\right)-\hat{D}_{t}(\lambda)\right)=|\lambda|\cdot\left(\left(\hat{P}_{t}(\lambda)\cdot\frac{\sum_{g^{(i)}\in\lambda}\frac{{r}_{i}}{|\lambda|}-r_{\textit{mean}}({\mathbb{G}})}{r_{\textit{std}}({\mathbb{G}})}\right)-\hat{D}_{t}(\lambda)\right)\\ &=\left(\hat{P}_{t}(\lambda)\frac{|\lambda|(\sum_{g^{(i)}\in\lambda}\frac{{r}_{i}}{|\lambda|}-r_{\textit{mean}}({\mathbb{G}}))}{r_{\textit{std}}({\mathbb{G}})}\right)-|\lambda|\hat{D}_{t}(\lambda)=\left(\hat{P}_{t}(\lambda)\frac{\sum_{g^{(i)}\in\lambda}{r}_{i}-\sum_{g^{(i)}\in\lambda}r_{\textit{mean}}({\mathbb{G}})}{r_{\textit{std}}({\mathbb{G}})}\right)-|\lambda|\hat{D}_{t}(\lambda)\\ &=\left(\sum_{g^{(i)}\in\lambda}{P}_{i,t}\frac{{r}_{i}-r_{\textit{mean}}({\mathbb{G}})}{r_{\textit{std}}({\mathbb{G}})}\right)-\sum_{g^{(i)}\in\lambda}{D}_{i,t}=\sum_{g^{(i)}\in\lambda}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t}\end{split} \tag{5} $$ Now, letting $t_{\textit{max}}=\textit{max}_{g^{(i)}∈{\mathbb{G}}}\hskip 2.84526pt\textit{len}(g^{(i)})$ , for each $0≤ t<t_{\textit{max}}$ we can define a partition ${\mathbb{X}}_{t}⊂eq{\mathcal{B}}({\mathbb{G}})$ of $\{g^{(i)}∈{\mathbb{G}}\hskip 2.84526pt|\hskip 2.84526pt\textit{len}(g^{(i)})≤ t\}$ such that ${\mathbb{X}}_{t}=\{\lambda∈{\mathcal{B}}({\mathbb{G}})\hskip 2.84526pt|\hskip 2.84526pts(\lambda)≤ t<e(\lambda)\}$ is the set of all process sets corresponding to a token span containing the index $t$ . The GRPO loss term $L_{\text{GRPO}}({\mathbb{G}})$ (Equation 2a) can be equivalently expressed as in Equation 6 (and analogously for $L_{\text{PRM}}({\mathbb{G}})$ of Equation 4). $$ L_{\text{GRPO}}({\mathbb{G}})=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}\sum_{\lambda\in{\mathbb{X}}_{t}}\sum_{g_{i}\in\lambda}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t} \tag{6} $$ We then have the following equalities by Equations 5 and 6: $$ \begin{split}L_{\text{GRPO}}({\mathbb{G}})&=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{g^{(i)}\in{\mathbb{G}}}\sum_{t=0}^{\textit{len}(g^{(i)})-1}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t}=\\ \frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}&\sum_{\lambda\in{\mathbb{X}}_{t}}\sum_{g^{(i)}\in\lambda}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t}=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}\sum_{\lambda\in{\mathbb{X}}_{t}}\sum_{g^{(i)}\in\lambda}({P}_{i,t}\cdot{A}_{i,t})-{D}_{i,t}\\ &=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{g^{(i)}\in{\mathbb{G}}}\sum_{t=0}^{\textit{len}(g^{(i)})-1}({P}_{i,t}\cdot{A}_{i,t})-{D}_{i,t}=L_{\text{PRM}}({\mathbb{G}})\end{split} $$ ∎ In other words, the standard GRPO objective defined in Equation 2a automatically induces the PRM and PRM-aware objective defined in Equations 3 - 4. 3.2 Empirical Analysis The theoretical analysis in Section 3.1 shows that the GRPO objective induces a PRM: it remains to be shown, however, that this induced PRM is non-trivial. We refer to the set ${\mathcal{B}}({\mathbb{G}})$ of process sets as trivial if it contains only singleton sets And, by definition, ${\mathbb{G}}$ itself, with $s({\mathbb{G}})=e({\mathbb{G}})=0$ . —i.e. ${\mathcal{B}}({\mathbb{G}})=\{{\mathbb{G}}\}\cup\{\{g^{(i)}\}\hskip 2.84526pt|\hskip 2.84526ptg^{(i)}∈{\mathbb{G}}\}$ . If ${\mathcal{B}}({\mathbb{G}})$ is trivial, then there are no meaningful process steps within each trajectory, and only the outcome reward has an effect on learning. On the other hand, if ${\mathcal{B}}({\mathbb{G}})$ is not trivial, then Theorem 1 entails that the standard GRPO objective of Equation 2a induces a non-trivial PRM. To analyze the complexity of real-world, GRPO-derived step-level rewards, we empirically evaluated ${\mathcal{B}}({\mathbb{G}})$ structures generated during standard GRPO training: for each group ${\mathbb{G}}$ , we computed its ${\mathcal{B}}({\mathbb{G}})$ tree, and measured the number of intermediate nodes between the root ${\mathbb{G}}$ and each terminal node $\{g^{(i)}\}$ (path depth), as a proxy for the complexity of the ${\mathcal{B}}({\mathbb{G}})$ structure. In addition, for each completion $g^{(i)}∈{\mathbb{G}}$ , we counted the number of tokens $n^{(i)}_{\textit{term}}=e(\{g^{(i)}\})-s(\{g^{(i)}\})$ contained in the process step corresponding to the terminal node $\{g^{(i)}\}$ —i.e. the number of tokens unique to $g^{(i)}$ —and calculated the intermediate proportion ${p}_{i}$ of $g^{(i)}$ : ${p}_{i}=(\textit{len}(g^{(i)})-n^{(i)}_{\textit{term}})/\textit{len}(g^{(i)})$ . Higher values of ${p}_{i}$ indicate that a greater proportion of the trajectory $g^{(i)}$ belongs to intermediate process steps and is therefore assigned non-trivial step-level reward. <details> <summary>x2.png Details</summary> ![ce3e14af](/v1/image/ce3e14af6164c31990784a3dd28cadd87c2ada0593851a4c13b7bb870a0834a1) ### Visual Description ## Line Charts: Validation Metrics vs. Step ### Overview The image presents six line charts, arranged in a 2x3 grid, displaying validation metrics (Reward, Path Depth, and Intermediate Proportion) plotted against 'STEP'. Each metric is shown for two different group sizes: 36 and 6. Each chart displays both 'Mean' and 'Median' values as separate lines. ### Components/Axes * **X-axis:** Labeled "STEP". The scale varies for each chart. * Top row charts: 0 to 260 * Bottom row charts: 0 to 1600 * **Y-axis:** Varies depending on the metric. * "Validation Reward": Scale approximately 0.20 to 0.34 * "Path Depth": Scale approximately 1.0 to 12.0 * "Intermediate Proportion": Scale approximately 0.0 to 1.0 * **Legend:** Located in the top-left corner of each chart. * "Mean" - Solid line * "Median" - Dashed line * **Titles:** Each chart has a title indicating the metric and group size. ### Detailed Analysis or Content Details **Chart 1: Validation Reward (Group Size = 36)** * **Trend (Mean):** The blue 'Mean' line initially decreases from approximately 0.28 to a minimum of around 0.23 at STEP 60, then increases with fluctuations to a maximum of approximately 0.32 at STEP 240, and then decreases slightly to around 0.30 at STEP 260. * **Data Points (Mean - approximate):** * STEP 0: 0.28 * STEP 60: 0.23 * STEP 120: 0.26 * STEP 180: 0.29 * STEP 240: 0.32 * STEP 260: 0.30 * **Trend (Median):** The blue dashed 'Median' line is relatively stable, fluctuating between approximately 0.24 and 0.27 throughout the entire range. * **Data Points (Median - approximate):** * STEP 0: 0.26 * STEP 60: 0.24 * STEP 120: 0.25 * STEP 180: 0.26 * STEP 240: 0.27 * STEP 260: 0.26 **Chart 2: Path Depth (Group Size = 36)** * **Trend (Mean):** The blue 'Mean' line shows a steep increase from approximately 4 to 10 between STEP 0 and STEP 120, then a slower increase to around 11.5 at STEP 260. * **Data Points (Mean - approximate):** * STEP 0: 4 * STEP 60: 7 * STEP 120: 10 * STEP 180: 10.5 * STEP 240: 11 * STEP 260: 11.5 * **Trend (Median):** The blue dashed 'Median' line follows a similar trend to the mean, but is slightly lower, increasing from approximately 4 to 9.5 at STEP 260. * **Data Points (Median - approximate):** * STEP 0: 4 * STEP 60: 6 * STEP 120: 8.5 * STEP 180: 9 * STEP 240: 9.5 * STEP 260: 9.5 **Chart 3: Intermediate Proportion (Group Size = 36)** * **Trend (Mean):** The blue 'Mean' line increases rapidly from approximately 0.15 to 0.6 at STEP 120, then plateaus around 0.65 to 0.7. * **Data Points (Mean - approximate):** * STEP 0: 0.15 * STEP 60: 0.3 * STEP 120: 0.6 * STEP 180: 0.65 * STEP 240: 0.7 * STEP 260: 0.68 * **Trend (Median):** The blue dashed 'Median' line increases more slowly, reaching approximately 0.4 at STEP 260. * **Data Points (Median - approximate):** * STEP 0: 0.1 * STEP 60: 0.2 * STEP 120: 0.35 * STEP 180: 0.4 * STEP 240: 0.45 * STEP 260: 0.4 **Chart 4: Validation Reward (Group Size = 6)** * **Trend (Mean):** The blue 'Mean' line fluctuates between approximately 0.24 and 0.36 throughout the entire range. * **Data Points (Mean - approximate):** * STEP 0: 0.26 * STEP 200: 0.34 * STEP 400: 0.28 * STEP 600: 0.32 * STEP 800: 0.25 * STEP 1000: 0.31 * STEP 1200: 0.27 * STEP 1400: 0.24 * STEP 1600: 0.30 * **Trend (Median):** The blue dashed 'Median' line is relatively stable, fluctuating between approximately 0.24 and 0.28. **Chart 5: Path Depth (Group Size = 6)** * **Trend (Mean):** The blue 'Mean' line increases steadily from approximately 1.0 to 3.8 at STEP 1600. * **Data Points (Mean - approximate):** * STEP 0: 1.0 * STEP 200: 1.8 * STEP 400: 2.5 * STEP 600: 3.0 * STEP 800: 3.2 * STEP 1000: 3.4 * STEP 1200: 3.6 * STEP 1400: 3.7 * STEP 1600: 3.8 * **Trend (Median):** The blue dashed 'Median' line follows a similar trend, but is slightly lower. **Chart 6: Intermediate Proportion (Group Size = 6)** * **Trend (Mean):** The blue 'Mean' line increases rapidly from approximately 0.0 to 0.8 at STEP 1600. * **Data Points (Mean - approximate):** * STEP 0: 0.0 * STEP 200: 0.2 * STEP 400: 0.4 * STEP 600: 0.6 * STEP 800: 0.7 * STEP 1000: 0.75 * STEP 1200: 0.78 * STEP 1400: 0.8 * STEP 1600: 0.8 ### Key Observations * The 'Validation Reward' for Group Size 36 shows a more pronounced initial dip and subsequent recovery compared to Group Size 6. * 'Path Depth' increases consistently for both group sizes, but the rate of increase is higher for Group Size 36 in the initial stages. * 'Intermediate Proportion' shows a faster increase for Group Size 36, reaching a higher plateau compared to Group Size 6. * The 'Median' lines are generally more stable than the 'Mean' lines, indicating less variance in the data. ### Interpretation The charts demonstrate the impact of group size on the validation metrics during a training or learning process. The larger group size (36) appears to lead to faster initial learning (as seen in 'Path Depth' and 'Intermediate Proportion') but also exhibits more volatility in the 'Validation Reward'. The smaller group size (6) shows more stable performance but slower learning. The difference in the 'Validation Reward' curves suggests that the larger group size might be more prone to overfitting or experiencing more significant fluctuations during training. The 'Intermediate Proportion' reaching a higher value for Group Size 36 suggests that the model is utilizing intermediate representations more effectively with a larger group size. The divergence between the 'Mean' and 'Median' lines indicates the presence of outliers or varying performance across individual instances within each group. These metrics likely represent the performance of a machine learning model, and the charts provide insights into the trade-offs between group size, learning speed, and stability. </details> Figure 2: Validation reward (exact-match accuracy; left), ${\mathcal{B}}({\mathbb{G}})$ root-to-terminal path depth (center), and proportions of trajectories spanned by intermediate (non-terminal) process steps (right) for GRPO runs with group sizes of 36 (top) and 6 (bottom). Experimental Setup. We trained two DeepSeek-R1-Distill-Qwen-1.5B models (DeepSeek-AI, 2025) on the OpenRS (Dang & Ngo, 2025) dataset using the standard GRPO algorithm and objective of Equation 2. We selected 125 OpenRS examples at random to serve as a validation set. The first model trained for 1675 steps with a group size of six and a learn rate of $6× 10^{-6}$ . The second was trained with a group size of 36 and a learn rate of $10^{-6}$ for 275 steps (due to the larger group size). Both models were trained with a maximum new token limit of 4096, a batch size of four, and a temperature of 0.75. Additional training details are located in Appendix B. Results. Figure 2 shows that both path depth and intermediate proportion increase drastically as validation reward saturates, for group sizes of six and 36. These results are supported by Yu et al. (2025), who find that entropy decreases sharply as GRPO training progresses: this indicates that increasingly rich PRM-inducing structures arise as the model converges on a locally optimal policy. In addition, found that only twelve of 6,700 ${\mathcal{B}}({\mathbb{G}})$ structures were trivial with a group size of six ( $\sim$ 0.2%). With a group size of 36, zero trivial ${\mathcal{B}}({\mathbb{G}})$ structures arose out of the 1,100 generated groups. Examples of non-trivial ${\mathcal{B}}({\mathbb{G}})$ structures from this experiment are given in Appendix C. In conjunction with the theoretical analysis of Section 3.1, the results of this experiment demonstrate that GRPO induces a non-trivial PRM under real-world conditions. In Section 4, we show that this induced PRM carries a serious flaw that is detrimental to RL training, and propose a minor modification to the GRPO algorithm to mitigate this shortcoming. 4 Proposed Approach: $\lambda$ -GRPO Viewing the GRPO objective in terms of process set partitions ${\mathbb{X}}_{t}$ (see Equation 6), we note that the contribution of each trajectory $g^{(i)}∈{\mathbb{G}}$ to the loss at index $t$ is identical to that of all other trajectories in the process set $\lambda^{(i,t)}$ (where $\hat{P}_{t}(\lambda)$ , $\hat{D}_{t}(\lambda)$ , and $\hat{A}(\lambda)$ are defined as in Equation 5): $$ \begin{split}\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{g^{(i)}\in{\mathbb{G}}}&\sum_{t=0}^{\textit{len}(g^{(i)})-1}({P}_{i,t}\cdot{a}_{i})-{D}_{i,t}=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}\sum_{\lambda\in{\mathbb{X}}_{t}}\sum_{g^{(i)}\in\lambda}({P}_{i,t}\cdot{A}_{i,t})-{D}_{i,t}\\ =&\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}\sum_{\lambda\in{\mathbb{X}}_{t}}|\lambda|\cdot((\hat{P}_{t}(\lambda)\cdot\hat{A}(\lambda))-\hat{D}_{t}(\lambda))\end{split} \tag{7} $$ The contribution of each process set $\lambda∈{\mathcal{B}}({\mathbb{G}})$ to the overall loss, $\hat{P}_{t}(\lambda)·\hat{A}(\lambda)-\hat{D}_{t}(\lambda)$ , is scaled by $|\lambda|$ : this carries the danger of harming exploration (for $\hat{A}(\lambda)<0$ ) and exploitation (for $\hat{A}(\lambda)>0$ ). Consider some process set $\lambda$ with $|\lambda|\gg 1$ . If $\hat{A}(\lambda)>0$ , then the increase in probability assigned to the process step corresponding to $\lambda$ by $\pi_{\theta}$ under GRPO is compounded by a factor of $|\lambda|$ , decreasing the likelihood of exploring process steps that are dissimilar from $\lambda$ in subsequent training episodes. Conversely, if $\hat{A}(\lambda)<0$ , then the decrease in probability assigned to $\lambda$ under GRPO is compounded by a factor of $|\lambda|$ , decreasing the likelihood of exploiting high-reward trajectories in $\lambda$ . To illustrate, consider the group ${\mathbb{G}}$ in Figure 2, assume ${r}_{1}={r}_{2}={r}_{6}=0.5$ , ${r}_{4}={r}_{5}=0$ , ${r}_{3}=1$ , and let $\lambda=\{g^{(3)},g^{(4)},g^{(5)}\}$ . Then $\hat{A}(\lambda)=-0.22$ : despite the fact that $g^{(3)}$ has the highest reward, the probability of the sub-trajectory $g^{(3)}_{:4}$ is decreased under the GRPO objective, thereby decreasing the overall likelihood of generating the completion $g^{(3)}$ . The term $|\lambda|$ in Equation 7 then scales this decrease in probability by a factor of three. We propose scaling the token-level loss for $g^{(i)}_{t}$ by $|\lambda^{(i,t)}|^{-1}$ ( $\lambda$ -GRPO; Equation 8): this has the effect of canceling out the term $|\lambda|$ in Equation 7, so that each process set contributes equally to the loss at index $t$ . $$ \begin{split}L_{\lambda\text{-GRPO}}({\mathbb{G}})&=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{g^{(i)}\in{\mathbb{G}}}\sum_{t=0}^{\textit{len}(g^{(i)})-1}\frac{({P}_{i,t}\cdot{a}_{i})-{D}_{i,t}}{|\lambda^{(i,t)}|}\\ &=\frac{1}{\sum_{g^{(i)}\in{\mathbb{G}}}\textit{len}(g^{(i)})}\cdot\sum_{t=0}^{t_{\textit{max}}-1}\sum_{\lambda\in{\mathbb{X}}_{t}}(\hat{P}_{t}(\lambda)\cdot\hat{A}(\lambda))-\hat{D}_{t}(\lambda)\end{split} \tag{8} $$ 5 Experiments To evaluate our proposed approach, we trained DeepSeek-R1-Distill-Qwen-1.5B and Llama-3.2-1B-Instruct https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct with the $\lambda$ -GRPO (Equation 8) objective on the OpenRS dataset of Section 3.2, and compared them to standard GRPO (Equation 2a) models trained with an identical setup. All models were evaluated on an OpenRS validation set and five downstream reasoning benchmarks (see Section 5.1). 5.1 Experimental Setup All models were trained for 1,000 steps with a group size of six, a batch size of four, a maximum of 4,096 new tokens, and a temperature of 0.75. We conducted two sets of trials across the two models (for a total of four trials): in the first, we set the KL coefficient $\beta=0.0$ , and in the second $\beta=0.04$ . The Qwen models were trained with a learn rate of $10^{-6}$ ; the Llama models were trained with a learn rate of $5× 10^{-7}$ for the $\beta=0.0$ trial and $10^{-7}$ for $\beta=0.04$ (as training was highly unstable with higher learn rates for Llama). Additional training details are located in Appendix B. We evaluated the models on the AIME24 https://huggingface.co/datasets/AI-MO/aimo-validation-aime, MATH-500 (Hendrycks et al., 2021; Lightman et al., 2024), AMC23 https://huggingface.co/datasets/AI-MO/aimo-validation-amc, Minerva (Lewkowycz et al., 2022), and OlympiadBench (He et al., 2024) benchmarks, using the LightEval framework Habib et al. (2023); Dang & Ngo (2025). All models were evaluated at the checkpoint corresponding to the step at which they achieved maximum validation accuracy. As in the experiment in Section 3.2, we withheld 125 examples as a validation set. 5.2 Results and Discussion <details> <summary>x3.png Details</summary> ![c01a0784](/v1/image/c01a07843ebd955b6540bd90dee6bfe681e3e64c6e68d7944d161e251841f1f1) ### Visual Description \n ## Line Chart: Validation Reward vs. Step for Different Models ### Overview The image presents four separate line charts, each depicting the relationship between "Validation Reward" and "Step" for different models: Owen (with β=0.1), Owen (with β=0.04), Llama (with β=0.1), and Llama (with β=0.04). Each chart displays two lines representing "CRPO" and "A-GPO" algorithms. The charts appear to track the performance of these algorithms over a series of steps, as measured by the validation reward. ### Components/Axes * **X-axis:** "STEP" ranging from 0 to 800. * **Y-axis:** "VALIDATION REWARD". The scales vary for each chart. * Owen (β=0.1): 0.20 to 3.25 * Owen (β=0.04): 0.225 to 3.75 * Llama (β=0.1): 0.08 to 0.150 * Llama (β=0.04): 0.04 to 0.14 * **Lines:** Two lines are present in each chart: * "CRPO" (Purple/Violet) * "A-GPO" (Cyan/Turquoise) * **Titles:** Each chart has a title indicating the model name and β value. * **Annotations:** Each chart has annotations indicating the maximum validation reward achieved by each algorithm. ### Detailed Analysis or Content Details **1. Owen (β=0.1)** * **CRPO:** The line starts at approximately 0.25, increases to a peak of approximately 3.25 at step 250, then fluctuates between approximately 1.90 and 3.00 for the remainder of the steps. * **A-GPO:** The line starts at approximately 0.25, increases to a peak of approximately 1.90 at step 250, then fluctuates between approximately 0.75 and 1.90 for the remainder of the steps. * **Annotations:** CRPO max reward: 3.25 at step 250. A-GPO max reward: 1.90 at step 250. **2. Owen (β=0.04)** * **CRPO:** The line starts at approximately 0.25, increases to a peak of approximately 3.75 at step 250, then fluctuates between approximately 1.90 and 3.75 for the remainder of the steps. * **A-GPO:** The line starts at approximately 0.25, increases to a peak of approximately 1.90 at step 250, then fluctuates between approximately 0.75 and 1.90 for the remainder of the steps. * **Annotations:** CRPO max reward: 4.00 at step 250. A-GPO max reward: 1.975 at step 250. **3. Llama (β=0.1)** * **CRPO:** The line starts at approximately 0.10, increases to a peak of approximately 0.150 at step 250, then fluctuates between approximately 0.08 and 0.150 for the remainder of the steps. * **A-GPO:** The line starts at approximately 0.08, increases to a peak of approximately 0.125 at step 250, then fluctuates between approximately 0.06 and 0.125 for the remainder of the steps. * **Annotations:** CRPO max reward: 0.150 at step 250. A-GPO max reward: 0.125 at step 250. **4. Llama (β=0.04)** * **CRPO:** The line starts at approximately 0.06, increases to a peak of approximately 0.145 at step 250, then fluctuates between approximately 0.06 and 0.145 for the remainder of the steps. * **A-GPO:** The line starts at approximately 0.04, increases to a peak of approximately 0.11 at step 250, then fluctuates between approximately 0.04 and 0.11 for the remainder of the steps. * **Annotations:** CRPO max reward: 0.145 at step 250. A-GPO max reward: 0.11 at step 250. ### Key Observations * **CRPO consistently outperforms A-GPO** across all models and β values, as indicated by higher validation rewards. * **Owen models achieve significantly higher validation rewards** compared to Llama models. * **The peak validation reward generally occurs around step 250** for both algorithms in all charts. * **Lower β values (0.04) seem to result in slightly higher peak rewards for Owen**, but the difference is not substantial. * **The fluctuations in validation reward after step 250 suggest instability** or ongoing learning/adaptation. ### Interpretation The data suggests that the CRPO algorithm is more effective than the A-GPO algorithm in maximizing validation reward for both Owen and Llama models. The Owen model consistently achieves higher rewards, indicating it may be better suited for this task or have a more effective learning process. The β parameter appears to have a minor influence on performance, with lower values potentially leading to slightly higher peak rewards in the Owen model. The fluctuations after step 250 suggest that the learning process is not fully converged and that further steps could potentially lead to improved performance. The difference in scale between the Owen and Llama charts indicates that the models operate in different reward spaces or have different inherent capabilities. The annotations pinpoint the step at which the maximum reward is achieved, providing a useful metric for comparing the algorithms' efficiency. </details> Figure 3: Models’ validation accuracy across training steps. Peak accuracy is highlighted by vertical, dashed lines. | Model Qwen | $\beta$ — | Version Base | AIME24 0.2000 ±0.0743 | MATH-500 0.8300 ±0.0168 | AMC23 0.7500 ±0.0693 | Minerva 0.2978 ±0.0278 | OB 0.5096 ±0.0193 | Avg. 0.5175 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 0.0 | GRPO | 0.3333 ±0.0875 | 0.7660 ±0.0190 | 0.6250 ±0.0775 | 0.2500 ±0.0263 | 0.4444 ±0.0191 | 0.4837 | | | $\lambda$ -GRPO (ours) | 0.3667 ±0.0895 | 0.8460 ±0.0162 | 0.7500 ±0.0693 | 0.2904 ±0.0276 | 0.5348 ±0.0192 | 0.5576 | | | | 0.04 | GRPO | 0.3000 ±0.0851 | 0.8660 ±0.0152 | 0.7500 ±0.0693 | 0.2610 ±0.0267 | 0.5200 ±0.0192 | 0.5394 | | | $\lambda$ -GRPO (ours) | 0.4000 ±0.0910 | 0.8340 ±0.0167 | 0.8000 ±0.0641 | 0.2978 ±0.0278 | 0.5378 ±0.0192 | 0.5739 | | | | Llama | — | Base | 0.0000 ±0.0000 | 0.2280 ±0.0188 | 0.0750 ±0.0422 | 0.0478 ±0.0130 | 0.0563 ±0.0089 | 0.0814 | | 0.0 | GRPO | 0.0000 ±0.0000 | 0.2300 ±0.0188 | 0.0750 ±0.0422 | 0.0551 ±0.0130 | 0.0607 ±0.0089 | 0.0842 | | | $\lambda$ -GRPO (ours) | 0.0000 ±0.0000 | 0.2620 ±0.0197 | 0.1250 ±0.0530 | 0.0515 ±0.0134 | 0.0622 ±0.0092 | 0.1001 | | | | 0.04 | GRPO | 0.0000 ±0.0000 | 0.2180 ±0.0185 | 0.1750 ±0.0608 | 0.0515 ±0.0134 | 0.0533 ±0.0087 | 0.0996 | | | $\lambda$ -GRPO (ours) | 0.0333 ±0.0333 | 0.2560 ±0.0195 | 0.0750 ±0.0422 | 0.0735 ±0.0159 | 0.0489 ±0.0083 | 0.0973 | | | Table 1: Exact-match accuracy for the base and GRPO-/ $\lambda$ -GRPO-trained Llama and Qwen models on downstream reasoning datasets ( $\text{OB}=\text{OlympiadBench}$ ). The best results in each column are indicated in bold, and the best results within each model type (i.e. Llama or Qwen) are underlined. Confidence intervals are subscripted. For each $\lambda$ -GRPO-trained model, results are given in green if it outperforms its GRPO-trained counterpart and the base model; yellow if it outperforms only its GRPO-trained counterpart; orange if it only improves over the base model; and red if it fails to outperform either model (see Table 2 in the Appendix for exact differences). All four $\lambda$ -GRPO models reach a higher validation accuracy in fewer steps than their GRPO counterparts (see Figure 3): on average, $\lambda$ -GRPO represents a more than 10% increase over the standard GRPO validation accuracy—in less than half of the number of training steps. This increase in validation accuracy corresponds to improved performance on downstream reasoning tasks (see Table 1). In total, the $\lambda$ -GRPO models outperform standard GRPO on 15/20 cells (excluding average performance) in Table 1, and they improve over the base Llama/Qwen models on 14/20 cells. Only the Llama $\lambda$ -GRPO model with $\beta=0.04$ failed to outperform its GRPO counterpart on average downstream performance—this model still outperformed standard GRPO on a majority (3/5) of the tasks. In addition, these performance gains result in effectively zero training slowdown: as we are merely detecting ${\mathcal{B}}({\mathbb{G}})$ structures that occur during training—rather than generating them (e.g. Kazemnejad et al., 2025; Hou et al., 2025; Yang et al., 2025a) —the added computational cost from $\lambda$ -GRPO (vs. standard GRPO) is negligible. 6 Related Work Monte Carlo Sampling for Finer-Grained Rewards. As discussed in Section 2.2, the labor-intensive human annotation required to obtain step-level rewards for PRM training has driven the development of heuristic methods for PRMs and PRM-like finer-grained reward signals, in particular those based on Monte Carlo estimation. Kazemnejad et al. (2025) replace the critic model in PPO with Monte Carlo estimation: multiple completions are sampled for each step, and the value for that step is derived from their mean outcome reward. On the other hand, Wang et al. (2024) train a neural PRM, using Monte Carlo estimation in a similar manner to Kazemnejad et al. (2025) in order to obtain step-level training rewards, and thereby avoid the need for costly human annotation. Xie et al. (2024) generate step-level preference data for Direct Preference Optimization (DPO; Rafailov et al., 2023) training via Monte Carlo Tree Search and outcome-level rewards: sequences of overlapping trajectories are obtained by forking trajectories at defined split points to construct tree structures similar to the $\mathcal{B}(G)$ trees introduced in Section 3.1. The daughter nodes with the highest and lowest mean reward are then selected as the preferred and dispreferred (respectively) sequences for step-level DPO. Similarly, Hou et al. (2025) construct $\mathcal{B}(G)$ -like trees by splitting generated trajectories at high-entropy tokens to create multiple completions to the same initial trajectory prefix. Subtrajectory rewards are then derived from the mean reward of the corresponding node’s daughters in the tree structure. Yang et al. (2025a) employ an analogous approach to generate step-level rewards for GRPO training. Unlike standard GRPO, however, the advantages for each node (step) are computed relative to the rewards of node’s sisters in the tree structure, rather than the entire group. These methods are orthogonal to our approach: they apply Monte Carlo estimation to explicitly construct step-level reward signals from outcome-level rewards, while we leverage the implicit step-level rewards already present in standard GRPO. PRMs with GRPO. Aside from Yang et al. (2025a), GRPO has been employed with PRMs in Shao et al. (2024) and Feng et al. (2025). Shao et al. (2024) modify the advantage computation of GRPO to account for step-level rewards: normalized rewards are computed relative to all step-level rewards of all trajectories in the group, and the advantage for each step is the sum of the normalized reward of each subsequent step in its trajectory. Feng et al. (2025) construct a two-level variant of GRPO, in which standard, trajectory-level GRPO advantage is combined with a step-level GRPO advantage, and step-level groups are dynamically computed according to the similarity of steps across trajectories. Our results in Sections 3 and 5 call into question the necessity of adapting GRPO to step-level rewards, given the rich step-level reward signal already present in the outcome-level variant of the algorithm. 7 Conclusion In this paper, we demonstrated both theoretically and empirically that the standard GRPO algorithm induces a PRM that derives step-level rewards via Monte Carlo estimation. We then showed that this hidden PRM is faulty, and as a result is potentially detrimental to exploration and exploitation. To mitigate this flaw, we introduced a process-step-aware scaling factor to GRPO to derive the $\lambda$ -GRPO objective. Models trained with $\lambda$ -GRPO reach higher validation accuracy faster than with standard GRPO, and achieve improved performance on downstream reasoning tasks. Our results indicate that it is possible to leverage the existing PRM structure inherent in the outcome-based GRPO algorithm, rather than employing costly, explicitly-defined PRMs. The limitations of this work are discussed in Appendix A. Reproducibility Statement We provide the complete proof of Theorem 1 in the main body of the paper. All settings and hyperparameters for the experiments conducted in this work are given in Sections 3.2 / 5.1 and Appendix B. All code is available on GitHub: https://github.com/coli-saar/grpo-prm/. References - Cui et al. (2025) Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025. - Dang & Ngo (2025) Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t. arXiv preprint arXiv:2503.16219, 2025. - DeepSeek-AI (2025) DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. - Feng et al. (2025) Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. arXiv preprint arXiv:2505.10978, 2025. - Habib et al. (2023) Nathan Habib, Clémentine Fourrier, Hynek Kydlíček, Thomas Wolf, and Lewis Tunstall. Lighteval: A lightweight framework for llm evaluation, 2023. URL https://github.com/huggingface/lighteval. - He et al. (2024) Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3828–3850, 2024. - Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. - Hou et al. (2025) Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. TreeRL: LLM reinforcement learning with on-policy tree search. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12355–12369, 2025. - Kazemnejad et al. (2025) Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. VinePPO: Refining credit assignment in RL training of LLMs. In Forty-second International Conference on Machine Learning, 2025. - Lewkowycz et al. (2022) Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022. - Li & Li (2025) Wendi Li and Yixuan Li. Process reward model with q-value rankings. In The Thirteenth International Conference on Learning Representations, 2025. - Lightman et al. (2024) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2024. - Luo et al. (2025) Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, 2025. Notion Blog. - Muennighoff et al. (2025) Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025. - Qian et al. (2025) Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs. arXiv preprint arXiv:2504.13958, 2025. - Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 53728–53741, 2023. - Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. - Setlur et al. (2025) Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for LLM reasoning. In The Thirteenth International Conference on Learning Representations, 2025. - Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. - Sullivan et al. (2025) Michael Sullivan, Mareike Hartmann, and Alexander Koller. Procedural environment generation for tool-use agents. arXiv preprint arXiv:2506.11045, 2025. - Uesato et al. (2022) Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022. - Wang et al. (2024) Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9426–9439, 2024. - Xie et al. (2024) Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning. In The First Workshop on System-2 Reasoning at Scale, NeurIPS’24, 2024. - Yang et al. (2025a) Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, and Jing Tang. Treerpo: Tree relative policy optimization. arXiv preprint arXiv:2506.05183, 2025a. - Yang et al. (2025b) Zonglin Yang, Zhexuan Gu, Houduo Qi, and Yancheng Yuan. Accelerating rlhf training with reward variance increase. arXiv preprint arXiv:2505.23247, 2025b. - Yao et al. (2023) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. - Yu et al. (2025) Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025. - Zhang et al. (2025) Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301, 2025. | Model Qwen | $\beta$ 0.0 | Version Base | AIME24 +0.1667 | MATH-500 +0.0160 | AMC23 0.0000 | Minerva -0.0074 | OB +0.0252 | Avg. +0.0401 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | GRPO | +0.0334 | +0.0800 | +0.1250 | +0.0404 | +0.0904 | +0.0738 | | | | 0.04 | Base | +0.2000 | +0.0040 | +0.0500 | 0.0000 | +0.0282 | +0.0564 | | | GRPO | +0.1000 | -0.0320 | +0.0500 | +0.0368 | +0.0178 | +0.0345 | | | | Llama | 0.0 | Base | 0.0000 | +0.0340 | +0.0500 | +0.0037 | +0.0059 | +0.0187 | | GRPO | 0.0000 | +0.0320 | +0.0500 | -0.0036 | +0.0015 | +0.0160 | | | | 0.04 | Base | +0.0333 | +0.0280 | 0.0000 | +0.0257 | -0.0074 | +0.0159 | | | GRPO | +0.0333 | +0.0380 | -0.1000 | +0.0220 | -0.0044 | -0.0022 | | | Table 2: Difference in accuracy between the $\lambda$ -GRPO-trained models, and their corresponding base models and standard-GRPO-trained counterparts. Positive differences (i.e. $\lambda$ -GRPO outperforms the comparison model) are highlighted in green; negative differences (i.e. the comparison model outperforms $\lambda$ -GRPO) are highlighted in red. For example, the top-most entry in the AIME24 column indicates that the $\lambda$ -GRPO Qwen model with $\beta=0.0$ outperformed the base DeepSeek-R1-Distill-Qwen-1.5B by 0.1667 on the AIME24 benchmark. Appendix A Limitations Due to computational resource constraints, we were only able to conduct the experiments in Sections 3.2 and 5 with relatively small models: 1.5 billion (Qwen) and 1 billion (Llama) parameters. Similarly, we only use one dataset for RL training in both experiments—although OpenRS is a combination of the s1 (Muennighoff et al., 2025) and DeepScaleR Luo et al. (2025) datasets. Future work should extend our findings regarding the non-triviality of the GRPO-induced PRM and the effectiveness of $\lambda$ -GRPO to larger models and more diverse (training) datasets. Finally, the objective of this work is to expose the PRM induced by the GRPO algorithm, and to highlight the deficiencies of that PRM as described in Section 4. To that end, our proposed $\lambda$ -GRPO method does not actually remedy the anti-exploitation effect of the GRPO-induced PRM—it merely lessens its impact. In future work, we intend to investigate more extensive modifications to the GRPO algorithm, with the goal of entirely solving the problems laid out in Section 4. Appendix B Experimental Setup All experiments were conducted on a single NVIDIA H100 GPU. We trained all models with 24 gradient accumulation steps per step and a generation batch size of 6. The models were evaluated on the validation split every 25 training steps. We additionally hard-coded the generation procedure to halt after “\boxed{…}” was detected: this was to prevent the model from generating multiple boxed answers for a single prompt. Appendix C ${\mathcal{B}}({\mathbb{G}})$ Structure Examples The following (Figures 4, 5, 6) represent ${\mathcal{B}}({\mathbb{G}})$ structures on groups generated during the group size 6 trial of the experiment in Section 3.2. A full trajectory is reconstructed by tracing the unique path from the root to a terminal node. The root ( red) corresponds to the prompt/query $q$ . Terminal nodes ( yellow) denote singleton process steps $\{g^{(i)}\}$ ; each non-terminal node $\lambda$ (white; including the root) denotes the process step corresponding to the set of all terminal nodes dominated by $\lambda$ . For the sake of presentation, overlong terminal steps are truncated with “…”. <details> <summary>x4.png Details</summary> ![91e16317](/v1/image/91e16317443bcfd1d8d8dc9ab524082b19b4b48e6d28a9cb7429e90c6729ce03) ### Visual Description \n ## Diagram: Problem Solving Flowchart ### Overview This diagram depicts a problem-solving process, likely a thought experiment or a step-by-step breakdown of a mathematical proof. It shows a person thinking through a problem, breaking it down into smaller parts, and attempting to understand the requirements. The diagram uses speech bubbles and arrows to illustrate the flow of thought. ### Components/Axes The diagram consists of: * **Central Problem Statement:** Located at the top, stating the mathematical problem to be proven. * **Person Icon:** Representing the thinker, positioned centrally. * **Speech Bubbles:** Containing the thought process, branching out from the person icon. * **Arrows:** Indicating the flow of thought and problem decomposition. * **Text Boxes:** Containing detailed explanations and restatements of the problem. ### Detailed Analysis or Content Details **Top Problem Statement:** "Find, with proof, all pairs of positive integers S(n,d)S with the following property: for every integer SSS, there exists a unique non-decreasing sequence of SnS integers Sa_1, a_2, ..., a_n such that Sa_1 + a_2 + ... + a_n = SS and Sa_n - a_1 = dS" **Left Branch (Initial Thought):** * "Okay, so I have this problem where I need to find all pairs of positive integers (n, d) such that for every integer S, there exists a unique non-decreasing sequence of n integers..." **Right Branch (Acknowledgement):** * "Alright, so I" **Central Node (Problem Reiteration):** * "have problem" **Bottom-Left Branch (Problem Restatement 1):** * "I've been given this interesting problem to solve. It asks to find all pairs of positive integers (n, d) such that for every integer S, there exists a unique non-decreasing sequence of n integers..." **Bottom-Center Branch (Problem Decomposition):** * "to solve: find all pairs of positive integers (n, d) such that for every integer S, there exists a unique non-decreasing sequence of n integers a_1, a_2, ..." **Bottom-Right Branch (Problem Restatement 2):** * "here: I need to find all pairs of positive integers (n, d) such that for every integer S, there exists a unique non-decreasing sequence of n integers a1, a2, ..." **Bottom-Most Branch (Further Decomposition):** * "a_1, a_2, ..., a_n where the sum of these a's is S, and the difference between the last term a_n and the first term a_1 is equal to d. \nOkay, let me parse this again. We need pairs (n, d) where: \n1n+..." * "a_1, a_2, ..., an where the sum of these integers is S and the difference between the last and the first term is d. \nHmm, okay, let me try to unpack this step by step \nInFirst, let me make sure I understand..." ### Key Observations The diagram illustrates a recursive process of problem decomposition. The thinker repeatedly restates the problem in different ways, attempting to break it down into smaller, more manageable parts. The use of "Okay, so I...", "Alright, so I", and "Hmm, okay, let me..." suggests a conversational, exploratory thought process. The problem itself involves finding pairs of integers (n, d) that satisfy a specific condition related to non-decreasing sequences and sums. ### Interpretation The diagram represents a common approach to solving complex mathematical problems: breaking down the problem into smaller, more understandable components. The thinker is actively engaging with the problem, restating it, and attempting to identify the key constraints and requirements. The diagram doesn't present a solution, but rather a *process* of attempting to arrive at a solution. The repeated restatements suggest the problem is not immediately obvious and requires careful consideration. The diagram highlights the iterative nature of problem-solving, where understanding is built through repeated analysis and refinement. The diagram is not about the *answer* but the *thinking*. </details> Figure 4: ${\mathcal{B}}({\mathbb{G}})$ structure from step 1 (see the beginning of Appendix C for additional details). <details> <summary>x5.png Details</summary> ![0326e081](/v1/image/0326e0819c8208fdfcc3abff64db48230bf19761b30a7cd88536ace8854ed92c) ### Visual Description \n ## Mathematical Problem Sheet: Probability of Drawing Balls ### Overview The image presents a handwritten sheet detailing the calculation of the probability of drawing two balls of the same color from a bag containing 3 black balls and 2 red balls. It uses combinatorial formulas and step-by-step explanations. The document is primarily in English, with some mathematical notation. ### Components/Axes The sheet is structured as a series of calculations and explanations. There are no traditional axes or charts. The key components are: * **Problem Statement:** Initial description of the scenario. * **Formula Introduction:** Explanation of the combination formula. * **Step-by-Step Calculation:** Breakdown of the probability calculation into smaller steps. * **Final Probability:** The calculated probability of drawing two balls of the same color. * **Mathematical Notation:** Use of LaTeX-like symbols for fractions and combinations (e.g., `\frac{5}{2}`, `\binom{n}{k}`). ### Detailed Analysis or Content Details The problem begins with the statement: "In a bag, there are 5 balls of the same size, including 3 black balls and 2 red balls. If two balls are drawn at random, the probability of drawing two balls of the same color is ( )." The solution proceeds as follows: 1. **Total Number of Balls:** The total number of balls is stated as 5. 2. **Combination Formula:** The combination formula is given as `\binom{n}{k}`, where 'n' is the total number of balls and 'k' is the number of balls to draw. 3. **Total Ways to Draw 2 Balls:** The total number of ways to draw 2 balls from 5 is calculated as `\binom{5}{2} = \frac{5!}{2!(5-2)!} = \frac{5!}{2!3!} = \frac{5 \times 4}{2 \times 1} = 10`. 4. **Ways to Draw 2 Black Balls:** The number of ways to draw 2 black balls from 3 is calculated as `\binom{3}{2} = \frac{3!}{2!(3-2)!} = \frac{3!}{2!1!} = \frac{3 \times 2}{2 \times 1} = 3`. 5. **Ways to Draw 2 Red Balls:** The number of ways to draw 2 red balls from 2 is calculated as `\binom{2}{2} = \frac{2!}{2!(2-2)!} = \frac{2!}{2!0!} = \frac{2 \times 1}{2 \times 1} = 1`. 6. **Total Ways to Draw 2 Balls of the Same Color:** The total number of ways to draw 2 balls of the same color is the sum of the ways to draw 2 black balls and 2 red balls: `3 + 1 = 4`. 7. **Probability Calculation:** The probability of drawing 2 balls of the same color is calculated as the number of ways to draw 2 balls of the same color divided by the total number of ways to draw 2 balls: `\frac{4}{10} = \frac{2}{5}`. The final answer is stated as: "Therefore, the probability of drawing two balls of the same color is \frac{2}{5}." ### Key Observations * The solution uses a clear, step-by-step approach. * The mathematical notation is consistent throughout the calculation. * The final answer is a simplified fraction. * The handwriting is legible, but contains some minor inconsistencies in symbol formation. ### Interpretation The document demonstrates a standard application of combinatorics to solve a probability problem. The problem is well-defined, and the solution is logically sound. The use of the combination formula is appropriate for calculating the number of ways to choose balls without regard to order. The final probability of 2/5 (or 0.4) indicates that there is a 40% chance of drawing two balls of the same color from the bag. The document serves as a clear example of how to apply mathematical principles to real-world scenarios. The consistent use of the combination formula and the step-by-step breakdown of the calculation make the solution easy to follow and understand. </details> Figure 5: ${\mathcal{B}}({\mathbb{G}})$ structure from step 1001 (see the beginning of Appendix C for additional details). <details> <summary>x6.png Details</summary> ![7da2f42f](/v1/image/7da2f42fe8e372b070190bf4ec54cb387d21ef786e223115b8a1b4b876d3655c) ### Visual Description \n ## Diagram: Probability Problem Breakdown ### Overview The image presents a diagram outlining the steps to solve a probability problem. The problem involves finding the probability that the distance between two points randomly selected on a line segment of length *b* is at least *a*, where 0 < *a* < *b*. The diagram breaks down the problem into smaller steps, using geometric reasoning and uniform distribution concepts. The diagram is structured as a flow chart with text blocks explaining each step. ### Components/Axes The diagram consists of several text blocks connected by arrows, visually representing a logical flow. There are no explicit axes or scales. The diagram is organized into a series of steps, starting with the problem statement and progressing through the solution approach. ### Detailed Analysis or Content Details **Top Block:** "Let $a$ and $b$ be given positive real numbers, with $a < b$. If two points are selected at random from a straight line segment of length $b$, what is the probability that the distance between them is at least $a$?" **First Branch (Left):** "points selected at random from a straight line segment of length (b), we need to find the probability that the distance between them is at least (a), where (" "a) and (b) \ are positive real numbers with (a < b). \nInFirst, we recognize that the problem can be modeled using a uniform distribution over the interval (0, b)..." A rectangle containing "0 < a < b" is present. **First Branch (Right):** "positive real numbers (a) and (b) such that (a < b). We need to find the probability that the distance between two randomly selected points on a" "line segment of length (b) is at least (a). \nInFirst, we note that the total number of possible pairs of points is (b^2), since each point can be anywhere along the segment of length (b)..." **Second Branch (Left):** "note that the total number of ways to select two points from a segment of length (b) is represented by the number of possible positions for these points. Since the points are selected continuously, we can model this using a unit square where each axis..." **Second Branch (Right):** "recognize that the problem can be modeled using a uniform distribution over the interval (0, b). \nInFirst, Let the positions of the two points be (X) and (Y), where (X) and (Y) are independent random variables uniformly distributed over (0, b). \nInThe distance between the two points is |(X - Y)|. We need to find the probability that |(X - Y)| geq a. \nInTo solve this, we consider the sample space" **Bottom Block:** "(0, b) times (0, b), where each point ((X), (Y)) represents a possible outcome. The area of this sample space is (b^2). \nInWe are interested in the region where |(X - Y)| geq a. This region can be split into two regions: \nIn1. Are..." "...which is the set of all possible pairs ((X), (Y)). This can be represented as a square in the XY-plane with side length (b). The area of this square is (b^2). \nInWe are interested in the region where..." ### Key Observations The diagram presents a step-by-step approach to solving a probability problem using geometric and statistical concepts. The use of a uniform distribution and the representation of the problem as an area within a square are key elements of the solution. The problem is broken down into smaller, manageable steps, making it easier to understand. ### Interpretation The diagram illustrates a common problem-solving technique in probability: translating a geometric problem into a statistical one. By representing the possible positions of the two points as coordinates in a square, the problem becomes one of finding the area of a specific region within that square. This area, when divided by the total area of the square, gives the desired probability. The diagram's structure emphasizes the logical flow of reasoning, starting with the problem statement and progressing through the steps needed to arrive at a solution. The use of mathematical notation and precise language indicates a technical audience familiar with probability and geometry. The diagram is a visual aid for understanding the solution process, rather than a presentation of numerical results. </details> Figure 6: ${\mathcal{B}}({\mathbb{G}})$ structure from step 1673 (see the beginning of Appendix C for additional details).

Rendering Paper...