# Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training
> ⋆ ⋆ \star ⋆ IBM Research, † † \dagger † IBM Quantum, ∘ \circ ∘ MIT-IBM Watson Lab
## Abstract
We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling efficiency, and memory usage. In addition, a recent analysis of GRPO suggests that estimating the advantage function with off-policy samples could be beneficial. Building on these observations, we adapt GRPO to the off-policy setting. We show that both on-policy and off-policy GRPO objectives yield an improvement in the reward. This result motivates the use of clipped surrogate objectives in the off-policy version of GRPO. We then compare the empirical performance of reinforcement learning with verifiable rewards in post-training using both GRPO variants. Our results show that off-policy GRPO either significantly outperforms or performs on par with its on-policy counterpart.
## 1. Introduction
Proximal Policy Optimization (PPO) (Schulman et al., 2015, 2017) is a widely used algorithm in reinforcement learning. Reinforcement learning from Human Feedback (Christiano et al., 2017; Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022) and Reinforcement Learning from Verifiable Rewards (Lambert et al., 2024; Shao et al., 2024) are corner stones in post-training of large language models to align their preferences with human values and to enable reasoning and coding capabilities using verifiable rewards.
Group Relative Policy Optimization introduced in (Shao et al., 2024) alleviate the need of training a critic network in PPO and uses Monte-Carlo samples referred to as “a group” to estimate the advantage function via a standardized reward, where the mean and standard deviation statistics are estimated using the group. GRPO was used to train the Deepseek R1 reasoning models (Guo et al., 2025) and was adopted by the open-source community as a method of choice for post-training of large language models, with open-source implementations in several librarires such as TRL of HuggingFace (von Werra et al., 2020b) and VERL (Luo et al., 2025).
Several recent works analyzed the loss implemented in GRPO such as Vojnovic and Yun (2025); Mroueh (2025). The study in Mroueh (2025) suggests that the iterative GRPO of Shao et al. (2024) with sample reuse (i.e. for $\mu>1$ in Shao et al. (2024)) leads to an off-policy estimation of the advantage and to a success rate amplification when using verifiable rewards. Indeed, it has been observed empirically that this off-policy advantage estimation leads to an improved performance (HuggingFace, 2025b).
Motivated by these observations and the rich literature on off-policy PPO and RL, like work by Queeney et al. (2021); Meng et al. (2023); Gan et al. (2024); Fakoor et al. (2020) to cite a few (see related work Section 4 for a larger account on this), in this paper we explore the extension of GRPO to the off-policy regime where the advantage is estimated using statistics coming from a different policy than the current policy.
The main contributions of this paper are:
- We review in Section 2 the iterative GRPO algorithm proposed in Shao et al. (2024) and introduce in Section 3 the off-policy GRPO.
- We show in Section 3 that the on-policy and off-policy advantages provides a lower bound on the policy improvement of the expected reward (Theorem 1 and Corollary 1).
- We state conditions under which optimizing the advantage leads to improvements in the off-policy regime, namely, given that the off-policy stays in the vicinity of the current policy and the variance of the reward under the off-policy is non zero, maximizing the regularized off-policy advantage leads to policy improvement. The regularization ensures that the updated policy stays close to the off-policy.
- Finally, armed with these results, we state the constrained policy optimization problem for off-policy GRPO in Section 3.2 and derive a clipped surrogate similar to the ones in off-policy PPO (Gan et al., 2024) and obtain on-policy GRPO clipped objective as a particular case.
- We validate experimentally that training LLMs with off-policy GRPO leads to either improved or on par performance while potentially reducing the communication burden in serving the model in each iteration for inference.
## 2. On-Policy GRPO
Let $\mathcal{X}$ be the space of inputs (prompts in the context of LLMs) and $\mathcal{Y}$ the space of responses. We denote by $\rho_{\mathcal{X}}$ the distribution on inputs. We refer to the policy we want to optimize as $\pi(\cdot|x)$ , which is a distribution on $\mathcal{Y}$ conditioned on $x\sim\rho_{\mathcal{X}}$ . For $k\geq 0$ , let $\pi_{k}$ be the policy at the current step $k$ .
The Group Relative Policy Optimization (GRPO) Clipped objective introduced in Shao et al. (2024) is a variant of Proximal Policy Optimization (PPO) (Schulman et al., 2017, 2015), where the advantage is computed as a standardized reward function with mean and variances computed with respect to a group or Monte-Carlo samples of size $G$ sampled from the current policy $\pi_{k}(.|x)$ for each $x$ independently. For $\epsilon,\beta>0$ and given a reference policy $\pi_{\mathrm{ref}}$ , the clipped objective optimization in GRPO is defined as follows:
$$
\max_{\pi}\mathbb{E}_{y\sim\pi_{k}(\cdot|x)}\min\left(\frac{\pi(y|x)}{\pi_{k}(
y|x)}A_{\pi_{k}}(x,y),~{}\text{clip}\left(\frac{\pi(y|x)}{\pi_{k}(y|x)},1-
\epsilon,1+\epsilon\right)A_{\pi_{k}}(x,y)\right)-\beta\mathsf{KL}(\pi||\pi_{
\mathrm{ref}}),
$$
where $\mathsf{KL}$ is the Kullback-Leibler divergence, and $A_{\pi_{k}}$ is the GRPO advantage function:
$$
A_{\pi_{k}}(x,y)=\frac{r(x,y)-\mathbb{E}_{\pi_{k}}r(x,y)}{\sqrt{\mathbb{E}_{
\pi_{k}}(r(x,y)-\mathbb{E}_{\pi_{k}}r(x,y))^{2}+\varepsilon}}.
$$
The advantage can be estimated from samples on “a group” of size $G$ for each $x$ , we sample $y_{1},\ldots,y_{G}\sim\pi_{k}(\cdot|x)$ and compute $r_{\ell}=r(x,y_{\ell}),$ $\ell=1,\ldots,G$ . We refer to the group of reward conditioned on $x$ as $\{r_{\ell}\}$ and the estimated GRPO advantage is therefore (Shao et al., 2024):
$$
\hat{A}_{\pi_{k}}(x,y_{i})=\frac{r_{i}-\texttt{mean}(\{r_{\ell}\})}{\sqrt{
\texttt{std}^{2}(\{r_{\ell}\})+\varepsilon}},
$$
where mean and std are empirical mean and standard deviation respectively. The statistics used to normalize the reward leading to the advantage function are estimated using the current policy $\pi_{k}$ , and hence we refer to $A_{\pi_{k}}$ as the on-policy advantage.
When compared with PPO, GRPO alleviates the need of training a critic network to compute the advantage and relies instead on standarized rewards that can be estimated efficiently using efficient inference frameworks such as vLLM (Kwon et al., 2023) in the context of large language models.
#### GRPO with Verifiable Rewards and Success Rate Amplification
The iterative GRPO (Shao et al., 2024) has two overlooked features:
- The algorithm suggests to optimize the policy $\pi$ for $\mu$ iterations fixing the samples from $\pi_{k}$ , which inherently leads to an off-policy estimation of the advantage.
- The algorithm suggests to do the training in stages while changing $\pi_{\mathrm{ref}}$ to the latest optimized policy with GRPO.
A recent analysis of GRPO with verifiable rewards, i.e. with binary rewards (Mroueh, 2025), suggests that this aforementioned off-policy advantage estimation in Shao et al. (2024) leads to an implicit fixed point iteration that guarantees that the success rate of the GRPO-optimized policy is higher than the one of the reference policy. This also explains the multi-stage nature of the iterative GRPO that changes the reference along the training iterations.
Motivated by these observations, we propose to take a step back and analyze on-policy and off-policy GRPO. In practice, in our proposed off-policy GRPO instead of just fixing the samples for $\mu$ iterations from $\pi_{k}$ as suggested in Shao et al. (2024), we use the policy $\pi_{k-\mu}$ to estimate the advantage for $\mu$ iterations with fresh samples in each iteration, and we refer to this as off-policy advantage.
## 3. Off-Policy and On-Policy GRPO Reward Improvement
We introduce in this Section off-policy GRPO, and analyze conditions under which policy reward improvement is possible in both the on-policy and off-policy regimes. Towards that goal we start by some preliminary definitions.
Define the expected reward of a policy given $x\sim\rho_{\mathcal{X}}$ :
$$
J(\pi(\cdot|x))=\mathbb{E}_{y\sim\pi(\cdot|x)}r(x,y) \tag{1}
$$
For $k\geq 0$ , let $\pi_{k}$ be the policy at the current step $k$ and $\alpha(\cdot|x)$ be a policy used for off-policy sampling, where typically we consider $\alpha(\cdot|x)=\pi_{k-v}(\cdot|x)$ , for $0\leq v<k$ . Note in Section 2 we referred to this as $\pi_{k-\mu}$ so we keep close to notation used in the original GRPO paper. We will use $v$ instead of $\mu$ in the rest of the paper.
Define the mean and standard deviation of the off-policy reward, i.e. under policy $\alpha$ :
$$
{\mu_{\alpha,r}(x)=\mathbb{E}_{y\sim\alpha(\cdot|x)}r(x,y)}
$$
and
$$
\sigma_{\alpha,r}(x)=\sqrt{\mathbb{E}_{y\sim\alpha(\cdot|x)}(r(x,y)-\mu_{
\alpha,r}(x))^{2}},
$$
and denote for ${0<\varepsilon<1}$ :
$$
\sigma_{\alpha,r,\varepsilon}(x)=\sqrt{\sigma^{2}_{\alpha,r}(x)+\varepsilon}.
$$
The GRPO advantage function computed using the off-policy distribution $\alpha$ is defined as the whitened reward, as follows:
$$
A_{\alpha}(x,y)=\frac{r(x,y)-\mu_{\alpha,r}(x)}{\sigma_{\alpha,r,\varepsilon}(
x)}. \tag{2}
$$
Our goal is to maximize the expected advantage function using importance sampling under the policy $\alpha$ :
$$
\mathcal{L}_{\alpha}(\pi(\cdot|x))=\mathbb{E}_{y\sim\alpha(\cdot|x)}\frac{\pi(
y|x)}{\alpha(y|x)}A_{\alpha}(x,y) \tag{3}
$$
If $\alpha=\pi_{k}$ , we obtain the online policy objective function of GRPO, where the advantage is computed with the current policy $\pi_{k}$ , i.e. using $A_{\pi_{k}}(x,y)$ .
### 3.1. Policy Improvement in GRPO
Note that our goal is to optimize the expected reward under $\pi$ , $J(\pi)$ given in eq. (1), but instead we use the expected advantage $\mathcal{L}_{\alpha}(\pi)$ – where the advantage is computed using $\alpha$ – given in eq. (3). Hence, our goal in what follows is to provide a lower bound on $J(\pi(\cdot|x))-J(\pi_{k}(\cdot|x))$ that involves $\mathcal{L}_{\alpha}(\pi)$ , which guarantees that maximizing the expected advantage function leads to improvement in terms of expected rewards on the current policy $\pi_{k}$ .
Our lower bounds are given in Theorem 1 and Corollary 1 and they involve the total variation distance $\mathbb{TV}$ defined as follows:
$$
\mathbb{TV}(m_{1},m_{2})=\frac{1}{2}\int|m_{1}-m_{2}|.
$$
**Theorem 1 (Policy Improvement Lower Bound inOff-Policy GRPO)**
*Assume that the reward is positive and bounded in $0\leq r\leq 1$ . Let $\alpha$ be the off-policy distribution and $\pi_{k}$ the current policy. Then for any policy $\pi$ we have for all $x$ ( $\rho_{\mathcal{X}}$ a.s.): $J (π (⋅ | x)) − J (π k (⋅ | x)) ≥ ℒ α (π (⋅ | x)) − 2 1 − σ α, r, ε (x) σ α, r, ε (x) 𝕋 𝕍 (π (⋅ | x), α (⋅ | x)) − 2 𝕋 𝕍 (π k (⋅ | x), α (⋅ | x)) italic_J ( italic_π ( ⋅ | italic_x ) ) - italic_J ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) ≥ caligraphic_L start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_π ( ⋅ | italic_x ) ) - 2 divide start_ARG 1 - italic_σ start_POSTSUBSCRIPT italic_α , italic_r , italic_ε end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_α , italic_r , italic_ε end_POSTSUBSCRIPT ( italic_x ) end_ARG blackboard_T blackboard_V ( italic_π ( ⋅ | italic_x ) , italic_α ( ⋅ | italic_x ) ) - 2 blackboard_T blackboard_V ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_x ) , italic_α ( ⋅ | italic_x ) )$*
If the reward is not bounded by $1$ we can scale it by $\left\lVert{r}\right\rVert_{\infty}$ so it becomes in $[0,1]$ , without this impacting the overall optimization problem. Note that this condition on the reward ensures that $\sigma_{\alpha,r}(x)\leq 1$ which is needed in the GRPO case to get the policy improvement lower bound. Indeed for bounded random variable in $[a,b]$ the variance is bounded by $\frac{(b-a)^{2}}{4}$ , and hence we have $\sigma_{\alpha,r}(x)\leq\frac{1}{4}$ , which guarantees that the term $\frac{1-\sigma_{\alpha,r,\varepsilon}(x)}{\sigma_{\alpha,r,\varepsilon}(x)}\geq 0$ .
For on-policy GRPO i.e. setting $\alpha=\pi_{k}$ in Theorem 1 we have the following corollary:
**Corollary 1 (Policy Improvement Lower Bound inOn-Policy GRPO)**
*Assume that the reward is positive and bounded, $0\leq r\leq 1$ . Let $\pi_{k}$ be the current policy, then for any policy $\pi$ we have for all $x$ ( $\rho_{\mathcal{X}}$ a.s.):
$J (π (⋅ | x)) − J (π k (⋅ | x)) ≥ ℒ π k (π (⋅ | x)) − 2 1 − σ π k, r, ε (x) σ π k, r, ε (x) 𝕋 𝕍 (π (⋅ | x), π k (⋅ | x)) italic_J ( italic_π ( ⋅ | italic_x ) ) - italic_J ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) ≥ caligraphic_L start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ( ⋅ | italic_x ) ) - 2 divide start_ARG 1 - italic_σ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r , italic_ε end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r , italic_ε end_POSTSUBSCRIPT ( italic_x ) end_ARG blackboard_T blackboard_V ( italic_π ( ⋅ | italic_x ) , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_x ) )$*
Define :
$$
M_{\alpha,r,\varepsilon}=\sqrt{\mathbb{E}_{x\sim\rho_{\mathcal{X}}}\frac{(1-
\sigma_{\alpha,r,\varepsilon}(x))^{2}}{\sigma^{2}_{\alpha,r,\varepsilon}(x)}}
$$
Integrating Theorem 1 on $x$ (prompts) and applying Cauchy-Schwarz inequality we obtain:
$\displaystyle\mathbb{E}_{x\sim\rho_{\mathcal{X}}}J(\pi(\cdot|x))-\mathbb{E}_{x \sim\rho_{\mathcal{X}}}J(\pi_{k}(\cdot|x))\geq\mathbb{E}_{x\sim\rho_{\mathcal{ X}}}\mathcal{L}_{\alpha}(\pi(\cdot|x))..$ $\displaystyle..-2M_{\alpha,r,\varepsilon}(\mathbb{E}_{x\sim\rho_{\mathcal{X}}} \mathbb{TV}^{2}(\pi(\cdot|x),\alpha(\cdot|x)))^{\frac{1}{2}}-2\mathbb{E}_{x \sim\rho_{\mathcal{X}}}\mathbb{TV}(\pi_{k}(\cdot|x),\alpha(\cdot|x))$ (4)
#### Interpreting the lower bound
When compared with lower bounds for policy improvement in PPO (Theorem 1 in TRPO (Schulman et al., 2015)) and for off-policy PPO (Lemma 3.1 in transductive PPO (Gan et al., 2024) and Theorem 1 in Generalized PPO (Queeney et al., 2021)), we observe similar lower bounds with a crucial difference that the constants weighting total variations are absolute constants for PPO whereas they are policy and data dependent for GRPO. In particular, the dependency of the lower bound on:
$$
\frac{1-\sigma_{\alpha,r,\varepsilon}(x)}{\sigma_{\alpha,r,\varepsilon}(x)}
$$
is of interest. We can examine this quantity for verifiable rewards, for each $x$ the verifiable reward is a Bernouilli random variable with parameter $p$ the probability of success of the policy given $x$ (Mroueh, 2025). Hence we have:
$$
\frac{1-\sigma_{\alpha,r,\varepsilon}(x)}{\sigma_{\alpha,r,\varepsilon}(x)}=
\frac{1-\sqrt{p(1-p)+\varepsilon}}{\sqrt{p(1-p)+\varepsilon}}
$$
Plotting this quantity as function of $p$ below, we observe that it diverges for fully correct and incorrect answers and this can indeed hurt the lower bound, as the negative terms in the lower bound will be dominating. It was suggested in DAPO (Yu et al., 2025) to filter out prompts with fully correct or incorrect answers, this will have the effect of controlling this term in the lower bound and keep that quantity bounded away from infinity.
<details>
<summary>extracted/6494595/std_bern.png Details</summary>

### Visual Description
## Line Graph: Plot of (1 - std(X)) / std(X) for Bernoulli(p)
### Overview
The image is a single-line graph plotting a mathematical function related to the Bernoulli distribution. The graph shows a symmetric, U-shaped curve that reaches its minimum value at the center of the x-axis range and increases asymptotically towards both ends.
### Components/Axes
* **Title:** "Plot of (1 - std(X)) / std(X) for Bernoulli(p)"
* **X-Axis:**
* **Label:** "p"
* **Scale:** Linear, ranging from 0.0 to 1.0.
* **Major Tick Marks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Y-Axis:**
* **Label:** "(1 - std(X))/std(X)"
* **Scale:** Linear, ranging from 1 to 9.
* **Major Tick Marks:** 1, 2, 3, 4, 5, 6, 7, 8, 9.
* **Legend:**
* **Position:** Bottom-left corner of the plot area.
* **Content:** A purple line segment followed by the text "1 - std(X) / std(X)".
* **Data Series:**
* A single, continuous, solid purple line.
* **Grid:** Light gray, dashed grid lines are present for both major x and y ticks.
### Detailed Analysis
The plotted function is `(1 - std(X)) / std(X)`, where `X` is a Bernoulli random variable with parameter `p`. The standard deviation of a Bernoulli(p) variable is `sqrt(p*(1-p))`.
**Trend Verification:** The curve is symmetric around `p = 0.5`. It starts at a very high value near `p = 0`, decreases to a minimum at `p = 0.5`, and then increases symmetrically to a very high value near `p = 1`.
**Approximate Data Points (Visual Estimation):**
* At `p = 0.5`: The curve reaches its minimum. The y-value is approximately **1.0**.
* At `p = 0.2` and `p = 0.8`: The y-value is approximately **1.5**.
* At `p = 0.1` and `p = 0.9`: The y-value is approximately **2.2**.
* At `p = 0.05` and `p = 0.95`: The y-value is approximately **4.0**.
* Near `p = 0.01` and `p = 0.99`: The y-value approaches or exceeds the upper limit of the graph (**9.0**).
### Key Observations
1. **Symmetry:** The graph is perfectly symmetric about the vertical line `p = 0.5`.
2. **Minimum Point:** The function achieves its global minimum value of approximately 1.0 at `p = 0.5`.
3. **Asymptotic Behavior:** The function value increases without bound (asymptotically) as `p` approaches 0 or 1. The graph is clipped at y=9, but the trend indicates it would continue to rise sharply.
4. **Monotonicity:** The function is strictly decreasing on the interval `p ∈ (0, 0.5]` and strictly increasing on the interval `p ∈ [0.5, 1)`.
### Interpretation
This graph visualizes the behavior of the ratio `(1 - σ) / σ` (where σ is the standard deviation) for a Bernoulli distribution as its success probability `p` varies.
* **Mathematical Meaning:** The expression simplifies to `1/σ - 1`. Since `σ = sqrt(p(1-p))`, the plotted function is `1/sqrt(p(1-p)) - 1`. The graph confirms that this value is minimized when `p(1-p)` is maximized, which occurs at `p=0.5`.
* **Statistical Insight:** The standard deviation of a Bernoulli variable is maximized at `p=0.5` (where `σ = 0.5`). At this point, the ratio `(1 - 0.5)/0.5 = 1`. As `p` moves towards 0 or 1, the distribution becomes more certain (σ decreases), causing the ratio `1/σ - 1` to grow very large. The graph thus illustrates how the "relative deviation" (as defined by this specific ratio) explodes as the Bernoulli process becomes nearly deterministic.
* **Visual Pattern:** The pronounced U-shape highlights the extreme sensitivity of this ratio to values of `p` near the boundaries (0 and 1), contrasting with its stable, minimal value at the point of maximum uncertainty (`p=0.5`).
</details>
Figure 1. $\frac{1-\sigma_{\alpha,r,\varepsilon}(x)}{\sigma_{\alpha,r,\varepsilon}(x)}$ explodes when variance is zero, meaning for fully correct or wrong policies, this term dominates the lower bound.
### 3.2. GRPO: From Constrained Optimization to Clipped Surrogate Objectives
#### From Penalized to $\mathsf{KL}$ Constrained Optimization
To maximize the lower bound in eq.(4), we see that the off-policy $\alpha$ needs to be in the vicinity of the current policy $\pi_{k}$ , i.e. for $\mathbb{TV}(\alpha,\pi_{k})\leq\delta$ and that $M α, r, 0 < ∞ subscript 𝑀 𝛼 𝑟 0 italic_M start_POSTSUBSCRIPT italic_α , italic_r , 0 end_POSTSUBSCRIPT < ∞$ (variance terms not exploding). Under these assumptions, we can solve the following penalized problem :
$$
\max_{\pi}\mathbb{E}_{x\sim\rho_{\mathcal{X}}}\mathcal{L}_{\alpha}(\pi(\cdot|x
))-2~{}M_{\alpha,r,\varepsilon}\sqrt{\mathbb{E}_{x\sim\rho_{\mathcal{X}}}
\mathbb{TV}^{2}(\pi(\cdot|x),\alpha(\cdot|x))}.
$$
By virtue of Theorem 1, maximizing this objective above leads to policy reward improvement.
We can write this as a constrained optimization, there exists $\Delta>0$ such that the following constrained optimization problem is equivalent:
$$
\max_{\pi}\mathbb{E}_{x\sim\rho_{\mathcal{X}}}\mathcal{L}_{\alpha}(\pi(\cdot|x
))\text{ subject to }\mathbb{E}_{x\sim\rho_{\mathcal{X}}}\mathbb{TV}^{2}(\pi(
\cdot|x),\alpha(\cdot|x))\leq\Delta^{2}.
$$
By Pinsker inequality for two measures $m_{1},m_{2}$ we have $\mathbb{TV}(m_{1},m_{2})\leq\sqrt{\frac{1}{2}\mathsf{KL}(m_{1},m_{2})}$ and hence we can bound instead the $\mathsf{KL}$ divergence as follows:
$$
\boxed{\max_{\pi}\mathbb{E}_{x\sim\rho_{\mathcal{X}}}\mathcal{L}_{\alpha}(\pi(
\cdot|x))\text{ subject to }\frac{1}{2}~{}\mathbb{E}_{x\sim\rho_{\mathcal{X}}}
\mathsf{KL}(\pi(\cdot|x),\alpha(\cdot|x))\leq\Delta^{2}.} \tag{5}
$$
#### From Constrained Optimization to Clipped Surrogate Objectives
The objective in (5) is the same as in the original constrained PPO formulation (Schulman et al., 2015) with two key differences: the advantage is the whitened reward of GRPO where the statistics are computed using the off-policy $\alpha$ , and the advantage objective is computed using importance sampling from the off-policy $\alpha$ , instead of $\pi_{k}$ in both cases. This is indeed related to objectives in off-policy PPO (Queeney et al., 2021; Gan et al., 2024). A practical implementation of these objectives is through clipped surrogates (Schulman et al., 2015).
For $\epsilon\in[0,1]$ following Gan et al. (2024); Queeney et al. (2021) let us define:
$$
f_{\epsilon}(r,r^{\prime},a)=\min(ra,~{}\text{clip}(r,\max(r^{\prime}-\epsilon
,0),r^{\prime}+\epsilon)~{}a).
$$
The clipped off-policy GRPO objective for $\alpha$ such that $𝕋 𝕍 (α, π k) ≤ δ 𝕋 𝕍 𝛼 subscript 𝜋 𝑘 𝛿 blackboard_T blackboard_V ( italic_α , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≤ italic_δ$ and $M α, r, 0 < ∞ subscript 𝑀 𝛼 𝑟 0 italic_M start_POSTSUBSCRIPT italic_α , italic_r , 0 end_POSTSUBSCRIPT < ∞$ is therefore defined as follows :
$$
\mathcal{L}^{c}_{\alpha}(\pi(\cdot|x))=\mathbb{E}_{y\sim\alpha(\cdot|x)}f_{
\epsilon}\left(\frac{\pi(y|x)}{\alpha(y|x)},\frac{\pi_{k}(y|x)}{\alpha(y|x)},A
_{\alpha}(x,y)\right) \tag{6}
$$
Let us unpack this, we have: $f_{\epsilon}\left(\frac{\pi(y|x)}{\alpha(y|x)},\frac{\pi_{k}(y|x)}{\alpha(y|x) },A_{\alpha}(x,y)\right)=..$
$$
..=\begin{cases}A_{\alpha}(x,y)\min\left(\frac{\pi(y|x)}{\alpha(y|x)},\frac{
\pi_{k}(y|x)}{\alpha(y|x)}+\epsilon\right),&r(x,y)\geq\mu_{\alpha,r}(x)\\
A_{\alpha}(x,y)\max\left(\frac{\pi(y|x)}{\alpha(y|x)},\max(\frac{\pi_{k}(y|x)}
{\alpha(y|x)}-\epsilon,0)\right),&r(x,y)<\mu_{\alpha,r}(x).\end{cases}
$$
The clipping ensures that the ratio $\frac{\pi}{\alpha}$ remains bounded and is a relaxation of the $\mathsf{KL}$ (or the total variation distance). Since $\alpha$ needs to satisfy closeness to $\pi_{k}$ in order to ensure improvement, the clipping objective incentivizes the difference between $\frac{\pi}{\alpha}-\frac{\pi_{k}}{\alpha}$ to not exceed $\epsilon$ (Gan et al., 2024).
In practice, the off-policy is $\alpha=\pi_{k-v}$ for a small $v\in[0,k)$ . Given a small learning rate and a small $v$ , the assumption that the policy $\pi_{k-v}$ doesn’t deviate from $\pi_{k}$ is reasonable, and for $v$ small we can approximate $\frac{\pi_{k}}{\pi_{k-v}}$ by $1$ . We use this approximation in practice as we found it more stable, and given that this approximation is in practice used in off-Policy PPO (with sample reuse) as discussed in Gan et al. (2024) (See Section 4.1 in Gan et al. (2024)).
#### Back to On-Policy GRPO Clipped Objective
For $\alpha=\pi_{k}$ , we obtain the clipped objective for on-policy GRPO (Shao et al., 2024):
| | $\displaystyle\mathcal{L}^{c}_{\pi_{k}}(\pi(\cdot|x))$ | $\displaystyle=\mathbb{E}_{y\sim\pi_{k}(\cdot|x)}f_{\epsilon}\left(\frac{\pi(y| x)}{\pi_{k}(y|x)},1,A_{\pi_{k}}(x,y)\right)$ | |
| --- | --- | --- | --- |
| Method name | Update by fixed batch $i$ | Update of Policy on Server $v$ |
| --- | --- | --- |
| On-Policy GRPO (Shao et al., 2024) | $i=1$ | $v=1$ |
| Off-policy GRPO (Shao et al., 2024) | $i>1$ | $v=1$ |
| Off-policy GRPO (this work) | $i=1$ | $v>1$ |
Table 1. Training configurations in alg. 1: (v1-i1) is on-policy GRPO and (v1-i10) is an example of off-policy GRPO in (Shao et al., 2024). Our off-policy GRPO corresponds e.g. to (v10-i1).
Algorithm 1 Iterative GRPO with verifiable rewards, modified from Shao et al. (2024)
1: Input initial policy model $\pi_{\theta_{\text{init}}}$ ; verifiable reward $r$ ; task prompts $\mathcal{D}$ ;
2: Hyperparameters $\epsilon$ , $\beta$ , $S$ ,
3: $(i,v)$ =(Number of SGD iteration by fixed batch, Model update on vLLM server)
4: Policy model $\pi_{\theta}\leftarrow\pi_{\theta_{\text{init}}}$ $\pi_{\mathrm{ref}}\leftarrow\pi_{\theta_{\text{init}}}$
5: for $s=1,\dots,S$ do
6: for $k=1,\dots,M$ do
7: Sample a batch $\mathcal{D}_{b}$ from $\rho_{\mathcal{X}}$
8: if $k\bmod v=0$ then
9: Update the old policy model on the vLLM server $\pi_{\theta_{\textrm{old}}}\leftarrow\pi_{\theta}$
10: Sample $G$ outputs $\{y_{i}\}_{i=1}^{G}\sim\pi_{\theta_{\textrm{old}}}(\cdot\mid x_{i})$ for each question $x\in\mathcal{D}_{b}$
11: Compute rewards $\{r_{i}\}_{i=1}^{G}$ for each sampled output $y_{i}$ by running verifiable reward $r$
12: $\alpha\leftarrow\pi_{\theta_{\textrm{old}}}$
13: Compute $A_{\alpha}(x,y_{i})$ using Equation (2)
14: for GRPO iteration = 1, …, $i$ do $\triangleright$ $i$ is referred to as $\mu$ in Original GRPO
15: Update the policy model $\pi_{\theta}$ by maximizing the GRPO objective (7) with gradient ascent
16: $\pi_{\mathrm{ref}}\leftarrow\pi_{\theta}$ $\triangleright$ Swap reference with the latest model
17: Output $\pi_{\theta}$
#### $\mathsf{KL}-$ Regularized RL & On-Policy / Off-Policy Algorithms
Finally putting together our clipped surrogate objective with the $\mathsf{KL}$ regularizer we obtain our final objective:
$$
\mathbb{E}_{x\sim\rho_{\mathcal{X}}}\mathcal{L}^{c}_{\alpha}(\pi(\cdot|x))-
\beta\mathsf{KL}(\pi||\pi_{\mathrm{ref}}). \tag{7}
$$
We present the GRPO algorithm in Algorithm 1 and the configurations that allow toggling between on-policy and off-policy GRPO in Table 1. Within the RL loop, the model is served for inference using vLLM (Kwon et al., 2023). The parameter $v$ controls how often the model is updated on the vLLM server (which corresponds to off-policy with $\alpha=\pi_{k-v+1}$ ). The parameter $i$ controls how many SGD iterations are applied to each batch sampled from the policy. For $v=1$ and $i=1$ , the model is continuously served, and each batch of samples is used once in SGD. This corresponds to on-policy GRPO. For $i>1$ and $v=1$ , the model is still continuously served, but each batch is used $i$ times in the SGD loop; this corresponds to an “off-policy” GRPO variant, as proposed in Shao et al. (2024). For large models that require tensor parallelism and multi-GPU serving, continuous model serving incurs additional communication costs. Our off-policy GRPO mitigates these costs by serving the model every $v>1$ iterations (line 8 in Algorithm 1) and fixing $i=1$ . Our theory guarantees reward improvement as long as $v$ is not too large.
#### Computational and Communication Costs
Updating the model served by vLLM during GRPO training incurs varying costs depending on the model size, update frequency ( $v$ ), and parallelism settings. When the training model and vLLM instance reside on different GPUs, or when vLLM uses tensor parallelism (TP), model updates may trigger deep copies and inter-GPU communication. These involve either full weight transfers or partitioned broadcasts, which scale linearly with model size. Frequent updates (e.g., $v=1$ ) can dominate the runtime, especially for large models (see the recent benchmark vLLM (2025) for latencies in serving large models with tensor parallelism using vLLM). To mitigate this, we update the vLLM model every $v>1$ iterations. This amortizes the copy cost while maintaining reward improvement guarantees from our theory. In our experiments (Section 5), we are limited to single node setups with relatively small models, and therefore cannot fully demonstrate the potential speedups —particularly those that would become more pronounced at larger scales. In our setups the speedups are modest, given that there is no inter GPU or inter nodes communication for serving the models. See Section A for further discussion.
On-Policy Clipped Objective with Zero Variance Masking a la DAPO (Yu et al., 2025) As discussed earlier in the interpretation of the lower bound in page 4, the samples with zero variance may lead to total variation terms to dominate the lower bound, hence we propose similar to DAPO (Yu et al., 2025) to mask these samples. For instance in the on policy case this would be with the following masked objective:
$$
\mathbb{E}_{x\sim\rho_{\mathcal{X}}}\mathbbm{1}_{\sigma_{\pi_{k},r}(x)\neq 0}
\left(\mathcal{L}^{c}_{\pi_{k}}(\pi(\cdot|x))-\beta\mathsf{KL}(\pi||\pi_{
\mathrm{ref}})\right). \tag{8}
$$
## 4. Related Work
#### Proximal Policy Optimization (PPO) and Extensions
Proximal Policy Optimization (PPO) is a widely used on-policy reinforcement learning algorithm that improves training stability through clipped surrogate objectives. While PPO is effective in diverse settings, it is inherently limited by its on-policy nature, which constrains sample efficiency. To address these limitations, several off-policy adaptations and extensions of PPO have been proposed. Generalized Proximal Policy Optimization (G-PPO) (Queeney et al., 2021) enables sample reuse while maintaining convergence guarantees. Transductive off-Policy PPO (ToPPO) (Gan et al., 2024) builds on G-PPO by incorporating transductive learning principles, bridging the gap between off-policy learning and theoretical guarantees of on-policy methods. Off-Policy PPO (OPPO) (Meng et al., 2023) proposes novel corrections to integrate replay buffer samples in PPO-style updates.
#### On-Policy and Off-Policy Actor-Critic Methods
Actor-critic methods blend the strengths of policy gradients and value function estimation. Off-policy variants aim to improve sample efficiency by learning from a replay buffer. The Off-Policy Actor-Critic algorithm (Degris et al., 2012) introduces importance weighting to enable stable updates from off-policy data. ACER (Wang et al., 2016) extends this with trust-region optimization and truncated importance sampling, enhancing by that the learning efficiency in discrete action spaces. Mixing on-policy and off-policy methods aims to leverage the stability of on-policy updates with the efficiency of off-policy learning. P3O (Fakoor et al., 2020) provides a principled approach that interleaves policy updates from both on- and off-policy data.
#### Off-Policy RLHF and other variants of GRPO
Noukhovitch et al. (2025) introduced within the iterative DPO framework an asynchronous RLHF using off-policy data and that ensures faster convergence to the optimal policy. New variants of GRPO have been proposed recently such as DAPO (Yu et al., 2025) and DR-GRPO (Liu et al., 2025). DAPO proposes the zero variance masking without theoretical backing, our work roots this in the improvement lower bound. DR-GRPO proposes to center only the reward without using the variance normalization.
## 5. Experiments
### 5.1. Ablation Studies on GSM8K
#### Setup, Model, and Data
In our first set of experiments, we use GSM8K dataset from Cobbe et al. (2021) (MIT license), and Qwen/Qwen2.5-0.5B-Instruct (Apache 2.0 license) by Yang et al. (2024). We integrate our changes in Algorithm 1 to the GRPO implementation in TRL (von Werra et al., 2020b), and train our models on the training split of GSM8K on a node with 8 GPUs (GPU 0 for the vLLM server and 7 other GPUs for distributed training). See Appendix B for the hardware specification. We use a learning $5\times 10^{-6}$ for all experiments and the KL regularizer $\beta=0.1$ in Equation (7). We use the correctness of the LLM output as a reward. For GRPO training, the hyperparameters are the following: group size $G=16$ and per-device batch size $16$ (meaning each GPU processes a single prompt $x$ with $16$ responses). To increase the overall batchsize we use gradient accumulation of $4$ , ending with an effective batch size of prompts of $28$ . The context length used for this experiment is $200$ , and the sampling temperature is set to $\tau=0.1$ .
#### Ablations and Results
We train our models with GRPO using Algorithm 1 with a verifiable reward for answer correctness. We use for GRPO different configurations given in Table 1 and report on the test split of GSM8K Pass@1 using 50 samples (i.e. frequency of success given 50 generations for each question) using the same sampling configuration as in training. We report results in Figure 2: Fig. 2(a) for on-policy GRPO ( $i=1,v=1$ ) with the objective given in Equation (7) with $S=3$ (i.e. for 4 epochs with $\pi_{\mathrm{ref}}$ swap at end of each epoch with latest model); Fig. 2(b) for on-policy GRPO ( $i=1,v=1$ ) with masking zero variance samples i.e. using the objective given Equation (8) with $S=3$ ; Fig. 2(c) for our off-policy GRPO $(v=10,i=1)$ , with $S=3$ and Fig. 2(d) for Shao et al. (2024) ’s off-policy GRPO i.e $(v=1,i=10)$ for a single epoch. We see in Fig. 2(a) that while the on-policy GRPO converges to a maximum Pass@1 of $45\$ it is unstable. The masking of zero variance sampling in 2(b) stabilizes the on-policy GRPO and leads to an improvement of the performance to $50\$ . This is in line with our theoretical grounding through the improvement lower bound. Our off-policy GRPO in Fig. 2(c) stabilizes the training also and leads to an improved Pass@1 of $50\$ on the test set. In all three cases, we see that by resetting the $\pi_{\mathrm{ref}}$ to the latest model, GRPO amplifies the success rate above the current $\pi_{\mathrm{ref}}$ , this concurs with the theoretical findings in Mroueh (2025). Finally, the off-policy variant in Shao et al. (2024) in Fig. 2(d) shows a slower convergence over an epoch.
<details>
<summary>extracted/6494595/figs/v1-i1-swap.png Details</summary>

### Visual Description
## Line Chart: On-Policy GRPO using π_k
### Overview
The image displays a line chart tracking the performance of a reinforcement learning algorithm, specifically "On-Policy GRPO with π_ref swap," over 1000 training iterations. The chart plots the "Pass@1" metric against the iteration number, showing significant volatility with an overall upward trend. Four vertical dashed lines mark specific iteration points, each annotated with a label below the x-axis.
### Components/Axes
* **Chart Title:** "On-Policy GRPO using π_k" (centered at the top).
* **Y-Axis:**
* **Label:** "Pass@1" (rotated vertically on the left).
* **Scale:** Linear, ranging from 0.10 to 0.45. Major tick marks are at 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, and 0.45.
* **X-Axis:**
* **Label:** "Iteration" (centered at the bottom).
* **Scale:** Linear, ranging from 0 to 1000. Major tick marks are at 0, 200, 400, 600, 800, and 1000.
* **Legend:**
* **Position:** Top-left corner of the plot area.
* **Content:** A blue line with a circle marker labeled "On-Policy GRPO with π_ref swap".
* **Data Series:** A single blue line with circular markers at each data point.
* **Annotations:** Four vertical, red, dashed lines extending from the x-axis to the top of the plot. Each is labeled with blue text below the x-axis:
1. At approximately Iteration 250: `grpo-plus-v1-11`
2. At approximately Iteration 350: `grpo-plus-v1-11-swap-1`
3. At approximately Iteration 500: `grpo-plus-v1-11-swap-2`
4. At approximately Iteration 700: `grpo-plus-v1-11-swap-3`
### Detailed Analysis
**Trend Verification:** The blue line shows a general upward trend from the start to the end of the plotted iterations, but with high volatility. It rises sharply initially, enters a period of fluctuation, experiences a dramatic drop, recovers, and then continues a more gradual, fluctuating ascent.
**Approximate Data Points (Pass@1 vs. Iteration):**
* **Start (Iteration ~0):** ~0.21
* **Initial Rise:** Rapid increase to ~0.35 by Iteration ~50.
* **First Plateau/Fluctuation:** Hovers between ~0.38 and ~0.40 from Iteration ~100 to ~250.
* **First Annotation (`grpo-plus-v1-11` at ~250):** Value is ~0.38.
* **Post-250 Fluctuation:** Dips to ~0.36, recovers to ~0.41, then drops sharply to ~0.28 around Iteration ~380.
* **Second Annotation (`grpo-plus-v1-11-swap-1` at ~350):** Value is on a downward slope, approximately ~0.36.
* **Recovery and Second Peak:** Recovers to ~0.41 by Iteration ~450.
* **Third Annotation (`grpo-plus-v1-11-swap-2` at ~500):** This coincides with the most dramatic feature—a precipitous drop to the chart's minimum value of ~0.10.
* **Post-500 Recovery:** Extremely sharp recovery back to ~0.42 by Iteration ~550.
* **Fourth Annotation (`grpo-plus-v1-11-swap-3` at ~700):** Value is near the chart's maximum, approximately ~0.44.
* **Final Segment:** After a dip to ~0.39 around Iteration ~750, the line trends upward with fluctuations, ending at approximately ~0.44 at Iteration ~1000.
### Key Observations
1. **Extreme Volatility at Swap-2:** The most notable event is the catastrophic drop in performance (Pass@1 from ~0.41 to ~0.10) at the iteration marked `grpo-plus-v1-11-swap-2` (~500). This is immediately followed by an equally sharp recovery.
2. **Correlation with Annotations:** Performance dips are observed around or shortly after each annotated "swap" event (especially at ~380 and ~500), suggesting these events (likely policy or reference model swaps) introduce instability.
3. **Overall Positive Trend:** Despite the severe mid-training collapse, the algorithm demonstrates resilience, recovering and ultimately achieving a higher Pass@1 score (~0.44) at the end of the run than at any prior point.
4. **Performance Range:** The Pass@1 metric varies widely, from a low of ~0.10 to a high of ~0.44, indicating high sensitivity to the training process or the specific swap events.
### Interpretation
This chart visualizes the training dynamics of an on-policy reinforcement learning algorithm (GRPO) that involves periodic swaps of a reference policy (π_ref). The "Pass@1" metric likely measures task success rate.
The data suggests that the **policy swap events are critical points of instability**. The swap at iteration 500 (`swap-2`) caused a near-total collapse in performance, which could indicate a severe mismatch between the new reference policy and the current agent policy, or a disruptive change in the optimization landscape. However, the system's ability to rapidly recover from this collapse and continue improving is a sign of robustness.
The **upward trend** implies that, despite these disruptive events, the learning process is effective over the long term. The final performance is the highest observed, suggesting the swaps, while destabilizing in the short term, may ultimately be beneficial for escaping local optima or adapting the policy. The pattern of "dip and recover" after each swap (most dramatically after `swap-2`) is a key characteristic of this training run. An investigator would want to examine the algorithmic details of the "swap" operation and the conditions at iteration 500 to understand the cause of the extreme drop.
</details>
(a) On-Policy GRPO with $\pi_{\mathrm{ref}}$ swap at end of each epoch. ( $v=1$ , $i=1$ , $S=3$ )
<details>
<summary>extracted/6494595/figs/v1-i1-swap-gm.png Details</summary>

### Visual Description
## Line Chart: On-Policy GRPO with Zero Variance Masking (π_k)
### Overview
The image is a line chart plotting the performance metric "Pass@1" against training "Iteration" for an algorithm labeled "On-Policy GRPO with zero variance masking". The chart shows a generally increasing trend with some fluctuations, punctuated by four vertical dashed red lines that appear to mark specific phases or version changes in the training process.
### Components/Axes
* **Chart Title:** "On-Policy GRPO with zero variance masking (π_k)"
* **Y-Axis:**
* **Label:** "Pass@1"
* **Scale:** Linear, ranging from 0.20 to 0.50.
* **Major Ticks:** 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50.
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Linear, ranging from 0 to 1000.
* **Major Ticks:** 0, 200, 400, 600, 800, 1000.
* **Legend:**
* **Location:** Top-left corner of the plot area.
* **Label:** "On-Policy GRPO with zero variance masking"
* **Symbol:** A blue line with circular markers.
* **Data Series:** A single blue line with circular data points at each iteration step.
* **Annotations:** Four vertical, dashed red lines extending from the x-axis to the top of the plot. Each is labeled at the bottom with blue text:
1. At approximately Iteration 250: `grpo-plus-v1.1t-gn`
2. At approximately Iteration 500: `grpo-plus-v1.1t-gn-swap-1`
3. At approximately Iteration 750: `grpo-plus-v1.1t-gn-swap-2`
4. At approximately Iteration 900: `grpo-plus-v1.1t-gn-swap-3`
### Detailed Analysis
* **Trend Verification:** The blue line shows a strong, positive, non-linear trend. It rises steeply in the initial phase (Iterations 0-200), then continues to increase at a more moderate, fluctuating rate for the remainder of the chart.
* **Data Point Extraction (Approximate Values):**
* **Start (Iteration ~20):** Pass@1 ≈ 0.205
* **First Major Peak (Iteration ~150):** Pass@1 ≈ 0.40
* **First Major Dip (Iteration ~175):** Pass@1 ≈ 0.29
* **At First Red Line (~250):** Pass@1 ≈ 0.41
* **At Second Red Line (~500):** Pass@1 ≈ 0.47
* **At Third Red Line (~750):** Pass@1 ≈ 0.49
* **At Fourth Red Line (~900):** Pass@1 ≈ 0.48
* **End (Iteration ~1000):** Pass@1 ≈ 0.50 (the highest point on the chart).
* **Phase Analysis:** The performance appears to improve across the phases demarcated by the red lines. The "swap" phases (1, 2, 3) generally maintain or slightly improve upon the performance level achieved at the end of the previous phase, despite local fluctuations.
### Key Observations
1. **Initial Rapid Learning:** The most dramatic improvement occurs in the first 200 iterations, where Pass@1 more than doubles from ~0.20 to ~0.40.
2. **Performance Dip:** A significant, sharp dip occurs around iteration 175, where performance drops from ~0.40 to ~0.29 before recovering.
3. **Plateau and Fluctuation:** After iteration 400, the rate of improvement slows. The line exhibits a "sawtooth" pattern of small rises and falls, but the overall trajectory remains upward.
4. **Phase Transitions:** The vertical red lines do not coincide with immediate, dramatic jumps in performance. Instead, they seem to mark points where the training configuration or model version changed, with the subsequent performance building upon the prior level.
5. **Final Performance:** The algorithm achieves its highest measured Pass@1 score of approximately 0.50 at the final iteration shown (1000).
### Interpretation
This chart documents the training progression of a reinforcement learning or optimization algorithm (GRPO) using a technique called "zero variance masking." The "Pass@1" metric likely measures success rate on a task.
* **What the data suggests:** The algorithm is effective, showing substantial learning over 1000 iterations. The "zero variance masking" technique, as applied in this "On-Policy" setting, leads to a final performance of ~50% success (Pass@1=0.50).
* **How elements relate:** The vertical red lines are critical for understanding the training narrative. They indicate that the training was not a single, continuous run but was segmented into distinct phases (`gn`, `gn-swap-1`, `gn-swap-2`, `gn-swap-3`). The "swap" suffix suggests a modification was introduced at each phase, possibly to the model, objective, or data distribution. The chart shows that performance was generally maintained or improved after each swap, indicating the modifications were likely beneficial or at least not detrimental.
* **Notable anomalies:** The sharp dip around iteration 175 is the most prominent anomaly. This could represent a period of instability, a change in training parameters, or the introduction of a more difficult subset of data. The algorithm's ability to recover and surpass its previous peak demonstrates robustness.
* **Underlying pattern:** The overall curve resembles a classic learning curve with diminishing returns—rapid initial gains followed by slower, incremental improvements. The segmentation via the red lines suggests an iterative development or ablation study process, where the core algorithm was progressively refined.
</details>
(b) On-Policy GRPO with masking of samples with variance $\sigma_{\pi_{k},r}=0$ , and with $\pi_{\mathrm{ref}}$ swap at end of each epoch. $v=1$ , $i=1$ , $S=3$ )
<details>
<summary>extracted/6494595/figs/v10-i1-swap.png Details</summary>

### Visual Description
## Line Chart: Off-policy GRPO using π_k - v, v = 10
### Overview
This is a line chart plotting the performance metric "Pass@1" against training "Iteration" for an algorithm labeled "off-policy GRPO with ref swap". The chart shows a generally increasing trend with significant fluctuations, annotated with vertical dashed lines marking specific events or phases in the training process.
### Components/Axes
* **Chart Title:** "Off-policy GRPO using π_k - v, v = 10" (centered at the top).
* **Y-Axis:**
* **Label:** "Pass@1" (rotated vertically on the left).
* **Scale:** Linear, ranging from 0.20 to 0.50.
* **Major Ticks:** 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50.
* **X-Axis:**
* **Label:** "Iteration" (centered at the bottom).
* **Scale:** Linear, ranging from 0 to 1000.
* **Major Ticks:** 0, 200, 400, 600, 800, 1000.
* **Legend:**
* **Location:** Top-left corner of the plot area.
* **Content:** A blue line with a circle marker labeled "off-policy GRPO with ref swap".
* **Data Series:**
* **Color:** Blue.
* **Marker:** Solid circles at each data point.
* **Line:** Solid blue line connecting the points.
* **Annotations:**
* Four vertical, dashed red lines span the entire height of the plot.
* Each line has a corresponding text label at the bottom, just above the x-axis.
### Detailed Analysis
**Data Series Trend & Key Points:**
The blue line ("off-policy GRPO with ref swap") shows a volatile but overall upward trend in Pass@1 performance over 1000 iterations.
1. **Initial Phase (Iterations 0-~200):** The line starts at approximately 0.205 (Iteration 0). It dips to a local minimum of ~0.18 around Iteration 50, then rises sharply and consistently, reaching ~0.40 by Iteration 200.
2. **Fluctuating Plateau (Iterations ~200-~600):** After the initial rise, the performance enters a phase of high volatility. It fluctuates between approximately 0.37 and 0.47. Notable peaks occur near Iterations 300 (~0.44), 450 (~0.47), and 550 (~0.47). A significant dip occurs around Iteration 580, dropping to ~0.40.
3. **Final Ascent (Iterations ~600-1000):** Following the dip at ~580, the line recovers and trends upward again, though still with fluctuations. It reaches its highest point on the chart, approximately 0.495, near Iteration 950, before ending slightly lower at ~0.49 at Iteration 1000.
**Annotation Labels (from left to right):**
The vertical red dashed lines are positioned at approximately the following iterations, with these exact labels:
1. `grpo-iter1-vlm10` (at ~Iteration 230)
2. `grpo-plus-vlm10-pi-ref-optim-reset` (at ~Iteration 480)
3. `grpo-plus-vlm10-pi-ref-optim-reset-swap2` (at ~Iteration 580)
4. `grpo-plus-vlm10-pi-ref-optim-reset-swap3` (at ~Iteration 720)
### Key Observations
* **Strong Initial Learning:** The most dramatic and consistent improvement occurs in the first 200 iterations.
* **High Volatility Post-Warmup:** After iteration 200, performance becomes highly unstable, with large swings of up to 0.07-0.08 in Pass@1 between consecutive measured points.
* **Correlation with Annotations:** The major performance dip at ~Iteration 580 aligns precisely with the third annotation (`...reset-swap2`). The subsequent recovery and final upward trend begin after this point.
* **Overall Improvement:** Despite the volatility, the final Pass@1 (~0.49) is more than double the starting value (~0.205), indicating successful learning over the full 1000 iterations.
* **No Clear Plateau:** The performance does not appear to have fully converged by Iteration 1000, as the final trend is still upward.
### Interpretation
This chart visualizes the training dynamics of an off-policy Reinforcement Learning from Human Feedback (RLHF) or similar optimization algorithm (GRPO). The "Pass@1" metric likely measures the success rate of a model's single-attempt output.
The data suggests a two-phase training process:
1. **Efficient Early Learning:** The algorithm quickly masters the core task, leading to rapid gains in Pass@1.
2. **Complex Optimization Phase:** The subsequent high volatility indicates the algorithm is navigating a complex loss landscape, possibly due to the "ref swap" mechanism mentioned in the legend. The annotations likely mark interventions like reference model resets or policy swaps, which are common techniques to escape local optima or correct training drift in RLHF. The dip at `...reset-swap2` followed by recovery suggests such an intervention was disruptive in the short term but beneficial for longer-term progress.
The chart demonstrates that while the off-policy GRPO method with reference swapping achieves significant overall improvement, its training path is non-monotonic and requires careful management of optimization events (the annotated resets/swaps) to sustain progress. The final upward trajectory implies the chosen strategy is effective, but the instability highlights the challenge of balancing exploration and stability in this type of training.
</details>
(c) Off-Policy GRPO using $v=10$ (this amounts to fixing the model on the vLLM server for $10$ iterations and getting fresh samples for new batches), and with $\pi_{\mathrm{ref}}$ swap.( $v=10$ , $i=1$ , $S=3$ )
<details>
<summary>extracted/6494595/figs/v1-i10-1ep.png Details</summary>

### Visual Description
\n
## Line Chart: Off-Policy GRPO Performance Over Iterations
### Overview
The image is a line chart titled "Off-Policy GRPO with fixed batch for 10 iterations from π_k". It plots a performance metric, "Pass@1", against training "Iteration" for a single data series. The chart shows a generally increasing trend, indicating improvement in the measured metric over the course of the iterations.
### Components/Axes
* **Chart Title:** "Off-Policy GRPO with fixed batch for 10 iterations from π_k"
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Linear scale from 0 to 2500.
* **Major Tick Marks:** 0, 500, 1000, 1500, 2000, 2500.
* **Y-Axis:**
* **Label:** "Pass@1"
* **Scale:** Linear scale from 0.200 to 0.375.
* **Major Tick Marks:** 0.200, 0.225, 0.250, 0.275, 0.300, 0.325, 0.350, 0.375.
* **Legend:**
* **Position:** Top-left corner of the plot area.
* **Label:** "grpo-iter10-vllm1"
* **Symbol:** A blue line with a circular marker.
* **Data Series:** A single blue line with circular markers at each data point, corresponding to the legend entry "grpo-iter10-vllm1".
### Detailed Analysis
The line chart tracks the "Pass@1" metric across approximately 16 distinct iteration points. The trend is predominantly upward, with a few periods of slower growth or plateau.
**Approximate Data Points (Iteration, Pass@1):**
* (~100, 0.205)
* (~300, 0.210)
* (~450, 0.215)
* (~600, 0.250)
* (~750, 0.280)
* (~900, 0.300)
* (~1050, 0.305)
* (~1200, 0.303) - *Slight dip or plateau*
* (~1350, 0.318)
* (~1500, 0.343)
* (~1650, 0.353)
* (~1800, 0.360)
* (~1950, 0.375) - *Reaches peak value*
* (~2100, 0.373) - *Slight decrease*
* (~2250, 0.374)
* (~2400, 0.376)
* (~2550, 0.376) - *Final point, stable at peak*
**Trend Verification:**
1. **Initial Phase (Iterations 0-500):** The line shows a gentle, positive slope, rising from ~0.205 to ~0.215.
2. **Rapid Growth Phase (Iterations 500-1000):** The slope steepens significantly, indicating accelerated improvement. The value climbs from ~0.215 to ~0.300.
3. **Plateau/Minor Dip (Iterations 1000-1200):** The line flattens, with a very slight decrease observed around iteration 1200.
4. **Second Growth Phase (Iterations 1200-1950):** The upward trend resumes, though the slope is less steep than the initial rapid phase. Performance increases from ~0.303 to the peak of ~0.375.
5. **Final Plateau (Iterations 1950-2550):** After reaching the peak, the line stabilizes, fluctuating minimally between ~0.373 and ~0.376.
### Key Observations
* **Overall Positive Trend:** The primary observation is a strong, positive correlation between the number of iterations and the Pass@1 score.
* **Non-Linear Improvement:** The rate of improvement is not constant. The most significant gains occur between iterations 500 and 1000.
* **Performance Plateau:** The metric appears to reach a saturation point or plateau after approximately 1950 iterations, with negligible gains thereafter.
* **Minor Fluctuations:** Small dips or plateaus (e.g., around iteration 1200 and 2100) are present but do not alter the overall upward trajectory.
### Interpretation
This chart demonstrates the learning curve of an "Off-Policy GRPO" (likely a reinforcement learning or optimization algorithm) training process. The "Pass@1" metric is a common measure of success in tasks like code generation or problem-solving, indicating the rate at which the model's top-ranked output is correct.
The data suggests that the training process is effective, as the model's performance improves substantially with more iterations. The rapid growth phase indicates a period of efficient learning. The subsequent plateau suggests the model is approaching its performance limit under the given fixed batch and 10-iteration constraint from the policy π_k. The final stable phase implies that further iterations beyond ~2000 yield diminishing returns for this specific metric and configuration.
The title's mention of "fixed batch for 10 iterations from π_k" provides critical context: this is likely an evaluation of an off-policy algorithm's stability or performance when trained on a static dataset (a fixed batch) derived from a previous policy (π_k). The chart validates that the algorithm can successfully learn and improve from this fixed data batch over many iterations.
</details>
(d) Off-Policy GRPO using fixed samples from $\pi_{k}$ for $10$ iterations. This will make $1$ epoch $10\times$ slower. $v=1$ , $i=10$ , $S=1$ )
Figure 2. We train different variants of GRPO on the train portion of GSM8K and report the Pass@1 on GSM8 test set using $50$ samples for each question in the test set for various variant of on-policy and off-policy GRPO. We see that as predicted by our theory, masking samples with zero variance stabilizes the training for on-policy training and leads to better performance. For off-policy training we see that using $v=10,i=1$ stabilizes also the training and leads also to better performance.
### 5.2. Finetuning Qwen Distill R1 model (1.5 B) on Deepscaler Data
In this section we use GRPO to finetune DeepSeek-R1-Distill-Qwen-1.5B (Guo et al., 2025) on DeepScaleR-Preview-Dataset from Luo et al. (2025) consisting of roughly $40K$ math questions with known answers. We used math-verify as the verifiable reward. We use a learning rate of $1\times 10^{-6}$ in the same distributed setting as before (GPU 0 for vLLM and 7 GPUs for distributed training). We use a context length of $4096$ , a group size $G=16$ , a per-device batch size of $16$ , and the KL regularizer is $\beta=0.001$ . The sampling temperature used is $0.7$ . We compared here the on-policy GRPO ( $v=1,i=1$ ) to our off-policy GRPO ( $v=10,i=1$ ) and report the performance of the trained model on a single epoch (around 24 hours on a single node). We report in Tables 3 and 2 Aime24 and Math500 performance using Huggingface light-eval (Habib et al., 2023). Aime24 is evaluated with Pass@1 using 32 samples, and math500 with extractive matching as recommended in light-eval with a context length of $~{}32K$ (evaluation context length and all other sampling hyperparameters are set to the default in OpenR1 for this model). Plots of evaluation as function of iterations are given in Appendix D. We see that both on-policy and off-policy GRPO improve the performance of DeepSeek-R1-Distill-Qwen-1.5B that has an Aime24 of $29\$ to $32\$ at maximum (over iterations), and its math-500 from $83\$ to $87\$ . This result confirms our theoretical results that by going off-policy we don’t loose in term of overall performance.
| Model/Aime24 | Min | Max | Median | Mean |
| --- | --- | --- | --- | --- |
| v1-i1-length-4096 | 0.2802 | 0.3229 | 0.3021 | 0.3022 |
| v10-i1-length-4096 | 0.2781 | 0.3250 | 0.3047 | 0.3049 |
Table 2. Aime24 using lighteval with on & off-policy ( (v1-i1) and (v10-i1)) GRPO.
| Model/Math500 | Min | Max | Median | Mean |
| --- | --- | --- | --- | --- |
| v1-i1-length-4096 | 0.830 | 0.870 | 0.854 | 0.8519 |
| v10-i1-length-4096 | 0.822 | 0.872 | 0.846 | 0.8474 |
Table 3. Math 500 extractive matching using light-eval (Habib et al., 2023) with on and off-policy (v1-i1) and (v10-i1) GRPO.
## 6. Conclusion and Discussion
We revisited (on-policy) GRPO (Shao et al., 2024) and showed that its clipping objective can be derived from first principles as a lower bound for reward improvement. We also gave theoretical grounding to masking of zero variance samples suggested in DAPO (Yu et al., 2025). We introduced off-policy GRPO and layed conditions under which it leads to policy improvement. Our off-policy GRPO has the advantage of reducing communication costs in serving the model for inference within the GRPO loop at each iteration as done in the on-policy counter-part, while not sacrificing performance. We showcased that off-policy GRPO stabilizes training and leads to either on par or improved performance as the on-policy one.
The main takeaways of our paper to practitioners are: (1) Zero variance masking stabilizes on-policy GRPO’s training (2) Off-policy GRPO attains its full potential in terms of maintaining performance and lowering latencies and communication overhead in larger scale training where models are served using tensor parallelism (see vLLM (2025)).
We hope our proof of concept for off-policy GRPO will help enabling stable and efficient reinforcement learning at scale.
## References
- Bai et al. [2022] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Christiano et al. [2017] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
- Cobbe et al. [2021] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Degris et al. [2012] T. Degris, M. White, and R. S. Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
- Fakoor et al. [2020] R. Fakoor, P. Chaudhari, and A. J. Smola. P3o: Policy-on policy-off policy optimization. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, pages 1017–1027. PMLR, 2020.
- Gan et al. [2024] Y. Gan, R. Yan, X. Tan, Z. Wu, and J. Xing. Transductive off-policy proximal policy optimization. arXiv preprint arXiv:2406.03894, 2024.
- Guo et al. [2025] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Habib et al. [2023] N. Habib, C. Fourrier, H. Kydlíček, T. Wolf, and L. Tunstall. Lighteval: A lightweight framework for llm evaluation, 2023. URL https://github.com/huggingface/lighteval.
- HuggingFace [2025a] HuggingFace. Open r1: A fully open reproduction of deepseek-r1, January 2025a. URL https://github.com/huggingface/open-r1.
- HuggingFace [2025b] HuggingFace. Open r1: Update #3, Mar. 2025b. URL https://huggingface.co/blog/open-r1/update-3. Accessed: 2025-05-11.
- Kwon et al. [2023] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- Lambert et al. [2024] N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, S. Lyu, et al. Tülu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024.
- Liu et al. [2025] Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective, 2025. URL https://arxiv.org/abs/2503.20783.
- Luo et al. [2025] M. Luo, S. Tan, J. Wong, X. Shi, W. Y. Tang, M. Roongta, C. Cai, J. Luo, T. Zhang, L. E. Li, R. A. Popa, and I. Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://tinyurl.com/5e9rs33z, 2025. Notion Blog.
- Meng et al. [2023] W. Meng, Q. Zheng, G. Pan, and Y. Yin. Off-policy proximal policy optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 37(8):9162–9170, 2023.
- Mroueh [2025] Y. Mroueh. Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification, 2025. URL https://arxiv.org/abs/2503.06639.
- Noukhovitch et al. [2025] M. Noukhovitch, S. Huang, S. Xhonneux, A. Hosseini, R. Agarwal, and A. Courville. Faster, more efficient RLHF through off-policy asynchronous learning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=FhTAG591Ve.
- Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library, Dec. 2019.
- Queeney et al. [2021] J. Queeney, I. C. Paschalidis, and C. G. Cassandras. Generalized proximal policy optimization with sample reuse. In Advances in Neural Information Processing Systems, volume 34, 2021.
- Schulman et al. [2015] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
- Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Shao et al. [2024] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Stiennon et al. [2020] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- vLLM [2025] P. vLLM. Pytorch ci hud: vllm benchmark dashboard. https://hud.pytorch.org/benchmark/llms?repoName=vllm-project/vllm, 2025. Accessed: 2025-05-14.
- Vojnovic and Yun [2025] M. Vojnovic and S.-Y. Yun. What is the alignment objective of grpo?, 2025. URL https://arxiv.org/abs/2502.18548.
- von Werra et al. [2020a] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul, and Q. Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020a.
- von Werra et al. [2020b] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul, and Q. Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020b.
- Wang et al. [2016] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. De Freitas. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
- Wolf et al. [2020] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. HuggingFace’s Transformers: State-of-the-art Natural Language Processing, July 2020.
- Yang et al. [2024] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, X. Zhou, X. Ren, X. Zhang, X. Wei, X. Ren, X. Liu, Y. Fan, Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang, Z. Guo, and Z. Fan. Qwen2 Technical Report, Sept. 2024.
- Yu et al. [2025] Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, W. Dai, Y. Song, X. Wei, H. Zhou, J. Liu, W.-Y. Ma, Y.-Q. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang. Dapo: An open-source llm reinforcement learning system at scale, 2025. URL https://arxiv.org/abs/2503.14476.
## Appendix A Broader Impact and Limitations
Our work analyzes the celebrated GRPO algorithm and develops an adaptation for the off-policy setting motivated by recent efforts for PPO that demonstrated higher stability and efficiency. Our primary contributions are theoretical, providing formal conditions under which advantage optimization guarantees policy improvement for the on-policy and off-policy regimes. These insights provide lower bounds on policy improvement and directly inform a practical clipped surrogate optimization objective for large language model (LLM) policy training that inherits our theoretical guarantees for both on policy and off policy regimes. In the on-policy regime our lower bound shed the light and give theoretical backing to the benefits of masking samples with zero variance as suggested in the DAPO paper [Yu et al., 2025]. Our formulation also clarifies theoretical relationships between our newly introduced off-policy GRPO, PPO variants, and general off-policy optimization frameworks – a linkage previously underexplored in the literature. Our derived off-policy GRPO algorithm is validated experimentally demonstrating improved performance compared to standard GRPO, while having the potential to reduce the communication overhead across devices in serving large models for sampling that is needed in GRPO. The broader impacts that we anticipate from our work (beside those directly inherited from GRPO and reinforcement fine-tuning of LLMs and the risks associated to the dual use of the enabled reasoning models) are then generally positive, as it enhances RL efficiency, reducing computational costs and improving stability.
The main limitation of our work is that the empirical validation remains constrained to smaller datasets, smaller model architectures, and smaller context size (4096 tokens at maximum) that can be trained on our hardware setup consisting of one compute node with 8 H100 NVIDIA gpus (1 used for the vLLM server and 7 for training the policy LLM). Our 1.5 B experimental setup, with deepscaler data is at the limit of what can fit in the memory of a single node.
This limitation primarily reflects the common resource constraints associated with provisioning large-scale distributed training environments, rather than any inherent restriction of the algorithm itself. Note that for larger context, larger batch size and larger architectures than the ones used in our paper, multi-node training is required.
While our main contribution here remains theoretical and backed with ablation studies on a single node, we reserve to scale up our experiments to larger training runs in future work aimed at showcasing the fact that the benefits of our off-policy algorithms in terms of efficient and reduced communication are expected to become even more pronounced in the large-scale distributed regime as it is already showed in multiple off policy RL works.
## Appendix B Assets
#### Hardware setup
All our experiments were run on one compute node with Dual 48-core Intel Xeon 8468, 2TB of RAM, 8 NVIDIA HGX H100 80GB SMX5, 8x 3.4TB Enterprise NVMe U.2 Gen4, and 10x NVIDIA Mellanox Infiniband Single port NDR adapters, running RedHat Enterprise Linux 9.5
#### Libraries
Our experiments rely on the open-source libraries pytorch [Paszke et al., 2019] (license: BSD), HuggingFace Transformers [Wolf et al., 2020] (Apache 2.0 license), and HuggingFace TRL [von Werra et al., 2020a] (Apache 2.0 license). We also relied on Open-R1 [HuggingFace, 2025a] as well as light-eval [Habib et al., 2023] for the evaluation of Aime24 and Math500.
#### Code re-use
Our GRPO training code is based on the public Github repository https://github.com/huggingface/open-r1 [HuggingFace, 2025a].
#### Data and Models
In our experiments, we use following publicly available datasets: (1) GSM8K dataset from Cobbe et al. [2021] (MIT license), and (2) the DeepScaleR-Preview-Dataset from Luo et al. [2025] (MIT license). The models that we used were Qwen/Qwen2.5-0.5B-Instruct (Apache 2.0 license) by Yang et al. [2024], and DeepSeek-R1-Distill-Qwen-1.5B (MIT license) by Guo et al. [2025].
## Appendix C Reward Improvement Lower Bound
### C.1. Proof of Theorem 1
We have :
$$
J(\pi(\cdot|x))=\mathbb{E}_{y\sim\pi(\cdot|x)}r(x,y)
$$
Let $\pi_{k}$ be the current policy and $\alpha(\cdot|x)$ be another policy typically consider $\alpha(\cdot|x)=\pi_{k-i}(\cdot|x)$ .
Define mean and variances of the off-policy reward, i.e policy under $\alpha$ :
$\mu_{\alpha}(x)=\mathbb{E}_{y\sim\alpha(\cdot|x)}r(x,y)$ and $\sigma_{\alpha}(x)=\sqrt{\mathbb{E}_{y\sim\alpha(\cdot|x)}(r(x,y)-\mu_{\alpha} (x))^{2}},$ and denote for $0<\varepsilon<1$ : $\sigma_{\alpha,\varepsilon}(x)=\sqrt{\sigma^{2}_{\alpha}(x)+\varepsilon}$ .
Note that we have a bounded reward $0\leq r(x,y)\leq\left\lVert{r}\right\rVert_{\infty}$ which implies that $\sigma^{2}_{\alpha}(x)\leq\frac{\left\lVert{r}\right\rVert^{2}_{\infty}}{4}$ , and hence we have:
$$
\sigma_{\alpha,\varepsilon}(x)\leq\sqrt{\frac{\left\lVert{r}\right\rVert^{2}_{
\infty}}{4}+\varepsilon}.
$$
We normalize the reward so that : $\sigma_{\alpha,\varepsilon}(x)\leq\sqrt{\frac{\left\lVert{r}\right\rVert^{2}_{ \infty}}{4}+\varepsilon}\leq 1.$
We denote GRPO advantage function as:
$$
A_{\alpha}(x,y)=\frac{r(x,y)-\mu_{\alpha}(x)}{\sigma_{\alpha,\varepsilon}(x)}
$$
$$
\mathcal{L}_{\alpha}(\pi(\cdot|x))=\mathbb{E}_{y\sim\alpha(\cdot|x)}\frac{\pi(
y|x)}{\alpha(y|x)}A_{\alpha}(x,y)
$$
If $\alpha=\pi_{k}$ , we obtain the online policy objective function of GRPO, where the advantage is computed with the current policy $\pi_{k}$ , i.e using $A_{\pi_{k}}(x,y)$ .
We have:
| | $\displaystyle\mathcal{L}_{\alpha}(\pi(\cdot|x))$ | $\displaystyle=\frac{1}{\sigma_{\alpha,\varepsilon}(x)}\left(\mathbb{E}_{y\sim \pi(\cdot|x)}r(x,y)-\mu_{\alpha}(x)\right)$ | |
| --- | --- | --- | --- |
Our goal is to provide an upper bound on :
$$
\mathcal{L}_{\alpha}(\pi(\cdot|x))-\left(J(\pi(\cdot|x))-J(\pi_{k}(\cdot|x))\right)
$$
Hence we have:
| | $\displaystyle\mathcal{L}_{\alpha}(\pi(\cdot|x))-\left(J(\pi(\cdot|x))-J(\pi_{k }(\cdot|x))\right)=\left(\frac{1}{\sigma_{\alpha,\varepsilon}(x)}-1\right)J( \pi(\cdot|x))+J(\pi_{k}(\cdot|x))-\frac{1}{\sigma_{\alpha,\varepsilon}(x)}J( \alpha(\cdot|x))$ | |
| --- | --- | --- |
**Lemma 1 (Kantorovich-Rubenstein duality of total variation distance, see )**
*The Kantorovich-Rubenstein duality (variational representation) of the total variation distance is as follows:
$$
\mathbb{TV}(m_{1},m_{2})=\frac{1}{2L}\sup_{g\in\mathcal{G}_{L}}\left\{\mathbb{
E}_{Z\sim m_{1}}[g(Z)]-\mathbb{E}_{Z\sim m_{2}}[g(Z)]\right\}, \tag{9}
$$
where $\mathcal{G}_{L}=\{g:\mathcal{Z}\rightarrow\mathbb{R},||g||_{\infty}\leq L\}$ .*
On the other hand using Lemma 1 we have:
$$
J(\pi(\cdot|x))-J(\alpha(\cdot|x))\leq 2\left\lVert{r}\right\rVert_{\infty}
\mathbb{TV}(\pi(\cdot|x),\alpha(\cdot|x))
$$
and
$$
J(\pi_{k}(\cdot|x))-J(\alpha(\cdot|x))\leq 2\left\lVert{r}\right\rVert_{\infty
}\mathbb{TV}(\pi_{k}(\cdot|x),\alpha(\cdot|x))
$$
By our assumption on the reward we have :
$$
\frac{1-\sigma_{\alpha,\varepsilon}(x)}{\sigma_{\alpha,\varepsilon}(x)}\geq 0
$$
so that we obtain the final bound as follows:
| | $\displaystyle\mathcal{L}_{\alpha}(\pi(\cdot|x))-\left(J(\pi(\cdot|x))-J(\pi_{k }(\cdot|x))\right)\leq 2\frac{1-\sigma_{\alpha,\varepsilon}(x)}{\sigma_{\alpha ,\varepsilon}(x)}\left\lVert{r}\right\rVert_{\infty}\mathbb{TV}(\pi(\cdot|x), \alpha(\cdot|x))+2\left\lVert{r}\right\rVert_{\infty}\mathbb{TV}(\pi_{k}(\cdot |x),\alpha(\cdot|x))$ | |
| --- | --- | --- |
We obtain finally our lower bound on policy improvement as follows:
$J (π (⋅ | x)) − J (π k (⋅ | x)) ≥ ℒ α (π (⋅ | x)) − 2 1 − σ α, ε (x) σ α, ε (x) ∥ r ∥ ∞ 𝕋 𝕍 (π (⋅ | x), α (⋅ | x)) − 2 ∥ r ∥ ∞ 𝕋 𝕍 (π k (⋅ | x), α (⋅ | x)) italic_J ( italic_π ( ⋅ | italic_x ) ) - italic_J ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) ≥ caligraphic_L start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_π ( ⋅ | italic_x ) ) - 2 divide start_ARG 1 - italic_σ start_POSTSUBSCRIPT italic_α , italic_ε end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_α , italic_ε end_POSTSUBSCRIPT ( italic_x ) end_ARG ∥ italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT blackboard_T blackboard_V ( italic_π ( ⋅ | italic_x ) , italic_α ( ⋅ | italic_x ) ) - 2 ∥ italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT blackboard_T blackboard_V ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_x ) , italic_α ( ⋅ | italic_x ) )$
Integrating over $x$ (the prompts) we have:
| | $\displaystyle\mathbb{E}_{x\sim\rho_{\mathcal{X}}}J(\pi(\cdot|x))-\mathbb{E}_{x \sim\rho_{\mathcal{X}}}J(\pi_{k}(\cdot|x))\geq\mathbb{E}_{x\sim\rho_{\mathcal{ X}}}\mathcal{L}_{\alpha}(\pi(\cdot|x))-2\left\lVert{r}\right\rVert_{\infty} \mathbb{E}_{x\sim\rho_{\mathcal{X}}}\frac{1-\sigma_{\alpha,\varepsilon}(x)}{ \sigma_{\alpha,\varepsilon}(x)}\mathbb{TV}(\pi(\cdot|x),\alpha(\cdot|x))$ | |
| --- | --- | --- |
## Appendix D Experiments
<details>
<summary>extracted/6494595/figs/math500.png Details</summary>

### Visual Description
## Line Chart: Math 500 Extractive Matchover Iterations per Model (with Variance)
### Overview
The image is a line chart comparing the performance of two models over training iterations. The chart displays the "Math 500 Extractive Match" score on the y-axis against the "Iteration" number on the x-axis. Each model's performance is represented by a line with a shaded area indicating variance or confidence intervals.
### Components/Axes
* **Chart Title:** "Math 500 Extractive Matchover Iterations per Model (with Variance)"
* **Y-Axis:**
* **Label:** "Math 500 Extractive Match"
* **Scale:** Linear, ranging from approximately 0.78 to 0.88.
* **Major Tick Marks:** 0.78, 0.80, 0.82, 0.84, 0.86, 0.88.
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Linear, ranging from 0 to 1400.
* **Major Tick Marks:** 0, 200, 400, 600, 800, 1000, 1200, 1400.
* **Legend:**
* **Position:** Top-right corner, outside the main plot area.
* **Title:** "Model"
* **Series 1:** `v10-l1-length-4096` - Represented by a dark blue line with circular markers.
* **Series 2:** `v1-l1-length-4096` - Represented by a light blue line with circular markers.
* **Data Series & Variance:** Both series have a shaded band of the same color (but lower opacity) surrounding their respective lines, indicating variance or a confidence interval.
### Detailed Analysis
**Model: v10-l1-length-4096 (Dark Blue Line)**
* **Trend:** The line exhibits significant volatility, with sharp peaks and troughs throughout the iteration range. It does not show a consistent upward or downward trend but fluctuates within a band.
* **Approximate Data Points (Match Score vs. Iteration):**
* Iteration ~50: ~0.83
* Iteration ~150: ~0.85
* Iteration ~200: ~0.862
* Iteration ~250: ~0.838
* Iteration ~350: ~0.872 (Peak)
* Iteration ~400: ~0.844
* Iteration ~500: ~0.854
* Iteration ~600: ~0.842
* Iteration ~700: ~0.836
* Iteration ~800: ~0.858
* Iteration ~850: ~0.866
* Iteration ~900: ~0.85
* Iteration ~1000: ~0.844
* Iteration ~1050: ~0.834
* Iteration ~1100: ~0.862
* Iteration ~1200: ~0.822 (Trough)
* Iteration ~1250: ~0.842
* Iteration ~1350: ~0.856
* Iteration ~1400: ~0.838
* **Variance Band:** The shaded area is wide, indicating high variance. The band spans roughly ±0.02 to ±0.03 from the central line at most points. The widest variance appears around iteration 50 and iteration 1200.
**Model: v1-l1-length-4096 (Light Blue Line)**
* **Trend:** This line is less volatile than the dark blue line. It shows a general, gradual upward trend from the start to around iteration 1250, followed by a slight decline.
* **Approximate Data Points (Match Score vs. Iteration):**
* Iteration ~50: ~0.83
* Iteration ~200: ~0.85
* Iteration ~350: ~0.856
* Iteration ~500: ~0.856
* Iteration ~700: ~0.862
* Iteration ~850: ~0.838
* Iteration ~1000: ~0.858
* Iteration ~1100: ~0.858
* Iteration ~1250: ~0.87 (Peak)
* Iteration ~1400: ~0.852
* **Variance Band:** The shaded area is also present but appears slightly narrower on average compared to the dark blue series, suggesting somewhat more stable performance. The band is particularly narrow around iterations 350-500.
### Key Observations
1. **Performance Range:** Both models operate within a similar performance band, with match scores primarily between 0.82 and 0.87.
2. **Volatility Contrast:** The `v10` model (dark blue) is markedly more volatile, with larger and more frequent swings in performance between measured iterations. The `v1` model (light blue) demonstrates a smoother, more gradual progression.
3. **Peak Performance:** The highest single data point belongs to `v10` at iteration ~350 (~0.872). The highest point for `v1` is at iteration ~1250 (~0.87).
4. **Lowest Point:** The lowest recorded point is for `v10` at iteration ~1200 (~0.822).
5. **Variance Overlap:** The variance bands of the two models overlap significantly for most of the chart, indicating that at many iterations, the performance difference between the models may not be statistically significant given the noise.
6. **Convergence/Divergence:** The models start at a similar point (~0.83). They diverge and converge multiple times. Notably, around iteration 1200, `v10` drops sharply while `v1` is near its peak, creating the largest performance gap visible on the chart.
### Interpretation
This chart visualizes the training or evaluation progress of two model variants (`v10` and `v1`) on a "Math 500 Extractive" task. The "Match" score is likely a performance metric (e.g., accuracy, F1-score).
* **Model Comparison:** The data suggests a trade-off. The `v1` model appears more stable and shows a clearer, albeit slow, improvement trend over time. The `v10` model achieves a slightly higher peak performance but is highly unstable, with performance degrading sharply at certain points (e.g., iteration 1200). This could indicate issues with training stability, hyperparameter sensitivity, or overfitting at specific checkpoints for `v10`.
* **Role of Variance:** The prominent variance bands are critical. They show that single-point evaluations are noisy. The true performance of a model at any given iteration is a range, not a precise number. The overlapping bands suggest that for many iterations, one cannot confidently declare one model superior to the other based on this metric alone.
* **Practical Implication:** If consistency is valued, `v1` might be preferable. If maximizing peak performance is the goal and the instability can be managed (e.g., through checkpoint selection), `v10` shows potential. The sharp drop for `v10` at iteration 1200 warrants investigation—it could be an outlier, a training anomaly, or a sign of catastrophic forgetting.
* **Underlying Question:** The chart prompts the question of what changed at iteration 1200 for `v10` and why `v1`'s performance peaks later. It also raises the question of whether the gradual trend for `v1` would continue beyond 1400 iterations or plateau.
</details>
(a) Aime 24
<details>
<summary>extracted/6494595/figs/aime24.png Details</summary>

### Visual Description
## Line Chart with Variance: Pass@1 over Iterations per Model
### Overview
This is a line chart displaying the performance of two machine learning models over the course of training iterations. The chart plots the "AIME 2.4 Pass@1" metric against "Iteration" count, with shaded regions representing the variance or confidence interval for each model's performance at each point. The overall trend shows fluctuating performance for both models, with neither demonstrating a clear, monotonic improvement.
### Components/Axes
* **Chart Title:** "Pass@1 over Iterations per Model (with Variance)"
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Linear, ranging from 0 to 1400.
* **Major Tick Marks:** 0, 200, 400, 600, 800, 1000, 1200, 1400.
* **Y-Axis:**
* **Label:** "AIME 2.4 Pass@1"
* **Scale:** Linear, ranging from 0.27 to 0.33.
* **Major Tick Marks:** 0.27, 0.28, 0.29, 0.30, 0.31, 0.32, 0.33.
* **Legend:** Located in the top-right corner, outside the main plot area.
* **Title:** "Model"
* **Series 1:** "v10-l1-length-4096" - Represented by a dark blue line with circular markers and a corresponding dark blue shaded variance band.
* **Series 2:** "v1-l1-length-4096" - Represented by a light blue line with circular markers and a corresponding light blue shaded variance band.
### Detailed Analysis
**Data Series 1: v10-l1-length-4096 (Dark Blue)**
* **Trend:** The line shows a general upward trend from iteration 100 to 500, followed by a period of high volatility with multiple peaks and troughs between iterations 500 and 1200, before a final decline towards iteration 1400.
* **Approximate Data Points (Iteration, Pass@1):**
* (100, ~0.294)
* (200, ~0.292) - *Local minimum*
* (300, ~0.303)
* (400, ~0.321)
* (500, ~0.325) - *Global maximum for this series*
* (600, ~0.320)
* (700, ~0.305)
* (800, ~0.302)
* (900, ~0.321) - *Second major peak*
* (1000, ~0.295)
* (1100, ~0.302)
* (1200, ~0.278) - *Global minimum for this series*
* (1300, ~0.310)
* (1400, ~0.300)
* **Variance:** The shaded band is widest between iterations 400-600 and 800-1000, indicating higher uncertainty or variability in performance during those phases.
**Data Series 2: v1-l1-length-4096 (Light Blue)**
* **Trend:** This series follows a similar initial rise but exhibits a more pronounced and sustained dip in performance between iterations 600 and 1000, reaching its lowest point around iteration 900, before recovering.
* **Approximate Data Points (Iteration, Pass@1):**
* (100, ~0.294)
* (200, ~0.292)
* (300, ~0.309)
* (400, ~0.318)
* (500, ~0.313)
* (600, ~0.289)
* (700, ~0.300)
* (800, ~0.298)
* (900, ~0.280) - *Global minimum for this series*
* (1000, ~0.306)
* (1100, ~0.290)
* (1200, ~0.297)
* (1300, ~0.308)
* (1400, ~0.302)
* **Variance:** The variance band is notably wide around the deep trough at iteration 900, suggesting significant instability or a wide range of outcomes at that stage.
### Key Observations
1. **Performance Crossover:** The two models perform very similarly until approximately iteration 400. After this point, the dark blue model (v10) generally maintains a higher Pass@1 score than the light blue model (v1) until around iteration 1300.
2. **Significant Divergence at Iteration 900:** The most striking feature is the large performance gap at iteration 900. The v10 model peaks at ~0.321 while the v1 model plummets to its lowest point at ~0.280.
3. **High Volatility:** Both models show substantial performance fluctuations rather than smooth learning curves. The metric does not consistently increase with more iterations.
4. **Convergence at End:** By the final recorded iteration (1400), both models converge to a similar performance level (~0.300-0.302), despite their divergent paths.
### Interpretation
The chart suggests that the "v10-l1-length-4096" model is generally more robust and achieves higher peak performance than the "v1-l1-length-4096" model during the middle stages of training (iterations 400-1200). The dramatic dip for the v1 model at iteration 900 could indicate a period of catastrophic forgetting, an unstable training phase, or sensitivity to a specific batch of data.
The high variance (wide shaded bands) for both models, especially during periods of rapid change, implies that the training process is noisy. A single training run might yield very different results. The fact that both models end at a similar point could mean that given enough iterations, they settle into a comparable local minimum, or that the v10 model's advantage is temporary.
From a practical standpoint, if the goal is to achieve the highest possible Pass@1 score, the v10 model appears superior, but one must be cautious of its volatility. The v1 model's deep trough represents a significant risk if training were stopped at an inopportune time (e.g., iteration 900). This data would be crucial for deciding which model to deploy, when to stop training, and for investigating the causes of instability in the v1 model's training dynamics.
</details>
(b) Math 500.
Figure 3. Aime 24/ Math 500