2509.24203v1

Model: gemini-2.0-flash

# Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends Abstract Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms — Online Policy Mirror Descent (OPMD) and Asymmetric REINFORCE (AsymRE) — as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k. footnotetext: Equal contribution. Contact: chaorui@ucla.edu, chenyanxi.cyx@alibaba-inc.com footnotetext: UCLA. Work done during an internship at Alibaba Group. footnotetext: Alibaba Group. 1 Introduction The past few years have witnessed rapid progress in reinforcement learning (RL) for large language models (LLMs). This began with reinforcement learning from human feedback (RLHF) (Bai et al., 2022; Ouyang et al., 2022) that aligns pre-trained LLMs with human preferences, followed by reasoning-oriented RL that enables LLMs to produce long chains of thought (OpenAI, 2024; DeepSeek-AI, 2025; Kimi-Team, 2025b; Zhang et al., 2025b). More recently, agentic RL (Kimi-Team, 2025a; Gao et al., 2025; Zhang et al., 2025a) aims to train LLMs for agentic capabilities such as tool use, long-horizon planning, and multi-step task execution in dynamic environments. Alongside these developments, off-policy RL has been attracting growing interest. In the “era of experience” (Silver and Sutton, 2025), LLM-powered agents need to be continually updated through interaction with the environment. Practical constraints in real-world deployment and the complexity of LLM-RL infrastructure often render on-policy training impractical (Noukhovitch et al., 2025): rollout generation and model training can proceed at mismatched speeds, data might be collected from different policies, reward feedback might be irregular or delayed, and the environment may be too costly or unstable to query for fresh trajectories. Moreover, in pursuit of higher sample efficiency and model performance, it is desirable to go beyond the standard paradigm of independent rollout sampling, e.g., via replaying past experiences (Schaul et al., 2016; Rolnick et al., 2019; An et al., 2025), synthesizing higher-quality experiences based on auxiliary information (Da et al., 2025; Liang et al., 2025; Guo et al., 2025), or incorporating expert demonstrations into online RL (Yan et al., 2025; Zhang et al., 2025c) — all of which incur off-policyness. However, the prominent algorithms in LLM-RL — Proximal Policy Optimization (PPO) (Schulman et al., 2017) and Group Relative Policy Optimization (GRPO) (Shao et al., 2024) — are essentially on-policy methods: as modern variants of REINFORCE (Williams, 1992), their fundamental rationale is to produce unbiased estimates of the policy gradient, which requires fresh data sampled from the current policy. PPO and GRPO can handle a limited degree of off-policyness via importance sampling, but require that the current policy remains sufficiently close to the behavior policy. Truly off-policy LLM-RL often demands ad-hoc analysis and algorithm design; worse still, as existing RL infrastructure (Sheng et al., 2024; Hu et al., 2024; von Werra et al., 2020; Wang et al., 2025; Pan et al., 2025; Fu et al., 2025a) is typically optimized for REINFORCE-style algorithms, their support for specialized off-policy RL algorithms could be limited. All these have motivated our investigation into principled and infrastructure-friendly algorithm design for off-policy RL. Core finding: a native off-policy interpretation for group-relative REINFORCE. Consider a one-step RL setting and a group-relative variant of REINFORCE that, like in GRPO, assumes access to multiple responses $\{y_{1},...,y_{K}\}$ for the same prompt $x$ and use the group mean reward $\overline{r}$ as the baseline in advantage calculation. Each response is a sequence of tokens $y_{i}=(y^{1}_{i},y^{2}_{i},...)$ , and receives a response-level reward $r_{i}=r(x,y_{i})$ . Let $\pi_{\bm{\theta}}(·|x)$ denote an autoregressive policy parameterized by $\bm{\theta}$ . The update rule for each iteration of group-relative REINFORCE is $\bm{\theta}^{\prime}=\bm{\theta}+\eta\bm{g}$ , where $\eta$ is the learning rate, and $\bm{g}$ is the sum of updates from multiple prompts and their corresponding responses. For a specific prompt $x$ , the update would be For notational simplicity and consistency, we use the same normalization factor $1/K$ for both response-wise and token-wise formulas in Eq. (1a) and (1b). For practical implementation, the gradient is calculated with samples from a mini-batch, and typically normalized by the total number of response tokens. This mismatch does not affect our theoretical studies in this work. Interestingly, our analysis of REINFORCE in this work provides certain justifications for calculating the token-mean loss within a mini-batch, instead of first taking the token-mean loss within each sequence and then taking the average across sequences (Shao et al., 2024); our perspective is complementary to the rationales explained in prior works like DAPO (Yu et al., 2025), although a deeper understanding of this aspect is beyond our current focus. $$ \displaystyle\bm{g}\big(\bm{\theta};x,\{y_{i},r_{i}\}_{1\leq i\leq K}\big) \displaystyle=\frac{1}{K}\sum_{1\leq i\leq K}(r_{i}-\overline{r})\nabla_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}\,|\,x)\quad\qquad\qquad\text{(response-wise)} \displaystyle=\frac{1}{K}\sum_{1\leq i\leq K}\sum_{1\leq t\leq|y_{i}|}(r_{i}-\overline{r})\nabla_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})\quad\text{(token-wise)}. $$ Here, the response-wise and token-wise formulas are linked by the elementary decomposition $\log\pi_{\bm{\theta}}(y_{i}\,|\,x)=\sum_{t}\log\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})$ , where $y^{<t}_{i}$ denotes the first $t-1$ tokens of $y_{i}$ . A major finding of this work is that group-relative REINFORCE admits a native off-policy interpretation. We establish this in Section 2 via a novel, first-principles derivation that makes no explicit assumption about the sampling distribution of the responses $\{y_{i}\}$ , in contrast to the standard policy gradient theory. Our derivation provides a new perspective for understanding how REINFORCE makes its way towards the optimal policy by constructing a series of surrogate objectives and taking gradient steps for the corresponding surrogate losses. Such analysis can be extended to multi-step RL settings as well, with details deferred to Appendix A. Implications: principles and concrete methods for augmenting REINFORCE. While the proposed off-policy interpretation does not imply that vanilla REINFORCE should converge to the optimal policy when given arbitrary training data (which is too good to be true), our analysis in Section 3 identifies two general principles for augmenting REINFORCE in off-policy settings: (1) regularize the policy update step to stabilize learning, and (2) actively shape the training data distribution to steer the policy update direction. As we will see in Section 4, this unified framework demystifies common myths about the rationales behind many recent RL algorithms: (1) It reveals that in GRPO, clipping (as a form of regularization) plays a much more essential role than importance sampling, and it is often viable to enlarge the clipping range far beyond conventional choices for accelerated convergence without sacrificing stability. (2) Two recent algorithms — Kimi’s Online Policy Mirror Descent (OPMD) (Kimi-Team, 2025b) and Meta’s Asymmetric REINFORCE (AsymRE) (Arnal et al., 2025) — can be reinterpreted as adding a regularization loss to the standard REINFORCE loss, which differs substantially from the rationales explained in their original papers. (3) Our framework justifies heuristic data-weighting strategies like discarding certain low-reward samples or up-weighting high-reward ones, even though they violate assumptions in policy gradient theory and often require ad-hoc analysis in prior works. Extensive empirical studies in Section 4 and Appendix B validate these insights and demonstrate the efficacy and/or limitations of various algorithms under investigation. By revealing the off-policy nature of group-relative REINFORCE, our work opens up new opportunities for principled, infrastructure-friendly algorithm design in off-policy LLM-RL with solid theoretical foundation. 2 Two interpretations for REINFORCE Consider the standard reward-maximization objective in reinforcement learning: $$ \max_{\bm{\theta}}\quad J(\bm{\theta})\coloneqq{\mathbb{E}}_{x\sim D}\;J(\bm{\theta};x),\quad\text{where}\quad J(\bm{\theta};x)\coloneqq{\mathbb{E}}_{y\sim\pi_{\bm{\theta}}(\cdot|x)}\;r(x,y), \tag{2} $$ where $D$ is a distribution over the prompts $x$ . We first recall the standard on-policy interpretation of REINFORCE in Section 2.1, and then present our proposed off-policy interpretation in Section 2.2. 2.1 Recap: on-policy interpretation via policy gradient theory In the classical on-policy view, REINFORCE updates policy parameters $\bm{\theta}$ using samples that are drawn directly from $\pi_{\bm{\theta}}$ . The policy gradient theorem (Sutton et al., 1998) tells us that $$ \nabla_{\bm{\theta}}J(\bm{\theta};x)=\nabla_{\bm{\theta}}\;{\mathbb{E}}_{y\sim\pi_{\bm{\theta}}(\cdot|x)}\;r(x,y)={\mathbb{E}}_{y\sim\pi_{\bm{\theta}}(\cdot|x)}\!\Big[\big(r(x,y)-b(x)\big)\,\nabla_{\bm{\theta}}\log\pi_{\bm{\theta}}(y|x)\Big], $$ where $b(x)$ is a baseline for reducing variance when $∇_{\bm{\theta}}J(\bm{\theta};x)$ is estimated with finite samples. If samples are drawn from a different behavior policy $\pi_{\textsf{b}}$ instead, the gradient can be rewritten as | | $\displaystyle∇_{\bm{\theta}}J(\bm{\theta};x)$ | $\displaystyle={\mathbb{E}}_{y\sim\pi_{\textsf{b}}(·|x)}\!\bigg[\big(r(x,y)-b(x)\big)\,\frac{\pi_{\bm{\theta}}(y\mid x)}{\pi_{\textsf{b}}(y\mid x)}\,∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y\mid x)\bigg].$ | | | --- | --- | --- | --- | While the raw importance-sampling weight ${\pi_{\bm{\theta}}(y|x)}/{\pi_{b}(y|x)}$ facilitates unbiased policy gradient estimate, it may be unstable when $\pi_{\bm{\theta}}$ and $\pi_{\textsf{b}}$ diverge. Modern variants of REINFORCE address this by modifying the probability ratios (e.g., via clipping or normalization), which achieves better bias-variance trade-off in the policy gradient estimate and leads to a stable learning process. In the LLM context, we have $∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y\,|\,x)=\sum_{t}∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y^{t}\,|\,x,y^{<t})$ , but the response-wise probability ratio $\pi_{\bm{\theta}}(y|x)/\pi_{\textsf{b}}(y|x)$ can blow up or shrink exponentially with the sequence length. Practical implementations typically adopt token-wise probability ratio instead: | | $\displaystyle\widetilde{g}(\bm{\theta};x)$ | $\displaystyle={\mathbb{E}}_{y\sim\pi_{\textsf{b}}(·|x)}\!\bigg[\big(r(x,y)-b(x)\big)\,\sum_{1≤ t≤|y|}\frac{\pi_{\bm{\theta}}(y^{t}\,|\,x,y^{<t})}{\pi_{\textsf{b}}(y^{t}\,|\,x,y^{<t})}\,∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y^{t}\,|\,x,y^{<t})\bigg].$ | | | --- | --- | --- | --- | Although this becomes a biased approximation of $∇_{\bm{\theta}}J(\bm{\theta};x)$ , classical RL theory still offers policy improvement guarantees if $\pi_{\bm{\theta}}$ is sufficiently close to $\pi_{\textsf{b}}$ (Kakade and Langford, 2002; Fragkiadaki, 2018; Schulman et al., 2015, 2017; Achiam et al., 2017). 2.2 A new interpretation: REINFORCE is inherently off-policy We now provide an alternative off-policy interpretation for group-relative REINFORCE. Let us think of policy optimization as an iterative process $\bm{\theta}_{1},\bm{\theta}_{2},...$ , and focus on the $t$ -th iteration that updates the policy model parameters from $\bm{\theta}_{t}$ to $\bm{\theta}_{t+1}$ . Our derivation consists of three steps: (1) define a KL-regularized surrogate objective, and show that its optimal solution must satisfy certain consistency conditions; (2) define a surrogate loss (with finite samples) that enforces such consistency conditions; and (3) take one gradient step of the surrogate loss, which turns out to be equivalently the group-relative REINFORCE method. Step 1: surrogate objective and consistency condition. Consider the following KL-regularized surrogate objective that incentivizes the policy to make a stable improvement over $\pi_{\bm{\theta}_{t}}$ : $$ \max_{\bm{\theta}}\quad J(\bm{\theta};\pi_{\bm{\theta}_{t}})\coloneqq{\mathbb{E}}_{x\sim D}\Big[{\mathbb{E}}_{y\sim\pi_{\bm{\theta}}(\cdot|x)}[r(x,y)]-\tau\cdot D_{\textsf{KL}}\big(\pi_{\bm{\theta}}(\cdot|x)\,\|\,\pi_{\bm{\theta}_{t}}(\cdot|x)\big)\Big], \tag{3} $$ where $\tau$ is a regularization coefficient. It is a well-known fact that the optimal policy $\pi$ for this surrogate objective satisfies the following (Nachum et al., 2017; Korbak et al., 2022; Rafailov et al., 2023; Richemond et al., 2024; Kimi-Team, 2025b): for any prompt $x$ and response $y$ , $$ \displaystyle\pi(y|x)=\frac{\pi_{\bm{\theta}_{t}}(y|x)e^{r(x,y)/\tau}}{Z(x,\pi_{\bm{\theta}_{t}})},\,\,\text{where}\,\,Z(x,\pi_{\bm{\theta}_{t}})\coloneqq\int\pi_{\bm{\theta}_{t}}(y^{\prime}|x)e^{r(x,y^{\prime})/\tau}\mathop{}\!\mathrm{d}y^{\prime}. \tag{4} $$ Note that Eq. (4) is equivalent to the following: for any pair of responses $y_{1}$ and $y_{2}$ , | | $\displaystyle\frac{\pi(y_{1}|x)}{\pi(y_{2}|x)}=\frac{\pi_{\bm{\theta}_{t}}(y_{1}|x)}{\pi_{\bm{\theta}_{t}}(y_{2}|x)}\exp\bigg(\frac{r(x,y_{1})-r(x,y_{2})}{\tau}\bigg).$ | | | --- | --- | --- | Taking logarithm of both sides, we have this pairwise consistency condition: $$ \displaystyle r_{1}-\tau\cdot\big(\log\pi(y_{1}|x)-\log\pi_{\bm{\theta}_{t}}(y_{1}|x)\big)=r_{2}-\tau\cdot\big(\log\pi(y_{2}|x)-\log\pi_{\bm{\theta}_{t}}(y_{2}|x)\big). \tag{5} $$ Step 2: surrogate loss with finite samples. Given a prompt $x$ and $K$ responses $y_{1},...,y_{K}$ , we define the following mean-squared surrogate loss that enforces the consistency condition: $$ \widehat{L}({\bm{\theta}};x,\pi_{\bm{\theta}_{t}})\coloneqq\frac{1}{K^{2}}\sum_{1\leq i<j\leq K}\frac{(a_{i}-a_{j})^{2}}{(1+\tau)^{2}},\,\,\text{where}\,\,a_{i}\coloneqq r_{i}-\tau\Big(\log\pi_{\bm{\theta}}(y_{i}|x)-\log\pi_{\bm{\theta}_{t}}(y_{i}|x)\Big). \tag{6} $$ Here, we normalize $a_{i}-a_{j}$ by $1+\tau$ to account for the loss scale. In theory, if this surrogate loss is defined by infinite samples with sufficient coverage of the action space, then its minimizer is the same as the optimal policy for the surrogate objective in Eq. (3). Step 3: one gradient step of the surrogate loss. Let us conduct further analysis for $(a_{i}-a_{j})^{2}$ . The trick here is that, if we take only one gradient step of this loss at $\bm{\theta}=\bm{\theta}_{t}$ , then the values of $\log\pi_{\bm{\theta}}(y_{i}|x)-\log\pi_{\bm{\theta}_{t}}(y_{i}|x)$ and $\log\pi_{\bm{\theta}}(y_{j}|x)-\log\pi_{\bm{\theta}_{t}}(y_{j}|x)$ are simply zero. As a result, | | $\displaystyle∇_{\bm{\theta}}{(a_{i}-a_{j})^{2}}\big|_{\bm{\theta}_{t}}={-2\tau}\,(r_{i}-r_{j})\Big(∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}|x)\big|_{\bm{\theta}_{t}}-∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{j}|x)\big|_{\bm{\theta}_{t}}\Big)\quad\Rightarrow$ | | | --- | --- | --- | Putting these back to the surrogate loss defined in Eq. (6), we end up with this policy update step: $$ \bm{g}\big(\bm{\theta};x,\{y_{i},r_{i}\}_{1\leq i\leq K}\big)=\frac{2\tau}{(1+\tau)^{2}}\cdot\frac{1}{K}\sum_{1\leq i\leq K}\big(r_{i}-\overline{r}\big)\,\nabla_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}\,|\,x). \tag{7} $$ That’s it! We have just derived the group-relative REINFORCE method, but without any on-policy assumption about the distribution of training data $\{x,\{y_{i},r_{i}\}_{1≤ i≤ K}\}$ . The regularization coefficient $\tau>0$ controls the update step size; a larger $\tau$ effectively corresponds to a smaller learning rate. Summary and remarks. Figure 1 visualizes the proposed interpretation of what REINFORCE is actually doing. The curve going through $\bm{\theta}_{t}→\bm{\theta}_{t+1}→\widetilde{\bm{\theta}}_{t+1}→\bm{\theta}^{\star}$ stands for the ideal optimization trajectory from $\bm{\theta}_{t}$ to the optimal policy model parametrized by $\bm{\theta}^{\star}$ , if the algorithm solves each intermediate surrogate objective $J(\bm{\theta};\pi_{\bm{\theta}_{t}})$ / surrogate loss $\widehat{L}(\bm{\theta};\pi_{\bm{\theta}_{t}})$ exactly at each iteration $t$ . In comparison, REINFORCE is effectively taking a single gradient step of the surrogate loss and immediately moving on to the next iteration $\bm{\theta}_{t+1}$ with a new surrogate objective. Two remarks are in place. (1) Our derivation of group-relative REINFORCE can be generalized to multi-step RL settings, by replacing a response $y$ in the previous analysis with a full trajectory consisting of multiple turns of agent-environment interaction. For example, regarding the surrogate objective in Eq. (3), we need to replace the response-level reward and KL divergence with their trajectory-level counterparts. Interested readers might refer to Appendix A for the full analysis. (2) The above analysis suggests that we might interpret group-relative REINFORCE from a pointwise or pairwise perspective. While the policy update in Eq. (7) is stated in a pointwise manner, we have also seen that, at each iteration, REINFORCE is implicitly enforcing the pairwise consistency condition in Eq. (5) among multiple responses. This allows us the flexibility to choose whichever perspective that offers more intuition for our analysis later in this work. <details> <summary>x1.png Details</summary> ![62d10938](/v1/image/62d109387aca2364cc2e2334936ae792720df08cc634e614cc38d60b1cf39aec) ### Visual Description ## Diagram: Optimization Process ### Overview The image illustrates an optimization process, likely related to machine learning or a similar field. It shows the iterative steps of updating a parameter θ to reach an optimal value θ*. The diagram uses arrows and equations to represent the update rules and the search for the optimal value. ### Components/Axes * **Nodes:** * θt: Initial parameter value (black dot) * Intermediate point (blue dot) * θ̃t+1: Intermediate parameter value (purple dot) * θ*: Optimal parameter value (black dot) * **Arrows:** * Blue arrow: Represents the gradient descent step. * Purple dashed arrow: Represents the optimization step to find θ̃t+1. * Dashed gray line: Represents the direct path from θt to θ*. * **Equations:** * θt+1 = θt - ηt∇θL̂(θ; πθt)|θ=θt (Blue text, top-left) * θ̃t+1 = arg maxθ J(θ; πθt) = arg minθ L̂(θ; πθt) (Purple text, top-right) ### Detailed Analysis * **Initial State (θt):** The process starts at θt, represented by a black dot on the left. * **Gradient Descent Step (Blue):** A blue arrow originates from θt, indicating a step in the direction of the negative gradient of the loss function L̂. The equation θt+1 = θt - ηt∇θL̂(θ; πθt)|θ=θt describes this update, where ηt is the learning rate and ∇θL̂(θ; πθt) is the gradient of the loss function with respect to θ, evaluated at θt. This step leads to an intermediate blue dot. * **Optimization Step (Purple):** A purple dashed arrow originates from the blue dot and curves upwards to a purple dot labeled θ̃t+1. This represents an optimization step to find the best θ given the current policy πθt. The equation θ̃t+1 = arg maxθ J(θ; πθt) = arg minθ L̂(θ; πθt) describes this step, where J is a reward function and L̂ is a loss function. * **Optimal State (θ*):** A purple dashed arrow originates from θ̃t+1 and curves downwards to the final black dot labeled θ*. This represents the final step in the optimization process, aiming to reach the optimal parameter value θ*. * **Direct Path (Gray):** A dashed gray line connects θt directly to θ*, representing the ideal, but often unattainable, direct path to the optimal value. ### Key Observations * The diagram illustrates an iterative optimization process. * The process involves both gradient descent and a separate optimization step. * The goal is to find the optimal parameter value θ*. ### Interpretation The diagram depicts a two-stage optimization process. The blue arrow represents a gradient descent step, which moves the parameter θ in the direction of decreasing loss. The purple arrow represents a more sophisticated optimization step, which aims to find the best θ given the current policy. This could represent a policy improvement step in reinforcement learning, where the policy is updated to maximize the expected reward. The dashed gray line represents the ideal, but often unattainable, direct path to the optimal value. The diagram highlights the iterative nature of the optimization process and the interplay between gradient descent and policy optimization. </details> Figure 1: A visualization of our off-policy interpretation for group-relative REINFORCE. Here $\widehat{L}(\bm{\theta};\pi_{\bm{\theta}_{t}})={\mathbb{E}}_{x\sim\widehat{D}}[\widehat{L}(\bm{\theta};x,\pi_{\bm{\theta}_{t}})]$ , where $\widehat{D}$ is the sampling distribution for prompts, and $\widehat{L}(\bm{\theta};x,\pi_{\bm{\theta}_{t}})$ is the surrogate loss defined in Eq. (6) for a specific prompt $x$ . 3 Pitfalls and augmentations Although we have provided a native off-policy interpretation for REINFORCE, it certainly does not guarantee convergence to the optimal policy when given arbitrary training data. This section identifies pitfalls that could undermine vanilla REINFORCE, which motivate two principles for augmentations in off-policy settings. Pitfalls of vanilla REINFORCE. In Figure 1, we might expect that ideally, (1) $\widetilde{\bm{\theta}}_{t+1}-\bm{\theta}_{t}$ aligns with the direction of $\bm{\theta}^{\star}-\bm{\theta}_{t}$ ; and (2) $\bm{\theta}_{t+1}-\bm{\theta}_{t}$ aligns with the direction of $\widetilde{\bm{\theta}}_{t+1}-\bm{\theta}_{t}$ . One pitfall, however, is that even if both conditions hold, they do not necessarily imply that $\bm{\theta}_{t+1}-\bm{\theta}_{t}$ should align well with $\bm{\theta}^{\star}-\bm{\theta}_{t}$ . That is, $\langle\widetilde{\bm{\theta}}_{t+1}-\bm{\theta}_{t},\bm{\theta}^{\star}-\bm{\theta}_{t}\rangle>0$ and $\langle\bm{\theta}_{t+1}-\bm{\theta}_{t},\widetilde{\bm{\theta}}_{t+1}-\bm{\theta}_{t}\rangle>0$ do not imply $\langle\bm{\theta}_{t+1}-\bm{\theta}_{t},\bm{\theta}^{\star}-\bm{\theta}_{t}\rangle>0$ . Moreover, it is possible that $\bm{\theta}_{t+1}-\bm{\theta}_{t}$ might not align well with $\widetilde{\bm{\theta}}_{t+1}-\bm{\theta}_{t}$ . Recall from Eq. (7) that, from $\bm{\theta}_{t}$ to $\bm{\theta}_{t+1}$ , we take one gradient step for a surrogate loss that enforces the pairwise consistency condition among a finite number of samples. Given the enormous action space of an LLM, some implicit assumptions about the training data (e.g., balancedness and coverage) would be needed to ensure that the gradient aligns well with the direction towards the optimum of the surrogate objective, namely $\widetilde{\bm{\theta}}_{t+1}-\bm{\theta}_{t}$ . In fact, without a mechanism that ensures boundedness of policy update under a sub-optimal data distribution, vanilla REINFORCE could eventually converge to a sub-optimal policy. Let us show this with a minimal example in a didactic 3-arm bandit setting. Suppose that there are three actions $\{a_{j}\}_{1≤ j≤ 3}$ with rewards $\{r(a_{j})\}$ . Consider $K$ training samples $\{y_{i}\}_{1≤ i≤ K}$ , where $y_{i}∈\{a_{j}\}_{1≤ j≤ 3}$ is sampled from some behavior policy $\pi_{\textsf{b}}$ . Denote by $\mu_{r}\coloneqq\sum_{1≤ j≤ 3}\pi_{\textsf{b}}(a_{j})r(a_{j})$ the expected reward under $\pi_{\textsf{b}}$ , and $\overline{r}\coloneqq\sum_{i}r(y_{i})/K$ the average reward of training samples. We consider the softmax parameterization, i.e., $\pi_{\bm{\theta}}(a_{j})=e^{\theta_{j}}/\sum_{\ell}e^{\theta_{\ell}}$ for a policy parameterized by $\bm{\theta}∈{\mathbb{R}}^{3}$ . A standard fact is that $∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(a_{j})=\bm{e}_{j}-\pi_{\bm{\theta}}$ , where $\bm{e}_{j}∈{\mathbb{R}}^{3}$ is a one-hot vector with value 1 at entry $j$ . Now we examine the policy update direction of REINFORCE, as $K→∞$ : | | $\displaystyle\bm{g}$ | $\displaystyle=\frac{1}{K}\sum_{1≤ i≤ K}(r(y_{i})-\overline{r})∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i})→\sum_{1≤ j≤ 3}\pi_{\textsf{b}}(a_{j})(r(a_{j})-\mu_{r})∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(a_{j})$ | | | --- | --- | --- | --- | For example, if $\bm{r}=[r(a_{j})]_{1≤ j≤ 3}=[0,0.8,1]$ and $\pi_{\textsf{b}}=[0.3,0.6,0.1]$ , then basic calculation says $\mu_{r}=0.58$ , $\bm{r}-\mu_{r}=[-0.58,0.22,0.42]$ , and finally $g_{2}=\pi_{\textsf{b}}(a_{2})(r(a_{2})-\mu_{r})>\pi_{\textsf{b}}(a_{3})(r(a_{3})-\mu_{r})=g_{3}$ , which implies that the policy will converge to the sub-optimal action $a_{2}$ . Two principles for augmenting REINFORCE. The identified pitfalls of vanilla REINFORCE suggest two general principles for augmenting REINFORCE in off-policy scenarios: - One is to regularize the policy update step, ensuring that the optimization trajectory remains bounded and reasonably stable when given training data from a sub-optimal distribution; - The other is to steer the policy update direction, by actively weighting the training samples rather than naively using them as is. These two principles are not mutually exclusive, and might be integrated within a single algorithm. We will see in the next section that many RL algorithms can be viewed as instantiations of them. 4 Rethinking the rationales behind recent RL algorithms This section revisits various RL algorithms through a unified lens — the native off-policy interpretation of group-relative REINFORCE and its augmentations — and demystifies some common myths about their working mechanisms. Our main findings are summarized as follows: | ID | Finding | Analysis & Experiments | | --- | --- | --- | | F1 | GRPO’s effectiveness in off-policy settings stems from clipping as regularization rather than importance sampling. A wider clipping range than usual often accelerates training without harming stability. | Section 4.1, Figures 3, 4, 6, 9, 10 | | F2 | Kimi’s OPMD and Meta’s AsymRE can be interpreted as REINFORCE loss + regularization loss, a perspective that is complementary to the rationales in their original papers. | Section 4.2, Figure 11 | | F3 | Data-oriented heuristics — such as dropping excess negatives or up-weighting high-reward rollouts — fit naturally into our off-policy view and show strong empirical performance. | Section 4.3, Figures 5, 6, 7 | Experimental setup. We primarily consider two off-policy settings in our experiments, specified by the sync_interval and sync_offset parameters in the Trinity-RFT framework (Pan et al., 2025). sync_interval specifies the number of generated rollout batches (each corresponding to one gradient step) between two model synchronization operations, while sync_offset specifies the relative lag between the generation and consumption of each batch. These parameters can be deliberately set to large values in practice, for improving training efficiency via pipeline parallelism and reduced frequency of model synchronization. In addition, $\texttt{sync\_offset}>1$ serves to simulate realistic scenarios where environmental feedback could be delayed. We also consider a stress-test setting that only allows access to offline data generated by the initial policy model. See Figure 2 for an illustration of these off-policy settings, and Appendix B.2 for further details. We conduct experiments on math reasoning tasks like GSM8k (Cobbe et al., 2021), MATH (Hendrycks et al., 2021) and Guru (math subset) (Cheng et al., 2025), as well as tool-use tasks like ToolACE (Liu et al., 2025a). We consider models of different families and scales, including Qwen2.5-1.5B-Instruct, Qwen2.5-7B-Instruct (Qwen-Team, 2025), Llama-3.1-8B-Instruct, and Llama-3.2-3B-Instruct (Dubey et al., 2024). Additional experiment details can be found in Appendix B. <details> <summary>x2.png Details</summary> ![a72f9305](/v1/image/a72f93054bd461fbd6d4bfb3be2543a9edfc0e8702d54cd549ee7d42746d4ad2) ### Visual Description ## Diagram: Rollout-Buffer-Training Synchronization ### Overview The image presents two diagrams illustrating different synchronization modes between "Rollout," "Buffer," and "Training" processes. The diagrams depict how data segments are synchronized based on "sync_interval" and "sync_offset" parameters. The left diagram shows a mode with `sync_interval = 4` and `sync_offset = 0`, while the right diagram shows a mode with `sync_interval = 1` and `sync_offset = 4`. ### Components/Axes * **Labels:** * "Rollout" (purple text, top) * "Buffer" (yellow text, middle) * "Training" (purple text, bottom) * "sync" (blue text, bottom) * **Parameters (Top):** * Left: "Mode: sync\_interval = 4, sync\_offset = 0" * Right: "Mode: sync\_interval = 1, sync\_offset = 4" * **Data Segments:** Represented as purple boxes with numerical indices (0-7). * **Synchronization Points:** Represented as vertical blue bars labeled "sync." * **Flow Direction:** Represented by yellow arrows pointing from Rollout to Buffer and from Buffer to Training. ### Detailed Analysis **Left Diagram: sync\_interval = 4, sync\_offset = 0** * **Rollout:** Contains data segments indexed from 0 to 7, arranged sequentially. * **Buffer:** A horizontal yellow bar representing the buffer. * **Training:** Contains data segments indexed from 0 to 7, arranged sequentially. * **Synchronization:** * The first "sync" bar aligns data segments 0-3 in Rollout and Training. * The second "sync" bar aligns data segments 4-7 in Rollout and Training. * The yellow arrows indicate the flow of data from Rollout to Buffer and then to Training. **Right Diagram: sync\_interval = 1, sync\_offset = 4** * **Rollout:** Contains data segments indexed from 0 to 7, arranged sequentially. * **Buffer:** A horizontal yellow bar representing the buffer. * **Training:** Contains data segments indexed from 0 to 3. * **Synchronization:** * The first "sync" bar aligns data segment 0 in Rollout with data segment 0 in Training. * The second "sync" bar aligns data segment 1 in Rollout with data segment 1 in Training. * The third "sync" bar aligns data segment 2 in Rollout with data segment 2 in Training. * The fourth "sync" bar aligns data segment 3 in Rollout with data segment 3 in Training. * The yellow arrows indicate the flow of data from Rollout to Buffer and then to Training. ### Key Observations * The `sync_interval` parameter determines how many data segments are grouped together during synchronization. * The `sync_offset` parameter seems to influence the starting point of the synchronization relative to the Rollout data. * In the left diagram, the entire Rollout (0-7) is synchronized in two large chunks. * In the right diagram, the Rollout is synchronized in smaller, individual segments. * The Training data in the right diagram only contains segments 0-3, suggesting a different processing strategy compared to the left diagram. ### Interpretation The diagrams illustrate two different approaches to synchronizing data between a "Rollout" process (likely generating data), a "Buffer" (likely storing data temporarily), and a "Training" process (likely consuming data for model training). The left diagram (sync\_interval = 4, sync\_offset = 0) suggests a batch-oriented synchronization, where larger chunks of data are synchronized at once. This might be suitable for scenarios where training benefits from processing data in larger batches. The right diagram (sync\_interval = 1, sync\_offset = 4) suggests a more granular, potentially online synchronization, where individual data segments are synchronized. The `sync_offset = 4` indicates that the training process starts with the 4th element of the rollout. This might be suitable for scenarios where training needs to adapt quickly to new data or where resources are limited, and processing smaller segments is more efficient. The choice of synchronization mode likely depends on the specific requirements of the application, such as the size of the dataset, the training algorithm, and the available computational resources. </details> Figure 2: A visualization of the rollout-training scheduling in sync_interval = 4 (left) or sync_offset = 4 (right) modes. Each block denotes one batch of samples for one gradient step, and the number in it denotes the corresponding batch id. Training blocks are color-coded by data freshness, with lighter color indicating increasing off-policyness. 4.1 Demystifying myths about GRPO Recall that in GRPO, the advantage for each response $y_{i}$ is defined as $A_{i}={(r_{i}-\overline{r})}/{\sigma_{r}}$ , where $\overline{r}$ and $\sigma_{r}$ denote the within-group mean and standard deviation of the rewards $\{r_{i}\}_{1≤ i≤ K}$ respectively. We consider the practical implementation of GRPO with token-wise importance-sampling (IS) weighting and clipping, whose loss function for a specific prompt $x$ and responses $\{y_{i}\}$ is In our experiments with GRPO, we neglect KL regularization with respect to an extra reference model, or entropy regularization that encourages output diversity. Recent works (Yu et al., 2025; Liu et al., 2025b) have shown that these practical techniques are often unnecessary. | | $\displaystyle\widehat{L}=\frac{1}{K}\sum_{1≤ i≤ K}\sum_{1≤ t≤|y_{i}|}\min\bigg\{\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}A_{i},\,\operatorname{clip}\Big(\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})},1-\epsilon_{\textsf{low}},1+\epsilon_{\textsf{high}}\Big)A_{i}\bigg\},$ | | | --- | --- | --- | where $\pi_{\textsf{old}}$ denotes the older policy version that generated this group of rollout data. The gradient of this loss can be written as (Schulman et al., 2017) $$ \bm{g}\big(\bm{\theta};x,\{y_{i},r_{i}\}_{1\leq i\leq K}\big)=\frac{1}{K}\sum_{1\leq i\leq K}\sum_{1\leq t\leq|y_{i}|}\nabla_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})\cdot A_{i}\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}M_{i}^{t}, $$ where $M_{i}^{t}$ denotes a one-side clipping mask: $$ M_{i}^{t}=\mathbbm{1}\bigg(A_{i}>0,\;\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}\leq 1+\epsilon_{\textsf{high}}\bigg)+\mathbbm{1}\bigg(A_{i}<0,\;\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}\geq 1-\epsilon_{\textsf{low}}\bigg). \tag{8} $$ Ablation study with the REC series. To isolate the roles of importance sampling and clipping, we consider a series of RE INFORCE-with- C lipping (REC) algorithms. Due to space limitation, we defer our studies of more clipping mechanisms to Appendix B.3, and focus on REC with one-side clipping in this section. More specifically, REC-OneSide-IS removes advantage normalization in GRPO (to reduce variability), and REC-OneSide-NoIS further removes IS weighting: | | $\displaystyle\text{{REC-OneSide-IS}:}\;\;\bm{g}$ | $\displaystyle=\frac{1}{K}\sum_{1≤ i≤ K}\sum_{1≤ t≤|y_{i}|}∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})·(r_{i}-\overline{r})\,\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}\,M_{i}^{t},$ | | | --- | --- | --- | --- | <details> <summary>x3.png Details</summary> ![fd558bf5](/v1/image/fd558bf51abd7f3aaf12ad2f2520e24de14c4dfd00e32258723b455bed5db334) ### Visual Description ## Line Charts: Evaluation Accuracy and Training Reward vs. Training Steps ### Overview The image presents a series of line charts comparing the performance of different reinforcement learning algorithms under varying synchronization conditions. The charts are arranged in a 2x3 grid, with the top row displaying "Evaluation Accuracy" and the bottom row displaying "Training Reward" against "Training Steps". The columns represent different synchronization settings: "sync_interval = 20", "sync_offset = 10", and "offline". The legend at the bottom identifies the algorithms being compared: REINFORCE, GRPO, REC-OneSide-NoIS (0.2), REC-OneSide-IS (0.2), and REC-OneSide-NoIS (0.6, 2.0). ### Components/Axes * **Rows:** * Top Row: Evaluation Accuracy (y-axis, ranging from 0.00 to 0.75) vs. Training Steps (x-axis, ranging from 0 to 150) * Bottom Row: Training Reward (y-axis, ranging from 0.0 to 1.0) vs. Training Steps (x-axis, ranging from 0 to 150) * **Columns:** * Left Column: sync\_interval = 20 * Middle Column: sync\_offset = 10 * Right Column: offline * **X-axis (all charts):** Training Steps, ranging from 0 to 150. Increments are marked at 0, 50, 100, and 150. * **Y-axis (top row):** Evaluation Accuracy, ranging from 0.00 to 0.75. Increments are marked at 0.00, 0.25, 0.50, and 0.75. * **Y-axis (bottom row):** Training Reward, ranging from 0.0 to 1.0. Increments are marked at 0.0, 0.5, and 1.0. * **Legend (bottom):** * REINFORCE (light blue line with circle markers) * GRPO (light green line with square markers) * REC-OneSide-NoIS (0.2) (light purple line with plus markers) * REC-OneSide-IS (0.2) (dotted light purple line with triangle markers) * REC-OneSide-NoIS (0.6, 2.0) (dark purple line with star markers) ### Detailed Analysis **Top Row: Evaluation Accuracy** * **sync\_interval = 20:** * REINFORCE (light blue): Starts around 0.3, increases to approximately 0.6 by step 50, then drops sharply to near 0 by step 150. * GRPO (light green): Starts around 0.3, gradually increases to approximately 0.5 by step 150. * REC-OneSide-NoIS (0.2) (light purple): Starts around 0.4, gradually increases to approximately 0.65 by step 150. * REC-OneSide-IS (0.2) (dotted light purple): Starts around 0.4, gradually increases to approximately 0.6 by step 150. * REC-OneSide-NoIS (0.6, 2.0) (dark purple): Starts around 0.3, increases to approximately 0.75 by step 150. * **sync\_offset = 10:** * REINFORCE (light blue): Starts around 0.3, increases to approximately 0.6 by step 50, then decreases to approximately 0.45 by step 150. * GRPO (light green): Starts around 0.3, gradually increases to approximately 0.65 by step 150. * REC-OneSide-NoIS (0.2) (light purple): Starts around 0.4, gradually increases to approximately 0.7 by step 150. * REC-OneSide-IS (0.2) (dotted light purple): Starts around 0.4, gradually increases to approximately 0.65 by step 150. * REC-OneSide-NoIS (0.6, 2.0) (dark purple): Starts around 0.3, increases to approximately 0.75 by step 150. * **offline:** * REINFORCE (light blue): Starts around 0.4, increases to approximately 0.65 by step 25, then decreases to near 0 by step 150. * GRPO (light green): Starts around 0.4, remains relatively constant around 0.45 by step 150. * REC-OneSide-NoIS (0.2) (light purple): Starts around 0.4, remains relatively constant around 0.45 by step 150. * REC-OneSide-IS (0.2) (dotted light purple): Starts around 0.4, remains relatively constant around 0.45 by step 150. * REC-OneSide-NoIS (0.6, 2.0) (dark purple): Starts around 0.4, remains relatively constant around 0.6 by step 150. **Bottom Row: Training Reward** * **sync\_interval = 20:** * REINFORCE (light blue): Starts around 0.5, increases to approximately 0.9 by step 50, then drops sharply to near 0 by step 150. * GRPO (light green): Starts around 0.4, gradually increases to approximately 0.7 by step 150. * REC-OneSide-NoIS (0.2) (light purple): Starts around 0.4, gradually increases to approximately 0.8 by step 150. * REC-OneSide-IS (0.2) (dotted light purple): Not visible in this chart. * REC-OneSide-NoIS (0.6, 2.0) (dark purple): Starts around 0.5, increases to approximately 0.95 by step 150. * **sync\_offset = 10:** * REINFORCE (light blue): Starts around 0.4, increases to approximately 0.7 by step 50, then decreases to approximately 0.6 by step 150. * GRPO (light green): Starts around 0.4, gradually increases to approximately 0.8 by step 150. * REC-OneSide-NoIS (0.2) (light purple): Starts around 0.4, gradually increases to approximately 0.9 by step 150. * REC-OneSide-IS (0.2) (dotted light purple): Starts around 0.4, gradually increases to approximately 0.8 by step 150. * REC-OneSide-NoIS (0.6, 2.0) (dark purple): Starts around 0.4, increases to approximately 0.95 by step 150. * **offline:** * REINFORCE (light blue): Not visible in this chart. * GRPO (light green): Not visible in this chart. * REC-OneSide-NoIS (0.2) (light purple): Starts around 0.45, remains relatively constant around 0.45 by step 150. * REC-OneSide-IS (0.2) (dotted light purple): Not visible in this chart. * REC-OneSide-NoIS (0.6, 2.0) (dark purple): Starts around 0.45, remains relatively constant around 0.45 by step 150. ### Key Observations * REINFORCE performs well initially in the "sync\_interval = 20" and "sync\_offset = 10" conditions but degrades significantly over time, especially in the "offline" setting. * REC-OneSide-NoIS (0.6, 2.0) generally achieves the highest evaluation accuracy and training reward across all conditions. * GRPO and REC-OneSide-NoIS (0.2) show more stable performance compared to REINFORCE, but their peak performance is generally lower than REC-OneSide-NoIS (0.6, 2.0). * The "offline" setting appears to negatively impact the performance of REINFORCE significantly. ### Interpretation The data suggests that the choice of reinforcement learning algorithm and synchronization strategy significantly impacts performance. The REC-OneSide-NoIS (0.6, 2.0) algorithm appears to be the most robust, consistently achieving high evaluation accuracy and training reward across different synchronization conditions. REINFORCE, while showing initial promise, is highly susceptible to performance degradation, particularly in the "offline" setting, indicating potential instability or sensitivity to the environment. The synchronization interval and offset also play a crucial role, as evidenced by the varying performance of the algorithms under different settings. The "offline" setting seems to present a more challenging scenario for REINFORCE, possibly due to the lack of real-time updates or feedback. </details> Figure 3: Empirical results for REC algorithms on GSM8k with Qwen2.5-1.5B-Instruct. Training reward curves are smoothed with a running-average window of size 3. Numbers in the legend denote clipping parameters $\epsilon_{\textsf{low}},\epsilon_{\textsf{high}}$ . <details> <summary>x4.png Details</summary> ![a2a64f91](/v1/image/a2a64f9113828347db71da0bdaba0e7e733f7284a34a8460ecf98521f6357e82) ### Visual Description ## Chart: Training Reward and Clipping Fraction vs. Training Steps ### Overview The image contains two line charts side-by-side. The left chart displays "Training Reward" versus "Training Steps," while the right chart shows "Clipping Fraction" versus "Training Steps." Both charts share the same x-axis ("Training Steps") and have "sync_interval = 20" at the top. The left chart has a linear y-axis, while the right chart has a logarithmic y-axis. Different colored lines represent different configurations (REC-OneSide-NoIS, REC-OneSide-IS, REC-Ring-NoIS, REC-TwoSide-NoIS) with varying parameters. ### Components/Axes **Left Chart (Training Reward):** * **Title:** sync\_interval = 20 * **X-axis:** Training Steps (0 to 400) * **Y-axis:** Training Reward (0.25 to 1.00) * Y-axis markers: 0.25, 0.50, 0.75, 1.00 * **Legend (bottom-left):** * REC-OneSide-NoIS (0.2, 0.25) - solid light-purple line * REC-OneSide-IS (0.2, 0.25) - dotted light-purple line * REC-Ring-NoIS (0.2, 0.25) & (0.6, 2.0) - solid dark-purple line **Right Chart (Clipping Fraction):** * **Title:** sync\_interval = 20 * **X-axis:** Training Steps (0 to 400) * **Y-axis:** Clipping Fraction (10-4 to 10-1, logarithmic scale) * Y-axis markers: 10^-4, 10^-3, 10^-2, 10^-1 * **Legend (bottom-right):** * REC-OneSide-NoIS (0.6, 2.0) - solid dark-purple line * REC-TwoSide-NoIS (0.2, 0.25) - solid light-yellow line ### Detailed Analysis **Left Chart (Training Reward):** * **REC-OneSide-NoIS (0.2, 0.25) (light-purple solid line):** Starts at approximately 0.25, increases rapidly to approximately 0.75 by step 100, then gradually increases to approximately 0.95 by step 400. * **REC-OneSide-IS (0.2, 0.25) (light-purple dotted line):** Starts at approximately 0.25, increases rapidly to approximately 0.70 by step 100, then gradually increases to approximately 0.90 by step 400. * **REC-Ring-NoIS (0.2, 0.25) & (0.6, 2.0) (dark-purple solid line):** Starts at approximately 0.25, increases rapidly to approximately 0.75 by step 100, then fluctuates around 0.95-1.00 by step 400. **Right Chart (Clipping Fraction):** * **REC-OneSide-NoIS (0.6, 2.0) (dark-purple solid line):** Starts at approximately 10-2, exhibits periodic drops and rises, generally decreasing over time to approximately 10-3 by step 400. The line shows a saw-tooth pattern. * **REC-TwoSide-NoIS (0.2, 0.25) (light-yellow solid line):** Starts at approximately 10-1, exhibits periodic drops and rises, generally decreasing over time to approximately 10-2 by step 400. The line shows a saw-tooth pattern. ### Key Observations * In the Training Reward chart, all configurations show a rapid increase in reward during the initial training steps, followed by a more gradual increase and stabilization. * In the Clipping Fraction chart, both configurations exhibit periodic behavior, with the clipping fraction decreasing over time. * The REC-TwoSide-NoIS (0.2, 0.25) configuration has a significantly higher clipping fraction than the REC-OneSide-NoIS (0.6, 2.0) configuration throughout the training process. ### Interpretation The charts illustrate the training performance of different reinforcement learning configurations. The Training Reward chart indicates how well the agent is learning, while the Clipping Fraction chart provides insight into the stability and convergence of the training process. The periodic behavior in the Clipping Fraction chart suggests that the agent is periodically adjusting its policy, possibly due to the "sync_interval = 20" parameter. The lower clipping fraction for REC-OneSide-NoIS (0.6, 2.0) suggests that this configuration might be more stable or efficient than REC-TwoSide-NoIS (0.2, 0.25). The different lines in the Training Reward chart show that the different configurations converge to similar reward levels, but the rate of convergence and the final reward level may vary slightly. </details> Figure 4: Empirical results for REC on ToolACE with Llama-3.2-3B-Instruct. Training reward curves are smoothed with a running-average window of size 3. Details about REC-TwoSide and REC-Ring are provided in Appendix B.3. Experiments. Figure 3 presents GSM8k results with Qwen2.5-1.5B-Instruct in various off-policy settings. REC-OneSide-IS / NoIS and GRPO (with the same $\epsilon_{\textsf{low}}=\epsilon_{\textsf{high}}=0.2$ ) have nearly identical performance, indicating that importance sampling is non-essential, whereas the collapse of REINFORCE highlights the critical role of clipping. Radically enlarging $(\epsilon_{\textsf{low}},\epsilon_{\textsf{high}})$ to $(0.6,2.0)$ accelerates REC-OneSide-NoIS without compromising stability in both sync_interval = 20 and sync_offset = 10 settings. Similar patterns also appear in Figure 4 (ToolAce with Llama-3.2-3B-Instruct) and other results in Appendix B. As for the stress-test (“offline”) setting, Figure 3 reveals an intrinsic trade-off between the speed and stability of policy improvement, motivating future work toward better algorithms that achieve both. 4.2 Understanding Kimi’s OPMD and Meta’s AsymRE Besides clipping, another natural method is to add a regularization loss $R(·)$ to vanilla REINFORCE: | | $\displaystyle\widehat{L}\big(\bm{\theta};x,\{y_{i},r_{i}\}_{1≤ i≤ K}\big)$ | $\displaystyle=-\frac{1}{K}\sum_{i∈[K]}(r_{i}-\overline{r})\log\pi_{\bm{\theta}}(y_{i}\,|\,x)+\tau· R\big(\bm{\theta};x,\{y_{i},r_{i}\}_{1≤ i≤ K}\big),$ | | | --- | --- | --- | --- | and take $\bm{g}=-∇_{\bm{\theta}}\widehat{L}$ . We show below that Kimi’s OPMD and Meta’s AsymRE are indeed special cases of this unified formula, with empirical validation of their efficacy deferred to Appendix B.5. Kimi’s OPMD. Kimi-Team (2025b) derives an OPMD variant by taking logarithm of both sides of Eq. (4), which leads to a consistency condition and further motivates the following surrogate loss: $$ \widetilde{L}=\frac{1}{K}\sum_{1\leq i\leq K}\bigg(r_{i}-\tau\log Z(x,\pi_{\bm{\theta}_{t}})-\tau\,\Big(\log\pi_{\bm{\theta}}(y_{i}\,|\,x)-\log\pi_{\bm{\theta}_{t}}(y_{i}|x)\Big)\bigg)^{2}. $$ With $K$ responses generated by $\pi_{\textsf{old}}=\pi_{\bm{\theta}_{t}}$ , the term $\tau\log Z(x,\pi_{\bm{\theta}_{t}})$ can be approximated by a finite-sample estimate $\tau\log(\sum_{i}e^{r_{i}/\tau}/K)$ , which can be further approximated by the mean reward $\overline{r}=\sum_{i}r_{i}/K$ if $\tau$ is large. With these approximations, the gradient of $\widetilde{L}$ becomes equivalent to that of the following loss (which is the final version of Kimi’s OPMD): $$ \widehat{L}=-\frac{1}{K}\sum_{1\leq i\leq K}(r_{i}-\overline{r})\log\pi_{\bm{\theta}}(y_{i}\,|\,x)+\frac{\tau}{2K}\sum_{1\leq i\leq K}\Big(\log\pi_{\bm{\theta}}(y_{i}\,|\,x)-\log\pi_{\textsf{old}}(y_{i}\,|\,x)\Big)^{2}. $$ In comparison, our analysis in Sections 2 and 3 suggests that this is in itself a principled loss function for off-policy RL, adding a mean-squared regularization loss to the vanilla REINFORCE loss. Meta’s AsymRE. AsymRE (Arnal et al., 2025) modifies REINFORCE by tuning down the baseline (from $\overline{r}$ to $\overline{r}-\tau$ ) in advantage calculation, which was motivated by the intuition of prioritizing learning from positive samples and justified by multi-arm bandit analysis in the original paper. We offer an alternative interpretation for AsymRE by rewriting its loss function: | | $\displaystyle\widehat{L}$ | $\displaystyle=-\frac{1}{K}\sum_{i}\Big(r_{i}-(\overline{r}-\tau)\Big)\log\pi_{\bm{\theta}}(y_{i}\,|\,x)=-\frac{1}{K}\sum_{i}(r_{i}-\overline{r})\log\pi_{\bm{\theta}}(y_{i}\,|\,x)-\frac{\tau}{K}\sum_{i}\log\pi_{\bm{\theta}}(y_{i}\,|\,x).$ | | | --- | --- | --- | --- | Note that the first term on the right-hand side is the REINFORCE loss, and the second term serves as regularization, enforcing imitation of responses from an older version of the policy model. For the latter, we may also add a term that is independent of $\bm{\theta}$ to it and take the limit $K→∞$ : | | $\displaystyle-\frac{1}{K}\sum_{1≤ i≤ K}\log\pi_{\bm{\theta}}(y_{i}\,|\,x)+\frac{1}{K}\sum_{1≤ i≤ K}\log\pi_{\textsf{old}}(y_{i}\,|\,x)=\frac{1}{K}\sum_{1≤ i≤ K}\log\frac{\pi_{\textsf{old}}(y_{i}\,|\,x)}{\pi_{\bm{\theta}}(y_{i}\,|\,x)}$ | | | --- | --- | --- | which turns out to be a finite-sample approximation of KL regularization. 4.3 Understanding data-weighting methods We now shift our attention to the second principle for augmenting REINFORCE, i.e., actively shaping the training data distribution. Pairwise weighting. Recall from Section 2 that we define the surrogate loss in Eq. (6) as an unweighted sum of pairwise mean-squared losses. However, if we have certain knowledge about which pairs are more informative for RL training, we may assign higher weights to them. This motivates generalizing $\sum_{i<j}(a_{i}-a_{j})^{2}$ to $\sum_{i<j}w_{i,j}(a_{i}-a_{j})^{2}$ , where $\{w_{i,j}\}$ are non-negative weights. Assuming that $w_{i,j}=w_{j,i}$ and following the steps in Section 2, we end up with $$ \bm{g}\big(\bm{\theta};x,\{y_{i},r_{i}\}_{1\leq i\leq K}\big)=\frac{1}{K}\sum_{1\leq i\leq K}\Big(\sum_{1\leq j\leq K}w_{i,j}\Big)\bigg(r_{i}-\frac{\sum_{j}w_{i,j}r_{j}}{\sum_{j}w_{i,j}}\bigg)\nabla_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}\,|\,x). $$ In the special case where $w_{i,j}=w_{i}w_{j}$ , this becomes $$ \bm{g}=\Big(\sum_{j}w_{j}\Big)\;\frac{1}{K}\sum_{1\leq i\leq K}w_{i}\big(r_{i}-\overline{r}_{w}\big)\nabla_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}\,|\,x),\,\,\text{where}\,\,\overline{r}_{w}\coloneqq\frac{\sum_{j}w_{j}r_{j}}{\sum_{j}w_{j}}. \tag{9} $$ Based on this, we investigate two RE INFORCE-with- d ata-weighting (RED) methods. RED-Drop: sample dropping. The idea is to use a filtered subset $\mathcal{S}⊂eq[K]$ of responses for training; for example, the Kimi-Researcher technical blog (Kimi-Team, 2025a) proposes to “discard some negative samples strategically”, as negative gradients increase the risk of entropy collapse. This is indeed a special case of Eq. (9), by setting $w_{i}=\sqrt{K}/|\mathcal{S}|$ for $i∈\mathcal{S}$ and $0$ otherwise: $$ \bm{g}\big(\bm{\theta};x,\{y_{i},r_{i}\}_{1\leq i\leq K}\big)=\frac{1}{|\mathcal{S}|}\sum_{i\in\mathcal{S}}(r_{i}-\overline{r}_{\mathcal{S}})\nabla_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}\,|\,x),\,\,\text{where}\,\,\overline{r}_{\mathcal{S}}=\frac{1}{|\mathcal{S}|}\sum_{i\in\mathcal{S}}r_{i}. \tag{10} $$ While this is no longer an unbiased estimate of policy gradient even if all responses are sampled from the current policy, it is still well justified by our off-policy interpretation of REINFORCE. RED-Weight: pointwise loss weighting. Another approach for prioritizing high-reward responses is to directly up-weight their gradient terms in Eq. (1a). To better understand the working mechanism of this seemingly heuristic method, we rewrite its policy update: | | $\displaystyle\bm{g}$ | $\displaystyle=\sum_{1≤ i≤ K}w_{i}(r_{i}-\overline{r})∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}|x)=\sum_{1≤ i≤ K}w_{i}(r_{i}-\overline{r}_{w}+\overline{r}_{w}-\overline{r})∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}|x)$ | | | --- | --- | --- | --- | This is the pairwise-weighted REINFORCE gradient in Eq. (9), plus a regularization term (weighted by $\overline{r}_{w}-\overline{r}>0$ ) that resembles the one in AsymRE but prioritizes imitating higher-reward responses, echoing the finding from offline RL literature (Hong et al., 2023a, b) that regularizing against high-reward trajectories can be more effective than conservatively imitating all trajectories in the dataset. Implementation details. Below are the concrete instantiations adopted in our empirical studies: - RED-Drop: When the number of negative samples in a group exceeds the number of positive ones, we randomly drop the excess negatives so that positives and negatives are balanced. After this subsampling step, we recompute the advantages using the remaining samples, which are then fed into the loss. - RED-Weight: Each sample $i$ is weighted by $w_{i}=\exp({A_{i}}/{\tau})$ , where $A_{i}$ denotes its advantage estimate and $\tau>0$ is a temperature parameter controlling the sharpness of weighting. This scheme amplifies high-advantage samples while down-weighting low-advantage ones. We fix $\tau=1$ for all experiments. <details> <summary>x5.png Details</summary> ![7226bc6a](/v1/image/7226bc6a0d1f3d9af6cdeee80eb152c6cd0617ed5e0f7955e2730905566a9082) ### Visual Description ## Chart: Evaluation Accuracy and Training Reward vs. Training Steps for Different Algorithms ### Overview The image presents four line graphs arranged in a 2x2 grid. The graphs depict the performance of different reinforcement learning algorithms, namely REINFORCE, RED-Weight, REC-OneSide-NoIS (0.6, 2.0), and RED-Drop, across varying training steps. The top row displays "Evaluation Accuracy" while the bottom row displays "Training Reward". The left column shows results for "on-policy" training, and the right column shows results for "sync_interval = 20". ### Components/Axes **General:** * **X-axis (all plots):** Training Steps, ranging from 0 to 150. Increments are marked at 0, 25, 50, 75, 100, 125, and 150. * **Legend (bottom):** Located below the four plots. * REINFORCE (light blue) * RED-Drop (orange-red) * RED-Weight (light orange) * REC-OneSide-NoIS (0.6, 2.0) (purple) **Top-Left Plot (on-policy, Evaluation Accuracy):** * **Title:** on-policy * **Y-axis:** Evaluation Accuracy, ranging from 0.0 to 0.8, with increments of 0.2. **Top-Right Plot (sync_interval = 20, Evaluation Accuracy):** * **Title:** sync_interval = 20 * **Y-axis:** Evaluation Accuracy, ranging from 0.0 to 0.8, with increments of 0.2. **Bottom-Left Plot (on-policy, Training Reward):** * **Y-axis:** Training Reward, ranging from 0.00 to 1.00, with increments of 0.25. **Bottom-Right Plot (sync_interval = 20, Training Reward):** * **Y-axis:** Training Reward, ranging from 0.00 to 1.00, with increments of 0.25. ### Detailed Analysis **Top-Left Plot (on-policy, Evaluation Accuracy):** * **REINFORCE (light blue):** Starts at approximately 0.35, increases sharply to around 0.65 by step 25, and then plateaus around 0.75. * **RED-Drop (orange-red):** Starts at approximately 0.35, increases sharply to around 0.60 by step 25, and then plateaus around 0.75. * **RED-Weight (light orange):** Starts at approximately 0.35, increases sharply to around 0.65 by step 25, and then plateaus around 0.75. * **REC-OneSide-NoIS (0.6, 2.0) (purple):** Starts at approximately 0.35, increases sharply to around 0.70 by step 25, and then plateaus around 0.78. **Top-Right Plot (sync_interval = 20, Evaluation Accuracy):** * **REINFORCE (light blue):** Starts at approximately 0.50, increases slightly to around 0.55 by step 25, then decreases steadily to approximately 0.05 by step 150. * **RED-Drop (orange-red):** Starts at approximately 0.40, increases steadily to around 0.70 by step 75, and then plateaus around 0.70. * **RED-Weight (light orange):** Starts at approximately 0.50, increases steadily to around 0.65 by step 75, and then plateaus around 0.65. * **REC-OneSide-NoIS (0.6, 2.0) (purple):** Starts at approximately 0.45, increases steadily to around 0.75 by step 75, and then plateaus around 0.78. **Bottom-Left Plot (on-policy, Training Reward):** * **REINFORCE (light blue):** Starts at approximately 0.55, increases sharply to around 0.90 by step 25, and then fluctuates around 0.95. * **RED-Drop (orange-red):** Starts at approximately 0.55, increases sharply to around 0.85 by step 25, and then fluctuates around 0.95. * **RED-Weight (light orange):** Starts at approximately 0.55, increases sharply to around 0.90 by step 25, and then fluctuates around 0.95. * **REC-OneSide-NoIS (0.6, 2.0) (purple):** Starts at approximately 0.45, increases sharply to around 0.90 by step 25, and then fluctuates around 0.95. **Bottom-Right Plot (sync_interval = 20, Training Reward):** * **REINFORCE (light blue):** Starts at approximately 0.60, increases to around 0.70 by step 25, then drops sharply and fluctuates between 0.00 and 0.25 after step 75. * **RED-Drop (orange-red):** Starts at approximately 0.60, increases steadily to around 0.90 by step 50, and then fluctuates around 0.95. * **RED-Weight (light orange):** Starts at approximately 0.60, increases steadily to around 0.85 by step 50, and then fluctuates around 0.95. * **REC-OneSide-NoIS (0.6, 2.0) (purple):** Starts at approximately 0.55, increases steadily to around 0.85 by step 50, and then fluctuates around 0.95. ### Key Observations * In the "on-policy" setting, all algorithms achieve similar performance in terms of both evaluation accuracy and training reward. * In the "sync_interval = 20" setting, REINFORCE performs significantly worse than the other algorithms, particularly in terms of evaluation accuracy and training reward. The other three algorithms (RED-Weight, REC-OneSide-NoIS, and RED-Drop) maintain relatively high performance. * The "sync_interval = 20" setting seems to negatively impact REINFORCE's ability to learn and maintain a high reward. ### Interpretation The data suggests that the "sync_interval = 20" setting introduces a challenge that REINFORCE struggles to overcome, while RED-Weight, REC-OneSide-NoIS, and RED-Drop are more robust to this condition. This could be due to the way REINFORCE updates its policy, making it more sensitive to delayed or infrequent synchronization. The other algorithms may employ techniques that mitigate the impact of asynchronous updates. The "on-policy" setting, where updates are more frequent and synchronized, allows REINFORCE to perform comparably to the other algorithms. The REC-OneSide-NoIS algorithm consistently achieves slightly higher evaluation accuracy than the other algorithms. </details> Figure 5: Empirical performance of RED-Drop and RED-Weight on GSM8k with Qwen2.5-1.5B-Instruct, in both on-policy and off-policy settings. Training reward curves are smoothed with a running-average window of size 3. Experiments. Figure 5 presents GSM8k results with Qwen2.5-1.5B-Instruct, which confirm the efficacy of RED-Drop and RED-Weight in on/off-policy settings, comparable to REC-OneSide-NoIS with enlarged $(\epsilon_{\textsf{low}},\epsilon_{\textsf{high}})$ . Figure 6 reports larger-scale experiments on Guru-Math with Qwen2.5-7B-Instruct, where RED-Weight achieves higher rewards than GRPO, with similar KL distance to the initial policy. Figure 7 further validates the efficacy of RED-Weight on MATH with Llama-3.1-8B-Instruct; compared to GRPO and REC-OneSide-NoIS, RED-Weight achieves higher rewards with lower KL divergence, while maintaining more stable entropy and response lengths. <details> <summary>x6.png Details</summary> ![fe020a0a](/v1/image/fe020a0a1bf6a68b53f70cfb27fd0fb77dcacdd8cae10b9c071309371886c34a) ### Visual Description ## Chart: Training Reward and KL Divergence vs. Training Steps ### Overview The image presents two line charts side-by-side. The left chart displays "Training Reward" versus "Training Steps," while the right chart shows "KL Divergence" (on a logarithmic scale) versus "Training Steps." Both charts compare three different algorithms: GRPO, REC-OneSide-NoIS (0.2), and RED-Weight, with a sync interval of 20. ### Components/Axes **Left Chart (Training Reward):** * **Title:** Training Reward vs Training Steps, sync_interval = 20 * **Y-axis:** Training Reward, linear scale from 0.15 to 0.25, with tick marks at 0.15, 0.20, and 0.25. * **X-axis:** Training Steps, linear scale from 0 to 1500, with tick marks at 0, 500, 1000, and 1500. * **Legend (bottom):** * GRPO (light green) - square marker * REC-OneSide-NoIS (0.2) (light purple) - downward triangle marker * RED-Weight (light orange) - circle marker **Right Chart (KL Divergence):** * **Title:** KL Divergence vs Training Steps, sync_interval = 20 * **Y-axis:** KL Divergence, logarithmic scale from 10^-2 to 10^0 (0.01 to 1), with tick marks at 10^-2 and 10^0. * **X-axis:** Training Steps, linear scale from 0 to 1500, with tick marks at 0, 500, 1000, and 1500. * **Legend (bottom):** * GRPO (light green) * REC-OneSide-NoIS (0.2) (light purple) * RED-Weight (light orange) ### Detailed Analysis **Left Chart (Training Reward):** * **GRPO (light green):** Starts at approximately 0.125 and generally increases to around 0.225 by 1500 training steps. * **REC-OneSide-NoIS (0.2) (light purple):** Starts at approximately 0.125 and increases to around 0.23 by 1500 training steps. * **RED-Weight (light orange):** Starts at approximately 0.125 and increases to around 0.26 by 1500 training steps. **Right Chart (KL Divergence):** * **GRPO (light green):** Starts near 0.002 and increases to approximately 0.015 by 1500 training steps. * **REC-OneSide-NoIS (0.2) (light purple):** Starts near 0.002 and increases to approximately 0.015 by 1500 training steps. * **RED-Weight (light orange):** Starts near 0.002 and increases to approximately 0.015 by 1500 training steps, with several large spikes throughout the training steps. ### Key Observations * In the Training Reward chart, RED-Weight consistently achieves a slightly higher reward than GRPO and REC-OneSide-NoIS (0.2). * In the KL Divergence chart, all three algorithms show a similar increasing trend, but RED-Weight exhibits significantly more volatility with large spikes. ### Interpretation The charts suggest that, with a sync interval of 20, RED-Weight achieves a higher training reward compared to GRPO and REC-OneSide-NoIS (0.2). However, this comes at the cost of increased KL divergence volatility, potentially indicating instability or exploration issues during training. GRPO and REC-OneSide-NoIS (0.2) show similar performance in both training reward and KL divergence, suggesting they might offer more stable training dynamics. The logarithmic scale on the KL Divergence chart highlights the relative differences in divergence, emphasizing the spikes observed in RED-Weight. </details> Figure 6: Empirical results on Guru-Math with Qwen2.5-7B-Instruct. Training reward curves are smoothed with a running-average window of size 3. <details> <summary>x7.png Details</summary> ![d1f1fb01](/v1/image/d1f1fb01547fdee35fbaa669a53ebf5921344e042c6ce1c22b9ccf8982caeb5f) ### Visual Description ## Line Charts: Training Metrics Comparison ### Overview The image contains four line charts comparing the performance of three different algorithms (GRPO, REC-OneSide-NoIS (0.2), and RED-Weight) across several training metrics: Training Reward, KL Divergence, Entropy, and Response Length. All charts share a common x-axis representing Training Steps, and each chart has a title indicating "sync_interval = 20". ### Components/Axes * **Titles (Top of each chart):** "sync\_interval = 20" (repeated for each chart) * **X-Axis (Shared):** "Training Steps" (range: 0 to 400 approximately) * **Y-Axis (Left to Right):** * Chart 1: "Training Reward" (range: 0.4 to 0.6) * Chart 2: "KL Divergence" (range: 0.00 to 0.06) * Chart 3: "Entropy" (range: 2 to 8) * Chart 4: "Response Length" (range: 500 to 2000) * **Legend (Bottom):** * GRPO (Teal) * REC-OneSide-NoIS (0.2) (Purple) * RED-Weight (Orange) ### Detailed Analysis **Chart 1: Training Reward** * **GRPO (Teal):** The line starts around 0.4, increases steadily to approximately 0.6 around step 200, and then plateaus with slight fluctuations. * **REC-OneSide-NoIS (0.2) (Purple):** Similar to GRPO, it starts around 0.4, increases to approximately 0.58 around step 200, and then plateaus with slight fluctuations. * **RED-Weight (Orange):** Starts around 0.4, increases to approximately 0.58 around step 200, and then plateaus with slight fluctuations. **Chart 2: KL Divergence** * **GRPO (Teal):** Starts near 0.00, increases to approximately 0.06 around step 400. * **REC-OneSide-NoIS (0.2) (Purple):** Starts near 0.00, increases to approximately 0.04 around step 400, with some fluctuations. * **RED-Weight (Orange):** Starts near 0.00, increases to approximately 0.01 around step 200, and then plateaus. **Chart 3: Entropy** * **GRPO (Teal):** Starts around 8, decreases to approximately 6 around step 200, and then plateaus with slight fluctuations. * **REC-OneSide-NoIS (0.2) (Purple):** Starts around 8, decreases sharply to approximately 2 around step 200, and then plateaus with slight fluctuations. * **RED-Weight (Orange):** Starts around 8, decreases to approximately 6.5 around step 200, and then plateaus with slight fluctuations. **Chart 4: Response Length** * **GRPO (Teal):** Starts around 2000, decreases to approximately 1750 around step 200, and then plateaus with slight fluctuations. * **REC-OneSide-NoIS (0.2) (Purple):** Starts around 2000, decreases sharply to approximately 750 around step 200, and then plateaus with slight fluctuations. * **RED-Weight (Orange):** Starts around 2000, decreases to approximately 2000 around step 200, and then plateaus with slight fluctuations. ### Key Observations * In the Training Reward chart, all three algorithms converge to similar reward values after approximately 200 training steps. * In the KL Divergence chart, GRPO shows the highest divergence, while RED-Weight shows the lowest. * In the Entropy chart, REC-OneSide-NoIS (0.2) shows the most significant decrease in entropy. * In the Response Length chart, REC-OneSide-NoIS (0.2) shows the most significant decrease in response length. ### Interpretation The charts provide a comparative analysis of three different algorithms across four key training metrics. The "sync\_interval = 20" suggests that the synchronization interval is a parameter being held constant across all experiments. * **Training Reward:** All algorithms achieve similar performance in terms of training reward, suggesting they are all effective in learning the task. * **KL Divergence:** The differences in KL Divergence suggest variations in the exploration strategies of the algorithms. Higher divergence might indicate more exploration. * **Entropy:** The decrease in entropy indicates that the algorithms are becoming more confident in their actions. REC-OneSide-NoIS (0.2) appears to converge to a more deterministic policy faster. * **Response Length:** The decrease in response length suggests that the algorithms are learning to generate shorter, more efficient responses. REC-OneSide-NoIS (0.2) achieves the shortest response length. Overall, the REC-OneSide-NoIS (0.2) algorithm seems to exhibit a more rapid convergence to a deterministic policy with shorter response lengths, while maintaining a comparable training reward. The GRPO algorithm shows a higher KL divergence, suggesting a different exploration strategy. </details> <details> <summary>x8.png Details</summary> ![cc3c09aa](/v1/image/cc3c09aabe3e3a3b3a7e152208b27571245fcda62c9ceb276c88cb6713570579) ### Visual Description ## Line Chart: MATH500 Accuracy vs. Training Steps ### Overview The image is a line chart comparing the MATH500 accuracy of three different methods (GRPO, REC-OneSide-NoIS (0.2), and RED-Weight) over training steps. The chart shows the performance of each method as the training progresses, with the x-axis representing training steps and the y-axis representing MATH500 accuracy. The parameter "sync_interval" is set to 20. ### Components/Axes * **Title:** sync\_interval = 20 * **X-axis:** Training Steps (values ranging from 0 to 400) * **Y-axis:** MATH500 Accuracy (values ranging from 0.40 to 0.50) * **Legend:** Located in the center of the chart. * GRPO (light green) * REC-OneSide-NoIS (0.2) (light purple) * RED-Weight (light orange) ### Detailed Analysis * **GRPO (light green):** The line starts at approximately 0.43 accuracy at 0 training steps. It initially decreases slightly, then increases steadily to approximately 0.51 accuracy at 400 training steps. * **REC-OneSide-NoIS (0.2) (light purple):** The line starts at approximately 0.43 accuracy at 0 training steps. It increases to approximately 0.50 accuracy at 200 training steps, then fluctuates slightly before reaching approximately 0.51 accuracy at 400 training steps. * **RED-Weight (light orange):** The line starts at approximately 0.43 accuracy at 0 training steps. It increases sharply to approximately 0.49 accuracy at 50 training steps, then fluctuates before reaching approximately 0.50 accuracy at 400 training steps. ### Key Observations * All three methods show an increasing trend in MATH500 accuracy as training steps increase. * RED-Weight shows the most rapid initial increase in accuracy. * GRPO has a more gradual and consistent increase in accuracy compared to the other two methods. * At 400 training steps, all three methods converge to approximately the same accuracy level (around 0.51). ### Interpretation The chart demonstrates the performance of three different methods for improving MATH500 accuracy during training. The RED-Weight method initially shows a faster improvement, but all three methods eventually achieve similar accuracy levels after a sufficient number of training steps. The choice of method may depend on the desired speed of initial improvement versus the consistency of the improvement over time. The "sync_interval" parameter being set to 20 suggests that the model parameters are synchronized every 20 training steps, which could influence the learning dynamics of each method. </details> Figure 7: Comparison of RED-Weight, REC-OneSide-NoIS, and GRPO on MATH with Llama-3.1-8B-Instruct. Reported metrics for training include reward, KL distance to the initial model, entropy, and response length. We also report evaluation accuracy on the MATH500 subset. 5 Related works Off-policy RL for LLMs has been studied from various perspectives. Importance sampling has long been considered one foundational mechanism for off-policy RL; besides PPO and GRPO, recent extensions include GSPO (Zheng et al., 2025) and GMPO (Zhao et al., 2025) that work with sequence-wise probability ratios, CISPO (Chen et al., 2025) that clips probability ratios rather than token updates, decoupled PPO (Fu et al., 2025a) that adapts PPO to asynchronous RL, among others. AsymRE (Arnal et al., 2025) offers an alternative baseline-shift approach (with ad-hoc analysis for discrete bandit settings), while OPMD (Kimi-Team, 2025b) partly overlaps with our analysis up to Eq. (4) before diverging, as discussed earlier in Section 4.2. Contrastive Policy Gradient (Flet-Berliac et al., 2024) overlaps with our analysis up to Eq. (6), but it requires paired responses within the same micro-batch (in order to optimize the pairwise surrogate loss), rendering it less infra-friendly than REINFORCE variants. Other perspectives include learning dynamics of DPO and SFT (Ren and Sutherland, 2025), training offline loss functions with negative gradients on on-policy data (Tajwar et al., 2024), or improving generalization of SFT via probability-aware rescaling (Wu et al., 2025). Another line of research integrates expert data into online RL (Yan et al., 2025; Zhang et al., 2025c; Fu et al., 2025b). Our work contributes complementary perspectives to this growing toolkit for off-policy LLM-RL. 6 Limitations and future work While our work offers a new off-policy interpretation for group-relative REINFORCE and shows its broad implications for LLM-RL, several limitations remain. (1) Our current analysis covers single/multi-step RL with response/trajectory-level rewards, and assumes access to multiple rollouts per query. Future work may expand its scope and applicability, e.g., generalizing to settings with step-level rewards or only one rollout per query. (2) Our analysis lacks formal guarantees for policy improvement or convergence. Future work may identify distributional assumptions that yield provable guarantees for REINFORCE variants in off-policy settings. (3) Our experiments focus on settings where training data is generated by older policy versions. Extensions to broader off-policy settings (e.g., advanced experience synthesis or incorporation of expert data) may reveal new insights. Addressing these limitations will further solidify the theoretical foundation and advance principled algorithm design for off-policy LLM-RL. References - Achiam et al. [2017] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 22–31. PMLR, 2017. - An et al. [2025] Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. POLARIS: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. URL https://hkunlp.github.io/blog/2025/Polaris. - Arnal et al. [2025] Charles Arnal, GaĂŤtan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, and Remi Munos. Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards. arXiv Preprint arXiv:2506.20520, 2025. - Bai et al. [2022] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. - Chen et al. [2025] Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. MiniMax-M1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025. - Cheng et al. [2025] Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, and Zhiting Hu. Revisiting reinforcement learning for LLM reasoning from a cross-domain perspective. arXiv preprint arXiv:2506.14965, 2025. - Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv, 2021. - Da et al. [2025] Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, and Sean Hendryx. Agent-RLVR: Training software engineering agents via guidance and environment rewards. arXiv, 2025. - DeepSeek-AI [2025] DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv, 2025. - Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. arXiv, 2024. - Flet-Berliac et al. [2024] Yannis Flet-Berliac, Nathan Grinsztajn, Florian Strub, Eugene Choi, Bill Wu, Chris Cremer, Arash Ahmadian, Yash Chandak, Mohammad Gheshlaghi Azar, Olivier Pietquin, and Matthieu Geist. Contrastive policy gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21353–21370, 2024. - Fragkiadaki [2018] Katerina Fragkiadaki. Natural policy gradients, TRPO, PPO. https://www.andrew.cmu.edu/course/10-703/slides/Lecture_NaturalPolicyGradientsTRPOPPO.pdf, 2018. - Fu et al. [2025a] Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. AReaL: A large-scale asynchronous reinforcement learning system for language reasoning. arXiv, 2025a. - Fu et al. [2025b] Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT: A single-stage method with supervised and reinforcement fine-tuning for reasoning. arXiv, 2025b. - Gao et al. [2025] Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl. arXiv, 2025. - Guo et al. [2025] Yongxin Guo, Wenbo Deng, Zhenglin Cheng, and Xiaoying Tang. G 2 RPO-A: Guided group relative policy optimization with adaptive guidance. arXiv, 2025. - Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. NeurIPS, 2021. - Hong et al. [2023a] Zhang-Wei Hong, Pulkit Agrawal, Remi Tachet des Combes, and Romain Laroche. Harnessing mixed offline reinforcement learning datasets via trajectory weighting. In The Eleventh International Conference on Learning Representations, 2023a. - Hong et al. [2023b] Zhang-Wei Hong, Aviral Kumar, Sathwik Karnik, Abhishek Bhandwaldar, Akash Srivastava, Joni Pajarinen, Romain Laroche, Abhishek Gupta, and Pulkit Agrawal. Beyond uniform sampling: Offline reinforcement learning with imbalanced datasets. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. - Hu et al. [2024] Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework. arXiv preprint arXiv:2405.11143, 2024. - Kakade and Langford [2002] Sham M. Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002. - Kimi-Team [2025a] Kimi-Team. Kimi-Researcher. https://moonshotai.github.io/Kimi-Researcher, 2025a. - Kimi-Team [2025b] Kimi-Team. Kimi k1.5: Scaling reinforcement learning with LLMs. arXiv preprint arXiv:2501.12599, 2025b. - Korbak et al. [2022] Tomasz Korbak, Ethan Perez, and Christopher L. Buckley. RL with KL penalties is better viewed as bayesian inference. In Conference on Empirical Methods in Natural Language Processing, 2022. - Liang et al. [2025] Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, and Weizhu Chen. SwS: Self-aware weakness-driven problem synthesis in reinforcement learning for LLM reasoning. arXiv Preprint arXiv:2506.08989, 2025. - Liu et al. [2025a] Weiwen Liu, Xu Huang, Xingshan Zeng, xinlong hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong WANG, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Wang Xinzhi, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, and Enhong Chen. ToolACE: Winning the points of LLM function calling. In The Thirteenth International Conference on Learning Representations, 2025a. - Liu et al. [2025b] Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Siran Yang, Jiamang Wang, Wenbo Su, and Bo Zheng. Part I: Tricks or traps? a deep dive into RL for LLM reasoning. arXiv preprint arXiv:2508.08221, 2025b. - Nachum et al. [2017] Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In NIPS, 2017. - Noukhovitch et al. [2025] Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous RLHF: Faster and more efficient off-policy RL for language models. In The Thirteenth International Conference on Learning Representations, 2025. - OpenAI [2024] OpenAI. OpenAI o1 system card. arXiv Preprint arXiv:2412.16720, 2024. - Ouyang et al. [2022] Long Ouyang, Pamela Mishkin, Jeff Wu, C L Mar, Jacob Hilton, Amanda Askell, and Paul Christiano. Training language models to follow instructions with human feedback. arXiv, 2022. - Pan et al. [2025] Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, and Jingren Zhou. Trinity-RFT: A general-purpose and unified framework for reinforcement fine-tuning of large language models. arXiv Preprint arXiv:2505.17826, 2025. - Qwen-Team [2025] Qwen-Team. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115. - Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. - Ren and Sutherland [2025] Yi Ren and Danica J. Sutherland. Learning dynamics of LLM finetuning. In The Thirteenth International Conference on Learning Representations, 2025. - Richemond et al. [2024] Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, and Bilal Piot. Offline regularised reinforcement learning for large language models alignment. arXiv Preprint arXiv:2405.19107, 2024. - Rolnick et al. [2019] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In Advances in Neural Information Processing Systems, volume 32, 2019. - Schaul et al. [2016] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv Preprint arXiv:1511.05952, 2016. - Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015. - Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. - Shao et al. [2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Junxiao Song Runxin Xu, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv, 2024. - Sheng et al. [2024] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. arXiv, 2024. - Silver and Sutton [2025] David Silver and Richard S. Sutton. Welcome to the era of experience. https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf, 2025. - Sutton et al. [1998] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press Cambridge, 1998. - Tajwar et al. [2024] Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of LLMs should leverage suboptimal, on-policy data. In Forty-first International Conference on Machine Learning, 2024. - Vaswani et al. [2022] Sharan Vaswani, Olivier Bachem, Simone Totaro, Robert Müller, Shivam Garg, Matthieu Geist, Marlos C. Machado, Pablo Samuel Castro, and Nicolas Le Roux. A functional mirror ascent view of policy gradient methods with function approximation. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 151, Valencia, Spain, 2022. PMLR. - von Werra et al. [2020] Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020. - Wang et al. [2025] Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library. arXiv preprint arXiv:2506.06122, 2025. - Williams [1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992. - Wu et al. [2025] Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of SFT: A reinforcement learning perspective with reward rectification. arXiv preprint arXiv:2508.05629, 2025. - Yan et al. [2025] Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv Preprint arXiv:2504.14945, 2025. - Yao et al. [2025] Feng Yao, Liyuan Liu an Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient RL framework secretly brings you off-policy RL training. https://fengyao.notion.site/off-policy-rl, 2025. - Yu et al. [2025] Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, and Mingxuan Wang. DAPO: An open-source LLM reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025. - Zhang et al. [2025a] Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinforcement learning for LLMs: A survey, 2025a. - Zhang et al. [2025b] Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827, 2025b. - Zhang et al. [2025c] Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy RL meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting. arXiv preprint arXiv:2508.11408, 2025c. - Zhao et al. [2025] Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, and Furu Wei. Geometric-mean policy optimization. arXiv preprint arXiv:2507.20673, 2025. - Zheng et al. [2025] Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025. Appendix A Extending Section 2.2 to multi-step RL This section extends the off-policy interpretation proposed in Section 2.2 to multi-step RL settings. Let us start by introducing some notations. In multi-step RL, the initial prompt $x$ is also regarded as the initial state $s^{1}=x$ . A rollout trajectory consisting of multiple turns of agent-environment interaction is denoted by $$ \mathcal{T}=(s^{1},a^{1},s^{2},a^{2},\dots)=(s^{\ell},a^{\ell})_{1\leq\ell\leq|\mathcal{T}|}, $$ where $s^{\ell}$ is the state and $a^{\ell}$ is the action, i.e., an LLM response (akin to $y$ in Section 2.2). Let $c^{\ell}$ denote the context up to step $\ell$ , so that $a^{\ell}\sim\pi(·|c^{\ell})$ for some policy $\pi$ . Throughout this section, we consider trajectory-level rewards $r(x,\mathcal{T})$ . Let $\rho_{\bm{\theta}}(·|x)$ denote the trajectory distribution induced by policy $\pi_{\bm{\theta}}$ at initial state $s^{1}=x$ . The following analysis focuses on the $t$ -th iteration, updating the policy model from $\bm{\theta}_{t}$ to $\bm{\theta}_{t+1}$ . Step 1: surrogate objective and consistency condition. For the $t$ -th iteration of policy optimization, consider the following KL-regularized objective: $$ \max_{\bm{\theta}}\quad J(\bm{\theta};\pi_{\bm{\theta}_{t}})\coloneqq{\mathbb{E}}_{x\sim D}\bigg[{\mathbb{E}}_{\mathcal{T}\sim\rho_{\bm{\theta}}(\cdot|x)}[r(x,\mathcal{T})]-\tau\cdot D_{\textsf{KL}}\big(\rho_{\bm{\theta}}(\cdot|x)\,\|\,\rho_{\bm{\theta}_{t}}(\cdot|x)\big)\bigg]. \tag{11} $$ The optimal policy $\pi$ and the induced trajectory distribution $\rho$ satisfies the following: for any trajectory $\mathcal{T}$ , $$ \displaystyle\rho(\mathcal{T}|x) \displaystyle=\frac{\rho_{\bm{\theta}_{t}}(\mathcal{T}|x)e^{r(x,\mathcal{T})/\tau}}{Z(x,\rho_{\bm{\theta}_{t}})},\quad\text{where} \displaystyle Z(x,\rho_{\bm{\theta}_{t}}) \displaystyle\coloneqq\int\rho_{\bm{\theta}_{t}}(\mathcal{T}^{\prime}|x)e^{r(x,\mathcal{T}^{\prime})/\tau}\mathop{}\!\mathrm{d}\mathcal{T}^{\prime}={\mathbb{E}}_{\mathcal{T}^{\prime}\sim\rho_{\bm{\theta}_{t}}(\cdot|x)}[e^{r(x,\mathcal{T}^{\prime})/\tau}]. \tag{12} $$ This is equivalent to the following: for any pair of trajectories $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ , | | $\displaystyle\frac{\rho(\mathcal{T}_{1}|x)}{\rho(\mathcal{T}_{2}|x)}=\frac{\rho_{\bm{\theta}_{t}}(\mathcal{T}_{1}|x)}{\pi_{\bm{\theta}_{t}}(\mathcal{T}_{2}|x)}e^{\big(r(x,\mathcal{T}_{1})-r(x,\mathcal{T}_{2})\big)/\tau}.$ | | | --- | --- | --- | Taking logarithm of both sides and doing some rearrangement, we have equivalently $$ \displaystyle r(x,\mathcal{T}_{1})-\tau\cdot\big(\log\rho(\mathcal{T}_{1}|x)-\log\rho_{\bm{\theta}_{t}}(\mathcal{T}_{1}|x)\big)=r(x,\mathcal{T}_{2})-\tau\cdot\big(\log\rho(\mathcal{T}_{2}|x)-\log\rho_{\bm{\theta}_{t}}(\mathcal{T}_{2}|x)\big). \tag{14} $$ Note that for a trajectory $\mathcal{T}$ , we have $$ \log\rho(\mathcal{T}|x)-\log\rho_{\bm{\theta}_{t}}(\mathcal{T}|x)=\sum_{\ell}\log\pi(a^{\ell}|c^{\ell})-\sum_{\ell}\log\pi_{\bm{\theta}_{t}}(a^{\ell}|c^{\ell}) $$ since the state-transition probability terms in $\log\rho(\mathcal{T}|x)$ and $\log\rho_{\bm{\theta}_{t}}(\mathcal{T}|x)$ cancel out. Step 2: surrogate loss with finite samples. Given $K$ trajectories from the same initial state $s_{1}=x$ , we define the following mean-squared surrogate loss that enforces the consistency condition: $$ \displaystyle\widehat{L}({\bm{\theta}};x,\pi_{\bm{\theta}_{t}})\coloneqq\frac{1}{K^{2}}\sum_{1\leq i<j\leq K}\frac{(a_{i}-a_{j})^{2}}{(1+\tau)^{2}}, \displaystyle\text{where}\quad a_{i}\coloneqq r(x,\mathcal{T}_{i})-\tau\Big(\sum_{\ell}\log\pi_{\bm{\theta}}(a^{\ell}_{i}|c^{\ell}_{i})-\sum_{\ell}\log\pi_{\bm{\theta}_{t}}(a^{\ell}_{i}|c^{\ell}_{i})\Big). \tag{15} $$ With infinite samples and sufficient coverage of the action space, the optimum of this surrogate loss would be the same as the optimal policy for the surrogate objective in Eq. (11). Step 3: one gradient step of the surrogate loss. By the same trick as in Section 2.2, we have | | $\displaystyle∇_{\bm{\theta}}{(a_{i}-a_{j})^{2}}\big|_{\bm{\theta}_{t}}={-2\tau}\,\Big(r(x,\mathcal{T}_{i})-r(x,\mathcal{T}_{j})\Big)\Big(∇_{\bm{\theta}}\sum_{\ell}\log\pi_{\bm{\theta}}(a^{\ell}_{i}|c^{\ell}_{i})\big|_{\bm{\theta}_{t}}-∇_{\bm{\theta}}\sum_{\ell}\log\pi_{\bm{\theta}}(a^{\ell}_{j}|c^{\ell}_{j})\big|_{\bm{\theta}_{t}}\Big),$ | | | --- | --- | --- | and | | $\displaystyle∇_{\bm{\theta}}\sum_{1≤ i<j≤ K}\frac{(a_{i}-a_{j})^{2}}{(1+\tau)^{2}}\Big|_{\bm{\theta}_{t}}=\frac{-2\tau K}{(1+\tau)^{2}}\sum_{1≤ i≤ K}\big(r(x,\mathcal{T}_{i})-\overline{r}(x)\big)∇_{\bm{\theta}}\sum_{\ell}\log\pi_{\bm{\theta}}(a^{\ell}_{i}|c^{\ell}_{i})\Big|_{\bm{\theta}_{t}},$ | | | --- | --- | --- | where $\overline{r}(x)\coloneqq\sum_{1≤ j≤ K}r(x,\mathcal{T}_{j})/K$ denotes the group mean reward in the last line. In sum, the gradient of the surrogate loss in Eq. (16) becomes: $$ \nabla_{\bm{\theta}}\widehat{L}(\bm{\theta};x,\pi_{\bm{\theta}_{t}})\big|_{\bm{\theta}_{t}}=\frac{-2\tau}{(1+\tau)^{2}}\cdot\frac{1}{K}\sum_{1\leq i\leq K}\big(r(x,\mathcal{T}_{i})-\overline{r}(x)\big)\,\nabla_{\bm{\theta}}\sum_{\ell}\log\pi_{\bm{\theta}}(a^{\ell}_{i}|c^{\ell}_{i})\Big|_{\bm{\theta}_{t}}. $$ This motivates the following policy update step: $$ g\big(\bm{\theta};x,\{\mathcal{T}_{i},r_{i}\}_{1\leq i\leq K}\big)=\frac{2\tau}{(1+\tau)^{2}}\cdot\frac{1}{K}\sum_{1\leq i\leq K}\big(r(x,\mathcal{T}_{i})-\overline{r}(x)\big)\,\nabla_{\bm{\theta}}\sum_{1\leq\ell\leq|\mathcal{T}_{i}|}\log\pi_{\bm{\theta}}(a^{\ell}_{i}|c^{\ell}_{i}), \tag{17} $$ which concludes our derivation of group-relative REINFORCE in multi-step RL settings. Appendix B Implementation details and additional experiments We implement all algorithms with the Trinity-RFT framework [Pan et al., 2025], and run experiments on NVIDIA L20, H20, and A800 GPUs. See Tables 1 and 2 for detailed configurations of our experiments. B.1 Dataset details We provide additional descriptions of the datasets used in our experiments: - GSM8k [Cobbe et al., 2021] is a widely used benchmark with 8.5k grade-school math word problems, designed to test arithmetic reasoning and step-by-step problem solving. - MATH [Hendrycks et al., 2021] covers algebra, geometry, probability, and number theory, containing 12.5k examples in total (7.5k for training and 5k for testing); it demands advanced symbolic reasoning beyond GSM8k. - Guru [Cheng et al., 2025] is a multi-domain reasoning dataset with 91.9k examples spanning math, code, science, logic, simulation, and tabular tasks; we use its math subset (around 54k samples), which introduces diverse problem formats for evaluating transfer of reasoning strategies. - ToolACE [Liu et al., 2025a] is a multilingual benchmark with around 11k synthetic samples designed to evaluate LLMs’ ability to solve tasks by selecting and invoking external tools via strict JSON-formatted function calls; we use a 5k single-turn subset in our experiments. B.2 Understanding the synchronization parameters We parameterize rollout-training scheduling by two configuration parameters in Trinity-RFT: the synchronization interval (sync_interval) and synchronization offset (sync_offset). Their meanings are visualized in Figure 2 and explained in the following. The parameter sync_interval specifies the number of generated rollout batches (which equals the number of gradient steps for training the policy model) between two consecutive executions of model weight synchronization. When sync_interval $=1$ , the rollout and policy models synchronize after each gradient step with one batch of samples, yielding a strictly on-policy process (if we ignore the issue of precision mismatch between rollout and training engines [Yao et al., 2025]). When sync_interval $>1$ , sync_interval rollout batches are generated with stale model weights before synchronization, which accelerates the overall RL process through pipeline parallelism but incurs off-policyness. The parameter sync_offset specifies the lag between the generation and consumption of each batch. More specifically, sync_offset batches are generated and saved to the buffer before training is launched, which is also useful for reducing pipeline bubbles and improving hardware utilization [Noukhovitch et al., 2025]. In some of our experiments, we deliberately set sync_offset to a large value, in order to simulate a scenario where reward signals from the environment are lagged. In general, with $(\texttt{sync\_interval},\texttt{sync\_offset})=(m,n)$ , the off-policyness of a consumed batch with zero-index id $l$ corresponds to its temporal distance from the most recent synchronized policy is $(l\bmod m)+n$ . For example, $(4,0)$ yields off-policyness ${0,1,2,3}$ within each interval, while $(1,4)$ yields a constant off-policyness of $4$ . Table 1: Default hyperparameters. Deviations from defaults are noted in figure captions. | | GSM8K (Qwen2.5-1.5B) | ToolACE (Llama-3.2-3B) | Guru (Qwen2.5-7B) | MATH (Llama-3.1-8B) | | --- | --- | --- | --- | --- | | Learning rate | $1\!×\!10^{-6}$ | $1\!×\!10^{-6}$ | $1\!×\!10^{-6}$ | $5\!×\!10^{-7}$ | | Batch size | 96 | 96 | 64 | 64 | | $K$ | 8 | 8 | 16 | 16 | | Weight decay | 0.01 | 0.01 | 0.1 | 0.1 | | Warmup steps | 0 | 0 | 80 | 40 | | Eval temperature | 1.0 | N/A | N/A | 0.6 | | Eval top-p | 1.0 | N/A | N/A | 1.0 | | Figures | 3, 5, 9, 10, 11 | 4 | 6 | 7 | Table 2: Other shared hyperparameters across all experiments. | Parameter | Value | | --- | --- | | Optimizer | AdamW | | $(\beta_{1},\beta_{2})$ | (0.9, 0.999) | | Gradient clipping | 1.0 | | Warmup style | constant | | Weight-decay increment style | constant | | Auxiliary LR decay style | exponential | | Training inference temperature | 1.0 | | Training inference top-p | 1.0 | B.3 REC with different clipping mechanisms In addition to one-side clipping investigated in Section 4, here we compare additional clipping mechanisms for the REC series, to understand how the geometry of clipping — asymmetric vs. symmetric bounds and the presence of a zero-gradient band — affects the learning process. REC-TwoSide-IS / NoIS. We replace the mask $M_{i}^{t}$ in REC-OneSide-IS / NoIS in Eq. (8) with a two-side mask It turns out that REC-TwoSide-NoIS resembles the sPPO algorithm proposed by Vaswani et al. [2022], though derived with different rationales. : $$ \widetilde{M}_{i}^{t}=\mathbbm{1}\Big(1-\epsilon_{\textsf{low}}\leq\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}\leq 1+\epsilon_{\textsf{high}}\Big). \tag{18} $$ Two-side clipping imposes weaker regularization than one-side clipping does with the same clipping parameter $(\epsilon_{\textsf{low}},\epsilon_{\textsf{high}})$ . This can potentially improve training efficiency, but might also be risky when the probability ratio $\pi_{\bm{\theta}}/\pi_{\textsf{old}}$ goes far off. To compensate for this, we design REC-Ring. REC-Ring. In addition to the inner band $(1-\epsilon_{\textsf{low}},1+\epsilon_{\textsf{high}})$ as in Eq. (18), we further specify outer safety margins $\epsilon_{\textsf{low}}^{\prime}\!≥\!\epsilon_{\textsf{low}}$ and $\epsilon_{\textsf{high}}^{\prime}\!≥\!\epsilon_{\textsf{high}}$ . The REC-Ring mask is: $$ \displaystyle\widehat{M}_{i}^{t} \displaystyle=\mathbbm{1}\Big(1-\epsilon_{\textsf{low}}\leq\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}\leq 1+\epsilon_{\textsf{high}}\Big) \displaystyle\qquad+\mathbbm{1}\Big(A_{i}>0\text{ and }\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}\leq 1-\epsilon_{\textsf{low}}^{\prime}\Big) \displaystyle\qquad+\mathbbm{1}\Big(A_{i}<0\text{ and }\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}\geq 1+\epsilon_{\textsf{high}}^{\prime}\Big). \tag{19} $$ A comparison of the clipping mechanisms are visualized in Figure 8. Note that REC-OneSide and REC-TwoSide can be regarded as special cases of REC-Ring. <details> <summary>x9.png Details</summary> ![31c9f4be](/v1/image/31c9f4be37ce1b795daaadbb6dd3776f9cb5ec0a0e996a594b7bf0ef4020e7ca) ### Visual Description ## Diagram: Regret-Constrained Equilibria ### Overview The image presents three diagrams illustrating different types of regret-constrained equilibria: GRPO/REC-OneSide, REC-TwoSide, and REC-Ring. Each diagram shows a horizontal axis representing the ratio of new profit to old profit (πθ/πold), with thresholds defined by epsilon values (εlow and εhigh). Arrows indicate the direction of adjustment (A > 0 or A < 0) based on the position relative to these thresholds. ### Components/Axes * **Horizontal Axis:** πθ/πold (Ratio of new profit to old profit) * **Vertical Dashed Lines:** Represent thresholds based on epsilon values. * **Arrows:** Indicate the direction of adjustment (A > 0: adjustment increases profit ratio; A < 0: adjustment decreases profit ratio). * **Labels:** * 1 - εlow * 1 + εhigh * 1 - ε'low * 1 + ε'high * A > 0 (Adjustment greater than 0) * A < 0 (Adjustment less than 0) * **Diagram Titles:** * GRPO/REC-OneSide (left) * REC-TwoSide (center) * REC-Ring (right) ### Detailed Analysis **1. GRPO/REC-OneSide (Left Diagram):** * The horizontal axis spans from an unspecified minimum to πθ/πold. * A vertical dashed line is located at 1 - εlow. * A second vertical dashed line is located at 1 + εhigh. * To the left of 1 - εlow, an arrow points right, labeled "A > 0". * Between 1 - εlow and 1 + εhigh, an arrow points right, labeled "A > 0". * To the right of 1 + εhigh, an arrow points left, labeled "A < 0". **2. REC-TwoSide (Center Diagram):** * The horizontal axis spans from an unspecified minimum to πθ/πold. * A vertical dashed line is located at 1 - εlow. * A second vertical dashed line is located at 1 + εhigh. * To the left of 1 - εlow, an arrow points right, labeled "A > 0". * Between 1 - εlow and 1 + εhigh, an arrow points left, labeled "A < 0". * To the right of 1 + εhigh, an arrow points right, labeled "A > 0". **3. REC-Ring (Right Diagram):** * The horizontal axis spans from an unspecified minimum to πθ/πold. * Vertical dashed lines are located at 1 - ε'low, 1 - εlow, 1 + εhigh, and 1 + ε'high. * To the left of 1 - ε'low, an arrow points right, labeled "A > 0". * Between 1 - ε'low and 1 - εlow, an arrow points left, labeled "A < 0". * Between 1 - εlow and 1 + εhigh, an arrow points right, labeled "A > 0". * Between 1 + εhigh and 1 + ε'high, an arrow points left, labeled "A < 0". * To the right of 1 + ε'high, an arrow points right, labeled "A > 0". ### Key Observations * Each diagram represents a different equilibrium condition based on the relationship between the profit ratio (πθ/πold) and the epsilon thresholds. * The arrows indicate the direction of adjustment needed to reach equilibrium. * The REC-Ring diagram has two sets of epsilon thresholds (εlow, ε'low, εhigh, ε'high), suggesting a more complex equilibrium condition. ### Interpretation The diagrams illustrate how different regret-constrained equilibrium concepts (GRPO/REC-OneSide, REC-TwoSide, REC-Ring) define the adjustment dynamics of a system based on profit ratios and epsilon thresholds. The arrows indicate whether the profit ratio needs to increase (A > 0) or decrease (A < 0) to reach a stable state. The specific placement of the thresholds and the direction of the arrows define the characteristics of each equilibrium type. For example, in REC-TwoSide, the system adjusts towards the region between 1 - εlow and 1 + εhigh, while in REC-Ring, there are two such regions. The epsilon values likely represent tolerance levels or bounds on acceptable profit changes. </details> Figure 8: A visualization of activated gradient for various REC algorithms. Here, $A$ represents the advantage of a specific token, and an arrow pointing to the right and annotated with “ $A>0$ ” means there is activated gradient that incentivizes increasing $\pi_{\bm{\theta}}$ when the token advantage is positive and the probability ratio $\pi_{\bm{\theta}}/\pi_{\textsf{old}}$ lies in the corresponding interval. Experiments. We compare the following algorithms: REINFORCE, GRPO, REC-TwoSide-IS, REC-TwoSide-NoIS, and REC-Ring-NoIS. Clipping parameters are set to $(\epsilon_{\text{low}},\epsilon_{\text{high}})=(0.2,0.2)$ , and for REC-Ring we additionally set $(\epsilon_{\textsf{low}}^{\prime},\epsilon_{\textsf{high}}^{\prime})=(0.6,2.0)$ . <details> <summary>x10.png Details</summary> ![901c2ea0](/v1/image/901c2ea05912e9d0549da83a8084d348efa229d4ed7d80b1fde356c1aecd5818) ### Visual Description ## Comparative Performance of Reinforcement Learning Algorithms under Varying Synchronization Conditions ### Overview The image presents a series of line charts comparing the performance of different reinforcement learning algorithms (REINFORCE, GRPO, REC-TwoSide-NoIS, REC-TwoSide-IS, REC-Ring-NoIS) across three synchronization conditions: `sync_interval = 20`, `sync_offset = 10`, and `offline`. Performance is evaluated based on Evaluation Accuracy, Training Reward, KL Divergence, and Clipping Fraction, all plotted against Training Steps. ### Components/Axes * **Title:** Comparative Performance of Reinforcement Learning Algorithms under Varying Synchronization Conditions * **X-axis (all charts):** Training Steps (range: 0 to 150) * **Y-axis (row 1):** Evaluation Accuracy (range: 0.0 to 0.6) * **Y-axis (row 2):** Training Reward (range: 0.00 to 1.00) * **Y-axis (row 3):** KL Divergence (log scale, range: 10^-2 to 10^2) * **Y-axis (row 4):** Clipping Fraction (log scale, range: 10^-3 to 10^-1) * **Synchronization Conditions (columns):** * Column 1: `sync_interval = 20` * Column 2: `sync_offset = 10` * Column 3: `offline` * **Legend (bottom):** * Blue: REINFORCE * Green: GRPO * Yellow: REC-TwoSide-NoIS (0.2) * Purple: REC-Ring-NoIS (0.2, 0.2) & (0.6, 2.0) * Yellow Dashed: REC-TwoSide-IS (0.2) ### Detailed Analysis **Row 1: Evaluation Accuracy** * **`sync_interval = 20`:** * REINFORCE (blue): Starts around 0.3, drops sharply around step 75, recovers slightly, ends around 0.1. * GRPO (green): Starts around 0.35, gradually increases to around 0.55. * REC-TwoSide-NoIS (yellow): Starts around 0.35, increases to around 0.65. * REC-Ring-NoIS (purple): Starts around 0.35, increases to around 0.65. * REC-TwoSide-IS (yellow dashed): Starts around 0.35, increases to around 0.7. * **`sync_offset = 10`:** * REINFORCE (blue): Starts around 0.3, drops sharply around step 75, recovers slightly, ends around 0.2. * GRPO (green): Starts around 0.35, gradually increases to around 0.6. * REC-TwoSide-NoIS (yellow): Starts around 0.35, increases to around 0.7. * REC-Ring-NoIS (purple): Starts around 0.35, increases to around 0.7. * REC-TwoSide-IS (yellow dashed): Starts around 0.35, increases to around 0.7. * **`offline`:** * REINFORCE (blue): Starts around 0.4, drops sharply around step 50, recovers slightly, ends around 0.2. * GRPO (green): Starts around 0.4, remains relatively stable around 0.4. * REC-TwoSide-NoIS (yellow): Starts around 0.4, fluctuates, ends around 0.3. * REC-Ring-NoIS (purple): Starts around 0.4, fluctuates, ends around 0.4. * REC-TwoSide-IS (yellow dashed): Starts around 0.4, fluctuates, ends around 0.3. **Row 2: Training Reward** * **`sync_interval = 20`:** * REINFORCE (blue): Starts around 0.5, drops sharply around step 75, recovers slightly, ends around 0.0. * GRPO (green): Starts around 0.5, gradually increases to around 0.8. * REC-TwoSide-NoIS (yellow): Starts around 0.5, increases to around 0.9. * REC-Ring-NoIS (purple): Starts around 0.5, increases to around 0.9. * REC-TwoSide-IS (yellow dashed): Starts around 0.7, increases to around 0.9. * **`sync_offset = 10`:** * REINFORCE (blue): Starts around 0.5, drops sharply around step 75, recovers slightly, ends around 0.0. * GRPO (green): Starts around 0.5, gradually increases to around 0.8. * REC-TwoSide-NoIS (yellow): Starts around 0.5, increases to around 0.9. * REC-Ring-NoIS (purple): Starts around 0.5, increases to around 0.9. * REC-TwoSide-IS (yellow dashed): Starts around 0.7, increases to around 0.9. * **`offline`:** * REINFORCE (blue): Not applicable. * GRPO (green): Starts around 0.5, remains relatively stable around 0.5. * REC-TwoSide-NoIS (yellow): Starts around 0.5, remains relatively stable around 0.5. * REC-Ring-NoIS (purple): Starts around 0.5, remains relatively stable around 0.5. * REC-TwoSide-IS (yellow dashed): Starts around 0.5, remains relatively stable around 0.5. **Row 3: KL Divergence** * **`sync_interval = 20`:** * REINFORCE (blue): Starts around 10^-1, spikes significantly around step 75, ends around 10^2. * GRPO (green): Starts around 10^-2, remains relatively stable around 10^-2. * REC-TwoSide-NoIS (yellow): Starts around 10^-1, increases to around 10^1. * REC-Ring-NoIS (purple): Starts around 10^-2, remains relatively stable around 10^-2. * REC-TwoSide-IS (yellow dashed): Starts around 10^-2, remains relatively stable around 10^-2. * **`sync_offset = 10`:** * REINFORCE (blue): Starts around 10^-1, increases to around 10^2. * GRPO (green): Starts around 10^-2, remains relatively stable around 10^-2. * REC-TwoSide-NoIS (yellow): Starts around 10^-1, increases to around 10^2. * REC-Ring-NoIS (purple): Starts around 10^-2, remains relatively stable around 10^-2. * REC-TwoSide-IS (yellow dashed): Starts around 10^-2, remains relatively stable around 10^-2. * **`offline`:** * REINFORCE (blue): Starts around 10^-1, increases to around 10^2. * GRPO (green): Starts around 10^-2, remains relatively stable around 10^-2. * REC-TwoSide-NoIS (yellow): Starts around 10^-1, increases to around 10^2. * REC-Ring-NoIS (purple): Starts around 10^-2, remains relatively stable around 10^-2. * REC-TwoSide-IS (yellow dashed): Starts around 10^-2, remains relatively stable around 10^-2. **Row 4: Clipping Fraction** * **`sync_interval = 20`:** * REINFORCE (blue): Not applicable. * GRPO (green): Starts around 10^-3, remains relatively stable around 10^-3. * REC-TwoSide-NoIS (yellow): Starts around 10^-2, increases to around 10^-1. * REC-Ring-NoIS (purple): Starts around 10^-2, remains relatively stable around 10^-2. * REC-TwoSide-IS (yellow dashed): Starts around 10^-2, increases to around 10^-1. * **`sync_offset = 10`:** * REINFORCE (blue): Not applicable. * GRPO (green): Starts around 10^-3, remains relatively stable around 10^-3. * REC-TwoSide-NoIS (yellow): Starts around 10^-2, increases to around 10^-1. * REC-Ring-NoIS (purple): Starts around 10^-2, remains relatively stable around 10^-2. * REC-TwoSide-IS (yellow dashed): Starts around 10^-2, increases to around 10^-1. * **`offline`:** * REINFORCE (blue): Not applicable. * GRPO (green): Starts around 10^-3, remains relatively stable around 10^-3. * REC-TwoSide-NoIS (yellow): Starts around 10^-2, increases to around 10^-1. * REC-Ring-NoIS (purple): Starts around 10^-2, remains relatively stable around 10^-2. * REC-TwoSide-IS (yellow dashed): Starts around 10^-2, increases to around 10^-1. ### Key Observations * REINFORCE's performance is significantly impacted by synchronization, showing a sharp decline in Evaluation Accuracy and Training Reward around step 75 in the `sync_interval = 20` and `sync_offset = 10` conditions. * GRPO demonstrates more stable performance across all synchronization conditions. * REC-TwoSide-NoIS and REC-TwoSide-IS generally achieve higher Evaluation Accuracy and Training Reward compared to REINFORCE, but also exhibit higher KL Divergence and Clipping Fraction. * REC-Ring-NoIS shows stable and relatively low KL Divergence and Clipping Fraction. * The offline condition results in relatively stable performance for all algorithms except REINFORCE. ### Interpretation The data suggests that the choice of reinforcement learning algorithm and synchronization strategy significantly impacts performance. REINFORCE is highly sensitive to synchronization parameters, while GRPO exhibits more robust performance. REC-TwoSide-NoIS and REC-TwoSide-IS can achieve higher rewards but at the cost of increased KL Divergence and Clipping Fraction, potentially indicating instability or less efficient learning. REC-Ring-NoIS offers a balance between performance and stability. The offline condition highlights the inherent limitations of certain algorithms when synchronization is completely absent. The sharp decline in REINFORCE's performance around step 75 in the synchronized conditions warrants further investigation, potentially indicating a critical point where the algorithm becomes unstable or diverges. </details> Figure 9: Comparison of REC variants on GSM8K with Qwen2.5-1.5B-Instruct under different off-policy settings. Evaluation accuracy, training reward, KL divergence (with respect to the initial model) and clipping fraction are reported. Training reward curves are smoothed with a running-average window of size 3. Figure 9 presents the empirical results. We observe that for REC-TwoSide, importance sampling is non-essential in all three settings, akin to the case of REC-OneSide. In addition, REC-TwoSide methods demonstrate fast policy improvement at the beginning but tend to collapse later on, whereas REC-Ring achieves a better balance of convergence speed and stability. B.4 Ablation: the impact of learning rates Recall that in Section 4.1, we have demonstrated empirically the advantages of enlarging the clipping parameters $\epsilon_{\textsf{low}},\epsilon_{\textsf{high}}$ for REC-OneSide-NoIS. One might wonder if the relatively weak performance of GRPO or REC-OneSide with conventional $\epsilon_{\textsf{low}}=\epsilon_{\textsf{high}}=0.2$ is genuinely rooted in the clipping mechanism itself, or simply due to the choice of a small learning rate. To answer this, we enhance the experiment of Figure 3 by sweeping learning rates over $\{1\!×\!10^{-5},2\!×\!10^{-6},5\!×\!10^{-6}\}$ . The results are illustrated in Figure 10, which confirm that simply increasing the learning rate cannot bridge the performance gap between GRPO with $\epsilon_{\textsf{low}}=\epsilon_{\textsf{high}}=0.2$ and REC-OneSide-NoIS with $\epsilon_{\textsf{low}}=0.6,\epsilon_{\textsf{high}}=2.0$ . This shows that relaxing the clipping range acts as a genuine improvement of regularization, rather than merely mimicking a larger learning rate. <details> <summary>x11.png Details</summary> ![e2c859a4](/v1/image/e2c859a4838409ae934fdbd99df5ec9e3685868106a3c92614c70f69d55713ae) ### Visual Description ## Line Charts: Evaluation Accuracy and Training Reward vs. Training Steps ### Overview The image contains two line charts comparing the performance of different reinforcement learning algorithms. The left chart shows "Evaluation Accuracy" versus "Training Steps," while the right chart shows "Training Reward" versus "Training Steps." Both charts share the same x-axis ("Training Steps") and are configured with "sync_interval = 20". The charts compare GRPO, REC-OneSide-NoIS, and REINFORCE algorithms with varying learning rates (lr). ### Components/Axes **Left Chart (Evaluation Accuracy):** * **Title:** sync\_interval = 20 * **Y-axis:** Evaluation Accuracy, ranging from 0.0 to 0.8. * **X-axis:** Training Steps, ranging from 0 to 160. * **Algorithms (Legend):** Located at the bottom of the image, applies to both charts. * GRPO (0.2) (lr = 1e-5): Light green, dashed line. * GRPO (0.2) (lr = 2e-6): Light green, dash-dot line. * GRPO (0.2) (lr = 5e-6): Light green, dotted line. * REC-OneSide-NoIS (0.2) (lr = 1e-5): Light purple, dashed line. * REC-OneSide-NoIS (0.2) (lr = 2e-6): Light purple, dash-dot line. * REC-OneSide-NoIS (0.2) (lr = 5e-6): Light purple, dotted line. * REC-OneSide-NoIS (0.6, 2.0) (lr = 1e-6): Dark purple, solid line with circle markers. * REC-OneSide-NoIS (0.6, 2.0) (lr = 2e-6): Dark purple, solid line with plus markers. * REINFORCE (lr = 2e-6): Light blue, dashed line. **Right Chart (Training Reward):** * **Title:** sync\_interval = 20 * **Y-axis:** Training Reward, ranging from 0.0 to 1.0. * **X-axis:** Training Steps, ranging from 0 to 160. * **Algorithms (Legend):** Same as the left chart. ### Detailed Analysis **Left Chart (Evaluation Accuracy):** * **GRPO (0.2) (lr = 1e-5):** Starts at approximately 0.35 and increases to around 0.65 by 160 training steps. * **GRPO (0.2) (lr = 2e-6):** Starts at approximately 0.35 and increases to around 0.60 by 160 training steps. * **GRPO (0.2) (lr = 5e-6):** Starts at approximately 0.35 and increases to around 0.50 by 160 training steps. * **REC-OneSide-NoIS (0.2) (lr = 1e-5):** Starts at approximately 0.35 and increases to around 0.70 by 160 training steps. * **REC-OneSide-NoIS (0.2) (lr = 2e-6):** Starts at approximately 0.35 and increases to around 0.75 by 160 training steps. * **REC-OneSide-NoIS (0.2) (lr = 5e-6):** Starts at approximately 0.35 and increases to around 0.65 by 160 training steps. * **REC-OneSide-NoIS (0.6, 2.0) (lr = 1e-6):** Starts at approximately 0.35 and increases to around 0.75 by 160 training steps. * **REC-OneSide-NoIS (0.6, 2.0) (lr = 2e-6):** Starts at approximately 0.35 and increases to around 0.75 by 160 training steps. * **REINFORCE (lr = 2e-6):** Starts at approximately 0.35, drops to near 0 around 40 training steps, fluctuates, and ends around 0.45 at 160 training steps. **Right Chart (Training Reward):** * **GRPO (0.2) (lr = 1e-5):** Starts at approximately 0.45, increases to around 0.60, and fluctuates around that value. * **GRPO (0.2) (lr = 2e-6):** Starts at approximately 0.45, increases to around 0.60, and fluctuates around that value. * **GRPO (0.2) (lr = 5e-6):** Starts at approximately 0.45, increases to around 0.55, and fluctuates around that value. * **REC-OneSide-NoIS (0.2) (lr = 1e-5):** Starts at approximately 0.45, increases to around 0.70, and fluctuates around that value. * **REC-OneSide-NoIS (0.2) (lr = 2e-6):** Starts at approximately 0.45, increases to around 0.70, and fluctuates around that value. * **REC-OneSide-NoIS (0.2) (lr = 5e-6):** Starts at approximately 0.45, increases to around 0.65, and fluctuates around that value. * **REC-OneSide-NoIS (0.6, 2.0) (lr = 1e-6):** Starts at approximately 0.45, increases to around 0.90, and fluctuates around that value. * **REC-OneSide-NoIS (0.6, 2.0) (lr = 2e-6):** Starts at approximately 0.45, increases to around 0.90, and fluctuates around that value. * **REINFORCE (lr = 2e-6):** Starts at approximately 0.45, drops to near 0 around 40 training steps, remains low until around 140 training steps, then increases to around 0.20. ### Key Observations * REC-OneSide-NoIS (0.6, 2.0) with learning rates 1e-6 and 2e-6 consistently achieve the highest evaluation accuracy and training reward. * REINFORCE (lr = 2e-6) performs poorly in both evaluation accuracy and training reward, showing a significant drop early in training. * GRPO and REC-OneSide-NoIS (0.2) show similar performance trends, with REC-OneSide-NoIS generally performing slightly better. * The "sync\_interval = 20" parameter is present on both charts. ### Interpretation The data suggests that the REC-OneSide-NoIS algorithm, particularly with parameters (0.6, 2.0) and learning rates of 1e-6 or 2e-6, is the most effective among those tested for this specific task, as it achieves the highest evaluation accuracy and training reward. REINFORCE, with a learning rate of 2e-6, appears to be unsuitable for this task, as it fails to learn effectively. The performance differences between GRPO and REC-OneSide-NoIS (0.2) are relatively minor, indicating that they may be viable alternatives, although not as effective as REC-OneSide-NoIS (0.6, 2.0). The "sync\_interval = 20" parameter likely refers to the frequency at which the target network is updated in the reinforcement learning algorithms. </details> Figure 10: Comparison of GRPO and REC-OneSide-NoIS on GSM8K with Qwen2.5-1.5B-Instruct. Evaluation accuracy (left) and training reward (right) are reported for varying learning rates. B.5 Experiments for OPMD and AsymRE Figure 11 presents empirical results for OPMD and AsymRE in various off-policy settings. It is worth noting that, while the analysis and experiments in their original papers [Kimi-Team, 2025b, Arnal et al., 2025] focus on a setting that is effectively the same as our $\texttt{sync\_interval}>1$ setting, our analysis and experiments have also validated their efficacy in $\texttt{sync\_offset}>1$ scenarios. <details> <summary>x12.png Details</summary> ![90f54839](/v1/image/90f548395a7366d8d7f1612bb314814d002eadd6178a2d7c0aa197d65d912f74) ### Visual Description ## Chart: Evaluation Accuracy and Training Reward vs. Training Steps for Different Algorithms ### Overview The image presents six line graphs arranged in a 2x3 grid. The top row displays "Evaluation Accuracy" versus "Training Steps," while the bottom row shows "Training Reward" versus "Training Steps." Each column represents a different synchronization setting: "sync_interval = 20," "sync_offset = 10," and "offline." Four different algorithms are compared across these settings: REINFORCE, REC-OneSide-NoIS (0.6, 2.0), OPMD, and AsymRE (-0.1). ### Components/Axes * **X-axis (all plots):** Training Steps, ranging from 0 to 150. * **Y-axis (top row):** Evaluation Accuracy, ranging from 0.0 to 0.8. Markers are present at 0.0, 0.2, 0.4, 0.6, and 0.8. * **Y-axis (bottom row):** Training Reward, ranging from 0.00 to 1.00. Markers are present at 0.00, 0.25, 0.50, 0.75, and 1.00. * **Titles (top row, left to right):** sync\_interval = 20, sync\_offset = 10, offline * **Legend (bottom):** Located below the bottom row of plots. * REINFORCE (light blue line) * REC-OneSide-NoIS (0.6, 2.0) (purple line with diamond markers) * OPMD (green line) * AsymRE (-0.1) (light green line with diamond markers) ### Detailed Analysis **Column 1: sync\_interval = 20** * **Evaluation Accuracy:** * REINFORCE (light blue): Starts around 0.4, increases to approximately 0.6 by step 25, then drops sharply to near 0 by step 75, before recovering to approximately 0.3 by step 100, and then decreasing again to approximately 0.1 by step 150. * REC-OneSide-NoIS (purple): Starts around 0.4, increases steadily to approximately 0.7 by step 150. * OPMD (green): Starts around 0.4, increases steadily to approximately 0.7 by step 150. * AsymRE (light green): Starts around 0.35, increases steadily to approximately 0.7 by step 150. * **Training Reward:** * REINFORCE (light blue): Starts around 0.5, increases to approximately 0.7 by step 25, then drops sharply to near 0 by step 75, before recovering to approximately 0.3 by step 100, and then decreasing again to approximately 0.1 by step 150. * REC-OneSide-NoIS (purple): Starts around 0.45, increases steadily to approximately 0.9 by step 150. * OPMD (green): Starts around 0.45, increases steadily to approximately 0.9 by step 150. * AsymRE (light green): Starts around 0.45, increases steadily to approximately 0.9 by step 150. **Column 2: sync\_offset = 10** * **Evaluation Accuracy:** * REINFORCE (light blue): Starts around 0.4, increases to approximately 0.6 by step 25, then drops sharply to near 0.1 by step 75, before recovering to approximately 0.4 by step 100, and then decreasing again to approximately 0.2 by step 150. * REC-OneSide-NoIS (purple): Starts around 0.4, increases steadily to approximately 0.7 by step 150. * OPMD (green): Starts around 0.4, increases steadily to approximately 0.7 by step 150. * AsymRE (light green): Starts around 0.35, increases steadily to approximately 0.7 by step 150. * **Training Reward:** * REINFORCE (light blue): Starts around 0.5, increases to approximately 0.7 by step 25, then drops sharply to near 0.1 by step 75, before recovering to approximately 0.4 by step 100, and then decreasing again to approximately 0.2 by step 150. * REC-OneSide-NoIS (purple): Starts around 0.45, increases steadily to approximately 0.9 by step 150. * OPMD (green): Starts around 0.45, increases steadily to approximately 0.9 by step 150. * AsymRE (light green): Starts around 0.45, increases steadily to approximately 0.9 by step 150. **Column 3: offline** * **Evaluation Accuracy:** * REINFORCE (light blue): Starts around 0.6, decreases sharply to near 0 by step 75, before recovering to approximately 0.2 by step 100, and then decreasing again to approximately 0.1 by step 150. * REC-OneSide-NoIS (purple): Starts around 0.6, remains relatively stable around 0.6. * OPMD (green): Starts around 0.5, remains relatively stable around 0.5. * AsymRE (light green): Starts around 0.5, remains relatively stable around 0.5. * **Training Reward:** * REINFORCE (light blue): Not visible. * REC-OneSide-NoIS (purple): Remains relatively stable around 0.6. * OPMD (green): Remains relatively stable around 0.6. * AsymRE (light green): Remains relatively stable around 0.6. ### Key Observations * REINFORCE performs poorly compared to other algorithms, especially in terms of stability. Its performance drops significantly around step 50-75 in all three synchronization settings. * REC-OneSide-NoIS, OPMD, and AsymRE show similar and more stable performance, with generally increasing evaluation accuracy and training reward over time, except in the offline setting. * The "offline" setting results in stable but lower performance for REC-OneSide-NoIS, OPMD, and AsymRE compared to the other two synchronization settings. ### Interpretation The data suggests that the synchronization interval and offset significantly impact the performance of the REINFORCE algorithm, leading to instability and lower rewards. REC-OneSide-NoIS, OPMD, and AsymRE are more robust to these synchronization parameters and achieve higher and more stable performance. The "offline" setting, where there is no synchronization, results in a stable but lower performance ceiling for the latter three algorithms, indicating that some level of synchronization is beneficial for these algorithms. The poor performance of REINFORCE could be due to its sensitivity to the synchronization parameters, causing it to diverge during training. </details> Figure 11: Empirical results for OPMD and AsymRE (cf. Section 4.2) on GSM8K with Qwen2.5-1.5B-Instruct under various off-policy settings. The regularization coefficient for OPMD and the baseline shift for AsymRE are both $0.1$ . Training reward curves are smoothed with a running-average window of size 3. Appendix C Summary: a unified view of various algorithms For convenient reference, Table 3 summarizes the algorithms investigated in Section 4. Table 3: A summary of algorithms investigated in Section 4. | Augmentation | Algorithm | Gradient / Loss | | --- | --- | --- | | Regularize by clipping | GRPO | $\bm{g}=\frac{1}{K}\sum_{i}\sum_{t}∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})· A_{i}\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}M_{i}^{t}$ | | REC-OneSide-IS | $\bm{g}=\frac{1}{K}\sum_{i}\sum_{t}∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})·(r_{i}-\overline{r})\frac{\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}{\pi_{\textsf{old}}(y_{i}^{t}\,|\,x,y_{i}^{<t})}M_{i}^{t}$ | | | REC-OneSide-NoIS | $\bm{g}=\frac{1}{K}\sum_{i}\sum_{t}∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}^{t}\,|\,x,y_{i}^{<t})·(r_{i}-\overline{r})M_{i}^{t}$ | | | Add regularization loss | OPMD | $\widehat{L}=-\tfrac{1}{K}\!\sum_{i}(r_{i}-\overline{r})\log\pi_{\bm{\theta}}(y_{i}|x)+\tfrac{\tau}{2K}\!\sum_{i}(\log\pi_{\bm{\theta}}(y_{i}|x)-\log\pi_{\textsf{old}}(y_{i}|x))^{2}$ | | AsymRE | $\widehat{L}=-\tfrac{1}{K}\!\sum_{i}(r_{i}-\overline{r})\log\pi_{\bm{\theta}}(y_{i}|x)-\tfrac{\tau}{K}\!\sum_{i}\log\pi_{\bm{\theta}}(y_{i}|x)$ | | | Reweight data | Pairwise-weighted REINFORCE | $\bm{g}=\tfrac{1}{K}\!\sum_{i}\Big(\sum_{j}w_{i,j}\Big)\Big(r_{i}-\tfrac{\sum_{j}w_{i,j}r_{j}}{\sum_{j}w_{i,j}}\Big)∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}|x)$ | | RED-Drop | $\bm{g}=\tfrac{1}{|S|}\!\sum_{i∈ S}(r_{i}-\overline{r}_{S})\,∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}|x)$ | | | RED-Weight | $\bm{g}=\sum_{i}w_{i}(r_{i}-\overline{r})\,∇_{\bm{\theta}}\log\pi_{\bm{\theta}}(y_{i}|x),\;\;w_{i}=\exp(A_{i}/\tau)$ | |

Rendering Paper...