2508.13023

Model: nemotron-free

# G2RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance > Equal Contribution. Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has markedly enhanced the reasoning abilities of large language models (LLMs). Its success, however, largely depends on strong base models with rich world knowledge, yielding only modest improvements for small-size language models (SLMs). To address this limitation, we investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories to compensate for SLMs’ inherent weaknesses. Through a comprehensive study of various guidance configurations, we find that naively adding guidance delivers limited gains. These insights motivate G 2 RPO-A, an adaptive algorithm that automatically adjusts guidance strength in response to the model’s evolving training dynamics. Experiments on mathematical reasoning and code-generation benchmarks confirm that G 2 RPO-A substantially outperforms vanilla GRPO. Our code and models are available at https://github.com/T-Lab-CUHKSZ/G2RPO-A. 1 Introduction Recent advancements in reasoning-centric large language models (LLMs), exemplified by DeepSeek-R1 Guo et al. (2025), OpenAI-o1 Jaech et al. (2024), and Qwen3 Yang et al. (2025a), have significantly expanded the performance boundaries of LLMs, showcasing the immense potential of reasoning-enhanced models. Building upon robust base models with comprehensive world knowledge, these reasoning-focused LLMs have achieved breakthrough progress in complex domains such as mathematics Guan et al. (2025), coding Souza et al. (2025); HUANG et al. (2025), and other grounding tasks (Li et al., 2025b; Wei et al., 2025). At the core of this success lies Reinforcement Learning with Verifiable Rewards (RLVR) (Shao et al., 2024; Chu et al., 2025; Liu et al., 2025). This innovative approach, which employs reinforcement learning techniques in LLMs using rule-based outcome rewards, has garnered significant attention in the AI community. RLVR has demonstrated remarkable improvement in generalization across a wide spectrum of downstream tasks (Jia et al., 2025; Wu et al., 2025), positioning it as a pivotal advancement in the field of artificial intelligence. As the de-facto algorithm, Group Relative Policy Optimization (GRPO) (Shao et al., 2024) improves upon Proximal Policy Optimization (PPO) (Schulman et al., 2017) by removing the need for a critic model through inner-group response comparison, thereby speeding up the training. <details> <summary>x1.png Details</summary> ![d862ca81](/v1/image/d862ca81525fa4f7f588158c69768468db9662354d8e545bd55f099a4d062da4) ### Visual Description ## Line Chart: The training dynamics of simple prompt guidance ### Overview The chart compares the accuracy reward of two methods ("With simple guidance" and "Original GRPO") across global training steps (-2 to 34). Both methods show distinct trajectories, with "With simple guidance" demonstrating a more stable improvement pattern. ### Components/Axes - **X-axis (Global step)**: Ranges from -2 to 34 in increments of 2. Ticks labeled at -2, 0, 2, ..., 34. - **Y-axis (Accuracy reward)**: Scaled from 0.45 to 0.60 in increments of 0.05. - **Legend**: Located at bottom-right, with: - Blue squares: "With simple guidance" - Red squares: "Original GRPO" ### Detailed Analysis 1. **With simple guidance (Blue line)**: - Starts at ~0.54 at x=-2 - Drops to ~0.48 at x=4 - Further declines to ~0.47 at x=10 - Reaches ~0.46 at x=20 - Sharp increase to ~0.58 at x=34 2. **Original GRPO (Red line)**: - Begins at ~0.57 at x=-2 - Falls to ~0.53 at x=4 - Declines to ~0.49 at x=10 - Drops to ~0.48 at x=20 - Sharp rise to ~0.59 at x=34 ### Key Observations - Both methods show a **U-shaped trajectory**, with initial decline followed by recovery. - "With simple guidance" maintains **higher stability** during the decline phase (smaller drops between x=4 and x=20). - Final accuracy rewards: - With simple guidance: ~0.58 - Original GRPO: ~0.59 - The **steepest increase** occurs between x=20 and x=34 for both methods. ### Interpretation The data suggests that simple prompt guidance introduces **stabilizing effects** during early training phases, reducing performance volatility. While both methods converge at similar final accuracy levels (~0.58-0.59), the blue line's smoother trajectory implies better generalization or robustness to training noise. The sharp final ascent in both lines may indicate a critical adaptation phase or parameter optimization breakthrough. The x-axis starting at -2 is unusual but does not affect the relative comparison between methods. </details> Figure 1: Naive guidance does not help. Using Qwen2.5-Math-7B as the base model, we train it on the s1K-1.1 dataset for a single epoch with a simple, fixed-length guidance (naive guidance). The naive guidance method shows a temporary increase in the accuracy reward during the early training stages, but it quickly becomes indistinguishable from the vanilla GRPO curve. <details> <summary>x2.png Details</summary> ![f2857357](/v1/image/f28573576182c5c997274db5b9c9833fc39a90df77a000f3df53c6731d816924) ### Visual Description ## Screenshot: Mathematical Problem and LLM Response ### Overview The image is a screenshot of a mathematical problem and an LLM's response. It contains three panels: 1. **Left Panel**: A mathematical problem involving sequences of positive numbers. 2. **Center Panel**: A "Guidance" section with a step-by-step approach to solving the problem. 3. **Right Panel**: An LLM's response, including a robot icon and a continuation of the problem-solving process. Icons are present: a person with a question mark (left) and a robot (right). --- ### Components/Axes - **Text Blocks**: - **Problem Statement**: Describes sequences `{a_n}`, sums `b_n`, products `c_n`, and the equation `b_n + 2c_n = 1`. - **Guidance**: A step-by-step explanation of the problem, including parsing sequences and constraints. - **LLM Response**: A continuation of the problem-solving process, with a robot icon. - **Icons**: - **Person with Question Mark**: Positioned next to the problem statement (left). - **Robot**: Positioned next to the LLM's response (right). --- ### Detailed Analysis #### Problem Statement - **Sequences**: - `{a_n}`: Sequence of positive numbers. - `{b_n}`: Sum of the first `n` terms of `{a_n}`. - `{c_n}`: Product of the first `n` terms of `{a_n}`. - **Constraint**: For every positive integer `n`, `b_n + 2c_n = 1`. - **Goal**: Find the number in the sequence `{1/a_n}` closest to 2013. #### Guidance - **Step-by-Step Approach**: - Parse the problem: Sum `b_n` and product `c_n` of `{a_n}`. - Equation: `b_n + 2c_n = 1` for all `n ∈ Z+`. - All `b_n` and `c_n` are positive (since `{a_n}` are positive). #### LLM Response - **Continuation**: - Parses the problem again, reiterating the definitions of `b_n` and `c_n`. - Notes that `b_n` and `c_n` are positive, and `c_n` is the product of `b_1 * b_2 * ... * b_n`. - Highlights the equation `b_n + 2c_n = 1` as a key constraint. --- ### Key Observations 1. **No Numerical Data**: The image contains no charts, graphs, or numerical values beyond the problem's symbolic constraints. 2. **Textual Focus**: The problem and LLM response are entirely textual, with no visual data points or trends. 3. **Icon Placement**: - The person icon (question mark) is spatially grounded to the left of the problem statement. - The robot icon is spatially grounded to the right of the LLM's response. --- ### Interpretation - **Problem Structure**: The problem involves recursive relationships between sums and products of sequences, with a constraint that ties `b_n` and `c_n` to 1. This suggests a need for algebraic manipulation or iterative reasoning. - **LLM's Role**: The LLM's response demonstrates a methodical approach to parsing the problem, emphasizing the positivity of terms and the recursive nature of `c_n`. - **Icons as Contextual Cues**: The person icon (question mark) likely represents the user's query, while the robot icon symbolizes the LLM's automated response. --- ### Notes on Missing Elements - **No Charts/Graphs**: The image lacks numerical data, axes, or visual trends. - **No Data Table**: No structured data tables are present. - **No Anomalies**: The text is consistent, with no contradictions or outliers. This description captures all textual and symbolic elements in the image, adhering to the specified technical documentation requirements. </details> Figure 2: Illustration of roll-outs with guidance. An example of using high-quality thinking trajectories to guide models. Capacity of small-size LLMs limit the performance gains of GRPO. Despite GRPO’s success with large-scale LLMs, its effectiveness is significantly constrained when applied to smaller LLMs. Recent research (Ye et al., 2025; Muennighoff et al., 2025a) reveals that GRPO’s performance gains highly depend on the base model’s capacity Bae et al. (2025); Xu et al. (2025); Zhuang et al. (2025). Consequently, small-scale LLMs (SLMs) show limited improvement under GRPO (Table 3, 8), exposing a critical scalability challenge in enhancing reasoning capabilities across diverse model sizes. To address this challenge, researchers have explored various approaches: distillation Guo et al. (2025), multi-stage training Xu et al. (2025) prior to RLVR, and selective sample filtering Xiong et al. (2025); Shi et al. (2025). However, these methods precede RLVR or suffer performance degradation in complex problems (Table 7). Consequently, optimizing the RLVR process for efficient learning in SLMs remains an open challenge, representing a critical frontier in AI research. Adaptive guidance as a solution. We propose incorporating guidance into the roll-out process to facilitate the generation of high-quality, reward-worthy candidates (Figure 2). However, our initial findings revealed that the implementation of simple fixed-length guidance to the prompts (naive guidance) failed to improve overall performance (Figure 1). Through a comprehensive analysis of the guidance mechanism—varying both the proportion of guided roll-outs within GRPO batches and the guidance length over training epochs—we obtained two key findings: (1) Code-generation tasks benefit from a higher guidance ratio than mathematical reasoning tasks, and smaller models likewise require more guidance than larger ones. (2) The optimal guidance length evolves throughout training and is highly context-dependent, rendering simple, predefined schedules ineffective. In response, we introduce the Guided Group Relative Policy Optimization with Adaptive Guidance (G 2 RPO-A) algorithm. This innovative approach dynamically adjusts guidance length based on the model’s real-time learning state, offering a sophisticated solution to the challenges of enhancing small-size LLMs’ performance in RLVR processes. The key contributions of this paper are summarized as follows: - We enhance GRPO for small-scale LLMs by injecting guidance into the rollout thinking trajectories and conduct a systematic analysis of the effects of key guidance configurations, specifically focusing on the guidance ratio and guidance length. - Our study also examines the importance of hard training samples. We find that integrating these samples into the dataset using a curriculum learning approach, and aided by the guidance mechanism, significantly boosts the training efficiency of our method for SLMs. - Drawing on these findings, we introduce G 2 RPO-A, an adaptive algorithm that automatically adjusts guidance length in response to the evolving training state. Our experimental results demonstrate the effectiveness of the proposed G 2 RPO-A algorithm. - We evaluate our method on mathematical reasoning and coding tasks with several models–including the Qwen3 series, DeepSeek-Math-7B-Base, and DeepSeek-Coder-6.7B-Base–and observe substantial performance gains over both vanilla GRPO and simple guided baselines. 2 Related Works <details> <summary>x3.png Details</summary> ![7689aa74](/v1/image/7689aa742c76c031f9e71ff4e9d8d8c3b8c10becae5472736e829624b68dede4) ### Visual Description ## Diagram: Multi-Stage Language Model Response Generation Process ### Overview The diagram illustrates a multi-stage process for generating responses using a language model (SLM), incorporating guidance mechanisms and iterative refinement. It shows how inputs (query, ground truth, thinking trajectory) are processed through sequential stages to produce optimized outputs. ### Components/Axes 1. **Inputs**: - **One query**: Text input for the model. - **A ground truth**: Reference for correctness. - **A thinking trajectory**: Internal reasoning path. 2. **First SLM**: Processes inputs to generate initial completions. 3. **Guided Completions**: - Labeled as `Completion 1` to `Completion n`, with sub-labels `g.1`, `g.2`, ..., `g.k` (guided). 4. **Unguided Completions**: - Labeled as `Completion 1` to `Completion n`, with sub-labels `o.1`, `o.2`, ..., `o.n` (unguided). 5. **Second SLM**: Takes guided/unguided completions to: - **Decide guidance strength**: Adjusts influence of guidance. - **Generate current responses**: Final output. ### Detailed Analysis - **Guided vs. Unguided Completions**: - Guided completions (`g.x`) are generated with explicit guidance (e.g., ground truth). - Unguided completions (`o.x`) are generated without such constraints. - **Iterative Refinement**: - Multiple generations (`k` to `n`) suggest iterative improvement. - Guidance strength is dynamically adjusted based on prior completions. - **Color Coding**: - **Blue**: Guided completions (`g.x`). - **Purple**: Unguided completions (`o.x`). ### Key Observations 1. **Dual-Path Processing**: - The model splits outputs into guided and unguided streams for parallel evaluation. 2. **Feedback Loop**: - The second SLM uses both completion types to refine guidance strength, implying adaptive learning. 3. **Hierarchical Structure**: - Inputs → First SLM → Generations → Second SLM → Final responses. ### Interpretation This diagram represents a **reinforcement learning framework** for language models, where guidance (e.g., ground truth) is used to iteratively improve response quality. The separation of guided and unguided completions allows the model to balance creativity (unguided) with accuracy (guided). The second SLM acts as a meta-controller, optimizing the trade-off between these paths. The iterative generations (`k` to `n`) suggest a focus on long-term coherence and error correction, critical for complex tasks requiring reasoning. The absence of explicit numerical data implies the process is conceptual, emphasizing architectural design over empirical results. </details> Figure 3: Overview of G 2 RPO-A. Each step we split roll-outs into a guided set and an unguided set. We then compare the current rewards with those from the previous steps; the resulting ratio determines the future guidance length. The introduction of chain-of-thought (CoT) prompting has markedly improved LLM performance on complex reasoning tasks (Wei et al., 2022; Kojima et al., 2022). Complementing this advance, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful training paradigm for reasoning-centric language models Yue et al. (2025); Lee et al. (2024). The de-facto RLVR algorithm, Group Relative Policy Optimization (GRPO) (Guo et al., 2025) delivers strong gains on various benchmarks while remaining training-efficient because it dispenses with the need for a separate critic network Wen et al. (2025); Shao et al. (2024). Recent efforts to improve GRPO have explored several directions. Some approaches focus on refining the core GRPO objective, either by pruning candidate completions Lin et al. (2025) or by removing normalization biases Liu et al. (2025). Separately, another studies aim to enhance the training signal and stability. DAPO Yu et al. (2025), for instance, introduces dense, step-wise advantage signals and decouples the actor-critic training to mitigate reward sparsity. However, adapting GRPO–style algorithms to small-scale LLMs remains challenging due to the sparse-reward problem Lu et al. (2024); Nguyen et al. (2024); Dang and Ngo (2025). Recent studies have therefore focused on improved reward estimation Cui et al. (2025). TreeRPO Yang et al. (2025b) uses a tree-structured sampling procedure to approximate step-wise expected rewards, and Hint-GRPO Huang et al. (2025) applies several reward-shaping techniques. Other lines of research investigate knowledge distillation (Guo et al., 2025), multi-stage pre-training before RLVR Xu et al. (2025), and selective sample filtering Xiong et al. (2025); Shi et al. (2025). In our experiments, however, these filtering or sampling strategies are performed either only before RLVR or do not improve performance on more complex tasks. In this paper, we introduce a guidance mechanism that injects ground-truth reasoning steps directly into the model’s roll-out trajectories during RL training. Because the guidance is provided online, the proposed method can still learn effectively from difficult examples while mitigating the sparse-reward issue. The role of guidance in GRPO-style training remains underexplored. Two concurrent studies have addressed related questions (Nath et al., 2025; Park et al., 2025), but both simply append guidance tokens to the input prompt, offering neither a systematic analysis of guidance configurations nor a mechanism that adapts to the changing training state of the model. We show that naive guidance often fails to improve performance because it yields low expected advantage. To remedy this, we provide a comprehensive examination of how guidance length, ratio, and scheduling affect learning, and we introduce G 2 RPO-A, an adaptive strategy that dynamically adjusts guidance strength throughout training. 3 Preliminary Group Relative Policy Optimization (GRPO). Given a prompt, GRPO Shao et al. (2024) samples $G$ completions and computes their rewards $\{r_{i}\}_{i=1}^{G}$ . Define the $t^{\text{th}}$ token of the $i^{\text{th}}$ completion as $o_{i,t}$ . GRPO then assigns a advantage, $\hat{A}_{i,t}$ to it. The optimization objective is defined as: $$ \begin{split}\mathcal{L}_{\text{GRPO}}(\theta)=-\frac{1}{\sum_{i=1}^{G}|o_{i}|}\sum_{i=1}^{G}\sum_{t=1}^{|o_{i}|}\biggl{[}\min\biggl{(}w_{i,t}\hat{A}_{i,t},\text{clip}\left(w_{i,t},1-\epsilon,1+\epsilon\right)\hat{A}_{i,t}\biggr{)}-\beta\mathcal{D}_{\text{KL}}(\pi_{\theta}\|\pi_{\text{ref}})\biggr{]},\end{split} \tag{1} $$ where the importance weight $w_{i,t}$ is given by $$ w_{i,t}=\frac{\pi_{\theta}(o_{i,t}\mid q,o_{i,<t})}{[\pi_{\theta_{\text{old}}}(o_{i,t}\mid q,o_{i,<t})]}. \tag{2} $$ The clipping threshold $\epsilon$ controls update magnitude, $[\,·\,]_{\text{old}}$ indicates policy at last step, $\beta$ is the influence of KL divergence $\mathcal{D}_{\text{KL}}$ , whose detailed definition can be found in the section Detailed equations in Appendix. Limitations of GRPO for small-size LLMs. Despite the success of GRPO in large language models (LLMs), small-size LLMs (SLMs) face significant challenges when confronted with complex problems requiring long chains of thought Zhang et al. (2025). Due to their inherently limited capacity, SLMs struggle to generate high-quality, reward-worthy candidates for such tasks Li et al. (2025a); Zheng et al. (2025). As shown in Figure 4(a), where Qwen3-1.7B is implemented on a code task and it fails to generate correct answers for most queries. This limitation substantially reduces the probability of sampling high-reward candidates, resulting in advantage signals vanishing (Figure 5(b)), thereby constraining the potential performance gains achievable through GRPO in SLMs. 4 Methodology To address the limitations of GRPO on SLMs, we propose incorporating guidance mechanisms into the thinking trajectories, thereby facilitating the sampling of high-quality candidates. We then conduct a comprehensive investigation into various design choices for guidance strategies. Finally, we introduce the G 2 RPO-A algorithm, which integrates our empirical observations and significantly reduces the need for extensive hyperparameter tuning. 4.1 Guided GRPO as a Solution The Guided GRPO can be formulated as: | | | $\displaystyle\mathcal{L}_{\text{guided}}(\theta)=\mathbb{E}_{(q,a)\sim\mathcal{D},\{g_{i}\}_{i=1}^{G}\sim\mathcal{G},\{o_{i}\}_{i=1}^{G}\sim\pi_{\text{ref}}(·|q,g_{i})}$ | | | --- | --- | --- | --- | where $w_{g,i,t}$ and $w_{o,i,t}$ denote the token-level weighting coefficient of guidance $g_{i}$ and the model outputs $o_{i}$ , respectively. As shown in Figure 4(b), this guidance enables SLMs to generate higher-reward candidates, potentially overcoming their inherent limitations. Naive Guided GRPO fails to boost the final performance. Despite increasing expected rewards (Figure 4(b)), we found that simply adding guidance to thinking trajectories of all candidates doesn’t enhance final performance and suffers from low advantage. As shown in Figure 5(a) and 5(b), we train Qwen-3-1.7B-Base on a math dataset sourced from math 220k dataset Wang et al. (2024), and find that: (1) Guided GRPO’s accuracy reward curve almost matches original GRPO. (2) Guided GRPO suffers from low advantage standard deviation, hindering the optimization of the models. As a result, further investigation is needed to leverage Guided GRPO’s higher rewards while ensuring effective training, as the naive approach fails to utilize its potential benefits. <details> <summary>x4.png Details</summary> ![5a4482e2](/v1/image/5a4482e29ec5904f6d92735f333b07b2187d4b7ceadfc5efbbdbf7dbcd02d23b) ### Visual Description ## Heatmap: Matrix of Numerical Values ### Overview The image is a 10x7 grid (rows A-J, columns A-G) with numerical values (0, 0.25, 0.5, 0.75, 0.88) and color-coded cells. Darker blue shades represent higher values, while lighter blue or white cells indicate lower or zero values. No explicit legend is present, but color intensity correlates with magnitude. ### Components/Axes - **Rows**: Labeled A (top) to J (bottom). - **Columns**: Labeled A (left) to G (right). - **Values**: Numerical annotations in each cell (e.g., 0.5, 0.75, 0.88). - **Color Coding**: - Dark blue: High values (e.g., 0.88, 0.75). - Medium blue: Moderate values (e.g., 0.5). - Light blue: Low values (e.g., 0.25). - White: Zero values. ### Detailed Analysis - **Highest Value**: 0.88 in cell **F-D** (row F, column D). - **Cluster of 0.75s**: - Row **I**: Columns B (0.5), C (0.75), D (0.75), G (0.5). - Row **F**: Columns D (0.88), E (0.75), F (0.5), G (0.5). - **0.5s**: - Row **G**: Columns A (0.5), B (0.5). - Row **B**: Columns A (0.5), D (0.5), F (0.5), G (0.25). - Row **C**: Columns B (0.5), D (0.5), E (0.25), F (0.25), G (0.25). - **0.25s**: - Row **E**: Columns A (0.25), B (0.25). - Row **D**: Columns C (0.25), E (0.25), F (0.25), G (0.25). - Row **C**: Columns C (0.25), E (0.25), F (0.25), G (0.25). - **Zeros**: - Rows **J**, **H**, and parts of **A**, **B**, **D**, **E**, **F**, **G**, **I**. ### Key Observations 1. **Outlier**: The value **0.88** in **F-D** is the highest, significantly exceeding other values. 2. **Row I Dominance**: Row **I** has the most non-zero values (4 cells: B, C, D, G), with two 0.75s. 3. **Sparse Distribution**: Many cells (e.g., **J**, **H**, **A-D**, **A-E**) contain zeros, suggesting sparse data or absence of relationships. 4. **Diagonal Patterns**: Some diagonal clusters (e.g., **B-A**, **G-B**, **F-D**) show moderate values. ### Interpretation This heatmap likely represents a **correlation matrix**, **probability distribution**, or **co-occurrence table**. - **Row I** and **Column D** are central to high-value interactions, suggesting a strong relationship or dependency. - The **0.88 in F-D** may indicate a critical or anomalous data point requiring further investigation. - The **sparse zeros** imply that most row-column pairs have no interaction or negligible values. - The **cluster of 0.75s in Row I** and **0.5s in Row G** could reflect hierarchical or grouped relationships. ### Spatial Grounding - **Legend**: Absent; color coding inferred from value magnitude. - **Axis Labels**: Rows (A-J) on the left, columns (A-G) on the top. - **Data Placement**: Values are centered in cells, with darker colors for higher values. ### Trend Verification - **Row I**: Peaks at 0.75 (columns C, D) and 0.5 (columns B, G). - **Row F**: Peaks at 0.88 (column D) and 0.75 (column E). - **Row G**: Consistent 0.5s in columns A and B. - **Row E**: Low values (0.25) in columns A and B. ### Component Isolation - **Header**: Row labels (A-J) and column labels (A-G). - **Main Chart**: Grid of values with color gradients. - **Footer**: No additional elements. ### Final Notes The absence of a legend limits precise interpretation of color coding, but the pattern suggests a focus on **Row I** and **Column D** as key nodes. The **0.88 outlier** and **sparse zeros** highlight potential areas for deeper analysis. </details> (a) GRPO <details> <summary>x5.png Details</summary> ![7a26bdc6](/v1/image/7a26bdc6b92390d02d55424213dd4930db98ff7af81c8599d760ee2e08797fbd) ### Visual Description ## Heatmap: Matrix of Numerical Values ### Overview The image displays a 10x7 heatmap with rows labeled A-J and columns labeled A-G. Each cell contains a numerical value ranging from 0 to 1, with varying shades of blue representing magnitude. The heatmap lacks explicit axis titles or legends, but color intensity correlates with value magnitude (darker blue = higher value). ### Components/Axes - **Rows**: Labeled A-J (vertical axis, left side) - **Columns**: Labeled A-G (horizontal axis, top) - **Values**: Numerical entries in each cell (0 to 1, with two decimal places for non-integer values) - **Color Gradient**: Light blue (low values) to dark blue (high values) ### Detailed Analysis #### Row-by-Row Values: - **A**: [1, 1, 1, 1, 0.75, 0.5, 1] - **B**: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 1] - **C**: [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5] - **D**: [0, 0.25, 0.5, 0, 0, 0, 0] - **E**: [0, 0, 0.5, 0.5, 0.5, 0.5, 0.5] - **F**: [0.25, 0.25, 0.25, 0.74, 0.49, 0.49, 0.49] - **G**: [0.98, 0.73, 0.96, 0.96, 0.46, 0.46, 0.5] - **H**: [0, 0, 0, 0, 0, 0.5, 0.5] - **I**: [1, 1, 1, 1, 1, 1, 1] - **J**: [0.75, 0.75, 1, 1, 0.75, 1, 0.75] #### Column-by-Column Values: - **A**: [1, 1, 0.5, 0, 0, 0.25, 0.98, 0, 1, 0.75] - **B**: [1, 0.5, 0.5, 0.25, 0, 0.25, 0.73, 0, 1, 0.75] - **C**: [1, 0.5, 0.5, 0.5, 0.5, 0.25, 0.96, 0, 1, 1] - **D**: [1, 0.5, 0.5, 0, 0.5, 0.74, 0.96, 0, 1, 1] - **E**: [0.75, 0.5, 0.5, 0.5, 0.5, 0.49, 0.46, 0.5, 1, 0.75] - **F**: [1, 0.5, 0.5, 0, 0, 0.49, 0.46, 0.5, 1, 1] - **G**: [1, 1, 0.5, 0, 0.5, 0.49, 0.5, 0.5, 1, 0.75] ### Key Observations 1. **Row I** contains all 1s, indicating maximum values across all columns. 2. **Rows H and D** have multiple zeros, suggesting no correlation or activity in those rows. 3. **Column G** has the highest concentration of 1s (rows A, B, I, J). 4. **Diagonal patterns**: Rows A, B, I, J show 1s in columns A and G, suggesting symmetry or identity relationships. 5. **Mid-range values (0.5)**: Dominant in rows C, E, and F, indicating moderate magnitudes. 6. **Anomalies**: - Row F has a spike at column D (0.74) followed by lower values. - Row G shows a drop from 0.96 (column C) to 0.46 (column E). ### Interpretation This heatmap likely represents a **correlation matrix** or **similarity score** between categories A-J and A-G. Key insights: - **Identity relationships**: Rows A, B, I, J and column G exhibit 1s, suggesting self-correlation or perfect alignment. - **Weak/no correlation**: Rows H and D (with zeros) may represent categories with no interaction. - **Moderate relationships**: Mid-values (0.5) in rows C, E, F indicate balanced but not extreme connections. - **Outliers**: Row F’s 0.74 in column D and row G’s 0.98 in column A suggest localized high correlations. The structure implies a systematic comparison, possibly in fields like machine learning (feature similarity), biology (gene expression), or social network analysis (node relationships). The absence of a legend limits precise interpretation of color scales, but the gradient aligns with standard heatmap conventions. </details> (b) Guided GRPO Figure 4: Reward of Guided GRPO. We fine-tuned Qwen3-1.7B on coding tasks, using 10 roll-outs and generating 280 candidates per batch. The candidates’ rewards form a 20x14 matrix. We then applied 2x2 average pooling, reducing it to a 10x7 matrix for clearer visualization. The results demonstrate that when configured with an optimal guidance ratio, G 2 RPO-A enables the model to sample candidates that yield a significantly denser reward signal. | MATH500 $\ell=50$ $\ell=100$ | $\alpha=\frac{5}{6}$ $66.80$ $65.20$ | $\alpha=\frac{3}{6}$ $66.00$ $63.00$ | $\alpha=\frac{1}{6}$ $67.20$ $66.20$ | $\alpha=1$ $65.10$ $64.70$ | | --- | --- | --- | --- | --- | | $\ell=200$ | $57.60$ | $52.40$ | $62.00$ | $59.30$ | | $\ell=500$ | $57.80$ | $62.00$ | $68.20$ | $55.80$ | | $\ell=0$ | 62.00 | | | | Table 1: Empirical study on guidance length $\ell$ and guidance ratio $\alpha$ . We use the Qwen2.5-Math-7B as the backbone. 4.2 Optimizing Guided GRPO Design In this section, we thoroughly examine optimal design choices for Guided GRPO, focusing on guidance ratio of GRPO candidate groups and adjusting guidance strength at different training stages. These investigations aim to maximize the effectiveness of the Guided GRPO and overcome the limitations observed in the naive implementation. Inner-Group Varied Guidance Ratio. The insufficiency of naive guidance suggests that a more nuanced approach is required. We begin by investigating the impact of the guidance ratio $\alpha$ . In each GRPO group of size $G$ , we steer only an $\alpha$ -fraction of the candidates. Let $g_{i}$ denote the guidance for the $i$ -th candidate (ordered arbitrarily), we have: $$ |g_{i}|=0\quad(i>\alpha G),\qquad|g_{i}|=l\quad(i\leq\alpha G). \tag{3} $$ That is, the first $\alpha G$ candidates have guidance, while the remaining $(1-\alpha)G$ candidates evolve freely. We conduct experiments on the Qwen2.5-Math-7B model Yang et al. (2024) with a roll-out number $n=6$ , training for one epoch on the s1k-1.1 dataset Muennighoff et al. (2025b). We set $\alpha∈\{1/6,...,1\}$ and $l∈\{50,100,...,500\}$ tokens, with all accuracies reported on the Math 500 benchmark. The results in Table 1 show that: - Partial inner-group guidance improves model performance. In most settings, Guided GRPO with guidance provided to only a subset of candidates outperforms the vanilla GRPO, confirming the usefulness of the guidance mechanism. - For Qwen2.5-Math-7B on the Math500 benchmark, the lowest guidance ratio $\alpha$ combined with a longest guidance window $\ell$ yields the best results. This suggests that Qwen2.5-Math-7B benefits from infrequent but heavyweight guidance. In summary, selective guidance–directing only few candidates by a long guidance–strikes the best balance between exploration and control, thereby improving model performance. Moreover, the optimal guidance ratio varies with both the task domain and model capacity. As Table 8, 9 shows, smaller models and coding tasks benefit from stronger intervention, whereas larger models and math tasks achieve better results with lighter guidance. <details> <summary>x6.png Details</summary> ![a00428e4](/v1/image/a00428e48cd63c4afcb3835361135072f32d24b2306be37a4d5edf5df75d36f7) ### Visual Description ## Line Graph: Accuracy Reward vs. Global Step ### Overview The image is a line graph comparing the accuracy reward of three reinforcement learning methods (Vanilla GRPO, Naive Guided GRPO, and G²RPO-A) across 140 global steps. The y-axis represents accuracy reward (0.0–0.5), and the x-axis represents global steps (0–140). Three colored lines (blue, green, purple) correspond to the methods in the legend. ### Components/Axes - **X-axis (Global Step)**: Labeled "Global step," with ticks at 0, 20, 40, 60, 80, 100, 120, 140. - **Y-axis (Accuracy Reward)**: Labeled "Accuracy reward," with ticks at 0.0, 0.1, 0.2, 0.3, 0.4, 0.5. - **Legend**: Located at the top, with: - **Blue**: Vanilla GRPO - **Green**: Naive Guided GRPO - **Purple**: G²RPO-A ### Detailed Analysis 1. **Vanilla GRPO (Blue)**: - Starts at ~0.05 accuracy reward at step 0. - Gradually increases to a peak of ~0.35 at step 100. - Declines sharply to ~0.1 by step 140. - Shows moderate fluctuations (e.g., minor dips at steps 40–60). 2. **Naive Guided GRPO (Green)**: - Begins at ~0.1 at step 0. - Rises to a peak of ~0.45 at step 100. - Drops sharply to ~0.05 by step 140. - Exhibits volatility (e.g., oscillations between 0.2–0.3 at steps 40–80). 3. **G²RPO-A (Purple)**: - Starts at ~0.1 at step 0. - Increases steadily to ~0.4 at step 100. - Declines gradually to ~0.15 by step 140. - Smoother trajectory with fewer fluctuations compared to others. ### Key Observations - All three methods peak near step 100, but G²RPO-A maintains higher accuracy post-peak. - Naive Guided GRPO has the highest peak (~0.45) but the steepest decline. - Vanilla GRPO shows the most gradual rise and fall, with intermediate performance. - No data points fall below 0.0 or exceed 0.5. ### Interpretation The graph suggests that **G²RPO-A** outperforms the other methods in maintaining accuracy over time, particularly after the global step 100. The Naive Guided GRPO achieves the highest peak accuracy but suffers from instability, leading to a rapid decline. Vanilla GRPO demonstrates moderate performance with fewer fluctuations. The sharp drop in Naive Guided GRPO after step 100 may indicate overfitting or sensitivity to hyperparameters. The trends imply that G²RPO-A balances exploration and exploitation more effectively, making it robust for longer training durations. </details> (a) Accuracy Reward <details> <summary>x7.png Details</summary> ![9ed69c91](/v1/image/9ed69c911a56b5ae998c021223b65f41684a410397ba204440c976cb9ab2b216) ### Visual Description ## Line Graph: Advantage σ vs Global Step ### Overview The image is a line graph comparing three optimization algorithms—Naive Guided GRPO (green), Vanilla GRPO (blue), and G²RPO-A (purple)—across 140 global steps. The y-axis measures "Advantage σ" (ranging from 0.1 to 0.8), while the x-axis represents "Global step" (0 to 140). The legend is positioned at the top-left corner, with each line color-coded to its corresponding algorithm. ### Components/Axes - **X-axis (Global step)**: Labeled with increments of 20 (0, 20, 40, ..., 140). - **Y-axis (Advantage σ)**: Labeled with increments of 0.1 (0.1, 0.2, ..., 0.8). - **Legend**: Top-left corner, mapping: - Green: Naive Guided GRPO - Blue: Vanilla GRPO - Purple: G²RPO-A ### Detailed Analysis 1. **Naive Guided GRPO (Green)**: - Starts at ~0.15 at step 0. - Peaks at ~0.35 around step 40. - Declines to ~0.25 by step 100, then stabilizes at ~0.28 by step 140. - Trend: Initial rise, followed by gradual decline and stabilization. 2. **Vanilla GRPO (Blue)**: - Begins at ~0.58 at step 0. - Peaks at ~0.65 around step 20. - Dips to ~0.35 at step 80, then rises to ~0.6 by step 120. - Trend: Volatile with a mid-step trough and late-stage recovery. 3. **G²RPO-A (Purple)**: - Starts at ~0.55 at step 0. - Peaks at ~0.72 around step 120. - Dips to ~0.4 at step 80, then rises sharply to ~0.72 by step 120. - Trend: Early decline, mid-step trough, and late-stage dominance. ### Key Observations - **G²RPO-A (Purple)** achieves the highest "Advantage σ" (~0.72) at step 120, outperforming others in later steps. - **Naive Guided GRPO (Green)** consistently underperforms, with values never exceeding ~0.35. - **Vanilla GRPO (Blue)** shows mid-range performance but exhibits significant volatility, dropping to ~0.35 at step 80 before recovering. ### Interpretation The data suggests that **G²RPO-A** is the most effective algorithm for maximizing "Advantage σ" as global steps increase, particularly after step 100. Its late-stage peak (~0.72) indicates superior scalability or efficiency in later optimization phases. In contrast, **Naive Guided GRPO** demonstrates limited effectiveness, with values remaining below 0.4 throughout. **Vanilla GRPO** exhibits intermediate performance but suffers from instability, as evidenced by its mid-step trough. The trends highlight trade-offs between stability (Vanilla GRPO) and scalability (G²RPO-A), with the latter emerging as the optimal choice for long-term optimization tasks. </details> (b) Advantage $\sigma$ Figure 5: Pitfalls of naive Guided GRPO. We trained Qwen3-1.7B-Base on a curriculum-ordered subset of Math-220K Wang et al. (2024): problems are presented from easy to hard. Because the curriculum continually increases task difficulty, the accuracy reward does not plateau at a high level–an expected outcome of the CL schedule. This training dynamic indicates that the advantage standard deviation is extremely low under the naive guidance condition, a situation that negatively impacts training efficiency for SLMs. | $\ell=50$ , $\alpha=0.8333$ $\ell=100$ , $\alpha=0.1667$ $\ell=200$ , $\alpha=0.8333$ | 63.80 60.60 66.60 | 58.40 62.00 54.20 | 57.60 66.20 58.40 | | --- | --- | --- | --- | | $\ell=200$ , $\alpha=0.1667$ | 61.20 | 64.20 | 59.60 | | $\ell=500$ , $\alpha=0.1667$ | 59.60 | 69.80 | 62.40 | | $\ell=0$ | 62.00 | | | Table 2: Performance of Guided GRPO under different guidance-length adjustment policies. We train Qwen2.5-Math-7B and evaluate it on the MATH 500 benchmark. For each guidance-length schedule, we report the results obtained with the guidance ratio that achieves the highest score in Table 1. | Base Model Qwen3-0.6B-Base Minerva | $\alpha$ 0.75 11.43 | Benchmark MATH500 9.57 | Base 40.18 10.40 | GRPO 54.26 12.29 | SFT 50.53 | G 2 RPO-A 51.77 | | | --- | --- | --- | --- | --- | --- | --- | --- | | gpqa | 25.49 | 24.51 | 25.49 | 30.39 | | | | | Qwen3-1.7B-Base | 0.25 | MATH500 | 50.96 | 63.74 | 62.11 | 67.21 | | | Minerva | 13.84 | 16.19 | 18.89 | 15.10 | | | | | gpqa | 27.45 | 29.41 | 24.51 | 32.35 | | | | | Qwen3-8B-Base | 0.14 | MATH500 | 71.32 | 79.49 | 80.29 | 82.08 | | | Minerva | 33.24 | 37.51 | 36.60 | 36.42 | | | | | gpqa | 43.17 | 44.13 | 42.85 | 49.72 | | | | Table 3: Performance of G 2 RPO-A on Math Tasks. We report accuracy (%) on various benchmarks. Models are trained for 5 epochs, and guidance ratios are selected based on the best settings obtained from Table 9. | Qwen3-0.6B Minerva gpqa | 0.75 12.32 24.51 | MATH500 20.59 25.45 | 76.20 21.57 26.43 | 85.37 | 87.15 | | --- | --- | --- | --- | --- | --- | | AIME24 | 10.00 | 6.67 | 10.00 | | | | AIME25 | 13.33 | 20.00 | 23.33 | | | | Qwen3-1.7B | 0.25 | MATH500 | 92.71 | 94.52 | 91.69 | | Minerva | 33.16 | 35.38 | 38.26 | | | | gpqa | 48.23 | 51.68 | 55.27 | | | | AIME24 | 46.67 | 56.67 | 63.33 | | | | AIME25 | 36.67 | 50.00 | 53.33 | | | Table 4: Performance of G 2 RPO-A on Math Tasks. The experiment settings are the same with Table 3. However, we use extra benchmarks like AIME24 and AIME25 here due to the stronger model performances. Time Varied Guidance Length. Apart from the guidance ratio, Table 1 shows that performance also depends on the guidance length $\ell$ . To investigate this further, we evaluate guided GRPO by varying the guidance length during training under three strategies. Those are: $$ \displaystyle\text{Concave decay:}\quad\ell_{t} \displaystyle=\ell_{0}\Bigl{(}1-\tfrac{t}{T}\Bigr{)}^{\beta}\, \displaystyle\text{Linear decay:}\quad\ell_{t} \displaystyle=\ell_{0}\Bigl{(}1-\tfrac{t}{T}\Bigr{)}\, \displaystyle\text{Stepwise decay:}\quad\ell_{t} \displaystyle=\ell_{0}\,\gamma^{\lfloor t/s\rfloor}\,, \tag{4} $$ where $T$ is the total training steps, and $\ell_{0}$ is the initial guidance length. The parameter $\beta∈(1,∞]$ controls the concavity, and $\gamma∈(0,1)$ sets the decay rate, and $s$ specifies the decay interval. We use the same experiment setting as in Table 1, and choose the guidance ratio that performs the best. The results are reported in Table 2. The results indicate that (1) model quality is highly sensitive to the chosen guidance length $\ell_{t}$ , and (2) no single schedule consistently outperforms the others. This highlights the need for more effective methods of controlling guidance length. 4.3 G 2 RPO-A: Sampling Difficulty Motivated Adaptive Guidance In this section, we propose an adaptive algorithm that automatically selects the guidance strength $\ell$ at every optimization step. Our approach is inspired by recent work on data filtering and sampling (Bae et al., 2025; Xiong et al., 2025; Shi et al., 2025) , which removes examples that yield uniformly low or uniformly high rewards. Such “uninformative” samples–being either too easy or too hard–contribute little to learning and can even destabilize training. The pseudo-code can be found in Appendix. Guidance length adjustment. Our key idea is to control the difficulty of training samples by dynamically adjusting the guidance length, taking into account the ongoing training states. At each training step $k$ , the guidance $\ell_{k+1}$ is determined by the following equation: $$ \ell_{k+1}=\ell_{k}\cdot\frac{\text{min}(\mathcal{T},k)r_{k}}{\sum_{\tau=1}^{\text{min}(\mathcal{T},k)}r_{k-\tau}}, \tag{5} $$ where $r_{k}$ is the average reward of the $k$ -th training step, $\mathcal{T}$ is a hyperparameter that controls the number of history steps we considered, and we found that setting $\mathcal{T}=2$ is already sufficient for noticeably improving Guided GRPO performance (Table 10, 11). Equation 5 implies the following dynamics: - When recent rewards rise, $\ell_{k}$ is reduced, making the next batch of examples harder. - When recent rewards fall, $\ell_{k}$ is increased, making the next batch easier. Thus, the training difficulty is automatically and continuously adjusted to match the model’s current competence. | Base Model Qwen3-0.6B | Guidance Ratio 0.75 | Benchmark-humaneval | Base Perf. 32.32 | GRPO 38.89 | SFT 40.33 | G 2 RPO-A 44.96 | | | --- | --- | --- | --- | --- | --- | --- | --- | | Live Code bench | 17.07 | 22.22 | 13.58 | 23.14 | | | | | Qwen3-1.7B | 1 | humaneval | 46.08 | 67.65 | 63.34 | 75.93 | | | Live Code bench | 34.31 | 53.14 | 56.33 | 51.96 | | | | | Qwen3-8B | 0.57 | humaneval | 64.36 | 81.48 | 77.42 | 80.33 | | | Live Code bench | 60.58 | 77.12 | 63.82 | 79.71 | | | | Table 5: Performance of G 2 RPO-A on Code Tasks. We report accuracy (%) on various benchmarks. Models are trained for 5 epochs, and guidance ratios are selected based on the best settings obtained from Table 8. Curriculum learning for further improvements. Equation 5 shows that the adaptive guidance-length controller updates $\ell$ by comparing the current reward with rewards from previous steps. When consecutive batches differ markedly in difficulty, these reward variations no longer reflect the model’s true learning progress, which in turn degrades G 2 RPO-A ’s performance. | Qwen3-1.7-Base | Random GRPO | CL G 2 RPO-A | GRPO | G 2 RPO-A | | --- | --- | --- | --- | --- | | MATH500 | 53.81 | 57.67 | 52.05 | 58.94 | | Minarva | 12.41 | 15.12 | 14.98 | 16.69 | | gpqa | 24.79 | 23.53 | 27.45 | 25.49 | | Qwen3-0.6B-Base | | | | | | Math 500 | 43.25 | 50.72 | 48.16 | 53.59 | | Minarva | 11.04 | 11.21 | 9.66 | 10.08 | | gpqa | 23.1 | 25.49 | 24.51 | 32.35 | Table 6: Comparison of training with random order and curriculum learning (CL) order across different models and benchmarks. | Remove Replace Original | 76.74 88.37 86.05 | 71.11 75.56 77.77 | 50.00 54.00 60.00 | 35.00 37.00 43.00 | 24.00 18.00 28.00 | | --- | --- | --- | --- | --- | --- | Table 7: Performance of GRPO with different sample-filtering methods. We train Qwen3-1.7B model using G 2 RPO-A, with $\alpha=0.25$ . In the Remove setting all hard samples are excluded from the original dataset, whereas in the Replace setting each hard sample is substituted with a sample of moderate difficulty. To eliminate this mismatch, we embed a curriculum-learning (CL) strategy Parashar et al. (2025); Shi et al. (2025); Zhou et al. (2025). Concretely, we sort the samples by difficulty. Using math task as an example, we rank examples by source, yielding five ascending difficulty tiers: cn_contest, aops_forum, amc_aime, olympiads, and olympiads_ref. We also tested ADARFT Shi et al. (2025), which orders samples by success rate, but its buckets proved uninformative in our cases—most questions were either trivial or impossible (see Appendix Figure 6)—so it failed to separate difficulty levels effectively. Table 6 shows that both the performance of vanilla GRPO and G 2 RPO-A boosted by CL. Compare G 2 RPO-A to sample-filtering methods. Earlier work argues that policy-gradient training benefits most from mid-level queries. Bae et al. (2025) keep only moderate-difficulty batches via an online filter, and Reinforce-Rej Xiong et al. (2025) discards both the easiest and hardest examples to preserve batch quality. Our experiments show that this exclusion is counter-productive: removing hard problems deprives the model of vital learning signals and lowers accuracy on challenging tasks. Table 7 confirms that either dropping hard items or substituting them with moderate ones reduces Level 4 and 5 test accuracy. G 2 RPO-A avoids this pitfall by retaining tough examples and attaching adaptive guidance to them, thus exploiting the full difficulty spectrum without sacrificing performance. 5 Experiments | humaneval LCB Qwen3-0.6B | 68.52 28.43 $\alpha=0$ | 59.88 19.61 $\alpha=\frac{1}{4}$ | 64.81 23.53 $\alpha=\frac{2}{4}$ | 72.22 30.39 $\alpha=\frac{3}{4}$ | 70.81 35.72 $\alpha=1$ | | --- | --- | --- | --- | --- | --- | | humaneval | 41.98 | 32.10 | 27.72 | 38.89 | 49.38 | | LCB | 12.75 | 11.76 | 9.80 | 18.63 | 12.75 | Table 8: Ablation studies on guidance ratio $\alpha$ for Code Tasks. The group size is set to 12. The initial guidance length for G 2 RPO-A is set to 3072. The LCB indicates Live Code Bench. | MATH500 Minerva gpqa | 52.05 14.98 27.45 | 58.71 16.69 25.49 | 53.09 16.25 30.39 | 55.53 18.21 25.49 | 45.95 16.11 22.55 | | --- | --- | --- | --- | --- | --- | | Qwen3-0.6B-Base | | | | | | | MATH500 | 48.16 | 49.59 | 50.94 | 53.50 | 38.42 | | Minerva | 9.66 | 9.10 | 8.96 | 10.08 | 15.69 | | gpqa | 24.51 | 19.61 | 31.37 | 32.35 | 25.49 | Table 9: Ablation studies on guidance ratio $\alpha$ for Math Tasks. The group size is set to 12. The initial guidance length for G 2 RPO-A is set to 3072. | Qwen3-1.7B-Base | GRPO 3072 | Fixed Guidance 2048 | RDP 1024 | G 2 RPO-A | | | | --- | --- | --- | --- | --- | --- | --- | | MATH500 | 52.05 | 51.28 | 60.52 | 46.78 | 51.02 | 58.71 | | Minerva | 14.98 | 14.40 | 17.16 | 12.22 | 17.99 | 22.46 | | gpqa | 27.45 | 25.00 | 24.51 | 23.53 | 22.13 | 25.49 | | Qwen3-0.6B-Base | | | | | | | | MATH500 | 48.16 | 55.80 | 54.17 | 52.69 | 55.97 | 53.50 | | Minerva | 9.66 | 13.27 | 15.26 | 11.78 | 14.32 | 15.69 | | gpqa | 24.51 | 24.00 | 21.57 | 22.55 | 26.00 | 32.35 | Table 10: Guidance-length ablation on Math Tasks. Each run uses the optimal guidance ratio reported in Table 9. The initial guidance budget for G 2 RPO-A is fixed at 3,072 tokens. RDP refers to the rule-based decay policy. 5.1 Experiment Settings In this section, we outline the experiment settings we used, and more details about dataset filtering methods and evaluation on more models can be found in Appendix. Datasets and models. We conduct experiments on math and code tasks. In detail, - Mathematical reasoning tasks. We construct a clean subset of the Open-R1 math-220k corpus Wang et al. (2024). Problems are kept only if their solution trajectories are (i) complete, (ii) factually correct, and (iii) syntactically parsable. - Code generation. For programming experiments we adopt the Verifiable-Coding-Problems-Python benchmark from Open-R1. For every problem we automatically generate a chain-of-thought with QWQ-32B-preview Team (2024). These traces are later used as guidance by our proposed G 2 RPO-A training procedure. We use Qwen3 series Yang et al. (2025a) for both tasks. Results of DeepSeek-Math-7B-Base Shao et al. (2024) for math and DeepSeek-Coder-6.7B-Base Guo et al. (2024) for code also included in Appendix. Unless specifically mentioned, CL is used for all experiments for fair comparison, and we also conducted ablation studies in Table 6. Evaluation protocol. We assess our models mainly on three public mathematical–reasoning benchmarks— Math500 Hendrycks et al. (2021), Minerva-Math Lewkowycz et al. (2022), and GPQA Rein et al. (2024). For the mathematical training of Qwen3-1.7B and Qwen3-0.6B, AIME24 Li et al. (2024) and AIME25 benchmarks are also used. And for code tasks, we use two benchmarks: humaneval Chen et al. (2021) and Live Code Bench Jain et al. (2024). Decoding hyper-parameters are fixed to: temperature $=0.6$ , $\mathrm{top}\text{-}p=0.95$ , and $\mathrm{top}\text{-}k=20$ . Unless otherwise noted, we generate with a batch size of $128$ and permit a token budget between $1{,}024$ and $25{,}000$ , based on each model’s context window. Training details. Our G 2 RPO-A algorithm is implemented on top of the fully open-source Open-R1 framework Face (2025). We use the following hyper-parameters: (i) number of roll-outs per sample set to $12$ for 0.6B and 1.7B backbones, and $7$ for 7B and 8B backbones; (ii) initial learning rate $1× 10^{-6}$ , decayed with a cosine schedule and a warm-up ratio of $0.1$ ; (iii) a training set of $1{,}000$ problems for $5$ epochs. Note that for ablation experiments, only 1 epoch is implemented in our training. (iv) All models are trained on 8 A100 GPUs. 5.2 Numerical Results Superior performance of G 2 RPO-A. As reported in Table 3, 4, and 5, (i) our proposed G 2 RPO-A markedly surpasses vanilla GRPO on nearly every benchmark, and (ii) all RL-based methods outperform both the frozen base checkpoints and their SFT variants, mirroring trends previously observed in the literature. Effect of the guidance ratio $\boldsymbol{\alpha}$ . Table 8, 9 show that (1) larger models benefit from weaker guidance—e.g., Qwen3-1.7B peaks at $\alpha{=}0.25/0.5$ on Math, whereas the smaller Qwen3-0.6B prefers $\alpha{=}0.75$ ; (2) Code tasks consistently require a higher guidance ratio than Math. Ablation on guidance–length schedules. Table 10, 11 contrast our adaptive scheme (G 2 RPO-A) with (i) fixed guidance and (ii) a rule-based decay policy (RDP). (1) G 2 RPO-A achieves the best score on almost every model–benchmark pair, confirming the benefit of on-the-fly adjustment. (2) For fixed guidance, the optimal value varies across both tasks and model sizes, with no clear global pattern, underscoring the need for an adaptive mechanism such as G 2 RPO-A. | | GRPO | Fixed Guidance | RDP | G 2 RPO-A | | | | --- | --- | --- | --- | --- | --- | --- | | 3072 | 2048 | 1024 | | | | | | Qwen3-1.7B | | | | | | | | humaneval | 68.52 | 58.64 | 58.02 | 60.49 | 69.29 | 70.81 | | LCB | 23.53 | 29.41 | 28.43 | 31.37 | 26.47 | 35.72 | | Qwen3-0.6B | | | | | | | | humaneval | 38.89 | 43.93 | 36.54 | 38.40 | 42.27 | 49.38 | | LCB | 12.75 | 13.73 | 10.78 | 9.80 | 11.67 | 12.75 | Table 11: Guidance-length ablation on Code Tasks. Each run uses the optimal guidance ratio reported in Table 8. The initial guidance budget for G 2 RPO-A is fixed at 3,072 tokens. RDP refers to the rule-based decay policy. 6 Conclusion and Future Work We introduce a method that injects ground-truth guidance into the thinking trajectories produced during GRPO roll-outs, thereby improving the performance of small-scale LLMs. After an extensive study of guidance configurations, we observe that the guidance ratio is significant in guidance mechanism and the optimal guidance length is context-dependent and, based on this, we develop G 2 RPO-A, an auto-tuned approach. Experiments on mathematical reasoning and code generation demonstrate that G 2 RPO-A consistently boosts accuracy. In future work, we plan to evaluate G 2 RPO-A across a broader range of tasks and model architectures, which we believe will further benefit the community. References - Bae et al. [2025] Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, JeongYeon Nam, and Donghyun Kwak. Online difficulty filtering for reasoning oriented reinforcement learning, 2025. URL https://arxiv.org/abs/2504.03380. - Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021. URL https://arxiv.org/abs/2107.03374. - Chu et al. [2025] Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Sergey Levine, and Yi Ma. SFT memorizes, RL generalizes: A comparative study of foundation model post-training. In The Second Conference on Parsimony and Learning (Recent Spotlight Track), 2025. URL https://openreview.net/forum?id=d3E3LWmTar. - Cui et al. [2025] Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. CoRR, 2025. - Dang and Ngo [2025] Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t, 2025. URL https://arxiv.org/abs/2503.16219. - Face [2025] Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1. - Guan et al. [2025] Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small LLMs can master math reasoning with self-evolved deep thinking. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=5zwF1GizFa. - Guo et al. [2024] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming-the rise of code intelligence. CoRR, 2024. - Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. - Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe. - HUANG et al. [2025] Dong HUANG, Guangtao Zeng, Jianbo Dai, Meng Luo, Han Weng, Yuhao QING, Heming Cui, Zhijiang Guo, and Jie Zhang. Efficoder: Enhancing code generation in large language models through efficiency-aware fine-tuning. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=8bgaOg1TlZ. - Huang et al. [2025] Qihan Huang, Weilong Dai, Jinlong Liu, Wanggui He, Hao Jiang, Mingli Song, Jingyuan Chen, Chang Yao, and Jie Song. Boosting mllm reasoning with text-debiased hint-grpo, 2025. URL https://arxiv.org/abs/2503.23905. - Jaech et al. [2024] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. CoRR, 2024. - Jain et al. [2024] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. CoRR, 2024. - Jia et al. [2025] Ruipeng Jia, Yunyi Yang, Yongbo Gai, Kai Luo, Shihao Huang, Jianhe Lin, Xiaoxi Jiang, and Guanjun Jiang. Writing-zero: Bridge the gap between non-verifiable tasks and verifiable rewards, 2025. URL https://arxiv.org/abs/2506.00103. - Kojima et al. [2022] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022. - Lee et al. [2024] Jung Hyun Lee, June Yong Yang, Byeongho Heo, Dongyoon Han, and Kang Min Yoo. Token-supervised value models for enhancing mathematical reasoning capabilities of large language models. CoRR, 2024. - Lewkowycz et al. [2022] Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IFXTZERXdM7. - Li et al. [2025a] Chen Li, Nazhou Liu, and Kai Yang. Adaptive group policy optimization: Towards stable training and token-efficient reasoning. arXiv preprint arXiv:2503.15952, 2025a. - Li et al. [2024] Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions, 2024. - Li et al. [2025b] Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl. arXiv preprint arXiv:2503.23383, 2025b. - Lin et al. [2025] Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning models, 2025. URL https://arxiv.org/abs/2503.22342. - Liu et al. [2025] Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URL https://arxiv.org/abs/2503.20783. - Lu et al. [2024] Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D Lane, and Mengwei Xu. Small language models: Survey, measurements, and insights. CoRR, 2024. - Muennighoff et al. [2025a] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Workshop on Reasoning and Planning for Large Language Models, 2025a. URL https://openreview.net/forum?id=LdH0vrgAHm. - Muennighoff et al. [2025b] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025b. URL https://arxiv.org/abs/2501.19393. - Nath et al. [2025] Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, and Sean Hendryx. Adaptive guidance accelerates reinforcement learning of reasoning models, 2025. URL https://arxiv.org/abs/2506.13923. - Nguyen et al. [2024] Chien Van Nguyen, Xuan Shen, Ryan Aponte, Yu Xia, Samyadeep Basu, Zhengmian Hu, Jian Chen, Mihir Parmar, Sasidhar Kunapuli, Joe Barrow, Junda Wu, Ashish Singh, Yu Wang, Jiuxiang Gu, Franck Dernoncourt, Nesreen K. Ahmed, Nedim Lipka, Ruiyi Zhang, Xiang Chen, Tong Yu, Sungchul Kim, Hanieh Deilamsalehy, Namyong Park, Mike Rimer, Zhehao Zhang, Huanrui Yang, Ryan A. Rossi, and Thien Huu Nguyen. A survey of small language models, 2024. URL https://arxiv.org/abs/2410.20011. - Parashar et al. [2025] Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning, 2025. URL https://arxiv.org/abs/2506.06632. - Park et al. [2025] Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyunwoo J. Kim. Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo, 2025. URL https://arxiv.org/abs/2506.07464. - Rein et al. [2024] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. - Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347. - Shao et al. [2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300. - Shi et al. [2025] Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetuning via adaptive curriculum learning, 2025. URL https://arxiv.org/abs/2504.05520. - Souza et al. [2025] Débora Souza, Rohit Gheyi, Lucas Albuquerque, Gustavo Soares, and Márcio Ribeiro. Code generation with small language models: A deep evaluation on codeforces, 2025. URL https://arxiv.org/abs/2504.07343. - Team [2024] Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown. Hugging Face, 2024. - Wang et al. [2024] Jun Wang, Meng Fang, Ziyu Wan, Muning Wen, Jiachen Zhu, Anjie Liu, Ziqin Gong, Yan Song, Lei Chen, Lionel M. Ni, Linyi Yang, Ying Wen, and Weinan Zhang. Openr: An open source framework for advanced reasoning with large language models, 2024. URL https://arxiv.org/abs/2410.09671. - Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf. - Wei et al. [2025] Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution. arXiv preprint arXiv:2502.18449, 2025. - Wen et al. [2025] Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms, 2025. URL https://arxiv.org/abs/2506.14245. - Wu et al. [2025] Jialong Wu, Shaofeng Yin, Ningya Feng, and Mingsheng Long. Rlvr-world: Training world models with reinforcement learning, 2025. URL https://arxiv.org/abs/2505.13934. - Xiong et al. [2025] Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, and Hanze Dong. A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025. URL https://arxiv.org/abs/2504.11343. - Xu et al. [2025] Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, and Weizhu Chen. Phi-4-mini-reasoning: Exploring the limits of small reasoning language models in math, 2025. URL https://arxiv.org/abs/2504.21233. - Yang et al. [2024] An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. CoRR, 2024. - Yang et al. [2025a] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025a. - Yang et al. [2025b] Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, and Jing Tang. Treerpo: Tree relative policy optimization, 2025b. URL https://arxiv.org/abs/2506.05183. - Ye et al. [2025] Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025. URL https://arxiv.org/abs/2502.03387. - Yu et al. [2025] Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. CoRR, 2025. - Yue et al. [2025] Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837. - Zhang et al. [2025] Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. CoRR, 2025. - Zheng et al. [2025] Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts, 2025. URL https://arxiv.org/abs/2506.02177. - Zhou et al. [2025] Yuhang Zhou, Jing Zhu, Shengyi Qian, Zhuokai Zhao, Xiyao Wang, Xiaoyu Liu, Ming Li, Paiheng Xu, Wei Ai, and Furong Huang. Disco balances the scales: Adaptive domain- and difficulty-aware reinforcement learning on imbalanced data, 2025. URL https://arxiv.org/abs/2505.15074. - Zhuang et al. [2025] Xialie Zhuang, Peixian Ma, Zhikai Jia, Shiwei Liu, and Zheng Cao. A technical study into 0.5b reasoning language models, 2025. URL https://arxiv.org/abs/2506.13404.

Rendering Paper...