\n
## Technical Diagram: Comparison of PPO and GRPO Reinforcement Learning Architectures
### Overview
The image is a technical flowchart diagram comparing two reinforcement learning algorithm architectures: **PPO** (Proximal Policy Optimization) and **GRPO** (Group Relative Policy Optimization). The diagram is divided into two horizontal sections separated by a dashed line. The top section illustrates the PPO workflow, and the bottom section illustrates the GRPO workflow. A legend on the right side defines the color-coding for model types.
### Components/Axes
**Legend (Right Side):**
* **Trained Models:** Represented by yellow-filled boxes with black borders.
* **Frozen Models:** Represented by blue-filled boxes with black borders.
**PPO Section (Top Half):**
* **Input:** A single input labeled `q` (likely representing a query or state).
* **Core Models:**
* `Policy Model` (Trained, Yellow): Takes `q` as input, produces output `o`.
* `Reference Model` (Frozen, Blue): Receives `o`.
* `Reward Model` (Frozen, Blue): Receives `o`.
* `Value Model` (Trained, Yellow): Receives `o`.
* **Outputs & Computations:**
* `o`: Output from the Policy Model.
* `KL`: Kullback-Leibler divergence, computed between the Policy Model and the Reference Model.
* `⊕`: A summation or combination symbol, combining the KL term and the output from the Reward Model.
* `r`: Scalar reward signal, output from the Reward Model (after combination with KL).
* `v`: Value estimate, output from the Value Model.
* `GAE`: Generalized Advantage Estimation, a computation block that takes `r` and `v` as inputs.
* `A`: The final advantage estimate, output from the GAE block.
* **Flow:** The diagram shows a feedback loop where the advantage `A` is used to update the `Policy Model` and `Value Model`.
**GRPO Section (Bottom Half):**
* **Input:** A single input labeled `q`.
* **Core Models:**
* `Policy Model` (Trained, Yellow): Takes `q` as input, produces a *group* of outputs: `o₁`, `o₂`, ..., `o_G`.
* `Reference Model` (Frozen, Blue): Receives the group of outputs.
* `Reward Model` (Frozen, Blue): Receives the group of outputs.
* **Outputs & Computations:**
* `o₁, o₂, ..., o_G`: A group of `G` outputs from the Policy Model.
* `KL`: Kullback-Leibler divergence, computed between the Policy Model and the Reference Model. The arrow points directly to the Policy Model, suggesting a direct regularization term.
* `r₁, r₂, ..., r_G`: A group of scalar reward signals, each corresponding to an output `o_i`, from the Reward Model.
* `Group Computation`: A processing block that takes the group of rewards `r₁...r_G` as input.
* `A₁, A₂, ..., A_G`: A group of advantage estimates, output from the Group Computation block.
* **Flow:** The diagram shows a feedback loop where the group of advantages `A₁...A_G` is used to update the `Policy Model`.
### Detailed Analysis
**PPO Architecture Flow:**
1. A single query `q` is fed into the trained `Policy Model`.
2. The Policy Model generates a single output `o`.
3. This output `o` is simultaneously fed into three models: the frozen `Reference Model`, the frozen `Reward Model`, and the trained `Value Model`.
4. The `Reference Model` helps compute a `KL` divergence penalty against the current policy.
5. The `Reward Model` produces a reward signal `r`. The diagram indicates this `r` is combined with the `KL` term (via the `⊕` symbol).
6. The `Value Model` produces a value estimate `v`.
7. The combined reward `r` and value `v` are fed into the `GAE` (Generalized Advantage Estimation) module.
8. The GAE module computes the final advantage estimate `A`.
9. The advantage `A` is used in the loss function to update the parameters of the `Policy Model` and the `Value Model` (indicated by the curved feedback arrows).
**GRPO Architecture Flow:**
1. A single query `q` is fed into the trained `Policy Model`.
2. The Policy Model generates a *group* of `G` different outputs: `o₁, o₂, ..., o_G`.
3. This entire group of outputs is fed into the frozen `Reference Model` and the frozen `Reward Model`.
4. The `Reference Model` computes a `KL` divergence penalty directly against the `Policy Model`.
5. The `Reward Model` produces a corresponding group of scalar rewards: `r₁, r₂, ..., r_G`.
6. This group of rewards is processed by a `Group Computation` block. This likely involves normalizing or comparing rewards within the group to compute relative advantages.
7. The Group Computation block outputs a group of advantages: `A₁, A₂, ..., A_G`.
8. These group advantages are used in the loss function to update the parameters of the `Policy Model` (indicated by the curved feedback arrow). Notably, there is no separate Value Model in the GRPO diagram.
### Key Observations
1. **Output Granularity:** The most fundamental difference is that PPO processes a single output (`o`) per query, while GRPO processes a group of `G` outputs (`o₁...o_G`) per query.
2. **Advantage Calculation:** PPO uses the classic GAE method combining a reward signal and a value estimate. GRPO replaces this with a "Group Computation" step that operates on multiple rewards, suggesting it calculates advantages based on the relative performance within the generated group.
3. **Model Architecture:** PPO explicitly includes a trained `Value Model` to estimate state values (`v`). GRPO does not show a Value Model, implying its advantage estimation is derived differently, likely from the group rewards.
4. **KL Divergence Application:** In PPO, the KL term is combined with the reward `r`. In GRPO, the KL arrow points directly to the Policy Model, suggesting it may be applied as a direct regularization term in the policy loss.
5. **Color-Coding Consistency:** The legend is applied consistently. The `Policy Model` and `Value Model` (in PPO) are yellow (Trained). The `Reference Model` and `Reward Model` are blue (Frozen) in both architectures.
### Interpretation
This diagram contrasts two approaches to policy optimization in reinforcement learning, likely from a language model fine-tuning context (given the "query" `q` and "output" `o` terminology).
* **PPO** represents the standard, well-established approach. It relies on a separate value function (the Value Model) to estimate the "goodness" of a state, which is crucial for calculating low-variance advantage estimates via GAE. This is effective but requires training and maintaining an additional model.
* **GRPO** appears to be a proposed variant that eliminates the need for a separate value function. Instead of predicting absolute value, it generates multiple responses (`o₁...o_G`) for a single prompt and computes advantages *relative* to the group. The "Group Computation" likely normalizes the rewards (e.g., subtracting the mean, dividing by standard deviation) to determine which outputs in the batch were better or worse than average. This relative advantage (`A_i`) is then used for policy update.
The core innovation suggested by GRPO is shifting from absolute advantage estimation (requiring a value model) to relative advantage estimation within a generated batch. This could potentially simplify training (one less model to train) and might offer more stable learning signals by focusing on comparative performance. The direct application of the KL penalty in GRPO also hints at a potentially different formulation of the policy loss objective. The diagram effectively highlights this architectural shift from a single-sample, value-dependent process to a multi-sample, group-relative process.