## Diagram: Flow-GRPO System Architecture
### Overview
This image presents a system architecture diagram titled "Flow-GRPO," illustrating a process for training a policy model within a multi-turn agentic system. The diagram details the interaction between a policy model, a reference model, a reward model, and various inputs and computational steps, including multi-turn rollouts and group computation, with clear distinctions between trained and frozen model components.
### Components/Axes
The diagram is structured as a left-to-right flow with feedback loops.
**Input Components (Left):**
* **q:** A white rounded rectangle, representing an input.
* **M:** An orange rounded rectangle, representing an input.
* **K:** A light blue rounded rectangle, representing an input.
**Model Components:**
* **Policy Model:** An orange rounded rectangle, centrally located on the left side. It has a small red flame icon on its top-right, indicating it is actively being trained or updated.
* **Reference Model:** A light blue rounded rectangle, positioned below the Policy Model. It has a small light blue cube icon on its top-right, indicating it is a frozen or fixed component.
* **Reward Model:** A light blue vertically oriented rounded rectangle, positioned in the middle of the diagram. It also has a small light blue cube icon on its top-left, indicating it is a frozen or fixed component.
**Process Blocks:**
* **Multi-turn Agentic System Rollouts:** A large orange-bordered rounded rectangle, spanning the middle-left section of the diagram. This block contains multiple rows of action sequences and observations.
* Each row represents a sequence of actions `a_i^1, a_i^2, ..., a_i^{T_i}` (where `i` denotes the group/agent and `T_i` denotes the number of turns for that group). These action sequences are enclosed in light grey rounded rectangles.
* Each action sequence leads to an observation `o_i` (e.g., `o_1, o_2, o_3, ..., o_G`), represented by white rounded rectangles.
* **Multi-turn Group Computation:** A white rounded rectangle, positioned on the far right, above the legend.
**Output/Intermediate Data Blocks:**
* A large light grey-bordered rounded rectangle, positioned in the middle-right section of the diagram. This block contains multiple rows of reward sequences.
* Each row represents a sequence of rewards `r_i^1, r_i^2, ..., r_i^{T_i}` (where `i` denotes the group/agent and `T_i` denotes the number of turns for that group). These reward sequences are enclosed in light grey rounded rectangles.
**Legend (Bottom-right):**
* **Trained Models:** An orange rounded rectangle.
* **Frozen Models:** A light blue rounded rectangle.
### Detailed Analysis
**Flow from Inputs to Policy Model:**
* Input `q` feeds into the `Policy Model`.
* Input `M` feeds into the `Policy Model`.
* Input `K` feeds into the `Reference Model`.
* The `Reference Model` feeds into the `Policy Model` with a connection labeled `KL`, suggesting a Kullback-Leibler divergence constraint or regularization.
**Multi-turn Agentic System Rollouts:**
* The `Policy Model` outputs to the `Multi-turn Agentic System Rollouts` block, specifically influencing the generation of actions `a_i^t`.
* The `Reference Model` also outputs to the `Multi-turn Agentic System Rollouts` block, specifically influencing the generation of actions `a_i^t`.
* Within the "Multi-turn Agentic System Rollouts" block:
* Row 1: `a_1^1, a_1^2, a_1^3` leads to `o_1`.
* Row 2: `a_2^1, a_2^2` leads to `o_2`.
* Row 3: `a_3^1, a_3^2, ..., a_3^{t_3}` leads to `o_3`.
* Ellipses indicate more rows.
* Last Row: `a_G^1, a_G^2, a_G^3, ..., a_G^{T_G}` leads to `o_G`.
* The `o_i` observations are then fed into the `Reward Model`.
**Reward Generation:**
* The `Reward Model` takes inputs from the `o_i` observations (e.g., an arrow from `o_3` points to the `Reward Model`).
* The `Reward Model` outputs to the light grey-bordered block containing reward sequences `r_i^t`.
* Row 1: `r_1^1, r_1^2, r_1^3`.
* Row 2: `r_2^1, r_2^2`.
* Row 3: `r_3^1, r_3^2, ..., r_3^{t_3}`.
* Ellipses indicate more rows.
* Last Row: `r_G^1, r_G^2, r_G^3, ..., r_G^{T_G}`.
**Policy Update Loop:**
* The reward sequences (from the light grey-bordered block) feed into the `Multi-turn Group Computation` block.
* The `Multi-turn Group Computation` block has a feedback loop, with an arrow pointing back to the `Policy Model`. This indicates that the computation based on rewards is used to update the `Policy Model`.
**Model Status (from Legend):**
* **Trained Models (Orange):** `Policy Model`, `M`. The orange border of "Multi-turn Agentic System Rollouts" suggests this process is part of the trained system's operation.
* **Frozen Models (Light Blue):** `Reference Model`, `Reward Model`, `K`. The cube icons on `Reference Model` and `Reward Model` reinforce their frozen status.
### Key Observations
* The `Policy Model` is the primary component undergoing training, indicated by its orange color and flame icon.
* The `Reference Model` and `Reward Model` are fixed or pre-trained components, indicated by their light blue color and cube icons.
* The system involves multi-turn interactions (`t` superscript) and multiple groups/agents (`i` subscript).
* A `KL` divergence term is used to regularize the `Policy Model`'s updates with respect to the `Reference Model`, a common technique in reinforcement learning (e.g., PPO).
* The process is iterative, with rewards from rollouts feeding back to update the policy.
### Interpretation
The Flow-GRPO diagram illustrates a reinforcement learning (RL) framework, likely for training a policy in a multi-agent, multi-turn environment. The core idea is to train a `Policy Model` (which generates actions `a_i^t`) by interacting with an environment (represented by the "Multi-turn Agentic System Rollouts" and subsequent reward generation).
1. **Policy Training:** The `Policy Model` is the trainable component, taking inputs `q` and `M`. It generates actions `a_i^t` for multiple agents/groups over multiple turns.
2. **Reference and Regularization:** A `Reference Model` (frozen) provides a baseline or constraint for the `Policy Model`'s updates, enforced by a `KL` divergence term. This prevents the policy from making drastic changes, promoting stable learning. Input `K` might be related to the reference model's parameters or state.
3. **Rollouts and Observations:** The generated actions `a_i^t` are executed in a simulated or real environment (the "Multi-turn Agentic System Rollouts"), leading to observations `o_i`. The orange border of this block suggests it's an active part of the training process, driven by the trained policy.
4. **Reward Evaluation:** The `Reward Model` (frozen) evaluates the observations `o_i` to produce rewards `r_i^t`. This implies the reward function is fixed and not learned during this process, providing a stable signal for policy improvement.
5. **Policy Update:** The `Multi-turn Group Computation` aggregates or processes these rewards to generate an update signal that is fed back to the `Policy Model`. This completes the RL loop, where the policy learns to maximize cumulative rewards.
The "Flow-GRPO" likely stands for a specific algorithm, possibly related to Group Reinforcement Policy Optimization, where "Flow" might imply a continuous or iterative process, and "GRPO" points to a group-based or generalized policy optimization method, potentially incorporating elements of PPO due to the KL divergence. The distinction between "Trained Models" and "Frozen Models" is crucial, highlighting which components are adaptive and which are fixed during the training phase. This architecture suggests a robust and controlled training process, leveraging fixed reference and reward models to guide the policy's learning.