## Diagram: Comparison of Pair Supervised Fine-Tuning and Online Reinforcement Learning Processes
### Overview
The image is a technical diagram comparing two distinct training methodologies for AI models, presented side-by-side. The left panel illustrates "Pair Supervised Fine-Tuning," and the right panel illustrates "Online Reinforcement Learning." Both processes start with a "Question" and involve sequences of "Solution" (S) and "Verification" (V) steps, but they differ fundamentally in how they generate training data and apply learning signals.
### Components/Axes
The diagram is divided into two main panels by a vertical line.
**Common Elements (Both Panels):**
* **Question Node:** A blue rectangular box labeled "Question" at the top of each process flow.
* **Solution Node (S):** A purple circle labeled "S". The legend defines this as "Solution".
* **Verification Node (V):** A yellow circle labeled "V". The legend defines this as "Verification".
* **Correct/Incorrect Indicators:** A green checkmark (✓) indicates a correct step. A red cross (✗) indicates an incorrect step.
* **EOT Indicator:** A dashed purple box around a node. The legend defines this as "EOT" (likely "End of Thought" or a terminal state).
**Left Panel: Pair Supervised Fine-Tuning**
* **Title:** "Pair Supervised Fine-Tuning" at the top.
* **Data Flow:** Shows multiple parallel branches originating from the "Question". Each branch is a sequence of S and V nodes with correct/incorrect markings.
* **Dataset Symbol:** An arrow points from the collection of branches to a box labeled "D_pair".
* **Learning Process:** A dashed arrow points from "D_pair" to a box labeled "SFT" (Supervised Fine-Tuning).
**Right Panel: Online Reinforcement Learning**
* **Title:** "Online Reinforcement Learning" at the top.
* **Reward Function:** Text in the top-right corner defines: `r(✓) = 1` and `r(✗) = 0`.
* **Data Flow:** Shows a single, deeper branching tree structure originating from the "Question". The tree has multiple levels of S and V nodes.
* **Dataset Symbol:** An arrow points from the tree structure to a box labeled "D_learn".
* **Learning Process:** A dashed arrow points from "D_learn" to a box labeled "RL" (Reinforcement Learning), with the text "PG Loss" (Policy Gradient Loss) written on the arrow.
**Legend (Bottom-Right Corner):**
* A box containing the key for all symbols:
* Green checkmark (✓): "Correct"
* Red cross (✗): "Incorrect"
* Purple circle (S): "Solution"
* Yellow circle (V): "Verification"
* Dashed purple box: "EOT"
### Detailed Analysis
**Pair Supervised Fine-Tuning (Left Panel):**
1. **Process:** A single "Question" generates multiple, independent solution-verification *pairs*. Each pair is a short sequence (e.g., Q -> S -> V).
2. **Data Generation:** The diagram shows four such pairs. The correctness of the S and V steps within each pair is mixed (e.g., one pair has S✗, V✓; another has S✓, V✗).
3. **Outcome:** All these paired trajectories are collected into a static dataset, `D_pair`. This dataset is then used for a standard Supervised Fine-Tuning (SFT) procedure. The learning is offline and based on pre-collected, paired examples.
**Online Reinforcement Learning (Right Panel):**
1. **Process:** A single "Question" generates a single, extended, and branching *trajectory* of solution and verification steps. This represents an interactive, step-by-step reasoning process.
2. **Data Generation:** The tree shows a path where initial incorrect steps (S✗) are followed by corrective steps, leading to a final correct verification (V✓). Other branches show failure paths (ending with V✗).
3. **Reward Signal:** Each verification step (V) receives a reward `r`: 1 for correct (✓), 0 for incorrect (✗). This reward signal is used to evaluate the entire trajectory.
4. **Outcome:** The experience from this interactive process (the tree of decisions and outcomes) is used to populate a dataset `D_learn`. This dataset is used for Reinforcement Learning (RL) with a Policy Gradient (PG) Loss, updating the model online based on the rewards received.
### Key Observations
1. **Structural Difference:** The SFT process uses flat, parallel pairs. The RL process uses a deep, sequential tree, indicating a more complex, multi-step reasoning chain.
2. **Learning Paradigm:** SFT learns from static, labeled pairs (correct/incorrect). RL learns from dynamic, reward-based feedback (1/0) on the outcome of a process.
3. **Error Correction:** The RL diagram explicitly shows a path where an initial incorrect solution (S✗) is followed by a correct one, suggesting the model can recover from errors during the reasoning process. The SFT diagram shows pairs as isolated units.
4. **Data Efficiency:** The RL approach appears to generate more diverse and sequential data from a single question compared to the SFT approach's multiple independent attempts.
### Interpretation
This diagram contrasts two fundamental approaches to training reasoning or verification capabilities in AI models.
* **Pair Supervised Fine-Tuning** represents a **static, imitation-based approach**. The model is trained to mimic correct solution-verification pairs from a fixed dataset (`D_pair`). It learns "what a good pair looks like" but may not learn the *process* of arriving at a correct answer through trial and error. The data is collected offline, possibly from human demonstrations or a stronger model.
* **Online Reinforcement Learning** represents a **dynamic, trial-and-error approach**. The model actively generates a reasoning trajectory, receives a reward based on the final verification outcome, and updates its policy to maximize future rewards. This teaches the model the *process* of reasoning, including how to correct mistakes, as it directly associates actions (solutions) with outcomes (rewards). The learning is online and interactive.
The core message is a shift from learning from **static examples** (SFT) to learning from **interactive experience and outcomes** (RL). The RL method, with its deeper tree and reward signal, is likely aimed at developing more robust, self-correcting, and process-oriented reasoning skills, whereas SFT provides a foundational capability based on curated examples. The "PG Loss" specification indicates the use of a policy gradient algorithm, a common RL technique for discrete action spaces like selecting the next solution step.