\n
## Technical Diagram: Multi-Stage Training Pipeline for Reasoning Models
### Overview
The image is a technical flowchart illustrating a four-stage training pipeline for a language model designed to perform reasoning tasks, likely using Chain-of-Thought (CoT) methods. The pipeline progresses from left to right, showing data flow and transformations through Pretraining, Non-reflective Supervised Fine-Tuning (SFT), Reflective SFT, and Reinforcement Learning (RL) fine-tuning. The diagram uses color-coded boxes, arrows, and mathematical notation to represent processes, data structures, and model components.
### Components/Axes
The diagram is segmented into four primary, labeled stages:
1. **(I) Pretraining (Green Box, Top-Left):**
* **Input:** "Training Data" (depicted as a green cylinder) and "CoT examples" (arrow from top).
* **Process:** Shows sequences of tokens (Q, R₁, R₂, A, etc.) being processed. Red boxes highlight "context windows (randomly drawn)".
* **Output:** A policy model denoted by π.
2. **(II) Non-reflective SFT (Blue Box, Center):**
* **Input:** The policy model π from stage (I).
* **Process:** Depicts a sequence of "states" (Q, S₁, S₂, ...) mapping to "steps or answers" (R₁, R₂, A, ...). The transition is governed by the policy π.
* **Output:** A refined policy model π.
3. **(III) Reflective SFT (Gray Box, Bottom-Left):**
* **Input:** "CoT examples (data mixture)" and the policy model π from stage (II).
* **Process:** Involves "sampling CoTs through MTP" (Model-Thought-Process, inferred). Shows a matrix with columns Q, S₁, S₂, ... and rows R₁, R₂, R₃, ... leading to "ground-truth verification" (V₁, V₂, V₃, ...). An "Expert Verifier" block processes this.
* **Output:** A policy model denoted by π̃ (π-tilde).
4. **(IV) RL fine-tuning (Orange Box, Right):**
* **Input:** The policy model π̃ from stage (III).
* **Process:** Contains two sub-modules:
* **MTP:** Shows a loop: `Q -> [π] -> R_t -> A`. State update: `S_t = T(S_{t-1}, R_t)`.
* **RMTP:** Shows a loop: `Q -> [π] -> (R_t, V_t) -> A`. State update: `S_t = T(S_{t-1}, R_t, V_t)`.
* **Components:** Includes a "Reward Model" and "Policy Optimization" block, with feedback loops (orange arrows) connecting them to the MTP/RMTP processes.
**Flow Arrows:** Black arrows indicate the primary data/model flow from (I) -> (II) -> (III) -> (IV). An additional arrow feeds "CoT examples" from the start into stage (III).
### Detailed Analysis
**Stage (I) Pretraining:**
* **Data Structure:** Sequences consist of a Question (Q), reasoning steps (R₁, R₂, ...), and an Answer (A). The diagram shows two example sequences: `Q, R₁, R₂, A` and `Q, R₁, A, Q, R₁, R₂, A, Q, R₁...`.
* **Key Mechanism:** "Context windows (randomly drawn)" are highlighted, suggesting the model is pretrained on variable-length segments of these reasoning chains.
**Stage (II) Non-reflective SFT:**
* **Mapping:** Establishes a direct mapping from a sequence of states (starting with Q, then S₁, S₂...) to a sequence of steps/answers (R₁, R₂, A). This represents standard supervised fine-tuning on reasoning traces.
**Stage (III) Reflective SFT:**
* **Verification Process:** Introduces a verification step. For a given question (Q) and state sequence (S₁, S₂...), multiple reasoning steps (R₁, R₂, R₃...) are generated and paired with verification scores or labels (V₁, V₂, V₃...). An "Expert Verifier" evaluates these against ground truth.
* **Model Update:** This process uses the policy π to sample CoTs and produces an updated policy π̃, incorporating reflective or verified reasoning.
**Stage (IV) RL fine-tuning:**
* **MTP (Model-Thought-Process):** A basic reactive model where the action (A) is taken based on the current reasoning step (R_t), and the state (S_t) is updated based only on the previous state and the new reasoning step.
* **RMTP (Reflective/Reinforced MTP):** An enhanced model where the action (A) is based on both a reasoning step (R_t) and its associated verification/value (V_t). The state update incorporates both R_t and V_t.
* **Optimization Loop:** The Reward Model provides feedback, which drives Policy Optimization. This optimized policy is then used in the MTP/RMTP modules, creating a reinforcement learning loop.
### Key Observations
1. **Progressive Complexity:** The pipeline evolves from simple sequence modeling (I) to supervised step-by-step reasoning (II), then adds verification (III), and finally integrates reinforcement learning with value feedback (IV).
2. **Notation Consistency:** The policy model is consistently denoted by π, with a tilde (π̃) used after the reflective stage to indicate a modified version.
3. **State Representation:** The state `S_t` is explicitly defined as a function `T` of previous states and reasoning steps/values, formalizing the reasoning trajectory.
4. **Dual RL Paths:** The RL stage explicitly contrasts a basic MTP with an enhanced RMTP, highlighting the integration of verification signals (V_t) into the decision and state-update process.
### Interpretation
This diagram outlines a sophisticated methodology for training AI models to perform complex reasoning. The pipeline's core innovation lies in its multi-stage approach that moves beyond standard pretraining and SFT.
* **From Imitation to Reflection:** Stages (I) and (II) teach the model to mimic reasoning patterns from data. Stage (III) introduces a critical reflective component, where the model's outputs are verified, likely teaching it to distinguish between valid and invalid reasoning paths.
* **From Supervision to Optimization:** Stage (IV) transitions from learning from static examples (SFT) to learning from outcomes via RL. The inclusion of the RMTP module suggests that the verification signal (V_t) from stage (III) is not just used for filtering data but is integrated as a core component of the decision-making process during RL, potentially guiding the model towards more reliable and high-reward reasoning strategies.
* **Overall Goal:** The pipeline aims to produce a model (final policy π) that doesn't just generate plausible-sounding reasoning chains but can engage in a verifiable, step-by-step thought process (`S_t = T(...)`) that is optimized for correctness, as determined by a reward model. This is a common framework for developing "reasoning" or "chain-of-thought" capabilities in large language models.