\n
## Diagram: Reasoning Model Verification and Finetuning Pipeline
### Overview
The image is a technical flowchart illustrating a process for evaluating and curating reasoning chains generated by an AI model. The process involves generating solution steps, verifying their correctness against process labels, and filtering them to create high-quality finetuning data. The diagram uses a left-to-right flow with color-coded elements and symbolic icons (checkmarks, crosses) to indicate correctness.
### Components/Axes
The diagram is structured into three main horizontal sections or stages, connected by arrows indicating data flow.
**1. Input Stage (Leftmost, Pink Box):**
* **Container:** A large pink rectangle with rounded corners.
* **Labels:** Contains two white sub-boxes.
* Left sub-box: Labeled "**Problem**" with a large question mark "?" inside.
* Right sub-box: Labeled "**Solution**" with placeholder text: "Step 1: ...", "Step 2: ...", "Step 3: ...".
* **Function:** Represents the initial input: a problem statement and a proposed multi-step solution generated by a model.
**2. Processing Stage (Center, Purple Box):**
* **Container:** A purple rectangle with rounded corners, connected by an arrow from the Input Stage.
* **Label:** Labeled "**Reasoning Model**".
* **Function:** Represents the AI model that processes the problem and solution to generate detailed reasoning chains (shown in the next stage).
**3. Verification & Filtering Stage (Right, Two Parallel Paths):**
This stage is split into two parallel processing chains, labeled at the top as "**1. Sample verification chains**".
* **Path A (Top Chain - Discarded):**
* **Container:** A light gray box containing a `<think>` block.
* **Content:** A reasoning chain with three steps.
* `Step 1 accurately... and is \boxed{correct}` - Accompanied by a **green checkmark icon**.
* `Step 2 omits... \boxed{incorrect}` - Accompanied by a **red 'X' icon**.
* `Step 3 ... \boxed{incorrect}` - Accompanied by a **red 'X' icon**.
* **Process Label (Right of Chain):** A green box labeled "**Step 1: Correct**", "**Step 2: Incorrect**", "**Step 3: Incorrect**".
* **Action:** An arrow points from this chain to a large **red 'X'** and the text "**Discard!**". This path is labeled "**2. Compare against process labels**".
* **Path B (Bottom Chain - Kept):**
* **Container:** A light gray box containing a `<think>` block.
* **Content:** A reasoning chain with three steps.
* `Step 1 calculates... Therefore is \boxed{correct}` - Accompanied by a **green checkmark icon**.
* `Step 2 ... is \boxed{correct}` - Accompanied by a **green checkmark icon**.
* `Step 3 is... \boxed{incorrect}` - Accompanied by a **red 'X' icon**.
* **Process Label (Right of Chain):** A green box labeled "**Step 1: Correct**", "**Step 2: Correct**", "**Step 3: Incorrect**".
* **Action:** An arrow points from this chain to a **green checkmark icon** and then to a yellow cylinder. This path is labeled "**3. Keep good chains**".
**4. Output Stage (Bottom Right):**
* **Container:** A yellow cylinder, a standard icon for a database or storage.
* **Label:** Labeled "**Finetuning data**".
* **Function:** Represents the curated dataset of high-quality reasoning chains (like the one from Path B) used to improve the model.
### Detailed Analysis
The diagram explicitly details the content of two sample verification chains to illustrate the filtering logic.
* **Chain A (Discarded):** This chain has one correct step followed by two incorrect steps. The process label confirms this assessment (Correct, Incorrect, Incorrect). The outcome is to discard the entire chain.
* **Chain B (Kept):** This chain has two correct steps followed by one incorrect step. The process label confirms this (Correct, Correct, Incorrect). Despite the final step being incorrect, the chain is kept. This suggests the filtering criterion is not perfection, but perhaps a minimum threshold of correctness (e.g., majority of steps correct) or the presence of valuable correct reasoning in the early steps.
### Key Observations
1. **Asymmetric Filtering:** The system does not require all steps to be correct for a chain to be retained. Chain B, with a 2/3 correct rate, is kept, while Chain A, with a 1/3 correct rate, is discarded.
2. **Process Label Dependency:** The verification is not based solely on the model's own `\boxed{correct/incorrect}` self-assessment. It is compared against external "**process labels**" (the green boxes), which serve as the ground truth for correctness.
3. **Visual Coding:** Correctness is consistently coded with **green checkmarks** and the word "correct". Incorrectness is coded with **red 'X' icons** and the word "incorrect". The final "Discard!" action is also marked with a large red 'X'.
4. **Spatial Flow:** The layout clearly separates the two outcomes (discard vs. keep) vertically, making the comparison and decision process easy to follow.
### Interpretation
This diagram outlines a **data curation pipeline for improving AI reasoning models**. Its core purpose is to automatically generate training data that teaches the model not just the final answer, but the *process* of correct reasoning.
* **What it demonstrates:** The system uses a "reasoning model" to generate step-by-step solutions. These solutions are then audited for correctness at each step against a known standard (process labels). The audit results are used to filter the generated data.
* **How elements relate:** The "Problem/Solution" input feeds the "Reasoning Model," which produces the detailed chains. The verification stage acts as a quality gate. The "Finetuning data" cylinder is the valuable output, composed only of chains that meet a quality standard (e.g., containing significant correct reasoning).
* **Notable implication:** The decision to keep Chain B (with a final incorrect step) is significant. It implies the finetuning process values **partial correctness and the demonstration of correct reasoning methodology**, even if the conclusion is flawed. This is a more nuanced approach than simply using only perfectly correct solutions, potentially making the model more robust by learning from near-miss examples. The pipeline automates the labor-intensive task of creating high-quality, process-oriented training data.