## Diagram: R-PRM Training and Reward Generation Pipeline
### Overview
This image is a technical system architecture diagram illustrating a two-part machine learning pipeline. The left side depicts the data collection and training process for a Process Reward Model (R-PRM) using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). The right side, separated by a vertical dashed line, demonstrates the inference phase where the trained model evaluates an input through sampling to calculate a final average reward score.
*Language Declaration:* All text in this image is in English. No other languages are present.
### Components
The diagram utilizes several distinct visual components to convey information:
* **Containers:** Rounded rectangles in various colors (light blue, light green, light red, purple) representing data inputs, outputs, and analytical steps.
* **Icons:**
* A basic white robot head (representing a base Large Language Model).
* A blue robot head with a red siren/light on top (representing the specialized R-PRM model).
* Stacked green coins/disks (representing the SFT Dataset).
* Stacked blue disks (representing the Preference Dataset).
* Green checkmarks ($\checkmark$) and red crosses ($\times$) indicating correctness.
* **Connectors:** Solid black arrows indicating process flow and data movement. Curly brackets grouping data elements.
* **Dividers:** A vertical dashed gray line separating the training phase (left) from the inference phase (right). A dashed rectangular box grouping numerical scores.
### Content Details
#### Left Section: Training Pipeline (Data Collection & Model Training)
**1. Input Generation (Top Left):**
* A large light blue box is labeled **"Evaluation Input"**.
* Inside this box are three smaller light blue boxes labeled: **"Problem"**, **"Previous Steps"**, and **"Now Step"**.
* To the right of the Evaluation Input box is a small purple box labeled **"Label: No"**.
* Below the Evaluation Input box, text reads: **"Collect response from LLM to construct seed data"**. A black arrow points downward from this text.
**2. Seed Data Construction (Middle Left):**
Below the arrow are three horizontal boxes representing generated analyses.
* **Box 1 (Light Green):**
* Text: "Previous Steps Analysis: This step starts by ..."
* Text: "......"
* Text: "Verification: Is the step correct (Yes/No)? No"
* Symbol: A large green checkmark ($\checkmark$) is positioned at the bottom right.
* **Box 2 (Light Green):**
* Text: "......"
* Text: "Now Step Analysis: Now Step checks if 23 is ..."
* Text: "Verification: Is the step correct (Yes/No)? No"
* Symbol: A large green checkmark ($\checkmark$) is positioned at the bottom right.
* **Box 3 (Light Red):**
* Text: "......"
* Text: "Calculation Analysis: The calculation in the..."
* Text: "Verification: Is the step correct (Yes/No)? Yes"
* Symbol: A large red 'X' ($\times$) is positioned at the bottom right.
**3. Dataset Creation and Model Training (Center Column):**
* A black curly bracket groups **Box 1** and **Box 2** (the green boxes). This bracket points to a green stacked-disk icon labeled **"SFT Dataset"**.
* A second black curly bracket groups **Box 2** (green) and **Box 3** (red). This bracket points to a blue stacked-disk icon labeled **"Preference Dataset"**.
* Above the datasets, a white robot icon has a downward arrow pointing to text **"R-PRM SFT"**.
* Below "R-PRM SFT" is a blue robot icon. A downward arrow points from it to text **"R-PRM DPO"**.
* Below "R-PRM DPO" is another instance of the blue robot icon, representing the final trained model.
#### Right Section: Inference Pipeline (Evaluation & Reward)
**1. Input and Model (Top Right):**
* A light blue box labeled **"Evaluation Input"** sits at the top.
* A downward arrow points to the blue robot icon (the trained R-PRM model).
**2. Sampling and Verification (Middle Right):**
* A downward arrow labeled **"Sampling"** points from the blue robot to a stack of three overlapping boxes.
* The front box is light red. The text inside reads:
* "......"
* "Verification: Is the step correct (Yes/No)? No"
* The two boxes behind it are light green. The letters "ct" (likely the end of the word "correct") are visible on the right edge of the middle box.
**3. Reward Calculation (Bottom Right):**
* Three downward arrows point from the stacked boxes into a dashed rectangular outline.
* Inside the dashed outline are three colored circles containing numbers:
* Red circle: **"0.4"**
* Green circle: **"0.9"**
* Green circle: **"0.8"**
* To the right of the dashed outline is the word **"Average"**.
* A downward arrow points from the dashed outline to a purple box labeled **"Reward: 0.7"**.
### Key Observations
* **Color Coding:** Green is consistently used to denote correct verifications or positive outcomes (checkmarks, SFT dataset, high scores). Red is used to denote incorrect verifications or negative outcomes (X marks, low scores). Light blue denotes inputs, and purple denotes final labels/rewards.
* **Mathematical Consistency:** In the reward calculation phase, the three sampled scores are 0.4, 0.9, and 0.8. The sum of these is 2.1. Divided by 3 (to find the average), the result is exactly 0.7, which matches the final "Reward: 0.7" box.
* **Dataset Grouping:** The SFT (Supervised Fine-Tuning) dataset is built exclusively from "correct" (green) examples. The Preference Dataset for DPO (Direct Preference Optimization) is built by contrasting a "correct" (green) example with an "incorrect" (red) example.
### Interpretation
This diagram outlines a sophisticated methodology for training a Process Reward Model (R-PRM) designed to evaluate the step-by-step reasoning of Large Language Models (LLMs).
**The Training Phase (Left):**
The system begins by taking a mathematical or logical problem, its previous steps, and the current step being evaluated. A base LLM generates analyses of these steps.
Crucially, the system evaluates the LLM's own verification of the steps. If the LLM correctly identifies a step as flawed (Green boxes: "Is the step correct? No" -> $\checkmark$), this good behavior is packaged into an **SFT Dataset** to teach the model basic competence.
To make the model highly robust, a **Preference Dataset** is created by pairing a correct verification (Green box) against an incorrect verification (Red box: "Is the step correct? Yes" -> $\times$ when it shouldn't be). This paired data is used for **DPO (Direct Preference Optimization)**, which trains the model to actively prefer the correct analytical reasoning over the flawed reasoning. The evolution of the robot icon from white to blue-with-a-siren visually represents the model gaining this specialized "evaluator/police" capability.
**The Inference Phase (Right):**
Once trained, the R-PRM is put to work. When given a new "Evaluation Input", it doesn't just generate one answer. It uses a **Sampling** technique, generating multiple verification paths (represented by the stacked boxes).
Each sampled path yields a confidence score (0.4, 0.9, 0.8). The red circle for 0.4 correlates with the red box in the sample stack, indicating a lower confidence or a negative verification path, while the green circles (0.9, 0.8) represent higher confidence/positive paths. By averaging these sampled scores, the system produces a final, stabilized **Reward** of 0.7.
*Reading between the lines:* This architecture suggests an attempt to mitigate LLM hallucinations in complex reasoning tasks (like math or coding). By training a model specifically to verify individual steps (Process Reward) rather than just the final answer (Outcome Reward), and by using multi-path sampling to smooth out anomalies, the resulting reward score is likely much more reliable for guiding reinforcement learning or search algorithms (like Monte Carlo Tree Search).