## Workflow Diagram: GenPRM Training and Test-Time Scaling
### Overview
The image presents a workflow diagram outlining the GenPRM (Generative Pre-trained Reasoning Model) training and test-time scaling process. It details the steps involved in solution generation, progress estimation, rationale synthesis, model training, and policy model scaling.
### Components/Axes
The diagram is divided into six main sections, each numbered:
1. Solution Generation & MC Reward Estimation
2. Relative Progress Estimation
3. Rationale Synthesis
4. GenPRM Training
5. Policy Model Test-Time Scaling
6. GenPRM Test-Time Scaling
**Legend (located at the bottom):**
* **a (blue square):** Intermediate Step
* **q (blue circle):** Correct Answer
* **q (pink circle):** Incorrect Answer
* **r (green square):** Single Reward Judgement
* **r (green circle):** Aggregated Reward Judgement
### Detailed Analysis
**1. Solution Generation & MC Reward Estimation (Top-Left)**
* Starts with a "Math Problem" (white rectangle).
* A tree-like structure follows, with nodes representing intermediate steps (blue squares labeled a11, a21, a31, a12, a22, a32, a12).
* Correct answers (blue circles labeled q1, q2, q3) and incorrect answers (pink circles labeled q1, q2, q3) are present.
* The final step shows three blue circles and one pink circle.
* A formula is shown: MC(s12, a12) = 2/3
**2. Relative Progress Estimation (Top-Center)**
* Formula: Pt = MC(st, at) / MC(st) >= epsilon
* A green checkmark indicates a positive reward: r̂t = 1
* A red "X" indicates a negative reward: r̂t = 0
**3. Rationale Synthesis (Center-Left)**
* Input: "Problem" (dashed box) with steps a1 to aT.
* "CoT Analysis" (Chain-of-Thought Analysis) box:
* Contains the text:
```
<analyze>
Let's analyze the paragraph step by step: ...
</analyze>
```
* "Code Verification" box:
* Contains the text:
```
<verify>
Let's use python code to find any potential error:
"python..."
</verify>
```
* "Execute" box with a Python logo.
* "Final Label" box:
* Contains the text:
```
<output>
Judgement: Yes/No
</output>
```
* Arrows indicate flow: Problem -> CoT Analysis -> Code Verification -> Execute -> Final Label.
* "consistent" and "conflict" labels indicate the relationship between "Code Output" and "Final Label".
**4. GenPRM Training (Top-Center-Right)**
* "Data (23K)" box.
* "SFT" (Supervised Fine-Tuning) arrow leading to "GenPRM" box.
* "Consensus filtering" box.
**5. Policy Model Test-Time Scaling (Top-Right)**
* "GenPRM as a Verifier" section:
* GenPRM is used to verify solutions.
* Solutions (pink and blue circles) are processed by GenPRM.
* Multiplied by N (x N)
* "GenPRM as a Critic" section:
* GenPRM is used to critique solutions.
* Solution is processed by GenPRM.
* Critique is generated.
* Multiplied by N (x N)
* A warning sign is present.
**6. GenPRM Test-Time Scaling (Bottom-Right)**
* Input: GenPRM.
* Multiple "analyze" and "verify" steps.
* Single Reward Judgements (r1, r2, r3, r4 - green squares).
* Aggregated Reward Judgement (r - green circle).
* Multiplied by N (x N)
### Key Observations
* The diagram illustrates a comprehensive process for training and deploying GenPRM.
* It incorporates both chain-of-thought reasoning and code verification for rationale synthesis.
* The model is used both as a verifier and a critic during test-time scaling.
* The diagram highlights the iterative nature of the process, with multiple analysis and verification steps.
### Interpretation
The diagram describes a system for automated problem-solving and reasoning. The GenPRM model is trained on a dataset and then used to generate solutions, estimate progress, synthesize rationales, and scale the model for test-time deployment. The use of both CoT analysis and code verification suggests a focus on both logical reasoning and practical implementation. The model's ability to act as both a verifier and a critic indicates a sophisticated approach to evaluating and refining solutions. The scaling processes (sections 5 and 6) are crucial for applying the model to a large number of problems or scenarios. The warning sign in the "GenPRM as a Critic" section suggests that critique generation may be a more challenging or sensitive task.