## Diagram: Comparison of Reward Models for Multimodal Math Reasoning
### Overview
The image is a technical diagram comparing three different reward model architectures for evaluating and improving multimodal mathematical reasoning. It is divided into three horizontal panels, labeled (a), (b), and (c), each illustrating a distinct model: the Outcome Reward Model (ORM), the Multimodal Process Reward Model (PRM), and the proposed GM-PRM (Ours). The diagram uses a flowchart style with boxes, arrows, and icons to depict the process flow, inputs, outputs, and feedback mechanisms for each model.
### Components/Axes
The diagram is structured into three distinct sections, each with its own title and flow.
**Panel (a): Outcome Reward Model**
* **Title:** `(a) Outcome Reward Model`
* **Input:** A box labeled `Question` containing a chalkboard icon with the equation `E=mc²`.
* **Process:** An arrow labeled `Output` points to a box labeled `Answer`.
* **Evaluation:** An arrow points from the `Answer` box to a blue robot icon labeled `ORM`.
* **Feedback:** A dashed orange arrow labeled `Reward` points from the `ORM` robot back to the `Answer` box.
* **Annotation:** Red text in the top-right corner states: `ONLY Reward Final Ans`.
**Panel (b): Multimodal Process Reward Model**
* **Title:** `(b) Multimodal Process Reward Model`
* **Input:** A box labeled `Multimodal Math Qns` containing icons of a math problem sheet and a pencil.
* **Process:** A sequence of boxes connected by arrows: `Step 1` -> `Step 2` -> `...` -> `Step T` -> `Answer`.
* **Evaluation:** An arrow points from the `Answer` box to a blue robot icon labeled `PRM`.
* **Feedback:** Multiple dashed orange arrows labeled `Reward` point from the `PRM` robot back to each intermediate step (`Step 1`, `Step 2`, `Step T`) and the final `Answer`.
* **Annotations:** Two gray boxes on the right list limitations: `Limited Explainability` and `No Correction Mechanism`.
**Panel (c): GM-PRM (Ours)**
* **Title:** `(c) GM-PRM (Ours)`
* **Input:** A box labeled `Multimodal Math Qns` (identical to panel b).
* **Process:** A sequence of boxes: `Step 1` -> `Step 2` -> `...` -> `Step T` -> `Answer`.
* **Key Modification:** `Step 2` is highlighted in purple. A purple arrow labeled `1st incorrect step` points from `Step 2` to a subsequent purple arrow labeled `After Correction` that points back into the process flow between `Step 2` and the `...` box.
* **Evaluation:** An arrow points from the `Answer` box to a purple robot icon labeled `GM-PRM`. The arrow is labeled `Refined BoN`.
* **Feedback:** Dashed orange arrows labeled `Reward` point from the `GM-PRM` robot back to each step and the answer.
* **Analysis Module:** A large purple dashed box at the bottom is labeled `Analysis & Judgement`. Inside, three connected purple boxes are labeled: `Step Intent`, `Image Alignment`, and `Reasoning Logic`.
* **Annotation:** A gray box on the right states: `Refined & Corrected Version`.
### Detailed Analysis
The diagram presents a clear evolution of model sophistication:
1. **Outcome Reward Model (ORM):** This is the simplest model. It takes a question, generates an answer, and the ORM provides a reward signal based **only** on the final answer's correctness. There is no evaluation of the reasoning process.
2. **Multimodal Process Reward Model (PRM):** This model introduces step-wise evaluation. It breaks down the solution into discrete steps (`Step 1` to `Step T`). The PRM provides reward signals for each intermediate step and the final answer. However, the diagram notes two critical flaws: it offers **limited explainability** for its rewards and has **no mechanism to correct** an identified incorrect step.
3. **GM-PRM (Ours):** This is the proposed, advanced model. It builds upon the PRM framework but introduces two major enhancements:
* **Correction Mechanism:** It can identify the `1st incorrect step` (shown as `Step 2` in purple) and initiate a correction process (`After Correction` arrow), leading to a `Refined BoN` (Best-of-N) answer.
* **Analysis & Judgement Module:** A dedicated component performs deep analysis based on three criteria: `Step Intent` (understanding the goal of the step), `Image Alignment` (ensuring the step correctly uses visual information), and `Reasoning Logic` (validating the logical flow). This module directly informs the GM-PRM's reward and correction process, addressing the "limited explainability" issue of the standard PRM. The resulting model is described as a `Refined & Corrected Version`.
### Key Observations
* **Visual Coding:** The diagram uses color consistently to denote the proposed model's components. Purple is used for the GM-PRM robot, the incorrect step, the correction flow, and the entire Analysis & Judgement module, visually distinguishing it from the blue ORM/PRM robots and orange reward arrows.
* **Flow Complexity:** The process flow increases in complexity from (a) to (c). Panel (a) is a simple loop, (b) adds parallel reward loops, and (c) adds a corrective branch and a parallel analysis subsystem.
* **Spatial Layout:** The three models are stacked vertically for direct comparison. The "Analysis & Judgement" module in (c) is placed at the bottom, acting as a foundational support for the GM-PRM process above it.
* **Iconography:** Simple icons (chalkboard, math sheet, pencil, robot) are used to represent concepts, making the diagram accessible. The robot's expression changes from a simple smile (ORM, PRM) to a more focused, determined look (GM-PRM), subtly implying greater capability.
### Interpretation
This diagram argues for a paradigm shift in reward modeling for multimodal reasoning tasks. It posits that evaluating only the final outcome (ORM) is insufficient. While evaluating each process step (PRM) is better, it remains a passive evaluator that cannot explain its judgments or intervene when errors occur.
The **GM-PRM** is presented as a solution that transforms the reward model from a passive judge into an active tutor. By integrating a structured `Analysis & Judgement` module that scrutinizes intent, visual grounding, and logic, it gains the explainability the PRM lacks. More importantly, by incorporating a `Correction Mechanism`, it can actively repair flawed reasoning chains. This suggests the GM-PRM is designed not just to score performance, but to **improve** the reasoning process itself, leading to more reliable and refined outputs. The diagram effectively communicates that the key innovation is the closed-loop system of analysis, judgment, reward, and correction.