Image 8bdc7ad276af...

EXPERT: gemini-2.0-flash VERSION 1

RUNTIME: nugit/gemini/gemini-2.0-flash
INTEL_VERIFIED
## Diagram: Comparison of Reward Models

### Overview
The image presents a comparative diagram illustrating three different reward models: Outcome Reward Model (ORM), Multimodal Process Reward Model (PRM), and GM-PRM (a proposed model). The diagram highlights the flow of information and the reward mechanisms associated with each model.

### Components/Axes

*   **Title:** The image is divided into three sections, each representing a different reward model.
    *   (a) Outcome Reward Model
    *   (b) Multimodal Process Reward Model
    *   (c) GM-PRM (Ours)
*   **Input:** Each model starts with an input.
    *   ORM: "Question" (represented by a blue box containing the equation E=mc²)
    *   PRM and GM-PRM: "Multimodal Math Qns" (represented by a blue box containing math equations)
*   **Process:** The models process the input through a series of steps.
    *   ORM: A single "Output" arrow leading to "Answer" (green box).
    *   PRM and GM-PRM: A series of steps labeled "Step 1", "Step 2", ..., "Step T" (green boxes) leading to "Answer" (green box).
*   **Reward:** Each model includes a reward mechanism.
    *   ORM, PRM, and GM-PRM: A robot icon labeled "ORM", "PRM", and "GM-PRM" respectively, receiving a "Reward" (orange dashed arrow).
*   **Annotations:** Additional text annotations provide context and highlight key differences.
    *   ORM: "ONLY Reward Final Ans" (red text)
    *   PRM: "Limited Explainability", "No Correction Mechanism" (red text)
    *   GM-PRM: "1st incorrect step" (purple text), "After Correction" (purple arrow), "Refined BoN" (green text), "Refined & Corrected Version" (purple text)
*   **Analysis & Judgement (GM-PRM):** A purple box at the bottom of the GM-PRM section contains the labels "Step Intent", "Image Alignment", and "Reasoning Logic".

### Detailed Analysis

**Outcome Reward Model (ORM):**

*   The model takes a "Question" as input.
*   The question is processed to produce an "Answer".
*   The "ORM" agent receives a "Reward" only based on the final answer.
*   The annotation "ONLY Reward Final Ans" emphasizes that the reward is solely based on the outcome.

**Multimodal Process Reward Model (PRM):**

*   The model takes "Multimodal Math Qns" as input.
*   The input is processed through a series of steps: "Step 1", "Step 2", ..., "Step T".
*   The final step leads to an "Answer".
*   The "PRM" agent receives a "Reward" based on the process.
*   The annotations "Limited Explainability" and "No Correction Mechanism" highlight the limitations of this model.

**GM-PRM (Ours):**

*   The model takes "Multimodal Math Qns" as input.
*   The input is processed through a series of steps: "Step 1", "Step 2" (purple), ..., "Step T".
*   The annotation "1st incorrect step" indicates a point where a correction mechanism is applied.
*   The "After Correction" arrow shows the flow after a correction.
*   The final step leads to an "Answer".
*   The "GM-PRM" agent receives a "Reward" based on the refined process.
*   The "Refined BoN" annotation suggests a refined version of something (likely a "Bag of Neurons" or similar concept).
*   The "Refined & Corrected Version" annotation emphasizes the improvements made in this model.
*   The "Analysis & Judgement" box indicates the model's ability to analyze and judge the steps involved.

### Key Observations

*   The ORM is the simplest model, focusing only on the final outcome.
*   The PRM considers the process but lacks explainability and a correction mechanism.
*   The GM-PRM builds upon the PRM by incorporating a correction mechanism and analysis/judgment capabilities.

### Interpretation

The diagram illustrates the evolution of reward models for problem-solving, particularly in the context of multimodal math questions. The GM-PRM model is presented as an improvement over existing models by incorporating a correction mechanism and analysis/judgment capabilities. This suggests that the GM-PRM is more robust and capable of handling complex problems compared to the ORM and PRM models. The "Refined BoN" annotation implies that the GM-PRM utilizes a more sophisticated approach to processing information and generating solutions. The diagram highlights the importance of considering the process and incorporating feedback mechanisms in reward models for complex tasks.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

8bdc7ad276af09ff8bf93e49

FOUND IN PAPERS

EXPERT: gemini-2.0-flash VERSION 1