Image a9f9c24a9e43...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Document Extraction: ORM Versus PRM Performance Analysis

## 1. Document Metadata
*   **Title:** ORM Versus PRM
*   **Language:** English
*   **Image Type:** Line graph with shaded confidence intervals/error bands
*   **Primary Subject:** Comparison of MATH Test Accuracy across different sampling methods and reward models

## 2. Component Isolation

### A. Header
*   **Main Title:** ORM Versus PRM

### B. Main Chart Area
*   **Y-Axis Label:** MATH Test Accuracy (%)
*   **Y-Axis Scale:** Linear, ranging from 10 to 40 with increments of 5
*   **X-Axis Label:** Number of Samples
*   **X-Axis Scale:** Logarithmic (Base 2), with markers at $2^1, 2^3, 2^5, 2^7, 2^9, 2^{11}$
*   **Legend Location:** Top-left quadrant

### C. Legend Data
| Color | Label |
| :--- | :--- |
| **Blue** | PRM best-of-N weighted |
| **Orange** | Base-LM Majority |
| **Green** | ORM best-of-N weighted |

---

## 3. Trend Verification and Data Extraction

### Series 1: PRM best-of-N weighted (Blue Line)
*   **Visual Trend:** This series shows the highest overall performance. It follows a steep upward trajectory from $2^0$ to $2^5$, after which the slope decreases but remains consistently positive, ending at the highest point on the graph.
*   **Estimated Data Points:**
    *   $2^0$ (1 sample): ~10.5%
    *   $2^3$ (8 samples): ~25.5%
    *   $2^5$ (32 samples): ~31.8%
    *   $2^7$ (128 samples): ~35.3%
    *   $2^{11}$ (2048 samples): ~39.8%

### Series 2: ORM best-of-N weighted (Green Line)
*   **Visual Trend:** Initially tracks very closely with the PRM (Blue) line for low sample counts ($2^0$ to $2^3$). However, after $2^3$, the growth rate slows down significantly compared to PRM, eventually plateauing between $2^9$ and $2^{11}$.
*   **Estimated Data Points:**
    *   $2^0$ (1 sample): ~10.5%
    *   $2^3$ (8 samples): ~25.0%
    *   $2^5$ (32 samples): ~30.5%
    *   $2^7$ (128 samples): ~33.0%
    *   $2^{11}$ (2048 samples): ~34.8%

### Series 3: Base-LM Majority (Orange Line)
*   **Visual Trend:** This is the baseline series and performs the worst of the three. It shows a steady but much shallower upward slope. It lacks the initial "burst" of accuracy seen in the reward-model-guided series.
*   **Estimated Data Points:**
    *   $2^0$ (1 sample): ~10.5%
    *   $2^3$ (8 samples): ~18.5%
    *   $2^5$ (32 samples): ~25.5%
    *   $2^7$ (128 samples): ~28.1%
    *   $2^{11}$ (2048 samples): ~29.6%

---

## 4. Key Findings and Observations
1.  **Convergence at Origin:** All three methods start at approximately the same accuracy (~10.5%) when the number of samples is 1 ($2^0$).
2.  **PRM Superiority:** The Process Reward Model (PRM) weighted best-of-N consistently outperforms the Outcome Reward Model (ORM) and the Base-LM Majority vote as the number of samples increases.
3.  **Scaling Efficiency:** Both ORM and PRM provide a significant "boost" over simple majority voting. However, the PRM scales better with more samples, whereas the ORM shows signs of diminishing returns (plateauing) earlier.
4.  **Statistical Variance:** Shaded regions around each line indicate confidence intervals. The variance appears relatively stable across the sample sizes, with no significant widening or narrowing as $N$ increases.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

a9f9c24a9e4319cabb7589b2

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1