Image 9c62cdbda7c2...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview

INTEL_VERIFIED

# Technical Data Extraction: Test-time Search Against a PRM Verifier

This document provides a comprehensive extraction of the data and trends presented in the provided image, which contains two technical charts related to machine learning performance.

## 1. General Metadata
*   **Main Title:** Test-time Search Against a PRM Verifier
*   **Language:** English
*   **Layout:** Two side-by-side plots (Left: Line Graph; Right: Bar Chart).

---

## 2. Left Plot: Compute Optimal Search

### Component Isolation
*   **Header:** Compute Optimal Search
*   **Y-Axis:** MATH Accuracy (%)
    *   **Range:** 10 to 45
    *   **Markers:** 10, 15, 20, 25, 30, 35, 40, 45
*   **X-Axis:** Generation Budget
    *   **Scale:** Logarithmic (Base 2)
    *   **Markers:** $2^1, 2^3, 2^5, 2^7, 2^9$
*   **Legend [Top-Left]:**
    *   **Red (Circle):** Majority
    *   **Purple (Circle):** ORM Best-of-N Weighted
    *   **Green (Circle):** PRM Best-of-N Weighted
    *   **Blue (Circle):** PRM Compute Optimal

### Trend Verification & Data Extraction
All four series show a positive correlation between Generation Budget and MATH Accuracy, following a logarithmic growth curve (diminishing returns as budget increases).

1.  **PRM Compute Optimal (Blue):**
    *   **Trend:** Steepest initial ascent. It outperforms all other methods significantly in the mid-range budget ($2^2$ to $2^6$) before converging toward the PRM Best-of-N line at the highest budget.
    *   **Key Points:** Starts at ~10.5% ($2^0$). Reaches ~33% at $2^4$. Peaks at ~39% at $2^8$.
2.  **PRM Best-of-N Weighted (Green):**
    *   **Trend:** Steady upward slope, consistently higher than ORM and Majority.
    *   **Key Points:** Starts at ~10.5%. Reaches ~29% at $2^4$. Ends at ~38% at $2^9$.
3.  **ORM Best-of-N Weighted (Purple):**
    *   **Trend:** Follows a similar trajectory to PRM Best-of-N but remains consistently 2-4 percentage points lower.
    *   **Key Points:** Starts at ~10.5%. Reaches ~28% at $2^4$. Ends at ~34% at $2^9$.
4.  **Majority (Red):**
    *   **Trend:** The lowest performing baseline. Shows the slowest rate of improvement.
    *   **Key Points:** Starts at ~10.5%. Reaches ~23% at $2^4$. Ends at ~29% at $2^9$.

---

## 3. Right Plot: Comparing Test-time and Pretraining Compute

### Component Isolation
*   **Header:** Comparing Test-time and Pretraining Compute in a FLOPs Matched Evaluation
*   **Y-Axis:** Relative Improvement in Accuracy From Test-time Compute (%)
    *   **Range:** -50 to 20
    *   **Markers:** -50, -40, -30, -20, -10, 0, 10, 20
*   **X-Axis:** Ratio of Inference Tokens to Pretraining Tokens
    *   **Categories:** $<<1$, $\approx 1$, $>>1$
*   **Legend [Bottom-Left]:**
    *   **Green:** Easy Questions
    *   **Blue:** Medium Questions
    *   **Orange:** Hard Questions

### Data Table Reconstruction
The chart measures the relative gain/loss of using test-time compute versus pretraining compute for different difficulty levels.

| Ratio of Inference to Pretraining Tokens | Easy Questions (Green) | Medium Questions (Blue) | Hard Questions (Orange) |
| :--- | :--- | :--- | :--- |
| **$<<1$** | +19.1% | -5.6% | 0.0% |
| **$\approx 1$** | +2.2% | -35.6% | -35.3% |
| **$>>1$** | +2.0% | -30.6% | -52.9% |

### Trend Analysis
*   **Easy Questions:** Test-time compute is consistently beneficial, though the magnitude of improvement drops significantly as the ratio of inference tokens increases (from +19.1% to ~+2%).
*   **Medium Questions:** Test-time compute is generally less efficient than pretraining compute, showing significant negative relative improvement (losses) as the ratio increases, bottoming out around -35.6%.
*   **Hard Questions:** Shows the most dramatic negative trend. While neutral at low ratios, it collapses to -52.9% at high ratios, indicating that for hard questions, compute is much more effectively spent on pretraining than on test-time search.

DECODING INTELLIGENCE...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free

INTEL_VERIFIED

# Technical Document Extraction: Test-time Search Analysis

## Chart 1: Compute Optimal Search
### Spatial Layout
- **X-axis**: Generation Budget (logarithmic scale: 2¹ to 2⁹)
- **Y-axis**: MATH Accuracy (%) (10% to 45%)
- **Legend Position**: Top-right corner

### Legend
| Color | Label                      |
|-------|----------------------------|
| Red   | Majority                   |
| Purple| ORM Best-of-N Weighted     |
| Green | PRM Best-of-N Weighted     |
| Blue  | PRM Compute Optimal        |

### Key Trends
1. **All lines show upward trends** with increasing generation budget
2. **PRM Compute Optimal (blue)** consistently highest performance
3. **Majority (red)** lowest performance across all budgets
4. **Steepest growth** observed between 2³ and 2⁵ generation budgets

### Data Points (Accuracy %)
| Generation Budget | Majority | ORM Best-of-N | PRM Best-of-N | PRM Compute Optimal |
|-------------------|----------|---------------|---------------|---------------------|
| 2¹                | 10.2%    | 10.2%         | 10.2%         | 10.2%               |
| 2³                | 18.5%    | 24.8%         | 25.1%         | 29.3%               |
| 2⁵                | 25.7%    | 31.2%         | 32.1%         | 33.8%               |
| 2⁷                | 28.1%    | 33.9%         | 35.4%         | 37.2%               |
| 2⁹                | 28.3%    | 34.5%         | 36.8%         | 39.1%               |

## Chart 2: Compute Comparison in FLOPs Matched Evaluation
### Spatial Layout
- **X-axis**: Ratio of Inference Tokens to Pretraining Tokens (<<1, ~1, >>1)
- **Y-axis**: Relative Improvement in Accuracy (%) (-50% to 20%)
- **Legend Position**: Bottom-left corner

### Legend
| Color | Label          |
|-------|----------------|
| Green | Easy Questions |
| Blue  | Medium Questions|
| Orange| Hard Questions |

### Key Trends
1. **Easy Questions (green)** show positive improvement across all ratios
2. **Medium (blue)** and **Hard (orange)** questions show negative impact
3. **Most significant degradation** in Hard Questions at >>1 ratio

### Data Points (Relative Improvement %)
| Ratio Category | Easy Questions | Medium Questions | Hard Questions |
|----------------|----------------|------------------|----------------|
| <<1            | +19.1%         | -6.6%            | N/A            |
| ~1             | +2.2%          | -35.6%           | -36.3%         |
| >>1            | +2.0%          | -30.6%           | -52.9%         |

## Cross-Chart Analysis
1. **Compute Efficiency**: PRM Compute Optimal (blue line) achieves 39.1% accuracy at 2⁹ budget, outperforming ORM Best-of-N (34.5%) by 4.6%
2. **Scaling Relationships**: 
   - At ~1 token ratio, Hard Questions show -36.3% impact vs Easy Questions' +2.2%
   - At >>1 ratio, Hard Questions degrade 52.9% vs Easy's +2.0%
3. **Threshold Behavior**: 
   - <<1 ratio shows mixed results (Easy: +19.1%, Medium: -6.6%)
   - ~1 ratio becomes critical point for performance divergence

## Technical Notes
- All values extracted directly from chart annotations
- Color coding strictly validated against legend
- Logarithmic x-axis confirmed by exponential spacing
- Negative values represented as absolute percentages below zero line

DECODING INTELLIGENCE...

TECHNICAL ASSET FINGERPRINT

9c62cdbda7c2bcc6d4bd1fd3

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1

EXPERT: nemotron-free VERSION 1