Image e5d7243da647...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview

INTEL_VERIFIED

# Technical Data Extraction: PRM Search Methods Performance Analysis

This document contains a detailed extraction of data from two side-by-side charts comparing various search methods for Process-based Reward Models (PRM) on the MATH test dataset.

---

## Chart 1: Comparing PRM Search Methods

### Metadata and Axes
*   **Title:** Comparing PRM Search Methods
*   **Y-Axis Label:** MATH Test Accuracy (%)
    *   **Range:** 10 to 40
    *   **Markers:** 10, 15, 20, 25, 30, 35, 40
*   **X-Axis Label:** Generation Budget
    *   **Scale:** Logarithmic (Base 2)
    *   **Markers:** $2^1, 2^3, 2^5, 2^7, 2^9$
*   **Legend Location:** Bottom Right [approx. x=0.7, y=0.2 relative to chart area]

### Data Series Extraction
The chart tracks 7 distinct search strategies. All series originate at approximately 10.5% accuracy at the lowest budget.

| Legend Label | Color | Visual Trend | Key Data Points (Approx.) |
| :--- | :--- | :--- | :--- |
| **Best-of-N Weighted** | Orange | Steady linear-log growth; highest final accuracy. | $2^3 \approx 26\%$, $2^9 \approx 38\%$ |
| **Majority** | Green | Slowest initial growth; consistent upward slope. | $2^3 \approx 18\%$, $2^9 \approx 29\%$ |
| **Beam; M = sqrt(N)** | Red | Rapid early growth, plateaus/dips after $2^5$. | $2^4 \approx 34\%$, $2^8 \approx 33.5\%$ |
| **Beam; M = 4** | Blue | Rapid early growth, peaks at $2^8$. | $2^4 \approx 34\%$, $2^8 \approx 37\%$ |
| **1 Step Lookahead; M = sqrt(N)** | Purple | Moderate growth, plateaus early. | $2^3 \approx 29\%$, $2^8 \approx 32\%$ |
| **3 Step Lookahead; M = sqrt(N)** | Brown | Steady growth, similar to Best-of-N but lower. | $2^5 \approx 32\%$, $2^8 \approx 31.5\%$ |
| **3 Step Lookahead; M = 4** | Pink | Late bloomer; sharp rise between $2^4$ and $2^6$. | $2^4 \approx 25\%$, $2^6 \approx 35\%$, $2^8 \approx 33\%$ |

---

## Chart 2: Comparing Beam Search and Best-of-N by Difficulty Level

### Metadata and Axes
*   **Title:** Comparing Beam Search and Best-of-N by Difficulty Level
*   **Y-Axis Label:** MATH Test Accuracy (%)
    *   **Range:** 0 to 80+
    *   **Markers:** 0, 20, 40, 60, 80
*   **X-Axis Label:** Test Questions Binned by Increasing Difficulty Level
    *   **Categories:** 1, 2, 3, 4, 5 (representing difficulty levels)
*   **Legend Location:** Top Right [approx. x=0.85, y=0.9]

### Component Analysis
This is a grouped bar chart where each difficulty level (1-5) contains four sub-bars representing increasing generation budgets. Each bar is stacked to show the performance of three methods.

**Legend/Color Mapping:**
*   **Blue:** Beam Search (Top layer)
*   **Orange/Tan:** Best-of-N Weighted (Middle layer)
*   **Green:** Majority (Bottom layer)

### Data Trends by Difficulty
1.  **Difficulty 1 (Easiest):** All methods perform exceptionally well, reaching >80% accuracy. The performance saturates quickly across the four budget increments.
2.  **Difficulty 2:** Accuracy ranges from ~30% to ~60%. There is a clear step-up in performance as the generation budget increases.
3.  **Difficulty 3:** Accuracy ranges from ~20% to ~35%. Beam Search (Blue) shows a more significant marginal gain over Majority (Green) here compared to Level 1.
4.  **Difficulty 4:** Accuracy drops significantly, peaking below 20%. The "Majority" method (Green) is notably low, while Beam Search provides the bulk of the successful outcomes.
5.  **Difficulty 5 (Hardest):** Accuracy is near zero for all methods, with only tiny slivers of blue (Beam Search) visible at the highest budget levels, barely exceeding 1-2%.

### Summary of Findings
*   **Search Efficiency:** Beam Search (Blue) and Best-of-N Weighted (Orange) consistently outperform simple Majority voting (Green) across all difficulty levels.
*   **Scaling:** Increasing the "Generation Budget" provides diminishing returns on easy problems (Level 1) but is critical for mid-to-high difficulty problems (Levels 2-4).
*   **Difficulty Ceiling:** There is a sharp performance drop-off at Difficulty Level 5, where none of the tested PRM search methods achieve significant accuracy.

DECODING INTELLIGENCE...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free

INTEL_VERIFIED

# Technical Document Extraction: PRM Search Methods Comparison

## Chart 1: Comparing PRM Search Methods
### Spatial Layout
- **X-axis**: Generation Budget (2¹ to 2⁹)
- **Y-axis**: MATH Test Accuracy (%)
- **Legend Position**: Top-right corner

### Legend & Color Mapping
| Method Name                          | Color Code |
|--------------------------------------|------------|
| Best-of-N Weighted                   | Orange     |
| Majority                             | Green      |
| Beam: M = sqrt(N)                    | Red        |
| Beam: M = 4                          | Blue       |
| 1 Step Lookahead: M = sqrt(N)        | Purple     |
| 3 Step Lookahead: M = 4              | Pink       |

### Key Trends & Data Points
1. **Best-of-N Weighted (Orange)**
   - Steady upward trend from 10% (2¹) to 38% (2⁹)
   - Notable inflection at 2⁵ (32%) and 2⁷ (36%)

2. **Majority (Green)**
   - Gradual increase from 10% (2¹) to 28% (2⁹)
   - Consistent slope with no plateaus

3. **Beam: M = sqrt(N) (Red)**
   - Initial spike to 30% (2³), then decline to 28% (2⁹)
   - Peak at 2³ (30%) followed by gradual decay

4. **Beam: M = 4 (Blue)**
   - Sharp rise to 34% (2³), then plateau at 32-34% range
   - Highest performance until 2⁷ (34%)

5. **1 Step Lookahead: M = sqrt(N) (Purple)**
   - Volatile pattern: 10% → 28% (2³) → 32% (2⁵) → 31% (2⁷) → 33% (2⁹)
   - Sharp dip at 2⁷ followed by recovery

6. **3 Step Lookahead: M = 4 (Pink)**
   - Steady ascent from 10% (2¹) to 34% (2⁹)
   - Most consistent growth among all methods

### Critical Observations
- All methods show >200% improvement from 2¹ to 2⁹
- Lookahead methods outperform static M values at higher budgets
- Best-of-N Weighted achieves highest absolute accuracy (38%)
- Beam: M = 4 maintains stability despite budget increases

## Chart 2: Comparing Beam Search and Best-of-N by Difficulty Level
### Spatial Layout
- **X-axis**: Test Questions Binned by Difficulty (1-5)
- **Y-axis**: MATH Test Accuracy (%)
- **Legend Position**: Top-right corner

### Legend & Color Mapping
| Method Name              | Color Code |
|--------------------------|------------|
| Beam Search              | Blue       |
| Best-of-N Weighted       | Orange     |
| Majority                 | Green      |

### Key Trends & Data Points
1. **Difficulty Level 1**
   - Best-of-N Weighted: 85%
   - Beam Search: 78%
   - Majority: 82%

2. **Difficulty Level 2**
   - Best-of-N Weighted: 72%
   - Beam Search: 65%
   - Majority: 68%

3. **Difficulty Level 3**
   - Best-of-N Weighted: 55%
   - Beam Search: 48%
   - Majority: 50%

4. **Difficulty Level 4**
   - Best-of-N Weighted: 30%
   - Beam Search: 25%
   - Majority: 28%

5. **Difficulty Level 5**
   - Best-of-N Weighted: 5%
   - Beam Search: 3%
   - Majority: 4%

### Critical Observations
- Performance degrades exponentially with increasing difficulty
- Best-of-N Weighted maintains 15-20% advantage over Beam Search across all levels
- Majority method shows minimal decline (82% → 28%) despite difficulty changes
- No method achieves >10% accuracy at difficulty level 5

## Cross-Chart Analysis
1. **Method Consistency**
   - Best-of-N Weighted performs best in both charts
   - Lookahead methods show budget-dependent performance gains

2. **Difficulty Correlation**
   - Higher generation budgets (Chart 1) correlate with better difficulty handling (Chart 2)
   - 3 Step Lookahead (34% at 2⁹) matches Best-of-N Weighted's difficulty 2 performance (72%)

3. **Accuracy Thresholds**
   - All methods fail to maintain >20% accuracy beyond difficulty level 3
   - Generation budget 2⁷ (36% accuracy) corresponds to difficulty level 2 performance

## Technical Limitations
- No explicit error bars or confidence intervals provided
- No temporal data (e.g., training time, convergence rates)
- No comparative analysis of computational resources required

DECODING INTELLIGENCE...

TECHNICAL ASSET FINGERPRINT

e5d7243da647c23ae35fe39d

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1

EXPERT: nemotron-free VERSION 1