Image 3a5d28db8292...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview

INTEL_VERIFIED

# Technical Data Extraction: Comparing Test-time and Pretraining Compute

## 1. Document Overview
This image contains two line graphs comparing the effectiveness of increasing test-time compute versus pretraining compute across different problem difficulty levels on the MATH dataset.

*   **Main Title:** Comparing Test-time and Pretraining Compute
*   **Sub-charts:** 
    1.  **Revisions** (Left)
    2.  **PRM Search** (Right)
*   **Language:** English

---

## 2. Component Isolation

### A. Global Legend & Scales
*   **Y-Axis (Both):** MATH Difficulty Level Accuracy (%)
    *   **Range:** 0 to 100 (Left chart), 0 to 80+ (Right chart).
    *   **Markers:** 0, 20, 40, 60, 80, 100.
*   **X-Axis (Both):** Proportional to Inference FLOPs
    *   **Scale:** Logarithmic (Base 2).
    *   **Markers:** $2^0, 2^1, 2^2, 2^3, 2^4, 2^5, 2^6, 2^7, 2^8$.
*   **Color Gradient Legend (Right Side):** Difficulty Level
    *   **Scale:** 1 (Purple/Top) to 5 (Blue/Bottom).
    *   **Mapping:** 
        *   Level 1: Purple
        *   Level 2: Red/Pink
        *   Level 3: Green
        *   Level 4: Orange/Brown
        *   Level 5: Blue
*   **Bottom Legend (Symbols & Lines):**
    *   **Star ($\star$):** Pretraining Compute
    *   **Circle-Line ($\bullet$):** Test-time Compute
    *   **Blue Dashed Vertical Line:** $R \gg 1$
    *   **Orange Dashed Vertical Line:** $R \approx 1$
    *   **Green Dashed Vertical Line:** $R \ll 1$

---

## 3. Data Analysis: Revisions (Left Chart)

### Trend Verification
All data series show a positive correlation between inference FLOPs and accuracy. The slope is steepest for Level 1 (Purple) and Level 2 (Red), while Level 5 (Blue) remains nearly flat near 0% accuracy.

### Data Points (Approximate Values)
| Difficulty Level (Color) | $2^0$ Accuracy | $2^8$ Accuracy | Trend Description |
| :--- | :--- | :--- | :--- |
| **1 (Purple)** | ~60% | ~90% | Rapid ascent, plateaus after $2^4$. |
| **2 (Red)** | ~22% | ~62% | Steady upward slope. |
| **3 (Green)** | ~8% | ~45% | Steady upward slope. |
| **4 (Orange)** | ~2% | ~18% | Shallow upward slope. |
| **5 (Blue)** | ~1% | ~3% | Nearly flat; very low accuracy. |

### Pretraining Compute ($\star$) Comparison
Stars are placed at $2^4, 2^6,$ and $2^8$ for various levels.
*   At $2^8$, the test-time compute (circles) generally matches or slightly exceeds the pretraining compute (stars) for levels 2, 3, and 4.

---

## 4. Data Analysis: PRM Search (Right Chart)

### Trend Verification
The "PRM Search" method shows a much sharper "elbow" in the curve. Accuracy gains are very high between $2^0$ and $2^2$ for Level 1, then plateau. Lower difficulty levels show more linear growth.

### Data Points (Approximate Values)
| Difficulty Level (Color) | $2^0$ Accuracy | $2^8$ Accuracy | Trend Description |
| :--- | :--- | :--- | :--- |
| **1 (Purple)** | ~38% | ~85% | Sharp jump to $2^2$, then plateaus. |
| **2 (Red)** | ~11% | ~60% | Consistent upward slope. |
| **3 (Green)** | ~4% | ~35% | Consistent upward slope. |
| **4 (Orange)** | ~1% | ~16% | Shallow upward slope. |
| **5 (Blue)** | ~0% | ~2% | Minimal gains. |

---

## 5. Key Observations & Vertical Markers
The vertical dashed lines represent the ratio ($R$) of compute efficiency.
1.  **$R \gg 1$ (Blue, $2^4$):** Indicates a regime where one form of compute is significantly more efficient.
2.  **$R \approx 1$ (Orange, $2^6$):** Indicates a "break-even" point where test-time and pretraining compute yields similar returns.
3.  **$R \ll 1$ (Green, $2^8$):** Indicates the opposite regime of the blue line.

**Summary of Findings:**
*   Increasing test-time compute (moving right on the x-axis) consistently improves accuracy across all difficulty levels, though the returns diminish as difficulty increases (Level 5 is hardest).
*   **PRM Search** appears to reach higher performance ceilings faster for Level 1 problems compared to the **Revisions** method, but Revisions starts with a higher baseline at $2^0$ FLOPs.
*   The stars ($\star$) indicate that for many difficulty levels, increasing test-time compute can achieve the same accuracy as models with significantly more pretraining compute.

DECODING INTELLIGENCE...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free

INTEL_VERIFIED

# Technical Document Extraction: Comparing Test-time and Pretraining Compute

## Chart Overview
The image contains two comparative line charts titled **"Comparing Test-time and Pretraining Compute"**. Both charts analyze **MATH Difficulty Level Accuracy (%)** across different computational scales and training paradigms. The x-axis represents **Proportional to Inference FLOPs** (logarithmic scale: 2⁰ to 2⁸), while the y-axis represents **MATH Difficulty Level Accuracy (%)** (0–100%).

---

### **Left Chart: Revisions**
#### Key Components
- **Legend**: Located at the bottom center, with five entries:
  1. **Pretraining Compute** (black star)
  2. **Test-time Compute** (black circle)
  3. **R >> 1** (blue dashed line)
  4. **R = 1** (orange dashed line)
  5. **R << 1** (green dashed line)

#### Data Trends
1. **Pretraining Compute** (black stars):
   - Starts at **60%** (2⁰) and increases steadily to **90%** (2⁷), then slightly declines to **85%** (2⁸).
   - Spatial grounding: Black stars align with the legend's "Pretraining Compute" label.

2. **Test-time Compute** (black circles):
   - Begins at **2%** (2⁰) and rises gradually to **5%** (2⁸).
   - Spatial grounding: Black circles match the legend's "Test-time Compute" label.

3. **R >> 1** (blue dashed line):
   - Starts at **10%** (2⁰) and increases to **45%** (2⁸).
   - Spatial grounding: Blue dashed line corresponds to the legend's "R >> 1" label.

4. **R = 1** (orange dashed line):
   - Begins at **30%** (2⁰) and rises to **60%** (2⁸).
   - Spatial grounding: Orange dashed line matches the legend's "R = 1" label.

5. **R << 1** (green dashed line):
   - Starts at **5%** (2⁰) and increases to **40%** (2⁸).
   - Spatial grounding: Green dashed line aligns with the legend's "R << 1" label.

#### Critical Observations
- **Pretraining Compute** dominates accuracy across all scales, with a peak at 2⁷.
- **Test-time Compute** shows minimal improvement, suggesting limited impact of test-time resources.
- **R >> 1** (high compute) outperforms **R << 1** (low compute) by ~25% at 2⁸.

---

### **Right Chart: PRM Search**
#### Key Components
- **Legend**: Identical to the left chart (same colors and labels).
- **X-axis**: Same logarithmic scale (2⁰ to 2⁸).
- **Y-axis**: Same range (0–100%).

#### Data Trends
1. **Pretraining Compute** (black stars):
   - Starts at **35%** (2⁰), jumps to **80%** (2³), then increases to **85%** (2⁸).
   - Spatial grounding: Black stars match the legend's "Pretraining Compute" label.

2. **Test-time Compute** (black circles):
   - Begins at **1%** (2⁰) and rises to **5%** (2⁸).
   - Spatial grounding: Black circles align with the legend's "Test-time Compute" label.

3. **R >> 1** (blue dashed line):
   - Starts at **40%** (2⁰) and increases to **80%** (2⁸).
   - Spatial grounding: Blue dashed line corresponds to the legend's "R >> 1" label.

4. **R = 1** (orange dashed line):
   - Begins at **30%** (2⁰) and rises to **60%** (2⁸).
   - Spatial grounding: Orange dashed line matches the legend's "R = 1" label.

5. **R << 1** (green dashed line):
   - Starts at **5%** (2⁰) and increases to **35%** (2⁸).
   - Spatial grounding: Green dashed line aligns with the legend's "R << 1" label.

#### Critical Observations
- **Pretraining Compute** shows a sharp initial gain, plateauing at higher scales.
- **R >> 1** (high compute) outperforms **R << 1** (low compute) by ~45% at 2⁸.
- **Test-time Compute** has negligible impact compared to pretraining.

---

### Cross-Chart Comparison
| Metric                | Revisions (2⁸) | PRM Search (2⁸) |
|-----------------------|----------------|-----------------|
| **Pretraining Compute** | 85%            | 85%             |
| **Test-time Compute**   | 5%             | 5%              |
| **R >> 1**              | 45%            | 80%             |
| **R = 1**               | 60%            | 60%             |
| **R << 1**              | 40%            | 35%             |

---

### Language and Localization
- **Primary Language**: English (all labels, axis titles, and legends are in English).
- **No Additional Languages Detected**.

---

### Final Notes
- All data points were cross-referenced with the legend to ensure color-label alignment.
- Trends were verified visually before numerical extraction.
- No embedded text or data tables were present in the image.

DECODING INTELLIGENCE...

TECHNICAL ASSET FINGERPRINT

3a5d28db8292b81da48e7a2f

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1

EXPERT: nemotron-free VERSION 1