Image 1a744b42726d...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Document Extraction: Performance Comparison of Search Strategies

## 1. Document Metadata
*   **Title:** Comparing Beam Search and Best-of-N with Unsupervised Difficulty Bins
*   **Chart Type:** Grouped Bar Chart (Overlaid/Stacked style)
*   **Language:** English (100%)

## 2. Component Isolation

### Header
*   **Main Title:** "Comparing Beam Search and Best-of-N with Unsupervised Difficulty Bins"

### Main Chart Area
*   **Y-Axis Label:** MATH Test Accuracy (%)
*   **Y-Axis Scale:** 0 to 80, with major gridlines every 10 units (0, 10, 20, 30, 40, 50, 60, 70, 80).
*   **X-Axis Label:** Test Questions Binned with Unsupervised Difficulty Bins
*   **X-Axis Categories:** 5 distinct difficulty bins labeled 1, 2, 3, 4, and 5.
*   **Sub-categories:** Within each difficulty bin, there are 4 distinct bars representing different model configurations or iterations.

### Legend [Top Right Placement]
*   **Blue (Semi-transparent):** Beam Search
*   **Orange (Semi-transparent):** Best-of-N Weighted
*   **Green (Semi-transparent):** Majority

## 3. Trend Verification and Data Extraction

### Overall Visual Trends
1.  **Difficulty Correlation:** There is a sharp, consistent downward trend in accuracy across all methods as the difficulty bin increases from 1 to 5.
2.  **Method Performance:** In almost all bins, "Beam Search" (Blue) tends to reach the highest accuracy peaks, followed by "Best-of-N Weighted" (Orange), with "Majority" (Green) generally forming the baseline performance.
3.  **Intra-Bin Trend:** Within each difficulty bin, accuracy generally increases across the four bars from left to right, suggesting a progression in model scale or compute.

### Data Table Reconstruction (Estimated Values)
The following table represents the approximate percentage values extracted from the visual height of the bars. Each difficulty bin contains 4 bars (ordered left to right).

| Difficulty Bin | Method | Bar 1 (%) | Bar 2 (%) | Bar 3 (%) | Bar 4 (%) |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **1 (Easiest)** | Beam Search (Blue) | ~71 | ~78 | ~76 | ~78 |
| | Best-of-N (Orange) | ~62 | ~75 | ~78 | ~79 |
| | Majority (Green) | ~44 | ~69 | ~76 | ~78 |
| **2** | Beam Search (Blue) | ~32 | ~50 | ~53 | ~54 |
| | Best-of-N (Orange) | ~27 | ~43 | ~53 | ~56 |
| | Majority (Green) | ~16 | ~29 | ~40 | ~42 |
| **3** | Beam Search (Blue) | ~20 | ~26 | ~26 | ~26 |
| | Best-of-N (Orange) | ~8 | ~15 | ~21 | ~28 |
| | Majority (Green) | ~5 | ~8 | ~11 | ~12 |
| **4** | Beam Search (Blue) | ~12 | ~12 | ~12 | ~21 |
| | Best-of-N (Orange) | ~5 | ~9 | ~13 | ~17 |
| | Majority (Green) | ~3 | ~5 | ~8 | ~9 |
| **5 (Hardest)** | Beam Search (Blue) | ~1 | ~4 | ~4 | ~7 |
| | Best-of-N (Orange) | ~1.5 | ~3 | ~3.5 | ~4 |
| | Majority (Green) | ~0.5 | ~1 | ~2 | ~2 |

## 4. Detailed Observations
*   **Bin 1 Performance:** The models perform exceptionally well on the easiest questions, with all methods converging near 80% accuracy by the fourth bar.
*   **Bin 5 Performance:** Performance is near zero for the hardest questions, with even the best method (Beam Search) struggling to exceed 5-7% accuracy.
*   **Method Overlap:** The bars are semi-transparent. Where colors overlap (e.g., Green inside Orange inside Blue), it indicates that the "Majority" baseline is a subset of the performance achieved by "Best-of-N," which is often a subset of "Beam Search."
*   **Beam Search Advantage:** The blue bars (Beam Search) are consistently the tallest across the mid-to-high difficulty bins (Bins 2, 3, and 4), indicating it is the most robust search strategy for this MATH test dataset.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

1a744b42726dd9aa97f9d332

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1