# Technical Document Extraction: Performance Comparison of Search Strategies
## 1. Document Metadata
* **Title:** Comparing Beam Search and Best-of-N with Unsupervised Difficulty Bins
* **Chart Type:** Grouped Bar Chart (Overlaid/Stacked style)
* **Language:** English (100%)
## 2. Component Isolation
### Header
* **Main Title:** "Comparing Beam Search and Best-of-N with Unsupervised Difficulty Bins"
### Main Chart Area
* **Y-Axis Label:** MATH Test Accuracy (%)
* **Y-Axis Scale:** 0 to 80, with major gridlines every 10 units (0, 10, 20, 30, 40, 50, 60, 70, 80).
* **X-Axis Label:** Test Questions Binned with Unsupervised Difficulty Bins
* **X-Axis Categories:** 5 distinct difficulty bins labeled 1, 2, 3, 4, and 5.
* **Sub-categories:** Within each difficulty bin, there are 4 distinct bars representing different model configurations or iterations.
### Legend [Top Right Placement]
* **Blue (Semi-transparent):** Beam Search
* **Orange (Semi-transparent):** Best-of-N Weighted
* **Green (Semi-transparent):** Majority
## 3. Trend Verification and Data Extraction
### Overall Visual Trends
1. **Difficulty Correlation:** There is a sharp, consistent downward trend in accuracy across all methods as the difficulty bin increases from 1 to 5.
2. **Method Performance:** In almost all bins, "Beam Search" (Blue) tends to reach the highest accuracy peaks, followed by "Best-of-N Weighted" (Orange), with "Majority" (Green) generally forming the baseline performance.
3. **Intra-Bin Trend:** Within each difficulty bin, accuracy generally increases across the four bars from left to right, suggesting a progression in model scale or compute.
### Data Table Reconstruction (Estimated Values)
The following table represents the approximate percentage values extracted from the visual height of the bars. Each difficulty bin contains 4 bars (ordered left to right).
| Difficulty Bin | Method | Bar 1 (%) | Bar 2 (%) | Bar 3 (%) | Bar 4 (%) |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **1 (Easiest)** | Beam Search (Blue) | ~71 | ~78 | ~76 | ~78 |
| | Best-of-N (Orange) | ~62 | ~75 | ~78 | ~79 |
| | Majority (Green) | ~44 | ~69 | ~76 | ~78 |
| **2** | Beam Search (Blue) | ~32 | ~50 | ~53 | ~54 |
| | Best-of-N (Orange) | ~27 | ~43 | ~53 | ~56 |
| | Majority (Green) | ~16 | ~29 | ~40 | ~42 |
| **3** | Beam Search (Blue) | ~20 | ~26 | ~26 | ~26 |
| | Best-of-N (Orange) | ~8 | ~15 | ~21 | ~28 |
| | Majority (Green) | ~5 | ~8 | ~11 | ~12 |
| **4** | Beam Search (Blue) | ~12 | ~12 | ~12 | ~21 |
| | Best-of-N (Orange) | ~5 | ~9 | ~13 | ~17 |
| | Majority (Green) | ~3 | ~5 | ~8 | ~9 |
| **5 (Hardest)** | Beam Search (Blue) | ~1 | ~4 | ~4 | ~7 |
| | Best-of-N (Orange) | ~1.5 | ~3 | ~3.5 | ~4 |
| | Majority (Green) | ~0.5 | ~1 | ~2 | ~2 |
## 4. Detailed Observations
* **Bin 1 Performance:** The models perform exceptionally well on the easiest questions, with all methods converging near 80% accuracy by the fourth bar.
* **Bin 5 Performance:** Performance is near zero for the hardest questions, with even the best method (Beam Search) struggling to exceed 5-7% accuracy.
* **Method Overlap:** The bars are semi-transparent. Where colors overlap (e.g., Green inside Orange inside Blue), it indicates that the "Majority" baseline is a subset of the performance achieved by "Best-of-N," which is often a subset of "Beam Search."
* **Beam Search Advantage:** The blue bars (Beam Search) are consistently the tallest across the mid-to-high difficulty bins (Bins 2, 3, and 4), indicating it is the most robust search strategy for this MATH test dataset.