## Radar Charts: Multi-Benchmark Performance Comparison
### Overview
The image displays three radar charts (also known as spider charts) arranged horizontally. Each chart compares the performance of different methods or model variants across five common benchmarks: MathVista, WeMath, MMStar, MMVet, and MathVision. The charts are titled "Data selection comparison," "Test-time scaling comparison," and "Ablation study," respectively. A consistent method labeled "DreamPRM" (cyan line) appears in all three charts, serving as a baseline for comparison.
### Components/Axes
- **Chart Type:** Radar Charts (Spider Plots)
- **Common Axes (Benchmarks):** Five axes radiate from the center, each representing a benchmark:
1. MathVista (Top)
2. WeMath (Top-Right)
3. MMStar (Bottom-Right)
4. MMVet (Bottom-Left)
5. MathVision (Top-Left)
- **Scale:** The concentric circles represent performance scores, increasing from the center (0) outward. The outermost ring appears to represent a score of approximately 70.
- **Legends:** Each chart has a legend positioned directly below it, mapping line colors to method names.
### Detailed Analysis
#### Chart 1: Data selection comparison
- **Legend (Bottom-Left):**
- Orange: No selection
- Purple: CaR selection
- Red: s1 selection
- Cyan: DreamPRM
- **Data Series & Approximate Values (Score on each benchmark):**
- **DreamPRM (Cyan):** Forms the outermost polygon. Values: MathVista ~68.9, WeMath ~57.1, MMStar ~61.1, MMVet ~60.1, MathVision ~65.0.
- **s1 selection (Red):** Forms an inner polygon. Values: MathVista ~65.8, WeMath ~52.7, MMStar ~50.1, MMVet ~50.1, MathVision ~60.0.
- **CaR selection (Purple):** Forms an inner polygon, generally inside the red line. Values: MathVista ~65.3, WeMath ~52.7, MMStar ~49.1, MMVet ~49.1, MathVision ~59.0.
- **No selection (Orange):** Forms the innermost polygon. Values: MathVista ~61.5, WeMath ~47.7, MMStar ~47.1, MMVet ~47.1, MathVision ~56.0.
- **Trend Verification:** The cyan line (DreamPRM) is consistently the outermost, indicating the highest performance across all five benchmarks. The red line (s1 selection) is generally next, followed by purple (CaR selection), with orange (No selection) being the innermost.
#### Chart 2: Test-time scaling comparison
- **Legend (Bottom-Center):**
- Orange: Self-consistency
- Purple: ORM
- Red: Self-correction
- Cyan: DreamPRM
- **Data Series & Approximate Values:**
- **DreamPRM (Cyan):** Outermost polygon. Values: MathVista ~68.9, WeMath ~60.1, MMStar ~62.3, MMVet ~61.3, MathVision ~65.0.
- **Self-correction (Red):** Inner polygon. Values: MathVista ~63.8, WeMath ~54.9, MMStar ~50.1, MMVet ~57.4, MathVision ~59.0.
- **ORM (Purple):** Inner polygon. Values: MathVista ~65.3, WeMath ~54.9, MMStar ~50.1, MMVet ~55.9, MathVision ~59.0.
- **Self-consistency (Orange):** Innermost polygon. Values: MathVista ~67.1, WeMath ~54.9, MMStar ~50.1, MMVet ~57.4, MathVision ~59.0.
- **Trend Verification:** DreamPRM (cyan) again forms the outermost shape. The other three methods (Self-consistency, ORM, Self-correction) are clustered more closely together in the middle range, with Self-consistency (orange) showing a notably higher score on MathVista compared to its performance on other axes.
#### Chart 3: Ablation study
- **Legend (Bottom-Right):**
- Orange: w/o AFL
- Purple: w/o ST
- Red: w/o BLO
- Cyan: DreamPRM
- **Data Series & Approximate Values:**
- **DreamPRM (Cyan):** Outermost polygon. Values: MathVista ~68.9, WeMath ~55.3, MMStar ~61.3, MMVet ~61.2, MathVision ~65.0.
- **w/o BLO (Red):** Inner polygon. Values: MathVista ~66.1, WeMath ~55.0, MMStar ~59.6, MMVet ~59.6, MathVision ~60.4.
- **w/o ST (Purple):** Inner polygon. Values: MathVista ~66.4, WeMath ~55.0, MMStar ~59.6, MMVet ~59.6, MathVision ~60.4.
- **w/o AFL (Orange):** Innermost polygon. Values: MathVista ~66.1, WeMath ~55.0, MMStar ~59.6, MMVet ~59.6, MathVision ~60.4.
- **Trend Verification:** DreamPRM (cyan) is the outermost. The three ablated versions (w/o AFL, w/o ST, w/o BLO) form nearly identical, overlapping polygons, suggesting that removing any one of these components (AFL, ST, BLO) has a similar, detrimental effect on performance across all benchmarks.
### Key Observations
1. **Consistent Superiority:** The "DreamPRM" method (cyan line) achieves the highest score on every single benchmark across all three comparison charts.
2. **Performance Hierarchy:** In the "Data selection comparison," a clear performance hierarchy is visible: DreamPRM > s1 selection > CaR selection > No selection.
3. **Clustering of Alternatives:** In the "Test-time scaling comparison," the alternative methods (Self-consistency, ORM, Self-correction) cluster together, performing significantly below DreamPRM but above the "No selection" baseline from the first chart.
4. **Impact of Ablation:** The "Ablation study" shows that removing any of the three components (AFL, ST, BLO) from the DreamPRM framework results in a similar and substantial drop in performance, indicating all are critical to its effectiveness.
5. **Benchmark Difficulty:** The relative ordering of benchmarks by score is not perfectly consistent across methods, but MathVista generally yields the highest scores, while MMStar and MMVet often yield the lowest for the non-DreamPRM methods.
### Interpretation
This set of charts presents a compelling technical narrative for the effectiveness of the "DreamPRM" method.
- **What the data suggests:** The data strongly suggests that DreamPRM is a superior approach for the task(s) measured by these five mathematical reasoning benchmarks (MathVista, WeMath, etc.). Its advantage is not marginal but substantial and consistent.
- **How elements relate:** The three charts build a logical argument:
1. **Chart 1** establishes that intelligent data selection (s1, CaR) helps, but DreamPRM's selection strategy is better.
2. **Chart 2** shows that even advanced test-time techniques (Self-consistency, ORM) are outperformed by DreamPRM's approach.
3. **Chart 3** deconstructs DreamPRM, revealing that its core components (AFL, ST, BLO) are all essential; removing any one degrades performance to a similar, lower level.
- **Notable Anomalies/Patterns:** The near-identical performance of the three ablated models in Chart 3 is striking. It suggests these components may be interdependent or contribute equally vital, non-redundant functionality. The high score of "Self-consistency" on MathVista in Chart 2, relative to its other scores, might indicate that this particular benchmark benefits more from simple ensemble methods than others do.
**In summary, the visual evidence positions DreamPRM as a state-of-the-art method whose performance gain stems from a synergistic combination of its core components, outperforming both simpler selection strategies and other sophisticated test-time scaling techniques.**