Image bb2a936c17d8...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
# Technical Document Analysis: Bar Chart of Model Performance Metrics

## Title
**Best-of-8 Mean Accuracy (%) vs ProcessBench Mean F1 (%)**

---

## Axes and Labels
- **X-Axis**: Model names (categories):
  - Math-Shepherd-PRM-7B
  - RLFlow-PRM-Mistral-8B
  - RLFlow-PRM-Deepseek-8B
  - Skywork-PRM-1.5B
  - Skywork-PRM-7B
  - EurusPRM-Stage1
  - EurusPRM-Stage2
  - Owen2.5-Math-Shepherd
  - Owen2.5-Math-PRM800K
  - Owen2.5-Math-PRM-7B
  - Owen2.5-Math-PRM-72B
- **Y-Axis (Left)**: **Best-of-8 Mean Accuracy (%)** (blue bars)
- **Y-Axis (Right)**: **ProcessBench Mean F1 (%)** (orange bars)
- **Dashed Lines**:
  - **pass@8 (74.7%)**: Horizontal dashed line at 74.7% (left Y-axis)
  - **maj@8 (66.2%)**: Horizontal dashed line at 66.2% (left Y-axis)

---

## Legend
- **Location**: Right side of the chart
- **Colors**:
  - **Blue**: Best-of-8 Mean Accuracy (%)
  - **Orange**: ProcessBench Mean F1 (%)

---

## Data Points and Trends
### Best-of-8 Mean Accuracy (%)
- **Range**: 61.6% (EurusPRM-Stage1) to 69.3% (Owen2.5-Math-PRM-72B)
- **Key Observations**:
  - **Highest**: Owen2.5-Math-PRM-72B (69.3%)
  - **Lowest**: EurusPRM-Stage1 (61.6%)
  - **Thresholds**:
    - **maj@8 (66.2%)**: Exceeded by 7 models (e.g., Math-Shepherd-PRM-7B: 64.2%, RLFlow-PRM-Deepseek-8B: 64.9%, Skywork-PRM-7B: 64.8%, Owen2.5-Math-Shepherd: 64.3%, Owen2.5-Math-PRM800K: 64.9%, Owen2.5-Math-PRM-7B: 67.6%, Owen2.5-Math-PRM-72B: 69.3%)
    - **pass@8 (74.7%)**: None of the Best-of-8 models reach this threshold.

### ProcessBench Mean F1 (%)
- **Range**: 26.6% (RLFlow-PRM-Deepseek-8B) to 78.3% (Owen2.5-Math-PRM-72B)
- **Key Observations**:
  - **Highest**: Owen2.5-Math-PRM-72B (78.3%)
  - **Lowest**: RLFlow-PRM-Deepseek-8B (26.6%)
  - **Thresholds**:
    - **pass@8 (74.7%)**: Exceeded by 2 models (Owen2.5-Math-PRM-7B: 73.5%, Owen2.5-Math-PRM-72B: 78.3%)
    - **maj@8 (66.2%)**: Exceeded by 5 models (Owen2.5-Math-Shepherd: 56.5%, Owen2.5-Math-PRM800K: 64.9%, Owen2.5-Math-PRM-7B: 67.6%, Owen2.5-Math-PRM-72B: 78.3%)

---

## Additional Notes
- **Stars (★)**: Indicate top-performing models in **ProcessBench Mean F1 (%)**:
  - **★ Owen2.5-Math-PRM-7B** (73.5%)
  - **★ Owen2.5-Math-PRM-72B** (78.3%)
- **Color Consistency**:
  - Blue bars (Best-of-8) and orange bars (ProcessBench) align with the legend.
  - No mismatches detected between legend labels and bar colors.

---

## Summary
The chart compares **Best-of-8 Mean Accuracy (%)** and **ProcessBench Mean F1 (%)** across 11 models. While Best-of-8 accuracy remains below the **pass@8 (74.7%)** threshold, ProcessBench F1 scores show significant variation, with the top models (Owen2.5-Math-PRM-7B and Owen2.5-Math-PRM-72B) exceeding both **pass@8** and **maj@8** thresholds. The stars highlight the highest-performing models in ProcessBench.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

bb2a936c17d8db93f4585d93

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1