Image 69cb7f88ad79...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
## Bar Chart: Accuracy of each LLM and PRM combination using Best-of-N on AMC

### Overview
The chart compares the accuracy of different large language models (LLMs) and their combinations with prompt retrieval models (PRMs) on the AMC benchmark. It uses a grouped bar format to show performance across four LLM categories, with each group containing five bars representing different evaluation metrics or PRM combinations.

### Components/Axes
- **X-axis**: Four LLM categories:
  1. Qwen2.5-7B-Instruct
  2. Eurus-2-7B-PRIME
  3. Phi-4-14B
  4. Qwen2.5-7B-AIRL-S (Our LLM)
- **Y-axis**: Accuracy (%) ranging from 30% to 70% in 5% increments.
- **Legend** (right side):
  - Pink: Accuracy@1
  - Light orange: Math-Shepherd-Mistral-7B-PRM
  - Light green: EurusPRM-Stage2
  - Medium green: Llama3.1-8B-PRM-Deepseek-Data
  - Dark blue: Qwen2.5-AIRL-S-PRM (Ours PRM)

### Detailed Analysis
1. **Qwen2.5-7B-Instruct**:
   - Accuracy@1: 33.7% (pink)
   - Math-Shepherd-Mistral-7B-PRM: 53.0% (light orange)
   - EurusPRM-Stage2: 54.2% (light green)
   - Llama3.1-8B-PRM-Deepseek-Data: 55.4% (medium green)
   - Qwen2.5-AIRL-S-PRM: 56.6% (dark blue)

2. **Eurus-2-7B-PRIME**:
   - Accuracy@1: 56.6% (pink)
   - Math-Shepherd-Mistral-7B-PRM: 61.4% (light orange)
   - EurusPRM-Stage2: 63.9% (light green)
   - Llama3.1-8B-PRM-Deepseek-Data: 63.9% (medium green)
   - Qwen2.5-AIRL-S-PRM: 65.1% (dark blue)

3. **Phi-4-14B**:
   - Accuracy@1: 44.6% (pink)
   - Math-Shepherd-Mistral-7B-PRM: 60.2% (light orange)
   - EurusPRM-Stage2: 59.0% (light green)
   - Llama3.1-8B-PRM-Deepseek-Data: 61.4% (medium green)
   - Qwen2.5-AIRL-S-PRM: 62.6% (dark blue)

4. **Qwen2.5-7B-AIRL-S (Our LLM)**:
   - Accuracy@1: 59.0% (pink)
   - Math-Shepherd-Mistral-7B-PRM: 63.9% (light orange)
   - EurusPRM-Stage2: 65.1% (light green)
   - Llama3.1-8B-PRM-Deepseek-Data: 65.1% (medium green)
   - Qwen2.5-AIRL-S-PRM: 67.5% (dark blue)

### Key Observations
- **Highest Performance**: The Qwen2.5-7B-AIRL-S (Our LLM) with its PRM achieves the highest accuracy (67.5%), outperforming all other combinations.
- **Accuracy@1 Baseline**: The Accuracy@1 metric (pink bars) is consistently the lowest across all categories, indicating that raw model performance without PRM integration is significantly lower.
- **PRM Impact**: All PRM combinations improve accuracy compared to Accuracy@1, with Qwen2.5-AIRL-S-PRM (dark blue) showing the most consistent gains.
- **Category-Specific Trends**:
  - **Qwen2.5-7B-Instruct**: Shows the largest improvement (+22.9%) from Accuracy@1 to Qwen2.5-AIRL-S-PRM.
  - **Phi-4-14B**: Has the lowest Accuracy@1 (44.6%) but achieves the second-highest PRM performance (62.6%).
  - **Eurus-2-7B-PRIME**: Already performs well at Accuracy@1 (56.6%) but still benefits from PRM integration (+8.5%).

### Interpretation
The data demonstrates that PRM integration significantly enhances LLM performance on AMC tasks, with the Qwen2.5-AIRL-S-PRM combination achieving state-of-the-art results. The Qwen2.5-7B-AIRL-S (Our LLM) category stands out as the most effective overall, suggesting that its architecture or training methodology synergizes particularly well with PRM techniques. The consistent performance of Qwen2.5-AIRL-S-PRM across categories indicates its robustness, while the Phi-4-14B category highlights the potential for improvement in models with lower baseline accuracy. The chart underscores the importance of PRM selection in optimizing LLM performance for mathematical reasoning tasks.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

69cb7f88ad794bdbab3220f7

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1