Image cc2c3245eba3...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
## Bar Chart: Mean accuracy and macro average (95% confidence intervals) after injection of internal error

### Overview
The chart compares the mean accuracy and macro average performance of nine AI models across four categories (SCLI5, GSM8K-SC, PRM800K-SC, Macro Average) after internal error injection. Accuracy values range from 0.0 to 1.0, with 95% confidence intervals represented by error bars.

### Components/Axes
- **X-axis**: Models (DeepSeek-R1-0528, QwQ-32B, Qwen3-235B-A228 (thinking), Qwen3-30B-A38 (thinking), Qwen3-14B (thinking), gemma-3-27b-it, Qwen3-32B (thinking), gemma-3-120-it, Phi-4-reasoning-plus)
- **Y-axis**: Accuracy (0.0–1.0 in 0.2 increments)
- **Legend**:
  - Blue: SCLI5
  - Orange: GSM8K-SC
  - Green: PRM800K-SC
  - Red: Macro Average
- **Error Bars**: Vertical lines representing 95% confidence intervals (values in parentheses)

### Detailed Analysis
1. **DeepSeek-R1-0528**
   - SCLI5: 0.998 (±0.012)
   - GSM8K-SC: 0.965 (±0.015)
   - PRM800K-SC: 0.772 (±0.021)
   - Macro Average: 0.908 (±0.018)

2. **QwQ-32B**
   - SCLI5: 0.978 (±0.014)
   - GSM8K-SC: 0.952 (±0.018)
   - PRM800K-SC: 0.770 (±0.020)
   - Macro Average: 0.894 (±0.017)

3. **Qwen3-235B-A228 (thinking)**
   - SCLI5: 0.954 (±0.016)
   - GSM8K-SC: 0.953 (±0.019)
   - PRM800K-SC: 0.758 (±0.023)
   - Macro Average: 0.876 (±0.019)

4. **Qwen3-30B-A38 (thinking)**
   - SCLI5: 0.843 (±0.020)
   - GSM8K-SC: 0.921 (±0.017)
   - PRM800K-SC: 0.775 (±0.022)
   - Macro Average: 0.845 (±0.018)

5. **Qwen3-14B (thinking)**
   - SCLI5: 0.856 (±0.019)
   - GSM8K-SC: 0.942 (±0.016)
   - PRM800K-SC: 0.741 (±0.024)
   - Macro Average: 0.843 (±0.019)

6. **gemma-3-27b-it**
   - SCLI5: 0.879 (±0.018)
   - GSM8K-SC: 0.778 (±0.021)
   - PRM800K-SC: 0.789 (±0.020)
   - Macro Average: 0.815 (±0.017)

7. **Qwen3-32B (thinking)**
   - SCLI5: 0.798 (±0.022)
   - GSM8K-SC: 0.913 (±0.019)
   - PRM800K-SC: 0.703 (±0.025)
   - Macro Average: 0.804 (±0.018)

8. **gemma-3-120-it**
   - SCLI5: 0.763 (±0.023)
   - GSM8K-SC: 0.789 (±0.020)
   - PRM800K-SC: 0.762 (±0.024)
   - Macro Average: 0.763 (±0.019)

9. **Phi-4-reasoning-plus**
   - SCLI5: 0.731 (±0.025)
   - GSM8K-SC: 0.718 (±0.026)
   - PRM800K-SC: 0.667 (±0.027)
   - Macro Average: 0.707 (±0.020)

### Key Observations
- **Highest Performance**: DeepSeek-R1-0528 achieves the highest accuracy across all categories (SCLI5: 0.998, Macro Average: 0.908).
- **Lowest Performance**: Phi-4-reasoning-plus has the lowest accuracy (SCLI5: 0.731, Macro Average: 0.707).
- **Macro Average Consistency**: The Macro Average (red bars) is consistently lower than individual model accuracies, indicating aggregation reduces performance.
- **Error Bar Variability**: Larger error bars (e.g., Phi-4-reasoning-plus: ±0.025–0.027) suggest greater uncertainty in lower-performing models.
- **Model-Specific Trends**:
  - SCLI5 (blue) and GSM8K-SC (orange) generally outperform PRM800K-SC (green).
  - Qwen3-30B-A38 and Qwen3-14B show significant drops in SCLI5 accuracy compared to other models.

### Interpretation
The data suggests that internal error injection reduces model robustness, with performance degradation varying by architecture. SCLI5 and GSM8K-SC demonstrate higher resilience, while PRM800K-SC struggles across all models. The Macro Average's lower values highlight the challenges of combining diverse models under error conditions. Notably, larger error bars in weaker models (e.g., Phi-4) indicate less reliable measurements, emphasizing the need for targeted improvements in error-handling capabilities. The consistent underperformance of PRM800K-SC suggests architectural limitations in handling injected errors compared to other frameworks.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

cc2c3245eba3bf63d12a7434

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1