Image cc2c3245eba3...

EXPERT: gemini-2.0-flash VERSION 1

RUNTIME: nugit/gemini/gemini-2.0-flash

INTEL_VERIFIED

## Bar Chart: Mean Accuracy and Macro Average After Injection of Internal Error

### Overview
The image is a bar chart comparing the mean accuracy and macro average of different models after injecting internal errors. The chart displays the performance of four categories (SCL15, GSM8K-SC, PRM800K-SC, and Macro Average) across various models. Error bars indicate 95% confidence intervals.

### Components/Axes
*   **Title:** Mean accuracy and macro average (95% confidence intervals) after injection of internal error
*   **X-axis:** Models (DeepSeek-R1-0528, QwQ-32B, Qwen3-235B-A22B (thinking), Qwen3-30B-A3B (thinking), Qwen3-14B (thinking), gemma-3-27b-it, Qwen3-32B (thinking), gemma-3-12b-it, Phi-4-reasoning-plus)
*   **Y-axis:** Accuracy (scale from 0.0 to 1.0, incrementing by 0.2)
*   **Legend:** Located in the top-right corner.
    *   SCL15 (light blue)
    *   GSM8K-SC (light orange)
    *   PRM800K-SC (light green)
    *   Macro Average (red)

### Detailed Analysis

**Model Performance Breakdown:**

1.  **DeepSeek-R1-0528:**
    *   SCL15: ~0.98 with a small confidence interval.
    *   GSM8K-SC: ~0.94 with a small confidence interval.
    *   PRM800K-SC: ~0.78 with a moderate confidence interval.
    *   Macro Average: ~0.91.

2.  **QwQ-32B:**
    *   SCL15: ~0.94 with a small confidence interval.
    *   GSM8K-SC: ~0.93 with a small confidence interval.
    *   PRM800K-SC: ~0.77 with a moderate confidence interval.
    *   Macro Average: ~0.91.

3.  **Qwen3-235B-A22B (thinking):**
    *   SCL15: ~0.91 with a small confidence interval.
    *   GSM8K-SC: ~0.92 with a small confidence interval.
    *   PRM800K-SC: ~0.77 with a moderate confidence interval.
    *   Macro Average: ~0.89.

4.  **Qwen3-30B-A3B (thinking):**
    *   SCL15: ~0.85 with a small confidence interval.
    *   GSM8K-SC: ~0.90 with a small confidence interval.
    *   PRM800K-SC: ~0.76 with a moderate confidence interval.
    *   Macro Average: ~0.88.

5.  **Qwen3-14B (thinking):**
    *   SCL15: ~0.85 with a small confidence interval.
    *   GSM8K-SC: ~0.94 with a small confidence interval.
    *   PRM800K-SC: ~0.74 with a moderate confidence interval.
    *   Macro Average: ~0.84.

6.  **gemma-3-27b-it:**
    *   SCL15: ~0.83 with a small confidence interval.
    *   GSM8K-SC: ~0.82 with a small confidence interval.
    *   PRM800K-SC: ~0.78 with a moderate confidence interval.
    *   Macro Average: ~0.82.

7.  **Qwen3-32B (thinking):**
    *   SCL15: ~0.80 with a moderate confidence interval.
    *   GSM8K-SC: ~0.91 with a small confidence interval.
    *   PRM800K-SC: ~0.72 with a moderate confidence interval.
    *   Macro Average: ~0.81.

8.  **gemma-3-12b-it:**
    *   SCL15: ~0.78 with a moderate confidence interval.
    *   GSM8K-SC: ~0.78 with a small confidence interval.
    *   PRM800K-SC: ~0.76 with a moderate confidence interval.
    *   Macro Average: ~0.77.

9.  **Phi-4-reasoning-plus:**
    *   SCL15: ~0.75 with a moderate confidence interval.
    *   GSM8K-SC: ~0.73 with a small confidence interval.
    *   PRM800K-SC: ~0.71 with a moderate confidence interval.
    *   Macro Average: ~0.71.

### Key Observations

*   **SCL15 consistently shows high accuracy** across all models, generally above 0.8, except for the last two models (gemma-3-12b-it and Phi-4-reasoning-plus).
*   **GSM8K-SC also exhibits high accuracy**, often comparable to or slightly higher than SCL15.
*   **PRM800K-SC generally has lower accuracy** compared to the other two, with more variability as indicated by the larger confidence intervals.
*   **Macro Average generally falls between PRM800K-SC and the higher-performing SCL15 and GSM8K-SC.**
*   The models **DeepSeek-R1-0528 and QwQ-32B** show the highest overall accuracy across all categories.
*   The models **gemma-3-12b-it and Phi-4-reasoning-plus** show the lowest overall accuracy across all categories.

### Interpretation

The bar chart illustrates the performance of different models under the stress of injected internal errors. The SCL15 and GSM8K-SC categories consistently outperform PRM800K-SC, suggesting they are more robust to the introduced errors. The Macro Average provides a general performance metric, reflecting the combined performance of all categories. The confidence intervals indicate the reliability of the accuracy measurements; larger intervals suggest greater variability in performance. The models DeepSeek-R1-0528 and QwQ-32B appear to be the most resilient to internal errors, while gemma-3-12b-it and Phi-4-reasoning-plus are the least. This information is valuable for selecting models that maintain high accuracy even when faced with internal inconsistencies or noise.

DECODING INTELLIGENCE...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free

INTEL_VERIFIED

## Bar Chart: Mean accuracy and macro average (95% confidence intervals) after injection of internal error

### Overview
The chart compares the mean accuracy and macro average performance of nine AI models across four categories (SCLI5, GSM8K-SC, PRM800K-SC, Macro Average) after internal error injection. Accuracy values range from 0.0 to 1.0, with 95% confidence intervals represented by error bars.

### Components/Axes
- **X-axis**: Models (DeepSeek-R1-0528, QwQ-32B, Qwen3-235B-A228 (thinking), Qwen3-30B-A38 (thinking), Qwen3-14B (thinking), gemma-3-27b-it, Qwen3-32B (thinking), gemma-3-120-it, Phi-4-reasoning-plus)
- **Y-axis**: Accuracy (0.0–1.0 in 0.2 increments)
- **Legend**:
  - Blue: SCLI5
  - Orange: GSM8K-SC
  - Green: PRM800K-SC
  - Red: Macro Average
- **Error Bars**: Vertical lines representing 95% confidence intervals (values in parentheses)

### Detailed Analysis
1. **DeepSeek-R1-0528**
   - SCLI5: 0.998 (±0.012)
   - GSM8K-SC: 0.965 (±0.015)
   - PRM800K-SC: 0.772 (±0.021)
   - Macro Average: 0.908 (±0.018)

2. **QwQ-32B**
   - SCLI5: 0.978 (±0.014)
   - GSM8K-SC: 0.952 (±0.018)
   - PRM800K-SC: 0.770 (±0.020)
   - Macro Average: 0.894 (±0.017)

3. **Qwen3-235B-A228 (thinking)**
   - SCLI5: 0.954 (±0.016)
   - GSM8K-SC: 0.953 (±0.019)
   - PRM800K-SC: 0.758 (±0.023)
   - Macro Average: 0.876 (±0.019)

4. **Qwen3-30B-A38 (thinking)**
   - SCLI5: 0.843 (±0.020)
   - GSM8K-SC: 0.921 (±0.017)
   - PRM800K-SC: 0.775 (±0.022)
   - Macro Average: 0.845 (±0.018)

5. **Qwen3-14B (thinking)**
   - SCLI5: 0.856 (±0.019)
   - GSM8K-SC: 0.942 (±0.016)
   - PRM800K-SC: 0.741 (±0.024)
   - Macro Average: 0.843 (±0.019)

6. **gemma-3-27b-it**
   - SCLI5: 0.879 (±0.018)
   - GSM8K-SC: 0.778 (±0.021)
   - PRM800K-SC: 0.789 (±0.020)
   - Macro Average: 0.815 (±0.017)

7. **Qwen3-32B (thinking)**
   - SCLI5: 0.798 (±0.022)
   - GSM8K-SC: 0.913 (±0.019)
   - PRM800K-SC: 0.703 (±0.025)
   - Macro Average: 0.804 (±0.018)

8. **gemma-3-120-it**
   - SCLI5: 0.763 (±0.023)
   - GSM8K-SC: 0.789 (±0.020)
   - PRM800K-SC: 0.762 (±0.024)
   - Macro Average: 0.763 (±0.019)

9. **Phi-4-reasoning-plus**
   - SCLI5: 0.731 (±0.025)
   - GSM8K-SC: 0.718 (±0.026)
   - PRM800K-SC: 0.667 (±0.027)
   - Macro Average: 0.707 (±0.020)

### Key Observations
- **Highest Performance**: DeepSeek-R1-0528 achieves the highest accuracy across all categories (SCLI5: 0.998, Macro Average: 0.908).
- **Lowest Performance**: Phi-4-reasoning-plus has the lowest accuracy (SCLI5: 0.731, Macro Average: 0.707).
- **Macro Average Consistency**: The Macro Average (red bars) is consistently lower than individual model accuracies, indicating aggregation reduces performance.
- **Error Bar Variability**: Larger error bars (e.g., Phi-4-reasoning-plus: ±0.025–0.027) suggest greater uncertainty in lower-performing models.
- **Model-Specific Trends**:
  - SCLI5 (blue) and GSM8K-SC (orange) generally outperform PRM800K-SC (green).
  - Qwen3-30B-A38 and Qwen3-14B show significant drops in SCLI5 accuracy compared to other models.

### Interpretation
The data suggests that internal error injection reduces model robustness, with performance degradation varying by architecture. SCLI5 and GSM8K-SC demonstrate higher resilience, while PRM800K-SC struggles across all models. The Macro Average's lower values highlight the challenges of combining diverse models under error conditions. Notably, larger error bars in weaker models (e.g., Phi-4) indicate less reliable measurements, emphasizing the need for targeted improvements in error-handling capabilities. The consistent underperformance of PRM800K-SC suggests architectural limitations in handling injected errors compared to other frameworks.

DECODING INTELLIGENCE...

TECHNICAL ASSET FINGERPRINT

cc2c3245eba3bf63d12a7434

FOUND IN PAPERS

EXPERT: gemini-2.0-flash VERSION 1

EXPERT: nemotron-free VERSION 1