Image cc2c3245eba3...

EXPERT: gemini-2.0-flash VERSION 1

RUNTIME: nugit/gemini/gemini-2.0-flash
INTEL_VERIFIED
## Bar Chart: Mean Accuracy and Macro Average After Injection of Internal Error

### Overview
The image is a bar chart comparing the mean accuracy and macro average of different models after injecting internal errors. The chart displays the performance of four categories (SCL15, GSM8K-SC, PRM800K-SC, and Macro Average) across various models. Error bars indicate 95% confidence intervals.

### Components/Axes
*   **Title:** Mean accuracy and macro average (95% confidence intervals) after injection of internal error
*   **X-axis:** Models (DeepSeek-R1-0528, QwQ-32B, Qwen3-235B-A22B (thinking), Qwen3-30B-A3B (thinking), Qwen3-14B (thinking), gemma-3-27b-it, Qwen3-32B (thinking), gemma-3-12b-it, Phi-4-reasoning-plus)
*   **Y-axis:** Accuracy (scale from 0.0 to 1.0, incrementing by 0.2)
*   **Legend:** Located in the top-right corner.
    *   SCL15 (light blue)
    *   GSM8K-SC (light orange)
    *   PRM800K-SC (light green)
    *   Macro Average (red)

### Detailed Analysis

**Model Performance Breakdown:**

1.  **DeepSeek-R1-0528:**
    *   SCL15: ~0.98 with a small confidence interval.
    *   GSM8K-SC: ~0.94 with a small confidence interval.
    *   PRM800K-SC: ~0.78 with a moderate confidence interval.
    *   Macro Average: ~0.91.

2.  **QwQ-32B:**
    *   SCL15: ~0.94 with a small confidence interval.
    *   GSM8K-SC: ~0.93 with a small confidence interval.
    *   PRM800K-SC: ~0.77 with a moderate confidence interval.
    *   Macro Average: ~0.91.

3.  **Qwen3-235B-A22B (thinking):**
    *   SCL15: ~0.91 with a small confidence interval.
    *   GSM8K-SC: ~0.92 with a small confidence interval.
    *   PRM800K-SC: ~0.77 with a moderate confidence interval.
    *   Macro Average: ~0.89.

4.  **Qwen3-30B-A3B (thinking):**
    *   SCL15: ~0.85 with a small confidence interval.
    *   GSM8K-SC: ~0.90 with a small confidence interval.
    *   PRM800K-SC: ~0.76 with a moderate confidence interval.
    *   Macro Average: ~0.88.

5.  **Qwen3-14B (thinking):**
    *   SCL15: ~0.85 with a small confidence interval.
    *   GSM8K-SC: ~0.94 with a small confidence interval.
    *   PRM800K-SC: ~0.74 with a moderate confidence interval.
    *   Macro Average: ~0.84.

6.  **gemma-3-27b-it:**
    *   SCL15: ~0.83 with a small confidence interval.
    *   GSM8K-SC: ~0.82 with a small confidence interval.
    *   PRM800K-SC: ~0.78 with a moderate confidence interval.
    *   Macro Average: ~0.82.

7.  **Qwen3-32B (thinking):**
    *   SCL15: ~0.80 with a moderate confidence interval.
    *   GSM8K-SC: ~0.91 with a small confidence interval.
    *   PRM800K-SC: ~0.72 with a moderate confidence interval.
    *   Macro Average: ~0.81.

8.  **gemma-3-12b-it:**
    *   SCL15: ~0.78 with a moderate confidence interval.
    *   GSM8K-SC: ~0.78 with a small confidence interval.
    *   PRM800K-SC: ~0.76 with a moderate confidence interval.
    *   Macro Average: ~0.77.

9.  **Phi-4-reasoning-plus:**
    *   SCL15: ~0.75 with a moderate confidence interval.
    *   GSM8K-SC: ~0.73 with a small confidence interval.
    *   PRM800K-SC: ~0.71 with a moderate confidence interval.
    *   Macro Average: ~0.71.

### Key Observations

*   **SCL15 consistently shows high accuracy** across all models, generally above 0.8, except for the last two models (gemma-3-12b-it and Phi-4-reasoning-plus).
*   **GSM8K-SC also exhibits high accuracy**, often comparable to or slightly higher than SCL15.
*   **PRM800K-SC generally has lower accuracy** compared to the other two, with more variability as indicated by the larger confidence intervals.
*   **Macro Average generally falls between PRM800K-SC and the higher-performing SCL15 and GSM8K-SC.**
*   The models **DeepSeek-R1-0528 and QwQ-32B** show the highest overall accuracy across all categories.
*   The models **gemma-3-12b-it and Phi-4-reasoning-plus** show the lowest overall accuracy across all categories.

### Interpretation

The bar chart illustrates the performance of different models under the stress of injected internal errors. The SCL15 and GSM8K-SC categories consistently outperform PRM800K-SC, suggesting they are more robust to the introduced errors. The Macro Average provides a general performance metric, reflecting the combined performance of all categories. The confidence intervals indicate the reliability of the accuracy measurements; larger intervals suggest greater variability in performance. The models DeepSeek-R1-0528 and QwQ-32B appear to be the most resilient to internal errors, while gemma-3-12b-it and Phi-4-reasoning-plus are the least. This information is valuable for selecting models that maintain high accuracy even when faced with internal inconsistencies or noise.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

cc2c3245eba3bf63d12a7434

FOUND IN PAPERS

EXPERT: gemini-2.0-flash VERSION 1