Image b42a25d729d2...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
## Bar Chart: Blind Spot summary across datasets - 95% Confidence Intervals

### Overview
The chart compares self-correction blind spot performance across multiple AI models (e.g., DeepSeek-R1-0328, QwQ-32B, Gemma-3-27b-it) using five datasets. Values represent deviations from perfect self-correction, with error bars showing 95% confidence intervals. Positive values indicate overestimation of blind spots, while negative values suggest underestimation.

### Components/Axes
- **X-axis**: Models (DeepSeek-R1-0328, QwQ-32B, Qwen-3-23B-A22B, Qwen-3-30B-A3B, Qwen-3-148, Gemma-3-27b-it, Qwen-3-328, Gemma-3-12b-it, Phi-4-reasoning-plus)
- **Y-axis**: Self-Correction Blind Spot (range: -0.4 to 0.4)
- **Legend**: 
  - Blue: SCL15
  - Orange: GSM8K-SC (Before commit answer)
  - Pink: GSM8K-SC (After commit answer)
  - Green: PRM800K-SC (Before commit answer)
  - Light Green: PRM800K-SC (After commit answer)

### Detailed Analysis
1. **DeepSeek-R1-0328**:
   - SCL15: -0.05 (±0.08)
   - GSM8K-SC (Before): -0.02 (±0.05)
   - GSM8K-SC (After): -0.01 (±0.04)
   - PRM800K-SC (Before): 0.06 (±0.07)
   - PRM800K-SC (After): -0.03 (±0.06)

2. **QwQ-32B**:
   - SCL15: -0.01 (±0.04)
   - GSM8K-SC (Before): -0.03 (±0.06)
   - GSM8K-SC (After): -0.02 (±0.05)
   - PRM800K-SC (Before): -0.20 (±0.12)
   - PRM800K-SC (After): -0.10 (±0.10)

3. **Qwen-3-23B-A22B (thinking)**:
   - SCL15: -0.01 (±0.03)
   - GSM8K-SC (Before): -0.02 (±0.04)
   - GSM8K-SC (After): -0.01 (±0.03)
   - PRM800K-SC (Before): -0.15 (±0.09)
   - PRM800K-SC (After): -0.05 (±0.07)

4. **Qwen-3-30B-A3B (thinking)**:
   - SCL15: 0.13 (±0.09)
   - GSM8K-SC (Before): -0.03 (±0.05)
   - GSM8K-SC (After): -0.02 (±0.04)
   - PRM800K-SC (Before): -0.05 (±0.06)
   - PRM800K-SC (After): -0.02 (±0.05)

5. **Qwen-3-148 (thinking)**:
   - SCL15: 0.12 (±0.08)
   - GSM8K-SC (Before): -0.02 (±0.04)
   - GSM8K-SC (After): -0.01 (±0.03)
   - PRM800K-SC (Before): -0.03 (±0.05)
   - PRM800K-SC (After): -0.01 (±0.04)

6. **Gemma-3-27b-it**:
   - SCL15: -0.05 (±0.07)
   - GSM8K-SC (Before): 0.18 (±0.09)
   - GSM8K-SC (After): 0.24 (±0.10)
   - PRM800K-SC (Before): 0.06 (±0.07)
   - PRM800K-SC (After): 0.13 (±0.08)

7. **Qwen-3-328 (thinking)**:
   - SCL15: 0.19 (±0.08)
   - GSM8K-SC (Before): 0.03 (±0.05)
   - GSM8K-SC (After): 0.14 (±0.07)
   - PRM800K-SC (Before): -0.02 (±0.04)
   - PRM800K-SC (After): -0.01 (±0.03)

8. **Gemma-3-12b-it**:
   - SCL15: 0.15 (±0.07)
   - GSM8K-SC (Before): 0.16 (±0.08)
   - GSM8K-SC (After): 0.22 (±0.09)
   - PRM800K-SC (Before): 0.07 (±0.06)
   - PRM800K-SC (After): 0.12 (±0.07)

9. **Phi-4-reasoning-plus**:
   - SCL15: 0.04 (±0.06)
   - GSM8K-SC (Before): -0.08 (±0.07)
   - GSM8K-SC (After): 0.15 (±0.08)
   - PRM800K-SC (Before): -0.15 (±0.09)
   - PRM800K-SC (After): -0.20 (±0.10)

### Key Observations
- **Positive Trends**: 
  - GSM8K-SC (After commit answer) consistently shows improved performance (higher values) across most models compared to "Before commit answer" versions.
  - SCL15 demonstrates the largest positive deviations (0.19 for Qwen-3-328), suggesting significant overestimation of blind spots.
- **Negative Trends**:
  - PRM800K-SC (Before commit answer) frequently shows negative values (e.g., -0.20 for Phi-4-reasoning-plus), indicating underestimation of blind spots.
  - Qwen-3-148 (thinking) has the smallest confidence intervals (±0.04–0.08), suggesting higher measurement precision.
- **Anomalies**:
  - Gemma-3-27b-it exhibits the largest improvement in GSM8K-SC (After commit answer: +0.06).
  - Phi-4-reasoning-plus shows the most drastic negative shift in PRM800K-SC (After commit answer: -0.20).

### Interpretation
The data suggests that model updates ("After commit answer") generally reduce blind spots for GSM8K-SC and PRM800K-SC datasets, with SCL15 showing the most persistent overestimation. The negative values for PRM800K-SC (Before commit answer) across multiple models imply systematic underestimation of blind spots in this configuration. Confidence interval lengths vary significantly, with larger intervals (e.g., ±0.12 for QwQ-32B PRM800K-SC Before) indicating lower reliability in those estimates. The correlation between model size (e.g., Qwen-3-30B-A3B vs. Qwen-3-148) and performance trends warrants further investigation.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

b42a25d729d2f04a09329a50

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1