## Bar Chart: Blind Spot summary across datasets - 95% Confidence Intervals
### Overview
The chart compares self-correction blind spot performance across multiple AI models (e.g., DeepSeek-R1-0328, QwQ-32B, Gemma-3-27b-it) using five datasets. Values represent deviations from perfect self-correction, with error bars showing 95% confidence intervals. Positive values indicate overestimation of blind spots, while negative values suggest underestimation.
### Components/Axes
- **X-axis**: Models (DeepSeek-R1-0328, QwQ-32B, Qwen-3-23B-A22B, Qwen-3-30B-A3B, Qwen-3-148, Gemma-3-27b-it, Qwen-3-328, Gemma-3-12b-it, Phi-4-reasoning-plus)
- **Y-axis**: Self-Correction Blind Spot (range: -0.4 to 0.4)
- **Legend**:
- Blue: SCL15
- Orange: GSM8K-SC (Before commit answer)
- Pink: GSM8K-SC (After commit answer)
- Green: PRM800K-SC (Before commit answer)
- Light Green: PRM800K-SC (After commit answer)
### Detailed Analysis
1. **DeepSeek-R1-0328**:
- SCL15: -0.05 (±0.08)
- GSM8K-SC (Before): -0.02 (±0.05)
- GSM8K-SC (After): -0.01 (±0.04)
- PRM800K-SC (Before): 0.06 (±0.07)
- PRM800K-SC (After): -0.03 (±0.06)
2. **QwQ-32B**:
- SCL15: -0.01 (±0.04)
- GSM8K-SC (Before): -0.03 (±0.06)
- GSM8K-SC (After): -0.02 (±0.05)
- PRM800K-SC (Before): -0.20 (±0.12)
- PRM800K-SC (After): -0.10 (±0.10)
3. **Qwen-3-23B-A22B (thinking)**:
- SCL15: -0.01 (±0.03)
- GSM8K-SC (Before): -0.02 (±0.04)
- GSM8K-SC (After): -0.01 (±0.03)
- PRM800K-SC (Before): -0.15 (±0.09)
- PRM800K-SC (After): -0.05 (±0.07)
4. **Qwen-3-30B-A3B (thinking)**:
- SCL15: 0.13 (±0.09)
- GSM8K-SC (Before): -0.03 (±0.05)
- GSM8K-SC (After): -0.02 (±0.04)
- PRM800K-SC (Before): -0.05 (±0.06)
- PRM800K-SC (After): -0.02 (±0.05)
5. **Qwen-3-148 (thinking)**:
- SCL15: 0.12 (±0.08)
- GSM8K-SC (Before): -0.02 (±0.04)
- GSM8K-SC (After): -0.01 (±0.03)
- PRM800K-SC (Before): -0.03 (±0.05)
- PRM800K-SC (After): -0.01 (±0.04)
6. **Gemma-3-27b-it**:
- SCL15: -0.05 (±0.07)
- GSM8K-SC (Before): 0.18 (±0.09)
- GSM8K-SC (After): 0.24 (±0.10)
- PRM800K-SC (Before): 0.06 (±0.07)
- PRM800K-SC (After): 0.13 (±0.08)
7. **Qwen-3-328 (thinking)**:
- SCL15: 0.19 (±0.08)
- GSM8K-SC (Before): 0.03 (±0.05)
- GSM8K-SC (After): 0.14 (±0.07)
- PRM800K-SC (Before): -0.02 (±0.04)
- PRM800K-SC (After): -0.01 (±0.03)
8. **Gemma-3-12b-it**:
- SCL15: 0.15 (±0.07)
- GSM8K-SC (Before): 0.16 (±0.08)
- GSM8K-SC (After): 0.22 (±0.09)
- PRM800K-SC (Before): 0.07 (±0.06)
- PRM800K-SC (After): 0.12 (±0.07)
9. **Phi-4-reasoning-plus**:
- SCL15: 0.04 (±0.06)
- GSM8K-SC (Before): -0.08 (±0.07)
- GSM8K-SC (After): 0.15 (±0.08)
- PRM800K-SC (Before): -0.15 (±0.09)
- PRM800K-SC (After): -0.20 (±0.10)
### Key Observations
- **Positive Trends**:
- GSM8K-SC (After commit answer) consistently shows improved performance (higher values) across most models compared to "Before commit answer" versions.
- SCL15 demonstrates the largest positive deviations (0.19 for Qwen-3-328), suggesting significant overestimation of blind spots.
- **Negative Trends**:
- PRM800K-SC (Before commit answer) frequently shows negative values (e.g., -0.20 for Phi-4-reasoning-plus), indicating underestimation of blind spots.
- Qwen-3-148 (thinking) has the smallest confidence intervals (±0.04–0.08), suggesting higher measurement precision.
- **Anomalies**:
- Gemma-3-27b-it exhibits the largest improvement in GSM8K-SC (After commit answer: +0.06).
- Phi-4-reasoning-plus shows the most drastic negative shift in PRM800K-SC (After commit answer: -0.20).
### Interpretation
The data suggests that model updates ("After commit answer") generally reduce blind spots for GSM8K-SC and PRM800K-SC datasets, with SCL15 showing the most persistent overestimation. The negative values for PRM800K-SC (Before commit answer) across multiple models imply systematic underestimation of blind spots in this configuration. Confidence interval lengths vary significantly, with larger intervals (e.g., ±0.12 for QwQ-32B PRM800K-SC Before) indicating lower reliability in those estimates. The correlation between model size (e.g., Qwen-3-30B-A3B vs. Qwen-3-148) and performance trends warrants further investigation.