## Bar Chart: Mean Accuracy and Macro Average (95% Confidence Intervals) after Injection of Internal Error
### Overview
This bar chart displays the mean accuracy and macro average, with 95% confidence intervals, for several models after the injection of internal error. The x-axis represents different models, and the y-axis represents accuracy. Four data series are presented: SCLIS (light blue), GSM8K-SC (light green), PRM800K-SC (light orange), and Macro Average (red). Error bars indicate the 95% confidence intervals.
### Components/Axes
* **Title:** "Mean accuracy and macro average (95% confidence intervals) after injection of internal error" (positioned at the top-center)
* **X-axis Label:** "Models" (positioned at the bottom-center)
* **Models (Categories):** Deepseek-rl-7.5B, QWQ-32B, Owen3-33B-v2.28 (thinking), Owen3-30B-v3.3B (thinking), Owen3-34B (thinking), gemma-3.2B-it, gemma-3.2B (thinking), gemma-3.12B-it, Phi-3-reasoning-plus
* **Y-axis Label:** "Accuracy" (positioned on the left-center)
* **Y-axis Scale:** 0.0 to 1.0, with increments of 0.2.
* **Legend:** Located at the top-right corner.
* **SCLIS:** Light Blue
* **GSM8K-SC:** Light Green
* **PRM800K-SC:** Light Orange
* **Macro Average:** Red
### Detailed Analysis
The chart consists of nine models, each with four bars representing the accuracy of SCLIS, GSM8K-SC, PRM800K-SC, and the Macro Average. The error bars represent the 95% confidence interval for each measurement.
* **Deepseek-rl-7.5B:**
* SCLIS: Approximately 0.998, with a very small error bar.
* GSM8K-SC: Approximately 0.908, with an error bar ranging from approximately 0.88 to 0.93.
* PRM800K-SC: Approximately 0.894, with an error bar ranging from approximately 0.86 to 0.92.
* Macro Average: Approximately 0.933, with an error bar ranging from approximately 0.90 to 0.96.
* **QWQ-32B:**
* SCLIS: Approximately 0.998, with a very small error bar.
* GSM8K-SC: Approximately 0.884, with an error bar ranging from approximately 0.85 to 0.91.
* PRM800K-SC: Approximately 0.864, with an error bar ranging from approximately 0.83 to 0.89.
* Macro Average: Approximately 0.914, with an error bar ranging from approximately 0.88 to 0.94.
* **Owen3-33B-v2.28 (thinking):**
* SCLIS: Approximately 0.998, with a very small error bar.
* GSM8K-SC: Approximately 0.876, with an error bar ranging from approximately 0.84 to 0.90.
* PRM800K-SC: Approximately 0.845, with an error bar ranging from approximately 0.81 to 0.87.
* Macro Average: Approximately 0.906, with an error bar ranging from approximately 0.87 to 0.93.
* **Owen3-30B-v3.3B (thinking):**
* SCLIS: Approximately 0.998, with a very small error bar.
* GSM8K-SC: Approximately 0.815, with an error bar ranging from approximately 0.78 to 0.84.
* PRM800K-SC: Approximately 0.784, with an error bar ranging from approximately 0.75 to 0.81.
* Macro Average: Approximately 0.851, with an error bar ranging from approximately 0.81 to 0.88.
* **Owen3-34B (thinking):**
* SCLIS: Approximately 0.998, with a very small error bar.
* GSM8K-SC: Approximately 0.843, with an error bar ranging from approximately 0.81 to 0.87.
* PRM800K-SC: Approximately 0.804, with an error bar ranging from approximately 0.77 to 0.83.
* Macro Average: Approximately 0.882, with an error bar ranging from approximately 0.84 to 0.91.
* **gemma-3.2B-it:**
* SCLIS: Approximately 0.998, with a very small error bar.
* GSM8K-SC: Approximately 0.815, with an error bar ranging from approximately 0.78 to 0.84.
* PRM800K-SC: Approximately 0.763, with an error bar ranging from approximately 0.73 to 0.79.
* Macro Average: Approximately 0.858, with an error bar ranging from approximately 0.82 to 0.89.
* **gemma-3.2B (thinking):**
* SCLIS: Approximately 0.998, with a very small error bar.
* GSM8K-SC: Approximately 0.804, with an error bar ranging from approximately 0.77 to 0.83.
* PRM800K-SC: Approximately 0.763, with an error bar ranging from approximately 0.73 to 0.79.
* Macro Average: Approximately 0.858, with an error bar ranging from approximately 0.82 to 0.89.
* **gemma-3.12B-it:**
* SCLIS: Approximately 0.998, with a very small error bar.
* GSM8K-SC: Approximately 0.804, with an error bar ranging from approximately 0.77 to 0.83.
* PRM800K-SC: Approximately 0.707, with an error bar ranging from approximately 0.67 to 0.74.
* Macro Average: Approximately 0.839, with an error bar ranging from approximately 0.80 to 0.87.
* **Phi-3-reasoning-plus:**
* SCLIS: Approximately 0.998, with a very small error bar.
* GSM8K-SC: Approximately 0.67, with an error bar ranging from approximately 0.64 to 0.70.
* PRM800K-SC: Approximately 0.643, with an error bar ranging from approximately 0.61 to 0.67.
* Macro Average: Approximately 0.757, with an error bar ranging from approximately 0.72 to 0.79.
### Key Observations
* SCLIS consistently exhibits the highest accuracy across all models, nearly reaching 1.0.
* The Macro Average generally falls between the accuracy of GSM8K-SC and PRM800K-SC.
* The accuracy of GSM8K-SC and PRM800K-SC tends to decrease as the models progress from Deepseek-rl-7.5B to Phi-3-reasoning-plus.
* Phi-3-reasoning-plus has the lowest accuracy among all models for GSM8K-SC and PRM800K-SC.
### Interpretation
The data suggests that SCLIS is the most robust model in maintaining accuracy after the injection of internal error. The decreasing accuracy of GSM8K-SC and PRM800K-SC as the models change indicates that the later models may be more susceptible to internal errors. The Macro Average provides a balanced view of performance, but it is heavily influenced by the high accuracy of SCLIS. The significant drop in accuracy for Phi-3-reasoning-plus, particularly for GSM8K-SC and PRM800K-SC, suggests a potential vulnerability or limitation in this model's architecture or training data when dealing with internal errors. The consistent high performance of SCLIS could be due to its specific design or training methodology, making it more resilient to such errors. The error bars provide a measure of uncertainty, and it's important to consider these intervals when comparing the performance of different models.