## Bar Chart: Model Accuracy Comparison
### Overview
This bar chart compares the accuracy of an "Initial Model", a "Trained Model", and a model with "G.T. Supervision" across different tasks: "ARC-C (step-level)", "ARC-C (instance-level)", "MATH (step-level)", "MATH (instance-level)", and "MATH (step-level w/ G.T.)". Accuracy is measured on the y-axis, ranging from 0 to 100.
### Components/Axes
* **X-axis:** Task Name (ARC-C (step-level), ARC-C (instance-level), MATH (step-level), MATH (instance-level), MATH (step-level w/ G.T.))
* **Y-axis:** Accuracy (ranging from 0 to 100)
* **Legend:**
* "Initial Model" - Orange
* "Trained Model" - Blue
* "G.T. Supervision" - Green
* The legend is positioned in the top-right corner of the chart.
### Detailed Analysis
The chart consists of five groups of three bars, one for each model type per task.
**1. ARC-C (step-level):**
* Initial Model: Approximately 60.6% accuracy.
* Trained Model: Approximately 76.4% accuracy. The Trained Model bar is significantly higher than the Initial Model bar.
* G.T. Supervision: Approximately 94.4% accuracy. The G.T. Supervision bar is the highest in this group.
**2. ARC-C (instance-level):**
* Initial Model: Approximately 60.6% accuracy.
* Trained Model: Approximately 75.3% accuracy. The Trained Model bar is higher than the Initial Model bar.
* G.T. Supervision: Approximately 89.2% accuracy. The G.T. Supervision bar is the highest in this group.
**3. MATH (step-level):**
* Initial Model: Approximately 29.0% accuracy.
* Trained Model: Approximately 32.2% accuracy. The Trained Model bar is slightly higher than the Initial Model bar.
* G.T. Supervision: Approximately 81.2% accuracy. The G.T. Supervision bar is the highest in this group.
**4. MATH (instance-level):**
* Initial Model: Approximately 29.0% accuracy.
* Trained Model: Approximately 32.9% accuracy. The Trained Model bar is slightly higher than the Initial Model bar.
* G.T. Supervision: Approximately 87.9% accuracy. The G.T. Supervision bar is the highest in this group.
**5. MATH (step-level w/ G.T.):**
* Initial Model: Approximately 29.0% accuracy.
* Trained Model: Approximately 35.4% accuracy. The Trained Model bar is higher than the Initial Model bar.
* G.T. Supervision: Approximately 97.9% accuracy. The G.T. Supervision bar is the highest in this group.
### Key Observations
* "G.T. Supervision" consistently achieves the highest accuracy across all tasks.
* The "Trained Model" consistently outperforms the "Initial Model" across all tasks.
* Accuracy is generally higher for ARC-C tasks compared to MATH tasks.
* The largest performance gain from the "Trained Model" and "G.T. Supervision" is observed in the MATH tasks.
* The "MATH (step-level w/ G.T.)" task shows the highest accuracy overall, reaching approximately 97.9%.
### Interpretation
The data demonstrates the effectiveness of both training and, particularly, Ground Truth (G.T.) supervision in improving model accuracy. The significant increase in accuracy when using G.T. supervision suggests that the model benefits greatly from having access to correct, labeled data during training. The difference in performance between ARC-C and MATH tasks could indicate that the model finds ARC-C tasks inherently easier, or that the training data for ARC-C is more effective. The consistent improvement of the "Trained Model" over the "Initial Model" highlights the value of the training process itself. The highest accuracy achieved in "MATH (step-level w/ G.T.)" suggests that combining step-level analysis with G.T. supervision is a particularly effective approach for this task. The chart provides strong evidence that incorporating G.T. supervision is a crucial step in achieving high accuracy in these tasks.