# Technical Document Extraction: Gemma-3-12B CoT Performance Analysis
This document contains a detailed extraction of data from two line charts analyzing the performance of the **Gemma-3-12B CoT** model across varying task lengths and error rates.
---
## 1. Global Legend and Metadata
The following legend applies to both charts and is located at the bottom of the image.
| Color | Label | Description |
| :--- | :--- | :--- |
| **Blue** | Original Run | Baseline performance without injected errors. |
| **Red** | 100% Error Rate | Performance with a 100% error injection rate. |
| **Orange** | 75% Error Rate | Performance with a 75% error injection rate. |
| **Yellow/Gold** | 50% Error Rate | Performance with a 50% error injection rate. |
| **Light Green** | 25% Error Rate | Performance with a 25% error injection rate. |
| **Dark Green** | 0% Error Rate | Performance with a 0% error injection rate (control). |
| **Purple** | (Unlabeled) | Appears in charts, likely representing a specific high-error or baseline condition. |
---
## 2. Left Chart: Turn Accuracy vs. Task Length
**Title:** Gemma-3-12B CoT
### Axis Definitions
* **Y-Axis:** Turn Accuracy (Scale: 0 to 0.4, increments of 0.05)
* **X-Axis:** Task Length (Scale: 0 to 80, increments of 20)
### Trend Analysis and Data Extraction
This chart shows a general downward trend in accuracy as task length increases for all series.
* **Original Run (Blue):**
* **Trend:** Starts highest (~0.28) and maintains the highest accuracy throughout, though it steadily declines.
* **Key Points:** Drops to ~0.20 at length 20, ~0.15 at length 40, and falls sharply toward 0 after length 60.
* **75% Error Rate (Orange):**
* **Trend:** Starts at ~0.21. Follows the blue line's downward trajectory but at a lower offset.
* **Key Points:** Drops to ~0.10 at length 20 and fluctuates around 0.05-0.10 until length 55.
* **100% Error Rate (Red):**
* **Trend:** Starts at ~0.19. Rapid initial decline.
* **Key Points:** Drops below 0.05 by length 15, with minor spikes around length 45 before hitting 0.
* **0% Error Rate (Dark Green):**
* **Trend:** Starts at ~0.20. Sharp decline.
* **Key Points:** Hits near-zero accuracy by length 15, with a small late-stage bump around length 55-60.
* **Purple Series:**
* **Trend:** Lowest starting accuracy (~0.13).
* **Key Points:** Drops to near 0 almost immediately (by length 10).
---
## 3. Right Chart: Format Failure Fraction vs. Task Length
### Axis Definitions
* **Y-Axis:** Format Failure Fraction (Scale: 0 to 1, increments of 0.2)
* **X-Axis:** Task Length (Scale: 0 to 1000, increments of 200)
### Trend Analysis and Data Extraction
This chart measures the point at which the model's output format breaks down completely (failure fraction = 1). All series show a "step function" behavior where they stay at 0 and then suddenly jump to 1.
* **50% Error Rate (Yellow/Gold):**
* **Trend:** Stable at 0 until a critical threshold.
* **Failure Point:** Sharp vertical jump to 1.0 at **Task Length ≈ 520**.
* **100% Error Rate (Red):**
* **Trend:** Stable at 0 until a critical threshold.
* **Failure Point:** Sharp vertical jump to 1.0 at **Task Length ≈ 610**.
* **Purple Series:**
* **Trend:** Stable at 0 until a critical threshold.
* **Failure Point:** Sharp vertical jump to 1.0 at **Task Length ≈ 720**.
* **Baseline (Blue/Green/Orange):**
* These lines remain at 0 for the duration of the visible X-axis (up to 1000), indicating no format failure within this range.
---
## 4. Summary of Observations
1. **Accuracy Degradation:** The model's accuracy (Left Chart) is highly sensitive to task length, even in the "Original Run." Accuracy effectively hits zero for all configurations once task length exceeds 70.
2. **Format Robustness:** While accuracy drops early, the model maintains correct formatting (Right Chart) for much longer. However, once a specific task length is reached (between 500 and 750 depending on error rate), the format fails catastrophically and completely.
3. **Inverse Correlation:** Higher error rates generally correlate with earlier format failure and lower initial accuracy.