# Technical Data Extraction: Model Performance Comparison
This document contains a detailed extraction of data from two line charts comparing the performance of **Gemma3** and **Qwen3** model families across varying task lengths.
## 1. Metadata and Global Legend
The image consists of two side-by-side line graphs sharing a common legend located at the bottom of the image.
**Legend (Spatial Placement: Bottom Center [x, y])**
The legend identifies seven distinct models categorized by color family:
* **Gemma3 Family (Red/Orange Tones):**
* **Gemma3-4B**: Light Peach/Orange
* **Gemma3-12B**: Bright Orange-Red
* **Gemma3-27B**: Dark Maroon/Red
* **Qwen3 Family (Blue Tones):**
* **Qwen3-4B**: Very Light Blue
* **Qwen3-8B**: Medium Light Blue
* **Qwen3-14B**: Medium Blue
* **Qwen3-32B**: Dark Navy Blue
---
## 2. Left Chart: Step Accuracy vs. Task Length
### Axis Information
* **Y-Axis:** "Step Accuracy" (Scale: 0.0 to 1.0, increments of 0.2)
* **X-Axis:** "Task Length" (Scale: 0 to 100, markers at 0, 25, 50, 75, 100)
### Component Analysis & Trends
This chart displays raw data points (faint dots) and a smoothed trend line for each model.
| Model | Color | Visual Trend Description | Performance Summary |
| :--- | :--- | :--- | :--- |
| **Qwen3-32B** | Dark Navy | High stability; slight downward slope. | Starts ~0.95, ends ~0.85. |
| **Gemma3-27B** | Dark Red | High stability; fluctuates around a horizontal mean. | Maintains ~0.85 to 0.90 throughout. |
| **Qwen3-14B** | Medium Blue | Moderate decline. | Starts ~0.85, drops to ~0.65 by length 100. |
| **Gemma3-12B** | Orange-Red | Significant steady decline. | Starts ~0.80, drops to ~0.30 by length 100. |
| **Qwen3-8B** | Light Blue | Sharp initial drop, then steady decline. | Starts ~0.80, drops to ~0.10 by length 100. |
| **Qwen3-4B** | V. Light Blue | Very sharp drop; stabilizes near zero. | Drops below 0.2 by length 25; ends near 0.05. |
| **Gemma3-4B** | Peach | Immediate collapse to baseline. | Drops to ~0.05 by length 10; remains near 0. |
---
## 3. Right Chart: Task Accuracy vs. Task Length
### Axis Information
* **Y-Axis:** "Task Accuracy" (Scale: 0.0 to 1.0, increments of 0.2)
* **X-Axis:** "Task Length" (Scale: 0 to 50+, markers at 0, 20, 40)
### Component Analysis & Trends
This chart measures the success of the entire task. Accuracy for all models begins at 1.0 for length 0 and decays exponentially as task length increases.
| Model | Color | Visual Trend Description | Performance Summary |
| :--- | :--- | :--- | :--- |
| **Qwen3-32B** | Dark Navy | Slowest decay; most resilient. | Hits 0.2 accuracy at length ~30; reaches ~0.02 at length 50. |
| **Gemma3-27B** | Dark Red | Moderate decay; second best. | Hits 0.2 accuracy at length ~15; reaches ~0.01 at length 40. |
| **Qwen3-14B** | Medium Blue | Moderate decay; follows Gemma3-27B closely. | Hits 0.2 accuracy at length ~15; reaches 0 at length 35. |
| **Gemma3-12B** | Orange-Red | Rapid decay. | Hits 0.2 accuracy at length ~8; reaches 0 at length 25. |
| **Qwen3-8B** | Light Blue | Very rapid decay. | Hits 0.2 accuracy at length ~5; reaches 0 at length 15. |
| **Qwen3-4B** | V. Light Blue | Near-instant decay. | Hits 0.2 accuracy at length ~2; reaches 0 at length 10. |
| **Gemma3-4B** | Peach | Instant decay. | Hits 0 accuracy by length ~3. |
---
## 4. Key Observations and Data Synthesis
1. **Scaling Law Correlation:** In both charts, there is a direct correlation between parameter count (B) and performance. Larger models (32B, 27B) maintain significantly higher accuracy as task complexity (length) increases.
2. **Step vs. Task Accuracy:** While the largest models maintain high *Step Accuracy* (above 80%) even at length 100, their *Task Accuracy* (the probability of completing every step correctly) drops toward zero much earlier (around length 50). This indicates that even small errors per step compound over time.
3. **Cross-Family Comparison:** The **Qwen3-32B** (Dark Navy) is the top performer across both metrics, followed by **Gemma3-27B** (Dark Red). The **Qwen3-14B** (Medium Blue) performs comparably to the **Gemma3-27B** in Task Accuracy despite having fewer parameters.