# Technical Data Extraction: Model Performance Analysis
This document contains a detailed extraction of data from four technical charts (labeled a, b, c, and d) comparing the performance of two model families: **Qwen3** (represented by blue tones) and **Gemma3** (represented by red tones).
---
## Global Legend (Footer)
The following models are identified across the subplots, categorized by color and parameter count:
| Model Family | Color Code | Specific Model Variants |
| :--- | :--- | :--- |
| **Gemma3** | Red/Orange Tones | Gemma3-4B (Light), Gemma3-12B (Medium), Gemma3-27B (Dark) |
| **Qwen3** | Blue Tones | Qwen3-4B (Lightest), Qwen3-8B (Light), Qwen3-14B (Medium), Qwen3-32B (Darkest) |
---
## Chart (a): Scaling of Initial Accuracy
**Type:** Line graph with markers and error bars.
**Spatial Grounding:** Legend located at bottom-left [x: low, y: low].
### Axis Labels
* **Y-Axis:** Turn 1 Accuracy (Scale: 0.5 to 1.0)
* **X-Axis:** Model Size (Billion Parameters) (Scale: 0 to 30+)
### Data Trends and Points
* **Qwen3 (Blue Line):** Shows a rapid upward slope from 4B to 8B, then plateaus at near-perfect accuracy.
* ~4B: ~0.85 accuracy
* ~8B: ~1.0 accuracy
* ~14B: ~1.0 accuracy
* ~32B: ~1.0 accuracy
* **Gemma3 (Red Line):** Shows a steep upward slope from 4B to 12B, then plateaus slightly below Qwen3.
* ~4B: ~0.72 accuracy
* ~12B: ~0.98 accuracy
* ~27B: ~0.99 accuracy
---
## Chart (b): Scaling of Horizon Length
**Type:** Line graph with markers.
**Spatial Grounding:** Legend located at top-left [x: low, y: high].
### Axis Labels
* **Y-Axis:** Horizon Length (Scale: 0 to 12)
* **X-Axis:** Model Size (Billion Parameters) (Scale: 0 to 30+)
### Data Trends and Points
Both families show a positive linear-to-exponential correlation between model size and horizon length.
* **Qwen3 (Blue Line):** Consistently maintains a higher horizon length than Gemma3 for equivalent sizes.
* 4B: ~3.0
* 8B: ~4.0
* 14B: ~5.0
* 32B: ~12.0
* **Gemma3 (Red Line):**
* 4B: ~3.0
* 12B: ~4.0
* 27B: ~9.0
---
## Chart (c): Task Accuracy vs. Task Length
**Type:** Decay curves with shaded confidence intervals.
**Spatial Grounding:** Uses the global footer legend.
### Axis Labels
* **Y-Axis:** Task Accuracy (Scale: 0 to 1.0)
* **X-Axis:** Task Length (Scale: 0 to 50)
### Data Trends
All models exhibit performance decay as task length increases.
* **Top Performer:** **Gemma3-27B** (Dark Red dashed line) shows the most resilience, maintaining ~0.6 accuracy at length 50.
* **Qwen3 Series:** The darkest blue (Qwen3-32B) performs best within its family but drops to near 0 accuracy by length 50.
* **Small Models:** Gemma3-4B (lightest orange) and Qwen3-4B (lightest blue) decay the fastest, hitting near 0 accuracy before length 10.
---
## Chart (d): Turn Accuracy vs. Task Length
**Type:** Noisy line plots with shaded variance.
**Spatial Grounding:** Uses the global footer legend.
### Axis Labels
* **Y-Axis:** Turn Accuracy (Scale: 0 to 1.0)
* **X-Axis:** Task Length (Scale: 0 to 100)
### Data Trends
* **High Stability Group:** **Gemma3-27B** (Dark Red) and **Qwen3-32B** (Dark Blue) maintain high accuracy (~0.8 to 0.9) even out to 100 turns.
* **Mid-Tier Decay:** **Qwen3-14B** (Medium Blue) and **Gemma3-12B** (Orange) show a gradual decline. Gemma3-12B starts at ~0.8 and drops to ~0.3 by turn 100. Qwen3-14B starts at ~0.9 and drops to ~0.7.
* **Low-Tier:** Smaller models (4B variants) start with lower initial accuracy and show significant volatility/noise, trending toward 0.1-0.2 accuracy at long task lengths.