# Technical Document Extraction: Multi-Model Performance Analysis
## Chart 1: Qwen3 8B [Thinking]
### Spatial Grounding
- **Legend Position**: Bottom center
- **Legend Colors**:
- Green: With Self-Verification
- Blue: Original Run
### Axes
- **X-Axis**: Task Length (0–250)
- **Y-Axis**: Turn Accuracy (0.0–1.0)
### Key Trends
1. **With Self-Verification (Green)**:
- Initial sharp decline from ~0.98 to ~0.6 at Task Length 50
- Volatile fluctuations between 0.5–0.75 until Task Length 200
- Final plateau at ~0.6
2. **Original Run (Blue)**:
- Gradual decline from ~0.95 to ~0.55 at Task Length 100
- Recovery to ~0.7 at Task Length 150
- Final drop to ~0.6 at Task Length 250
### Data Points
- **With Self-Verification**:
- Task Length 0: 0.98
- Task Length 50: 0.6
- Task Length 100: 0.65
- Task Length 200: 0.6
- Task Length 250: 0.6
- **Original Run**:
- Task Length 0: 0.95
- Task Length 50: 0.85
- Task Length 100: 0.55
- Task Length 150: 0.7
- Task Length 250: 0.6
---
## Chart 2: Gemma3 12B [CoT]
### Spatial Grounding
- **Legend Position**: Bottom center
- **Legend Colors**:
- Green: With Self-Verification
- Blue: Original Run
### Axes
- **X-Axis**: Task Length (0–250)
- **Y-Axis**: Turn Accuracy (0.0–1.0)
### Key Trends
1. **With Self-Verification (Green)**:
- Starts at 0.98, drops to 0.8 at Task Length 50
- Sharp decline to 0.2 at Task Length 200
- Final collapse to 0.0 at Task Length 250
2. **Original Run (Blue)**:
- Steady rise from 0.4 to 0.85 at Task Length 150
- Plateaus at ~0.85 until Task Length 250
### Data Points
- **With Self-Verification**:
- Task Length 0: 0.98
- Task Length 50: 0.8
- Task Length 100: 0.75
- Task Length 150: 0.8
- Task Length 200: 0.2
- Task Length 250: 0.0
- **Original Run**:
- Task Length 0: 0.4
- Task Length 50: 0.5
- Task Length 100: 0.7
- Task Length 150: 0.85
- Task Length 200: 0.85
- Task Length 250: 0.85
---
## Chart 3: Gemma3 12B [CoT] (Token Generation)
### Spatial Grounding
- **Legend Position**: Bottom center
- **Legend Colors**:
- Green: With Self-Verification
- Blue: Original Run
### Axes
- **X-Axis**: Task Length (0–250)
- **Y-Axis**: Avg. Tokens Generated per Step (0–1400)
### Key Trends
1. **With Self-Verification (Green)**:
- Starts at 600 tokens
- Peaks at 1200 tokens at Task Length 150
- Catastrophic drop to 0 at Task Length 200
2. **Original Run (Blue)**:
- Stable at ~400 tokens until Task Length 200
- Minor fluctuations (380–420) post-200
### Data Points
- **With Self-Verification**:
- Task Length 0: 600
- Task Length 50: 800
- Task Length 100: 1000
- Task Length 150: 1200
- Task Length 200: 0
- Task Length 250: 0
- **Original Run**:
- Task Length 0: 400
- Task Length 50: 400
- Task Length 100: 400
- Task Length 150: 400
- Task Length 200: 400
- Task Length 250: 400
---
## Cross-Chart Observations
1. **Self-Verification Impact**:
- Qwen3 8B: Moderate accuracy retention (60% baseline)
- Gemma3 12B: Complete failure at Task Length 200+
2. **Token Efficiency**:
- Self-Verification correlates with exponential token growth (Chart 3)
3. **Model Scaling**:
- Larger models (Gemma3 12B) show more pronounced instability with self-verification
## Critical Anomalies
- **Chart 2 Green Line**: Abrupt 0.8→0.2 drop at Task Length 200 (potential data truncation?)
- **Chart 3 Green Line**: 1200→0 token collapse at Task Length 200 (systemic failure point)
## Language Note
All text is in English. No non-English content detected.