## Line Graphs: DeepSeek-R1-Zero Training Metrics
### Overview
The image contains two line graphs comparing training performance metrics for the DeepSeek-R1-Zero model. The left graph tracks AIME accuracy over training steps, while the right graph shows average response length per response. Both graphs span 10,000 training steps.
### Components/Axes
**Left Graph (Accuracy):**
- **X-axis**: "Steps" (0 to 10,000)
- **Y-axis**: "Accuracy" (0.2 to 0.9)
- **Legend**:
- Blue line: "r1-zero-pass@1"
- Red line: "r1-zero-cons@16"
- Green dashed line: "human participants"
- **Legend Position**: Top-left corner
**Right Graph (Response Length):**
- **X-axis**: "Steps" (0 to 10,000)
- **Y-axis**: "Average length per response" (0 to 20,000)
- **Line**: Blue with shaded variability region
- **No explicit legend**, but shaded area indicates response length uncertainty
### Detailed Analysis
**Left Graph Trends:**
1. **r1-zero-pass@1 (Blue)**:
- Starts at ~0.15 accuracy at 0 steps
- Gradual increase to ~0.75 by 10,000 steps
- Notable plateau between 6,000-8,000 steps (~0.72-0.74)
- Approximate values:
- 2,000 steps: ~0.45
- 4,000 steps: ~0.60
- 6,000 steps: ~0.70
2. **r1-zero-cons@16 (Red)**:
- Starts at ~0.25 accuracy at 0 steps
- Rapid initial growth to ~0.8 by 2,000 steps
- Fluctuates between 0.75-0.85 until 10,000 steps
- Approximate values:
- 2,000 steps: ~0.60
- 4,000 steps: ~0.75
- 6,000 steps: ~0.80
3. **Human Participants (Green)**:
- Flat line at ~0.38 accuracy throughout
- No improvement over training steps
**Right Graph Trends:**
- **Response Length**:
- Starts at ~0 at 0 steps
- Steady linear increase with periodic spikes
- Reaches ~15,000 by 10,000 steps
- Shaded area shows variability (e.g., ±1,000 at 5,000 steps)
- Approximate values:
- 2,000 steps: ~2,500
- 4,000 steps: ~4,000
- 6,000 steps: ~6,000
### Key Observations
1. **Accuracy Divergence**:
- r1-zero-cons@16 consistently outperforms r1-zero-pass@1 by ~10-15% across all steps
- Human baseline remains static at 0.38, far below model performance
2. **Response Length Growth**:
- Linear correlation between training steps and response length (R² ~0.95)
- Variability increases with steps (standard deviation grows from ~200 to ~1,500)
3. **Training Dynamics**:
- Both models show diminishing returns after ~6,000 steps
- r1-zero-cons@16 maintains higher stability (lower variance in accuracy)
### Interpretation
The data demonstrates that:
1. **Model Architecture Impact**: The "cons@16" configuration achieves superior accuracy, suggesting ensemble methods or multi-stage processing improve performance.
2. **Training Efficiency**: Both models show rapid initial improvement (first 2,000 steps) followed by slower gains, indicating potential optimization opportunities.
3. **Response Complexity**: The linear increase in response length suggests the model develops more comprehensive reasoning capabilities over time, though with increasing variability.
4. **Human Baseline**: The static human performance highlights the models' ability to surpass human-level accuracy in this task, despite similar response lengths.
The shaded region in the response length graph indicates growing uncertainty in model outputs as training progresses, potentially reflecting increased complexity in generated answers.