## Line Graphs: Lichess Puzzle Accuracy vs Training Steps for Different Models
### Overview
Three line graphs compare the performance of three language models (Qwen2.5-3B, Qwen2.5-7B, Llama3.1-8B) across three training strategies: dense reward (blue), sparse reward (gray), and action SFT (red dotted line). The y-axis measures "Lichess Puzzle Accuracy" (0-0.35 range), and the x-axis tracks "Training Step" (0-150). All graphs show distinct trends in how each model improves with training.
### Components/Axes
- **X-axis**: Training Step (0-150, linear scale)
- **Y-axis**: Lichess Puzzle Accuracy (0-0.35, linear scale)
- **Legends**:
- Dense reward (solid blue)
- Sparse reward (solid gray)
- Action SFT (dashed red)
- **Graph Titles**:
- Qwen2.5-3B
- Qwen2.5-7B
- Llama3.1-8B
### Detailed Analysis
#### Qwen2.5-3B
- **Dense reward**: Starts at ~0.005, rises sharply to ~0.24 by step 150 (slope: ~0.0016 per step).
- **Sparse reward**: Remains flat at ~0.005 throughout.
- **Action SFT**: Flat line at ~0.16.
#### Qwen2.5-7B
- **Dense reward**: Starts at ~0.005, rises to ~0.28 by step 150 (slope: ~0.0018 per step).
- **Sparse reward**: Starts at ~0.005, rises to ~0.29 by step 150 (slope: ~0.0019 per step).
- **Action SFT**: Flat line at ~0.19.
#### Llama3.1-8B
- **Dense reward**: Spikes to ~0.30 by step 30, then fluctuates between ~0.28-0.32.
- **Sparse reward**: Flat line at ~0.005.
- **Action SFT**: Flat line at ~0.17.
### Key Observations
1. **Dense reward** consistently outperforms other strategies in Qwen models, with Qwen2.5-7B achieving the highest accuracy (~0.28).
2. **Sparse reward** underperforms in Qwen2.5-3B but matches dense reward in Qwen2.5-7B.
3. **Action SFT** acts as a static benchmark (~0.16-0.19) across all models.
4. Llama3.1-8B's dense reward shows volatility after step 30, suggesting potential overfitting or instability.
### Interpretation
- **Training Efficiency**: Dense reward strategies scale better with model size (Qwen2.5-7B outperforms Qwen2.5-3B by 16.7%).
- **Reward Design**: Sparse reward's effectiveness depends on model capacity (matches dense reward in 7B but not 3B).
- **Action SFT Limitation**: All models plateau below action SFT benchmarks, suggesting SFT alone may not capture complex puzzle-solving dynamics.
- **Llama3.1 Anomaly**: The post-step-30 volatility in dense reward could indicate reward hacking or insufficient regularization.
The data suggests that dense reward strategies are most effective for larger models, while sparse rewards require careful tuning to avoid underperformance. Action SFT serves as a useful baseline but may need augmentation with reward-based methods for optimal results.