## Line Charts: Lichess Puzzle Accuracy vs. Training Step for Different Models
### Overview
The image presents three line charts comparing the performance of different language models (Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B) on a Lichess puzzle accuracy task. The charts depict the accuracy achieved by each model over training steps, using different reward strategies: dense reward, sparse reward, and Action SFT (Supervised Fine-Tuning).
### Components/Axes
* **Titles (Top of each chart):**
* Left Chart: Qwen2.5-3B
* Middle Chart: Qwen2.5-7B
* Right Chart: Llama3.1-8B
* **Y-axis (Vertical):** Lichess Puzzle Acc, ranging from 0.00 to 0.25 (left), 0.00 to 0.30 (middle), and 0.00 to 0.30 (right).
* **X-axis (Horizontal):** Training Step, ranging from 0 to 150 in increments of 30.
* **Legend (Bottom-left of the first chart):**
* Blue line: Dense reward
* Gray line: Sparse reward
* Red dotted line: Action SFT
### Detailed Analysis
**Chart 1: Qwen2.5-3B**
* **Dense reward (Blue):** The line starts at approximately 0.00 and increases sharply until around training step 90, reaching an accuracy of approximately 0.22. It then plateaus, reaching a final accuracy of approximately 0.24 at training step 150.
* (0, 0.005), (30, 0.015), (60, 0.055), (90, 0.16), (120, 0.225), (150, 0.24)
* **Sparse reward (Gray):** The line remains relatively flat near 0.00 throughout the training steps.
* (0, 0.005), (150, 0.005)
* **Action SFT (Red dotted line):** The line is horizontal at approximately 0.17.
* (0, 0.17), (150, 0.17)
**Chart 2: Qwen2.5-7B**
* **Dense reward (Blue):** The line starts at approximately 0.00 and increases sharply until around training step 60, reaching an accuracy of approximately 0.24. It then plateaus, reaching a final accuracy of approximately 0.29 at training step 150.
* (0, 0.005), (30, 0.12), (60, 0.24), (90, 0.275), (120, 0.285), (150, 0.29)
* **Sparse reward (Gray):** The line starts at approximately 0.01 and increases sharply until around training step 60, reaching an accuracy of approximately 0.29. It then plateaus, reaching a final accuracy of approximately 0.31 at training step 150.
* (0, 0.01), (30, 0.20), (60, 0.29), (90, 0.30), (120, 0.31), (150, 0.315)
* **Action SFT (Red dotted line):** The line is horizontal at approximately 0.19.
* (0, 0.19), (150, 0.19)
**Chart 3: Llama3.1-8B**
* **Dense reward (Blue):** The line starts at approximately 0.00 and increases sharply until around training step 60, reaching an accuracy of approximately 0.30. It then fluctuates between 0.28 and 0.32, reaching a final accuracy of approximately 0.32 at training step 150.
* (0, 0.005), (30, 0.11), (60, 0.30), (90, 0.28), (120, 0.31), (150, 0.32)
* **Sparse reward (Gray):** The line remains relatively flat near 0.01 throughout the training steps.
* (0, 0.01), (150, 0.01)
* **Action SFT (Red dotted line):** The line is horizontal at approximately 0.18.
* (0, 0.18), (150, 0.18)
### Key Observations
* For all three models, the dense reward strategy (blue line) results in a significant increase in Lichess puzzle accuracy compared to the sparse reward strategy (gray line).
* The sparse reward strategy (gray line) shows minimal improvement in accuracy across all models.
* The Action SFT (red dotted line) provides a baseline accuracy level, which the dense reward strategy surpasses for all models.
* The Qwen2.5-7B model shows the highest accuracy with the sparse reward strategy, nearly matching the dense reward performance.
* The Llama3.1-8B model achieves the highest overall accuracy with the dense reward strategy, reaching approximately 0.32.
### Interpretation
The data suggests that a dense reward strategy is more effective for training language models on Lichess puzzle accuracy compared to a sparse reward strategy. The dense reward provides more frequent feedback to the model during training, leading to faster and more significant improvements in accuracy. The Action SFT baseline indicates the performance level achievable through supervised fine-tuning alone, which the dense reward strategy consistently outperforms. The Qwen2.5-7B model's performance with the sparse reward suggests that certain models may be more sensitive to reward structure. The Llama3.1-8B model's superior performance with the dense reward indicates its potential for high accuracy in this task.