## Line Chart: Training Dynamics of Simple Prompt Guidance
### Overview
This image presents a line chart illustrating the training dynamics of two methods: "With simple guidance" and "Original GRPO". The chart plots "Accuracy reward" against "Global step", showing how the performance of each method changes over the course of training.
### Components/Axes
* **Title:** "The training dynamics of simple prompt guidance" (positioned at the top-center)
* **X-axis:** "Global step" (ranging from -2 to 34, with markers at every 2 steps)
* **Y-axis:** "Accuracy reward" (ranging from 0.45 to 0.60, with markers at every 0.05)
* **Legend:** Located in the bottom-right corner, identifying the two data series:
* "With simple guidance" - represented by a light blue line.
* "Original GRPO" - represented by a red line.
### Detailed Analysis
**"With simple guidance" (Light Blue Line):**
The light blue line initially slopes downward from approximately 0.55 at Global step 0 to a minimum of approximately 0.48 at Global step 10. It then fluctuates, increasing to around 0.51 at Global step 22, decreasing to approximately 0.48 at Global step 26, and finally rising sharply to approximately 0.59 at Global step 34.
* Global step 0: Accuracy reward ≈ 0.55
* Global step 2: Accuracy reward ≈ 0.54
* Global step 4: Accuracy reward ≈ 0.52
* Global step 6: Accuracy reward ≈ 0.50
* Global step 8: Accuracy reward ≈ 0.49
* Global step 10: Accuracy reward ≈ 0.48
* Global step 12: Accuracy reward ≈ 0.49
* Global step 14: Accuracy reward ≈ 0.49
* Global step 16: Accuracy reward ≈ 0.48
* Global step 18: Accuracy reward ≈ 0.48
* Global step 20: Accuracy reward ≈ 0.48
* Global step 22: Accuracy reward ≈ 0.51
* Global step 24: Accuracy reward ≈ 0.49
* Global step 26: Accuracy reward ≈ 0.48
* Global step 28: Accuracy reward ≈ 0.52
* Global step 30: Accuracy reward ≈ 0.56
* Global step 32: Accuracy reward ≈ 0.58
* Global step 34: Accuracy reward ≈ 0.59
**"Original GRPO" (Red Line):**
The red line starts at approximately 0.57 at Global step 0, then decreases to approximately 0.53 at Global step 2, continues to decrease to a minimum of approximately 0.47 at Global step 10. It then increases to approximately 0.49 at Global step 14, decreases to approximately 0.47 at Global step 20, and rises sharply to approximately 0.58 at Global step 30, and finally to approximately 0.58 at Global step 34.
* Global step 0: Accuracy reward ≈ 0.57
* Global step 2: Accuracy reward ≈ 0.53
* Global step 4: Accuracy reward ≈ 0.51
* Global step 6: Accuracy reward ≈ 0.49
* Global step 8: Accuracy reward ≈ 0.48
* Global step 10: Accuracy reward ≈ 0.47
* Global step 12: Accuracy reward ≈ 0.48
* Global step 14: Accuracy reward ≈ 0.49
* Global step 16: Accuracy reward ≈ 0.48
* Global step 18: Accuracy reward ≈ 0.47
* Global step 20: Accuracy reward ≈ 0.47
* Global step 22: Accuracy reward ≈ 0.48
* Global step 24: Accuracy reward ≈ 0.48
* Global step 26: Accuracy reward ≈ 0.49
* Global step 28: Accuracy reward ≈ 0.52
* Global step 30: Accuracy reward ≈ 0.58
* Global step 32: Accuracy reward ≈ 0.58
* Global step 34: Accuracy reward ≈ 0.58
### Key Observations
* Both methods exhibit a decrease in accuracy reward during the initial training phase (Global steps 0-10).
* "With simple guidance" shows a more pronounced increase in accuracy reward towards the end of the training process (Global steps 28-34) compared to "Original GRPO".
* "Original GRPO" starts with a higher accuracy reward than "With simple guidance" but converges to a similar level at the end of the training.
* The "With simple guidance" method appears to be more volatile, with larger fluctuations in accuracy reward throughout the training process.
### Interpretation
The chart demonstrates the training dynamics of two different prompt guidance methods. Both methods initially experience a decrease in performance, likely due to the model adjusting to the training data. However, "With simple guidance" ultimately achieves comparable performance to "Original GRPO", and exhibits a more significant improvement in the later stages of training. This suggests that while "Original GRPO" may have a better starting point, "With simple guidance" can catch up and potentially surpass it with continued training. The fluctuations observed in "With simple guidance" could indicate a more sensitive training process, potentially requiring more careful tuning of hyperparameters. The sharp increase in both lines near Global step 30 suggests a point of convergence or a significant learning event during training.