\n
## Comparison of Reinforcement Learning Algorithms: Evaluation Accuracy and Training Reward over Steps
### Overview
The image displays two line charts side-by-side, sharing a common legend. Both charts plot the performance of several reinforcement learning algorithms over 160 training steps, with a fixed `sync_interval = 20`. The left chart measures "Evaluation Accuracy," and the right chart measures "Training Reward." The charts compare variants of GRPO, REC-OneSide-NoIS, and REINFORCE algorithms with different learning rates (lr).
### Components/Axes
* **Common Title (Top Center):** `sync_interval = 20` (appears above both charts).
* **Left Chart:**
* **Y-Axis Label:** `Evaluation Accuracy`
* **Y-Axis Scale:** Linear, from 0 to 0.8, with major ticks at 0.0, 0.2, 0.4, 0.6, 0.8.
* **X-Axis Label:** `Training Steps`
* **X-Axis Scale:** Linear, from 0 to 160, with major ticks every 20 steps (0, 20, 40, ..., 160).
* **Right Chart:**
* **Y-Axis Label:** `Training Reward`
* **Y-Axis Scale:** Linear, from 0.0 to 1.0, with major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **X-Axis Label:** `Training Steps`
* **X-Axis Scale:** Identical to the left chart.
* **Legend (Bottom Center, spanning both charts):** Contains 8 entries, each with a unique color, marker, and label.
1. **Color:** Light teal, **Marker:** Circle, **Label:** `GRPO (0.2) (lr = 1e-5)`
2. **Color:** Medium teal, **Marker:** Square, **Label:** `GRPO (0.2) (lr = 2e-6)`
3. **Color:** Dark teal, **Marker:** Diamond, **Label:** `GRPO (0.2) (lr = 5e-6)`
4. **Color:** Light purple, **Marker:** Circle, **Label:** `REC-OneSide-NoIS (0.2) (lr = 1e-5)`
5. **Color:** Medium purple, **Marker:** Square, **Label:** `REC-OneSide-NoIS (0.2) (lr = 2e-6)`
6. **Color:** Dark purple, **Marker:** Diamond, **Label:** `REC-OneSide-NoIS (0.2) (lr = 5e-6)`
7. **Color:** Deep violet, **Marker:** Circle, **Label:** `REC-OneSide-NoIS (0.6, 2.0) (lr = 1e-6)`
8. **Color:** Light blue, **Marker:** Circle, **Label:** `REINFORCE (lr = 2e-6)`
### Detailed Analysis
**Left Chart: Evaluation Accuracy**
* **Trend Verification:** Most lines show an upward trend, indicating improving accuracy with training. The REINFORCE line (light blue) is a major exception, showing high volatility and a significant drop.
* **Data Series & Approximate Values:**
* **REC-OneSide-NoIS (0.6, 2.0) (lr = 1e-6) [Deep violet, circle]:** The top performer. Starts ~0.35, rises steadily to ~0.78 by step 160.
* **REC-OneSide-NoIS (0.2) variants [Purples]:** Cluster in the middle. The `lr=5e-6` (dark purple, diamond) variant performs best among them, reaching ~0.70. The `lr=1e-5` (light purple, circle) variant is the lowest of this group, ending near ~0.65.
* **GRPO (0.2) variants [Teals]:** Also cluster in the middle, slightly below the REC-OneSide-NoIS (0.2) group. The `lr=5e-6` (dark teal, diamond) variant is the best GRPO, ending near ~0.68. The `lr=1e-5` (light teal, circle) variant ends near ~0.62.
* **REINFORCE (lr = 2e-6) [Light blue, circle]:** Highly anomalous. Starts ~0.35, rises to ~0.45 by step 20, then plummets to near 0.0 by step 40. It shows a partial, volatile recovery between steps 100-140 (peaking ~0.55) before dropping again to near 0.0 at step 160.
**Right Chart: Training Reward**
* **Trend Verification:** Similar to accuracy, most reward lines trend upward. The REINFORCE line again shows a catastrophic drop and poor recovery.
* **Data Series & Approximate Values:**
* **REC-OneSide-NoIS (0.6, 2.0) (lr = 1e-6) [Deep violet, circle]:** Clearly dominant. Starts ~0.45, climbs rapidly to ~0.8 by step 40, and continues to a near-perfect reward of ~0.98 by step 160.
* **REC-OneSide-NoIS (0.2) variants [Purples]:** Form a middle cluster. The `lr=5e-6` (dark purple, diamond) variant leads this subgroup, reaching ~0.90. The `lr=1e-5` (light purple, circle) variant is the lowest, ending near ~0.70.
* **GRPO (0.2) variants [Teals]:** Cluster below the REC (0.2) group. The `lr=5e-6` (dark teal, diamond) variant is the best GRPO, ending near ~0.80. The `lr=1e-5` (light teal, circle) variant ends near ~0.65.
* **REINFORCE (lr = 2e-6) [Light blue, circle]:** Shows a severe failure mode. Starts ~0.45, rises briefly to ~0.65 by step 20, then crashes to 0.0 by step 40. It remains near 0.0 until step 120, after which it shows a weak, noisy recovery to only ~0.40 by step 160.
### Key Observations
1. **Clear Performance Hierarchy:** The `REC-OneSide-NoIS (0.6, 2.0)` algorithm with a low learning rate (`1e-6`) is the unequivocal best performer on both metrics, achieving near-maximum reward and highest accuracy.
2. **Learning Rate Sensitivity:** For both GRPO and REC-OneSide-NoIS (0.2), the intermediate learning rate (`5e-6`) consistently outperforms the higher (`1e-5`) and lower (`2e-6`) rates tested, suggesting an optimal LR exists in that range for these configurations.
3. **Catastrophic Failure of REINFORCE:** The REINFORCE algorithm exhibits a complete collapse in performance early in training (around step 40) on both metrics. Its recovery is minimal and unstable, indicating severe instability with the given hyperparameters (`lr=2e-6`, `sync_interval=20`).
4. **Correlation Between Metrics:** There is a strong positive correlation between Evaluation Accuracy and Training Reward for all stable algorithms. The line shapes in both charts are very similar for each corresponding algorithm/color.
### Interpretation
This data demonstrates a controlled experiment comparing policy gradient methods in reinforcement learning. The key finding is the significant superiority of the `REC-OneSide-NoIS` algorithm, particularly with the `(0.6, 2.0)` parameterization and a very low learning rate. This configuration achieves robust, high-performance learning.
The results highlight two critical factors for successful training in this context:
1. **Algorithm Choice:** The `REC-OneSide-NoIS` approach appears more stable and sample-efficient than both GRPO and the classic REINFORCE under these conditions.
2. **Hyperparameter Tuning:** Learning rate is a crucial hyperparameter. The performance gap between LR variants of the same algorithm is substantial, and the wrong choice (as seen with REINFORCE) can lead to complete training failure.
The `sync_interval=20` parameter is held constant, so its effect is not evaluated here. The catastrophic drop in the REINFORCE line suggests a potential issue with high variance in gradient estimates, which the other algorithms seem to mitigate more effectively. This visualization would be used to justify the selection of `REC-OneSide-NoIS (0.6, 2.0) (lr=1e-6)` for further experiments or deployment, and to caution against using REINFORCE without significant modification or different hyperparameter tuning.