## Line Charts: Performance Metrics with/without LC Reward
### Overview
The image displays three horizontally arranged line charts comparing the performance of a system "w/ LC Reward" (with LC Reward) versus "w/o LC Reward" (without LC Reward) across 5000 training steps. The charts measure three distinct metrics: "LC Reward", "LiveCodeBench Pass@1", and "AIME Accuracy". Each chart uses a blue line with circle markers for the "w/ LC Reward" condition and a green line with circle markers for the "w/o LC Reward" condition.
### Components/Axes
* **Chart 1 (Left): LC Reward**
* **Title:** "LC Reward"
* **Y-axis:** Label is not explicitly written, but the scale represents the "LC Reward" value. Range: 0.86 to 1.00. Major ticks at 0.86, 0.88, 0.90, 0.92, 0.94, 0.96, 0.98, 1.00.
* **X-axis:** Label: "Steps". Range: 0 to 5000. Major ticks at 0, 1000, 2000, 3000, 4000, 5000.
* **Legend:** Located in the bottom-left corner. Blue line with circle marker: "w/ LC Reward". Green line with circle marker: "w/o LC Reward".
* **Chart 2 (Center): LiveCodeBench Pass@1**
* **Title:** "LiveCodeBench Pass@1"
* **Y-axis:** Label is not explicitly written, but the scale represents the "Pass@1" score. Range: 0.38 to 0.50. Major ticks at 0.38, 0.40, 0.42, 0.44, 0.46, 0.48, 0.50.
* **X-axis:** Label: "Steps". Range: 0 to 5000. Major ticks at 0, 1000, 2000, 3000, 4000, 5000.
* **Legend:** Located in the top-left corner. Blue line with circle marker: "w/ LC Reward". Green line with circle marker: "w/o LC Reward".
* **Chart 3 (Right): AIME Accuracy**
* **Title:** "AIME Accuracy"
* **Y-axis:** Label is not explicitly written, but the scale represents the "Accuracy" score. Range: 0.450 to 0.625. Major ticks at 0.450, 0.475, 0.500, 0.525, 0.550, 0.575, 0.600, 0.625.
* **X-axis:** Label: "Steps". Range: 0 to 5000. Major ticks at 0, 1000, 2000, 3000, 4000, 5000.
* **Legend:** Located in the top-left corner. Blue line with circle marker: "w/ LC Reward". Green line with circle marker: "w/o LC Reward".
### Detailed Analysis
**Chart 1: LC Reward**
* **Trend Verification:** The blue line ("w/ LC Reward") shows a generally high and stable trend with moderate fluctuations. The green line ("w/o LC Reward") shows a clear downward trend with high volatility, especially after step 3000.
* **Data Points (Approximate):**
* **w/ LC Reward (Blue):** Starts ~0.97 at step 0. Fluctuates between ~0.96 and ~0.99 throughout. Ends near ~0.99 at step 5000.
* **w/o LC Reward (Green):** Starts ~0.96 at step 0. Shows a gradual decline with significant dips. Notable low points: ~0.91 at step ~1800, ~0.90 at step ~2800, and a sharp drop to ~0.87 at step ~3800. Ends near ~0.87 at step 5000.
**Chart 2: LiveCodeBench Pass@1**
* **Trend Verification:** Both lines show a strong upward trend from step 0 to step 5000. The green line ("w/o LC Reward") appears to overtake and consistently stay above the blue line ("w/ LC Reward") after approximately step 1000.
* **Data Points (Approximate):**
* **w/ LC Reward (Blue):** Starts ~0.38 at step 0. Rises steadily to ~0.46 by step 2000. Continues rising to end near ~0.48 at step 5000.
* **w/o LC Reward (Green):** Starts ~0.38 at step 0. Rises more steeply, reaching ~0.47 by step 2000. Maintains a lead, ending near ~0.50 at step 5000.
**Chart 3: AIME Accuracy**
* **Trend Verification:** Both lines show a strong upward trend from step 0 to step 5000. The lines are closely intertwined, with the blue line ("w/ LC Reward") showing slightly higher peaks in the later stages (after step 3000).
* **Data Points (Approximate):**
* **w/ LC Reward (Blue):** Starts ~0.45 at step 0. Rises to ~0.55 by step 2000. Shows high volatility in the later half, with peaks reaching ~0.62 near step 4500. Ends near ~0.58 at step 5000.
* **w/o LC Reward (Green):** Starts ~0.45 at step 0. Rises to ~0.55 by step 2000. Follows a similar volatile path but with slightly lower peaks, ending near ~0.57 at step 5000.
### Key Observations
1. **Divergent Impact:** The presence of "LC Reward" has opposite effects on the measured metrics. It maintains a high "LC Reward" value (Chart 1) but appears to slightly hinder performance on "LiveCodeBench Pass@1" (Chart 2) compared to its absence.
2. **Volatility:** All metrics show significant step-to-step volatility, particularly in the later stages of training (after step 2000-3000).
3. **Convergence in AIME:** For "AIME Accuracy" (Chart 3), the two conditions perform very similarly, with no clear, sustained advantage for either, though "w/ LC Reward" hits higher maximum values.
4. **Stability vs. Performance:** The "w/o LC Reward" condition leads to a degradation of the "LC Reward" metric itself (Chart 1) but correlates with improved performance on the "LiveCodeBench" coding benchmark (Chart 2).
### Interpretation
The data suggests a trade-off or a nuanced relationship between optimizing for the internal "LC Reward" signal and performance on external benchmarks. The "LC Reward" appears to be a stable, high-value objective when explicitly trained with it (Chart 1, blue line). However, removing this explicit reward signal ("w/o LC Reward") does not cause catastrophic failure; instead, it leads to a decline in that specific reward value but coincides with improved performance on the LiveCodeBench coding task (Chart 2). This could indicate that the "LC Reward" metric and the "LiveCodeBench Pass@1" metric are not perfectly aligned, or that optimizing directly for the former may lead to some degree of overfitting or a suboptimal policy for the latter.
For the AIME (likely a math or reasoning benchmark) accuracy, the impact is negligible, suggesting that this capability develops similarly regardless of the presence of the LC Reward signal. The high volatility across all charts is typical of reinforcement learning or iterative training processes, reflecting exploration and policy updates. The key takeaway is that the design of the reward function ("LC Reward") significantly influences which capabilities are prioritized and stabilized during training, with potential trade-offs between different performance metrics.