## Line Chart: Final Result on OlympiadBench
### Overview
The image displays a line chart titled "Final Result on OlympiadBench." It plots the performance of two accuracy metrics over the course of reinforcement learning training steps. The chart shows a general upward trend for both metrics, indicating improvement with increased training.
### Components/Axes
* **Chart Title:** "Final Result on OlympiadBench" (centered at the top).
* **X-Axis:** Labeled "Training Steps of Reinforcement Learning." The axis has major tick marks at intervals of 50, labeled: 0, 50, 100, 150, 200.
* **Y-Axis:** Labeled "Average Benchmark Accuracy (%)". The axis has major tick marks at intervals of 1, labeled from 35 to 44.
* **Legend:** Located in the top-left corner of the chart area. It contains two entries:
* A green line labeled "Sum. M Accuracy"
* A blue line labeled "Final Accuracy"
* **Data Series:** Two lines plotted on the chart:
1. A **green line** representing "Sum. M Accuracy."
2. A **blue line** representing "Final Accuracy."
### Detailed Analysis
**Data Series 1: Sum. M Accuracy (Green Line)**
* **Trend:** The green line shows a generally positive, upward trend with moderate fluctuations. It starts at the lowest point on the chart and ends significantly higher.
* **Approximate Data Points:**
* Step 0: ~35.5%
* Step 25: ~35.8%
* Step 50: ~36.5%
* Step 75: ~37.5%
* Step 100: ~39.0% (local peak)
* Step 125: ~38.0% (dip)
* Step 150: ~39.5%
* Step 175: ~39.8%
* Step 200: ~40.5%
**Data Series 2: Final Accuracy (Blue Line)**
* **Trend:** The blue line also shows a positive, upward trend but is more volatile than the green line. It consistently remains above the green line throughout the training steps.
* **Approximate Data Points:**
* Step 0: ~39.5%
* Step 25: ~39.0% (dip)
* Step 50: ~40.5%
* Step 75: ~41.0%
* Step 100: ~42.0% (significant peak)
* Step 125: ~40.0% (sharp dip)
* Step 150: ~41.5%
* Step 175: ~42.0%
* Step 200: ~43.0% (highest point)
### Key Observations
1. **Consistent Performance Gap:** The "Final Accuracy" (blue) is consistently higher than the "Sum. M Accuracy" (green) at every measured training step. The gap between them is approximately 3-4 percentage points.
2. **Correlated Movements:** Both lines often move in tandem. For example, both show a local peak at step 100 and a subsequent dip at step 125, suggesting a common factor affecting both metrics at those training stages.
3. **Peak Performance:** Both metrics achieve their highest values at the final recorded step (200), with "Final Accuracy" reaching ~43% and "Sum. M Accuracy" reaching ~40.5%.
4. **Volatility:** The "Final Accuracy" line exhibits sharper peaks and valleys (e.g., the pronounced peak at step 100 and dip at step 125) compared to the somewhat smoother progression of the "Sum. M Accuracy" line.
### Interpretation
The chart demonstrates the effectiveness of reinforcement learning training on the OlympiadBench benchmark. The upward trajectory of both lines indicates that the model's performance improves as it undergoes more training steps.
The persistent gap between "Final Accuracy" and "Sum. M Accuracy" suggests these are measuring different aspects of performance. "Final Accuracy" likely represents the model's ultimate answer accuracy, while "Sum. M Accuracy" might be a component score or a metric from an intermediate step (e.g., summarization or multiple-choice accuracy). The fact that the final answer accuracy is higher implies the model may be effectively synthesizing or correcting intermediate outputs to arrive at better final answers.
The correlated dip after step 100 is a notable anomaly. This could indicate a period of instability in training, such as the model encountering a particularly challenging subset of data, a change in the learning rate, or a temporary overfitting phenomenon before recovering and continuing to improve. The overall trend, however, is positive, showing that extended training (up to 200 steps) yields better results on this benchmark.