## Line Chart: Pass Rate vs. SWE-Agent SFT Tokens
### Overview
This image is a complex line chart illustrating the performance (measured as "Pass Rate (%)") of four different machine learning models or training methodologies across varying amounts of training data (measured in "# SWE-Agent SFT tokens"). The chart evaluates each method using three different metrics: Pass@1, Pass@2, and Pass@3, resulting in 12 distinct data series.
### Components/Axes
**1. Y-Axis (Left):**
* **Label:** `Pass Rate (%)`
* **Scale:** Linear, ranging from 0 to 65.
* **Markers/Ticks:** Major ticks are marked at 0, 10, 20, 30, 40, 50, and 60. Faint, solid light-gray horizontal gridlines extend from these ticks across the chart area.
**2. X-Axis (Bottom):**
* **Label:** `# SWE-Agent SFT tokens`
* **Scale:** Categorical/Non-linear progression of token counts.
* **Markers/Ticks:** 0, $2^{21}$, $2^{23}$, $2^{24}$, $1.1 \times 2^{25}$, $1.1 \times 2^{26}$, $1.1 \times 2^{27}$, $1.5 \times 2^{28}$. Vertical dashed gray gridlines extend upward from each tick mark.
**3. Legend (Right):**
Positioned outside the main chart area on the right side, enclosed in a bounding box. It maps colors to methodologies and shapes to metrics.
* **Colors (Methodologies):**
* Red: RL (Reinforcement Learning)
* Orange: SFT (Supervised Fine-Tuning)
* Purple: MT (Multi-Task)
* Blue: Base
* **Shapes (Metrics):**
* Circle: Pass@1
* Square: Pass@2
* Triangle: Pass@3
* **Exact Legend Entries (Top to Bottom):**
* Red Circle: `RL Pass@1`
* Red Square: `RL Pass@2`
* Red Triangle: `RL Pass@3`
* Orange Circle: `SFT Pass@1`
* Orange Square: `SFT Pass@2`
* Orange Triangle: `SFT Pass@3`
* Purple Circle: `MT Pass@1`
* Purple Square: `MT Pass@2`
* Purple Triangle: `MT Pass@3`
* Blue Circle: `Base Pass@1`
* Blue Square: `Base Pass@2`
* Blue Triangle: `Base Pass@3`
### Detailed Analysis
**Visual Encoding & Trend Verification:**
The chart utilizes two types of lines to convey information:
1. **Solid Lines (Intra-token scaling):** At every single X-axis tick, for every color, a solid line connects the Circle (Pass@1) to the Square (Pass@2) to the Triangle (Pass@3). *Trend:* Without exception, these solid lines slope steeply upward, confirming that for any given model at any given training stage, Pass@3 > Pass@2 > Pass@1.
2. **Dashed Lines (Inter-token scaling):** Dashed lines connect identical shapes of the same color across different X-axis ticks (e.g., connecting all Red Circles). *Trend:* The general trend for all dashed lines is upward from left to right, indicating that increasing SFT tokens generally improves the pass rate across all methods and metrics.
**Data Extraction Table:**
*Note: Values are visual approximations derived from the Y-axis scale (±1%).*
| X-Axis Tick | Method (Color) | Pass@1 (Circle) | Pass@2 (Square) | Pass@3 (Triangle) |
| :--- | :--- | :--- | :--- | :--- |
| **0** | Base (Blue) | ~0% | ~0% | ~0% |
| | MT (Purple) | ~1% | ~1% | ~1% |
| | RL (Red) | ~4% | ~9% | ~12% |
| | SFT (Orange) | ~8% | ~13% | ~16% |
| **$2^{21}$** | Base (Blue) | ~1% | ~2% | ~3% |
| | MT (Purple) | ~5% | ~6% | ~7% |
| | SFT (Orange) | ~20% | ~33% | ~38% |
| | RL (Red) | ~23% | ~33% | ~39% |
| **$2^{23}$** | Base (Blue) | ~16% | ~24% | ~28% |
| | MT (Purple) | ~27% | ~36% | ~44% |
| | SFT (Orange) | ~27% | ~35% | ~41% |
| | RL (Red) | ~33% | ~43% | ~48% |
| **$2^{24}$** | Base (Blue) | ~13% | ~22% | ~28% |
| | SFT (Orange) | ~20% | ~31% | ~36% |
| | MT (Purple) | ~29% | ~41% | ~47% |
| | RL (Red) | ~34% | ~42% | ~47% |
| **$1.1 \times 2^{25}$** | Base (Blue) | ~12% | ~27% | ~36% |
| | MT (Purple) | ~31% | ~46% | ~52% |
| | RL (Red) | ~34% | ~45% | ~50% |
| | SFT (Orange) | ~35% | ~45% | ~51% |
| **$1.1 \times 2^{26}$** | Base (Blue) | ~22% | ~38% | ~45% |
| | MT (Purple) | *No Data Plotted* | *No Data Plotted* | *No Data Plotted* |
| | SFT (Orange) | ~37% | ~49% | ~55% |
| | RL (Red) | ~38% | ~51% | ~58% |
| **$1.1 \times 2^{27}$** | Base (Blue) | ~33% | ~48% | ~52% |
| | SFT (Orange) | ~44% | ~55% | ~59% |
| | RL (Red) | ~44% | ~56% | ~60% |
| | MT (Purple) | ~45% | ~55% | ~60% |
| **$1.5 \times 2^{28}$** | Base (Blue) | ~36% | ~48% | ~54% |
| | MT (Purple) | ~46% | ~55% | ~60% |
| | SFT (Orange) | ~48% | ~58% | ~62% |
| | RL (Red) | ~49% | ~58% | ~64% |
### Key Observations
1. **Missing Data:** The MT (Purple) series has a distinct gap; there are no data points plotted at the $1.1 \times 2^{26}$ token mark. The dashed lines bridge directly from $1.1 \times 2^{25}$ to $1.1 \times 2^{27}$.
2. **Performance Hierarchy:** Throughout almost the entire chart, RL (Red) and SFT (Orange) are the top-performing methods, often overlapping or tracking very closely together. MT (Purple) generally sits in the middle, while the Base model (Blue) consistently yields the lowest pass rates.
3. **Anomalous Dips:** At the $2^{24}$ token mark, there is a noticeable regression in performance for the Base (Blue) and SFT (Orange) models compared to their performance at $2^{23}$. The Base Pass@1 drops from ~16% to ~13%, and SFT Pass@1 drops significantly from ~27% to ~20%.
4. **Convergence at Scale:** As the token count reaches the maximum ($1.5 \times 2^{28}$), the performance gap between the methods begins to narrow, particularly between RL, SFT, and MT, which all cluster tightly between 55% and 64% for Pass@2 and Pass@3.
### Interpretation
This chart demonstrates the efficacy of different training interventions on a language model's ability to successfully complete software engineering tasks (implied by "SWE-Agent").
* **The Value of Multiple Attempts:** The steep solid lines at every interval prove that allowing the agent multiple attempts (Pass@3 vs Pass@1) drastically improves the likelihood of success, regardless of the underlying model or training stage.
* **Training Efficacy:** The data clearly shows that fine-tuning (SFT) and Reinforcement Learning (RL) provide massive early advantages over the Base model. For instance, at $2^{21}$ tokens, SFT and RL are already achieving ~40% Pass@3, while the Base model is barely above 0%.
* **Scaling Laws:** The overall upward trajectory of the dashed lines confirms a standard scaling law: exposing the model to more SFT tokens generally increases its pass rate. However, the dips at $2^{24}$ suggest that training is not perfectly linear and may experience instability or require learning rate adjustments at certain phases.
* **Diminishing Returns:** While performance is still climbing at the far right of the chart ($1.5 \times 2^{28}$), the slope of the dashed lines is beginning to flatten slightly compared to the explosive growth seen between $2^{21}$ and $2^{23}$. This suggests that while more data helps, the marginal utility of each additional token is decreasing, and the models may be approaching an asymptotic performance limit for this specific benchmark.