## Line Chart: Scaling Behavior of SWE-Agent Training Methods
### Overview
This is a multi-series line chart illustrating the performance scaling of four different training methods (RL, SFT, MT, Base) for an AI agent called "SWE-Agent." The chart plots the "Pass Rate (%)" against the number of "SWE-Agent SFT tokens" used for training. Each method is evaluated using three metrics: Pass@1, Pass@2, and Pass@3, resulting in 12 distinct data series. The overall trend shows that performance improves for all methods as the training token count increases, though the rate of improvement and starting points vary significantly.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "# SWE-Agent SFT tokens". The scale is logarithmic, with tick marks at the following approximate values: `0`, `2^21` (~2.1 million), `2^23` (~8.4 million), `2^24` (~16.8 million), `1.1 × 2^25` (~36.9 million), `1.1 × 2^26` (~73.8 million), `1.1 × 2^27` (~147.6 million), and `1.5 × 2^28` (~402.7 million).
* **Y-Axis (Vertical):** Labeled "Pass Rate (%)". The scale is linear, ranging from 0 to 60 with major gridlines every 10 units.
* **Legend:** Positioned on the right side of the chart, outside the plot area. It defines the color and marker shape for each of the 12 data series:
* **RL (Red):** Pass@1 (Circle), Pass@2 (Square), Pass@3 (Triangle)
* **SFT (Orange):** Pass@1 (Circle), Pass@2 (Square), Pass@3 (Triangle)
* **MT (Purple):** Pass@1 (Circle), Pass@2 (Square), Pass@3 (Triangle)
* **Base (Blue):** Pass@1 (Circle), Pass@2 (Square), Pass@3 (Triangle)
* **Grid:** A light gray dashed grid is present for both axes.
### Detailed Analysis
The following table reconstructs the approximate data points for each series at the given token counts. Values are estimated from the chart's visual positioning.
| Token Count (Approx.) | RL Pass@1 | RL Pass@2 | RL Pass@3 | SFT Pass@1 | SFT Pass@2 | SFT Pass@3 | MT Pass@1 | MT Pass@2 | MT Pass@3 | Base Pass@1 | Base Pass@2 | Base Pass@3 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **0** | ~4% | ~9% | ~12% | ~8% | ~13% | ~16% | ~0.5% | ~0.5% | ~0.5% | ~0% | ~0% | ~0% |
| **2^21** | ~23% | ~33% | ~39% | ~20% | ~33% | ~38% | ~5% | ~6% | ~7% | ~1% | ~2% | ~3% |
| **2^23** | ~33% | ~43% | ~48% | ~27% | ~35% | ~41% | ~27% | ~36% | ~44% | ~16% | ~24% | ~28% |
| **2^24** | ~34% | ~42% | ~47% | ~20% | ~31% | ~36% | ~29% | ~41% | ~47% | ~13% | ~22% | ~28% |
| **1.1 × 2^25** | ~34% | ~45% | ~50% | ~35% | ~46% | ~51% | ~31% | ~46% | ~52% | ~12% | ~28% | ~36% |
| **1.1 × 2^26** | ~38% | ~51% | ~58% | ~37% | ~49% | ~55% | ~37% | ~51% | ~58% | ~22% | ~38% | ~45% |
| **1.1 × 2^27** | ~44% | ~56% | ~60% | ~44% | ~55% | ~59% | ~45% | ~55% | ~60% | ~33% | ~48% | ~52% |
| **1.5 × 2^28** | ~49% | ~58% | ~64% | ~48% | ~58% | ~62% | ~46% | ~55% | ~60% | ~36% | ~48% | ~54% |
**Trend Verification per Method:**
* **RL (Red Lines):** All three lines show a strong, consistent upward trend. The slope is steep initially (0 to 2^23 tokens) and continues to rise steadily, with RL Pass@3 achieving the highest overall pass rate on the chart.
* **SFT (Orange Lines):** Also shows a strong upward trend. Notably, SFT Pass@1 exhibits a significant dip at 2^24 tokens before recovering and continuing its ascent.
* **MT (Purple Lines):** Starts very low (near 0%) but demonstrates the most dramatic scaling. The lines have a very steep slope between 2^21 and 2^23 tokens, eventually converging with and sometimes surpassing the SFT lines at higher token counts.
* **Base (Blue Lines):** Starts at or near 0% and shows the slowest initial growth. However, it exhibits a strong, consistent upward trend from 2^24 tokens onward, though it remains the lowest-performing group at every data point.
### Key Observations
1. **Performance Hierarchy:** At every token count, the performance order within each method is consistently Pass@3 > Pass@2 > Pass@1. This indicates that allowing the agent more attempts (k in Pass@k) reliably increases the success rate.
2. **Method Comparison:** RL and SFT methods start with a significant performance advantage over MT and Base at low token counts (0 to 2^21). MT shows a "catch-up" phenomenon, scaling rapidly to match SFT performance at higher token volumes. Base is consistently the lowest-performing method.
3. **Scaling Efficiency:** The most dramatic performance gains for all methods occur in the range between `2^21` and `1.1 × 2^25` tokens. After `1.1 × 2^26` tokens, the rate of improvement begins to plateau slightly for most series.
4. **Notable Anomaly:** The SFT Pass@1 series shows a clear performance drop at `2^24` tokens (from ~27% to ~20%) before recovering. This is the most pronounced deviation from the general upward trend in the chart.
### Interpretation
This chart demonstrates the **scaling laws** for different training paradigms applied to the SWE-Agent. The data suggests that:
* **Data Quantity is Critical:** All methods benefit from more training data (tokens), confirming that scale is a primary driver of performance for this agent.
* **Training Method Matters:** The choice of training method (RL, SFT, MT, Base) has a profound impact on data efficiency. RL and SFT are highly data-efficient, achieving decent performance with relatively few tokens. MT is less efficient initially but scales very effectively. The Base method is the least data-efficient, requiring orders of magnitude more data to achieve comparable results.
* **Metric Sensitivity:** The consistent gap between Pass@1, Pass@2, and Pass@3 highlights that the agent's "first-attempt" success rate is substantially lower than its success rate when given multiple chances. This is a crucial consideration for real-world deployment where the cost of multiple attempts may be high.
* **Practical Implication:** For resource-constrained scenarios (limited training data/compute), RL or SFT would be preferable. If massive data is available, MT becomes a competitive option. The Base method appears to be a weak baseline, likely representing a model without specialized training for the SWE-Agent task. The dip in SFT Pass@1 at `2^24` tokens could indicate a point of instability or overfitting in that specific training run, warranting further investigation.