Image 6c37d7e7367d...

EXPERT: gemini-3.1-pro-preview VERSION 1

RUNTIME: gemini/gemini-3.1-pro-preview
INTEL_VERIFIED
## Line Chart: Pass Rate vs. SWE-Agent SFT Tokens

### Overview
This image is a complex line chart illustrating the performance (measured as "Pass Rate (%)") of four different machine learning models or training methodologies across varying amounts of training data (measured in "# SWE-Agent SFT tokens"). The chart evaluates each method using three different metrics: Pass@1, Pass@2, and Pass@3, resulting in 12 distinct data series. 

### Components/Axes

**1. Y-Axis (Left):**
*   **Label:** `Pass Rate (%)`
*   **Scale:** Linear, ranging from 0 to 65.
*   **Markers/Ticks:** Major ticks are marked at 0, 10, 20, 30, 40, 50, and 60. Faint, solid light-gray horizontal gridlines extend from these ticks across the chart area.

**2. X-Axis (Bottom):**
*   **Label:** `# SWE-Agent SFT tokens`
*   **Scale:** Categorical/Non-linear progression of token counts.
*   **Markers/Ticks:** 0, $2^{21}$, $2^{23}$, $2^{24}$, $1.1 \times 2^{25}$, $1.1 \times 2^{26}$, $1.1 \times 2^{27}$, $1.5 \times 2^{28}$. Vertical dashed gray gridlines extend upward from each tick mark.

**3. Legend (Right):**
Positioned outside the main chart area on the right side, enclosed in a bounding box. It maps colors to methodologies and shapes to metrics.
*   **Colors (Methodologies):**
    *   Red: RL (Reinforcement Learning)
    *   Orange: SFT (Supervised Fine-Tuning)
    *   Purple: MT (Multi-Task)
    *   Blue: Base
*   **Shapes (Metrics):**
    *   Circle: Pass@1
    *   Square: Pass@2
    *   Triangle: Pass@3
*   **Exact Legend Entries (Top to Bottom):**
    *   Red Circle: `RL Pass@1`
    *   Red Square: `RL Pass@2`
    *   Red Triangle: `RL Pass@3`
    *   Orange Circle: `SFT Pass@1`
    *   Orange Square: `SFT Pass@2`
    *   Orange Triangle: `SFT Pass@3`
    *   Purple Circle: `MT Pass@1`
    *   Purple Square: `MT Pass@2`
    *   Purple Triangle: `MT Pass@3`
    *   Blue Circle: `Base Pass@1`
    *   Blue Square: `Base Pass@2`
    *   Blue Triangle: `Base Pass@3`

### Detailed Analysis

**Visual Encoding & Trend Verification:**
The chart utilizes two types of lines to convey information:
1.  **Solid Lines (Intra-token scaling):** At every single X-axis tick, for every color, a solid line connects the Circle (Pass@1) to the Square (Pass@2) to the Triangle (Pass@3). *Trend:* Without exception, these solid lines slope steeply upward, confirming that for any given model at any given training stage, Pass@3 > Pass@2 > Pass@1.
2.  **Dashed Lines (Inter-token scaling):** Dashed lines connect identical shapes of the same color across different X-axis ticks (e.g., connecting all Red Circles). *Trend:* The general trend for all dashed lines is upward from left to right, indicating that increasing SFT tokens generally improves the pass rate across all methods and metrics.

**Data Extraction Table:**
*Note: Values are visual approximations derived from the Y-axis scale (±1%).*

| X-Axis Tick | Method (Color) | Pass@1 (Circle) | Pass@2 (Square) | Pass@3 (Triangle) |
| :--- | :--- | :--- | :--- | :--- |
| **0** | Base (Blue) | ~0% | ~0% | ~0% |
| | MT (Purple) | ~1% | ~1% | ~1% |
| | RL (Red) | ~4% | ~9% | ~12% |
| | SFT (Orange) | ~8% | ~13% | ~16% |
| **$2^{21}$** | Base (Blue) | ~1% | ~2% | ~3% |
| | MT (Purple) | ~5% | ~6% | ~7% |
| | SFT (Orange) | ~20% | ~33% | ~38% |
| | RL (Red) | ~23% | ~33% | ~39% |
| **$2^{23}$** | Base (Blue) | ~16% | ~24% | ~28% |
| | MT (Purple) | ~27% | ~36% | ~44% |
| | SFT (Orange) | ~27% | ~35% | ~41% |
| | RL (Red) | ~33% | ~43% | ~48% |
| **$2^{24}$** | Base (Blue) | ~13% | ~22% | ~28% |
| | SFT (Orange) | ~20% | ~31% | ~36% |
| | MT (Purple) | ~29% | ~41% | ~47% |
| | RL (Red) | ~34% | ~42% | ~47% |
| **$1.1 \times 2^{25}$** | Base (Blue) | ~12% | ~27% | ~36% |
| | MT (Purple) | ~31% | ~46% | ~52% |
| | RL (Red) | ~34% | ~45% | ~50% |
| | SFT (Orange) | ~35% | ~45% | ~51% |
| **$1.1 \times 2^{26}$** | Base (Blue) | ~22% | ~38% | ~45% |
| | MT (Purple) | *No Data Plotted* | *No Data Plotted* | *No Data Plotted* |
| | SFT (Orange) | ~37% | ~49% | ~55% |
| | RL (Red) | ~38% | ~51% | ~58% |
| **$1.1 \times 2^{27}$** | Base (Blue) | ~33% | ~48% | ~52% |
| | SFT (Orange) | ~44% | ~55% | ~59% |
| | RL (Red) | ~44% | ~56% | ~60% |
| | MT (Purple) | ~45% | ~55% | ~60% |
| **$1.5 \times 2^{28}$** | Base (Blue) | ~36% | ~48% | ~54% |
| | MT (Purple) | ~46% | ~55% | ~60% |
| | SFT (Orange) | ~48% | ~58% | ~62% |
| | RL (Red) | ~49% | ~58% | ~64% |

### Key Observations

1.  **Missing Data:** The MT (Purple) series has a distinct gap; there are no data points plotted at the $1.1 \times 2^{26}$ token mark. The dashed lines bridge directly from $1.1 \times 2^{25}$ to $1.1 \times 2^{27}$.
2.  **Performance Hierarchy:** Throughout almost the entire chart, RL (Red) and SFT (Orange) are the top-performing methods, often overlapping or tracking very closely together. MT (Purple) generally sits in the middle, while the Base model (Blue) consistently yields the lowest pass rates.
3.  **Anomalous Dips:** At the $2^{24}$ token mark, there is a noticeable regression in performance for the Base (Blue) and SFT (Orange) models compared to their performance at $2^{23}$. The Base Pass@1 drops from ~16% to ~13%, and SFT Pass@1 drops significantly from ~27% to ~20%.
4.  **Convergence at Scale:** As the token count reaches the maximum ($1.5 \times 2^{28}$), the performance gap between the methods begins to narrow, particularly between RL, SFT, and MT, which all cluster tightly between 55% and 64% for Pass@2 and Pass@3.

### Interpretation

This chart demonstrates the efficacy of different training interventions on a language model's ability to successfully complete software engineering tasks (implied by "SWE-Agent"). 

*   **The Value of Multiple Attempts:** The steep solid lines at every interval prove that allowing the agent multiple attempts (Pass@3 vs Pass@1) drastically improves the likelihood of success, regardless of the underlying model or training stage.
*   **Training Efficacy:** The data clearly shows that fine-tuning (SFT) and Reinforcement Learning (RL) provide massive early advantages over the Base model. For instance, at $2^{21}$ tokens, SFT and RL are already achieving ~40% Pass@3, while the Base model is barely above 0%.
*   **Scaling Laws:** The overall upward trajectory of the dashed lines confirms a standard scaling law: exposing the model to more SFT tokens generally increases its pass rate. However, the dips at $2^{24}$ suggest that training is not perfectly linear and may experience instability or require learning rate adjustments at certain phases.
*   **Diminishing Returns:** While performance is still climbing at the far right of the chart ($1.5 \times 2^{28}$), the slope of the dashed lines is beginning to flatten slightly compared to the explosive growth seen between $2^{21}$ and $2^{23}$. This suggests that while more data helps, the marginal utility of each additional token is decreasing, and the models may be approaching an asymptotic performance limit for this specific benchmark.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

6c37d7e7367d95c876417b32

FOUND IN PAPERS

EXPERT: gemini-3.1-pro-preview VERSION 1