2509.23045

Model: healer-alpha-free

# Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents > Indicates equal contribution. † Joint leads. ## Abstract Large Language Models (LLMs) are increasingly applied to software engineering (SWE), with SWE-bench as a key benchmark. Solutions are split into SWE-Agent frameworks with multi-turn interactions and workflow-based Agentless methods with single-turn verifiable steps. We argue these paradigms are not mutually exclusive: reasoning-intensive Agentless training induces skill priors, including localization, code edit, and self-reflection that enable efficient and effective SWE-Agent adaptation. In this work, we first curate the Agentless training recipe and present Kimi-Dev, an open-source SWE LLM achieving 60.4% on SWE-bench Verified, the best among workflow approaches. With additional SFT adaptation on 5k publicly-available trajectories, Kimi-Dev powers SWE-Agents to 48.6% pass@1, on par with that of Claude 3.5 Sonnet (241022 version). These results show that structured skill priors from Agentless training can bridge workflow and agentic frameworks for transferable coding agents. ## 1 Introduction Recent days have witnessed the rapid development of Large Language Models (LLMs) automating Software-Engineering (SWE) tasks (jimenez2023swe; yang2024swe; xia2024agentless; anthropic_claude_3.5_sonnet_20241022; pan2024training; wang2024openhands; wei2025swe; yang2025qwen3; team2025kimi_k2; openai_gpt5_system_card_2025). Among the benchmarks that track the progress of LLM coding agents in SWE scenarios, SWE-bench (jimenez2023swe) stands out as one of the most representative ones: Given an issue that reports a bug in a real-world GitHub repository, a model is required to produce a patch that fixes the bug, the correctness of which is further judged by whether the corresponding unit tests are passed after its application. The difficulty of the task (as of the date the benchmark was proposed), the existence of the outcome reward with the provided auto-eval harness, as well as the real-world economic value it reflects, have made the SWE-bench a focal point of the field. Two lines of solutions have emerged for the SWE-bench task. Agent-based solutions like SWE-Agent (yang2024swe) and OpenHands (wang2024openhands) take an interactionist approach: Instructed with the necessary task description, a predefined set of available tools, as well as the specific problem statement, the agent is required to interact with an executable environment for multiple turns, make change to the source codes, and determine when to stop autonomously. In contrast, workflow-based solutions like Agentless (xia2024agentless) pre-define the solving progress as a pipeline, which consists of steps like localization, bug repair, and test composition. Such task decomposition transforms the agentic task into generating correct responses for a chain of single-turn problems with verifiable rewards (guo2025deepseek; wei2025swe; SWESwiss2025). The two paradigms have been widely viewed as mutually exclusive. On the one hand, SWE-Agents are born with higher potential and better adaptability, thanks to the higher degree of freedom of the multi-turn interaction without the fixed routines. However, it has also proved more difficult to train with such frameworks due to their end-to-end nature (deepswe2025; cao2025skyrl). On the other hand, Agentless methods offer better modularity and the ease to train with Reinforcement Learning with Verifiable Rewards (RLVR) techniques, but more limited exploration space and flexibility, and difficulty in behavior monitoring as the erroneous patterns appear only in the single-turn long reasoning contents (pan2024training). However, we challenge the dichotomy from the perspective of training recipe: We argue that Agentless training should not be viewed as the ultimate deliverable, but rather as a way to induce skill priors – atomic capabilities such as the localization of buggy implementations and the update of erroneous code snippets, as well as self-reflection and verification, all of which help scaffold the efficient adaptation of more capable and generalizable SWE-agents. Guided by this perspective, we introduce Kimi-Dev, an open-source code LLM for SWE tasks. Specifically, we first develop an Agentless training recipe, which includes mid-training, cold-start, reinforcement learning, and test-time self-play. This results in 60.4% accuracy on SWE-bench Verified, the SoTA performance among the workflow-based solutions. Building on this, we show that Agentless training induces skill priors: a minimal SFT cold-start from Kimi-Dev with 5k publicly-available trajectories enables efficient SWE-agent adaptation and reaches 48.6% pass@1 score, similar to that of Claude 3.5 Sonnet (the 20241022 version, anthropic_claude_3.5_sonnet_20241022). We demonstrate that these induced skills transfer from the non-agentic workflows to the agentic frameworks, and the self-reflection in long Chain-of-Thoughts baked through Agentless training further enable the agentic model to leverage more turns and succeed with a longer horizon. Finally, we also show that the skills from Agentless training generalize beyond SWE-bench Verified to broader benchmarks like SWE-bench-live (zhang2025swe) and SWE-bench Multilingual (yang2025swesmith). Together, these results reframe the relationship between Agentless and agentic frameworks: not mutually exclusive, but as complementary stages in building transferable coding LLMs. This shift offers a principled view that training with structural skill priors could scaffold autonomous agentic interaction. The remainder of this paper is organized as follows. Section 2 reviews the background of the framework dichotomy and outlines the challenges of training SWE-Agents. Section 3 presents our Agentless training recipe and the experimental results. Section 4 demonstrates how these Agentless-induced skill priors enable efficient SWE-Agent adaptation, and evaluates the skill transfer and generalization beyond SWE-bench Verified. ## 2 Background In this section, we first review the two dominant frameworks for SWE tasks and their dichotomy in Section 2.1. We then summarize the progress and challenges of training SWE-Agents in Section 2.2. The background introduction sets the stage for reinterpreting Agentless training as skill priors for SWE-Agents, a central theme developed throughout the later sections. ### 2.1 Framework Dichotomy Two paradigms currently dominate the solutions for automating software engineering tasks. Agentless approaches decompose SWE tasks into modular workflows (xia2024agentless; wei2025swe; ma2024lingma; ma2025alibaba; swe-fixer). Typical workflows consist of bug localization, bug repair, and test generation. This design provides modularity and stability: each step could be optimized separately as a single-turn problem with verifiable rewards (wei2025swe; SWESwiss2025). However, such rigidity comes at the cost of flexibility. When encountering scenarios requiring multiple rounds of incremental updates, the Agentless approaches struggle to adapt. By contrast, SWE-agents adopt an end-to-end, multi-turn reasoning paradigm (yang2024swe; wang2024openhands). Rather than following a fixed workflow, they iteratively plan, act, and reflect, resembling how human developers debug complex issues. This design enables greater adaptability, but introduces significant difficulties: trajectories often extend over tens or even hundreds of steps, context windows of the LLMs must span over the entire interaction history, and the model must handle exploration, reasoning, and tool use simultaneously. The dichotomy between fixed workflows (e.g., Agentless) and agentic frameworks (e.g., SWE-Agent) has shaped much of the community’s perspective. The two paradigms are often regarded as mutually exclusive: one trades off flexibility and performance ceiling for modularity and stability, whereas the other makes the reverse compromise. Our work challenges this dichotomy, as we demonstrate that Agentless training induces skill priors that make further SWE-agent training both more stable and more efficient. ### 2.2 Training SWE-agents Training SWE-agents relies on acquiring high-quality trajectories through interactions with executable environments. Constructing such large-scale environments and collecting reliable trajectories, however, requires substantial human labor as well as costly calls to frontier models, making data collection slow and resource-demanding (pan2024training; badertdinov2024sweextra). Recent studies also attempt to scale environment construction by synthesizing bugs for the reverse construction of executable runtime (jain2025r2e; yang2025swesmith). However, credit assignment across long horizons still remains challenging, as outcome rewards are sparse and often only available when a final patch passes its tests. Reinforcement learning techniques have been proposed, but frequently suffer from instability or collapse when trajectories exceed dozens of steps (deepswe2025; cao2025skyrl). SWE-agent training is also highly sensitive to initialization: starting from a generic pre-trained model often leads to brittle behaviors, such as failing to use tools effectively or getting stuck in infinite loops of specific action patterns (pan2024training; yang2025swesmith). These limitations motivate our central hypothesis: instead of training SWE-agents entirely from scratch, one can first induce skill priors through agentless training, enhancing the atomic capabilities like localization, repair, test composition, and self-reflection. These priors lay a foundation that makes subsequent agentic training both more efficient and more generalizable. ## 3 Agentless Training Recipe Instead of training SWE-agents from scratch, we leverage Agentless training to induce skill priors. Skill priors enhanced by Agentless training include but are not limited to bug localization, patch generation, self-reflection and verification, which lay the foundation for end-to-end agentic interaction. In this section, we elaborate our Agentless training recipe: the duo framework design of BugFixer and TestWriter, mid-training and cold-start, reinforcement learning, and test-time self-play. Sections 3.1 – 3.4 detail these ingredients, and Section 3.5 presents the experimental results for each of them. This training recipe results in Kimi-Dev, an open-source 72B model that achieves 60.4% on SWE-bench Verified, the SoTA performance among the workflow-based solutions. <details> <summary>x3.png Details</summary> ![67245c36](/v1/image/67245c361cfca5b15496c3060a85dbddcac2d3f3566580999006f318c70e4bd8) ### Visual Description ## System Architecture Diagram: Automated Bug Fixing and Test Generation Workflow ### Overview The image is a technical system architecture diagram illustrating a closed-loop, automated software debugging and testing workflow. The system involves two primary modules, **BugFixer** and **TestWriter**, which are coordinated by a central **LLM** (Large Language Model). The diagram depicts a cyclical process where bugs are identified, fixed, and verified through generated test cases. ### Components/Axes The diagram is organized into three main regions: a left module, a central processing hub, and a right module. **1. Left Module (BugFixer):** * **Primary Component:** `BugFixer` (text label). * **Associated Functions (Icons & Labels):** * `File Localization` (icon: a document with a magnifying glass). * `Code Edit` (icon: a document with a pencil). * **Position:** Left side of the diagram. **2. Central Hub (LLM):** * **Primary Component:** `LLM` (text label), represented by a network/graph icon (nodes connected by lines). * **Position:** Center of the diagram, acting as the intermediary. **3. Right Module (TestWriter):** * **Primary Component:** `TestWriter` (text label). * **Associated Functions (Icons & Labels):** * `File Localization` (icon: a document with a magnifying glass). * `Code Edit` (icon: a document with a pencil). * **Position:** Right side of the diagram. **4. Process Flows (Arrows & Labels):** * **Top Arrow (Blue):** Flows from `BugFixer` (left) to `TestWriter` (right), passing above the LLM. It is labeled `Generate Test Case`. * **Bottom Arrow (Blue):** Flows from `TestWriter` (right) to `BugFixer` (left), passing below the LLM. It is labeled `Fix Bugs`. * **Central Arrows (Purple):** Two short, horizontal arrows connect the central `LLM` icon to the `BugFixer` and `TestWriter` labels, indicating bidirectional communication or control. ### Detailed Analysis The diagram describes a specific technical workflow: 1. **Initiation & Bug Localization:** The process begins with the `BugFixer` module. Its associated icons indicate its core functions are to perform `File Localization` (finding the relevant source files containing a bug) and to execute `Code Edit` (modifying the code to apply a fix). 2. **Test Case Generation Request:** Once a bug is presumably localized, the `BugFixer` module triggers the top flow. The blue arrow labeled `Generate Test Case` indicates a request or data is sent from `BugFixer` to `TestWriter`. 3. **Test Writing & Execution:** The `TestWriter` module receives this request. Like `BugFixer`, it also has `File Localization` and `Code Edit` capabilities, suggesting it can locate test files and write or modify test code. Its primary function, as per its name and the incoming arrow, is to generate a test case based on the bug information. 4. **Bug Fixing Loop:** After generating the test case, the bottom blue arrow labeled `Fix Bugs` shows a return flow from `TestWriter` back to `BugFixer`. This implies the generated test case is used to validate a fix. The cycle suggests an iterative process: a fix is attempted, a test is generated to verify it, and the result informs the next fixing attempt. 5. **LLM Orchestration:** The central `LLM` component is connected to both modules via purple arrows. This positioning indicates the LLM acts as the orchestrator, decision-maker, or reasoning engine for the entire loop. It likely receives information from both sides, determines the next action (e.g., what to fix, what test to write), and directs the modules accordingly. ### Key Observations * **Symmetry:** The `BugFixer` and `TestWriter` modules are structurally symmetric, both possessing `File Localization` and `Code Edit` capabilities. This suggests a design where both modules operate on the codebase but with different primary objectives (fixing vs. testing). * **Closed Loop:** The two main blue arrows form a clear, continuous cycle, emphasizing an automated, iterative debugging process. * **Centralized Intelligence:** The LLM is not just a tool used by one module but is placed at the heart of the diagram, connected to both. This highlights its role as the core intelligence driving the coordination between bug fixing and test generation. * **Functional Icons:** The repeated use of the `File Localization` and `Code Edit` icons explicitly defines the low-level actions each high-level module can perform. ### Interpretation This diagram represents a sophisticated **AI-driven DevOps or MLOps pipeline** for automated software maintenance. The system's purpose is to autonomously manage the bug lifecycle. * **What it demonstrates:** It shows a framework where an LLM doesn't just generate code or tests in isolation but actively participates in a **feedback loop**. The `BugFixer` identifies a problem, the LLM reasons about it and directs the `TestWriter` to create a verification mechanism, and the outcome of that test feeds back into the fixing process. * **Relationships:** The relationship is **cyclical and interdependent**. `BugFixer` depends on `TestWriter` for validation, and `TestWriter` depends on `BugFixer` for the bug context. The LLM is the **dependency hub** that enables this coordination, likely translating bug reports into test specifications and test results into fix strategies. * **Notable Implications:** This architecture aims to reduce human intervention in the debugging process. The "File Localization" step is critical, as it implies the system can navigate a large codebase to find relevant code and test files. The "Code Edit" function for both modules indicates the system has the capability to directly modify production and test code, which would require robust safeguards and validation in a real-world implementation. The model suggests a move towards **autonomous software agents** that can perceive (localize), reason (LLM), and act (edit code) within a development environment. </details> Figure 1: Agentless framework for Kimi-Dev: the duo of BugFixer and TestWriter. ### 3.1 Framework: the Duo of Bugfixer and Testwriter In GitHub issue resolution, we conceptualize the process as the collaboration between two important roles: the BugFixer, who produces patches that correctly address software bugs, and the TestWriter, who creates reproducible unit tests that capture the reported bug. A resolution is considered successful when the BugFixer’s patch passes the tests provided for the issue, while a high-quality test from the TestWriter should fail on the pre-fix version of the code and pass once the fix is applied. Each role relies on two core skills: (i) file localization, the ability to identify the specific files relevant to the bug or test, and (ii) code edit, the ability to implement the necessary modifications. For BugFixer, effective code edits repair the defective program logic, whereas for TestWriter, they update precise unit test functions that reproduce the issue into the test files. As illustrated in Figure 1, these two skills constitute the fundamental abilities underlying GitHub issue resolution. Thus, we enhance these skills through the following training recipes, including mid-training, cold-start, and RL. ### 3.2 Mid-Training & Cold Start To enhance the model’s prior as both a BugFixer and a TestWriter, we perform mid-training with $∼$ 150B tokens in high-quality and real-world data. With the Qwen 2.5-72B-Base (qwen2025qwen25technicalreport) model as a starting point, we collect millions of GitHub issues and PR commits to form its mid-training dataset, which consists of (i) $∼$ 50B tokens in the form of Agentless derived from the natural diff patch, (ii) $∼$ 20B tokens of curated PR commit packs, and (iii) $∼$ 20B tokens of synthetic data with reasoning and agentic interaction patterns (upsampled by a factor of 4 during training). The data recipe is carefully constructed to enable the model to learn how human developers reason with GitHub issues, implement code fixes, and develop unit tests. We also performed strict data decontamination to exclude any repository from the SWE-bench Verified test set. Mid-training sufficiently enhances the knowledge in the model about practical bug fixes and unit tests, making it a better starting point for later stages. The details of the recipe are covered in Appendix A. To activate the model’s long Chain-of-Thought (CoT) capability, we also construct a cold-start dataset with reasoning trajectories based on the SWE-Gym (pan2024training) and SWE-bench-extra (badertdinov2024scaling) datasets, generated by the DeepSeek R1 model (deepswe2025, the 20250120 version). In this setup, R1 acts the roles of Bugfixer and Testwriter, producing outputs such as file localization and code edits. Through supervised finetuning as a cold start with this dataset, we enable the model to acquire essential reasoning skills, including problem analysis, method sketching, self-refinement, and exploration of alternative solutions. ### 3.3 Reinforcement Learning After mid-training and cold-start, the model demonstrates strong performance in localization. Therefore, reinforcement learning (RL) focuses solely on the code edit stage. We construct a training set specifically for this stage, where each prompt is equipped with an executable environment. We further employ multiple localization rollouts from the initial model to generate varied file location predictions, which diversifies the prompts used in code-edit RL. For the RL algorithm, we adopt the policy optimization method proposed by Kimi k1.5 (team2025kimi_k15), which has shown promising results on reasoning tasks in both math and coding. Kimi k1.5 (team2025kimi_k15) adopts a simpler policy gradient approach based on the REINFORCE algorithm (williams1992simple). Similarly to GRPO (shao2024deepseekmath), we use the average rewards of multiple rollouts as the baseline to normalize the returns. When adapting the algorithm in our SWE-bench setting, we highlight the following 3 key desiderata: 1. Outcome-based reward only: We rely solely on the final execution outcome from the environment as the raw reward (0 or 1), without incorporating any format- or process-based signals. For BugFixer, a positive reward is given if the generated patch passes all ground-truth unittests. For TestWriter, a positive reward is assigned when (i) the predicted test raises a failure in the repo without the ground-truth bugfix patch applied, AND (ii) the failure is resolved once the ground-truth bugfix patch is applied. 1. Adaptive prompt selection: Prompts with pass@16 = 0 are initially discarded as they do not contribute to the batch loss. This results in an initial prompt set of 1,200 problems and enlarges the effective batch size. A curriculum learning scheme is then applied: once the success rate on the current set exceeds a threshold, 500 new (previously excluded) prompts (with initial pass@16 = 0 but improved under RL) are reintroduced every 100 RL steps to gradually raise task difficulty. 1. Positive example reinforcement: As performance improvements begin to plateau in later stages of training, we incorporate the positive samples from the recent RL iterations into the training batch of the current iteration. This approach reinforces the model’s reliance on successful patterns, thereby accelerating convergence in the final phase. Robust sandbox infrastructure. We construct the docker environment with Kubernetes (kubernetes), which provides a secure and scalable sandbox infrastructure and efficient training and rollouts. The infra supports over 10,000 concurrent instances with robust performance, making it ideal for competitive programming and software engineering tasks (see Appendix D for details). ### 3.4 Test-Time Self-Play After RL, the model masters the roles of both a BugFixer and a TestWriter. During test time, it adopts a self-play mechanism to coordinate its bug-fixing and test-writing abilities. Following Agentless (xia2024agentless), we leverage the model to generate 40 candidate patches and 40 tests for each instance. Each patch generation involves independent runs of the localization and code edit from BugFixer, where the first run uses greedy decoding (temperature 0), and the remaining 39 use temperature 1 to ensure diversity. Similarly, 40 tests are generated independently from TestWriter. For the test patch candidates, to guarantee their validity, we first filter out those failing to raise a failure in the original repo without applying any BugFixer patch. Denote the rest TestWriter patches as set $T$ , and the BugFixer patches as set $B$ . For each $b_i∈B$ and $t_j∈T$ , we execute the test suite over the test file modified by $t_j$ for twice: first without $b_i$ , and then with $b_i$ applied. From the execution log for the first run, we get the count of the failed and the passed tests from $t_j$ , denoted as ${\rm F}(j)$ and ${\rm P}(j)$ . Comparing the execution logs for the two test suite runs, we get the count of the fail-to-pass and the pass-to-pass tests, denoted as ${\rm FP}(i,j)$ and ${\rm PP}(i,j)$ , respectively. We then calculate the score for each $b_i$ with $$ S_i=\frac{∑_j{\rm FP}(i,j)}{∑_j{\rm F}(j)}+\frac{∑_j{\rm PP}(i,j)}{∑_j{\rm P}(j)},\vskip-2.0pt \tag{1} $$ where the first part reflects the performance of $b_i$ under reproduction tests, and the second part could be viewed as the characterization of $b_i$ under regression tests (xia2024agentless). We select the BugFixer patch $b_i$ with the highest $S_i$ score as the ultimate answer. Table 1: Performance comparison for models on SWE-bench Verified under Agentless-like frameworks. All the performances are obtained under the standard 40 patch, 40 test setting (xia2024agentless), except that Llama3-SWE-RL uses 500 patches and 30 tests. ### 3.5 Experiments #### 3.5.1 Main Results <details> <summary>figs/sec3_mid_training/mid-train_perf.png Details</summary> ![b26037a0](/v1/image/b26037a03d28c9d53ffa2bc1078e0937f8ead3e177032b768fd9c02a5f1db4cd) ### Visual Description ## Bar Chart: Pass Rate vs. Mid-training Tokens ### Overview This is a vertical bar chart illustrating the relationship between the number of "Mid-training tokens" (in billions) and the resulting "Pass Rate" (as a percentage). The chart demonstrates a clear, positive correlation: as the number of mid-training tokens increases, the pass rate also increases. ### Components/Axes * **Chart Type:** Vertical Bar Chart. * **Y-Axis (Vertical):** * **Label:** "Pass Rate (%)" * **Scale:** Linear scale ranging from 26 to 38, with major tick marks and grid lines at intervals of 2 units (26, 28, 30, 32, 34, 36, 38). * **X-Axis (Horizontal):** * **Label:** "Mid-training tokens" * **Categories:** Three discrete categories representing token counts: "50B", "100B", and "150B". * **Data Series:** A single data series represented by three light blue bars with black outlines. * **Data Labels:** Each bar has its exact numerical value displayed directly above it. * **Legend:** Not present in this chart. * **Title:** No chart title is present. ### Detailed Analysis The chart presents three data points, each corresponding to a specific mid-training token count: 1. **50B Tokens:** * **Bar Position:** Leftmost bar. * **Pass Rate Value:** 28.6% * **Visual Trend:** The shortest bar, establishing the baseline performance. 2. **100B Tokens:** * **Bar Position:** Center bar. * **Pass Rate Value:** 32.6% * **Visual Trend:** The bar is taller than the 50B bar, indicating an increase in pass rate. The increase from 50B to 100B is 4.0 percentage points. 3. **150B Tokens:** * **Bar Position:** Rightmost bar. * **Pass Rate Value:** 36.6% * **Visual Trend:** The tallest bar, showing the highest performance. The increase from 100B to 150B is another 4.0 percentage points. **Trend Verification:** The visual trend is unambiguously upward. Each successive bar to the right is taller than the previous one, confirming a monotonic increase in pass rate with more mid-training tokens. ### Key Observations * **Consistent Linear Increase:** The pass rate increases by a consistent margin of 4.0 percentage points for each 50-billion-token increment in mid-training data (from 50B to 100B, and from 100B to 150B). * **No Plateau Observed:** Within the range shown (50B to 150B tokens), there is no visual indication of diminishing returns or a performance plateau. The growth appears linear. * **Clear Positive Correlation:** The relationship between the two variables is direct and positive. * **Absence of Outliers:** All data points follow the established trend perfectly. ### Interpretation The data suggests a strong, positive, and linear relationship between the volume of mid-training tokens and the model's performance on the evaluated task (measured by pass rate). This implies that, within the tested range, investing in more mid-training data yields proportional improvements in model capability. From a technical perspective, this chart likely comes from an AI/ML research context, evaluating how scaling the "mid-training" phase (a stage between initial pre-training and final fine-tuning) affects final model performance. The consistent 4% gain per 50B tokens provides a predictable scaling law for this specific training regimen and evaluation metric. The key takeaway is that increasing mid-training data is an effective strategy for boosting model pass rates, with no observed saturation point up to 150B tokens. </details> Figure 2: The performance on SWE-bench Verified after mid-training with different training token budgets. Table 1 shows the performance of Kimi-Dev on SWE-bench Verified (jimenez2023swe). Instead of the text-similarity rewards used in SWE-RL (wei2025swe), we adopt execution-based signals for more reliable fix quality. Our two-stage TestWriter also improves over prior Agentless systems (xia2024agentless; guo2025deepseek; SWESwiss2025), which rely on a single root-level test, by better capturing repository context and mirroring real developer workflows (OpenAI-Codex-2025). Kimi-Dev attains state-of-the-art performance among open-source models, resolving 60.4% of issues. #### 3.5.2 Mid-Training In this section, we evaluate the relationship between the amount of data used during mid-training and model performance. Specifically, we finetuned Qwen 2.5-72B-Base with the subset of mid-training data of 50B, 100B, and approximately 150B tokens, and then lightly activated these mid-trained models using the same set of 2,000 Bugfixer input-output pairs for SFT cold start. We only report BugFixer pass@1 here for simplicity of evaluation. Figure 2 shows that increasing the number of tokens in mid-training consistently improves model performance, highlighting the effectiveness of this stage. #### 3.5.3 Reinforcement Learning <details> <summary>figs/sec3_rl_scaling/quick_plot_twin_bf_final.png Details</summary> ![1051ce05](/v1/image/1051ce0531ace0fb34ef1ac258a096fdb80273a426be587e85e035a875986465) ### Visual Description \n ## Dual-Axis Line Chart: RL Training Steps vs. Token Length and Pass Rate ### Overview This is a dual-axis line chart plotting two metrics—Token Length and Pass Rate (%)—against the number of Reinforcement Learning (RL) training steps. The chart shows the progression of both metrics over 500 training steps, indicating a general upward trend for both, with the Pass Rate exhibiting significantly more volatility. ### Components/Axes * **Chart Type:** Dual-axis line chart. * **X-Axis (Bottom):** * **Label:** "RL Training Steps" * **Scale:** Linear, from 0 to 500. * **Major Tick Marks:** Every 50 steps (0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500). * **Primary Y-Axis (Left):** * **Label:** "Token Length" (in blue text). * **Scale:** Linear, from 4000 to 8000. * **Major Tick Marks:** Every 500 units (4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000). * **Secondary Y-Axis (Right):** * **Label:** "Pass Rate (%)" (in red text). * **Scale:** Linear, from 34% to 46%. * **Major Tick Marks:** Every 2% (34, 36, 38, 40, 42, 44, 46). * **Legend (Top-Left Corner):** * **Blue line with square markers:** "Token Length" * **Red line with circle markers:** "Pass Rate (%)" * **Data Series:** 1. **Token Length (Blue Line, Square Markers):** Plotted against the left Y-axis. 2. **Pass Rate (Red Line, Circle Markers):** Plotted against the right Y-axis. ### Detailed Analysis **Trend Verification:** * **Token Length (Blue):** Shows a clear, generally upward trend from ~3900 at step 0 to ~7800 at step 500. The increase is not perfectly smooth; there are minor dips and plateaus (e.g., around steps 50, 200, and 475). * **Pass Rate (Red):** Also shows an overall upward trend from ~34% at step 0 to ~46% at step 500. This trend is highly volatile, characterized by sharp peaks and troughs throughout the training process. **Approximate Data Points (Selected Key Points):** *Note: Values are approximate based on visual inspection of the chart.* | RL Training Steps | Token Length (Approx.) | Pass Rate (%) (Approx.) | | :--- | :--- | :--- | | 0 | 3900 | 34.0 | | 50 | 4250 | 36.5 | | 100 | 5000 | 35.0 | | 150 | 5500 | 38.5 | | 200 | 5700 | 41.0 | | 250 | 5900 | 42.5 | | 300 | 5900 | 43.0 | | 350 | 6200 | 42.0 | | 400 | 7200 | 44.0 | | 450 | 7400 | 46.0 | | 475 | 6700 | 43.5 | | 500 | 7800 | 46.0 | **Notable Volatility in Pass Rate:** * A sharp drop occurs around step 50 (to ~34%). * A significant peak occurs around step 250 (to ~42.5%). * Another major peak is visible around step 450 (to ~46.0%). * A pronounced dip follows the peak at step 450, dropping to ~43.5% at step 475 before recovering. ### Key Observations 1. **Correlated General Trend:** Both metrics improve over the course of training, suggesting that as the model trains longer (more RL steps), it tends to generate longer token sequences and achieve a higher pass rate on the evaluated task. 2. **Divergent Volatility:** The Pass Rate is far more sensitive to training steps, exhibiting large swings, while the Token Length increases more steadily. This indicates that the quality or success rate (Pass Rate) of the model's outputs is less stable during training than the length of the outputs. 3. **Late-Stage Performance:** The highest values for both metrics are achieved in the final 100 steps (400-500), with Token Length reaching near 8000 and Pass Rate hitting 46%. 4. **Potential Overfitting or Policy Shift:** The sharp drop in Pass Rate after its peak at step 450, while Token Length remains high, could indicate a phase where the model's outputs become longer but less correct or aligned, a potential sign of overfitting or a shift in the learned policy. ### Interpretation The data demonstrates the progression of a reinforcement learning process for a language model. The x-axis ("RL Training Steps") represents the duration of training. The two y-axes track different aspects of the model's output: * **Token Length** is a measure of output verbosity or complexity. * **Pass Rate (%)** is a measure of output correctness or task success. The chart suggests that with more training, the model learns to produce longer and, on average, more correct responses. However, the path to improvement is not linear, especially for correctness. The high volatility in Pass Rate implies that the training process involves periods of exploration where performance can degrade temporarily before improving. The strong correlation in the final steps suggests the training may be converging toward an optimal policy that balances length and correctness. The divergence at step 475 is a critical point for investigation, as it shows a temporary decoupling of length and success rate. </details> (a) 72B Joint RL, BugFixer <details> <summary>figs/sec3_rl_scaling/quick_plot_twin_tw_final.png Details</summary> ![a8f8935c](/v1/image/a8f8935c39b06a8b772521bbd2acf09afdf1a549f023261a85a3aba436924428) ### Visual Description ## Dual-Axis Line Chart: RL Training Progression ### Overview This is a dual-axis line chart plotting two metrics—**Token Length** and **Reproduced Rate (%)**—against **RL Training Steps**. The chart visualizes the progression of these two variables over the course of a reinforcement learning (RL) training process, spanning from step 0 to step 500. The data suggests a relationship between the length of generated tokens and the model's reproduction accuracy as training advances. ### Components/Axes * **X-Axis (Bottom):** Labeled **"RL Training Steps"**. Linear scale from 0 to 500, with major tick marks every 50 steps. * **Primary Y-Axis (Left):** Labeled **"Token Length"** in blue text. Linear scale from 3000 to 6500, with major tick marks every 500 units. * **Secondary Y-Axis (Right):** Labeled **"Reproduced Rate (%)"** in red text. Linear scale from 20.0 to 35.0, with major tick marks every 2.5 percentage points. * **Legend:** Located in the top-left corner of the plot area. * Blue line with square markers: **"Token Length"** * Red line with circle markers: **"Reproduced Rate (%)"** * **Grid:** A light gray, dashed grid is present for both axes, aiding in value estimation. ### Detailed Analysis **1. Token Length (Blue Line, Left Axis):** * **Trend:** Shows a strong, generally consistent upward trend throughout training, with minor local fluctuations. * **Key Data Points (Approximate):** * Step 0: ~3050 * Step 50: ~3000 (local minimum) * Step 100: ~3500 * Step 150: ~3650 * Step 200: ~4050 * Step 250: ~4600 * Step 300: ~4900 * Step 350: ~5100 * Step 400: ~6100 * Step 450: ~6200 * Step 500: ~6400 (peak) * **Observation:** The growth is relatively smooth from step 150 onward, with a notable acceleration between steps 350 and 400. **2. Reproduced Rate (%) (Red Line, Right Axis):** * **Trend:** Exhibits a volatile but overall upward trend, characterized by sharp peaks and troughs. * **Key Data Points (Approximate):** * Step 0: ~20.0% (minimum) * Step 50: ~22.5% * Step 100: ~27.5% (local peak) * Step 150: ~27.5% * Step 200: ~27.5% * Step 250: ~32.5% (local peak) * Step 300: ~32.5% * Step 350: ~32.5% * Step 400: ~35.0% (global peak) * Step 450: ~32.5% * Step 500: ~32.0% * **Observation:** The rate is highly unstable. Major dips occur around steps 75, 175, 275, and 425. The highest reproduction rate (~35%) is achieved near step 400, coinciding with a steep rise in token length. ### Key Observations 1. **Correlation with Volatility:** While both metrics trend upward, the **Reproduced Rate** is far more volatile than the steadily increasing **Token Length**. This suggests that while the model learns to generate longer sequences, its ability to accurately reproduce target content is less stable and may be sensitive to specific training phases. 2. **Peak Performance Window:** The highest reproduction rate (~35%) occurs in the step 380-420 window, where token length is also rapidly increasing (from ~5500 to ~6100). This could indicate an optimal training phase. 3. **Late-Stage Divergence:** After step 400, token length continues to climb to its maximum (~6400), but the reproduced rate declines from its peak and becomes erratic. This divergence might signal the onset of overfitting, where the model generates longer but less accurate outputs. 4. **Initial Phase:** The first 50 steps show minimal growth in token length and a low, fluctuating reproduction rate, typical of early training exploration. ### Interpretation The chart demonstrates a common dynamic in RL training for generative models: **increased output complexity (longer tokens) does not guarantee improved performance (higher reproduction rate)**. The data suggests: * **Learning Progress:** The model successfully learns to generate longer sequences as training progresses, indicating it is capturing more complex patterns or adhering to longer-form generation objectives. * **Performance Instability:** The high volatility in the reproduction rate implies the training process is unstable. The model's accuracy is not improving monotonically; it experiences significant setbacks, which could be due to factors like reward function sparsity, policy updates causing catastrophic forgetting, or exploration-exploitation trade-offs. * **Potential Overfitting or Objective Misalignment:** The final phase (steps 400-500) is critical. The continued increase in token length coupled with a declining and unstable reproduction rate suggests the model may be optimizing for length at the expense of accuracy, or that the training objective is not perfectly aligned with the desired outcome of faithful reproduction. * **Actionable Insight:** A practitioner analyzing this chart might consider adjusting the training hyperparameters (e.g., learning rate, reward scaling) after step 400 to stabilize the reproduced rate, or investigate why performance peaks and then degrades despite longer generations. The optimal checkpoint for deployment might be around step 400, where reproduction rate is maximized. </details> (b) 72B Joint RL, TestWriter Figure 3: Joint code-edit RL experiments on the model after mid-training and cold-start. The pass rate for BugFixer and the reproduced rate for TestWriter are reported as pass@1 with temperature=1.0. The performance improves consistently as the output becomes increasingly longer. Experimental setup We set the training step per RL iteration as 5 and sample 10 rollouts for each of the 1,024 problems from the union of SWE-gym (pan2024training) and SWE-bench-extra (badertdinov2024sweextra). We dynamically adjust the prompt set every 20 iterations to gradually increase task difficulty. We fix the maximum training context length as 64k tokens, since the prompt input contains the contents of the entire files localized by the initial model in advance. Results Figure 3 shows the performance and response length curves on the test set during RL training. The pass rate and the reproduced rate are calculated from pass@1 and temperature=1. Specifically, we observe that both model performance and response length steadily increase, reflecting the expected benefits of RL scaling. Similar RL scaling curves are also observed in our ablation experiments run on Qwen2.5-14B-Instruct models, proving the effectiveness of the RL training recipe across models of different sizes. The experimental details, as well as the ablation studies on positive example reinforcement in Section 3.3, are listed in Appendix C.2). The lengthy outputs consist of in-depth problem analysis and self-reflection patterns, similar to those in the math and code reasoning tasks (team2025kimi_k15; guo2025deepseek). We have also observed that for TestWriter, occasional false-positive examples take place during RL training due to the lack of reproduction coverage. We leave the case studies in Appendix E and further improvement for future work. <details> <summary>figs/sec3_sp_scaling/selfplay_figure_v2.png Details</summary> ![9a8744bb](/v1/image/9a8744bbf5c5e7c6371ce6d058f653f49524a84cf1bba7b1529d42e2f7ef8d7b) ### Visual Description ## [Line Charts]: Pass Rate vs. Number of Patches (BF x TW) ### Overview The image contains two side-by-side line charts comparing the performance (Pass Rate %) of different methods as the "Number of patches: BF x TW" increases. The left chart compares "Self-play" and "Majority Voting." The right chart compares "Self-play" and "Pass@N." Both charts share the same x-axis categories and the same "Self-play" data series, but have different y-axis scales and a different second data series. ### Components/Axes **Common Elements:** * **X-Axis (Both Charts):** Labeled "Number of patches: BF x TW". The categorical markers are: `1×1`, `3×3`, `5×5`, `10×10`, `20×20`, `40×40`. * **Y-Axis (Both Charts):** Labeled "Pass Rate (%)". * **Grid:** Both charts have a light gray, dashed grid. **Left Chart Specifics:** * **Y-Axis Scale:** Ranges from 45.0 to 62.5, with major ticks every 2.5 units. * **Legend:** Located in the top-left corner. * `Self-play`: Blue line with hollow circle markers. * `Majority Voting`: Green line with hollow triangle markers. **Right Chart Specifics:** * **Y-Axis Scale:** Ranges from 45 to 75, with major ticks every 5 units. * **Legend:** Located in the top-left corner. * `Self-play`: Blue line with hollow circle markers (identical to left chart). * `Pass@N`: Orange line with hollow diamond markers. ### Detailed Analysis **Data Series: Self-play (Blue Line, Circle Markers)** * **Trend:** Increases steadily from `1×1` to `20×20`, then plateaus. * **Data Points:** * `1×1`: 48.0% * `3×3`: 52.6% * `5×5`: 55.4% * `10×10`: 58.8% * `20×20`: 60.4% * `40×40`: 60.4% **Data Series: Majority Voting (Green Line, Triangle Markers) - Left Chart Only** * **Trend:** Shows a modest, gradual increase, peaking at `20×20` before a slight decline. * **Data Points:** * `1×1`: 48.0% * `3×3`: 48.8% * `5×5`: 50.0% * `10×10`: 51.0% * `20×20`: 51.4% * `40×40`: 51.2% **Data Series: Pass@N (Orange Line, Diamond Markers) - Right Chart Only** * **Trend:** Shows a strong, consistent upward trend across all patch numbers, with no sign of plateauing. * **Data Points:** * `1×1`: 48.0% * `3×3`: 60.4% * `5×5`: 64.0% * `10×10`: 67.4% * `20×20`: 71.6% * `40×40`: 74.8% ### Key Observations 1. **Common Baseline:** All three methods start at the same performance (48.0%) for the `1×1` patch configuration. 2. **Diverging Performance:** As the number of patches increases, the performance of the three methods diverges significantly. 3. **Plateau vs. Growth:** "Self-play" performance plateaus after `20×20` patches. "Majority Voting" shows minimal gains overall. In contrast, "Pass@N" demonstrates continuous, strong improvement. 4. **Relative Performance:** At the highest patch count (`40×40`), "Pass@N" (74.8%) significantly outperforms "Self-play" (60.4%), which in turn outperforms "Majority Voting" (51.2%). ### Interpretation The data demonstrates the impact of scaling the "Number of patches: BF x TW" on the pass rate for different evaluation or sampling strategies. * **Self-play** benefits from increased patch count up to a point (`20×20`), after which its performance saturates. This suggests a limit to the effectiveness of self-play alone as the problem space (represented by patches) expands. * **Majority Voting** provides only marginal improvements over the baseline, indicating it is not a highly effective strategy for leveraging increased patch counts in this context. * **Pass@N** shows a strong, positive correlation between patch count and performance. This suggests that the Pass@N method is highly effective at utilizing the additional information or diversity provided by a larger number of patches, leading to substantially higher pass rates. The lack of a plateau within the tested range implies potential for further gains with even larger patch counts. **Conclusion:** For the task measured by "Pass Rate," the Pass@N strategy scales most effectively with an increasing number of patches (BF x TW), followed by Self-play, while Majority Voting offers limited benefit. The choice of strategy becomes increasingly critical as the patch configuration grows larger. </details> Figure 4: Test-time self-play on SWE-bench Verified. Performance improves with more generated patches and tests. Left: Execution-based self-play consistently surpasses BugFixer majority voting. Right: Self-play performances remain below pass@N where the ground-truth test patch is used, suggesting the room exists for TestWriter to improve. #### 3.5.4 Test-time Self-Play Following Section 3.4, we evaluate how the final performance on SWE-bench Verified scales with the number of patches and tests generated. The temperature is fixed at 0 for the initial rollout, and set to 1.0 for the subsequent 39 rollouts. As shown on the left of Figure 4, the final performance improves from 48.0% to 60.4% as the number of patch-test pairs increases from 1 $×$ 1 to 40 $×$ 40, and consistently surpasses the results obtained from the majority vote of the BugFixer patches only. Specifically, the self-play result obtained from 3 patches and 3 tests for each instance has already surpassed the performance with majority voting from 40 BugFixer patches. This demonstrates the effectiveness of additional information from test-time execution. The room for improvement of TestWriter, though, still exists for more powerful self-play: Shown on Figure 4, self-play performances remain below pass@N, where ground-truth test cases serve as the criterion for issue resolution. This finding aligns with anthropic_claude_3.5_sonnet_20241022, which introduced a final edge-case checking phase to generate a more diverse set of test cases, thereby strengthening the role of the “TestWriter” in their SWE-Agent framework. We also report preliminary observations of a potential parallel scaling phenomenon, which requires no additional training and may enable scalable performance improvements. The details of the phenomenon and analyses are covered in Appendix F. ## 4 Initializing SWE-Agents from Agentless Training End-to-end multi-turn frameworks, such as SWE-Agent (yang2024swe; anthropic_claude_3.5_sonnet_20241022) and OpenHands (wang2024openhands), enable agents to leverage tools and interact with environments. Specifically, the system prompt employed in the SWE-Agent framework (anthropic_claude_3.5_sonnet_20241022) outlines a five-stage workflow: (i) repo exploration, (ii) error reproduction via a test script, (iii) code edit for bug repair, (iv) test re-execution for validation, and (v) edge-case generation and checks. Unlike Agentless, the SWE-Agent framework doesn’t enforce a strict stage-wise workflow; the agent can reflect, transition, and redo freely until it deems the task complete and submits. The performance potential is therefore higher without a fixed routine; However, the training for SWE-Agent is more challenging because of the sparsity of the outcome reward for long-horizon credit assignment. Meanwhile, our Kimi-Dev model has undergone Agentless training, with its skills of localization and code edit for BugFixer and TestWriter strengthened elaborately. In this section, we investigate whether it can serve as an effective prior for multi-turn SWE-Agent scenarios. Table 2: Single-attempt performance of different models on SWE-bench Verified under end-to-end agentic frameworks, categorized by proprietary or open-weight models, and size over or under 100B (as of 2025.09). “Internal” denotes results achieved with their in-house agentic frameworks. ### 4.1 Performance after SWE-Agent Fine-tuning <details> <summary>figs/sec4_main/v-sweeping-new-FINAL.png Details</summary> ![6c37d7e7](/v1/image/6c37d7e7367d95c876417b3269e61e2559fb467a3993b8ae6da4cd2faf8f7a5a) ### Visual Description ## Line Chart: Scaling Behavior of SWE-Agent Training Methods ### Overview This is a multi-series line chart illustrating the performance scaling of four different training methods (RL, SFT, MT, Base) for an AI agent called "SWE-Agent." The chart plots the "Pass Rate (%)" against the number of "SWE-Agent SFT tokens" used for training. Each method is evaluated using three metrics: Pass@1, Pass@2, and Pass@3, resulting in 12 distinct data series. The overall trend shows that performance improves for all methods as the training token count increases, though the rate of improvement and starting points vary significantly. ### Components/Axes * **X-Axis (Horizontal):** Labeled "# SWE-Agent SFT tokens". The scale is logarithmic, with tick marks at the following approximate values: `0`, `2^21` (~2.1 million), `2^23` (~8.4 million), `2^24` (~16.8 million), `1.1 × 2^25` (~36.9 million), `1.1 × 2^26` (~73.8 million), `1.1 × 2^27` (~147.6 million), and `1.5 × 2^28` (~402.7 million). * **Y-Axis (Vertical):** Labeled "Pass Rate (%)". The scale is linear, ranging from 0 to 60 with major gridlines every 10 units. * **Legend:** Positioned on the right side of the chart, outside the plot area. It defines the color and marker shape for each of the 12 data series: * **RL (Red):** Pass@1 (Circle), Pass@2 (Square), Pass@3 (Triangle) * **SFT (Orange):** Pass@1 (Circle), Pass@2 (Square), Pass@3 (Triangle) * **MT (Purple):** Pass@1 (Circle), Pass@2 (Square), Pass@3 (Triangle) * **Base (Blue):** Pass@1 (Circle), Pass@2 (Square), Pass@3 (Triangle) * **Grid:** A light gray dashed grid is present for both axes. ### Detailed Analysis The following table reconstructs the approximate data points for each series at the given token counts. Values are estimated from the chart's visual positioning. | Token Count (Approx.) | RL Pass@1 | RL Pass@2 | RL Pass@3 | SFT Pass@1 | SFT Pass@2 | SFT Pass@3 | MT Pass@1 | MT Pass@2 | MT Pass@3 | Base Pass@1 | Base Pass@2 | Base Pass@3 | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | **0** | ~4% | ~9% | ~12% | ~8% | ~13% | ~16% | ~0.5% | ~0.5% | ~0.5% | ~0% | ~0% | ~0% | | **2^21** | ~23% | ~33% | ~39% | ~20% | ~33% | ~38% | ~5% | ~6% | ~7% | ~1% | ~2% | ~3% | | **2^23** | ~33% | ~43% | ~48% | ~27% | ~35% | ~41% | ~27% | ~36% | ~44% | ~16% | ~24% | ~28% | | **2^24** | ~34% | ~42% | ~47% | ~20% | ~31% | ~36% | ~29% | ~41% | ~47% | ~13% | ~22% | ~28% | | **1.1 × 2^25** | ~34% | ~45% | ~50% | ~35% | ~46% | ~51% | ~31% | ~46% | ~52% | ~12% | ~28% | ~36% | | **1.1 × 2^26** | ~38% | ~51% | ~58% | ~37% | ~49% | ~55% | ~37% | ~51% | ~58% | ~22% | ~38% | ~45% | | **1.1 × 2^27** | ~44% | ~56% | ~60% | ~44% | ~55% | ~59% | ~45% | ~55% | ~60% | ~33% | ~48% | ~52% | | **1.5 × 2^28** | ~49% | ~58% | ~64% | ~48% | ~58% | ~62% | ~46% | ~55% | ~60% | ~36% | ~48% | ~54% | **Trend Verification per Method:** * **RL (Red Lines):** All three lines show a strong, consistent upward trend. The slope is steep initially (0 to 2^23 tokens) and continues to rise steadily, with RL Pass@3 achieving the highest overall pass rate on the chart. * **SFT (Orange Lines):** Also shows a strong upward trend. Notably, SFT Pass@1 exhibits a significant dip at 2^24 tokens before recovering and continuing its ascent. * **MT (Purple Lines):** Starts very low (near 0%) but demonstrates the most dramatic scaling. The lines have a very steep slope between 2^21 and 2^23 tokens, eventually converging with and sometimes surpassing the SFT lines at higher token counts. * **Base (Blue Lines):** Starts at or near 0% and shows the slowest initial growth. However, it exhibits a strong, consistent upward trend from 2^24 tokens onward, though it remains the lowest-performing group at every data point. ### Key Observations 1. **Performance Hierarchy:** At every token count, the performance order within each method is consistently Pass@3 > Pass@2 > Pass@1. This indicates that allowing the agent more attempts (k in Pass@k) reliably increases the success rate. 2. **Method Comparison:** RL and SFT methods start with a significant performance advantage over MT and Base at low token counts (0 to 2^21). MT shows a "catch-up" phenomenon, scaling rapidly to match SFT performance at higher token volumes. Base is consistently the lowest-performing method. 3. **Scaling Efficiency:** The most dramatic performance gains for all methods occur in the range between `2^21` and `1.1 × 2^25` tokens. After `1.1 × 2^26` tokens, the rate of improvement begins to plateau slightly for most series. 4. **Notable Anomaly:** The SFT Pass@1 series shows a clear performance drop at `2^24` tokens (from ~27% to ~20%) before recovering. This is the most pronounced deviation from the general upward trend in the chart. ### Interpretation This chart demonstrates the **scaling laws** for different training paradigms applied to the SWE-Agent. The data suggests that: * **Data Quantity is Critical:** All methods benefit from more training data (tokens), confirming that scale is a primary driver of performance for this agent. * **Training Method Matters:** The choice of training method (RL, SFT, MT, Base) has a profound impact on data efficiency. RL and SFT are highly data-efficient, achieving decent performance with relatively few tokens. MT is less efficient initially but scales very effectively. The Base method is the least data-efficient, requiring orders of magnitude more data to achieve comparable results. * **Metric Sensitivity:** The consistent gap between Pass@1, Pass@2, and Pass@3 highlights that the agent's "first-attempt" success rate is substantially lower than its success rate when given multiple chances. This is a crucial consideration for real-world deployment where the cost of multiple attempts may be high. * **Practical Implication:** For resource-constrained scenarios (limited training data/compute), RL or SFT would be preferable. If massive data is available, MT becomes a competitive option. The Base method appears to be a weak baseline, likely representing a model without specialized training for the SWE-Agent task. The dip in SFT Pass@1 at `2^24` tokens could indicate a point of instability or overfitting in that specific training run, warranting further investigation. </details> Figure 5: Comparing the quality of the raw Base, the Agentless mid-trained (MT), the Agentless mid-trained with reasoning-intensive cold-start (SFT), and the Kimi-Dev model after RL as the prior for SWE-Agent adaptation. The tokens of the SWE-Agent SFT trajectories are swept over different scales, and the SWE-Agent performances are reported up to pass@3 on SWE-bench Verified. We use the publicly available SWE-Agent trajectories to finetune Kimi-Dev. The finetuning dataset we used is released by SWE-smith (yang2025swe), consisting of 5,016 SWE-Agent trajectories collected with Claude 3.7 Sonnet (Anthropic-Claude3.7Sonnet-2025) in the synthetic environments. We perform supervised fine-tuning over Kimi-Dev, setting the maximum context length as 64K tokens during training, and allowing up to 128K tokens and 100 turns during inference. As shown in Table 2, without collecting more trajectory data over realistic environments, or conducting additional multi-turn agentic RL, our finetuned model achieves a pass@1 score of 48.6% on SWE-bench Verified under the agentic framework setup, without additional test-time scaling. Using the same SFT data, our finetuned Kimi-Dev model outperforms the SWE-agent-LM (yang2025swesmith), with the performance comparable to that of Claude 3.5 Sonnet (49% by the 241022 version). The pass@10 of our SWE-Agent adapted model is 74.0% and surpasses the pass@30 of our model under Agentless (73.8%), proving the higher potential for the SWE-Agent framework. ### 4.2 Skill Transfer and Generalization The results shown in Section 4.1 demonstrate that Kimi-Dev, a model with extensive Agentless training, could be adapted to end-to-end SWE-Agents with lightweight supervised finetuning. As the Agentless training recipe consists of mid-training, cold-start (SFT) and RL, we explore the contribution of each part in the recipe to the SWE-Agent capability after adaptation. To figure this out, we perform SWE-Agent SFT on the original Qwen2.5-72B (Base), the mid-trained model (MT), the model then activated with Agentless-formatted long CoT data (SFT), and the (Kimi-Dev) model after finishing RL training (RL). As we are treating the four models as the prior for SWE-Agents We slightly abuse the term “prior” to refer to a model to be finetuned with SWE-Agent trajectories in the following analysis., and a good prior always demonstrates the ability of fast adaptation with a few shots (finn2017model; brown2020language), we also sweep the amount of SWE-Agent SFT data to measure the efficiency of each prior in SWE-Agent adaptation. Specifically, we randomly shuffle the 5,016 SWE-Agent trajectories and construct nested subsets of sizes 100, 200, 500, 1,000, and 2,000, where each smaller subset is contained within the larger ones. In addition, we prepend two extreme baselines: (i) zero-shot, where the prior model is directly evaluated under the SWE-Agent framework without finetuning, and (ii) one-step gradient descent, where the model is updated with a single gradient step using the 100-trajectory subset. This yields a range of SFT token budgets spanning { $0$ , $2^21$ , $2^23$ , $2^24$ , $1.1× 2^25$ , $1.1× 2^26$ , $1.1× 2^27$ , $1.5× 2^28$ }. After these lightweight SFT experiments, we evaluate performance in terms of pass@{1,2,3} under the SWE-Agent framework, with evaluations for pass@1 conducted at temperature 0, and those for pass@2 and pass@3 at temperature 1.0. Figure 5 presents the SWE-Agent performances of each prior (Base, MT, SFT, RL) after being fine-tuned with different amounts of agentic trajectories. We have the following observations: 1. The RL prior outperforms all the other models in nearly all the SWE-Agent SFT settings. This demonstrates that the Agentless training recipe indeed strengthens the prior in terms of SWE-Agent adaptation. For example, To achieve the top pass@1 performance of the Base prior, the RL prior needs only $2^23$ SWE-Agent SFT tokens, whereas the Base prior consumes $1.5× 2^28$ tokens. 1. The MT prior is lagged behind the SFT and the RL ones in extremely data-scarce settings (zero-shot ( $0$ ) and one-step gradient descent ( $2^21$ ) ), but quickly becomes on par with them after 200 trajectories ( $2^24$ ) are available for finetuning. This indicates that adaptation efficiency remains comparable after the prior is strengthened through Agentless mid-training. 1. The performance of the SFT prior is mostly similar to the RL one except for two cases: (i) The SFT prior outperforms the RL one under the zero-shot setting. This is reasonable, as the RL prior might overfit to the Agentless input-output format, while the SFT prior suffers less from this. (ii) The SFT prior exhibits a significant degradation with 200 SWE-Agent trajectories ( $2^24$ ). A potential reason could be that the 200 trajectories collapse onto a single data mode, leading the SFT prior to overfit through memorization (chu2025sft); the RL prior instead embeds stronger transferable skills and thus generalizes better. <details> <summary>figs/sec4_long_cot_to_multi_turn/hist_steps_6x4.png Details</summary> ![df382571](/v1/image/df382571609966a51cb0ea4c17641ffa78a1cef1b1b46050762ff766b61ae60d) ### Visual Description ## Step Histogram: Number of Instances Resolved (Per Bin of Turns) ### Overview This image is a step histogram (or step plot) comparing the performance of four different models or methods—RL, SFT, MT, and Base—on a task. The chart displays how many problem instances were successfully resolved, grouped by the number of conversational turns required for resolution. The data suggests a performance comparison across different interaction lengths. ### Components/Axes * **Chart Title:** "Number of instances resolved (per bin of turns)" * **Y-Axis:** * **Label:** "#Instances resolved" * **Scale:** Linear, from 0 to 160. * **Major Tick Marks:** 0, 40, 80, 120, 160. * **X-Axis:** * **Label:** "#Turns" * **Scale:** Linear, binned in increments of 10. * **Bins (Tick Marks):** 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100. * **Legend:** Located in the top-right corner of the plot area. * **RL:** Solid red line. * **SFT:** Dash-dot orange line. * **MT:** Dotted purple line. * **Base:** Dashed blue line. ### Detailed Analysis The chart shows the count of resolved instances for each model within specific turn-count bins (e.g., 0-10 turns, 10-20 turns). Values are approximate based on visual inspection of the step heights. **Trend Verification:** All four data series follow a similar visual trend: a sharp peak in the 10-20 turn bin, followed by a general decline as the number of turns increases. The RL series consistently shows the highest or near-highest resolved count in most bins. **Data Points by Bin (Approximate Values):** * **Bin 0-10 Turns:** * MT: ~55 instances (highest) * SFT: ~38 * RL: ~38 * Base: ~25 (lowest) * **Bin 10-20 Turns (Peak for all models):** * RL: ~155 instances (highest peak) * SFT: ~145 * Base: ~140 * MT: ~140 * **Bin 20-30 Turns:** * SFT: ~70 instances (highest) * RL: ~55 * Base: ~55 * MT: ~50 * **Bin 30-40 Turns:** * RL: ~30 instances * Base: ~28 * MT: ~25 * SFT: ~22 * **Bin 40-50 Turns:** * RL: ~20 instances * SFT: ~15 * Base: ~12 * MT: ~8 * **Bin 50-60 Turns:** * SFT: ~12 instances * RL: ~8 * Base: ~5 * MT: ~5 * **Bin 60-70 Turns:** * SFT: ~8 instances * RL: ~5 * Base: ~2 * MT: ~2 * **Bin 70-80 Turns:** * RL: ~8 instances (notable small rise) * SFT: ~5 * Base: ~2 * MT: ~2 * **Bin 80-90 Turns:** * SFT: ~5 instances * RL: ~2 * Base: ~2 * MT: ~2 * **Bin 90-100 Turns:** * RL: ~8 instances (another small rise) * SFT: ~5 * Base: ~2 * MT: ~2 ### Key Observations 1. **Universal Peak:** All models achieve their highest number of resolved instances in the 10-20 turn bin, indicating this is the most common length for successful resolution. 2. **RL Dominance at Peak:** The RL model has the highest peak performance (~155 instances) in the 10-20 turn range. 3. **Performance Decline:** For all models, the number of resolved instances drops significantly as the required number of turns increases beyond 20. 4. **MT's Early Strength:** The MT model performs best relative to others in the shortest bin (0-10 turns). 5. **SFT's Mid-Range Strength:** The SFT model shows the highest resolved count in the 20-30 turn bin. 6. **Long-Tail Performance:** In the higher turn bins (70+), the resolved counts are very low for all models, though RL shows minor, isolated increases in the 70-80 and 90-100 bins. ### Interpretation This chart likely evaluates AI models on a conversational or multi-step task (e.g., dialogue systems, problem-solving agents). The "turns" represent interaction steps, and "instances resolved" are successful task completions. * **What the data suggests:** The task is most frequently solvable within 10-20 interactions. Solving it requires more than 30 turns is progressively rarer, suggesting either increased difficulty or a dataset skewed towards shorter solutions. * **Model Comparison:** RL appears most effective for the most common case (10-20 turns). MT may be better for very quick resolutions, while SFT holds an edge for slightly longer interactions (20-30 turns). The "Base" model generally underperforms the specialized methods (RL, SFT, MT). * **Anomalies:** The small bumps for RL in the 70-80 and 90-100 turn bins are interesting. They could indicate a subset of very difficult problems that the RL model is uniquely capable of solving, or they could be statistical noise given the low counts. * **Underlying Message:** The visualization argues for the effectiveness of trained models (RL, SFT, MT) over a base model, with RL showing particular strength for the most common resolution path. It also highlights the inherent challenge of the task as interaction length grows. </details> <details> <summary>figs/skill_analysis_figure.png Details</summary> ![fe17d516](/v1/image/fe17d51656bda581756dc53bffe5884aa59a17951e9e295a443e0f60bee31bb4) ### Visual Description ## Stacked Bar Chart: Model Performance in Resolving Cases ### Overview The image displays a stacked bar chart comparing the performance of four different models (Base, MT, SFT, RL) in terms of the "Number of Resolved Cases." Each bar is divided into two segments: a solid-colored base representing the "Bugfixer cutoff" and a hatched top section representing "Reflection." The chart demonstrates a clear upward trend in total resolved cases across the models, with each subsequent model showing improvement. ### Components/Axes * **Chart Type:** Stacked Bar Chart. * **Y-Axis:** * **Label:** "Number of Resolved Cases" * **Scale:** Linear, ranging from 0 to 800, with major tick marks every 100 units. * **X-Axis:** * **Label:** "Models" * **Categories (from left to right):** "Base", "MT", "SFT", "RL". * **Legend:** * **Position:** Top-left corner of the chart area. * **Item 1:** A solid blue rectangle labeled "Bugfixer cutoff". * **Item 2:** A blue rectangle with diagonal hatching labeled "Reflection". * **Data Series & Colors:** * **Base Model:** Solid blue base, blue hatched top. * **MT Model:** Solid purple base, purple hatched top. * **SFT Model:** Solid orange base, orange hatched top. * **RL Model:** Solid red base, red hatched top. ### Detailed Analysis The chart presents the following data for each model, broken down by component: 1. **Base Model:** * **Bugfixer cutoff (Solid Blue):** 484 cases. * **Reflection (Hatched Blue):** 94 cases. * **Total Resolved Cases:** 578 (annotated as "578(+94)"). 2. **MT Model:** * **Bugfixer cutoff (Solid Purple):** 542 cases. * **Reflection (Hatched Purple):** 100 cases. * **Total Resolved Cases:** 642 (annotated as "642(+100)"). 3. **SFT Model:** * **Bugfixer cutoff (Solid Orange):** 584 cases. * **Reflection (Hatched Orange):** 109 cases. * **Total Resolved Cases:** 693 (annotated as "693(+109)"). 4. **RL Model:** * **Bugfixer cutoff (Solid Red):** 605 cases. * **Reflection (Hatched Red):** 113 cases. * **Total Resolved Cases:** 718 (annotated as "718(+113)"). **Trend Verification:** * The **"Bugfixer cutoff"** component shows a steady upward trend: 484 → 542 → 584 → 605. * The **"Reflection"** component also shows a steady upward trend: 94 → 100 → 109 → 113. * The **Total Resolved Cases** consequently show a consistent upward trend: 578 → 642 → 693 → 718. ### Key Observations * **Consistent Improvement:** Each model (Base → MT → SFT → RL) outperforms the previous one in both the "Bugfixer cutoff" and "Reflection" components, leading to a higher total. * **Dominant Component:** The "Bugfixer cutoff" constitutes the majority of resolved cases for all models, ranging from approximately 83.7% (Base) to 84.3% (RL) of the total. * **Growth of "Reflection":** The contribution from "Reflection" increases in absolute terms (from 94 to 113) and as a percentage of the total (from ~16.3% to ~15.7% - note: while the absolute number grows, its percentage share slightly decreases as the base grows faster). * **Largest Gains:** The most significant total improvement occurs between the "Base" and "MT" models (+64 cases). The incremental gain from "SFT" to "RL" is the smallest (+25 cases), suggesting potential diminishing returns. ### Interpretation This chart likely illustrates the results of an iterative model development or training process in a technical domain, such as automated bug fixing or problem resolution. The "Bugfixer cutoff" may represent a baseline or initial resolution capability, while "Reflection" could signify an additional, perhaps more sophisticated, reasoning or self-correction step that yields further resolutions. The data suggests that sequential training or refinement techniques (represented by MT, SFT, RL) are effective. The "RL" (likely Reinforcement Learning) model achieves the highest performance, indicating that this training paradigm is the most successful among those tested for this task. The consistent, additive contribution of the "Reflection" component across all models implies it is a valuable and complementary module to the core "Bugfixer" system. The narrowing gap between later models (SFT to RL) might indicate that the problem space is approaching a performance ceiling with the current methodology, or that further gains require more substantial architectural changes. </details> Figure 6: Left: Performance of the four priors under turn limits after SWE-Agent adaptation. Right: The characterization of the BugFixer and the reflection skills for each prior by counting the resolved cases of the 3 runs at Stage-3 cutoff moment, and comparing those with the final success cases. From long CoT to extended multi-turn interactions. We hypothesize that reflective behaviors cultivated through long chain-of-thought reasoning may transfer to settings requiring extended multi-turn interactions. To examine this, we evaluate the four priors (Base, MT, SFT, and RL) by finetuning on the 5,016 trajectories and test on SWE-bench Verified, under varying turn limits with pass@3 as the metric (Figure 6, left). The distinct interaction-length profiles show supportive evidence: the RL prior, after finetuning, continues to make progress beyond 70 turns, while the SFT, mid-trained, and raw models show diminishing returns around 70, 60, and 50 turns, respectively. We further evaluate the efficacy of the Agentless skill priors (BugFixer and reflection) in the SWE-Agent adapted model. For BugFixer, given that the SWE-Agent may autonomously reflect between the five stages, we examine the moment in each trajectory when the bug fix of the third stage is initially completed, and the test rerun of the fourth stage has not yet been entered. Heuristically, when the SWE-Agent just completes the third stage, it has not yet obtained the execution feedback from the fourth stage, and thus has not further reflected based on the execution information or refined the bug fix. We therefore calculate the success rate of direct submission at this cutoff moment, which reflects the capability of the BugFixer skill. Regarding reflection, we further compare the performance at the cutoff point with the performance after full completion for each problem. The increment in the number of successful problems is used to reflect the capability of the reflection skill. We use kimi-k2-0711-preview (team2025kimi_k2) to annotate the SWE-Agent trajectories, identifying the stage to which each turn belongs. Figure 6 (right) demonstrates that both skills are strengthened through each stage of the Agentless training recipe: For the BugFixer skill, the cutoff performance at Stage-3 within the SWE-Agent interaction trajectories of the four adapted models shows consistent improvement, ranging from 484 cases resolved by the Base prior to 605 cases by the RL prior, as measured by the number of successful resolutions within three passes. For the reflection skill, examining the performance gains from Stage-3 to the end of the trajectories reveals a similar trend, with improvements increasing from +94 under the Base prior to +113 under the RL prior. Taken together, the adapted model from the RL prior achieves the strongest overall performance across both skills. It should be noted that our analysis of the reflection skill remains coarse-grained, since the measured performance gains between the two checkpoints capture not only agentic reflection and redo behaviors, but also the intermediate test-writing process performed by the SWE-Agent. A more fine-grained evaluation that isolates the TestWriter skill prior is left for future work. The prompt for SWE-Agent stage annotation, extended qualitative studies, as well as additional discussions for skill transfer and generalization, are covered in Appendix G. <details> <summary>figs/sec4_swe_agent_rl/rebuttal_cmp_prior_pass1.png Details</summary> ![ac6be221](/v1/image/ac6be221fa9cc68137eab6990afba981bd566f8e9163fa4d70e26c202db64b77) ### Visual Description \n ## Line Chart: Performance Progression of SFT vs. RL Priors ### Overview The image is a line chart comparing the performance of two different model training approaches—SFT (Supervised Fine-Tuning) prior and RL (Reinforcement Learning) prior—over the course of 300 training steps. Performance is measured by a "Pass Rate (%)" metric. Each approach is represented by an average line (Avg@5) and a shaded region indicating the range between the minimum and maximum values over 5 runs (Min@5-Max@5). ### Components/Axes * **Chart Type:** Line chart with shaded confidence/range bands. * **X-Axis:** * **Label:** "Steps" * **Scale:** Linear, from 0 to 300. * **Major Tick Marks:** 0, 50, 100, 150, 200, 250, 300. * **Y-Axis:** * **Label:** "Pass Rate (%)" * **Scale:** Linear, from 25.0 to 42.5. * **Major Tick Marks:** 25.0, 27.5, 30.0, 32.5, 35.0, 37.5, 40.0, 42.5. * **Legend (Positioned in the bottom-right quadrant of the plot area):** 1. **Red line with circular markers:** "SFT prior, Avg@5" 2. **Light pink shaded area:** "SFT prior, Min@5-Max@5" 3. **Blue line with circular markers:** "RL prior, Avg@5" 4. **Light blue shaded area:** "RL prior, Min@5-Max@5" ### Detailed Analysis **1. SFT prior, Avg@5 (Red Line):** * **Trend:** Starts at the lowest point (~25.8% at step 0). Shows a rapid initial increase, followed by a generally upward but highly volatile trend with frequent peaks and troughs. The growth rate slows after approximately step 150. * **Key Data Points (Approximate):** * Step 0: ~25.8% * Step 50: ~33.5% * Step 100: ~36.0% * Step 150: ~34.5% (a local trough) * Step 200: ~37.0% * Step 250: ~37.0% * Step 300: ~36.2% * **Range (Pink Shaded Area):** The min-max range is substantial throughout, often spanning 3-5 percentage points. The range appears widest around steps 150-200 and 250-300, indicating high variance in performance across different runs at those stages. **2. RL prior, Avg@5 (Blue Line):** * **Trend:** Starts higher than SFT (~27.8% at step 0). Also shows a rapid initial increase. Its upward trend appears slightly more consistent and less volatile than the SFT line, especially after step 100. It maintains a performance lead over the SFT average for nearly the entire duration. * **Key Data Points (Approximate):** * Step 0: ~27.8% * Step 50: ~37.5% (a sharp peak) * Step 100: ~36.5% * Step 150: ~40.0% * Step 200: ~39.0% * Step 250: ~41.0% * Step 300: ~38.8% * **Range (Blue Shaded Area):** The min-max range is also significant but appears slightly narrower on average than the SFT range, particularly in the later stages (steps 200-300). This suggests the RL prior may yield more consistent results across runs. ### Key Observations 1. **Performance Gap:** The RL prior (blue) consistently outperforms the SFT prior (red) on average after the initial steps. The gap is most pronounced around steps 150 and 250. 2. **Volatility:** Both methods exhibit high volatility, as seen in the jagged average lines and wide shaded ranges. However, the SFT prior's average line appears more erratic. 3. **Convergence and Divergence:** The two average lines converge briefly around step 100 and step 175 but otherwise maintain a separation. The shaded ranges overlap significantly throughout, indicating that while the averages differ, individual runs from either method can achieve similar performance levels. 4. **Peak Performance:** The highest observed average pass rate is achieved by the RL prior, reaching approximately 41% around step 250. The SFT prior's average peaks lower, at around 38-39%. ### Interpretation This chart demonstrates a comparative analysis of two training methodologies for a task measured by a pass rate. The data suggests that, on average, the **RL prior approach leads to better final performance and a more stable learning trajectory** than the SFT prior approach over 300 steps. * **Effectiveness:** The RL prior's higher starting point and sustained lead imply it may provide a better initialization or learning signal for this specific task. * **Stability:** The slightly narrower range for the RL prior suggests it is less sensitive to random seeds or initial conditions, making it a more reliable method. * **Underlying Dynamics:** The high volatility in both curves indicates the training process is noisy or the evaluation metric is sensitive. The fact that the ranges overlap heavily means that while RL is better *on average*, a well-tuned SFT run could still match a poorly-tuned RL run. * **Practical Implication:** If the goal is to maximize the expected pass rate with reliable results, the RL prior appears to be the superior choice based on this data. However, the significant variance warns that multiple runs are necessary to gauge true performance for either method. </details> <details> <summary>figs/sec4_swe_agent_rl/rebuttal_cmp_prior_pass3.png Details</summary> ![ef20f038](/v1/image/ef20f0384ea0cda3722a9032fce7bea0c4607c15a3a9c783f2103f01d2884f49) ### Visual Description ## Line Chart: Pass Rate (%) vs. Training Steps for Two Prior Methods ### Overview The image is a line chart comparing the performance of two different prior methods—SFT (Supervised Fine-Tuning) and RL (Reinforcement Learning)—over the course of training steps. The performance metric is the "Pass Rate (%)" for a "Pass@3" evaluation. The chart shows the progression of this metric for both methods from step 0 to step 300. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis:** * **Label:** "Steps" * **Scale:** Linear, from 0 to 300. * **Major Ticks:** 0, 50, 100, 150, 200, 250, 300. * **Y-Axis:** * **Label:** "Pass Rate (%)" * **Scale:** Linear, from 42 to 56. * **Major Ticks:** 42, 44, 46, 48, 50, 52, 54, 56. * **Legend:** * **Position:** Bottom-right corner of the plot area. * **Series 1:** Red line with circular markers, labeled "SFT prior, Pass@3". * **Series 2:** Blue line with circular markers, labeled "RL prior, Pass@3". * **Grid:** A light gray grid is present for both major x and y ticks. ### Detailed Analysis **Data Series 1: SFT prior, Pass@3 (Red Line)** * **Trend:** Shows a general upward trend with significant volatility. It starts at the lowest point on the chart and ends at a high level, but with notable dips. * **Key Data Points (Approximate):** * Step 0: ~42.0% (lowest point for this series) * Step ~180: ~54.2% (peak for this series) * Step 300: ~52.0% * Notable dip at Step ~220: ~46.8% **Data Series 2: RL prior, Pass@3 (Blue Line)** * **Trend:** Also shows a general upward trend with high volatility. It starts higher than the SFT series and achieves the highest overall value on the chart. * **Key Data Points (Approximate):** * Step 0: ~45.5% * Step ~240: ~56.0% (highest point on the entire chart) * Step 300: ~51.5% * Notable dip at Step ~170: ~48.2% ### Key Observations 1. **Initial Performance Gap:** At step 0, the RL prior method starts with a pass rate approximately 3.5 percentage points higher than the SFT prior method. 2. **Overall Improvement:** Both methods demonstrate a clear improvement in pass rate over the 300 training steps. 3. **Volatility:** Both data series are highly volatile, with frequent sharp increases and decreases between consecutive measured steps. 4. **Peak Performance:** The RL prior method achieves the highest recorded pass rate (~56%) at approximately step 240. The SFT prior method's peak (~54.2%) occurs earlier, around step 180. 5. **Convergence and Divergence:** The two lines frequently cross each other, indicating periods where one method outperforms the other. However, the RL prior line is generally above the SFT prior line for the majority of the training steps, especially in the first half. 6. **Final Values:** By step 300, the performance of both methods is relatively close, with SFT prior at ~52% and RL prior at ~51.5%. ### Interpretation This chart likely visualizes the training progress of two different AI model initialization or training strategies ("priors") on a task where success is measured by a "Pass@3" rate (e.g., solving a problem correctly in at least 3 out of 3 attempts). * **What the data suggests:** The Reinforcement Learning (RL) prior appears to provide a stronger starting point and leads to a higher maximum performance ceiling compared to the Supervised Fine-Tuning (SFT) prior. However, both methods are effective at improving the model's capability over time. * **Relationship between elements:** The "Steps" axis represents training iterations. The upward trend in "Pass Rate" for both lines confirms that training is effective. The volatility suggests the training process is noisy or the evaluation metric is sensitive to small model changes. * **Notable patterns/anomalies:** The significant dips (e.g., SFT at step ~220, RL at step ~170) could indicate periods of training instability, catastrophic forgetting, or challenging batches of training data. The fact that both methods recover from these dips shows robustness. The RL prior's ability to reach a higher peak suggests it may be better at escaping local optima or finding a more optimal policy for the task. </details> <details> <summary>figs/sec4_swe_agent_rl/rebuttal_cmp_prior_pass5.png Details</summary> ![debbf837](/v1/image/debbf837c493a232a5620c6a4bf4194f824af3d51426600cc167d681f0cd9d63) ### Visual Description ## Line Chart: Pass Rate (%) vs. Steps for SFT and RL Priors ### Overview The image is a line chart comparing the performance of two different prior methods—SFT (Supervised Fine-Tuning) and RL (Reinforcement Learning)—over the course of 300 training steps. The performance metric is "Pass Rate (%)", measured using a "Pass@5" evaluation. The chart shows that both methods improve over time, but the RL prior generally achieves a higher pass rate with greater volatility. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis:** * **Label:** "Steps" * **Scale:** Linear, from 0 to 300. * **Major Ticks:** 0, 50, 100, 150, 200, 250, 300. * **Y-Axis:** * **Label:** "Pass Rate (%)" * **Scale:** Linear, from 48 to 60. * **Major Ticks:** 48, 50, 52, 54, 56, 58, 60. * **Legend:** * **Position:** Bottom-right corner of the plot area. * **Entry 1:** Red line with circular markers, labeled "SFT prior, Pass@5". * **Entry 2:** Blue line with circular markers, labeled "RL prior, Pass@5". * **Grid:** Light gray grid lines are present for both major x and y ticks. ### Detailed Analysis **1. SFT prior, Pass@5 (Red Line):** * **Trend:** Shows a general upward trend from step 0 to step 300, but with significant high-frequency volatility (sharp peaks and troughs). * **Key Data Points (Approximate):** * Starts at ~47.5% at step 0. * Initial rapid rise to ~51.5% by step ~25. * Experiences a notable dip to ~50.2% around step 50. * Reaches a local peak of ~56.2% near step 125. * Hits its highest point of ~59.0% around step 175. * Shows a significant drop to ~51.8% near step 225. * Ends at approximately 54.8% at step 300. * **Volatility:** The line frequently changes direction, with swings of 2-4 percentage points between consecutive data points. **2. RL prior, Pass@5 (Blue Line):** * **Trend:** Also shows a clear upward trend, starting higher than the SFT line and generally maintaining a performance advantage. It exhibits even greater volatility, especially in the later stages. * **Key Data Points (Approximate):** * Starts at ~50.8% at step 0. * Rises sharply to ~55.2% by step ~25. * Reaches a high of ~56.6% around step 50. * Shows a deep trough at ~53.2% near step 75. * Surges to a peak of ~59.2% around step 125. * Experiences a sharp drop to ~53.6% near step 175. * Achieves its maximum value of ~60.2% at approximately step 290. * Ends at ~58.0% at step 300. * **Volatility:** The blue line's fluctuations are often larger in magnitude than the red line's, particularly after step 150. ### Key Observations 1. **Performance Gap:** The RL prior (blue) consistently outperforms the SFT prior (red) for the majority of the training steps shown. The gap is smallest at the very beginning and around step 175, where the red line briefly peaks. 2. **Volatility Comparison:** Both methods are highly volatile, but the RL prior's performance swings are more extreme, suggesting less stable but potentially higher-ceiling optimization. 3. **Late-Stage Performance:** After step 250, the RL prior shows a strong upward surge, reaching the highest recorded value on the chart (~60.2%), while the SFT prior's performance stagnates and then declines. 4. **Correlation of Dips:** There are points where both lines dip simultaneously (e.g., around step 75 and step 175), which might indicate challenging phases in the training process or evaluation batches. ### Interpretation This chart demonstrates the comparative learning dynamics of two model training approaches. The **RL prior** appears to be a more powerful but less stable method for improving the pass rate on the given task. Its ability to reach higher peaks suggests it can find better solutions, but its greater volatility implies the training process is more sensitive or explores a wider, riskier space of parameters. The **SFT prior** provides a more stable, though generally lower, performance trajectory. Its significant dip around step 225, from which it doesn't fully recover, could indicate overfitting or a failure mode in the supervised fine-tuning process at that stage. The simultaneous dips in both curves are particularly interesting, as they suggest the underlying evaluation data or task difficulty may have inherent variations that affect both models similarly, regardless of their training prior. The final 50 steps highlight a key divergence: the RL method capitalizes on late training to achieve a new high, while the SFT method falters. This could imply that RL-based optimization benefits from longer training horizons for this specific problem. </details> Figure 7: Comparison between the SFT Prior and the RL Prior when further applied with end-to-end SWE-Agent RL. Left: Pass@1 averaged from 5 runs. Middle: Pass@3. Right: Pass@5. The two priors are activated with the same $2^21$ SWE-Agent SFT tokens (the second column in Figure 5). After end-to-end RL, the RL prior slightly outperforms the SFT prior in all the Pass@1, Pass@3, and Pass@5 settings, which agrees with their SWE-Agent SFT performance comparison in Figure 5. End-to-end SWE-Agent RL for prior comparison. To further validate the effectiveness of the priors baked by the Agentless training recipes, we employ end-to-end SWE-Agent RL (deepswe2025) with the cold-started priors as the initial models. To maximally alleviate the effect from the patterns of proprietary models within the SWE-Smith trajectories, we leverage the setting with $2^21$ SWE-Agent SFT tokens, the second column in Figure 5, where a single step of gradient decent takes place on top of each prior. Under the minimal cold-start setup, end-to-end RL reveals the potential of each prior beyond taking the shortcut of imitation (gudibande2024the; chu2025sft). To run the end-to-end RL training for prior comparison, we use the SWE-Gym (pan2024training) and the SWE-bench-extra (badertdinov2024scaling) subsets as the training set. Similarly to the Agentless RL recipe, we first use each initial model to filter out the problems with Pass@8 = 0. For the model with the MT prior, 260 out of 6,202 problems remain; for the models with the SFT prior and the RL prior, a total of 2,062 from the 6,202 problems are kept. In all end-to-end RL runs, we use the outcome reward only, and the same policy gradient algorithm in Sec. 3.3 without KL or entropy regularization for optimization, with batch size as 256. The results are shown as follows: For the model with MT prior, the pass@1 performance quickly deteriorates to less than 2% after 10 end-to-end RL steps. The potential reason for this could be the lack of available problems to be trained with, reflecting the inferiority of the prior. For the models with the SFT prior and the RL prior, the RL runs last for 300 steps, and we plot the performance comparison in Figure 7. According to Figure 7, the model with the RL prior demonstrates slightly higher scores of Pass@1, Pass@3, and Pass@5 over the model with the SFT prior. While the phenomenon agrees with the performance comparison under SWE-Agent SFT shown in Figure 5, we observe that the patterns in the interaction trajectories of the models incentivized by end-to-end SWE-Agent RL significantly differ from the patterns of the proprietary models (detailed in Appendix G.3). These results reveal that the Agentless training recipe curates strong priors for end-to-end learning under SWE-Agent frameworks with the minimal supervision of proprietary end-to-end trajectories. We leave the exploration of more advanced agentic RL techniques for further improvement as future work. ## 5 Conclusion and Future Work In this work, we reframed Agentless and agentic paradigms for automated software engineering as complementary rather than competing. By introducing Kimi-Dev, we demonstrated that structured Agentless training can induce transferable skill priors, including bug localization, code repair, and self-reflection. As a result, Kimi-Dev not only achieves SoTA results on SWE-bench Verified among the workflow-based approaches, but enables efficient SWE-Agent adaptation as well. These findings establish a novel path toward building more generalizable coding agents through staged training. ## Acknowledgements We thank Yuzhi Wang, Xinyu Zhou, Guokun Lai, Yulun Du, Fang Li, Hao Ding, Dehao Zhang, Enming Yuan, Dikang Du, and Jiacheng You for their valuable suggestions. We also appreciate the members of the infrastructure team at Moonshot AI for their timely support during the project. ## Ethics and Reproducibility Statements This work obeys the Code of Ethics required by the ICLR conference. The study does not involve human subjects or animal experimentation. The personally identifiable information from raw data is excluded for privacy consideration (see the mid-training data recipe detailed in Appendix A). Beyond the scope of this work, we strongly advocate for the community to advance systematic research on agent safety, thereby ensuring responsible progress in this area. For all of the experiments, we have covered the detailed setups and discussions in the appendices: mid-training for Agentless in Appendix A, details of the used dockers in Appendix B, Agentless RL in Appendix C, agent infrastructure in Appendix D, case studies under Agentless in Appendix E, preliminary findings about emergent test-time parallel scaling in Appendix F, and extended analysis for SWE-Agents in Appendix G. ## Appendix ## Appendix A Details of Mid-training We curate a mid-training data recipe with a focus on enhancing SWE capabilities. Central to this effort is the collection of pull request (PR) data from GitHub, which provides extensive coverage of real-world bug fixes, feature requests, and code enhancements. To ensure data quality, we apply two filters: (i) we only retain repositories that have accumulated at least five GitHub stars, thereby excluding sparsely maintained projects with limited community engagement; and (ii) we remove any repositories overlapping with the SWE-bench benchmark (jimenez2023swe) to prevent potential data leakage. For each candidate repository, we query the GitHub API for all PRs with the state MERGED, while discarding those abandoned, superseded, or left under review. To preserve more context information, we also snapshot the entire codebase at the base commit before the first code change in the PR. After data crawling, we incorporate two complementary forms for the natural code change data: (i) natural diff patches and (ii) PR commit packs. A natural diff patch consolidates all commits in a PR into the final code difference, typically expressed as SEARCH–REPLACE blocks. This format aligns with the Agentless paradigm, in which the model must directly output the final patch. In contrast, a commit pack captures the sequence of human-authored commits within a PR, where each commit message (textual reasoning) is paired with the corresponding code modification (action). This structure closely parallels the SWE-Agent setting, where intermediate reasoning steps are interleaved with actions. However, the distinction of the utilities for the two types of data is not absolute: commit messages in a PR commit pack can still inform the model’s knowledge and indirectly strengthen its reasoning ability in the Agentless setting. Natural diff patches. The natural diff patches used in the mid-training data recipe are processed with the following rules: - Incorporate the agentless prompt template (see Prompts 1, 2, 3, 4; These four prompt templates are also used in the later stages, including cold-start, RL, and test-time self-play), and apply a loss mask to the prompt part. For the localization prompt, the response is set as the files modified in the ground-truth diff patch. - If a related issue to the PR exists, use its content of the related issue; otherwise, use the PR title as the surrogate of the issue content. - If a related issue to the PR exists, prepend the issue discussion at the beginning of the output in the code edit response. We aim to strengthen the model’s capability of code edit reasoning by leveraging the discussion contents. - Discard PRs that include modifications to files other than {.py, .md, .rst}. - For PRs containing {.md, .rst} file modifications, retain only the Python diffs and rewrite them into SEARCH–REPLACE blocks. - Remove PRs involving file additions or deletions. - For the code edits with only line insertions or deletions, preserve the original Git diff hunks as the SEARCH content in the SEARCH–REPLACE blocks. - Ensure that no more than three Python files are modified per PR. - Apply a filtering script to exclude PRs with non-{.py, .md, .rst} modifications, or PRs modifying more than three Python files. - Further exclude PRs containing more than five SEARCH–REPLACE blocks. A total of $∼$ 50B tokens for natural diff patches are obtained after applying these filtering rules. ⬇ Please look through the following GitHub problem description and Repository structure and provide a list of files that one would need to edit to fix the problem. ### GitHub Problem Description ### {related issue / PR title content} ### ### Repository Structure ### {file structure induced by the repo snapshot} ### Please only provide the full path and return at most 5 files. The returned files should be separated by new lines ordered by most to least important and wrapped with ‘‘‘ For example: ‘‘‘ file1. py file2. py ‘‘‘ Listing 1: Agentless prompt template: Localization for BugFixer. ⬇ Please look through the following GitHub problem description and Repository structure and provide a list of test files that should be run after applying the patch to fix the issue. ### GitHub Problem Description ### {related issue / PR title content} ### ### Repository Structure ### {file structure induced by the repo snapshot} ### Please only provide the full path and return at most 5 files. The returned files should be separated by new lines ordered by most to least important and wrapped with ‘‘‘ For example: ‘‘‘ file1. py file2. py ‘‘‘ Listing 2: Agentless prompt template: Localization for TestWriter. ⬇ We are currently solving the following issue within our repository. Here is the issue text: --- BEGIN ISSUE --- {related issue / PR title content} --- END ISSUE --- Below are some code segments, each from a relevant file. One or more of these files may contain bugs. --- BEGIN FILE --- ‘‘‘ ### {filename1} {content of filename1} ### {filename2} {content of filename2} {...} ‘‘‘ --- END FILE --- Please first localize the bug based on the issue statement, and then generate * SEARCH / REPLACE * edits to fix the issue. Every * SEARCH / REPLACE * edit must use this format: 1. The file path 2. The start of search block: <<<<<<< SEARCH 3. A contiguous chunk of lines to search for in the existing source code 4. The dividing line: ======= 5. The lines to replace into the source code 6. The end of the replace block: >>>>>>> REPLACE Here is an example: ‘‘‘ python ### mathweb / flask / app. py <<<<<<< SEARCH from flask import Flask ======= import math from flask import Flask >>>>>>> REPLACE ‘‘‘ Please note that the * SEARCH / REPLACE * edit REQUIRES PROPER INDENTATION. If you would like to add the line ’ print (x)’, you must fully write that out, with all those spaces before the code! Wrap the * SEARCH / REPLACE * edit in blocks ‘‘‘ python...‘‘‘. Listing 3: Agentless prompt template: Code edit for BugFixer. ⬇ We are currently solving the following issue within our repository. Here is the issue text: --- BEGIN ISSUE --- {related issue / PR title content} --- END ISSUE --- Below are some code segments, each from a relevant test file. One or more of these files may be added some new tests which can reproduce the issue. --- BEGIN FILE --- ‘‘‘ ### {filename1} {content of filename1} ### {filename2} {content of filename2} {...} ‘‘‘ --- END FILE --- Please first localize some possible locations in those test files within the repo, and then generate * SEARCH / REPLACE * edit updates to the ** test ** files in the repo, so that the erroneous scenario described in the problem is reproduced. Every * SEARCH / REPLACE * edit must use this format: 1. The file path 2. The start of search block: <<<<<<< SEARCH 3. A contiguous chunk of lines to search for in the existing source code 4. The dividing line: ======= 5. The lines to replace into the source code 6. The end of the replace block: >>>>>>> REPLACE Here is an example: ‘‘‘ python ### mathweb / flask / app. py <<<<<<< SEARCH from flask import Flask ======= import math from flask import Flask def test__rules__std_L060_raised () -> None: try: sql = " SELECT IFNULL (NULL, 100), NVL (NULL,100);" result = lint (sql, rules =[" L060 "]) assert len (result) == 2 except: print (" Other issues ") return try: assert result [0][" description "] == " Use ’ COALESCE ’ instead of ’ IFNULL ’." assert result [1][" description "] == " Use ’ COALESCE ’ instead of ’ NVL ’." print (" Issue resolved ") except AssertionError: print (" Issue reproduced ") return return >>>>>>> REPLACE ‘‘‘ Please note that the * SEARCH / REPLACE * edit REQUIRES PROPER INDENTATION. If you would like to add the line ’ print (x)’, you must fully write that out, with all those spaces before the code! Wrap the * SEARCH / REPLACE * edit in blocks ‘‘‘ python...‘‘‘. Listing 4: Agentless prompt template: Code edit for TestWriter. PR commit packs. The PR commit packs used in the mid-training data recipe are processed with the following rules: - Discard PRs that include modifications to files other than {.py, .md, .rst}. - For {.md, .rst} file modifications, retain the “diff –git” signature but remove the actual content changes. - Ensure that each PR modifies at most five Python files (with at least one required). PRs exceeding this limit are discarded. - Apply a filtering script to exclude PRs containing non-{.py, .md, .rst} file modifications or those modifying more than five Python files. - Filter out all of the developer signatures and GitHub IDs for ethics considerations. A total of $∼$ 20B tokens for PR commit packs are obtained after applying these filtering rules. In addition, we incorporate synthetic data to further enhance both the reasoning and agentic capabilities of the model. A key observation is that the ground-truth reward for the localization stage in the Agentless setting can be derived directly from the diff patch, since the set of files requiring modification is explicitly indicated. Synthetic reasoning data. To improve reasoning quality, we perform a lightweight SFT of the Qwen-2.5-72B-Instruct model on 2,000 R1 trajectories. The resulting model is then used to generate large-scale rollouts for the localization stage of both BugFixer and TestWriter. We retain only the rollouts that achieve exactly correct file localizations. This procedure yields approximately $∼$ 10B tokens of reasoning-intensive data dedicated to Agentless localization in the mid-training recipe. Synthetic agentic interactions. To strengthen agentic capabilities, we simulate agent–environment interactions with a custom tool set designed to mimic file-system operations without execution. This design is motivated by practical constraints: while repository snapshots from GitHub are available, not all snapshots are equipped with an executable Docker environment. As a result, shell commands are disabled. Instead, we introduce synthetic tools that allow the agent to view file contents and perform keyword-based search for localization, which effectively reproduces the first stage of Agentless but in an agentic manner. The specification of this tool set is covered in the system prompt, which is then used to elicit agentic interaction rollouts from the Qwen-2.5-72B-Instruct model. The complete system prompt is provided in Prompt 5. We apply a loss mask only to the system prompt, and enable the model to simultaneously learn both actions and observations along the trajectory, inspired by yang2024react. This approach integrates both policy and world modeling into mid training. ⬇ Your job is to look through the given GitHub problem description and Repository structure, and edit updates to the files in the repo to resolve the problem. The job is divided into two stages: + In Stage 1, you should localize the files the files that you would need to edit to fix the problem. + In Stage 2, you should edit the updates to the repo. Let ’ s begin from Stage 1 to localize the bugs: In Stage 1, besides reading the provided Repository structure, you can use the following skills for exploration. The skills are to be called in an environment wrapped by < execute > and </ execute >, listed in the form of python functions as below: open_file (path: str, is_all | None = False, line_number: int | None = 1, context_lines: int | None = 100) -> None: Opens the file at the given path in the editor for exploration. By default, only the first 100 lines of the file are displayed. To open the entire file, set ‘ is_all ‘ to ‘ True ‘. The ‘ context_lines ‘ parameter determines the maximum number of lines to be displayed, with a cap of 100 lines. Use ‘ scroll_up ‘ and ‘ scroll_down ‘ to view more content up or down. If a ‘ line_number ‘ is provided, the window will be moved to include that line. Note: When ‘ is_all ‘ is set to ‘ True ‘, the ‘ line_number ‘ and ‘ context_lines ‘ parameters will not take effect, as the entire file will be opened and displayed without any line - specific focus or context limitation. Args: path: str: The path to the file to open. the full path of the filename should be provided. is_all: bool | None = False: If set to ‘ True ‘, the entire file will be opened. Defaults to ‘ False ‘. line_number: int | None = 1: The line number to move to. Defaults to 1. context_lines: int | None = 100: Only shows this number of lines in the context window (usually from line 1), with line_number as the center (if possible). Defaults to 100. goto_line (line_number: int) -> None: Moves the window to show the specified line number. Args: line_number: int: The line number to move to. goto_class_or_func (class_or_func_name: str) -> None: Moves the window to show the specified class or function in the current open file. Args: class_or_func_name: str: The name of the given class, function, or method in a class to move to. scroll_down () -> None: Moves the window down by 100 lines. Args: None scroll_up () -> None: Moves the window up by 100 lines. Args: None search_dir (search_term: str, dir_path: str | None) -> None: Searches for search_term in all files in dir. If dir is not provided, searches in the entire repository. Filename, fine - grained line number, and the relative class or function it is located in (if applied) will be shown for each found position. Args: search_term: str: The term to search for. dir_path: str: The path to the directory to search. Should be full path filename. search_file (search_term: str, file_path: str | None = None) -> None: Searches for search_term in file. If file is not provided, searches in the current open file. Filename, fine - grained line number, and the relative class or function it is located in (if applied) will be shown for each found position. Args: search_term: str: The term to search for. file_path: str | None: The path to the file to search. Should be full path filename if provided. find_file (file_name: str, dir_path: str | None) -> None: Finds all files with the given name in the specified directory. If dir is not provided, find in the entire repository. Args: file_name: str: The name of the file to find. dir_path: str: The path to the directory to search. str_replace (path: str, old_str, new_str) old_str =[the old content to be replaced] new_str =[the new content after replacement] -> None: Replace the old content (old_str) in the file at the given path with the new content (new_str). This is the skill that you will be using to edit the updates. Args: path: str: The path to the file to be updated. The full path of the filename should be provided. old_str: str: The old content to be replaced. Note that this argument should be written in a new line starting with " old_str =", and the string content should not be quoted. new_str: str: The new content after replacement. Note that this argument should be written in a new line starting with " new_str =", and the string content should not be quoted. Example: Assuming a call is shown as follows: ‘‘‘ str_replace (" filename. py ", old_str, new_str) old_str = a new_str = b c ‘‘‘ Then it will function as replacing the ’ a \ n ’ string with the ’ b \ nc ’ string in the ‘ filename. py ‘ file. insert (path: str, insert_line: int, new_str) new_str =[the new content to be inserted] -> None: Insert the new content (new_str) in the file at the given path. When you want to add an entirely new class / function to the file, it would be better to use this method. Args: path: str: The path to the file to be updated. The full path of the filename should be provided. insert_line: int: The Line number below which the new content is to be added. This Line number should be within the range of lines of the file: [0, Lines_of_the_File]. Specifically, when insert_line = 0, the added content starts from the top of the file. new_str: str: The new content to be inserted. Note that this argument should be written in a new line starting with " new_str =", and the string content should not be quoted. Example: Assuming a call is shown as follows: ‘‘‘ insert (" test_filename. py ", 5, new_str) new_str = def test_add (): assert add (1, 2) == 3 ‘‘‘ Then it will function as inserting the string ’ def test_add ():\ n assert add (1, 2) == 3’ below the Line 5 of the ‘ test_filename. py ‘ file. stop () -> None: Terminate the editing process. Args: None NOTE: Responses should be concise. When exploring, you should attempt fewer things at a time: Include ONLY ONE < execute > per response, and use a SINGLE skill listed above within the < execute > environment. DO NOT use other python functions, as the environment does not support them. You should first reason in the verbal form, then use a skill with < execute > and </ execute >. You should avoid apologies and thanks in the responses. When you finish exploring and analyzing with the provided skills, please return at most 3 files with the full path only. Each full path should be placed in a single line, INSTEAD OF BROKEN WITH MULTIPLE LINES. The returned files should be separated by new lines ordered by most to least important, wrapped with ‘‘‘ and NOTHING ELSE. An example for a full output: ‘‘‘ full_path_to_file1. py full_path_to_file2. py ‘‘‘ Now Let ’ s start! ### GitHub Problem Description ### {issue content} ### Repository Structure ### {file structure} ### Listing 5: A non-execution set of tools empowering the simulation of agentic interaction trajectories. After completing the initial localization stage, the agent is guided into the code-editing phase through a follow-up instruction: “Now let’s move on to Stage 2 and edit the updates. Remember, you can still decide at any point whether a file actually requires modification.” We retain partial rollouts from Stage 1, provided that the localization results include at least one correct file. In Stage 2, we first simulate the agent’s interaction by allowing it to open incorrectly localized files, and we artificially inject agentic reasoning patterns such as “I realize that I do not need to modify this file” after inspecting the file content. This procedure is designed to strengthen the self-reflection ability of the agent by exposing it to false-positive contexts regarding the issue to be solved. Subsequently, we transcribe the ground-truth PR commit pack into trajectory form: each commit message is treated as the agent’s reasoning step, and each code update is represented as the corresponding action, expressed through the “str_replace” or “insert” tools. These interactions are appended to the trajectory, followed by a terminating “stop” call. Due to storage constraints on repository snapshots, this trajectory simulation is applied to only a subset of PRs. Overall, this process contributes approximately $∼$ 10B tokens of agentic interaction data to the mid-training recipe. Future directions for scaling this component in the data recipe include leveraging the idea of environment scaling (yang2025swesmith). Training. We perform mid-training using a standard next token prediction approach, initialized from the Qwen2.5-72B-Base (qwen2025qwen25technicalreport) model. We upsample the synthetic part of the data by a factor of 4 during mid-training, inspired by the practice in grattafiori2024llama; qwen2025qwen25technicalreport; gu2025data. A global batch size of 256 with a maximum sequence length of 32K tokens is used, optimizing for long-context capabilities necessary for real-world software engineering tasks. The learning rate is set to 2e-5, with a cosine decay schedule and a minimum learning rate of 2e-6. The warm-up phase covers over approximately 3 billion tokens, followed by learning rate decay until approximately 150 billion tokens are processed. ## Appendix B Docker environments Table 3: The sources of the docker environments used in the development of Kimi-Dev. | SWE-Gym (pan2024training) | https://huggingface.co/datasets/SWE-Gym/SWE-Gym/ | 2,356 | | --- | --- | --- | | SWE-bench-extra (badertdinov2024scaling) | https://huggingface.co/datasets/nebius/SWE-bench-extra/ | 3,846 | | R2E-Gym-Lite (jain2025r2e) | https://huggingface.co/datasets/R2E-Gym/R2E-Gym-Lite | 3,671 | Docker environment construction. To validate non-ground-truth patches generated by model rollouts and expand our dataset, we required executable Docker environments. We combined publicly available datasets with custom-configured Docker environments (see Table. 3). Among them, SWE-Gym and R2E-Gym-Lite open-source their dockers that we can directly use. For datasets lacking Docker support (SWE-Bench-Extra), we implemented an automated configuration method: 1. Initialize a Docker environment with fixed dependencies. 1. Select Python version based on commit year. 1. Install dependencies via requirements.txt and “ pip install -e . ”. 1. Resolve ModuleNotFound errors during test execution. 1. Validate success if a FAIL_TO_PASS test transitions from failing (without gt_patch) to passing (with gt_patch). Out of 6.38k SWE-bench-extra instances, 3,846 environments are successfully constructed and subsequently used for cold-start and RL training. ## Appendix C More Details of RL training ### C.1 Prompt set selection <details> <summary>figs/sec3_rl_scaling/RL_bugfix_ablation_figure.png Details</summary> ![5392d3a0](/v1/image/5392d3a0b893474d790def817b45d4788af0ccf07f02dfc398ee659e240202c2) ### Visual Description ## Line Chart: Reinforcement Learning Training Progress ### Overview The image is a line chart comparing the performance of two reinforcement learning (RL) models over the course of training. The chart plots "Pass Rate (%)" against "RL Training Steps," showing how the success rate of each model evolves as training progresses. The data suggests an experiment evaluating the impact of a "Positive Reinforce" mechanism on a "Bugfixer RL" model. ### Components/Axes * **Chart Type:** Line chart with two data series. * **X-Axis:** * **Label:** "RL Training Steps" * **Scale:** Linear, ranging from 0 to 500. * **Major Tick Marks:** Every 50 steps (0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500). * **Y-Axis:** * **Label:** "Pass Rate (%)" * **Scale:** Linear, ranging from 34% to 46%. * **Major Tick Marks:** Every 2 percentage points (34, 36, 38, 40, 42, 44, 46). * **Legend:** * **Position:** Top-left corner of the plot area. * **Series 1:** "Bugfixer RL" - Represented by a **red line** with circular markers. * **Series 2:** "w/o Positive Reinforce" - Represented by a **blue line** with circular markers. "w/o" is an abbreviation for "without." ### Detailed Analysis #### Data Series 1: Bugfixer RL (Red Line) * **Trend:** Shows a general, volatile upward trend over the entire 500-step training period. Performance starts low, experiences significant dips, but ultimately reaches its highest values towards the end of training. * **Key Data Points (Approximate):** * Step 0: ~34.0% * Step 50: ~33.5% (local minimum) * Step 100: ~37.5% * Step 150: ~38.0% * Step 200: ~41.0% * Step 250: ~42.5% * Step 300: ~43.0% * Step 350: ~42.0% * Step 400: ~44.0% * Step 450: ~46.0% (global maximum) * Step 500: ~46.0% #### Data Series 2: w/o Positive Reinforce (Blue Line) * **Trend:** This series begins later in the training process (around step 350). It exhibits high volatility within a range of approximately 40% to 45%, without a clear upward or downward trend over its shorter observed period. * **Key Data Points (Approximate):** * Step 350: ~45.0% * Step 375: ~42.0% * Step 400: ~42.5% * Step 425: ~41.0% * Step 450: ~40.5% * Step 475: ~45.0% * Step 500: ~43.5% ### Key Observations 1. **Performance Gap:** After the blue line appears at step 350, the red "Bugfixer RL" line is consistently at or above the blue "w/o Positive Reinforce" line for most subsequent steps, with the gap widening significantly at step 450. 2. **Volatility:** Both models show considerable step-to-step volatility in pass rate, indicating that learning progress is not smooth but characterized by peaks and troughs. 3. **Late-Stage Peak:** The "Bugfixer RL" model achieves its highest performance (~46%) in the final quarter of the displayed training (steps 450-500). 4. **Divergence Point:** The most dramatic divergence occurs at step 450, where "Bugfixer RL" peaks at ~46% while "w/o Positive Reinforce" dips to its lowest point in the series at ~40.5%. ### Interpretation The chart demonstrates the training dynamics of a bug-fixing reinforcement learning agent. The primary comparison is between the full "Bugfixer RL" system and an ablated version lacking a "Positive Reinforce" component. * **What the data suggests:** The inclusion of the "Positive Reinforce" mechanism appears beneficial for long-term learning. While both models are volatile, the full model (red) not only starts improving earlier but also reaches a higher ultimate performance ceiling. The ablated model (blue), when introduced later, fails to match the peak performance of the full model and shows a significant performance drop at the same step where the full model peaks. * **Relationship between elements:** The later start of the blue line suggests it may be a control condition or a model variant introduced mid-training for comparison. The fact that the red line is already at a high level (~42%) when the blue line starts (~45%) indicates the "Bugfixer RL" had already learned substantially before the comparison point. * **Notable anomaly:** The sharp, simultaneous peak (red) and trough (blue) at step 450 is a critical data point. It strongly implies that the "Positive Reinforce" component is crucial for achieving and maintaining high performance at that stage of training, preventing the severe performance collapse seen in its absence. The overall trend argues that positive reinforcement contributes to more robust and higher-performing learning in this specific bug-fixing task. </details> Figure 8: Ablation of positive example reinforcement during 72B Bugfixer RL. In the main text, we introduce the adaptive prompt selection method for RL training. Specifically, we construct an initial prompt set of 1,200 problems by selecting those with pass@16 $>$ 0 from SWE-Gym (pan2024training), SWE-bench-extra (badertdinov2025swerebenchautomatedpipelinetask), and R2E-gym (jain2025r2e). Then, every 100 training steps, we expand the prompt set by adding 500 new problems. These additional problems are randomly sampled and filtered from the pool of problems for which the current model has pass@16 = 0, thereby progressively increasing the difficulty and forming a proper curriculum. ### C.2 RL experiment ablation Figure 9 shows the performance of the Qwen2.5-14B model in RL experiments, where both the BugFixer and the TestWriter exhibit clear scaling law behavior. <details> <summary>figs/sec3_rl_scaling/quick_plot_M3_bf.png Details</summary> ![38e36998](/v1/image/38e36998332df5b8b0d6bedb00cb23d64dcb6effabad49c5e819750bd24b8bed) ### Visual Description \n ## Dual-Axis Line Chart: RL Training Metrics ### Overview This is a dual-axis line chart tracking two metrics—Token Length and Pass Rate (%)—over the course of 200 Reinforcement Learning (RL) training steps. The chart visualizes the relationship and progression of these two variables throughout the training process. ### Components/Axes * **X-Axis (Bottom):** Labeled "RL Training Steps". The scale runs from 0 to 200, with major tick marks every 25 steps (0, 25, 50, 75, 100, 125, 150, 175, 200). * **Primary Y-Axis (Left):** Labeled "Token Length" in blue text. The scale runs from 6500 to 8500, with major tick marks every 500 units (6500, 7000, 7500, 8000, 8500). * **Secondary Y-Axis (Right):** Labeled "Pass Rate (%)" in red text. The scale runs from 24% to 34%, with major tick marks every 2 percentage points (24, 26, 28, 30, 32, 34). * **Legend:** Located in the top-left corner of the plot area. * A blue line with square markers is labeled "Token Length". * A red line with circle markers is labeled "Pass Rate (%)". * **Grid:** A light gray dashed grid is present for both axes. ### Detailed Analysis **Data Series 1: Token Length (Blue Line, Square Markers)** * **Trend:** The overall trend is upward, starting near 6500 and ending near 8400, but with significant volatility. It features several sharp peaks and troughs. * **Approximate Data Points (Step, Token Length):** * (0, ~6480) * (10, ~7180) - First peak * (20, ~6420) - Sharp drop * (30, ~6680) * (40, ~6900) * (50, ~6880) * (60, ~6720) * (70, ~6750) * (80, ~7050) * (90, ~7080) * (100, ~6850) * (110, ~7050) * (120, ~7070) * (130, ~7050) * (140, ~7550) - Significant jump * (150, ~7550) * (160, ~8300) - Major peak * (170, ~7980) * (180, ~8550) - Highest peak * (190, ~7980) * (200, ~8400) **Data Series 2: Pass Rate (%) (Red Line, Circle Markers)** * **Trend:** The overall trend is also upward, starting near 24.5% and ending near 32.5%. It shows a strong positive correlation with the Token Length line, moving in near-unison through most peaks and troughs. * **Approximate Data Points (Step, Pass Rate %):** * (0, ~24.5%) * (10, ~26.5%) * (20, ~26.2%) * (30, ~25.0%) * (40, ~26.2%) * (50, ~27.8%) * (60, ~27.0%) * (70, ~25.8%) * (80, ~29.2%) * (90, ~29.5%) * (100, ~31.2%) - First major peak * (110, ~28.2%) * (120, ~29.2%) * (130, ~28.8%) * (140, ~28.0%) * (150, ~29.2%) * (160, ~33.2%) - Major peak * (170, ~31.0%) * (180, ~34.0%) - Highest peak * (190, ~31.0%) * (200, ~32.5%) ### Key Observations 1. **Strong Positive Correlation:** The two lines are highly synchronized. Almost every increase or decrease in Token Length is mirrored by a corresponding change in Pass Rate. This is most evident at the major peaks (steps ~10, 160, 180) and troughs (steps ~20, 70, 100). 2. **Volatile Growth:** Neither metric improves smoothly. Training is characterized by periods of rapid gain followed by sharp corrections, suggesting an unstable or exploratory training dynamic. 3. **Late-Stage Surge:** The most significant gains for both metrics occur after step 150, culminating in the highest values at step 180. 4. **Divergence at Step 100:** A notable point of divergence occurs at step 100. The Pass Rate reaches a local peak (~31.2%), while the Token Length dips to a local trough (~6850). This is the most prominent instance where the two metrics move in opposite directions. ### Interpretation The chart demonstrates a clear, positive relationship between the length of generated tokens (Token Length) and the success rate (Pass Rate) during this RL training process. This suggests that, for the task being trained, longer responses are strongly associated with higher correctness or quality. The high volatility indicates the training process is likely in an exploratory phase, where the model is trying different strategies (resulting in varying response lengths) and receiving feedback (the pass rate). The strong correlation implies the reward signal (pass rate) is effectively guiding the model toward generating longer, more successful outputs. The divergence at step 100 is a critical anomaly. It represents a moment where the model produced shorter outputs that were, on average, more successful. This could indicate the discovery of a more efficient or concise solution path, or it could be a temporary artifact of the training dynamics. Investigating the model's behavior around this step could yield insights into optimizing for both efficiency and effectiveness. Overall, the data suggests that extending the training beyond 150 steps was highly beneficial, leading to the peak performance observed at step 180. The final data point at 200 shows a slight regression from this peak, which might signal the beginning of over-optimization or the need for further training stabilization. </details> (a) 14B BugFixer <details> <summary>figs/sec3_rl_scaling/quick_plot_M3_tw.png Details</summary> ![7f8bb1b2](/v1/image/7f8bb1b2a6eed6bdddc70f57e198a75d57c318ae6b54ce41fd58e1dae5c5e79c) ### Visual Description ## Dual-Axis Line Chart: RL Training Steps vs. Token Length and Reproduced Rate ### Overview This is a dual-axis line chart plotting two metrics against "RL Training Steps" on the x-axis. The chart tracks the progression of "Token Length" (left y-axis) and "Reproduced Rate (%)" (right y-axis) over 200 training steps. The data suggests a general upward trend for both metrics, with the Reproduced Rate exhibiting significantly more volatility than the steadily increasing Token Length. ### Components/Axes * **Chart Type:** Dual-axis line chart. * **X-Axis:** * **Label:** `RL Training Steps` * **Scale:** Linear, from 0 to 200. * **Major Tick Marks:** 0, 25, 50, 75, 100, 125, 150, 175, 200. * **Left Y-Axis (Primary):** * **Label:** `Token Length` (displayed vertically in blue). * **Scale:** Linear, from 3000 to 5500. * **Major Tick Marks:** 3000, 3500, 4000, 4500, 5000, 5500. * **Right Y-Axis (Secondary):** * **Label:** `Reproduced Rate (%)` (displayed vertically in red). * **Scale:** Linear, from 18 to 26. * **Major Tick Marks:** 18, 20, 22, 24, 26. * **Legend:** * **Position:** Top-left corner of the plot area. * **Series 1:** `Token Length` - Represented by a blue line with square markers. * **Series 2:** `Reproduced Rate (%)` - Represented by a red line with circular markers. * **Grid:** A light gray, dashed grid is present for both axes. ### Detailed Analysis **Data Series 1: Token Length (Blue Line, Square Markers)** * **Trend:** Shows a consistent, near-linear upward trend with minor fluctuations. The slope increases slightly after step 100. * **Approximate Data Points (RL Step, Token Length):** * (0, ~3150) * (10, ~3050) * (20, ~3450) * (30, ~3300) * (40, ~3400) * (50, ~3400) * (60, ~3600) * (70, ~3650) * (80, ~3800) * (90, ~4050) * (100, ~4050) * (110, ~4200) * (120, ~4350) * (130, ~4450) * (140, ~4500) * (150, ~4750) * (160, ~4900) * (170, ~4850) * (180, ~5200) * (190, ~5450) * (200, ~5600) **Data Series 2: Reproduced Rate (%) (Red Line, Circular Markers)** * **Trend:** Exhibits high volatility with sharp peaks and troughs, but the overall trend is upward. Notable dips occur at steps 30, 50, 70, 120, and 150. * **Approximate Data Points (RL Step, Reproduced Rate %):** * (0, ~18.2%) * (10, ~18.5%) * (20, ~20.0%) * (30, ~19.0%) * (40, ~22.0%) * (50, ~18.5%) * (60, ~21.5%) * (70, ~19.8%) * (80, ~21.5%) * (90, ~20.2%) * (100, ~20.5%) * (110, ~21.5%) * (120, ~20.2%) * (130, ~21.5%) * (140, ~24.0%) * (150, ~19.5%) * (160, ~22.0%) * (170, ~24.0%) * (180, ~22.5%) * (190, ~24.5%) * (200, ~26.5%) ### Key Observations 1. **Divergent Volatility:** The most striking feature is the contrast between the smooth, predictable growth of Token Length and the erratic, sawtooth pattern of the Reproduced Rate. 2. **Correlation at Extremes:** Despite the volatility, both metrics reach their highest values at the final data point (Step 200: Token Length ~5600, Reproduced Rate ~26.5%). 3. **Significant Drop:** The Reproduced Rate experiences its most severe drop at step 150, falling to ~19.5% from a peak of ~24.0% at step 140, before recovering sharply. 4. **Mid-Training Convergence:** Between steps 90 and 110, the two lines visually converge and cross paths on the chart, indicating a period where the numerical values of the two different metrics (on their respective scales) were similar. ### Interpretation The chart likely visualizes the performance of a Reinforcement Learning (RL) model during training. "Token Length" probably refers to the length of sequences generated by the model, while "Reproduced Rate (%)" is a performance metric, possibly indicating how often the model successfully reproduces a target output or meets a specific criterion. The data suggests that as training progresses (more RL steps), the model learns to generate longer sequences (increasing Token Length). Concurrently, its performance (Reproduced Rate) also improves overall, but this learning process is unstable, characterized by frequent setbacks and recoveries. The sharp drop at step 150 could indicate a period of catastrophic forgetting, a challenging batch of training data, or an adjustment in the training process. The final data point shows both metrics at their peak, suggesting that despite the instability, the training process was ultimately successful in improving both the length and quality (as measured by the reproduction rate) of the model's outputs. The correlation between increasing length and increasing rate implies that the model's ability to generate longer sequences is linked to its improved performance. </details> (b) 14B TestWriter Figure 9: RL scaling experiments on Qwen2.5-14B model. Furthermore, Figure 8 illustrates the effect of our proposed positive example reinforcement. We incorporated this improvement in the later stage of training and found that the RL of BugFixer achieved better performance compared to not using this trick. The main reason is that positive examples can reinforce the correct reasoning patterns, enabling the model to converge more quickly, especially when its exploration ability diminishes in the later stages. It is noticed that a similar trick was also applied in the development of the Seed1.5-thinking model (seed2025seed1). ## Appendix D Agent Infrastructure We present a unified sandbox architecture designed for hybrid cloud cluster environments, leveraging Kubernetes orchestration to provide researchers with a flexible, secure, and highly scalable code execution platform. The architecture is accessible through a Python SDK and implements an ephemeral, use-and-destroy paradigm that ensures resource efficiency and security isolation. The system demonstrates exceptional scalability, supporting over 10,000 concurrent sandbox instances while maintaining performance stability. This infrastructure serves as a foundational platform for diverse computational scenarios, including automated code evaluation and RL experiments. The system employs a sidecar container pattern to efficiently orchestrate and manage a heterogeneous collection of over 25,000 docker images sourced from multiple datasets. This architectural choice enables optimal resource utilization while maintaining isolation between different execution environments. Additionally, we have developed a proprietary Software Engineering (SWE) image-building pipeline that continuously expands the repository of supported images, thereby enhancing the system’s adaptability to diverse computational workloads and research requirements. ## Appendix E Case Studies under Agentless ### E.1 Bugfixer Traces We provide a Bugfixer’s response trace during the code repair stage (see Listing 6). From this, we can observe that the initial code given was return self[i, j - pos - other.cols], but the model quickly realized its mistake and corrected it to return self[i, j - other.cols]. Although some repetitive self-reflections followed, the increasing response length indeed improved answer accuracy and robustness in our experiments. ⬇ $\lhd$ think $\rhd$ Okay, let ’ s tackle this problem. The user reported an issue where the Matrix. col_insert () method isn ’ t working correctly. The example given shows that after inserting a matrix V into M at position 3, the resulting matrix has the inserted columns shifted incorrectly. The expected result would have the inserted columns at positions 3 and 4, but the actual output shows them starting at position 3, but the original matrix ’ s columns after the insert are shifted upwards instead of remaining in their original positions. First, I need to understand how col_insert works. Looking at the code provided, especially in the MatrixShaping class in sympy / matrices / common. py, the _eval_col_insert method is responsible for inserting columns. The current implementation shifts the existing columns to the right by the number of columns inserted. However, the bug might be in how the original matrix ’ s columns are being shifted. Let me look at the existing code for _eval_col_insert: def _eval_col_insert (self, pos, other): cols = self. cols def entry (i, j): if j < pos: return self [i, j] elif pos <= j < pos + other. cols: return other [i, j - pos] return self [i, j - pos - other. cols] return self. _new (self. rows, self. cols + other. cols, lambda i, j: entry (i, j)) Wait, the problem here is in the last case. When j is greater than or equal to pos + other. cols, the code subtracts (pos + other. cols) from j. But that ’ s incorrect because the original columns after the inserted ones should start at j = pos + other. cols. So for the original matrix ’ s columns that come after the inserted ones, their new positions should be j + other. cols. But in the current code, when j is in the original matrix ’ s columns (after the inserted ones), the code is subtracting (pos + other. cols), which would incorrectly shift them. Wait, let ’ s think again. The original code for the entry function: if j < pos: return self [i, j] # original columns before pos elif pos <= j < pos + other. cols: return other [i, j - pos] # inserted columns return self [i, j - pos - other. cols] # original columns after the inserted ones Wait, no. Let ’ s say the original matrix has cols = 6. We insert 2 columns at pos =3. The new total cols is 6 + 2 = 8. The original columns after pos 3 (columns 3,4,5) should now be at positions 5,6,7. But according to the code, when j is >=3 +2=5 (since pos + other. cols =3+2=5), then the code returns self [i, j -5 -2] = self [i, j -7]. Wait, that ’ s not right. For example, if j =5 (the new column 5), then j - pos - other. cols =5 -3 -2=0. So the original column 0 would be mapped to the new column 5. That ’ s definitely wrong. Ah, here ’ s the mistake. The code is subtracting (pos + other. cols) from j, but that ’ s incorrect. The original columns after the inserted ones should be shifted to the right by other. cols. So, the correct calculation should be j - other. cols. For example, original column 3 becomes 3 +2=5, original column 4 becomes 6, etc. So the line in the code: return self [i, j - pos - other. cols] should be: return self [i, j - other. cols] Because the inserted columns have taken up ’ other. cols ’ positions starting at ’ pos ’, so the original columns after ’ pos ’ are shifted by ’ other. cols ’ to the right. Therefore, to get the correct original column index, we subtract ’ other. cols ’ from the current j. So the fix is to change the subtraction from (pos + other. cols) to other. cols. Let me check: Original code: return self [i, j - pos - other. cols] Corrected code: return self [i, j - other. cols] Wait, no. Wait, the original columns after the inserted block (which starts at pos and has other. cols columns) would start at j = pos + other. cols. So for those j values, the original column index would be j - (pos + other. cols). Wait, no. Let ’ s think with numbers. Original matrix has 6 columns. Insert 2 columns at position 3. The new matrix has 8 columns. The original columns 0,1,2 remain in positions 0,1,2. The inserted 2 columns are at positions 3 and 4. The original columns 3,4,5 now become 5,6,7. So for j in 5,6,7 (original columns 3,4,5), the mapping is: 5 $ \ rightarrow$ 3 6 $ \ rightarrow$ 4 7 $ \ rightarrow$ 5 So the formula should be original_column = j - inserted_columns. In this case, inserted_columns is other. cols. So the code should subtract other. cols from j when accessing the original matrix ’ s columns after the inserted block. Therefore, the line: return self [i, j - pos - other. cols] should be: return self [i, j - other. cols] Because pos is where the insertion starts, but the original columns after the inserted block are shifted by other. cols. Let ’ s verify with the example: pos =3, other. cols =2. For j =5 (new column 5), original column index is 5 -2=3, which is correct. For j =6, 6-2=4; j =7, 7-2=5. That ’ s correct. So the error in the original code was subtracting (pos + other. cols) instead of other. cols. The pos is the starting position of the insertion, but after inserting other. cols columns, the original columns after the inserted block are shifted by other. cols. Therefore, the correct subtraction is other. cols. Thus, the fix is to change the line in the _eval_col_insert method from: return self [i, j - pos - other. cols] to: return self [i, j - other. cols] Now, looking back at the code in the _eval_col_insert function: The code for the else clause is: return self [i, j - pos - other. cols] Which is incorrect. Changing that to j - other. cols should fix the problem. So the SEARCH block would look for the lines: def _eval_col_insert (self, pos, other): cols = self. cols def entry (i, j): if j < pos: return self [i, j] elif pos <= j < pos + other. cols: return other [i, j - pos] return self [i, j - pos - other. cols] return self. _new (self. rows, self. cols + other. cols, lambda i, j: entry (i, j)) And the REPLACE block would change the last part to: elif pos <= j < pos + other. cols: return other [i, j - pos] return self [i, j - other. cols] So the correction is in the return statement of the entry function. The original code subtracted (pos + other. cols) from j, but it should subtract other. cols instead. Therefore, the correct SEARCH / REPLACE block would be: ### sympy / matrices / common. py <<<<<<< SEARCH def _eval_col_insert (self, pos, other): cols = self. cols def entry (i, j): if j < pos: return self [i, j] elif pos <= j < pos + other. cols: return other [i, j - pos] return self [i, j - pos - other. cols] return self. _new (self. rows, self. cols + other. cols, lambda i, j: entry (i, j)) ======= def _eval_col_insert (self, pos, other): def entry (i, j): if j < pos: return self [i, j] elif pos <= j < pos + other. cols: return other [i, j - pos] return self [i, j - other. cols] return self. _new (self. rows, self. cols + other. cols, lambda i, j: entry (i, j)) >>>>>>> REPLACE This change corrects the column shifting calculation for the original matrix ’ s columns after the inserted block. Instead of subtracting (pos + other. cols), we subtract other. cols, which properly shifts the original columns to the right by the number of inserted columns. $\lhd$ / think $\rhd$ ‘‘‘ python ### sympy / matrices / common. py <<<<<<< SEARCH def _eval_col_insert (self, pos, other): cols = self. cols def entry (i, j): if j < pos: return self [i, j] elif pos <= j < pos + other. cols: return other [i, j - pos] return self [i, j - pos - other. cols] return self. _new (self. rows, self. cols + other. cols, lambda i, j: entry (i, j)) ======= def _eval_col_insert (self, pos, other): def entry (i, j): if j < pos: return self [i, j] elif pos <= j < pos + other. cols: return other [i, j - pos] return self [i, j - other. cols] return self. _new (self. rows, self. cols + other. cols, lambda i, j: entry (i, j)) >>>>>>> REPLACE ‘‘‘ Listing 6: Reasoning trajectory example for BugFixer. ### E.2 Testwriter False Positive Cases During the RL training process, we observed that the Testwriter component exhibits some false positives, which can result in tests that contain certain gaps. This might cause the test patches generated by the model to be mistakenly considered as correct. Consequently, the performance improvement of Testwriter does not accurately reflect the actual performance. In this section, we provide two intuitive examples to illustrate this phenomenon, and leave further enhancement for TestWriter as future work. Listing 7 shows the ground-truth test patch for instance django__django-16877, with Listing 8 showing the test patch generated by theTestWriter. A detailed comparison reveals that the model-generated test cases are not comprehensive: Two locations should have been tested, but only one was checked, and the effect of mark_safe was not considered. In practice, this allowed the model-predicted patch to pass the TestWriter reward check, but result in a false positive. The second example instance is sympy__sympy-13974, with Listings 9 and 10 showing the ground-truth and model-generated test patches. In this case, the coverage of the generated test is also insufficient. The generated test does not account for the need to satisfy the associativity property, e.g., assert tensor_product_simp(TP(A, B)**x) == TP(A**x, B**x). As a consequence, it fails to serve as a reliable judge for BugFixer. ⬇ diff -- git a / tests / template_tests / filter_tests / test_escapeseq. py b / tests / template_tests / filter_tests / test_escapeseq. py new file mode 100644 --- / dev / null +++ b / tests / template_tests / filter_tests / test_escapeseq. py @@ -0,0 +1,59 @@ + from django. test import SimpleTestCase + from django. utils. safestring import mark_safe + + from .. utils import setup + + + class EscapeseqTests (SimpleTestCase): + """ + The " escapeseq " filter works the same whether autoescape is on or off, + and has no effect on strings already marked as safe. + """ + + @setup ( + { + " escapeseq_basic ": ( + ’{{ a | escapeseq | join:", " }} -- {{ b | escapeseq | join:", " }}’ + ), + } + ) + def test_basic (self): + output = self. engine. render_to_string ( + " escapeseq_basic ", + {" a ": [" x & y ", ""], " b ": [mark_safe (" x & y "), mark_safe ("")]}, + ) + self. assertEqual (output, " x & amp; y, & lt; p & gt; -- x & y, ") + + @setup ( + { + " escapeseq_autoescape_off ": ( + ’{% autoescape off %}{{ a | escapeseq | join:", " }}’ + " -- " + ’{{ b | escapeseq | join:", "}}{% endautoescape %}’ + ) + } + ) + def test_autoescape_off (self): + output = self. engine. render_to_string ( + " escapeseq_autoescape_off ", + {" a ": [" x & y ", ""], " b ": [mark_safe (" x & y "), mark_safe ("")]}, + ) + self. assertEqual (output, " x & amp; y, & lt; p & gt; -- x & y, ") + + @setup ({" escapeseq_join ": ’{{ a | escapeseq | join:" " }}’}) + def test_chain_join (self): + output = self. engine. render_to_string (" escapeseq_join ", {" a ": [" x & y ", ""]}) + self. assertEqual (output, " x & amp; y & lt; p & gt;") + + @setup ( + { + " escapeseq_join_autoescape_off ": ( + ’{% autoescape off %}{{ a | escapeseq | join:" " }}{% endautoescape %}’ + ), + } + ) + def test_chain_join_autoescape_off (self): + output = self. engine. render_to_string ( + " escapeseq_join_autoescape_off ", {" a ": [" x & y ", ""]} + ) + self. assertEqual (output, " x & amp; y & lt; p & gt;") Listing 7: GT test patch of django__django-16877. ⬇ diff -- git a / tests / template_tests / filter_tests / test_safeseq. py b / tests / template_tests / filter_tests / test_safeseq. py --- a / tests / template_tests / filter_tests / test_safeseq. py +++ b / tests / template_tests / filter_tests / test_safeseq. py @@ -20,3 +20,20 @@ def test_safeseq02 (self): output = self. engine. render_to_string (" safeseq02 ", {" a ": ["&", "<"]}) self. assertEqual (output, "&, < -- &, <") + + # New tests for escapeseq + @setup ({" escapeseq01 ": ’{{ a | escapeseq | join:", " }}’}) + def test_escapeseq01 (self): + output = self. engine. render_to_string (" escapeseq01 ", {" a ": ["&", "<"]}) + self. assertEqual (output, "& amp;, & lt;") + + @setup ( + { + " escapeseq02 ": ( + ’{% autoescape off %}{{ a | escapeseq | join:", " }}{% endautoescape %}’ + ) + } + ) + def test_escapeseq02 (self): + output = self. engine. render_to_string (" escapeseq02 ", {" a ": ["&", "<"]}) + self. assertEqual (output, "& amp;, & lt;") Listing 8: Model predicted test patch of django__django-16877. ⬇ diff -- git a / sympy / physics / quantum / tests / test_tensorproduct. py b / sympy / physics / quantum / tests / test_tensorproduct. py --- a / sympy / physics / quantum / tests / test_tensorproduct. py +++ b / sympy / physics / quantum / tests / test_tensorproduct. py @@ -10,7 +10,7 @@ from sympy. physics. quantum. density import Density from sympy. core. trace import Tr - A, B, C = symbols (’ A, B, C ’, commutative = False) + A, B, C, D = symbols (’ A, B, C, D ’, commutative = False) x = symbols (’ x ’) mat1 = Matrix ([[1, 2* I], [1 + I, 3]]) @@ -47,6 +47,11 @@ def test_tensor_product_commutator (): def test_tensor_product_simp (): assert tensor_product_simp (TP (A, B)* TP (B, C)) == TP (A * B, B * C) + # tests for Pow - expressions + assert tensor_product_simp (TP (A, B)** x) == TP (A ** x, B ** x) + assert tensor_product_simp (x * TP (A, B)**2) == x * TP (A **2, B **2) + assert tensor_product_simp (x *(TP (A, B)**2)* TP (C, D)) == x * TP (A **2* C, B **2* D) + assert tensor_product_simp (TP (A, B)- TP (C, D)** x) == TP (A, B)- TP (C ** x, D ** x) def test_issue_5923 (): Listing 9: GT test patch for sympy__sympy-13974. ⬇ diff -- git a / sympy / physics / quantum / tests / test_tensorproduct. py b / sympy / physics / quantum / tests / test_tensorproduct. py --- a / sympy / physics / quantum / tests / test_tensorproduct. py +++ b / sympy / physics / quantum / tests / test_tensorproduct. py @@ -47,6 +47,27 @@ def test_tensor_product_simp (): assert tensor_product_simp (TP (A, B)* TP (B, C)) == TP (A * B, B * C) + + # Test tensor product powers + a = symbols (’ a ’, commutative = False) + t1 = TP (1, 1) * TP (1, 1) + t1_simp = tensor_product_simp (t1) + assert t1_simp == TP (1, 1), f " Expected 1 x1, got {t1_simp}" + + # Test with Pauli matrices + from sympy. physics. paulialgebra import Pauli + t2 = TP (1, Pauli (3)) * TP (1, Pauli (3)) + t2_simp = tensor_product_simp (t2) + assert t2_simp == TP (1, 1), f " Expected 1 x1, got {t2_simp}" + + # Test substitution cases from issue + expr1 = TP (1, 1) * TP (1, a) + result1 = expr1. subs (a, 1) + assert tensor_product_simp (result1) == TP (1, 1), f " Substitution 1 failed: {result1}" + + expr2 = TP (1, Pauli (3)) * TP (1, a) + result2 = expr2. subs (a, Pauli (3)) + assert tensor_product_simp (result2) == TP (1, 1), f " Substitution 2 failed: {result2}" def test_issue_5923 (): Listing 10: Model predicted test patch of sympy__sympy-13974. ## Appendix F Emergent Parallel scaling in Kimi-Dev <details> <summary>figs/sec3_sp_scaling/parellel_scaling_figure.png Details</summary> ![9c1af101](/v1/image/9c1af101766becf38fe55a3cefe0f3ba5905634f19d0df10c67ea8fca34e41ec) ### Visual Description ## Line Chart: Pass Rate vs. Number of Patches ### Overview The image displays a line chart illustrating the relationship between the "Number of Patches" (x-axis) and the "Pass Rate (%)" (y-axis). The chart shows a single, positively sloping trend line, indicating that as the number of patches increases, the pass rate also increases. The data points are explicitly labeled with their exact percentage values. ### Components/Axes * **Chart Type:** Line chart with square data point markers. * **X-Axis (Horizontal):** * **Label:** "Number of Patches" * **Scale:** Non-linear, categorical. The marked tick values are: 1, 5, 10, 20, 40. * **Y-Axis (Vertical):** * **Label:** "Pass Rate (%)" * **Scale:** Linear, ranging from 47 to 53. Major gridlines are at intervals of 1% (47, 48, 49, 50, 51, 52, 53). * **Data Series:** * A single series represented by a solid green line connecting square markers. * **Legend:** Not present. The single series is self-evident. * **Grid:** A light gray dashed grid is present for both major x and y axis ticks. ### Detailed Analysis The chart plots five distinct data points. The trend is consistently upward, with the slope of the line gradually decreasing as the number of patches increases. **Data Points (Number of Patches, Pass Rate %):** 1. (1, 48.0) 2. (5, 49.2) 3. (10, 50.2) 4. (20, 51.2) 5. (40, 51.6) **Trend Verification:** The green line slopes upward from left to right across all data points, confirming a positive correlation. The increase between points is as follows: * From 1 to 5 patches: +1.2% * From 5 to 10 patches: +1.0% * From 10 to 20 patches: +1.0% * From 20 to 40 patches: +0.4% ### Key Observations 1. **Positive Correlation:** There is a clear, monotonic increase in pass rate as the number of patches grows. 2. **Diminishing Returns:** The rate of improvement slows significantly at higher patch counts. The gain from 20 to 40 patches (+0.4%) is less than half the gain from 10 to 20 patches (+1.0%). 3. **Non-Linear X-Axis:** The x-axis uses a non-linear scale (1, 5, 10, 20, 40), which visually compresses the space between higher values. This emphasizes the performance gains at lower patch counts. 4. **Explicit Labeling:** Each data point is precisely labeled, removing ambiguity in reading the y-axis values. ### Interpretation The data suggests that increasing the "Number of Patches" is an effective strategy for improving the "Pass Rate." This could relate to a technical process where applying more patches (e.g., code fixes, data samples, or model adjustments) leads to higher success or completion rates. The most significant gains are achieved with the initial increases in patches (from 1 to 10). The curve begins to plateau after 20 patches, indicating a point of diminishing returns where adding more patches yields progressively smaller improvements. This pattern is common in optimization scenarios, suggesting that while more patches are beneficial, there is an optimal range (likely between 10 and 20 patches) where the cost/benefit ratio is most favorable. The chart does not provide context on what "patches" or "pass rate" specifically refer to, but the quantitative relationship is clearly demonstrated. </details> Figure 10: Emergent parallel scaling results on SWE-bench Verified. Compared to the self-play method that leverages execution during test time, parallel scaling (chen2025parallel; pan2025learning) is a recently-introduced test-time scaling method that aggregates multiple independent rollouts for additional improvement. While previous works incentivize the ability of parallel scaling through training, or train a separate reward model (pan2024training; jain2025r2e) for patch selection, we observe that the parallel scaling effect emerges in our model without any specific training. Specifically, patch candidates from parallel rollouts of BugFixer are used to prompt the model to summarize their differences and compose a new bugfix patch. Listing 11 is the prompt template of parallel scaling. In the template, promblem_statement is the GitHub issue, and trajs_content represents the content of multiple patch candidates. ⬇ We are currently solving the following issue within our repository. Here is the issue text: --- BEGIN ISSUE --- {problem_statement} --- END ISSUE --- Below are some thinking trajectories, each from llm reasoning model. Only one trajetory is right. --- BEGIN FILE --- ‘‘‘ {trajs_content} ‘‘‘ --- END FILE --- Please first summary and analyze the key differences between the trajectories, and then generate * SEARCH / REPLACE * edits to fix the issue. Every * SEARCH / REPLACE * edit must use this format: 1. The file path 2. The start of search block: <<<<<<< SEARCH 3. A contiguous chunk of lines to search for in the existing source code 4. The dividing line: ======= 5. The lines to replace into the source code 6. The end of the replace block: >>>>>>> REPLACE Here is an example: ‘‘‘ python ### mathweb / flask / app. py <<<<<<< SEARCH from flask import Flask ======= import math from flask import Flask >>>>>>> REPLACE ‘‘‘ Please note that the * SEARCH / REPLACE * edit REQUIRES PROPER INDENTATION. If you would like to add the line ’ print (x)’, you must fully write that out, with all those spaces before the code! Wrap the * SEARCH / REPLACE * edit in blocks ‘‘‘ python...‘‘‘. The summary of the key differences between the trajectories should be in the thinking part. Listing 11: The prompt template for parallel scaling. The results in Figure 10 show that the performance of the parallel aggregation improves as the number of patch candidates in the prompt increases. The advantage of this scaling paradigm over majority voting lies in its ability to leverage the model’s own capacity to analyze multiple candidate patches, thereby surpassing the simplistic approach of weighting answers merely by their frequency of occurrence. Listing 12 covers a full prompt example with 14 different input patch candidates. Note: As Listing 12 is long, it is OK to jump to Listing 13 to directly observe the model behavior. ⬇ We are currently solving the following issue within our repository. Here is the issue text: --- BEGIN ISSUE --- Data < @filename isn ’ t relative to the YAML file The [docs say](https:// gabbi. readthedocs. io / en / latest / format. html # data): > If the value is a string that begins with < @ then the rest of the string is treated as the name of a file to be loaded from the same directory as the YAML file. But I haven ’ t found it works like this unless I cd into the directory containing the yaml file. ‘ _load_data_file ‘ says: ‘‘‘ python path = os. path. join (self. test_directory, os. path. basename (filename)) ‘‘‘ this does a few things: - makes the path begin with ‘ self. test_directory ‘ (defaults to ‘.‘, the current working directory) - disguards any folders in the < @ ‘ filename ‘ path - appends ‘ filename ‘ This means, if I ’ m in ‘/‘, and I have a test that says ‘ data: < @cake. jpg ‘, I cannot run my tests as advised in the docs: ‘ gabbi - run -- / my / test. yaml / my / other. yaml ‘ ‘ FileNotFoundError: [Errno 2] No such file or directory: ’./ cake. jpg ’‘ So in our test running script, we have to start with ‘ cd tests /‘. My preference, would be to make the path relative to the location of the yaml file. That way, I can run my tests from whatever directory. Obviously this isn ’ t possible when redirecting with ‘<‘ as the filename is never seen by gabbi. So I ’ m happy is discuss other ideas. --- Furthermore I can ’ t keep my test data in a child directory, unless I cd into that directory like: ‘‘‘ shell cd yaml_tests / test_data gabbi - run -- ../ test_things. yaml ‘‘‘ So for that reason, I ’ d like to allow directories like ‘< @test_data / filename. txt ‘ to be included in the path. --- Happy to write the patch, if we come up with an approach here. --- END ISSUE --- Below are some thinking trajectories, each from llm reasoning model. --- BEGIN FILE --- ‘‘‘ Answer 1: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH import argparse from importlib import import_module import sys import unittest ======= import argparse from importlib import import_module import os import sys import unittest >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) result = ConciseTestRunner ( verbosity =2, failfast = failfast). run (test_suite) return result. wasSuccessful () ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) result = ConciseTestRunner ( verbosity =2, failfast = failfast). run (test_suite) return result. wasSuccessful () >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH else: for input_file in input_files: with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) ======= else: for input_file in input_files: test_directory = os. path. dirname (input_file) or ’.’ with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory = test_directory) >>>>>>> REPLACE ‘‘‘ Answer 2: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) ======= test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) ======= test_directory = os. path. dirname (input_file) or ’.’ success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory = test_directory) >>>>>>> REPLACE ‘‘‘ Answer 3: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) result = ConciseTestRunner ( verbosity =2, failfast = failfast). run (test_suite) return result. wasSuccessful () ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) result = ConciseTestRunner ( verbosity =2, failfast = failfast). run (test_suite) return result. wasSuccessful () >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH for input_file in input_files: with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) if not failure: # once failed, this is considered immutable failure = not success if failure and failfast: break ======= for input_file in input_files: test_directory = os. path. dirname (input_file) with open (input_file, ’ r ’) as fh: success = run_suite ( fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory = test_directory) if not failure: # once failed, this is considered immutable failure = not success if failure and failfast: break >>>>>>> REPLACE ‘‘‘ Answer 4: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) ======= test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH for input_file in input_files: with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) ======= for input_file in input_files: # Determine test directory based on input file ’ s location test_directory = os. path. dirname (os. path. abspath (input_file)) or ’.’ with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory = test_directory) >>>>>>> REPLACE ‘‘‘ Answer 5: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH import sys import unittest ======= import os import sys import unittest >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) ======= test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH for input_file in input_files: with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) ======= for input_file in input_files: test_directory = os. path. dirname (input_file) or ’.’ with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory = test_directory) >>>>>>> REPLACE ‘‘‘ Answer 6: ‘‘‘ python ### gabbi / driver. py <<<<<<< SEARCH file_suite = suitemaker. test_suite_from_dict ( loader, test_base_name, suite_dict, path, host, port, fixture_module, intercept, prefix = prefix, test_loader_name = test_loader_name, handlers = handler_objects, inner_fixtures = inner_fixtures) ======= test_file_dir = os. path. dirname (test_file) file_suite = suitemaker. test_suite_from_dict ( loader, test_base_name, suite_dict, test_file_dir, host, port, fixture_module, intercept, prefix = prefix, test_loader_name = test_loader_name, handlers = handler_objects, inner_fixtures = inner_fixtures) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) ======= test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) ======= test_directory = os. path. dirname (input_file) success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory) >>>>>>> REPLACE ‘‘‘ Answer 7: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH import argparse from importlib import import_module import sys import unittest ======= import argparse from importlib import import_module import os import sys import unittest >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) if not failure: # once failed, this is considered immutable failure = not success if failure and failfast: break ======= test_dir = os. path. dirname (os. path. abspath (input_file)) with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory = test_dir) if not failure: # once failed, this is considered immutable failure = not success if failure and failfast: break >>>>>>> REPLACE ‘‘‘ Answer 8: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) ======= data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} # Determine test directory based on input handle if hasattr (handle, ’ name ’): test_directory = os. path. dirname (os. path. abspath (handle. name)) else: test_directory = ’.’ loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ Answer 9: ‘‘‘ python ### gabbi / driver. py <<<<<<< SEARCH for test_file in glob. iglob (’% s /*. yaml ’ % path): if ’ _ ’ in os. path. basename (test_file): warnings. warn (exception. GabbiSyntaxWarning ( "’ _ ’ in test filename % s. This can break suite grouping." % test_file)) if intercept: host = str (uuid. uuid4 ()) suite_dict = utils. load_yaml (yaml_file = test_file) test_base_name = os. path. splitext (os. path. basename (test_file))[0] if all_test_base_name: test_base_name = ’% s_ % s ’ % (all_test_base_name, test_base_name) if require_ssl: if ’ defaults ’ in suite_dict: suite_dict [’ defaults ’][’ ssl ’] = True else: suite_dict [’ defaults ’] = {’ ssl ’: True} file_suite = suitemaker. test_suite_from_dict ( loader, test_base_name, suite_dict, path, host, port, fixture_module, intercept, prefix = prefix, test_loader_name = test_loader_name, handlers = handler_objects, inner_fixtures = inner_fixtures) ======= for test_file in glob. iglob (’% s /*. yaml ’ % path): if ’ _ ’ in os. path. basename (test_file): warnings. warn (exception. GabbiSyntaxWarning ( "’ _ ’ in test filename % s. This can break suite grouping." % test_file)) if intercept: host = str (uuid. uuid4 ()) suite_dict = utils. load_yaml (yaml_file = test_file) test_base_name = os. path. splitext (os. path. basename (test_file))[0] if all_test_base_name: test_base_name = ’% s_ % s ’ % (all_test_base_name, test_base_name) if require_ssl: if ’ defaults ’ in suite_dict: suite_dict [’ defaults ’][’ ssl ’] = True else: suite_dict [’ defaults ’] = {’ ssl ’: True} # Use absolute path of the YAML file ’ s directory test_file_dir = os. path. dirname (os. path. abspath (test_file)) file_suite = suitemaker. test_suite_from_dict ( loader, test_base_name, suite_dict, test_file_dir, host, port, fixture_module, intercept, prefix = prefix, test_loader_name = test_loader_name, handlers = handler_objects, inner_fixtures = inner_fixtures) >>>>>>> REPLACE ‘‘‘ Answer 10: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH import argparse from importlib import import_module import sys import unittest ======= import argparse from importlib import import_module import os import sys import unittest >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, force_ssl = force_ssl, failfast = failfast, handlers = handler_objects) ======= test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, force_ssl = force_ssl, failfast = failfast, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) ======= test_dir = os. path. dirname (input_file) success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory = test_dir) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / case. py <<<<<<< SEARCH path = os. path. join (self. test_directory, os. path. basename (filename)) ======= path = os. path. join (self. test_directory, filename) >>>>>>> REPLACE ‘‘‘ Answer 11: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) ======= test_dir = os. path. dirname (input_file) success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory = test_dir) >>>>>>> REPLACE ‘‘‘ Answer 12: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH import argparse from importlib import import_module import sys import unittest ======= import argparse from importlib import import_module import os import sys import unittest >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH else: for input_file in input_files: with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) ======= else: for input_file in input_files: test_dir = os. path. dirname (os. path. abspath (input_file)) with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory = test_dir) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ Answer 13: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH for input_file in input_files: with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) ======= for input_file in input_files: test_dir = os. path. dirname (os. path. abspath (input_file)) with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory = test_dir) >>>>>>> REPLACE ‘‘‘ Answer 14: ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH for input_file in input_files: with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) ======= for input_file in input_files: with open (input_file, ’ r ’) as fh: success = run_suite (input_file, fh, handler_objects, host, port, prefix, force_ssl, failfast) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_directory = ’.’ test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) ======= def run_suite (input_file, handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): """ Run the tests from the YAML in handle.""" data = utils. load_yaml (handle) if force_ssl: if ’ defaults ’ in data: data [’ defaults ’][’ ssl ’] = True else: data [’ defaults ’] = {’ ssl ’: True} loader = unittest. defaultTestLoader test_directory = os. path. dirname (input_file) if input_file else ’.’ test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH if not input_files: success = run_suite (sys. stdin, handler_objects, host, port, prefix, force_ssl, failfast) ======= if not input_files: success = run_suite (None, sys. stdin, handler_objects, host, port, prefix, force_ssl, failfast) >>>>>>> REPLACE ‘‘‘ ‘‘‘ --- END FILE --- Please first summary and analyze the key differences between the trajectories, and then generate * SEARCH / REPLACE * edits to fix the issue. Every * SEARCH / REPLACE * edit must use this format: 1. The file path 2. The start of search block: <<<<<<< SEARCH 3. A contiguous chunk of lines to search for in the existing source code 4. The dividing line: ======= 5. The lines to replace into the source code 6. The end of the replace block: >>>>>>> REPLACE Here is an example: ‘‘‘ python ### mathweb / flask / app. py <<<<<<< SEARCH from flask import Flask ======= import math from flask import Flask >>>>>>> REPLACE ‘‘‘ Please note that the * SEARCH / REPLACE * edit REQUIRES PROPER INDENTATION. If you would like to add the line ’ print (x)’, you must fully write that out, with all those spaces before the code! Wrap the * SEARCH / REPLACE * edit in blocks ‘‘‘ python...‘‘‘. The summary of the key differences between the trajectories should be in the thinking part. " Listing 12: Example of emergent parallel scaling (cdent__gabbi-191 in SWE-bench-extra), the prompt input part. Listing 13 shows how the model successfully recognizes the correct answer candidate in Listing 12, and generates a novel patch with additional reasoning to solve the issue: ⬇ $\lhd$ think $\rhd$ Okay, let ’ s tackle this problem. The user is having an issue where when they specify a data file in their YAML test using the < @ syntax, the path isn ’ t resolved relative to the YAML file ’ s location. Instead, it ’ s using the current working directory. They want the path to be relative to the YAML file so they can run tests from any directory. First, I need to understand how the current code is handling the data file paths. The user mentioned that in _load_data_file, the code uses os. path. join (self. test_directory, os. path. basename (filename)). This takes the test_directory (which defaults to ’.’), discards any directories in the filename, and appends the basename. That ’ s why if the YAML is in / my / tests / test. yaml and the data is < @data. txt, it looks for ./ data. txt instead of / my / tests / data. txt. The key here is to make sure that the test_directory is set to the directory where the YAML file is located. Then, when joining with the filename (without basename), it would correctly resolve the relative path. Looking at the provided code snippets, the main changes are in runner. py where the test_directory is determined. The user ’ s proposed solutions involve modifying the run_suite function to accept a test_directory parameter, which is derived from the input file ’ s directory. In the original code, when running a suite from an input file, the test_directory is hardcoded to ’.’. The various answers modify this by passing the directory of the input file as test_directory. For example, in Answer 10, they add the test_directory parameter to run_suite and set it to os. path. dirname (input_file). Then, in case. py, they change the path joining to use the filename directly instead of basename. Wait, the critical part is in case. py ’ s _load_data_file method. The original code uses os. path. basename (filename), which strips any directories from the filename. So if the YAML has < @test_data / file. txt, the basename is file. txt, and it ’ s joined with test_directory. But if test_directory is the YAML ’ s directory, then it should be YAML_DIR / test_data / file. txt. However, the current code would do test_directory + file. txt, which would be YAML_DIR / file. txt. That ’ s incorrect. So the problem is that the code is taking the basename, thus ignoring any subdirectories in the filename. Ah, so the fix here is to not take the basename. Instead, the filename after < @ should be treated as a relative path from the YAML ’ s directory. So in case. py ’ s _load_data_file, instead of using os. path. basename (filename), we should use the filename as - is. Then, os. path. join (self. test_directory, filename) would correctly resolve the path relative to the YAML ’ s directory. But wait, the filename could be an absolute path. For example, if the YAML has < @ / absolute / path / file. txt, then os. path. join would just use the absolute path. However, the documentation says that the filename is relative to the YAML file. So absolute paths might not be intended. But the code should handle that. So combining two changes: in runner. py, set test_directory to the YAML file ’ s directory, and in case. py, stop using basename and use the full filename. Looking at Answer 10, there ’ s a change in case. py from os. path. basename (filename) to filename. That ’ s exactly the fix needed. Then, in runner. py, when running the suite, pass the test_directory as the directory of the input file. So the key files to modify are: 1. gabbi / runner. py: When calling run_suite, pass test_directory as the directory of the input file. This involves modifying the run_suite function to accept test_directory as a parameter and then passing it when the input file is processed. 2. gabbi / case. py: Change the path construction from using os. path. basename (filename) to using the filename directly. This allows subdirectories in the filename to be included in the path. Looking through the provided code snippets, the one that does both is Answer 10. Let me check: In Answer 10, the runner. py changes add the test_directory parameter and set it to os. path. dirname (input_file). Then, in case. py, the path is changed to os. path. join (self. test_directory, filename). That would correctly resolve the path relative to the YAML ’ s directory, including any subdirectories in the filename. But wait, the original code in case. py uses os. path. basename (filename). So changing that to filename means that if the input is ’ test_data / file. txt ’, it ’ s joined as test_directory / test_data / file. txt. If test_directory is the YAML ’ s directory, that ’ s correct. ... [intermediate reasoning omitted for display brevity] ... But perhaps better to use os. path. abspath and then dirname to handle relative paths correctly. But maybe the existing code already works. For example, if the input_file is ’ tests / test. yaml ’, then os. path. dirname would return ’ tests ’, which when joined with os. getcwd () would be correct. Wait, no. os. path. dirname just gives the directory part of the path. So if input_file is a relative path, like ’ tests / test. yaml ’, then test_directory would be ’ tests ’, and when os. path. join is called, it ’ s relative to the current working directory. But the code in suitemaker. test_suite_from_dict might expect an absolute path. However, the original code uses ’.’ as the test_directory, which is relative. So perhaps using the directory of the input file as test_directory is sufficient. Another thing to consider: when the YAML is loaded from stdin (like via redirect), the test_directory remains ’.’. That ’ s acceptable, as the documentation mentions that when using <, the filename isn ’ t seen by Gabbi, so it ’ s not possible to resolve the path relative to the YAML ’ s original location. So in that case, the test_directory defaults to ’.’. Now, in case. py: Original line: path = os. path. join (self. test_directory, os. path. basename (filename)) Modified line: path = os. path. join (self. test_directory, filename) So the SEARCH / REPLACE block for case. py: ### gabbi / case. py <<<<<<< SEARCH path = os. path. join (self. test_directory, os. path. basename (filename)) ======= path = os. path. join (self. test_directory, filename) >>>>>>> REPLACE Putting all these together, the final edits are: 1. Add import os in runner. py. 2. Modify run_suite to accept test_directory and pass it to test_suite_from_dict. 3. In the loop over input_files, compute test_directory as the directory of input_file and pass it to run_suite. 4. Modify case. py ’ s path construction. So the final answer would include these four SEARCH / REPLACE blocks. $\lhd$ / think $\rhd$ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH import argparse from importlib import import_module import sys import unittest ======= import argparse from importlib import import_module import os import sys import unittest >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False): ======= def run_suite (handle, handler_objects, host, port, prefix, force_ssl = False, failfast = False, test_directory =’.’): >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, ’.’, host, port, None, None, prefix = prefix, handlers = handler_objects) ======= test_suite = suitemaker. test_suite_from_dict ( loader, ’ input ’, data, test_directory, host, port, None, None, prefix = prefix, handlers = handler_objects) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / runner. py <<<<<<< SEARCH else: for input_file in input_files: with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast) ======= else: for input_file in input_files: test_directory = os. path. dirname (input_file) or ’.’ with open (input_file, ’ r ’) as fh: success = run_suite (fh, handler_objects, host, port, prefix, force_ssl, failfast, test_directory = test_directory) >>>>>>> REPLACE ‘‘‘ ‘‘‘ python ### gabbi / case. py <<<<<<< SEARCH path = os. path. join (self. test_directory, os. path. basename (filename)) ======= path = os. path. join (self. test_directory, filename) >>>>>>> REPLACE ‘‘‘ These changes: 1. Add ‘ os ‘ import needed for path handling 2. Modify run_suite to accept a test_directory parameter 3. Use the input file ’ s directory as the test_directory when processing files 4. Update data file path resolution to preserve directory structure from YAML relative paths 5. Remove basename handling that was stripping directory information from data file paths The net effect is that data files specified with ‘< @filename ‘ in YAML tests will now be resolved relative to their YAML file ’ s directory, supporting both flat and nested directory structures for test data. When tests are run from stdin (using ‘<‘), the default directory remains the current working directory as before. Listing 13: Example of emergent parallel scaling (cdent__gabbi-191 in SWE-bench-extra), the model response part. ## Appendix G Analysis for SWE-Agents ### G.1 Stage Annotation for SWE-Agent Trajectories In this section, we present how we use a frontier LLM to annotate the SWE-Agent stage to which each interaction turn within the trajectory rollout belongs. While we have briefly introduced the five stages suggested in the prompt of the SWE-Agent prompt in Section 4, we attach the excerpt in Listing 14 for greater clarity: ⬇ ... Follow these steps to resolve the issue: 1. As a first step, it might be a good idea to find and read code relevant to the < pr_description > 2. Create a script to reproduce the error and execute it with ‘ python < filename. py >‘ using the bash tool, to confirm the error 3. Edit the source code of the repo to resolve the issue 4. Rerun your reproduce script and confirm that the error is fixed! 5. Think about edgecases and make sure your fix handles them as well ... Listing 14: The excerpt of the five-stage declaration in the SWE-Agent prompt. It should be noted that the agent could flexibly transit across the five stages during its working process. For example, after Stage 4 when the agent rerun the test script, possibilities are that erroneous information remains, and this is when the agent goes back to Stage 3 to refine its code repair with reflection; Similar backtracing behavior could be observed from Stage 5 to Stage 3 as well, where the initial code repair has proven correct under the initial test script the agent composes in Stage 2, but fails some edge testcase the agent proposes in Stage 5. To further analyze the BugFixer and the reflection skill prior, we need to realize which stage each turn along the SWE-Agent trajectory belongs to. As no strict boundaries or special prompt notes are set between each consecutive stage, we leverage an LLM for annotation. The annotation system prompt we set in kimi-k2-0711-preview is shown in Listing 15: ⬇ You are a professional inspector that can analyze the provided agentic interaction trajectory. The trajectory you are going to analyze is made by an agent that interacts with a computer to solve tasks. This agent has access to the following functions: ---- BEGIN FUNCTION #1: bash ---- Description: Execute a bash command in the terminal. Parameters: (1) command (string, required): The bash command to execute. Can be empty to view additional logs when previous exit code is ‘-1‘. Can be ‘ ctrl + c ‘ to interrupt the currently running process. ---- END FUNCTION #1 ---- ---- BEGIN FUNCTION #2: submit ---- Description: Finish the interaction when the task is complete OR if the assistant cannot proceed further with the task. No parameters are required for this function. ---- END FUNCTION #2 ---- ---- BEGIN FUNCTION #3: str_replace_editor ---- Description: Custom editing tool for viewing, creating and editing files * State is persistent across command calls and discussions with the user * If ‘ path ‘ is a file, ‘ view ‘ displays the result of applying ‘ cat - n ‘. If ‘ path ‘ is a directory, ‘ view ‘ lists non - hidden files and directories up to 2 levels deep * The ‘ create ‘ command cannot be used if the specified ‘ path ‘ already exists as a file * If a ‘ command ‘ generates a long output, it will be truncated and marked with ‘< response clipped >‘ * The ‘ undo_edit ‘ command will revert the last edit made to the file at ‘ path ‘ Notes for using the ‘ str_replace ‘ command: * The ‘ old_str ‘ parameter should match EXACTLY one or more consecutive lines from the original file. Be mindful of whitespaces! * If the ‘ old_str ‘ parameter is not unique in the file, the replacement will not be performed. Make sure to include enough context in ‘ old_str ‘ to make it unique * The ‘ new_str ‘ parameter should contain the edited lines that should replace the ‘ old_str ‘ Parameters: (1) command (string, required): The commands to run. Allowed options are: ‘ view ‘, ‘ create ‘, ‘ str_replace ‘, ‘ insert ‘, ‘ undo_edit ‘. Allowed values: [‘ view ‘, ‘ create ‘, ‘ str_replace ‘, ‘ insert ‘, ‘ undo_edit ‘] (2) path (string, required): Absolute path to file or directory, e. g. ‘/ repo / file. py ‘ or ‘/ repo ‘. (3) file_text (string, optional): Required parameter of ‘ create ‘ command, with the content of the file to be created. (4) old_str (string, optional): Required parameter of ‘ str_replace ‘ command containing the string in ‘ path ‘ to replace. (5) new_str (string, optional): Optional parameter of ‘ str_replace ‘ command containing the new string (if not given, no string will be added). Required parameter of ‘ insert ‘ command containing the string to insert. (6) insert_line (integer, optional): Required parameter of ‘ insert ‘ command. The ‘ new_str ‘ will be inserted AFTER the line ‘ insert_line ‘ of ‘ path ‘. (7) view_range (array, optional): Optional parameter of ‘ view ‘ command when ‘ path ‘ points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e. g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting ‘[start_line, -1]‘ shows all lines from ‘ start_line ‘ to the end of the file. ---- END FUNCTION #3 ---- The agent was instructed with the following: * A python code repository has been uploaded in the directory / testbed. * Implement the necessary changes to the repository so that the requirements specified in the < pr_description > are met. * All changes to any of the test files described in the < pr_description > have already been taken care of. This means no need to modify the testing logic or any of the tests in any way. * Make the minimal changes to non - tests files in the / testbed directory to ensure the < pr_description > is satisfied. The agent was suggested to follow the following steps to resolve the issue: 1. As a first step, it might be a good idea to find and read code relevant to the < pr_description > 2. Create a script to reproduce the error and execute it with ‘ python < filename. py >‘ using the bash tool, to confirm the error 3. Edit the source code of the repo to resolve the issue 4. Rerun your reproduce script and confirm that the error is fixed! 5. Think about edgecases and make sure your fix handles them as well The agent was encouraged to think thoroughly, and it ’ s fine if it ’ s very long. You are going to inspect this agent ’ s interaction trajectory with a computer to solve the given task in the < pr_description >. One turn of interaction contains a pair of OBSERVATION and ACTION, where the OBSERVATION comes from the computer, and the ACTION is taken by the agent. For each turn of interaction, determine which step (of the aforementioned five) this turn belongs to. Output a single number (1~5) ONLY in a separate line as your classification (DO NOT OUTPUT ANY OTHER WORDS THAN THE DIGIT). You can think before make the inspection. When thinking, wrap your thought with < think > and </ think >. Don ’ t forget to output your final inspection after thinking. Listing 15: The annotation prompt for SWE-Agent stages. To provide a clearer understanding of the trajectory, we incorporate most of the tool descriptions and instructions from the SWE-Agent system prompt into the annotation system prompt. The annotation is conducted in a multi-round manner, leveraging the agent’s previous actions and observations, as well as the stage classifications of earlier turns, to better exploit contextual information. At the $i$ -th round of annotation, the observation–action pair from turn $i$ of the SWE-Agent trajectory is appended as input, and the annotator is expected to output the corresponding stage classification. ### G.2 Comparative Study Based on the automatic stage annotation in the above section, we present a comparative study by inspecting the performance on sympy__sympy-20590 among the Kimi-Dev under Agentless, and each of the Base, MT, SFT, and RL priors with SWE-Agent adaptation. The problem statement of sympy__sympy-20590 is listed in Listing 16: ⬇ Symbol instances have __dict__ since 1.7? In version 1.6.2 Symbol instances had no ‘ __dict__ ‘ attribute ‘‘‘ python >>> sympy. Symbol (’ s ’). __dict__ ------------------------------------------------------------------------ AttributeError Traceback (most recent call last) < ipython - input -3- e2060d5eec73 > in < module > ----> 1 sympy. Symbol (’ s ’). __dict__ AttributeError: ’ Symbol ’ object has no attribute ’ __dict__ ’ >>> sympy. Symbol (’ s ’). __slots__ (’ name ’,) ‘‘‘ This changes in 1.7 where ‘ sympy. Symbol (’ s ’). __dict__ ‘ now exists (and returns an empty dict) I may misinterpret this, but given the purpose of ‘ __slots__ ‘, I assume this is a bug, introduced because some parent class accidentally stopped defining ‘ __slots__ ‘. Listing 16: The problem statement of sympy__sympy-20590. It is observed that the main difficulty in resolving the issue lies in the realization of the “ some parent class” referenced in the problem. In fact, the hints text of this problem, which reflects the discussion of the developers under the original issue, reveals a much more in-depth investigation into the issue (Listing 17): ⬇ It seems that Basic now inherits ‘ DefaultPrinting ‘ which I guess doesn ’ t have slots. I ’ m not sure if it ’ s a good idea to add ‘ __slots__ ‘ to that class as it would then affect all subclasses. ... Using slots can break multiple inheritance but only if the slots are non - empty I guess. Maybe this means that any mixin should always declare empty slots or it won ’ t work properly with subclasses that have slots... I see that ‘ EvalfMixin ‘ has ‘ __slots__ = ()‘. I guess we should add empty slots to DefaultPrinting then. Listing 17: The excerpted hints text of sympy__sympy-20590. According to the discussion, it is clear that the code repair would be to “add empty slots to DefaultPrinting”, which naturally leads to the navigation towards the file related to the implementation of the printer (sympy/core/_print_helpers.py, which is also the file updated by the ground-truth patch.) However, the hints_text information in the test set is not allowed to be used in the problem-solving process, which challenges the reasoner or the agent to figure out “the parent class that stopped defining ‘__slots__‘ ” autonomously. We first examine Kimi-Dev under Agentless. None of the 40 runs succeeded in producing the correct file localization. In most cases, the updates are made to sympy/core/symbol.py, which is a plausible choice since the reported problem is triggered by sympy.Symbol(’s’), and symbol.py should contain the definition of the Symbol class. This line of reasoning, however, diverges from the correct solution. A small number of runs attempt alternative localizations, such as sympy/core/{basic.py, expr.py}, but these are also incorrect. The limitation of the Agentless framework becomes more obvious when the focus is shifted to the inspection over the SWE-Agent trajectories. For all of the Base, MT, SFT, and RL priors with SWE-Agent adaptation, the resulting trajectories always contain a test-reproducing phase (marked as Stage “2” by the annotator), with the generated script reads like: ⬇ from sympy import Symbol # Try to access __dict__ on a Symbol instance s = Symbol (’ s ’) print (" Symbol. __slots__:", s. __slots__) try: print (" Symbol. __dict__:", s. __dict__) except AttributeError as e: print (" Error:", e) Listing 18: An exampled reproducing script in Stage 2 within the SWE-Agent trajectories in sympy__sympy-20590. It is trivial for an agent to generate a reproducing script of this form, as it directly transcribes the case described in the problem statement. Executing the script provides an unambiguous signal of whether the issue is resolved: the attribute s.__dict__ should no longer exist, and the expected Error: message should appear in the standard error output. Indeed, observing the initial completion of Stage 3 across all trajectories reveals that none of the agents succeeded on their first attempt. However, the incomplete status is consistently exposed in Stage 4, when the script is rerun and s.__dict__ still persists. This progression highlights the advantage of the SWE-Agent framework over the Agentless paradigm, as the agent is afforded additional opportunities for iterative refinement rather than being limited to a single attempt. We continue to analyze the behavior of the agents adapted from the Base, MT, and SFT priors. Due to the difficulty of the issue, it is non-trivial to correctly localize the error through their agentic reasoning: the interactions show that the agents were switching among several file candidates, but failed to reach the correct one. The only exception comes from the agent adapted from the RL prior, which marks its unique success on this instance through accurate agentic reasoning. We attach the intermediate process from the initial Stage-3 moment to the Stage-3 revisit (refining the initial code repair) within the successful trajectory, tagged by the stage annotator: ⬇ ... OBSERVATION: Symbol. __slots__: (’ name ’,) Symbol. __dict__: {} ACTION: Hmm, it seems like the issue is still there. Let ’ s check if there are any other classes in the hierarchy that might be missing ‘ __slots__ ‘: < function = bash > < parameter = command > python - c " from sympy import Symbol; s = Symbol (’ s ’); for cls in type (s). __mro__: print (cls. __name__, ’ has __slots__:’, hasattr (cls, ’ __slots__ ’))"</ parameter > </ function > OBSERVATION: File "< string >", line 1 from sympy import Symbol; s = Symbol (’ s ’); for cls in type (s). __mro__: print (cls. __name__, ’ has __slots__:’, hasattr (cls, ’ __slots__ ’)) ^ SyntaxError: invalid syntax ACTION: Let ’ s try a different approach: < function = str_replace_editor > < parameter = command > create </ parameter > < parameter = path >/ testbed / check_slots. py </ parameter > < parameter = file_text > from sympy import Symbol s = Symbol (’ s ’) for cls in type (s). __mro__: print (f "{cls. __name__:15} has __slots__: {hasattr (cls, ’ __slots__ ’)}") if hasattr (cls, ’ __slots__ ’): print (f "{’ ’*16} __slots__ value: {cls. __slots__}") </ parameter > </ function > OBSERVATION: File created successfully at: / testbed / check_slots. py ACTION: < function = bash > < parameter = command > python / testbed / check_slots. py </ parameter > </ function > OBSERVATION: Symbol has __slots__: True __slots__ value: (’ name ’,) AtomicExpr has __slots__: True __slots__ value: () Atom has __slots__: True __slots__ value: () Expr has __slots__: True __slots__ value: () Boolean has __slots__: True __slots__ value: () Basic has __slots__: True __slots__ value: (’ _mhash ’, ’ _args ’, ’ _assumptions ’) Printable has __slots__: False EvalfMixin has __slots__: True __slots__ value: () object has __slots__: False ACTION: Now I see the issue! ‘ Printable ‘ doesn ’ t have ‘ __slots__ ‘ defined. Let ’ s check the ‘ Printable ‘ class: < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / sympy / core / _print_helpers. py </ parameter > < parameter = view_range >[1, 30]</ parameter > </ function > OBSERVATION: Here ’ s the result of running ‘ cat - n ‘ on / testbed / sympy / core / _print_helpers. py: 1 """ 2 Base class to provide str and repr hooks that ‘ init_printing ‘ can overwrite. 3 4 This is exposed publicly in the ‘ printing. defaults ‘ module, 5 but cannot be defined there without causing circular imports. 6 """ 7 8 class Printable: 9 """ 10 The default implementation of printing for SymPy classes. 11 12 This implements a hack that allows us to print elements of built - in 13 Python containers in a readable way. Natively Python uses ‘‘ repr ()‘‘ 14 even if ‘‘ str ()‘‘ was explicitly requested. Mix in this trait into 15 a class to get proper default printing. 16 17 This also adds support for LaTeX printing in jupyter notebooks. 18 """ 19 20 # Note, we always use the default ordering (lex) in __str__ and __repr__, 21 # regardless of the global setting. See issue 5487. 22 def __str__ (self): 23 from sympy. printing. str import sstr 24 return sstr (self, order = None) 25 26 __repr__ = __str__ 27 28 def _repr_disabled (self): 29 """ 30 No - op repr function used to disable jupyter display hooks. ACTION: We need to add ‘ __slots__ = ()‘ to the ‘ Printable ‘ class as well: < function = str_replace_editor > < parameter = command > str_replace </ parameter > < parameter = path >/ testbed / sympy / core / _print_helpers. py </ parameter > < parameter = old_str > class Printable: """ The default implementation of printing for SymPy classes. This implements a hack that allows us to print elements of built - in Python containers in a readable way. Natively Python uses ‘‘ repr ()‘‘ even if ‘‘ str ()‘‘ was explicitly requested. Mix in this trait into a class to get proper default printing. This also adds support for LaTeX printing in jupyter notebooks. """ </ parameter > < parameter = new_str > class Printable: """ The default implementation of printing for SymPy classes. This implements a hack that allows us to print elements of built - in Python containers in a readable way. Natively Python uses ‘‘ repr ()‘‘ even if ‘‘ str ()‘‘ was explicitly requested. Mix in this trait into a class to get proper default printing. This also adds support for LaTeX printing in jupyter notebooks. """ __slots__ = () </ parameter > </ function > ... Listing 19: The intermediate reflection process between the two Stage-3 moment of the successful trajectory in sympy__sympy-20590. Listing 19 shows the advanced reflection capability through slot checking along the entire class hierarchy, which successfully leads to the observation of the erroneous Printable. The following behavior from the agent includes the routine of test rerun, and the observation of passing the test script (Listing 18). This chain of actions recovers and resonates with the developer discussions in hints_text (Listing 17), showing the superior reflection skill in the RL prior over other priors. ### G.3 Pattern Comparison between SFT with Expert Data and End-to-End RL In this section, we compare the patterns of the interaction trajectories of the RL prior that undergoes (i) $1.5× 2^28$ SWE-Agent expert SFT tokens (later referred as “heavy SFT”) and (ii) $2^21$ SWE-Agent expert SFT tokens + end-to-end SWE-Agent RL (later referred as “light SFT + end-to-end RL”) as a qualitative study for Figure 7. We select an instance solved by both methods, matplotlib__matplotlib-13989, as case study. We attach the actions along the two interaction trajectories for comparison. The action patterns of the heavy SFT model read as Listing 20: ⬇ ACTION 0: I ’ ll help you implement the necessary changes to fix the issue where ‘ hist ()‘ doesn ’ t respect ‘ range =...‘ when ‘ density = True ‘. Let ’ s follow the steps: 1. First, let ’ s find and read the relevant code: < function = bash > < parameter = command > find / testbed - type f - name "*. py " | grep - v " test_ "</ parameter > </ function > ACTION 1: Let ’ s look for the histogram implementation since that ’ s what we need to fix. The PR mentions that the bug is in the ‘ hist ()‘ function when ‘ density = True ‘ and ‘ range =...‘ is specified. < function = bash > < parameter = command > grep - r " def hist " -- include ="*. py " / testbed | grep - v " test_ "</ parameter > </ function > ACTION 2: The main histogram implementation appears to be in ‘/ testbed / lib / matplotlib / axes / _axes. py ‘. Let ’ s look at that file: < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > </ function > ACTION 3: Let me look at the histogram implementation specifically: < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > < parameter = view_range >[6365, 6899]</ parameter > </ function > ACTION 4: Let me also look at the ‘ histogram_bin_edges ‘ function since it ’ s used in the code: < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > < parameter = view_range >[40, 60]</ parameter > </ function > ACTION 5: 2. Let ’ s create a script to reproduce the error: < function = str_replace_editor > < parameter = command > create </ parameter > < parameter = path >/ testbed / reproduce. py </ parameter > ... ACTION 6: Let ’ s run it to confirm the issue: < function = bash > < parameter = command > python / testbed / reproduce. py </ parameter > </ function > ACTION 7: The issue is confirmed: when ‘ density = True ‘ is specified, the ‘ range =(0, 1)‘ parameter is not respected. The bins should start at 0 and end at 1, but they don ’ t. 3. Let ’ s fix the issue. Looking at the code, the problem is in the ‘ hist ‘ method and specifically around line 6689-6690: ‘‘‘ python if density and not stacked: hist_kwargs = dict (density = density) ‘‘‘ The issue is that when ‘ density = True ‘, the ‘ range ‘ parameter gets lost because we ’ re not including it in the ‘ hist_kwargs ‘. Let ’ s modify the code to fix this: < function = str_replace_editor > < parameter = command > str_replace </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > ... ACTION 8: 4. Let ’ s run the reproduce script again to confirm the fix: < function = bash > < parameter = command > python / testbed / reproduce. py </ parameter > </ function > ACTION 9: Great! The fix works. Now the bins start at 0 and end at 1 as expected. 5. Let ’ s think about edge cases: 1. When ‘ range = None ‘ - This should use the min / max of the data, which is already handled correctly by the existing code 2. When ‘ range ‘ is specified but ‘ density = False ‘ - The range should be respected, which it already is 3. When ‘ range ‘ is specified and ‘ density = True ‘ - The range should be respected, which our fix now ensures 4. When ‘ bins ‘ is a sequence - The range should be ignored, which is already handled by the code 5. When ‘ range ‘ includes values outside the data range - The histogram should include empty bins for the range outside the data, which is already handled by numpy ’ s histogram function Let me test some of these edge cases to be sure: < function = str_replace_editor > < parameter = command > str_replace </ parameter > < parameter = path >/ testbed / reproduce. py </ parameter > ... ACTION 10: < function = bash > < parameter = command > python / testbed / reproduce. py </ parameter > </ function > ACTION 11: All test cases look good: 1. When ‘ range =(0, 1)‘ and ‘ density = True ‘, the bins start at 0 and end at 1 2. When ‘ range = None ‘ and ‘ density = True ‘, the bins use the min / max of the data 3. When ‘ range =(0, 1)‘ and ‘ density = False ‘, the bins start at 0 and end at 1 4. When ‘ bins ‘ is a sequence, the range parameter is ignored and the sequence is used directly 5. When ‘ range =(-0.5, 1.5)‘, the bins start at -0.5 and end at 1.5 The fix looks good and handles all edge cases correctly. Let me submit the changes: < function = submit > </ function > Listing 20: The interaction made by the RL prior with $1.5× 2^28$ SWE-Agent SFT tokens on matplotlib__matplotlib-13989. The action patterns of the light SFT + end-to-end RL model read as Listing 21: ⬇ ACTION 0: Understood. Let ’ s tackle this problem step by step. First, I need to find the relevant code in the / testbed directory. The user mentioned that the code is in / testbed, so I ’ ll start by listing the files there. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed </ parameter > </ function > ACTION 1: Hmm, the / testbed directory has a lot of files, but I need to focus on the code related to the hist function. The hist function is part of matplotlib. pyplot, so I should look for the pyplot module in the / testbed / lib / matplotlib directory. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / pyplot. py </ parameter > </ function > ACTION 2: Okay, I ’ ll look at the hist function in pyplot. py. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / pyplot. py </ parameter > < parameter = view_range >[2610, 2625]</ parameter > </ function > ACTION 3: The hist function in pyplot. py simply calls the hist method of the current axes. So the actual implementation is in the Axes class. I ’ ll look for the Axes. hist method in the / testbed / lib / matplotlib / axes directory. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes </ parameter > </ function > ACTION 4: The Axes. hist method is likely in _axes. py, which is where most of the plotting methods are defined. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > </ function > ACTION 5: I ’ ll look at the hist method in _axes. py. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > < parameter = view_range >[6365, 6400]</ parameter > </ function > ACTION 6: I ’ ll look at more of the hist method to see where the range is handled. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > < parameter = view_range >[6400, 6450]</ parameter > </ function > ACTION 7: I ’ ll look at the code where the bins are calculated. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > < parameter = view_range >[6450, 6500]</ parameter > </ function > ACTION 8: I ’ ll look at the code where the bins are calculated. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > < parameter = view_range >[6500, 6550]</ parameter > </ function > ACTION 9: I ’ ll look at the code where the bins are calculated. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > < parameter = view_range >[6550, 6600]</ parameter > </ function > ACTION 10: I ’ ll look at the code where the bins are calculated. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > < parameter = view_range >[6600, 6650]</ parameter > </ function > ACTION 11: I ’ ll look at the code where the bins are calculated. < function = str_replace_editor > < parameter = command > view </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > < parameter = view_range >[6650, 6700]</ parameter > </ function > ACTION 12: Ah, here ’ s the key part. The code calculates the bins using np. histogram, and if the range is specified, it ’ s passed to histogram_bin_edges. But when density is True, the code may be overriding the range. Wait, in the code: if not input_empty and len (x) > 1: if weights is not None: _w = np. concatenate (w) else: _w = None bins = histogram_bin_edges (np. concatenate (x), bins, bin_range, _w) else: hist_kwargs [’ range ’] = bin_range But when density is True, the code may be using the default range of the data, not the specified range. Wait, no. The code for calculating the bins is: if bins is None: bins = rcParams [’ hist. bins ’] if bin_range is not None: bin_range = self. convert_xunits (bin_range) if bin_range is None: xmin = np. inf xmax = - np. inf for xi in x: if len (xi): xmin = min (xmin, np. nanmin (xi)) xmax = max (xmax, np. nanmax (xi)) if not np. isnan ([xmin, xmax]). any () and not (xmin > xmax): bin_range = (xmin, xmax) # If bins are not specified either explicitly or via range, # we need to figure out the range required for all datasets, # and supply that to np. histogram. if not input_empty and len (x) > 1: if weights is not None: _w = np. concatenate (w) else: _w = None bins = histogram_bin_edges (np. concatenate (x), bins, bin_range, _w) else: hist_kwargs [’ range ’] = bin_range density = bool (density) or bool (normed) if density and not stacked: hist_kwargs = dict (density = density) # List to store all the top coordinates of the histograms tops = [] mlast = None # Loop through datasets for i in range (nx): # this will automatically overwrite bins, # so that each histogram uses the same bins m, bins = np. histogram (x [i], bins, weights = w [i], ** hist_kwargs) m = m. astype (float) # causes problems later if it ’ s an int if mlast is None: But when density is True, the code passes density = True to np. histogram. However, when density is True, np. histogram normalizes the counts, but the bin edges are still determined by the range. Wait, but the problem is that when density = True, the range is not respected. The user ’ s code is: _, bins, _ = plt. hist (np. random. rand (10), " auto ", range =(0, 1), density = True) print (bins) and the output is: [0.00331535 0.18930174 0.37528813 0.56127453 0.74726092 0.93324731] which is not from 0 to 1. But when density = False, the range is respected. This suggests that when density = True, the code is not using the specified range for the bins. But why? Let me look at the code again. When the code calculates the bins: if not input_empty and len (x) > 1: if weights is not None: _w = np. concatenate (w) else: _w = None bins = histogram_bin_edges (np. concatenate (x), bins, bin_range, _w) else: hist_kwargs [’ range ’] = bin_range But when len (x) is 1 (as in the user ’ s code), the code enters the else clause and sets hist_kwargs [’ range ’] = bin_range. Then, in the loop: for i in range (nx): m, bins = np. histogram (x [i], bins, weights = w [i], ** hist_kwargs) But wait, if bins is None, then np. histogram will use the range. But if bins is a string (like ’ auto ’), then np. histogram will calculate the bins based on the data and the range. Wait, in the user ’ s code, bins is ’ auto ’ and range is (0, 1). When density = True, the code should calculate the bins based on the range (0, 1). But in the code, when len (x) is 1, the code sets hist_kwargs [’ range ’] = bin_range. Then, in the loop, the code calls np. histogram (x [i], bins, weights = w [i], ** hist_kwargs). But if bins is a string (like ’ auto ’), then np. histogram will calculate the bins based on the data and the range. So the range should be respected. But in the user ’ s code, the bins are not from 0 to 1. This suggests that the code is not passing the range to np. histogram when density = True. But why? This is puzzling. Let me try to reproduce the error. I ’ ll create a script in the / testbed directory. < function = str_replace_editor > < parameter = command > create </ parameter > < parameter = path >/ testbed / reproduce_error. py </ parameter > ... ACTION 13: I ’ ll run the script. < function = bash > < parameter = command > python / testbed / reproduce_error. py </ parameter > </ function > ACTION 14: Hmm, the output is not from 0 to 1. This confirms the error. Now, I need to find where in the code the range is not being passed when density = True. Let me look at the code again. When len (x) is 1, the code sets hist_kwargs [’ range ’] = bin_range. Then, in the loop: for i in range (nx): m, bins = np. histogram (x [i], bins, weights = w [i], ** hist_kwargs) But if bins is a string (like ’ auto ’), then np. histogram will calculate the bins based on the data and the range. So the range should be respected. But in the user ’ s code, the bins are not from 0 to 1. This suggests that the code is not passing the range to np. histogram when density = True. But why? This is confusing. Let me look at the code for np. histogram. The np. histogram function has a range parameter. If range is specified, the bin edges are calculated based on that range. If the data is outside the range, it is ignored. In the user ’ s code, the data is between 0 and 1, so the range (0, 1) should be respected. But when density = True, the code is not respecting the range. This suggests that the code is not passing the range to np. histogram when density = True. But why? This is puzzling. Let me try to modify the code in _axes. py. I ’ ll look for the line where hist_kwargs is set. In the code: density = bool (density) or bool (normed) if density and not stacked: hist_kwargs = dict (density = density) This line is overwriting hist_kwargs, which may include the range. Ah, here ’ s the problem. If density is True and stacked is False, the code sets hist_kwargs to dict (density = density), which overwrites any previous hist_kwargs, including the range. This is the root of the problem. The code is overwriting hist_kwargs, which includes the range, when density is True and stacked is False. So, in the code: density = bool (density) or bool (normed) if density and not stacked: hist_kwargs = dict (density = density) This line is overwriting hist_kwargs, which may include the range. So the range is not passed to np. histogram. This is the cause of the problem. The solution is to add the range to hist_kwargs in this case. So, the code should be: density = bool (density) or bool (normed) if density and not stacked: hist_kwargs. setdefault (’ density ’, density) But wait, hist_kwargs may already have the range. So, the code should not overwrite hist_kwargs, but add to it. The code is: density = bool (density) or bool (normed) if density and not stacked: hist_kwargs = dict (density = density) This line is overwriting hist_kwargs, which may include the range. So the range is lost. The solution is to modify this code to add to hist_kwargs, not overwrite it. So, the code should be: density = bool (density) or bool (normed) if density and not stacked: hist_kwargs. setdefault (’ density ’, density) But wait, hist_kwargs may not exist. Let me look at the code. The code for hist_kwargs is: hist_kwargs = dict () # if the bin_range is not given, compute without nan numpy # does not do this for us when guessing the range (but will # happily ignore nans when computing the histogram). if bin_range is None: xmin = np. inf xmax = - np. inf for xi in x: if len (xi): xmin = min (xmin, np. nanmin (xi)) xmax = max (xmax, np. nanmax (xi)) # make sure we have seen at least one non - nan and finite # value before we reset the bin range if not np. isnan ([xmin, xmax]). any () and not (xmin > xmax): bin_range = (xmin, xmax) # If bins are not specified either explicitly or via range, # we need to figure out the range required for all datasets, # and supply that to np. histogram. if not input_empty and len (x) > 1: if weights is not None: _w = np. concatenate (w) else: _w = None bins = histogram_bin_edges (np. concatenate (x), bins, bin_range, _w) else: hist_kwargs [’ range ’] = bin_range density = bool (density) or bool (normed) if density and not stacked: hist_kwargs = dict (density = density) # List to store all the top coordinates of the histograms tops = [] mlast = None # Loop through datasets for i in range (nx): # this will automatically overwrite bins, # so that each histogram uses the same bins m, bins = np. histogram (x [i], bins, weights = w [i], ** hist_kwargs) m = m. astype (float) # causes problems later if it ’ s an int if mlast is None: Ah, here ’ s the problem. The code sets hist_kwargs = dict (density = density), which overwrites any previous hist_kwargs, including the range. So, if density is True and stacked is False, the code overwrites hist_kwargs with dict (density = density), which does not include the range. So the range is not passed to np. histogram. This is the cause of the problem. The solution is to modify this code to add to hist_kwargs, not overwrite it. So, the code should be: density = bool (density) or bool (normed) if density and not stacked: hist_kwargs. setdefault (’ density ’, density) But hist_kwargs may not exist. Wait, hist_kwargs is initialized as: hist_kwargs = dict () So, the code can be: density = bool (density) or bool (normed) if density and not stacked: hist_kwargs [’ density ’] = density This way, hist_kwargs includes the density and the range. This would fix the problem. Let me test this. I ’ ll modify the code in _axes. py. < function = str_replace_editor > < parameter = command > str_replace </ parameter > < parameter = path >/ testbed / lib / matplotlib / axes / _axes. py </ parameter > ... ACTION 14: I ’ ll run the reproduce_error. py script again. < function = bash > < parameter = command > python / testbed / reproduce_error. py </ parameter > </ function > ACTION 15: Great, the fix works. The bins are now from 0 to 1. Now, I can submit the changes. < function = submit > </ function > Listing 21: The interaction made by the RL prior with $2^21$ SWE-Agent SFT tokens and end-to-end SWE-Agent RL on matplotlib__matplotlib-13989. It can be seen that the two types of patterns are significantly different from each other: - The patterns of the heavy SFT model appear to be more regulated, with explicitly labeled steps 1 to 5 along the intermediate reasoning process. The reasoning content is generally short as well. This reflects the behavior of the model shaped by the heavy SFT data from the proprietary models. - The patterns of the light SFT + end-to-end RL model, in contrast, are much more under-regulated. More turns of interaction are spent at repo exploration, and a significantly larger amount of tokens are generated at the code edit steps before the action is made. This is the reflection behavior shaped by end-to-end RL (and less impacted by the expert data), as the sign of reasoning-intensive skill priors is retained in the agentic scenario. ### G.4 Generalization to Other Tasks In this section, we study whether the skills incentivized by Agentless training and SWE-Agent adaptation through SWE-Smith trajectories could transfer to other SWE tasks. We use SWE-bench-live and SWE-bench Multilingual as our testbeds. SWE-bench-live (zhang2025swe) is a benchmark for real-world issue resolution, evaluating AI systems on software engineering tasks. Using an automated curation pipeline, it is updated monthly to ensure fresh tasks and rigorous, contamination-free evaluation. For our experiments, we selected the default set of 300 tasks, with data collected between October 2024 and March 2025. Compared to SWE-bench Verified, SWE-bench-live exhibits a higher degree of distributional shift. SWE-bench Multilingual (yang2025swesmith) introduces 300 curated tasks from 42 GitHub repositories across 9 programming languages, including Rust, Java, PHP, Ruby, JavaScript/TypeScript, Go, and C/C++, covering domains such as web frameworks, data tools, core utilities, and libraries. Compared to SWE-bench Verified, which focuses exclusively on Python, SWE-bench Multilingual exhibits greater linguistic and domain diversity, posing additional challenges in cross-language generalization and transferability of software engineering capabilities. Similar to previous experiments, we evaluated four model stages as the priors: the original Qwen2.5-72B (Base), the mid-trained model (MT), the model activated with reasoning data through supervised finetuning (SFT), and the model after RL training (RL). We still use the open-source SWE-smith trajectories to activate the agentic capabilities of each prior. Figures 11 and 12 show the performance of the four priors on SWE-bench-Live and SWE-bench Multilingual under varied amounts of agentic trajectories for adaptation ( $2^21$ as one-step gradient descent, $2^23$ , $1.1× 2^27$ , and $1.5× 2^28$ as 100, 2,000, and 5,016 training trajectories). Each SWE-Agent adaptation experiment is conducted through lightweight supervised finetuning, the training time of which ranges from several minutes to two hours at most. Compared to the Base prior, those specifically enhanced with Agentless skills (SFT and RL) demonstrate stronger task generalization, especially under the data-scarce settings. However, when more SWE-Smith trajectories are used for adaptation, the performances of the Base and the MT priors become closer to those of the SFT and the RL priors. This could be attributed to the gaps between the different SWE tasks. The exploration for recipes that enable stronger out-of-distribution and task-agnostic generalization is left for future work. <details> <summary>figs/sec4_generalization_plots/l-100-s1.png Details</summary> ![d1cb3fe0](/v1/image/d1cb3fe063c155c82bcea9cf7406ac3844e1047d4794e44fd240d2e24f57da91) ### Visual Description ## Line Chart: Pass@k Performance Comparison ### Overview This image is a line chart comparing the performance of four different methods (RL, SFT, MT, Base) on a metric called "Pass@k (%)" as the parameter `k` increases from 1 to 3. The chart illustrates how the success rate (Pass@k) changes for each method with an increasing number of attempts or samples (`k`). ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis:** * **Label:** `k` * **Scale:** Discrete values at 1, 2, and 3. * **Y-Axis:** * **Label:** `Pass@k (%)` * **Scale:** Linear scale from 0.0 to 12.5, with major gridlines at intervals of 2.5 (0.0, 2.5, 5.0, 7.5, 10.0, 12.5). * **Legend:** Located in the top-left corner of the plot area. It contains four entries, each with a colored line and marker: * **RL:** Red line with circular markers. * **SFT:** Orange line with circular markers. * **MT:** Purple line with circular markers. * **Base:** Blue line with circular markers. * **Data Series:** Four distinct lines, each connecting data points at `k=1`, `k=2`, and `k=3`. ### Detailed Analysis The following table reconstructs the approximate data points for each method, extracted by cross-referencing the line color and marker position with the legend and axis scales. | Method (Color) | k=1 (Approx. %) | k=2 (Approx. %) | k=3 (Approx. %) | Visual Trend | | :--- | :--- | :--- | :--- | :--- | | **RL (Red)** | ~4.5 | ~7.0 | ~9.2 | Steep, consistent upward slope. | | **SFT (Orange)** | ~4.0 | ~8.0 | ~11.0 | Steepest upward slope, surpassing RL after k=1. | | **MT (Purple)** | ~1.5 | ~2.0 | ~2.3 | Very shallow upward slope, nearly flat. | | **Base (Blue)** | ~1.0 | ~1.8 | ~2.6 | Shallow upward slope, starting lowest but slightly surpassing MT at k=3. | **Spatial Grounding & Verification:** * At `k=1`, the red (RL) and orange (SFT) markers are clustered near the 5.0 gridline, with RL slightly higher. The purple (MT) and blue (Base) markers are clustered near the bottom, with MT slightly higher than Base. * At `k=2`, the orange (SFT) marker is clearly above the red (RL) marker. The blue (Base) marker has risen to be slightly below the purple (MT) marker. * At `k=3`, the orange (SFT) marker is the highest point on the chart, above the 10.0 gridline. The red (RL) marker is below it. The blue (Base) marker is now slightly above the purple (MT) marker. ### Key Observations 1. **Performance Hierarchy:** For all values of `k`, the SFT and RL methods significantly outperform the MT and Base methods. The gap between the top two (SFT, RL) and bottom two (MT, Base) methods widens as `k` increases. 2. **Growth Rate:** SFT and RL show strong, positive growth in Pass@k as `k` increases. SFT exhibits the highest growth rate, starting slightly below RL at `k=1` but ending well above it at `k=3`. 3. **Low-Performance Cluster:** The MT and Base methods show minimal improvement with increasing `k`. Their performance lines are relatively flat and close together, with Base showing a slightly better rate of improvement, eventually overtaking MT at `k=3`. 4. **No Outliers:** All data series follow smooth, monotonic trends without unexpected dips or spikes. ### Interpretation The chart demonstrates the effectiveness of different training or prompting strategies (RL, SFT, MT) compared to a Base model on a task measured by the Pass@k metric. Pass@k typically measures the probability that at least one of `k` generated samples is correct. * **What the data suggests:** The Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) methods are highly effective, as their success rates increase substantially when allowed more attempts (`k`). This indicates these methods produce a higher density of correct solutions within their top `k` outputs. The Base model and the MT method (possibly "Multi-Task" or another baseline) show poor scalability with `k`, suggesting their output distributions are less likely to contain correct answers even with multiple samples. * **Relationship between elements:** The steep slopes for SFT and RL directly correlate with the `k` parameter, showing a strong positive relationship between the number of attempts and the likelihood of success. The shallow slopes for MT and Base indicate a weak relationship. * **Notable implication:** The crossover where SFT surpasses RL and Base surpasses MT highlights that the relative advantage of one method over another can depend on the operational constraint (`k`). If only one attempt is allowed (`k=1`), RL and SFT are comparable. If multiple attempts are feasible (`k>1`), SFT becomes the clear leader. Similarly, Base is the worst performer at `k=1` but marginally better than MT at `k=3`. </details> (a) #SFT $=2^21$ . <details> <summary>figs/sec4_generalization_plots/l-100.png Details</summary> ![42749e91](/v1/image/42749e911b86b9e5e78de0ce8ca62461e81217b25afa8db0aa4ad42a9140b497) ### Visual Description ## Line Chart: Pass@k Performance Comparison ### Overview This is a line chart comparing the performance of four different methods (RL, SFT, MT, Base) on a metric called "Pass@k (%)". The chart plots this metric against increasing values of "k" (1, 2, and 3). All four methods show an upward trend as k increases. ### Components/Axes * **X-Axis:** Labeled "k". It has three discrete, evenly spaced tick marks at values 1, 2, and 3. * **Y-Axis:** Labeled "Pass@k (%)". The scale runs from 0.0 to 12.5, with major grid lines at intervals of 2.5 (0.0, 2.5, 5.0, 7.5, 10.0, 12.5). * **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries, each with a colored line and marker symbol: * **RL:** Red line with a circle marker. * **SFT:** Orange line with a circle marker. * **MT:** Purple line with a circle marker. * **Base:** Blue line with a circle marker. * **Data Markers:** Each data series uses an 'x' marker at k=1 and a circle marker at k=2 and k=3. ### Detailed Analysis The chart displays the following approximate data points for each method. The trend for all series is a positive slope, indicating Pass@k increases with k. **1. RL (Red Line):** * **Trend:** Slopes upward steadily. * **Data Points:** * k=1: ~9.0% (marked with a red 'x') * k=2: ~10.5% (red circle) * k=3: ~12.2% (red circle) **2. SFT (Orange Line):** * **Trend:** Slopes upward steadily, with a slope less steep than RL and MT. * **Data Points:** * k=1: ~5.5% (orange 'x') * k=2: ~7.6% (orange circle) * k=3: ~9.6% (orange circle) **3. MT (Purple Line):** * **Trend:** Slopes upward with the steepest incline of all series. It starts below RL but surpasses it by k=3. * **Data Points:** * k=1: ~6.0% (purple 'x') * k=2: ~10.5% (purple circle) - intersects with RL at this point. * k=3: ~12.7% (purple circle) - the highest value on the chart. **4. Base (Blue Line):** * **Trend:** Slopes upward steadily, maintaining the lowest performance across all k values. * **Data Points:** * k=1: ~2.7% (blue 'x') * k=2: ~5.0% (blue circle) * k=3: ~7.9% (blue circle) ### Key Observations 1. **Universal Improvement:** All four methods show improved Pass@k scores as k increases from 1 to 3. 2. **MT's Steep Ascent:** The MT method exhibits the most dramatic improvement, nearly doubling its score from k=1 to k=3 and achieving the highest overall score at k=3. 3. **RL vs. MT Crossover:** RL starts as the top performer at k=1. MT catches up to RL at k=2 (both ~10.5%) and then surpasses it at k=3. 4. **Consistent Hierarchy at k=1:** At the starting point (k=1), the performance order from highest to lowest is clearly RL > MT > SFT > Base. 5. **Base as Lower Bound:** The Base method consistently performs the worst but shows a similar rate of improvement to SFT. ### Interpretation The chart demonstrates the effectiveness of different training or prompting strategies (RL, SFT, MT) compared to a Base model on a task where performance is measured by the probability of getting at least one correct answer in `k` attempts (Pass@k). * **The value of `k`:** Increasing `k` (allowing more attempts) universally improves the chance of success for all methods, which is an expected outcome. * **Method Efficacy:** All advanced methods (RL, SFT, MT) significantly outperform the Base model at every `k` value, indicating their added value. * **Strategic Implications:** The choice of optimal method may depend on the operational constraint for `k`. If only one attempt is allowed (`k=1`), RL is the best choice. However, if multiple attempts are feasible (`k=3`), MT becomes the most effective strategy, suggesting it may be better at generating diverse or high-quality candidate solutions that pay off when more chances are given. The steep slope of MT implies its outputs have higher variance or a better "top-k" distribution, making it more likely to contain a correct answer when more samples are drawn. * **SFT's Position:** SFT provides a solid improvement over Base but is consistently outperformed by RL and MT, suggesting that reinforcement learning (RL) or the specific technique used in MT offers advantages beyond supervised fine-tuning alone for this metric. </details> (b) #SFT $=2^23$ . <details> <summary>figs/sec4_generalization_plots/l-2000.png Details</summary> ![6dd973f4](/v1/image/6dd973f48d2580e92f2a871d8b633ec74ceaa0a6ecf7d3e127e86143e0edcd31) ### Visual Description ## Line Chart: Pass@k Performance Comparison ### Overview This image is a line chart comparing the performance of four different methods (RL, SFT, MT, Base) on a metric called "Pass@k (%)" across three values of k (1, 2, and 3). The chart demonstrates how the success rate of each method changes as the parameter k increases. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis:** * **Label:** `k` * **Scale:** Discrete values at 1, 2, and 3. * **Y-Axis:** * **Label:** `Pass@k (%)` * **Scale:** Linear scale from 5.0 to 17.5, with major ticks at 5.0, 7.5, 10.0, 12.5, 15.0, and 17.5. * **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries, each with a colored line and marker: * **RL:** Red line with circular markers. * **SFT:** Orange line with circular markers. * **MT:** Purple line with circular markers. * **Base:** Blue line with circular markers. * **Data Point Markers:** The data points at `k=1` are marked with an 'x' symbol for all series. The data points at `k=2` and `k=3` are marked with solid circles. ### Detailed Analysis The chart plots the Pass@k percentage for each method at k=1, 2, and 3. All values are approximate based on visual inspection of the chart. **Trend Verification:** All four data series show a clear upward trend, with Pass@k increasing as k increases from 1 to 3. **Data Series Breakdown:** 1. **RL (Red Line):** * **Trend:** Steepest upward slope among all series. * **Data Points:** * k=1: ~10.5% (marked with red 'x') * k=2: ~13.5% * k=3: 15.0% (appears to be exactly on the grid line) 2. **SFT (Orange Line):** * **Trend:** Steady upward slope, less steep than RL. * **Data Points:** * k=1: ~11.0% (marked with orange 'x') - This is the highest starting point. * k=2: ~12.5% * k=3: ~14.5% 3. **MT (Purple Line):** * **Trend:** Steady upward slope, similar to SFT. * **Data Points:** * k=1: ~10.5% (marked with purple 'x') - Very close to RL's starting point. * k=2: ~12.0% * k=3: ~14.5% - Ends at approximately the same point as SFT. 4. **Base (Blue Line):** * **Trend:** Steepest relative increase from its starting point. * **Data Points:** * k=1: ~6.5% (marked with blue 'x') - The lowest starting point by a significant margin. * k=2: ~11.5% * k=3: ~14.0% ### Key Observations * **Performance Hierarchy at k=1:** SFT > RL ≈ MT > Base. * **Performance Hierarchy at k=3:** RL > SFT ≈ MT > Base. * **Convergence:** The performance gap between the methods narrows significantly as k increases. At k=1, the spread is ~4.5 percentage points (from ~6.5% to ~11.0%). At k=3, the spread is only ~1.0 percentage point (from ~14.0% to 15.0%). * **RL's Overtake:** The RL method starts slightly below SFT at k=1 but surpasses it by k=2 and achieves the highest score at k=3. * **Base Model's Improvement:** The Base model shows the most dramatic relative improvement, nearly catching up to the other methods by k=3 despite starting far behind. ### Interpretation The chart illustrates a common evaluation scenario in fields like machine learning or code generation, where "Pass@k" measures the probability that at least one of k generated samples is correct. The data suggests several insights: 1. **Benefit of Multiple Attempts (k):** For all methods, allowing more attempts (increasing k) leads to a higher success rate. This is an expected and fundamental property of the Pass@k metric. 2. **Method Efficacy:** The RL (Reinforcement Learning) method demonstrates the strongest scaling behavior with k, ultimately yielding the best performance. SFT (Supervised Fine-Tuning) and MT (likely Multi-Task or another technique) show solid, comparable performance. The Base model, while starting poorly, benefits greatly from increased k, indicating its underlying capability is unlocked with more sampling. 3. **Diminishing Returns:** The convergence of lines at higher k suggests that the advantage of advanced training methods (RL, SFT, MT) over the Base model is most pronounced when only a single attempt (k=1) is allowed. When multiple attempts are permitted, the raw sampling capability of the base model can partially compensate for its lower initial quality. 4. **Practical Implication:** If the application allows for multiple attempts (high k), the choice of method may be less critical than if only a single attempt (k=1) is feasible. For single-attempt scenarios, investing in methods like SFT or RL provides a clear performance benefit. </details> (c) #SFT $=1.1× 2^27$ . <details> <summary>figs/sec4_generalization_plots/l-5000.png Details</summary> ![b1b43377](/v1/image/b1b433770b7a914063e2fe91a2de7ea83842c1bca1ea421348ccca34dff6ab89) ### Visual Description ## Line Chart: Pass@k Performance Comparison ### Overview This image is a line chart comparing the performance of four different methods (RL, SFT, MT, Base) on a metric called "Pass@k" across three discrete values of k (1, 2, and 3). The chart shows how the success rate (in percentage) changes as the number of attempts (k) increases for each method. ### Components/Axes * **X-axis:** Labeled "k". It has three discrete, evenly spaced markers: **1**, **2**, and **3**. * **Y-axis:** Labeled "Pass@k (%)". It is a linear scale ranging from **5.0** to **17.5**, with major tick marks at 5.0, 7.5, 10.0, 12.5, 15.0, and 17.5. * **Legend:** Located in the **bottom-right corner** of the plot area. It contains four entries, each with a colored line and marker: * **RL:** Red line with circular markers. * **SFT:** Orange line with circular markers. * **MT:** Purple line with circular markers. * **Base:** Blue line with circular markers. * **Data Points:** For each method, data points are plotted at k=1, 2, and 3. The points at **k=1 are marked with an 'x'**, while the points at **k=2 and k=3 are marked with solid circles**. ### Detailed Analysis The following table reconstructs the approximate data points for each method. Values are estimated from the chart's grid lines and carry inherent visual uncertainty. | Method (Color) | k=1 (Pass@k %) | k=2 (Pass@k %) | k=3 (Pass@k %) | Visual Trend | | :--- | :--- | :--- | :--- | :--- | | **Base (Blue)** | ~8.5 | ~12.0 | ~12.2 | Sharp increase from k=1 to k=2, then nearly flat to k=3. | | **RL (Red)** | ~11.5 | ~13.5 | ~14.5 | Steady, moderate upward slope across all k. | | **SFT (Orange)** | ~12.0 | ~14.0 | ~15.5 | Steady, moderate upward slope, consistently above RL. | | **MT (Purple)** | ~12.5 | ~15.5 | ~16.5 | Steep upward slope, consistently the highest-performing method. | **Trend Verification:** * **Base (Blue):** Slopes upward sharply from k=1 to k=2, then plateaus. * **RL (Red):** Slopes upward steadily. * **SFT (Orange):** Slopes upward steadily, parallel to but above RL. * **MT (Purple):** Slopes upward steadily, with the steepest initial slope and highest final value. ### Key Observations 1. **Performance Hierarchy:** A clear and consistent performance hierarchy is maintained across all values of k: **MT > SFT > RL > Base**. 2. **Universal Improvement:** All four methods show an increase in Pass@k as k increases from 1 to 3, indicating that allowing more attempts improves the chance of success. 3. **Diminishing Returns for Base:** The "Base" method shows the most significant diminishing returns; its performance gain from k=2 to k=3 is negligible compared to the other methods. 4. **Consistent Gaps:** The performance gaps between the methods remain relatively constant. For example, MT maintains a lead of approximately 1-2 percentage points over SFT at each k value. ### Interpretation The chart demonstrates the effectiveness of different training or prompting techniques (RL, SFT, MT) compared to a baseline ("Base") on a code generation or problem-solving task, as measured by the Pass@k metric. * **What the data suggests:** The "MT" method is the most effective, followed by "SFT" and then "RL," with all three significantly outperforming the "Base" model. The fact that all methods improve with higher k is expected, as more samples increase the probability of generating a correct solution. * **How elements relate:** The consistent ordering of the lines suggests that the advanced methods (MT, SFT, RL) provide a robust and scalable improvement over the base model. The plateau of the Base line might indicate a fundamental limitation in its capability that cannot be overcome simply by generating more samples. * **Notable patterns:** The most notable pattern is the **parallel nature of the MT, SFT, and RL lines**. This suggests these methods might improve the model's underlying capability in a similar way, shifting the entire performance curve upward, whereas the Base model's curve has a different shape (sharper initial rise, quicker plateau). This could imply that the advanced methods not only improve the best-case performance (k=1) but also make the model more reliably correct as more attempts are allowed. </details> (d) #SFT $=1.5× 2^28$ . Figure 11: Generalization analysis on SWE-bench-Live. <details> <summary>figs/sec4_generalization_plots/m-100-s1.png Details</summary> ![a2077fb5](/v1/image/a2077fb554a33281d9b2f2db68c3494b290d006096c1a00b6585a22e5a9a4be8) ### Visual Description \n ## Line Chart: Pass@k (%) Performance Comparison ### Overview This is a line chart comparing the performance of four different models or methods (RL, SFT, MT, Base) across three values of `k` (1, 2, 3). The performance metric is "Pass@k (%)", which likely represents the percentage of problems solved correctly when given `k` attempts or samples. The chart shows that performance generally increases with `k` for most methods, but the rate of improvement varies significantly. ### Components/Axes * **X-Axis:** Labeled "k". It has three discrete, evenly spaced tick marks at values 1, 2, and 3. * **Y-Axis:** Labeled "Pass@k (%)". The scale runs from 0 to just above 20, with major tick marks at 0, 5, 10, 15, and 20. * **Legend:** Located in the center-right portion of the chart area. It contains four entries, each with a colored line and marker: * **RL:** Red line with circular markers. * **SFT:** Orange line with circular markers. * **MT:** Purple line with circular markers. * **Base:** Blue line with circular markers. * **Grid:** A light gray, dashed grid is present in the background, aligned with the major y-axis ticks. ### Detailed Analysis The chart plots four data series. Below is the extracted data for each series at k=1, 2, and 3. Values are approximate based on visual alignment with the y-axis. **1. SFT (Orange Line)** * **Trend:** Shows a strong, steady upward slope from k=1 to k=3. * **Data Points:** * k=1: ~12.0% * k=2: ~17.5% * k=3: ~21.5% **2. RL (Red Line)** * **Trend:** Shows a strong upward slope, similar to SFT but starting from a lower point. The slope appears slightly steeper between k=1 and k=2 than between k=2 and k=3. * **Data Points:** * k=1: ~9.0% * k=2: ~16.5% * k=3: ~19.5% **3. Base (Blue Line)** * **Trend:** Shows a gentle upward slope. It starts very low and increases modestly with `k`. * **Data Points:** * k=1: ~1.0% * k=2: ~1.8% * k=3: ~4.0% **4. MT (Purple Line)** * **Trend:** Nearly flat. Performance shows almost no change as `k` increases from 1 to 3. * **Data Points:** * k=1: ~2.0% * k=2: ~2.2% * k=3: ~2.2% ### Key Observations * **Performance Hierarchy:** At all values of `k`, the performance order from highest to lowest is consistently: SFT > RL > MT/Base (with Base surpassing MT at k=3). * **Greatest Improvement:** The RL method shows the most dramatic relative improvement, more than doubling its Pass@1 score by k=3. * **Stagnation:** The MT method's performance is effectively stagnant, showing negligible gain from increasing `k`. * **Crossover:** The Base method, while starting the lowest, overtakes the MT method between k=2 and k=3. * **Convergence Gap:** The gap between the top two methods (SFT, RL) and the bottom two (MT, Base) is substantial and widens as `k` increases. ### Interpretation This chart demonstrates the effectiveness of different training or sampling strategies (likely for a code generation or problem-solving task) when evaluated with the Pass@k metric. * **SFT (Supervised Fine-Tuning) is the most effective strategy** shown, consistently achieving the highest pass rates. Its strong performance suggests that fine-tuning on high-quality demonstrations is highly beneficial. * **RL (Reinforcement Learning) is also highly effective**, particularly as `k` increases. Its steep improvement curve indicates that RL-trained models benefit greatly from having multiple attempts, possibly because they can explore a more diverse solution space. * **The Base model performs poorly at k=1 but shows some capacity to improve with more samples**, suggesting its initial generations are low quality but it has some latent capability that can be unlocked with repeated sampling. * **The MT (likely "Multi-Task" or another baseline) model shows a critical failure mode**: its performance does not scale with `k`. This implies the model is either generating very similar, incorrect solutions each time or has a fundamental limitation that prevents it from benefiting from additional attempts. The data strongly suggests that for tasks measured by Pass@k, investing in SFT or RL training yields significantly better returns than the Base or MT approaches, especially when the evaluation allows for multiple attempts (k > 1). The widening gap at higher `k` values highlights that advanced training methods not only improve single-attempt accuracy but also dramatically improve the model's ability to self-correct or find correct solutions within a limited budget of attempts. </details> (a) #SFT $=2^21$ . <details> <summary>figs/sec4_generalization_plots/m-100.png Details</summary> ![d4d626b2](/v1/image/d4d626b2647f0fc8ad436bbab0ef51389bf675f8e07abd7e39695a1d31b15687) ### Visual Description ## Line Chart: Pass@k (%) Performance Comparison of Four Methods ### Overview The image is a line chart comparing the performance of four different methods (RL, SFT, MT, Base) on a metric called "Pass@k (%)". The chart plots this metric against three discrete values of `k` (1, 2, and 3). All four methods show a positive, linear trend, with performance improving as `k` increases. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis:** * **Label:** `k` * **Scale/Ticks:** Discrete values at 1, 2, and 3. * **Y-Axis:** * **Label:** `Pass@k (%)` * **Scale/Ticks:** Linear scale from 0 to 20, with major ticks at 0, 5, 10, 15, and 20. * **Legend:** * **Position:** Top-left corner of the plot area. * **Entries (from top to bottom):** 1. **RL** - Red line with circular markers. 2. **SFT** - Orange line with circular markers. 3. **MT** - Purple line with circular markers. 4. **Base** - Blue line with circular markers. * **Data Series:** Four distinct lines, each corresponding to a legend entry. Each line connects three data points (at k=1, 2, 3). The data points at k=1 are marked with an 'x' symbol, while points at k=2 and k=3 are marked with filled circles. ### Detailed Analysis **Trend Verification:** All four lines (RL, SFT, MT, Base) slope upward from left to right, indicating that the Pass@k (%) score increases for all methods as `k` increases from 1 to 3. **Data Point Extraction (Approximate Values):** | Method (Legend Color) | k=1 (Pass@k %) | k=2 (Pass@k %) | k=3 (Pass@k %) | | :--- | :--- | :--- | :--- | | **RL (Red)** | ~8.5 | ~16.0 | ~20.5 | | **MT (Purple)** | ~7.0 | ~12.5 | ~17.5 | | **SFT (Orange)** | ~5.5 | ~9.5 | ~13.0 | | **Base (Blue)** | ~3.0 | ~8.0 | ~12.0 | **Component Isolation & Spatial Grounding:** * **Header/Title:** No explicit chart title is present. * **Main Plot Area:** Contains the four data lines and gridlines. * **Axes:** X-axis at the bottom, Y-axis on the left. * **Legend:** Positioned in the upper-left quadrant, overlapping slightly with the gridlines but not obscuring data points. The order in the legend (RL, SFT, MT, Base) corresponds to the vertical order of the lines at k=3 (RL highest, Base lowest). ### Key Observations 1. **Consistent Hierarchy:** The performance ranking of the methods is consistent across all values of `k`. From highest to lowest Pass@k (%): **RL > MT > SFT > Base**. 2. **Linear Improvement:** The relationship between `k` and Pass@k (%) appears approximately linear for all methods within the range shown (k=1 to 3). 3. **Performance Gap:** The absolute performance gap between the top method (RL) and the bottom method (Base) widens as `k` increases. At k=1, the gap is ~5.5 percentage points; at k=3, it is ~8.5 percentage points. 4. **Marker Distinction:** The use of an 'x' marker for the k=1 data point for all series is a notable visual distinction from the circular markers used for k=2 and k=3. ### Interpretation This chart demonstrates the comparative effectiveness of four different training or modeling approaches (likely Reinforcement Learning, Supervised Fine-Tuning, a method labeled MT, and a Baseline) on a task measured by the Pass@k metric. Pass@k is a common metric in code generation and problem-solving tasks, representing the probability that at least one of `k` generated samples is correct. The data suggests that the **RL method is the most effective** of the four, consistently achieving the highest Pass@k scores. The **Base model performs the worst**, indicating that any of the other training methods (SFT, MT, RL) provide a significant improvement over the baseline. The positive slope for all lines indicates that allowing the model to generate more samples (increasing `k`) increases the chance of obtaining a correct solution, which is the expected behavior for the Pass@k metric. The fact that the RL line has the steepest slope suggests its performance benefits the most from an increased sample budget (`k`), or that its top-1 performance (k=1) is particularly strong relative to its top-k performance. The consistent ranking implies a clear hierarchy in the efficacy of these methods for the specific task and evaluation setup used to generate this chart. The MT method occupies a middle ground, outperforming SFT but not reaching the level of RL. </details> (b) #SFT $=2^23$ . <details> <summary>figs/sec4_generalization_plots/m-2000.png Details</summary> ![0be95bb3](/v1/image/0be95bb3d52cc61370e30216f1e051f960fa8f4610c6dfd31a3225d1f667fbf3) ### Visual Description ## Line Chart: Pass@k Performance Comparison of Four Methods ### Overview This image is a line chart comparing the performance of four different methods—RL, SFT, MT, and Base—using the Pass@k metric (expressed as a percentage). The chart plots performance across three values of k (1, 2, and 3). All methods show a positive trend, with Pass@k increasing as k increases. ### Components/Axes - **X-axis**: Labeled "k". Discrete markers at values 1, 2, and 3. - **Y-axis**: Labeled "Pass@k (%)". Linear scale from 5 to 35, with major gridlines at intervals of 5 (5, 10, 15, 20, 25, 30, 35). - **Legend**: Located in the bottom-right corner of the plot area. It contains four entries, each with a colored line and circular marker: - **RL**: Red line with red circular markers. - **SFT**: Orange line with orange circular markers. - **MT**: Purple line with purple circular markers. - **Base**: Blue line with blue circular markers. - **Plot Area**: Contains a light gray dashed grid for both axes. The four data series are plotted as solid lines connecting circular data points. ### Detailed Analysis **Trend Verification**: All four lines slope upward from left to right, indicating that Pass@k performance improves for all methods as k increases from 1 to 3. **Data Points (Approximate Values)**: | k | SFT (Orange) | RL (Red) | MT (Purple) | Base (Blue) | |-----|--------------|----------|-------------|-------------| | 1 | ~13.5% | ~12.5% | ~11.5% | ~9.0% | | 2 | ~26.5% | ~24.0% | ~22.0% | ~21.0% | | 3 | ~30.5% | ~28.5% | ~27.0% | ~29.0% | **Component Isolation & Cross-Reference**: - The legend order (RL, SFT, MT, Base) does not correspond to the final performance order at k=3. - At k=1 and k=2, the performance order from highest to lowest is consistently: SFT > RL > MT > Base. - At k=3, the order changes. The Base method (blue) shows the steepest improvement between k=2 and k=3, surpassing both RL (red) and MT (purple) to claim the second-highest position. The final order at k=3 is: SFT > Base > RL > MT. ### Key Observations 1. **Consistent Leader**: The SFT method (orange line) is the top performer at every measured value of k. 2. **Convergence at Higher k**: The performance gap between the methods narrows as k increases. The spread at k=1 is approximately 4.5 percentage points (from ~9% to ~13.5%), while at k=3 it is approximately 3.5 percentage points (from ~27% to ~30.5%). 3. **Strong Late Improvement by Base**: The Base method (blue line) exhibits the most significant relative gain, improving by approximately 20 percentage points from k=1 to k=3. Its slope between k=2 and k=3 is notably steeper than the other methods. 4. **RL and MT Plateauing**: The RL (red) and MT (purple) methods show a slightly less pronounced increase in performance between k=2 and k=3 compared to their initial jump from k=1 to k=2. ### Interpretation This chart likely evaluates different training or prompting strategies for a generative model (e.g., for code synthesis or problem-solving), where **Pass@k** measures the probability that at least one of *k* independently generated samples is correct. - **What the data suggests**: Supervised Fine-Tuning (SFT) provides the most robust performance boost across all levels of sampling (k). The base model, while starting with the lowest single-attempt success rate (Pass@1), benefits disproportionately from being allowed more attempts (higher k), nearly catching up to the reinforcement learning (RL) method at k=3. - **Relationship between elements**: The upward trend for all lines confirms the fundamental principle that generating more samples increases the chance of success. The differing slopes suggest that the methods affect not just the base performance (Pass@1) but also the rate of improvement with additional samples. - **Notable anomaly/insight**: The crossover where the Base method surpasses RL and MT at k=3 is the most significant finding. It implies that while fine-tuning (SFT, RL, MT) improves the model's "first-shot" capability, the underlying base model may have a higher capacity for generating diverse solutions that, given enough attempts, can match or exceed the solutions produced by some fine-tuned variants. This could indicate that certain fine-tuning methods might reduce output diversity in favor of higher initial accuracy. </details> (c) #SFT $=1.1× 2^27$ . <details> <summary>figs/sec4_generalization_plots/m-5000.png Details</summary> ![5ee7a35d](/v1/image/5ee7a35d3a8961dd5282233e81de5bbc39b0f7901506fd0e27bb9d315739f3f9) ### Visual Description \n ## Line Chart: Pass@k Performance Comparison ### Overview This image displays a line chart comparing the performance of four different models or methods (RL, SFT, MT, Base) on a metric called "Pass@k" across three discrete values of k (1, 2, and 3). The chart shows a clear upward trend for all methods as k increases. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis:** * **Label:** `k` * **Scale:** Discrete, linear scale with major ticks at `1`, `2`, and `3`. * **Y-Axis:** * **Label:** `Pass@k (%)` * **Scale:** Linear scale ranging from 5 to 35, with major ticks every 5 units (5, 10, 15, 20, 25, 30, 35). * **Legend:** * **Position:** Bottom-right corner of the plot area. * **Entries (from top to bottom):** 1. **RL:** Red line with circular markers. 2. **SFT:** Orange line with circular markers. 3. **MT:** Purple line with circular markers. 4. **Base:** Blue line with circular markers (note: at k=1, the marker is an 'x' instead of a circle). * **Grid:** Light gray dashed grid lines are present for both axes. ### Detailed Analysis **Trend Verification:** All four data series exhibit a positive, upward-sloping trend from k=1 to k=3. The slope appears steepest between k=1 and k=2 for all series. **Data Points (Approximate Values):** | Method (Color) | k=1 | k=2 | k=3 | | :--- | :--- | :--- | :--- | | **RL (Red)** | ~17% | ~27.5% | ~31% | | **SFT (Orange)** | ~16.5% | ~26.5% | ~34.5% | | **MT (Purple)** | ~15.5% | ~25.5% | ~29.5% | | **Base (Blue)** | ~18.5% | ~26% | ~30.5% | **Spatial Grounding & Component Isolation:** * **Header/Title:** No chart title is present. * **Main Chart Area:** Contains the four plotted lines and the grid. * **Footer/Axes:** The x-axis label "k" is centered below the axis. The y-axis label "Pass@k (%)" is rotated 90 degrees and placed to the left of the axis. * **Legend:** Located in the bottom-right quadrant, overlapping slightly with the grid lines but not obscuring data points. ### Key Observations 1. **Performance Hierarchy at k=1:** The `Base` model (blue) starts with the highest Pass@1 score (~18.5%), followed by `RL` (~17%), `SFT` (~16.5%), and `MT` (~15.5%). 2. **Performance Hierarchy at k=3:** The order changes significantly. `SFT` (orange) achieves the highest Pass@3 score (~34.5%), followed by `RL` (~31%), `Base` (~30.5%), and `MT` (~29.5%). 3. **Rate of Improvement:** The `SFT` method shows the most dramatic improvement, increasing by approximately 18 percentage points from k=1 to k=3. Its slope is the steepest, especially between k=2 and k=3. 4. **Crossover Point:** Between k=1 and k=2, the `RL` (red) line crosses above the `Base` (blue) line. The `SFT` (orange) line also crosses above the `Base` line in this interval. 5. **Marker Anomaly:** The `Base` series uses a distinct 'x' marker at k=1, while all other data points across all series use circular markers. ### Interpretation This chart likely evaluates the effectiveness of different training or decoding strategies (Reinforcement Learning - RL, Supervised Fine-Tuning - SFT, perhaps Multi-Task - MT) against a baseline model (Base) on a code generation or problem-solving task, where "Pass@k" measures the probability that at least one of k generated samples is correct. The data suggests that while the `Base` model is the strongest for single-attempt generation (k=1), the specialized training methods (`SFT` and `RL`) scale better with increased sampling (higher k). `SFT` demonstrates the most significant benefit from additional attempts, ultimately outperforming all other methods at k=3. This implies that the `SFT` method produces a more diverse set of high-quality candidate solutions, increasing the likelihood of finding a correct one when given multiple chances. The `MT` method, while improving, consistently underperforms the other approaches across all k values shown. The crossover between `RL`/`SFT` and `Base` highlights a key trade-off: the baseline may be better for efficiency (single try), but the fine-tuned methods are superior when computational resources allow for multiple sampling attempts. </details> (d) #SFT $=1.5× 2^28$ . Figure 12: Generalization analysis on SWE-bench Multilingual. ## Appendix H Use of Large Language Models The initial draft of this paper was written entirely by the authors. A large language model (gpt-5) was used only to aid with polishing the language (e.g., grammar and clarity). All conceptual contributions, experimental designs, analyses, and conclusions are the work of the authors.

Rendering Paper...