2602.09892v1

Model: healer-alpha-free

# Immersion in the GitHub Universe: Scaling Coding Agents to Mastery **Authors**: Jiale Zhao, Guoxin Chen, Fanzhe Meng, Minghao Li, Jie Chen, Hui Xu, Yongshuai Sun, Wayne Xin Zhao, Ruihua Song, Yuan Zhang, Peng Wang, Cheng Chen, Ji-Rong Wen, Kai Jia {marshmallowzjl, gx.chen.chn, mengfanzhe16, batmanfly}@gmail.com, songruihua_bloon@outlook.com, jiakai@bytedance.com ## Abstract Achieving mastery in real-world software engineering tasks is fundamentally bottlenecked by the scarcity of large-scale, high-quality training data. Scaling such data has been limited by the complexity of environment setup, unit-test generation, and problem statement curation. In this paper, we propose Scale‑SWE, an automated, sandboxed multi‑agent workflow designed to construct high‑quality SWE data at scale. The system coordinates three specialized agents—for environment setup, test creation, and problem description synthesis—to process 6 million pull requests across 5.2k repositories, producing Scale‑SWE‑Data: 100k verified SWE instances, the largest such dataset to date. It substantially surpasses existing real‑world datasets in repository diversity and reflects realistic task complexity. We further demonstrate the dataset’s utility for training by distilling 71,498 high‑quality trajectories and fine‑tuning Qwen‑30B-A3B-Instruct to produce Scale‑SWE‑Agent. Our agent achieves a 64% resolve rate on SWE‑Bench‑Verified—a nearly three‑fold improvement over the base model. Scale‑SWE provides a scalable, reproducible approach for data construction to advance LLM‑based software engineering. <details> <summary>x1.png Details</summary> ![f0d4c2c5](/v1/image/f0d4c2c5b56ed5b59104dda2e65bf8c709d04c8f36dfcea1afab7d2f99b70e22) ### Visual Description Icon/Small Image (24x24) </details> Model <details> <summary>x2.png Details</summary> ![c7e7d8c2](/v1/image/c7e7d8c2963617880acf7c41d8208c883613fc5684f4ad25bb0392dff94434f1) ### Visual Description Icon/Small Image (24x24) </details> Dataset <details> <summary>x3.png Details</summary> ![5cf6339f](/v1/image/5cf6339f55ec44526c0cbc60551c976d62508c2698f8c388284240389db21c54) ### Visual Description Icon/Small Image (24x24) </details> Code ## 1 Introduction <details> <summary>figures/performance.drawio.png Details</summary> ![164084a0](/v1/image/164084a017eb1fb2bdd63c5a8fbe04703b444bb182aac7857ee1d426f813dba3) ### Visual Description ## Scatter Plot: AI Model Performance on SWE-Bench Verified vs. Activated Model Size ### Overview This image is a scatter plot comparing the performance of various AI models on the "SWE-Bench Verified" benchmark against their "Activated Model Size" in billions of parameters (B). The chart highlights the performance of a model named "Scale-SWE" with a prominent red star, indicating it as a key result or the subject of the analysis. ### Components/Axes * **X-Axis:** Labeled "Activated Model Size (B)". It has major tick marks at values 3, 7, 32, and 70. The axis represents the scale of the model in billions of parameters. * **Y-Axis:** Labeled "SWE-Bench Verified (%)". It ranges from 10 to 70 with increments of 10. This axis represents the accuracy or success rate percentage on the benchmark. * **Data Points:** Each model is represented by a blue circular dot, except for "Scale-SWE" which is marked with a large red star. Each data point is accompanied by a text label stating the model name and its exact performance percentage in parentheses. * **Legend:** There is no separate legend box. The identification of each data point is provided by its adjacent text label. ### Detailed Analysis The plot contains 14 distinct data points. Below is a list of all models, their approximate activated size (based on x-axis position), and their exact SWE-Bench Verified score: 1. **Scale-SWE (64%)**: Marked with a red star. Positioned at approximately 3B on the x-axis and 64% on the y-axis. 2. **GLM-4.7-Flash (59.2%)**: Blue dot. Positioned at ~3B, 59.2%. 3. **Qwen3-Coder-30A3B (52.2%)**: Blue dot. Positioned at ~3B, 52.2%. 4. **Qwen3-30A3B-Instruct (22%)**: Blue dot. Positioned at ~3B, 22%. 5. **SWE-Mirror-7B (22.8%)**: Blue dot. Positioned at ~7B, 22.8%. 6. **SWE-Lego-8B (42.2%)**: Blue dot. Positioned at ~8B (slightly right of 7B), 42.2%. 7. **SWE-smith (40.2%)**: Blue dot. Positioned at ~32B, 40.2%. 8. **R2E-Gym (34.4%)**: Blue dot. Positioned at ~32B, 34.4%. 9. **SWE-Gym (20.6%)**: Blue dot. Positioned at ~32B, 20.6%. 10. **KAT-Dev-32B (62.4%)**: Blue dot. Positioned at ~32B, 62.4%. 11. **SWE-Lego-32B (52.6%)**: Blue dot. Positioned at ~32B, 52.6%. 12. **SWE-Mirror-32B (52.2%)**: Blue dot. Positioned at ~32B, 52.2%. 13. **Llama3-SWE-RL (41%)**: Blue dot. Positioned at ~70B, 41%. 14. **SWE-Fixer (32.8%)**: Blue dot. Positioned at ~70B, 32.8%. ### Key Observations * **Performance vs. Size Trend:** There is a general, but not strict, positive correlation between activated model size and benchmark performance. The highest-performing models (Scale-SWE, KAT-Dev-32B) are at the top of the chart. * **Significant Outlier:** **Scale-SWE** is a major outlier. It achieves the highest score (64%) while having one of the smallest activated model sizes (~3B). This suggests exceptional efficiency or a different architectural approach. * **Clustering at 32B:** A dense cluster of models exists around the 32B size mark, with performance varying widely from 20.6% (SWE-Gym) to 62.4% (KAT-Dev-32B). * **Diminishing Returns at 70B:** The two models shown at 70B (Llama3-SWE-RL, SWE-Fixer) do not outperform the top models at 32B or 3B, indicating that simply increasing size does not guarantee better performance on this specific benchmark. * **Performance Tiers:** Models can be loosely grouped into tiers: * **Top Tier (>60%):** Scale-SWE, KAT-Dev-32B. * **High Tier (50-60%):** GLM-4.7-Flash, SWE-Lego-32B, SWE-Mirror-32B, Qwen3-Coder-30A3B. * **Mid Tier (40-50%):** SWE-Lego-8B, SWE-smith, Llama3-SWE-RL. * **Lower Tier (<40%):** R2E-Gym, SWE-Fixer, SWE-Mirror-7B, Qwen3-30A3B-Instruct, SWE-Gym. ### Interpretation This chart is likely from a technical report or paper introducing the "Scale-SWE" model. The primary message is that Scale-SWE achieves state-of-the-art performance (64%) on the SWE-Bench Verified benchmark while using a significantly smaller activated model size (~3B parameters) compared to other top performers like KAT-Dev-32B (32B). This challenges the common assumption that larger models always perform better and highlights the importance of model architecture, training data, or fine-tuning techniques (implied by names like "SWE-RL", "SWE-Lego", "SWE-Mirror"). The wide performance spread among models of similar size (especially at 32B) suggests that the SWE-Bench benchmark is highly sensitive to model specialization and training methodology for software engineering tasks, not just raw scale. The plot serves as a compelling visual argument for the efficiency and effectiveness of the Scale-SWE approach. </details> Figure 1: Resolved rate vs. activated model size on SWE-bench Verified. The vertical axis denotes the percentage of resolved issues on the SWE-bench Verified benchmark. The horizontal axis represents the number of activated parameters in billions (B). Recently, LLM-based code agents have garnered significant attention for their demonstrated potential in tackling complex software engineering (SWE) tasks [Anthropic, 2025a, Google, 2025, OpenAI, 2025], as reflected in benchmarks like SWE-bench [Jimenez et al., 2023] and its successors [Zhang et al., 2025]. Yet the advancement of these agents is fundamentally constrained by the scarcity of high-quality training data. Unlike conventional code generation, SWE tasks necessitate operation within executable environments, requiring agents to navigate existing codebases, manage dependencies, and satisfy test suites. These inherent complexities render the systematic curation and validation of appropriate data a significant challenge. Current methodologies for constructing SWE-style datasets predominantly rely on labor-intensive manual curation [Pan et al., 2024] or on simplistic LLM-based [Badertdinov et al., 2025] and rule-based synthesis [Guo et al., 2025b]. Consequently, existing datasets are often limited in scale, diversity, and difficulty, or lack the executable environments and comprehensive test suites. This limitation persists despite the wealth of available real-world software artifacts, including code repositories, issue trackers, and commit histories, which remain largely untapped for building a scalable, realistic dataset. The absence of systematic, automated mining techniques has thus produced a clear disconnect between these raw software resources and the creation of robust, large-scale training data. This compelling need drives our work to develop an automated and reproducible approach for dataset construction. However, the automatic construction of SWE datasets presents unique and significant challenges. First, environment configuration becomes a major hurdle as repository diversity increases, leading to highly heterogeneous and often fragile build processes [Froger et al., 2025], which can be challenging even for experienced developers to set up correctly. Second, real-world repositories frequently lack sufficient, well-defined unit tests. While incorporating these repositories is essential to reduce data bias and achieve scale, generating comprehensive unit tests itself is a complex problem that demands interactive agentic execution and self-correction. Finally, a substantial portion of real-world pull requests either lack informative descriptions or are inherently unsuitable for SWE tasks. Therefore, generating high-quality, self-contained problem descriptions is challenging and requires agents to infer task intent by acquiring deep repository context through iterative sandbox exploration. To overcome these challenges, in this paper, we introduce Scale-SWE, an automated, sandboxed multi-agent workflow designed for scalable, high-quality software engineering dataset construction. Our system coordinates three specialized agents: an environment builder agent that sets up isolated Docker environments, a unit-test creator agent that generates robust Pass-to-Pass (P2P) and Fail-to-Pass (F2P) test cases, and a problem statement agent that crafts self-contained task descriptions grounded in pull request content. By processing 6 million pull requests across 5,200 repositories, this workflow produces 100,000 verified instances, yielding the largest SWE dataset to date, which we call Scale-SWE-Data. This dataset surpasses prior real-world datasets in repository diversity and reflects realistic software engineering complexity, both in the number of file modifications required and the robustness of its unit tests. To further demonstrate the utility of Scale-SWE-Data for model training, we distill 71,498 high-quality trajectories from a subset of 25,000 instances using DeepSeek-V3.2. Fine-tuning Qwen3-30A3B-Instruct on this distilled data yields our Scale-SWE-Agent, which achieves a substantial performance boost on SWE-Bench-Verified, increasing the resolve rate from 22% to 60%. This result underscores the effectiveness of our dataset for training and enhancing LLM-based code agents. To summarize, our contributions are as follows: - We introduce Scale-SWE, an automated, sandboxed multi-agent workflow for scalable, high-quality software engineering dataset construction. It systematically coordinates three specialized agents for environment setup, unit-test generation, and problem description synthesis. - We construct Scale-SWE-Data, the largest verified SWE dataset to date, comprising 100,000 real-world instances. It surpasses prior benchmarks in repository diversity and task complexity, supporting both evaluation and training for LLM-based code agents. - We distill 71,498 high-quality trajectories from Scale-SWE-Data and fine-tune Qwen-30A3B-Instruct to create Scale-SWE-Agent. The agent substantially boosts performance on SWE-Bench-Verified, achieving a 64% resolve rate—a nearly three-fold improvement over the original backbone. ## 2 Scale-SWE: Software Task Scaling <details> <summary>figures/scale_swe.drawio.png Details</summary> ![78d346bb](/v1/image/78d346bbeefdd6abfc31ce578d984ef5b405c30419337f3921f86398b91ec2da) ### Visual Description ## Diagram: Automated GitHub Repository Processing Pipeline ### Overview This image is a technical flowchart illustrating an automated system for processing real-world GitHub repositories and pull requests using specialized AI agents. The pipeline sources data from GitHub, filters it, and then uses three distinct agents to build environments, create unit tests, and write problem statements, ultimately leading to an "Immersion in the GitHub Universe." ### Components/Flow The diagram is organized into three main vertical sections, flowing from left to right. **1. Left Panel: Data Sourcing & Filtering** * **Top Element:** A GitHub logo (Octocat) labeled "Real-world 23k repositories." * **Flow Arrow:** Points down, labeled "GitHub API." * **Middle Element:** An icon representing pull requests, labeled "6M pull requests." * **Flow Arrow:** Points down, labeled "Stars downloads LLM as Judge ......" * **Bottom Element:** A badge/award icon, labeled "1M high quality pull requests." * **Connection:** A thick arrow points from this panel to the central panel, indicating the filtered data is the input for the agent system. **2. Central Panel: Agent Workflows** This panel contains three rounded rectangular boxes, each describing an agent's process. * **Box 1: "1. Environment Builder Agent" (Top)** * **Process Flow (Left to Right):** 1. Icon of a robot at a computer: "Auto-explore repository." 2. Arrow to an icon of a terminal with `pip install`/`conda install`: "Agent-based package installation apt-get, pip, shell, ..." 3. Arrow to a blue button: "Run unit-test." 4. Arrow to a test result icon showing "PASSED" and "FAILED": "Unit-test failed." 5. Arrow to a terminal icon: "Install more packages & Fix package versions & ......" 6. Arrow loops back to "Run unit-test" with the label "Repeat Fix & Check ...." * **Success Path (Downward from "Run unit-test"):** 1. Arrow down to a clipboard icon: "All unit-test passed." 2. Arrow left to a terminal showing commands like `pip install -e.[dev]`, `pip uninstall sqlalchemy-mixins -y`, `pip install "SQLAlchemy<2.0"`: "Extract trajectories." 3. Arrow left to a Docker whale icon: "Docker image built successfully." * **Box 2: "2. Unit-test Creator Agent" (Bottom Left)** * **Process Flow:** 1. Icon of a robot reading a document labeled "PULL REQUEST META INFO": "Read pull request meta info." 2. Arrow right to a robot exploring a codebase: "Auto-explore repository." 3. Arrow down to a lightbulb icon labeled "F2P P2P": "Write unit-test." 4. Arrow left to a robot at a computer running tests: "Run unit-test." 5. Arrow labeled "If failed, fix unit-test" points back to the "Write unit-test" step. * **Box 3: "3. Problem Statement Writer Agent" (Bottom Right)** * **Process Flow:** 1. Icon of a robot reading a document labeled "FAIL-TO-PASS AND PULL REQUEST META INFO": "Read pull request meta info. & Read Fail-to-Pass." 2. Arrow right to a robot exploring a codebase: "Auto-explore repository." 3. Arrow down to a lightbulb icon labeled "Problem Statement": "Write problem statement." **3. Right Panel: Output/Theme** * **Element:** A stylized image of a robot astronaut in space surrounded by GitHub logos and stars. * **Label:** "Immersion in the GitHub Universe." * **Connection:** A thick arrow points from the central panel to this image, indicating the final outcome or context of the processed data. ### Detailed Analysis * **Data Scale:** The pipeline begins with a large corpus: 23,000 repositories yielding 6 million pull requests, which are filtered down to 1 million "high quality" pull requests using metrics like stars, downloads, and an LLM-as-a-judge evaluation. * **Agent 1 (Environment Builder) Logic:** This agent follows a trial-and-error loop. It attempts to set up a runnable environment by installing packages, runs tests, and if they fail, iteratively installs more packages or fixes versions until all tests pass. The successful outcome is a built Docker image and a log of installation commands ("trajectories"). * **Agent 2 (Unit-test Creator) Logic:** This agent focuses on the pull request itself. It reads the PR's metadata, explores the code, and writes a unit test intended to validate the change. It has an internal loop to run and fix its own test until it works. * **Agent 3 (Problem Statement Writer) Logic:** This agent synthesizes information. It reads both the PR metadata and the "Fail-to-Pass" (F2P) information (likely the state before and after the PR's fix), explores the code, and generates a natural language problem statement describing the issue the PR addresses. * **Spatial Relationships:** The three agent boxes are arranged with the Environment Builder spanning the top, and the other two side-by-side below it. The flow arrows clearly show sequential steps within each agent and the overall left-to-right progression of the entire pipeline. ### Key Observations 1. **Iterative Problem-Solving:** Both the Environment Builder and Unit-test Creator agents employ iterative loops (fix-and-retry), mimicking a human developer's debugging process. 2. **Specialization:** Each agent has a distinct, non-overlapping responsibility: environment setup, test creation, and documentation (problem statement). 3. **Input Specificity:** The Unit-test Creator and Problem Statement Writer agents require specific inputs from the pull request ("meta info," "Fail-to-Pass"), indicating they operate on granular, change-level data. 4. **Output Artifacts:** The pipeline produces concrete artifacts: a Docker environment, unit tests, and problem statements, which could be used to create a benchmark dataset for training or evaluating AI models on real-world software engineering tasks. ### Interpretation This diagram outlines a sophisticated data engineering pipeline designed to transform raw, unstructured GitHub activity into a structured, machine-readable dataset. The core innovation is using specialized AI agents to automate the laborious tasks of environment configuration, test generation, and issue description—tasks that are typically manual and require deep code understanding. The system's goal appears to be the creation of a large-scale, high-fidelity benchmark or training set (the "Immersion in the GitHub Universe") for AI models that aim to solve real programming problems. By starting with 1M high-quality PRs and generating paired data (code change + environment + test + problem description), it addresses a key gap in AI for software engineering: the lack of realistic, executable contexts. The "Peircean" reading suggests this is not just data collection, but an attempt to capture the full *context* and *process* of software development, moving beyond static code snippets to dynamic, testable scenarios. The emphasis on "trajectories" (installation logs) is particularly notable, as it records the *path* to a working environment, which is valuable knowledge often lost in traditional datasets. </details> Figure 2: The Sandboxed multi-agent system for Scale-SWE dataset construction. Starting from millions of raw GitHub pull requests, the pipeline employs a series of autonomous agents to transform high-quality PRs into executable software engineering tasks. The framework automates environment setup, unit test generation (Fail-to-Pass/Pass-to-Pass), and formal problem statement synthesis, ensuring the scalability and reproducibility of the distilled trajectories. The core philosophy of Scale-SWE is to make use of a sandboxed multi-agent system to autonomously explore codebases and complete SWE data construction. Each SWE data instance encapsulates the necessary components: a docker image, a problem statement, and the validation unit tests (consisting of F2P and P2P unit tests). Our sandboxed multi-agent system offers significantly greater flexibility compared to prior rule-based construction methods. By shifting the construction burden to the agent, Scale-SWE enables the scaling of interactive environments while minimizing the heuristic bias inherent in rigid filtering pipelines. In what follows, we firstly introduce our sandboxed multi-agent system (Section 2.1), and then detail by the data processing pipeline (Section 2.2). ### 2.1 Sandboxed Multi-agent System To scale the automated construction of software engineering tasks, we develop a sandboxed multi-agent system. This framework leverages collaborative agents to produce a large volume of SWE data with fully integrated, executable environments. #### 2.1.1 The Overall Workflow As illustrated in Figure 2, our sandboxed multi-agent system automates the construction of software engineering tasks through the coordinated execution of three specialized agents: the Environment Builder Agent (EBA), the Unit-test Creator Agent (UCA), and the Problem Statement Writer Agent (PSWA). Within this framework, the EBA generates reproducible, Docker-based execution environments from target repositories, providing the isolated and consistent runtime needed for scalable task validation. The UCA synthesizes comprehensive, executable test suites—including both Fail-to-Pass (F2P) and Pass-to-Pass (P2P) test cases—from pull requests and repository context, ensuring robust evaluation criteria. Finally, the PSWA produces high-quality, self-contained problem descriptions grounded in the executable test suites, guaranteeing semantic alignment between the problem statement and the validation requirements. The system is built upon SWE-agent [Yang et al., 2024a] and powered by the DeepSeek language model family [Liu et al., 2025] and Gemini3-Pro [Google, 2025], with task instances generated using both DeepSeek v3.1 or DeepSeek v3.2, with the exception of PSWA, which utilizes Gemini3-Pro. #### 2.1.2 Environment builder agent The EBA is designed to automatically generate a reproducible, Docker-based execution environment from a target source code repository. By providing an isolated and executable sandbox, the EBA enables the scalable construction and validation of test samples for automated software engineering workflows. Role Formulation. We define the core function of the EBA as the transformation of an initial, generic Docker environment into a specialized runtime container tailored to a specific repository. This process can be expressed as: $$ D_final=EBA(R,D_init), \tag{1} $$ where $R$ denotes the input repository that the agent analyzes to infer dependencies and configuration logic, $D_init$ is the base Docker image, and $D_final$ is the resulting functional, ready-to-use container image. Construction Process. In implementation, the EBA is initialized within a base Docker container containing the cloned repository. It begins by autonomously exploring the repository’s structure and analyzing configuration files—such as setup.py, pyproject.toml, and README.md —to infer project dependencies. Since Python projects lack a universal setup protocol, a standardized installation approach is insufficient. The agent must therefore interpret project-specific documentation and interactively resolve dependency conflicts by parsing terminal feedback. This feedback-driven, autonomous process enables flexible environment configuration, circumventing the limitations of static, rule-based methods (e.g., predefined pip install commands). Finally, we extract all executed commands from the agent’s trajectory and use an LLM to synthesize them into a reproducible Dockerfile. Efficiently Scaling Environmental Diversity. While constructing a dedicated environment per pull request (PR) is straightforward, it is computationally prohibitive for large-scale evaluation. To scale across a diverse set of repositories, we sample at most ten PRs per repository for full environment construction via the EBA. This sampling strategy enables broader repository coverage while controlling resource costs. Note that ten PRs do not yield only ten task samples; multiple PRs can share the same environment if they originate from a similar runtime state, resulting in significantly more test samples than uniquely built environments. Specifically, for the remaining PRs, we execute tests in the “nearest” available Docker environment, determined by proximity in PR ID (as a proxy for repository timeline). If a PR fails the unit tests within its nearest available environment, it is discarded from the final dataset. Otherwise, the PR and its associated pre-established environment are retained as a valid test instance. On average, each repository contributes 19 test instances (see Table 2), dramatically improving environment reuse and overall dataset diversity. #### 2.1.3 Unit-test creator agent The unit-test creator agent (UCA) is designed to automatically generate comprehensive and executable test suites from target source code repositories and their associated pull requests. By synthesizing semantic PR information with dynamic repository context, the UCA enables the scalable construction of validated Fail-to-Pass (F2P) and Pass-to-Pass (P2P) test cases, which are essential for robust software engineering task evaluation. Role Formulation. The core function of the UCA is to transform the provided pull request metadata and its associated code context into a comprehensive suite of executable unit tests. This process can be expressed as: $$ U=UCA(M,R,D_final), \tag{2} $$ where $M$ denotes the input pull request metadata—including the title, description, and diff patches—that provides the semantic specification for test generation, $R$ is the associated source repository that the agent analyzes to understand code structure and logic, $D_final$ is the functional Docker environment built by the EBA, and $U$ is the resulting set of executable test cases. In this work, we adopt the SWE-bench protocol for unit tests, categorizing them into two types: Fail-to-Pass (F2P) tests, which initially fail on the original codebase and must pass after the bug-fixing patch to verify correctness; and Pass-to-Pass (P2P) tests, which are pre-existing passing tests that must continue to pass to ensure no regression is introduced. Construction Process. A large proportion of GitHub repositories lack comprehensive unit tests, making the automated construction of both Fail-to-Pass (F2P) and Pass-to-Pass (P2P) test cases essential for creating valid Software Engineering (SWE) task instances. The UCA begins by analyzing structured metadata from a target PR, including its title, description, and diff patches. This input provides the necessary semantic grounding for the agent to comprehend the intent and scope of the proposed code changes. Building on this foundation, the agent performs an autonomous traversal of the associated repository to map its directory structure, identify key modules, and infer the underlying program logic and dependencies. However, generating effective unit tests presents a complex, stateful challenge that requires not only static code understanding but also dynamic behavioral validation. The agent must reason about cross-file interactions, data flow, exception handling, and edge cases—tasks that are difficult to accomplish through static analysis alone. To address this, the UCA is deployed within a secure, sandboxed execution environment (specifically, the Docker container $D_final$ produced by the EBA). This sandbox grants the agent direct, real-time code execution privileges, enabling an interactive execute-analyze-refine loop. The agent can thus dynamically run its proposed tests, observe their outcomes, and iteratively revise the test logic, assertions, and fixtures. This closed-loop, feedback-driven methodology allows the UCA to produce robust, executable test suites. #### 2.1.4 Problem Statement Writer Agent The problem statement writer agent (PSWA) is designed to automatically synthesize high-quality, self-contained task descriptions from raw pull requests and their associated test suites. By grounding the narrative in executable unit tests, the PSWA ensures semantic alignment between the problem specification and unit tests, thereby producing well-posed and tractable software engineering tasks. Role Formulation. The core function of the PSWA is to generate a formal problem statement free of solution leakage according to pull request metadata and its corresponding test suite. This process can be expressed as: $$ S=PSWA(M,U,R,D_final), \tag{3} $$ where $M$ denotes the input PR metadata—including its title, description, and diff patches—that provides initial contextual grounding; $U$ is the executable unit-test suite generated by the UCA, which ensures the problem statement aligns with the actual validation requirements; $R$ is the associated source repository; $D_final$ is the functional Docker environment built by the EBA; and $S$ is the resulting formal problem description that articulates the issue without exposing implementation details and consistent with unit-test at the same time. Construction Process. Relying solely on raw PR descriptions as problem statements is fundamentally flawed due to their retrospective nature—they are often written after the fix is implemented and may leak solution details or reference internal artifacts. Moreover, a significant portion of high-quality PRs lack descriptions or are disconnected from original issue reports. To overcome these limitations, the PSWA is initialized with both the PR metadata and the executable unit tests produced by the UCA. Integrating the test suite into the prompt is critical because F2P tests may invoke functions or classes that do not exist in the original codebase; the generated problem statement must explicitly articulate these requirements to make the task tractable. Without this alignment, descriptions derived only from PR metadata often diverge from the actual failures captured by the tests. The agent thus synthesizes a coherent, self-contained narrative that specifies the expected behavior, any new interfaces required by the tests, and the context necessary for an external solver—all while deliberately omitting hints about the implementation. This process ensures that each synthesized task instance is both semantically precise and evaluation-ready, scaling the creation of SWE-bench-style problems from raw repository data. For PSWA, we employ Gemini3-Pro, as our experiments indicate that it generates more consistent and rigorous problem statements while significantly minimizing information leakage. Table 1: Detailed statistics of the Scale-SWE dataset. We report the mean and percentiles (P50, P75, P95) for code modification metrics and test case distributions. | Metric | Mean | P50 | P75 | P95 | | --- | --- | --- | --- | --- | | Modified Files | 6.4 | 3.0 | 6.0 | 18.0 | | Deleted Lines | 54.9 | 1.0 | 10.0 | 119.0 | | Added Lines | 220.8 | 43.0 | 120.0 | 595.0 | | Edited Lines | 37.0 | 6.0 | 20.0 | 108.0 | | Total Changes | 312.7 | 63.0 | 167.0 | 867.0 | | Fail-to-Pass | 5.7 | 2.0 | 5.0 | 15.0 | | Pass-to-Pass | 209.0 | 68.0 | 178.0 | 793.0 | | Total tests | 214.7 | 72.0 | 185.0 | 801.7 | ### 2.2 Data Processing Ensuring high data quality is the foremost priority in building a reliable dataset. This section details our curation methodology to achieve this goal. Repository Selection. To assemble a comprehensive and high-quality corpus, we sourced repositories from two primary channels, illustrated in Figure 5. First, we identified the repositories corresponding to the top 15,000 most-downloaded packages on the Python Package Index (PyPI), using data from Top PyPI Packages. Second, to capture notable projects beyond PyPI, we queried the SEART search engine with the following criteria: primary language as Python, a minimum of 5 contributors, at least 500 stars, and a creation date between January 1, 2015, and October 29, 2025. This query returned 9,062 repositories. The initial union of these two sources yielded approximately 23,000 candidate repositories. We then applied a multi-stage filtering pipeline. First, to prevent data leakage from the evaluation benchmark, we excluded all PRs originating from repositories listed in SWE-Bench Verified. Second, we filtered to include only repositories with permissive open-source licenses. Finally, recognizing that a subset of repositories might be GPU-dependent, tutorial-based, or contain minimal source code, we performed a content-based filter. We extracted the README files from all remaining candidates and employed an LLM-as-a-judge approach to automatically exclude repositories deemed unsuitable for training. Pull Request Filtering. We next extracted all pull requests merged into the “main” or “master” branches of the filtered repositories, resulting in a preliminary set of 6 million entries. To ensure quality, we utilized an LLM-as-a-judge to filter low-quality instances based on the available metadata: the git diff, the pull request description, and the merge commit message. Cheating Prevention. During the evaluation phase, it is imperative to prevent the model from exploiting git commands (e.g., git log --all) to access ground truth solutions [Xiao et al., 2026]. To address this, we execute a sanitization script immediately after initializing the task environment. This process, systematically removes metadata that follows the task’s parent commit. It performs a hard reset, deletes all remote references and tags, and purges internal git files (e.g., logs/, packed-refs, and various HEAD files), thereby eliminating any trace of future solution history. <details> <summary>figures/dataset_comparison_icml.png Details</summary> ![e6f2c854](/v1/image/e6f2c854cfc2ebbc4e7df986f9347b6faf165d23a207fc6de9733f7b668256e4) ### Visual Description ## Grouped Bar Chart: Error Type Distribution Across Software Engineering Benchmarks ### Overview The image displays a grouped bar chart comparing the percentage distribution of ten different software error types across four distinct software engineering (SWE) benchmark datasets. The chart is designed to show the relative prevalence of each error category within each benchmark. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **Y-Axis:** Labeled "Percentage (%)". The scale runs from 0% to 60% with major gridlines at 20% intervals (0%, 20%, 40%, 60%). * **X-Axis:** Represents four distinct benchmark datasets. The categories are, from left to right: 1. `SWE-Bench-Verified` 2. `SWE-Gym` 3. `SWE-smith` 4. `Scale-SWE` * **Legend:** Positioned at the top of the chart, spanning its full width. It defines ten error types with associated colors: * **Blue:** API Mismatch * **Orange:** Logic Error * **Green:** Input/Boundary * **Red:** Constructor * **Purple:** Import Error * **Brown:** State Sync * **Pink:** Mutability * **Grey:** Spec Violation * **Olive Green:** I/O Resource * **Teal:** Security ### Detailed Analysis Below are the approximate percentage values for each error type within each benchmark dataset. Values are estimated based on bar height relative to the y-axis gridlines. **1. SWE-Bench-Verified** * **Logic Error (Orange):** ~43% (Highest in this group) * **Input/Boundary (Green):** ~19% * **API Mismatch (Blue):** ~13% * **State Sync (Brown):** ~8% * **Spec Violation (Grey):** ~7% * **Import Error (Purple):** ~4% * **Constructor (Red):** ~3% * **Mutability (Pink):** ~2% * **I/O Resource (Olive):** ~1% * **Security (Teal):** ~0% (Not visible) **2. SWE-Gym** * **Logic Error (Orange):** ~36% (Highest in this group) * **Input/Boundary (Green):** ~21% * **API Mismatch (Blue):** ~20% * **State Sync (Brown):** ~8% * **Spec Violation (Grey):** ~7% * **Import Error (Purple):** ~4% * **Constructor (Red):** ~2% * **Mutability (Pink):** ~1% * **I/O Resource (Olive):** ~1% * **Security (Teal):** ~0% (Not visible) **3. SWE-smith** * **Logic Error (Orange):** ~63% (Highest bar in the entire chart) * **API Mismatch (Blue):** ~9% * **Import Error (Purple):** ~9% * **Input/Boundary (Green):** ~7% * **Constructor (Red):** ~4% * **State Sync (Brown):** ~3% * **Spec Violation (Grey):** ~3% * **Mutability (Pink):** ~1% * **I/O Resource (Olive):** ~0% (Not visible) * **Security (Teal):** ~0% (Not visible) **4. Scale-SWE** * **API Mismatch (Blue):** ~27% (Highest in this group) * **Logic Error (Orange):** ~25% * **Input/Boundary (Green):** ~20% * **Import Error (Purple):** ~12% * **Spec Violation (Grey):** ~7% * **State Sync (Brown):** ~5% * **Constructor (Red):** ~2% * **Mutability (Pink):** ~1% * **I/O Resource (Olive):** ~1% * **Security (Teal):** ~0% (Not visible) ### Key Observations 1. **Dominance of Logic Errors:** Logic Error is the most prevalent error type in three of the four benchmarks (SWE-Bench-Verified, SWE-Gym, SWE-smith), peaking at over 60% in SWE-smith. 2. **Shift in Scale-SWE:** The `Scale-SWE` dataset shows a different pattern, where `API Mismatch` becomes the most common error type (~27%), surpassing Logic Error (~25%). 3. **Consistent Secondary Errors:** `Input/Boundary` and `API Mismatch` are consistently significant across all datasets, typically ranging between 7% and 27%. 4. **Low-Frequency Errors:** `Security`, `I/O Resource`, `Mutability`, and `Constructor` errors are consistently low across all benchmarks, each generally below 5%. 5. **Variability in Import Errors:** The prevalence of `Import Error` varies notably, from ~4% in the first two benchmarks to ~9% in SWE-smith and ~12% in Scale-SWE. ### Interpretation This chart provides a comparative analysis of the types of bugs or errors found in different software engineering evaluation benchmarks. The data suggests that: * **Benchmark Character:** The benchmarks are not homogeneous. `SWE-smith` appears heavily skewed towards logic-based failures, while `Scale-SWE` presents a more balanced challenge with a higher proportion of API integration issues. This implies that performance on these benchmarks may test different aspects of a software engineering agent's capabilities. * **Common Failure Modes:** Across these diverse benchmarks, issues related to core program logic (`Logic Error`), interaction with external code (`API Mismatch`), and handling of inputs (`Input/Boundary`) constitute the vast majority of observed errors. This highlights these areas as critical focus points for improving automated code generation or repair systems. * **Anomaly - SWE-smith:** The extremely high concentration of Logic Errors in `SWE-smith` (over 60%) is a notable outlier. This could indicate that this specific benchmark is designed to test or inadvertently captures scenarios where logical reasoning is the primary point of failure. * **Implication for Tool Development:** Developers of AI programming assistants or testing tools should note that while logic errors are common, the significant presence of API and input-related errors, especially in `Scale-SWE`, indicates that tools must also be robust in handling integration and interface specifications. The near-absence of Security and I/O Resource errors in this visualization may reflect the nature of the benchmark tasks or a potential gap in their coverage. </details> Figure 3: Distribution of bug categories across different datasets. The bar chart compares the percentage of ten bug types within SWE-bench Verified, SWE-Gym, SWE-smith, and Scale-SWE. The categories are defined as: API Mismatch (incompatible signatures or parameter errors); Logic Error (flawed conditionals or control flow); Input/Boundary (edge case mishandling or validation failures); Constructor (object initialization errors); Import Error (missing modules or undefined symbols); State Sync (inconsistent internal state); Mutability (unintended side effects); Spec Violation (non-compliance with protocols); I/O Resource (file system or stream errors); and Security (improper scoping or access control). ### 2.3 Dataset Quality Assurance Rule-based Check. We implemented a rigorous filtering protocol based on test outcomes. We retained only those instances where: (1) all P2P tests pass and F2P tests fail on the original buggy codebase; and (2) all P2P and F2P tests pass upon application of the golden patch. Expert Check. We engaged four senior Ph.D. students to manually audit 100 randomly sampled instances employing a cross-validation protocol. The audit confirmed that 94 instances were valid, featuring correct environments, unit tests, and problem statements. We attribute this high quality to three key factors: (1) We leveraged state-of-the-art models (e.g., Gemini 3 Pro) to synthesize accurate problem statements; (2) We employed an LLM-as-a-Judge approach to preemptively discard low-quality repositories and pull requests; and (3) We enforced a fixed execution order for P2P and F2P tests to prevent test pollution, where varying execution orders could otherwise lead to inconsistent results due to environment contamination. ### 2.4 Dataset Statistics and Trajectory Distillation Ultimately, our agent-based construction pipeline yielded 100k successfully verified instances—the largest verified SWE benchmark to date. Leveraging our extensive collection of GitHub repositories, the resulting dataset achieves substantially greater repository diversity than existing SWE benchmarks, as detailed in Table 2. Unlike previous datasets that typically rely on synthetic generation or limited real-world sources, Scale-SWE is constructed entirely from 5.2k real repositories. This represents a 50% increase in repository count over the next largest real-world benchmark (SWE-rebench with 3.5k repositories). This scale and diversity ensure that our benchmark better reflects the complexity and variety of real-world software engineering tasks. The statistical characteristics of Scale-SWE are presented in Table 1. The dataset presents non-trivial software engineering challenges. The median instance requires modifications to at least 3 files and the addition of 43 lines of code, reflecting the task complexity. For evaluation robustness, instances contain an average of over 200 Pass-to-Pass (P2P) tests (median: 68), providing strong protection against regression, while an average of 5.69 Fail-to-Pass (F2P) tests per instance validates whether the LLM successfully resolves the specific issue. To construct the training corpus, we distilled agent trajectories from a subset of 25k instances using the high-performance expert model DeepSeek-V3.2. For each instance, we conducted five independent sampling trials with a temperature of 0.95 and a maximum budget of 100 interaction turns. A trajectory was considered valid only if it culminated in a submission that passed all unit tests. This pipeline produced 71k high-quality trajectories totaling approximately 3.5 billion tokens. To ensure a fair comparison, we applied the identical pipeline to collect trajectories for SWE-Smith and SWE-Gym. Figure 3 highlights the superior diversity of Scale-SWE. Synthetic datasets like SWE-smith exhibit a strong bias towards Logic Errors, failing to capture the intricacies of API Mismatches or State Synchronization issues common in large codebases. Similarly, SWE-Gym, despite being real-world sourced, suffers from low variety due to its restriction to only 11 repositories. Conversely, Scale-SWE achieves a highly balanced distribution across all ten bug categories. By leveraging 5.2k real-world repositories and an automated environment-building pipeline, Scale-SWE effectively captures a wide array of defect types—from Constructor errors to Security flaws. This validates that scaling the source repositories and automating the execution pipeline are critical for synthesizing training data that faithfully represents real-world software evolution. We adopt the bug taxonomy proposed in BugPilot [Sonwane et al., 2025] and employ DeepSeek v3.2 to automatically annotate the training instances. Table 2: Comparison of Scale-SWE with existing software engineering (SWE) benchmarks. We contrast datasets across key dimensions: the number of executable instances—defined as those equipped with a Dockerized environment and validated via Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests—primary data source, repository diversity, and trajectory count. | Dataset | Exec. instances | Primary Source | Repo. | Traj. | | --- | --- | --- | --- | --- | | R2E-Gym [Jain et al., 2025] | 4.6k | Synthetic | 10 | 3.3k | | SWE-Gym [Pan et al., 2024] | 2.4k | Real | 11 | 491 | | SWE-smith [Yang et al., 2025a] | 50k | Synthetic | 128 | 5k | | SWE-Mirror [Wang et al., 2025a] | 60k | Synthetic | 40 | 12k | | SWE-rebench [Badertdinov et al., 2025] | 7.5k | Real | 3.5k | N/A | | Scale-SWE | 100k | Real | 5.2k | 71k | Table 3: Performance comparison on SWE-bench Verified. We categorize models into proprietary systems, open-source methods, and size-matched baselines. | Models | Base Model | SWE-bench (V) | | --- | --- | --- | | Proprietary Models | | | | GPT-5.2 Thinking [OpenAI, 2025] | - | 80.0 | | Claude Sonnet 4.5 [Anthropic, 2025b] | - | 77.2 | | Gemini 3 Pro [Google, 2025] | - | 76.2 | | MiniMax-M2.1 [MiniMax AI, 2025] | - | 74.0 | | GLM-4.7 [Zhipu AI, 2025] | - | 73.8 | | DeepSeek-V3.2 [Liu et al., 2025] | - | 73.1 | | Kimi K2 Thinking [Team et al., 2025a] | - | 71.3 | | Open Source Methods | | | | SWE-Gym-32B [Pan et al., 2024] | Qwen-2.5 coder | 20.6 | | SWE-Fixer-72B [Xie et al., 2025] | Qwen2.5-72B | 32.8 | | R2E-Gym-32B [Jain et al., 2025] | Qwen-2.5-Coder | 34.4 | | SWE-rebench-72B [Golubev et al., 2025] | Qwen2.5-72B-Instruct | 39.0 | | SWE-smith-32B [Yang et al., 2025a] | Qwen2.5-32B | 40.2 | | SWE-RL [Wei et al., 2025] | Llama3-70B | 41.0 | | Skywork-SWE-32B [Zeng et al., 2025] | Qwen2.5-Coder-32B-Instruct | 47.9 | | SWE-Mirror-LM-32B [Wang et al., 2025a] | Qwen2.5-Coder-32B-Instruct | 52.2 | | SWE-Lego-32B [Tao et al., 2026] | Qwen3-32B | 52.6 | | KAT-Dev-32B [Zhan et al., 2025] | - | 62.4 | | Models of the same size | | | | Qwen3-30B-A3B-Instruct [Team, 2025] | - | 22.0 | | Qwen3-Coder-30B-A3B-Instruct [Team, 2025] | - | 51.6 | | GLM-4.7-Flash-30A3B [Zhipu AI, 2026] | - | 59.2 | | Our Model | | | | Scale-SWE-Agent | Qwen3-30B-A3B-Instruct | 64.0 | ## 3 Experiments ### 3.1 Experiment Setup Agent Scaffolding. We employed OpenHands [Wang et al., 2025b], an open-source, event-driven platform, as the unified agent framework for all experiments. OpenHands facilitates LLM agents to iteratively edit files, execute shell commands, and browse the web within sandboxed containers. We selected this framework due to its proven ability to establish robust and reproducible baselines on benchmarks such as SWE-Bench. Agent Post-training. We perform post-training on the Qwen3-30B-A3B-Instruct [Team, 2025] base model. The training process is configured with a learning rate of 1e-5, a batch size of 128, and a warmup ratio of 0.05, supporting a maximum context length of 131,072. Also, we apply loss masking to restrict loss computation solely to assistant turns that result in well-formed actions [Team et al., 2025b, Chen et al., 2025]. Evaluation Benchmarks and Metrics Our evaluation is conducted on SWE-bench Verified [Chowdhury et al., 2024], a benchmark comprising 500 high-quality, human-curated Python software issues. We report the Resolved Rate (%), representing the proportion of instances for which the model generates a correct solution. Notably, although the models were trained with a sequence length of 131,072, we extended the context limit to 262,144 during inference to handle larger inputs. ### 3.2 Experiment Results As shown in Table 3, the Scale-SWE Agent demonstrates superior performance on the SWE-bench Verified. First, regarding the impact of our scaling strategy, Scale-SWE Agent achieves a remarkable 42.0% absolute improvement over its base model, Qwen3-30B-A3B-Instruct, boosting the pass rate from 22.0% to 64.0%. Second, in comparison to models of the same size, our method significantly outperforms strong competitors, including Qwen3-Coder (51.6%) and GLM-4.7-Flash (59.2%). Furthermore, Scale-SWE Agent exhibits exceptional efficiency, surpassing models with significantly larger parameter counts, such as SWE-RL (Llama3-70B) and SWE-Fixer-72B. Notably, it also exceeds the previous state-of-the-art open-source method, KAT-Dev-32B (62.4%), and outperforms recent specialist models like SWE-Mirror and SWE-Lego by a margin of over 11%. These results validate the effectiveness of scaling up SWE-style data for enhancing software engineering capabilities. <details> <summary>figures/merge_traj_stat.drawio.png Details</summary> ![07924f5a](/v1/image/07924f5ac76115aba124563e59f98a42b62f167af6c3da71f5133a3011e8769c) ### Visual Description ## Density Plots: Token Count and Turn Distribution for Three Datasets ### Overview The image contains two vertically stacked density plots comparing the distributions of three datasets: **SWE-Gym**, **SWE-smith**, and **Scale-SWE**. The top plot analyzes "Token Count," and the bottom plot analyzes "Turns (tool call)." Both plots use kernel density estimation to show the probability distribution of the respective metrics. ### Components/Axes **Top Plot:** * **X-axis:** Label: "Token Count". Scale: Linear, ranging from 0 to 120k (120,000), with major ticks at 0, 20k, 40k, 60k, 80k, 100k, 120k. * **Y-axis:** Label: "Density". Scale: Linear, with a multiplier of **×10⁻⁵**. Major ticks at 0, 1, 2, 3 (representing 0, 1e-5, 2e-5, 3e-5). * **Legend:** Positioned in the top-right corner. Contains three entries: * **SWE-Gym:** Represented by a blue line and light blue filled area. * **SWE-smith:** Represented by an orange line and light orange filled area. * **Scale-SWE:** Represented by a green line and light green filled area. **Bottom Plot:** * **X-axis:** Label: "Turns (tool call)". Scale: Linear, ranging from 0 to 100, with major ticks at 0, 20, 40, 60, 80, 100. * **Y-axis:** Label: "Density". Scale: Linear, with a multiplier of **×10⁻²**. Major ticks at 0, 1, 2 (representing 0, 0.01, 0.02). * **Legend:** Positioned in the top-right corner. Identical to the top plot's legend. ### Detailed Analysis **Top Plot (Token Count Distribution):** * **SWE-Gym (Blue):** The distribution is right-skewed. It rises sharply from near 0 to a peak density of approximately **3.3e-5** at a token count of **~20k**. After the peak, it declines steadily, with a long tail extending past 100k tokens. * **SWE-smith (Orange):** Also right-skewed. It peaks slightly earlier than SWE-Gym, at a token count of **~18k**, with a peak density of approximately **3.1e-5**. Its decline is similar to SWE-Gym but appears slightly steeper in the 20k-40k range. * **Scale-SWE (Green):** This distribution is notably different. It is broader and shifted to the right. It begins rising later, peaks at a token count of **~40k** with a density of approximately **2.5e-5**, and has a much more gradual decline, maintaining significant density out to 80k-100k tokens. **Bottom Plot (Turn Distribution):** * **SWE-Gym (Blue):** The distribution is bimodal. The primary, sharp peak occurs at **~20 turns** with a density of approximately **2.7e-2**. After a steep decline, the density plateaus and then shows a smaller, secondary peak near **100 turns**. * **SWE-smith (Orange):** The distribution is unimodal and right-skewed. It peaks at **~18 turns** with a density of approximately **2.6e-2**, closely mirroring the primary peak of SWE-Gym. It then declines steadily without a pronounced secondary peak. * **Scale-SWE (Green):** This distribution is also bimodal but with a very different shape. It has a low, broad initial hump around **15 turns**, then rises to a major, broad peak centered around **60 turns** with a density of approximately **2.3e-2**. It then declines but shows a clear secondary peak near **100 turns**, similar to but more pronounced than SWE-Gym's. ### Key Observations 1. **Dataset Differentiation:** Scale-SWE is distinctly different from SWE-Gym and SWE-smith in both metrics. It consistently involves longer token counts and more tool-call turns. 2. **Correlation Between Metrics:** For SWE-Gym and SWE-smith, the peaks in token count (~20k) and turns (~20) align, suggesting a correlation between the length of a task (in tokens) and the number of interactive steps (turns) for these datasets. 3. **Bimodality in Turns:** Both SWE-Gym and Scale-SWE show evidence of bimodality in the turn distribution, with a secondary cluster of data points at the high end (~100 turns). This suggests a subset of tasks in these datasets require a significantly higher number of interactions. 4. **Distribution Shape:** The token count distributions for all three are unimodal and right-skewed. The turn distributions are more complex, showing unimodal (SWE-smith) and bimodal (SWE-Gym, Scale-SWE) shapes. ### Interpretation The data suggests fundamental differences in the nature of the tasks or interactions captured by the three datasets. * **SWE-Gym and SWE-smith** appear to represent similar types of software engineering (SWE) tasks. They are characterized by a relatively consistent, moderate length (peaking at ~20k tokens) and a similar number of interactive steps (peaking at ~20 turns). The tight coupling of these peaks implies a predictable workflow. * **Scale-SWE** likely represents a more complex or diverse set of tasks. The right-shifted and broader token count distribution indicates tasks that are, on average, longer and more variable in length. The major peak at ~60 turns suggests these tasks require substantially more back-and-forth interaction, possibly involving more complex debugging, exploration, or multi-step problem-solving. The secondary peak at 100 turns for both Scale-SWE and SWE-Gym may indicate a specific category of "long-tail" tasks that are particularly interaction-heavy. **In summary:** The plots reveal that Scale-SWE is a dataset of longer, more interaction-intensive SWE tasks compared to SWE-Gym and SWE-smith, which are more similar to each other. The presence of bimodal turn distributions hints at distinct task categories within the datasets, particularly one requiring a high number of tool calls. </details> Figure 4: Comparison of distillation data statistics across different datasets. We show the probability density functions for (top) total token count and (bottom) the number of tool-call turns. To evaluate the efficacy of our data compared to existing alternatives, we conducted controlled experiments by performing distillation and SFT on SWE-Gym and SWE-smith using an identical pipeline. As shown in Table 4, Scale-SWE significantly outperforms both baselines. Notably, despite SWE-smith possessing a considerably larger volume of instances than SWE-Gym, it yields slightly inferior performance. This performance gap suggests a diminishing return on purely synthetic data and underscores a critical insight: high-fidelity, real-world data is inherently more effective than massive-scale synthetic alternatives. This finding reinforces the importance of our large-scale “real-data” construction approach. The distributions of interaction turns and token counts are presented in Figure 4. As illustrated in these figures, Scale-SWE tasks necessitate a greater number of turns for repository exploration and iterative debugging. This observation underscores the high complexity and difficulty inherent in the Scale-SWE dataset. Table 4: SFT performance comparison on SWE-bench Verified. All models are fine-tuned using the same distillation pipeline to ensure a fair comparison. | Dataset Name | SWE-bench Verified | | --- | --- | | SWE-Gym | 54.8 | | SWE-smith | 54.6 | | Scale-SWE | 64.0 | ## 4 Related Work SWE Benchmark. Since the introduction of the prevailing software engineering benchmark, SWE-bench [Jimenez et al., 2023] and SWE-bench-Verified [Chowdhury et al., 2024], many other benchmarks have emerged to assess multi-modal [Yang et al., 2024b], multi-language [Zan et al., 2025b, Rashid et al., 2025, Guo et al., 2025a], and long-horizon capabilities [Deng et al., 2025]. These new benchmarks also evaluate whole repository generation [Ding et al., 2025], scientific domain knowledge [Duston et al., 2025], and other specialized abilities [Ma et al., 2025, Shetty et al., 2025]. Collectively, these benchmarks constitute a comprehensive evaluation ecosystem, establishing rigorous standards that assess the multifaceted capabilities required for autonomous software engineering. SWE Datasets. High-quality data is pivotal for enhancing the programming capabilities of Large Language Models (LLMs). Recently, there has been a surge in repository-level software engineering datasets aimed at addressing complex coding tasks. Efforts to scale up SWE task instances generally fall into two categories. One line of work, including R2E-Gym [Jain et al., 2025], SWE-smith [Yang et al., 2025a], and SWE-Mirror [Wang et al., 2025a], attempts to scale up training data through synthetic generation. Conversely, other works focus on mining real-world issues; for instance, SWE-Gym [Pan et al., 2024] constructed 2,400 executable instances restricted to 11 repositories, while SWE-rebench [Badertdinov et al., 2025] further expanded this collection to 7,500 executable instances. SWE Models and Agents. Recent advancements have introduced powerful models specialized for SWE tasks, including SWE-RL [Wei et al., 2025], SWE-Swiss [He et al., 2025], Kimi-Dev [Yang et al., 2025b], and KAT-Coder [Zhan et al., 2025]. In parallel, frameworks such as SWE-agent [Yang et al., 2024a], Mini-SWE-Agent [Yang et al., 2024a], OpenHands [Wang et al., 2025b], and MOpenHands [Zan et al., 2025a] serve as effective scaffolds to streamline interactions with development environments. ## 5 Conclusion In this work, we introduced Scale-SWE, a sandboxed multi‑agent framework that automates the construction of large‑scale, high‑quality software engineering data along with executable environments. By orchestrating specialized agents for environment setup, unit‑test generation, and task description synthesis, we processed six million real‑world pull requests to produce Scale-SWE-Data—a dataset of 100,000 instances that surpasses existing datasets in both scale and repository diversity. We further demonstrated the practical value of the dataset by distilling high‑quality trajectories and fine‑tuning Qwen3‑30B-A3B-Instruct to create Scale-SWE-Agent, which achieves substantial improvements on SWE‑Bench‑Verified—increasing the resolve rate from 22% to 64%. This advancement underscores the quality and utility of our data for training more capable code agents. We believe Scale-SWE opens new avenues for scalable SWE‑style dataset construction and provides a rich, open‑access resource to support the development of more capable LLM‑based software engineering agents. In future work, we aim to not only increase the volume of training data to fully leverage the potential of Scale-SWE-Data but also significantly broaden its linguistic scope. Specifically, we plan to extend our pipeline to support other major programming languages—such as Java, C, C++, and Rust—thereby fostering the development of truly language-agnostic software engineering agents. ## References - Anthropic [2025a] Anthropic. Introducing Claude Sonnet 4.5, Sept. 2025a. URL https://www.anthropic.com/news/claude-sonnet-4-5. - Anthropic [2025b] Anthropic. Announcing Claude Sonnet 4.5, 2025b. URL https://www.anthropic.com/news/claude-sonnet-4-5. Accessed: 2025-09-30. - Badertdinov et al. [2025] I. Badertdinov, A. Golubev, M. Nekrashevich, A. Shevtsov, S. Karasik, A. Andriushchenko, M. Trofimova, D. Litvintseva, and B. Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. arXiv preprint arXiv:2505.20411, 2025. - Chen et al. [2025] G. Chen, Z. Qiao, X. Chen, D. Yu, H. Xu, W. X. Zhao, R. Song, W. Yin, H. Yin, L. Zhang, et al. Iterresearch: Rethinking long-horizon agents via markovian state reconstruction. arXiv preprint arXiv:2511.07327, 2025. - Chowdhury et al. [2024] N. Chowdhury, J. Aung, C. J. Shern, O. Jaffe, D. Sherburn, G. Starace, E. Mays, R. Dias, M. Aljubeh, M. Glaese, C. E. Jimenez, J. Yang, L. Ho, T. Patwardhan, K. Liu, and A. Madry. Introducing SWE-bench verified, 2024. URL https://openai.com/index/introducing-swe-bench-verified/. OpenAI Blog Post. - Deng et al. [2025] X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025. - Ding et al. [2025] J. Ding, S. Long, C. Pu, H. Zhou, H. Gao, X. Gao, C. He, Y. Hou, F. Hu, Z. Li, et al. Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents. arXiv preprint arXiv:2512.12730, 2025. - Duston et al. [2025] T. Duston, S. Xin, Y. Sun, D. Zan, A. Li, S. Xin, K. Shen, Y. Chen, Q. Sun, G. Zhang, et al. Ainsteinbench: Benchmarking coding agents on scientific repositories. arXiv preprint arXiv:2512.21373, 2025. - Froger et al. [2025] R. Froger, P. Andrews, M. Bettini, A. Budhiraja, R. S. Cabral, V. Do, E. Garreau, J.-B. Gaya, H. Laurençon, M. Lecanu, et al. Are: Scaling up agent environments and evaluations. arXiv preprint arXiv:2509.17158, 2025. - Golubev et al. [2025] A. Golubev, M. Trofimova, S. Polezhaev, I. Badertdinov, M. Nekrashevich, A. Shevtsov, S. Karasik, S. Abramov, A. Andriushchenko, F. Fisin, et al. Training long-context, multi-turn software engineering agents with reinforcement learning. arXiv preprint arXiv:2508.03501, 2025. - Google [2025] Google. Gemini 3 pro, 2025. URL https://deepmind.google/models/gemini/pro/. - Guo et al. [2025a] L. Guo, W. Tao, R. Jiang, Y. Wang, J. Chen, X. Liu, Y. Ma, M. Mao, H. Zhang, and Z. Zheng. Omnigirl: A multilingual and multimodal benchmark for github issue resolution. Proceedings of the ACM on Software Engineering, 2(ISSTA):24–46, 2025a. - Guo et al. [2025b] L. Guo, Y. Wang, C. Li, P. Yang, J. Chen, W. Tao, Y. Zou, D. Tang, and Z. Zheng. Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks. arXiv preprint arXiv:2506.10954, 2025b. URL https://arxiv.org/abs/2506.10954. - He et al. [2025] Z. He, Q. Yang, W. Sheng, X. Zhong, K. Zhang, C. An, W. Shi, T. Cai, D. He, J. Chen, and J. Xu. Swe-swiss: A multi-task fine-tuning and rl recipe for high-performance issue resolution, 2025. Notion Blog. - Jain et al. [2025] N. Jain, J. Singh, M. Shetty, L. Zheng, K. Sen, and I. Stoica. R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents. arXiv preprint arXiv:2504.07164, 2025. - Jimenez et al. [2023] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023. - Liu et al. [2025] A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025. - Ma et al. [2025] J. J. Ma, M. Hashemi, A. Yazdanbakhsh, K. Swersky, O. Press, E. Li, V. J. Reddi, and P. Ranganathan. Swe-fficiency: Can language models optimize real-world repositories on real workloads? arXiv preprint arXiv:2511.06090, 2025. - MiniMax AI [2025] MiniMax AI. Multilingual and Multi-Task Coding with Strong Generalization: A Look at MiniMax-M2.1. Hugging Face Blog, 2025. URL https://huggingface.co/blog/MiniMaxAI/multilingual-and-multi-task-coding-with-strong-gen. Accessed: 2025-12-23. - OpenAI [2025] OpenAI. Introducing GPT-5.2, Dec. 2025. URL https://openai.com/index/introducing-gpt-5-2/. - Pan et al. [2024] J. Pan, X. Wang, G. Neubig, N. Jaitly, H. Ji, A. Suhr, and Y. Zhang. Training software engineering agents and verifiers with swe-gym. arXiv preprint arXiv:2412.21139, 2024. - Rashid et al. [2025] M. S. Rashid, C. Bock, Y. Zhuang, A. Buchholz, T. Esler, S. Valentin, L. Franceschi, M. Wistuba, P. T. Sivaprasad, W. J. Kim, et al. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents. arXiv preprint arXiv:2504.08703, 2025. - Shetty et al. [2025] M. Shetty, N. Jain, J. Liu, V. Kethanaboyina, K. Sen, and I. Stoica. Gso: Challenging software optimization tasks for evaluating swe-agents. arXiv preprint arXiv:2505.23671, 2025. - Sonwane et al. [2025] A. Sonwane, I. White, H. Lee, M. Pereira, L. Caccia, M. Kim, Z. Shi, C. Singh, A. Sordoni, M.-A. Côté, et al. Bugpilot: Complex bug generation for efficient learning of swe skills. arXiv preprint arXiv:2510.19898, 2025. - Tao et al. [2026] C. Tao, J. Chen, Y. Jiang, K. Kou, S. Wang, R. Wang, X. Li, S. Yang, Y. Du, J. Dai, et al. Swe-lego: Pushing the limits of supervised fine-tuning for software issue resolving. arXiv preprint arXiv:2601.01426, 2026. - Team et al. [2025a] K. Team, Y. Bai, Y. Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y. Chen, Y. Chen, Y. Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025a. - Team [2025] Q. Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388. - Team et al. [2025b] T. D. Team, B. Li, B. Zhang, D. Zhang, F. Huang, G. Li, G. Chen, H. Yin, J. Wu, J. Zhou, et al. Tongyi deepresearch technical report. arXiv preprint arXiv:2510.24701, 2025b. - Wang et al. [2025a] J. Wang, D. Zan, S. Xin, S. Liu, Y. Wu, and K. Shen. Swe-mirror: Scaling issue-resolving datasets by mirroring issues across repositories. arXiv preprint arXiv:2509.08724, 2025a. - Wang et al. [2025b] X. Wang, S. Rosenberg, J. Michelini, C. Smith, H. Tran, E. Nyst, R. Malhotra, X. Zhou, V. Chen, R. Brennan, et al. The openhands software agent sdk: A composable and extensible foundation for production agents. arXiv preprint arXiv:2511.03690, 2025b. - Wei et al. [2025] Y. Wei, O. Duchenne, J. Copet, Q. Carbonneaux, L. Zhang, D. Fried, G. Synnaeve, R. Singh, and S. I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution. arXiv preprint arXiv:2502.18449, 2025. - Xiao et al. [2026] B. Xiao, B. Xia, B. Yang, B. Gao, B. Shen, C. Zhang, C. He, C. Lou, F. Luo, G. Wang, et al. Mimo-v2-flash technical report. arXiv preprint arXiv:2601.02780, 2026. - Xie et al. [2025] C. Xie, B. Li, C. Gao, H. Du, W. Lam, D. Zou, and K. Chen. Swe-fixer: Training open-source llms for effective and efficient github issue resolution. arXiv preprint arXiv:2501.05040, 2025. - Yang et al. [2024a] J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems, 37:50528–50652, 2024a. - Yang et al. [2024b] J. Yang, C. E. Jimenez, A. L. Zhang, K. Lieret, J. Yang, X. Wu, O. Press, N. Muennighoff, G. Synnaeve, K. R. Narasimhan, et al. Swe-bench multimodal: Do ai systems generalize to visual software domains? arXiv preprint arXiv:2410.03859, 2024b. - Yang et al. [2025a] J. Yang, K. Lieret, C. E. Jimenez, A. Wettig, K. Khandpur, Y. Zhang, B. Hui, O. Press, L. Schmidt, and D. Yang. Swe-smith: Scaling data for software engineering agents. arXiv preprint arXiv:2504.21798, 2025a. - Yang et al. [2025b] Z. Yang, S. Wang, K. Fu, W. He, W. Xiong, Y. Liu, Y. Miao, B. Gao, Y. Wang, Y. Ma, et al. Kimi-dev: Agentless training as skill prior for swe-agents. arXiv preprint arXiv:2509.23045, 2025b. - Zan et al. [2025a] D. Zan, Z. Huang, W. Liu, H. Chen, L. Zhang, S. Xin, L. Chen, Q. Liu, X. Zhong, A. Li, S. Liu, Y. Xiao, L. Chen, Y. Zhang, J. Su, T. Liu, R. Long, K. Shen, and L. Xiang. Multi-swe-bench: A multilingual benchmark for issue resolving, 2025a. URL https://arxiv.org/abs/2504.02605. - Zan et al. [2025b] D. Zan, Z. Huang, W. Liu, H. Chen, L. Zhang, S. Xin, L. Chen, Q. Liu, X. Zhong, A. Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving. arXiv preprint arXiv:2504.02605, 2025b. - Zeng et al. [2025] L. Zeng, Y. Li, Y. Xiao, C. Li, C. Y. Liu, R. Yan, T. Wei, J. He, X. Song, Y. Liu, et al. Skywork-swe: Unveiling data scaling laws for software engineering in llms. arXiv preprint arXiv:2506.19290, 2025. - Zhan et al. [2025] Z. Zhan, K. Deng, J. Wang, X. Zhang, H. Tang, M. Zhang, Z. Lai, H. Huang, W. Xiang, K. Wu, et al. Kat-coder technical report. arXiv preprint arXiv:2510.18779, 2025. - Zhang et al. [2025] L. Zhang, S. He, C. Zhang, Y. Kang, B. Li, C. Xie, J. Wang, M. Wang, Y. Huang, S. Fu, et al. Swe-bench goes live! arXiv preprint arXiv:2505.23419, 2025. - Zhipu AI [2025] Zhipu AI. GLM-4.7: Technical Blog. Zhipu AI Blog, 2025. URL https://z.ai/blog/glm-4.7. Accessed: 2025-12-22. - Zhipu AI [2026] Zhipu AI. GLM-4.7-Flash: Model Repository. Hugging Face, 2026. URL https://huggingface.co/zai-org/GLM-4.7-Flash. Accessed: 2026-01-19. ## Appendix A Scale-SWE workflow details. ### A.1 Scale-SWE workflow Overview. <details> <summary>figures/DataEngineWorkflow.drawio.png Details</summary> ![b5b232fb](/v1/image/b5b232fb719e947a47fbac701823efeb5787af8eb42f0b6cc2cb43f5385ca306) ### Visual Description ## Diagram: Data Pipeline for Software Engineering Task Instance Generation ### Overview This image is a technical flowchart diagram illustrating a multi-stage data pipeline designed to collect, filter, and process software repository data into standardized "SWE Task Instances." The process begins with a broad scraping strategy, applies multiple filtering steps using Large Language Models (LLMs), and culminates in a sandboxed multi-agent system that constructs the final task instances. ### Components/Axes The diagram is organized into three main visual regions: 1. **Top-Left (Scrape Strategy):** A dashed-line box containing two green rounded rectangles. 2. **Main Pipeline (Center/Right):** A series of blue rounded rectangles connected by arrows, representing data flow and transformation stages. 3. **Bottom (Sandboxed multi-agent system):** A light gray rounded rectangle containing three orange/yellow rounded rectangles, which feed into a final pink rectangle. **Labels and Text Elements:** * **Scrape Strategy** (Title, top-left) * **SEART (#star, #PRs, ...)** (Green box, top-left) * **Top PyPI Packages** (Green box, bottom-left) * **23,000 Repos** (Blue box, center-left) * **LLM as judge** (Text on arrow, with a small multi-colored star logo labeled "Gemini") * **Filtered repos** (Blue box, center) * **GitHub API** (Text on arrow) * **6M pull requests** (Blue box, top-right) * **LLM as judge** (Text on arrow, with a small multi-colored star logo labeled "Gemini") * **1M pull requests** (Blue box, bottom-right) * **Sandboxed multi-agent system** (Title, bottom-center) * **Environment Builder Agent** (Orange box, rightmost in the sandbox) * **Unit-test Creator Agent** (Orange box, center in the sandbox) * **Problem Statement Writer Agent** (Orange box, leftmost in the sandbox) * **SWE Task Instance** (Pink box, bottom-left) ### Detailed Analysis The pipeline flow is as follows: 1. **Data Source Identification (Scrape Strategy):** * Two primary sources are targeted: repositories identified via **SEART** (using metrics like star count and pull request count) and **Top PyPI Packages**. 2. **Initial Repository Collection:** * These sources yield an initial set of **23,000 Repos**. 3. **First Filtering Stage:** * The 23,000 repositories are processed by an **"LLM as judge"** (specifically identified as **Gemini** by the logo). * This results in a set of **Filtered repos**. 4. **Pull Request Extraction:** * Using the **GitHub API**, the system extracts pull requests from the filtered repositories. * This yields a dataset of **6M (6 million) pull requests**. 5. **Second Filtering Stage:** * The 6 million pull requests undergo another round of filtering by an **"LLM as judge"** (again, **Gemini**). * This significantly reduces the dataset to **1M (1 million) pull requests**. 6. **Task Instance Construction (Sandboxed multi-agent system):** * The 1 million filtered pull requests are input into a **Sandboxed multi-agent system**. * This system consists of three specialized agents operating in sequence (right-to-left flow): * **Environment Builder Agent:** Likely sets up the code environment for the task. * **Unit-test Creator Agent:** Generates or validates unit tests related to the pull request. * **Problem Statement Writer Agent:** Formulates a clear problem description based on the code change. * The final output of this multi-agent system is a **SWE Task Instance**. ### Key Observations * **Funnel Effect:** The pipeline demonstrates a massive data reduction funnel: from 23,000 repos to 6M PRs, then filtered down to 1M PRs for final processing. * **LLM-Centric Filtering:** The core filtering mechanism at two critical stages is an LLM (Gemini) acting as a judge, suggesting automated quality or relevance assessment. * **Modular Agent Design:** The final construction phase uses a specialized, multi-agent architecture where each agent has a distinct responsibility (environment, tests, description). * **Spatial Flow:** The diagram uses a clear left-to-right flow for the data processing pipeline, which then feeds into a right-to-left flow within the sandboxed system, creating a logical loop that ends at the final output on the left. ### Interpretation This diagram outlines a sophisticated, automated pipeline for creating a large-scale benchmark or training dataset for software engineering AI agents. The process is designed to curate high-quality, real-world coding tasks from open-source repositories. * **Purpose:** The system aims to solve the problem of obtaining realistic, well-defined software engineering tasks at scale. Manually creating such tasks is prohibitively expensive. * **Methodology:** It leverages existing, popular code repositories (via SEART and PyPI) as a source of authentic code changes (pull requests). The dual-stage LLM filtering is crucial for ensuring the selected tasks are suitable—likely filtering for clarity, self-contained nature, and educational value. * **Significance:** The final "SWE Task Instance" is the key product. Each instance likely includes a codebase state, a problem description, and a test suite, providing a complete environment for an AI to practice or be evaluated on software engineering skills. The scale (1M processed PRs) suggests an ambition to create a very comprehensive dataset. * **Notable Design Choice:** The use of a "sandboxed" multi-agent system for the final step implies that constructing a valid task instance is complex and requires isolated, controlled steps to avoid interference and ensure reliability. </details> Figure 5: Schematic workflow for automated Scale-SWE task synthesis. From an initial pool of 23k repositories and 6M pull requests, the pipeline utilizes LLM-as-a-judge to filter for quality and relevance. The selected 1M pull requests are then transformed into formal software engineering task instances via a sandboxed orchestration of specialized agents responsible for environment building, test creation, and statement writing. ## Appendix B Scale-SWE Task Instance Structure The structure of a Scale-SWE task instance closely adheres to the SWE-bench standard, with a necessary adaptation to accommodate synthetic data. Specifically, since original developer-written “fail-to-pass” (F2P) tests are not available for all mined instances, we include a field for F2P scripts generated by our unit-test creator agent. A Scale-SWE task instance consists of the following fields: - instance_id: A unique identifier formatted as {user}_{repo}_pr{id}. - user: The owner of the GitHub repository. - repo: The name of the GitHub repository. - language: The programming language of the codebase (currently Python). - workdir: The working directory path within the environment. - image_url: The URL of the pre-built Docker image for the task. - patch: The ground-truth patch (Golden Patch) from the corresponding pull request. - pr_commit: The commit hash of the pull request. - parent_commit: The commit hash of the parent commit (base state). - problem_statement: The issue description conveying the bug, provided to the model as input. - f2p_patch: The developer-written test patch containing tests that fail before the fix (if available). - f2p_script: The synthetic reproduction script generated by our unit-test creator agent (used when f2p_patch is absent). - FAIL_TO_PASS: A list of unit tests that fail when applied to the buggy version but pass after the fix. - PASS_TO_PASS: A list of unit tests that pass in both the buggy and fixed versions (regression tests). - github_url: The URL of the original GitHub repository. - pre_commands: Commands executed immediately upon entering the container to revert future commit information and prevent data leakage. ## Appendix C Implementation Details The hyperparameters for SFT are detailed in Table 5. Table 5: Key hyperparameters in the SFT phase. | Hyperparameter | Value | | --- | --- | | Learning Rate | 1e-5 | | Base model | Qwen3-30B-A3B | | Batch size | 128 | | Maximum Context Length | 131,072 | | Warmup ratio | 0.05 | | LR scheduler type | Cosine | | Epoch | 3 | ## Appendix D Anti-hack Strategy To ensure the integrity of the SWE-bench evaluation, we implement an anti-leak strategy to prevent the LLM from accessing ground truth solutions via Git history (e.g., using commands like git log --all). The following sanitization script is executed immediately after initializing the environment container: ⬇ 1 2 git clean - fd - e ’*.egg-info’ - e ’.tox’ - e ’.venv’ && git checkout {parent_commit} 3 NEW_BRANCH = "swe_bench_clean_main" 4 CURRENT_HEAD = $ (git rev - parse HEAD) 5 git stash - a 6 git clean - fd 7 git reset -- hard $CURRENT_HEAD 8 git stash pop || echo "No ␣ stash ␣ to ␣ apply ␣ or ␣ conflict ␣ occurred" 9 10 git config user. email "pre-agent@swalm.local" && git config user. name "Pre-Agent" && git add . && git commit - m "pre-agent ␣ commit" 11 12 13 CURRENT_3_PRIME = $ (git rev - parse HEAD) 14 15 git for - each - ref -- format = "%(refname)" refs / remotes / | xargs - I {} git update - ref - d {} 16 git tag - l | xargs - r git tag - d 17 18 rm - f . git / packed - refs 19 rm - f . git / ORIG_HEAD . git / FETCH_HEAD . git / MERGE_HEAD . git / CHERRY_PICK_HEAD . git / refs / stash 20 rm - rf . git / logs / 21 22 git update - ref refs / heads / $ {NEW_BRANCH} $CURRENT_3_PRIME 23 git symbolic - ref HEAD refs / heads / $ {NEW_BRANCH} 24 25 git for - each - ref -- format = "%(refname)" refs / heads / | grep - v "refs/heads/${NEW_BRANCH}" | xargs - I {} git update - ref - d {} 26 27 git gc -- prune = now -- aggressive ## Appendix E Prompts in Scale-SWE workflow Repository filtering prompt Pull request filtering prompt Environment builder agent agent Unit-test creater agent prompt Problem statement writer agent prompt Bug type categorization prompt

Rendering Paper...