2602.02361
Model: healer-alpha-free
# SWE-Universe: Scale Real-World Verifiable Environments to Millions
**Authors**:
- Junyang Lin, Binyuan Hui (Qwen Team, Alibaba Group, Zhejiang University)
> Work done during an internship at Alibaba Qwen.
## Abstract
We propose SWE-Universe, a scalable and efficient framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs). To overcome the prevalent challenges of automatic building, such as low production yield, weak verifiers, and prohibitive cost, our framework utilizes a building agent powered by an efficient custom-trained model. This agent employs iterative self-verification and in-loop hacking detection to ensure the reliable generation of high-fidelity, verifiable tasks. Using this method, we scale the number of real-world multilingual SWE environments to a million scale (807,693). We demonstrate the profound value of our environments through large-scale agentic mid-training and reinforcement learning. Finally, we applied this technique to Qwen3-Max-Thinking and achieved a score of 75.3% on SWE-Bench Verified. Our work provides both a critical resource and a robust methodology to advance the next generation of coding agents.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Bar Chart: Number of Real-World Verifiable SWE Instances
### Overview
This is a vertical bar chart comparing the number of verifiable Software Engineering (SWE) instances across different benchmarks or datasets. The chart highlights two categories of instances: "Python-only" and "Multilingual." The primary takeaway is the significantly larger scale of the "SWE-Universal (Ours)" dataset compared to all others listed.
### Components/Axes
* **Chart Title:** "Number of Real-World Verifiable SWE Instances"
* **Legend:** Located in the top-left corner.
* **Blue Square:** "Python-only"
* **Orange Square:** "Multilingual"
* **X-Axis (Categories):** Lists eight different benchmarks/datasets. From left to right:
1. SWE-Bench
2. SWE-Gym
3. Multi-SWE-RL
4. SWE-rebench
5. DeepSeek-V3.2
6. QWM
7. MIMO-V2-Flash
8. SWE-Universal (Ours)
* **Y-Axis:** Represents the count of instances. The axis line is present, but no numerical labels or title are visible. Values are provided directly above each bar.
* **Data Labels:** Exact numerical values are printed above each bar.
### Detailed Analysis
The chart presents the following data points for each category:
1. **SWE-Bench:**
* **Bar:** Single blue bar (Python-only).
* **Value:** 2,294.
* **Trend:** Baseline value, the smallest on the chart.
2. **SWE-Gym:**
* **Bar:** Single blue bar (Python-only).
* **Value:** 2,438.
* **Trend:** Slightly higher than SWE-Bench.
3. **Multi-SWE-RL:**
* **Bar:** Single orange bar (Multilingual).
* **Value:** 4,723.
* **Trend:** First multilingual entry, roughly double the preceding Python-only values.
4. **SWE-rebench:**
* **Bar:** Single blue bar (Python-only).
* **Value:** 21,000.
* **Trend:** Significant jump in scale compared to previous entries.
5. **DeepSeek-V3.2:**
* **Bar:** Single orange bar (Multilingual).
* **Value:** 24,667.
* **Trend:** Comparable in scale to SWE-rebench, but multilingual.
6. **QWM:**
* **Bar:** Single blue bar (Python-only).
* **Value:** 35,000.
* **Trend:** The largest Python-only dataset shown.
7. **MIMO-V2-Flash:**
* **Bar:** Single orange bar (Multilingual).
* **Value:** 90,000.
* **Trend:** A major increase, more than double the previous highest value (QWM).
8. **SWE-Universal (Ours):**
* **Bar:** Single orange bar (Multilingual).
* **Value:** 807,693.
* **Trend:** An order-of-magnitude increase over all other datasets. This bar dominates the chart visually.
### Key Observations
* **Scale Disparity:** The "SWE-Universal (Ours)" dataset contains approximately **9 times** more instances than the next largest dataset (MIMO-V2-Flash) and over **350 times** more than the smallest (SWE-Bench).
* **Category Distribution:** Of the eight datasets listed, five are categorized as "Python-only" (blue) and three as "Multilingual" (orange). The two largest datasets by a wide margin are both multilingual.
* **Visual Trend:** There is a general, non-linear upward trend in dataset size from left to right, culminating in the massive final bar. The growth is not monotonic, as the fourth bar (SWE-rebench) is smaller than the fifth (DeepSeek-V3.2).
### Interpretation
This chart is likely from a research paper or technical report introducing the "SWE-Universal" dataset. Its primary purpose is to **demonstrate the unprecedented scale** of this new resource compared to existing benchmarks in the software engineering domain.
* **What the data suggests:** The field has previously relied on relatively small, often language-specific (Python) datasets for training and evaluating AI models on real-world software tasks. "SWE-Universal" represents a massive leap in available verifiable data, specifically emphasizing multilingual support.
* **How elements relate:** The x-axis orders the datasets, likely in a combination of chronological release and increasing scale, to build a narrative of progression that peaks with the authors' contribution. The color coding (blue vs. orange) immediately draws a distinction between language-specific and broader multilingual resources.
* **Notable implications:** The sheer size of "SWE-Universal" implies it could enable the training of more robust and generalizable AI software engineering agents. The emphasis on "verifiable" instances suggests a focus on high-quality, ground-truth data where the correctness of solutions can be automatically checked, which is crucial for reliable benchmarking. The chart makes a compelling visual argument for the significance of the authors' work.
</details>
Figure 1: A comparison of the number of instances in real-world SWE instances. Our multilingual SWE-Universe is significantly larger than other recent multilingual efforts like MiMo-V2-Flash (Xiao et al., 2026), DeepSeek-V3.2 (DeepSeek-AI, 2025), and Multi-SWE-RL (Zan et al., 2025), as well as prominent Python-only datasets including SWE-rebench (Badertdinov et al., 2025), SWE-Gym (Pan et al., 2024), CWM (Copet et al., 2025), and SWE-Bench (Jimenez et al., 2024).
## 1 Introduction
Training large language models (LLMs) (Yang et al., 2025a; Liu et al., 2025) as coding agents has gained substantial attention in software engineering (SWE). Progress in this direction is critically dependent on large-scale, high-quality environments with reliable verification signals. An ideal source for such data lies within the vast ecosystem of real-world software development, specifically public GitHub repositories where pull requests (PRs) are linked to corresponding issues. Each such PR defines a self-contained training environment: the issue serves as a clear problem statement, the code patch offers an expert reference solution, and the accompanying tests can be repurposed into a verifier for agent proposed fixes. This formulation effectively turns each PR into a “gym” for training and evaluating coding agents. Benchmarks such as SWE-bench (Jimenez et al., 2024) operationalized this pipeline by curating real-world issues and standardizing the evaluation, serving as a common testbed for subsequent work.
However, scaling the construction of verifiable real-world SWE environments while maintaining diversity and reproducibility remains a major challenge. A vast body of work, including the original SWE-bench and its many derivatives and extensions (Pan et al., 2024; Badertdinov et al., 2025; Zhang et al., 2025c; Guo et al., 2025; Zeng et al., 2026; Wang et al., 2025; Aleithan et al., 2024; Copet et al., 2025) has primarily focused on Python. While this line of work leverages the low barrier to entry for Python environment configuration, it nonetheless restricts the development of agents with true cross-lingual generalization capabilities. Efforts to create multi-lingual datasets, such as Multi-SWE-bench (Zan et al., 2025) and SWE-PolyBench (Rashid et al., 2025), rely on labor-intensive manual environment setup and remain limited in scale. Despite recent efforts of industrial LLM providers that scale SWE instances to the $10^{4}-10^{5}$ magnitude (Xiao et al., 2026; DeepSeek-AI, 2025), the technical details are undisclosed.
We argue that scaling the generation of such verifiable environments to the massive scale (e.g., $10^{6}$ ) hinges on overcoming three fundamental challenges:
- Low Production Yield: The intricate and heterogeneous nature of real-world repositories with their complex dependencies, platform-specific configurations, and bespoke build toolchains, yielding a low conversion rate from repositories to runnable instances. This results in significant computational waste and makes large-scale generation impractical.
- Weak Verifier: Issues, PR patches, and test suites exhibit substantial variance in quality. A naive extraction pipeline can produce low-fidelity instances, and can also admit verifier vulnerabilities that allow solutions to pass via shallow heuristics (e.g., string matching with grep) rather than compiling and executing the intended code. Such failure modes create spurious training signals and distort evaluation.
- Prohibitive Cost and Inefficiency: Many existing pipelines rely on large, expensive LLMs to perform repository-specific reasoning for dependency resolution and build configuration. The resulting cost and latency per instance make massive-scale generation economically and operationally impractical.
To systematically address these challenges, we introduce SWE-Universe, a scalable, reliable, and efficient framework for automatically constructing million-scale, real-world agentic software engineering environments. At its core is an autonomous building agent that, for each PR, synthesizes a self-contained executable environment together with an executable verifier. To mitigate low production yield, the agent performs self-verification in an iterative validation loop. Concretely, it repeatedly tests the generated verifier against both the buggy and the fixed repository states, diagnoses failure modes, and revises the build procedure accordingly. This process improves the build success rate from 82.6% to 94% on a held-out set. To address the weak verifier, we integrate an in-loop hacking detector that immediately flags and rejects superficial verifiers during the generation process, forcing the agent toward solutions that genuinely execute the code. To achieve high efficiency and low cost, we specially trained an efficient Qwen-Next-80B-A3B model with a Mixture-of-Experts (MoE) framework and hybrid attention. On our custom multi-lingual building benchmark, this model achieves a 78.44% success rate, surpassing top-tier proprietary models like Claude-Opus-4.5, while its efficient architecture dramatically reduces the latency and cost per build.
Leveraging the SWE-Universe framework, we construct 807,693 multilingual, verifiable training instances sourced from over 52,000 unique GitHub repositories. To our knowledge, this dataset is currently the largest and most diverse collection of real-world software engineering tasks with executable verification. We further validate the immense value of this dataset through extensive experiments. First, in a mid-training phase, we show that continued training on our dataset significantly enhances a model’s performance on standard benchmarks like SWE-Bench Verified (Jimenez et al., 2024), proving its powerful generalization capabilities. Second, we demonstrate that the verifiers generated by our pipeline provide a stable and effective reward signal for Reinforcement Learning (RL). By applying to our flagship model, Qwen3-Max-Thinking (Qwen Team, 2026), we achieved a score of 75.3% on the SWE-Bench Verified. Together, these results validate SWE-Universe as a robust framework for creating large-scale, high-quality agentic training data, paving the way for the development of more capable and versatile coding agents for real-world applications.
## 2 Methodology: Scalability and Reliability
<details>
<summary>figures/env-build.png Details</summary>

### Visual Description
\n
## Diagram: Automated Software Patch Evaluation and Validation Pipeline
### Overview
This image is a technical flowchart illustrating an automated system for evaluating software patches submitted via pull requests. The system uses AI agents to build, test, and verify patches, with mechanisms to detect malicious code ("hacking") and generate new software engineering tasks upon successful validation. The process is cyclical and includes retry logic.
### Components/Elements
The diagram is organized into several interconnected functional blocks, each with specific icons, labels, and text.
1. **Pull Request (Top-Left)**
* **Icon:** GitHub logo.
* **Title:** "Pull Request"
* **Sub-element:** A box labeled "Real-world PR" showing a diff summary: `+20 -12` with green and red bars.
* **File List:**
* `sklearn/` (folder icon)
* `gradient_boosting.py` (file icon with a green plus)
* `helper.py` (file icon with an orange square)
* `utils` (folder icon with a red minus)
* **Output Arrow:** Labeled "oracle patch", pointing to the "Building & Verifier Agent".
2. **Building & Verifier Agent (Top-Center)**
* **Icon:** A robot head.
* **Title:** "Building & Verifier Agent"
* **Sub-element:** A dashed box labeled "tools" containing three blue buttons:
* `bash`
* `switch-to-bug`
* `switch-to-resolved`
* **Output Arrow:** A solid black arrow labeled "submit evaluation.sh" pointing to "Verifier Validation".
3. **Verifier Validation (Top-Right)**
* **Icon:** A gear with a checkmark.
* **Title:** "Verifier Validation"
* **Process Flow:**
* **Path 1 (Top):** `switch-to-bug` + `TEST` icon → dashed arrow → `non-zero exitcode`
* **Path 2 (Bottom):** `switch-to-resolved` + `TEST` icon → dashed arrow → `0 exitcode`
* **Outcome Arrows:**
* A green dashed arrow labeled "pass" with a green checkmark points down to "New Software Engineering Task".
* A red dashed arrow labeled "failed" with a red 'X' points left, leading to a "retry" loop back to the "Building & Verifier Agent".
4. **Hacking Detector (Bottom-Left)**
* **Icon:** A detective/spy.
* **Title:** "Hacking Detector"
* **Sub-elements:** Two code blocks monitoring `evaluation.sh`:
* **Block 1 (Pass):** `# evaluation.sh` with a green checkmark. Code: `pytest test_production.py`
* **Block 2 (Fail):** `# evaluation.sh` with a red 'X'. Code: `grep -q "keyword" bug.java`
* **Connection:** A dashed arrow labeled "observation" points from the "Building & Verifier Agent" to this block. Another dashed arrow labeled "evaluation.sh" with a `TEST` icon points from this block to the "Repository Container".
5. **Repository Container (Bottom-Center)**
* **Icon:** A Docker container.
* **Title:** "Repository Container"
* **Sub-element:** A blue box labeled "Codebase" with a GitHub logo, listing:
* `sklearn/` (folder)
* `examples/` (folder)
* `README.rst` (file)
* `reqs.txt` (file)
* `setup.cfg` (file)
* `setup.py` (file)
* **Connection:** A dashed arrow labeled "action" points from the "Building & Verifier Agent" to this block.
6. **New Software Engineering Task (Bottom-Right)**
* **Icon:** A smiling face with hearts.
* **Title:** "New Software Engineering Task"
* **Output Elements:**
* A document icon labeled "TXT" with the text "problem statement".
* A Docker container icon labeled "docker image".
* A `TEST` script icon labeled "eval script".
7. **Central Agentic Loop**
* **Label:** "agentic loop"
* **Visual:** A circular flow of three curved arrows (orange, yellow, light blue) connecting the "Building & Verifier Agent", "Hacking Detector", and "Repository Container".
### Detailed Analysis
The diagram details a closed-loop, automated workflow:
1. **Input:** A real-world pull request (PR) containing code changes (e.g., to `gradient_boosting.py`) is the starting input.
2. **Agent Processing:** The "Building & Verifier Agent" receives the PR patch. It has tools (`bash`, `switch-to-bug`, `switch-to-resolved`) to manipulate the codebase state. It submits an `evaluation.sh` script for validation.
3. **Validation:** The "Verifier Validation" block tests the patch in two states:
* **`switch-to-bug`:** Introduces the bug. A passing test here (non-zero exit code) confirms the bug exists.
* **`switch-to-resolved`:** Applies the patch. A passing test here (zero exit code) confirms the patch fixes the bug.
4. **Decision Point:**
* **Pass:** If validation succeeds (green path), the system generates a "New Software Engineering Task" comprising a problem statement, a docker image, and an evaluation script.
* **Fail:** If validation fails (red path), a "retry" signal is sent back to the "Building & Verifier Agent" to attempt a new solution.
5. **Security Monitoring:** The "Hacking Detector" observes the agent's actions and the `evaluation.sh` script. It checks for malicious patterns (e.g., `grep -q "keyword" bug.java`), which would cause a failure (red X). A legitimate test command (`pytest test_production.py`) passes (green check).
6. **Environment:** The "Repository Container" provides the isolated codebase environment (`sklearn/`, `examples/`, etc.) where the agent performs its actions.
### Key Observations
* **Dual Validation Logic:** The system doesn't just check if a patch works; it first verifies the bug is present (`switch-to-bug` + test should fail) and then verifies the patch fixes it (`switch-to-resolved` + test should pass). This is a robust testing methodology.
* **Security as a First-Class Check:** The "Hacking Detector" is a dedicated component that runs in parallel, scrutinizing the agent's scripts for signs of malicious intent, not just functional correctness.
* **Retry Mechanism:** The workflow is designed to be iterative. An agent that produces an invalid or malicious patch receives a "failed" signal and can try again.
* **Output as a New Task:** The successful output isn't just a "merged PR." It's a packaged "New Software Engineering Task," suggesting this system might be used to generate training data or benchmarks for other AI agents.
### Interpretation
This diagram represents a sophisticated **AI-driven software engineering agent evaluation framework**. Its core purpose is to autonomously assess whether an AI agent can correctly fix a real-world bug in a codebase without introducing security vulnerabilities or "hacking" the test suite.
The **Peircean investigation** reveals:
* **Sign (Representation):** The flowchart symbols (icons, arrows, boxes) represent the components of an automated DevOps/ML Ops pipeline.
* **Object (Referent):** The actual process of an AI agent attempting to solve a GitHub pull request issue in a sandboxed environment.
* **Interpretant (Meaning):** The system enforces a rigorous, multi-stage validation protocol. It moves beyond simple "pass/fail" on a test suite to include:
1. **Bug Existence Verification:** Ensuring the test is meaningful.
2. **Patch Correctness Verification:** Ensuring the solution works.
3. **Security/Intent Verification:** Ensuring the agent isn't cheating.
4. **Generative Output:** Creating a new, validated task from the successful interaction.
The **"agentic loop"** is central, indicating the agent (Building & Verifier) continuously interacts with its environment (Repository Container) and receives feedback (from Hacking Detector and Verifier Validation) to refine its actions. This is a classic sense-think-act cycle for autonomous agents.
The ultimate implication is that this framework can be used to **benchmark or train AI coding agents** in a safe, automated, and highly controlled manner, producing validated datasets of solved problems for further research. The presence of `sklearn` in the example PR suggests a focus on machine learning libraries.
</details>
Figure 2: Our SWE-Universe framework for scalable and reliable environment building. The pipeline is built around a building agent that proposes a verifier (evaluation.sh). Two key components ensure quality and yield: an in-loop Hacking Detector that preemptively rejects superficial scripts, and an Iterative Validation loop where the agent self-corrects based on feedback from testing its verifier against both buggy and resolved code states.
Our primary goal is to develop a method for automated verifiable SWE environment construction that is both scalable and reliable. Scalability implies compatibility across diverse programming languages and build toolchains, as well as maximizing the build success rate to minimize computational waste. Reliability requires that the generated verifier (i.e., an evaluation.sh script) accurately provides a reward signal for a given patch, passing only when the SWE requirements are met. To this end, we design an autonomous building agent based on the popular mini-sweagent scaffold.
Our key insight is that while directly verifying the success of a complex building is difficult, assessing the success of a test script is comparatively straightforward: one can simply run it to see if it distinguishes between the pre-patch (buggy) and post-patch (fixed) states, as adopted by DeepSeek-V3.2 (Liu et al., 2025). However, we argue that this condition alone is insufficient. A naive agent might generate a script that “hacks” the verification by using simple string matching to confirm the patch’s application, rather than actually executing the code. Such a script would correctly distinguish states but would fail to validate the environment or the behavioral correctness of the fix. Therefore, we establish a more robust acceptance criterion for a successfully built task: the verifier must not only correctly distinguish between states but must do so by genuinely executing the code under test. Our methodology, detailed below and sumarized in Figure ˜ 2, is designed to meet this twofold objective.
### 2.1 Key Designs for SWE-Universe
#### PR Crawling and Patch Separation.
The pipeline begins by sourcing a large corpus of issue-linked pull requests (PRs) from public GitHub repositories. To prevent data contamination and ensure fair evaluation, we meticulously filter out any PRs that overlap with known downstream benchmarks. For each valid PR, we employ a language model to analyze the code modifications and partition them into a test patch (containing test-related changes) and a fix patch (containing the source code fix). PRs lacking a discernible test component are discarded.
#### Agent-based Environment Building.
Following patch separation, our autonomous agent, equipped with a set of powerful tools, initiates the construction of the environment and its corresponding verifier. The process commences by applying the test patch to the repository. The agent is then tasked with generating a verifier script, designated as evaluation.sh, whose objective is to reliably distinguish between the repository’s states based solely on its return code. Guided by the information within the test patch, the agent has the flexibility to adopt one of two strategies: it can either directly invoke the unit tests described in the patch, or, in scenarios where those tests lack a straightforward execution entry point, it can author a new custom test from scratch. This standardized approach, centered on a universal bash interface and its integer return code, is a deliberate design choice. It decouples the verification logic from language-specific conventions (e.g., hardcoding a pytest or cargo run workflow), thereby maximizing the method’s generalizability and scalability across disparate projects and ecosystems.
#### Toolset.
We equip the agent with three tools: bash, switch-to-resolved, and switch-to-bug.
- bash: A general-purpose shell for file manipulation, dependency installation, and script generation.
- switch-to-resolved and switch-to-bug: A pair of tools that allow the agent to atomically apply or revert the fix patch, toggling the repository between its fixed and buggy states.
These state-switching tools are fundamental to our approach, as they empower the agent to perform self-verification. By testing its own actions in a closed loop, the agent can diagnose and recover from failures, which is key to improving the overall build success rate and achieving scalability.
#### Iterative Validation.
To ensure the verifier is reliable, the agent engages in an iterative validation loop. After the agent ends with submitting a candidate evaluation.sh script, we execute it under both repository states using its switching tools. A script is considered functionally correct only if it fails (exits with a non-zero status) in the buggy state and succeeds (exits with a zero status) in the fixed state. If the script fails this validation, the agent receives this negative feedback, discards the faulty script, and is prompted to generate a new, revised version until reaches the maximum turns we predefined (e.g., 100 turns). This iterative process substantially improves reliability: for a set of held-out PRs, the environment-building success rate increases from 82.6% to 94%.
#### In-loop Hacking Detection.
To meet our second criterion—that the verifier must genuinely execute the code—we integrate a Hacking Detector directly into the agent’s work cycle. This module uses an LLM to inspect the generated evaluation.sh script for “hacking” patterns, such as using grep or other string-matching utilities to check the contents of source files instead of running a build or test command. Critically, this check is performed within the agent’s loop, not as a post-processing step. If the detector flags a script as a “hack”, this attempt is immediately considered a failure. This provides timely feedback to the agent, forcing it to abandon superficial strategies and guiding it toward a valid solution that involves actual code execution. This in-loop design improves both the reliability of the final artifact and the efficiency of the agent’s search process.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Code Snippet Comparison: Testing Methodologies
### Overview
The image displays three side-by-side terminal or code editor windows, each containing a different script related to software testing and verification. The panels are labeled at the bottom: (1) Executing Human-written unit test, (2) Executing LLM-written tests, and (3) Hacking: using static checking. The content is presented with syntax highlighting on a dark background.
### Components/Axes
The image is segmented into three distinct vertical panels:
* **Left Panel (1):** A Bash shell script.
* **Middle Panel (2):** A Python script.
* **Right Panel (3):** A Bash shell script.
Each panel contains code with comments (prefixed by `#` or `"""`), commands, and logic. The text is monospaced.
### Detailed Analysis
#### Panel (1): Executing Human-written unit test
This is a Bash script for setting up and running a JavaScript/Node.js test suite.
* **Shebang:** `#!/bin/bash`
* **Key Steps:**
1. Navigate to a directory: `cd /testbed`
2. Install npm dependencies if `node_modules` doesn't exist.
3. Create a CommonJS version of a Jest setup file (`jest.setup.js`), mocking a server.
4. Run Jest tests with a specific config, logging output to `test-output.log`.
5. Check the log for a specific error pattern (`"Property 'persistUserToken' does not exist"`). If found, it reports a bug; otherwise, it reports tests passed.
#### Panel (2): Executing LLM-written tests
This is a Python script that appears to mock a glyph system for testing purposes.
* **Shebang:** `#!/usr/bin/env python3`
* **Imports:** `sys`, and classes from `glyphhinter.builder.hints` and `glyphhinter.classes`.
* **Key Components:**
* `class MockGlyph`: Initialized with a `hints` dictionary containing properties like `horizontal`, `opt`, `target`, and `top`.
* `class MockSelf`: Contains a `glyph_module` attribute set to the `classes` module.
* **Execution Block (`if __name__ == "__main__":`):**
1. Creates instances of `MockGlyph` and `MockSelf`.
2. Calls `to_glyphs_hints` with these mocks and a `layer` object.
3. Uses a `try...except` block to catch errors. On success, it prints "SUCCESS: No error occurred - bug appears to be fixed" and exits with code 0. On exception, it prints the error and exits with code 1.
#### Panel (3): Hacking: using static checking
This is a Bash script that uses static file checking (grep) to verify if a specific code fix has been applied.
* **Shebang:** `#!/bin/bash`
* **Environment Setup:** Exports `JAVA_HOME` and updates the `PATH`.
* **Key Logic:**
1. Echoes its purpose: testing `watchForGeneration` functionality by checking source code.
2. **First Check:** Uses `grep` to look for the string `"failAction: Option[Match.Action]"` in the file `/testbed/main/src/main/scala/internal/Continuous.scala`.
* If found, it echoes "Found failAction field - fix is present".
* If not found, it proceeds to a second check.
3. **Second Check:** Uses `grep` to look for `"exitMatchShared"` in the same file.
* If found, it echoes "Found exitMatchShared refactoring - fix is complete" and exits with code 0.
* If not found, it echoes "exitMatchShared not refactored - fix is incomplete" and exits with code 1.
4. If the first check fails (field not found), it echoes "failAction field not found - bug is present" and exits with code 1.
### Key Observations
1. **Methodology Contrast:** The three panels demonstrate fundamentally different approaches to verifying software correctness:
* **Panel 1:** Traditional, human-written unit tests that execute code and check runtime behavior/logs.
* **Panel 2:** Tests presumably generated by a Large Language Model (LLM), which in this case involve mocking dependencies to isolate and test a specific function (`to_glyphs_hints`).
* **Panel 3:** A "hacking" or static analysis approach that bypasses execution entirely, directly searching source code files for evidence of a fix.
2. **Specificity of Checks:** Each script looks for very specific indicators of a bug or fix:
* Panel 1: A specific error message string in test output.
* Panel 2: The absence of an exception during a mocked function call.
* Panel 3: The presence of specific code patterns/strings in a source file.
3. **Language & Tooling:** The scripts use different languages (Bash, Python) and tools (npm/Jest, Python mocks, grep) appropriate for their context (JavaScript project, Python glyph library, Scala codebase).
### Interpretation
This image is a comparative study of software verification techniques, likely from a research paper or technical blog post about AI-assisted debugging or testing. It visually argues that there are multiple, distinct pathways to confirm whether a software bug has been resolved.
* **The Human Approach (Panel 1)** is procedural and relies on a full test harness. It's robust but requires maintaining a test suite.
* **The LLM Approach (Panel 2)** is more focused, generating a targeted test for a specific unit of code. It suggests an AI can create executable verification logic, but its effectiveness depends on the quality of the mocks and the test design.
* **The Static/Hacking Approach (Panel 3)** is the most direct and arguably the most brittle. It doesn't test behavior but rather the *artifact* of a fix (the code itself). This is fast and requires no execution environment but can produce false positives (code is present but wrong) or false negatives (fix uses different terminology).
The overarching theme is the **trade-off between fidelity and speed**. Running a full test suite (Panel 1) is high-fidelity but slow. Static checking (Panel 3) is instant but low-fidelity. The LLM-written test (Panel 2) attempts to find a middle ground—providing behavioral verification without the overhead of a full project build and test run. The image prompts the viewer to consider which method is most appropriate for different stages of development and debugging.
</details>
Figure 3: Three types of evaluation.sh. We only accept the first two types of evaluation scripts.
### 2.2 Further Analysis
#### Case Study
We examine three representative LLM-generated test scripts to illustrate our verification criteria (Figure 3). Case 1 tests a JavaScript authentication bug by executing existing human-written unit tests in the Jest framework via pnpm exec jest, checking for specific error patterns in the output. Case 2 validates a Python glyphsLib bug by creating LLM-generated mock objects and unit tests that directly exercise the buggy code path and verify execution without exceptions. Case 3 attempts to verify a Scala bug fix using static pattern matching with grep commands to detect the presence of specific code structures, which is flagged as “hacking” by our hacking detection. We accept Cases 1 and 2 as they employ strong verification through executable tests that validate runtime behavior, while we reject Case 3 because static pattern matching cannot guarantee correctness—a patch may introduce expected code patterns while still containing logical errors.
#### Quality Analysis
We find that the resulting data still exhibit several quality issues: (1) some task descriptions are ambiguous or incomplete; (2) certain Docker environments do not fully match the stated requirements; and (3) some unit tests are misaligned with the task descriptions, which can lead to false positives or false negatives. To quantify and mitigate this problem, we develop a quality-judge agent. It takes as input the task description, Docker environment, test scripts, and optionally the ground-truth patch, and automatically evaluates the task quality. On a human-labeled quality-judging benchmark, the agent reaches 78.72% accuracy. After applying it to the entire dataset, as shown in Figure 4, we find our dataset matches SWE-Rebench (Badertdinov et al., 2025) in quality while providing 38Ă— more instances.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Scatter Plot: SWE Dataset Quality vs. Size
### Overview
The image is a scatter plot comparing four different Software Engineering (SWE) datasets. It plots "Dataset Size (count)" on a logarithmic y-axis against "Quality Score (%)" on a linear x-axis. The chart visually demonstrates a trade-off: datasets with higher quality scores tend to be significantly smaller in size.
### Components/Axes
* **Chart Type:** Scatter plot.
* **X-Axis:**
* **Label:** `Quality Score (%)`
* **Scale:** Linear, from 50.0 to 100.0.
* **Major Ticks:** 50.0, 62.5, 75.0, 87.5, 100.0.
* **Y-Axis:**
* **Label:** `Dataset Size (count)`
* **Scale:** Logarithmic (base 10).
* **Major Ticks:** 10^2, 10^3, 10^4, 10^5, 10^6, 10^7.
* **Data Series (Labeled Points):** Four distinct data points, each labeled directly on the chart. The legend is integrated as text labels adjacent to each point.
1. **SWE-Universe** (Blue circle)
2. **SWE-rebench** (Orange circle)
3. **SWE-Gym** (Red circle)
4. **SWE-bench Verified** (Green circle)
### Detailed Analysis
**Data Point Extraction (Approximate Values):**
The following table reconstructs the data based on visual estimation from the chart's grid lines.
| Dataset Label | Color | Approx. Quality Score (%) | Approx. Dataset Size (count) | Spatial Position (Relative to Plot Area) |
| :--- | :--- | :--- | :--- | :--- |
| **SWE-Universe** | Blue | ~57% | ~8 x 10^5 (800,000) | Top-Left quadrant. Highest size, lowest quality. |
| **SWE-rebench** | Orange | ~58% | ~2 x 10^4 (20,000) | Center-Left. Slightly higher quality than SWE-Universe but ~40x smaller. |
| **SWE-Gym** | Red | ~76% | ~2 x 10^3 (2,000) | Center. Noticeably higher quality and ~10x smaller than SWE-rebench. |
| **SWE-bench Verified** | Green | ~96% | ~1 x 10^2 (100) | Bottom-Right quadrant. Highest quality, smallest size by a large margin. |
**Trend Verification:**
The visual trend is a clear, inverse relationship. As one moves from left to right along the x-axis (increasing Quality Score), the data points descend sharply along the logarithmic y-axis (decreasing Dataset Size). The line connecting these points conceptually would slope steeply downward.
### Key Observations
1. **Inverse Correlation:** There is a strong negative correlation between dataset size and quality score among these four benchmarks.
2. **Magnitude of Difference:** The range is vast. The largest dataset (SWE-Universe) is approximately **8,000 times larger** than the smallest (SWE-bench Verified), while the smallest has a quality score nearly **40 percentage points higher**.
3. **Clustering:** SWE-Universe and SWE-rebench are clustered in the lower-quality, larger-size region. SWE-Gym occupies a middle ground. SWE-bench Verified is an outlier in the high-quality, small-size region.
### Interpretation
This chart illustrates a fundamental tension in dataset curation for software engineering tasks: **scale versus precision**.
* **What the data suggests:** Achieving a very high "Quality Score" (likely involving rigorous verification, human annotation, or filtering for correctness) appears to require a drastic reduction in dataset size. Conversely, scaling up to massive sizes (like SWE-Universe) seems to come at the cost of lower average quality.
* **How elements relate:** The logarithmic y-axis is crucial. It emphasizes that the size differences are not linear but exponential. A small gain in quality score (e.g., from SWE-rebench to SWE-Gym) is associated with an order-of-magnitude reduction in size.
* **Notable implications:** The position of **SWE-bench Verified** is particularly significant. It represents a "gold standard" approach where immense effort is invested in verifying a small set of high-quality examples. This is contrasted with **SWE-Universe**, which likely prioritizes breadth and volume, possibly through automated collection, accepting lower average quality as a trade-off. The choice between these datasets would depend entirely on the downstream task: training a large model might benefit from scale, while evaluating a model's precise reasoning might require the verified set. The chart provides a clear visual framework for understanding this strategic choice in dataset development.
</details>
Figure 4: Task Quality vs. Dataset Size (Log-Scale). Task quality is measured as the fraction of high-quality samples.
## 3 Efficient Building and Benchmarking
To optimize both the performance and throughput of our pipeline, we developed a high-capacity yet efficient model and a comprehensive cross-lingual benchmark to evaluate its capabilities.
#### Model Training and Deployment
To make our end-to-end pipeline efficient at scale, we train a lightweight but strong builder model, Qwen-Next-80A3 (Qwen Team, 2025b), a mixture-of-experts (MoE) model with hybrid attentions including linear attentions and full attentions. The model was trained using rejection sampling on a collection of high-quality building trajectories. This process involved sampling multiple candidate paths for environment construction and filtering for successful, non-hacked outcomes to serve as training data. The model serves as the unified backbone for all tasks in our pipeline, including PR patch splitting, iterative environment building, and the hacking detector. Its efficient architecture allows us to scale the building process across thousands of repositories with significantly lower latency compared to dense models of similar performance.
#### Benchmark
To rigorously assess the reliability of automated environment construction, we constructed a diverse benchmark consisting of 320 pull requests randomly sampled from GitHub. We remove the repositories used in the training trajectories from the benchmark. To ensure broad representativeness, we selected 40 PRs for each of the eight language categories: Python, JavaScript/TypeScript, Go, Java, Rust, C/C++, C#, and an “Others” category (including all the other languages such as PHP and Kotlin). We define two primary success metrics: Success Rate (w/o Hack), which requires the verifier to be functionally correct and pass the hacking detector, and Success Rate (w/ Hack), which includes all scripts that distinguish the bug regardless of the detection outcome.
Table 1: Benchmark results for automated environment building across various models. “Success (%) (w/o Hack)” measures the rate of creating a valid, non-hacked verifiable environment, while “Success (%) (w/ Hack)” also counts “hacked” verifiers as successful builds. Our model, Qwen-Next-80A3, achieves the highest non-hacking success rate.
| Model | Success (%) | Success (%) | Success Rate by Language (%) | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| (w/o Hack) | (w/ Hack) | C/C++ | C# | Go | Java | JS/TS | Other | Python | Rust | |
| Qwen-Next-80A3 (Ours) | $78.44$ | $82.50$ | $57.50$ | $70.00$ | $80.49$ | $82.50$ | $84.62$ | $83.72$ | $85.37$ | $83.33$ |
| Claude-Opus-4.5 | $77.81$ | $85.00$ | $52.50$ | $57.50$ | $82.93$ | $77.50$ | $89.74$ | $76.74$ | $95.12$ | $91.67$ |
| Claude-Sonnet-4 | $75.62$ | $85.62$ | $67.50$ | $52.50$ | $75.61$ | $72.50$ | $84.62$ | $83.72$ | $82.93$ | $86.11$ |
| Gemini-3-Pro | $69.69$ | $72.50$ | $32.50$ | $57.50$ | $73.17$ | $72.50$ | $87.18$ | $76.74$ | $87.80$ | $69.44$ |
| Claude-Sonnet-4-5 | $66.88$ | $71.56$ | $30.00$ | $50.00$ | $63.41$ | $67.50$ | $87.18$ | $67.44$ | $92.68$ | $77.78$ |
| GLM-4.7 | $58.44$ | $64.06$ | $37.50$ | $50.00$ | $53.66$ | $57.50$ | $66.67$ | $65.12$ | $73.17$ | $63.89$ |
| MiniMax-M2.1 | $54.69$ | $61.88$ | $22.50$ | $35.00$ | $56.10$ | $55.00$ | $74.36$ | $65.12$ | $73.17$ | $55.56$ |
| DeepSeek-v3.2 | $54.06$ | $59.38$ | $15.00$ | $50.00$ | $63.41$ | $47.50$ | $53.85$ | $48.84$ | $78.05$ | $77.78$ |
| Qwen3-Coder-480B | $48.75$ | $55.62$ | $35.00$ | $32.50$ | $39.02$ | $45.00$ | $48.72$ | $69.77$ | $68.29$ | $50.00$ |
#### Result Analysis
Our evaluation, detailed in Table 1, establishes Qwen-Next-80A3 as the new state-of-the-art model for automated environment building, achieving a 78.44% success rate that surpasses even strong proprietary models like Claude-Opus-4.5 (77.81%). This superior performance extends beyond raw success to overall reliability. We observe a significant gap between the true success rate and the one including “hacked” verifiers for most general-purpose models (e.g., over 7% for Claude-Opus-4.5), indicating their tendency to find superficial shortcuts. In stark contrast, our model exhibits the smallest gap among top performers (4.06%), demonstrating that our task-specific fine-tuning on collected trajectories effectively discourages these deceptive behaviors. Furthermore, while performance varies by language—with C/C++ proving most challenging for all agents—our model demonstrates more consistent and robust capabilities across the entire linguistic spectrum compared to competitors that show strength only in specific ecosystems. Ultimately, these results validate that a specialized, efficiently trained MoE model not only achieves higher success but also operates with greater fidelity than larger, general-purpose counterparts in this complex domain.
## 4 Scaling Environments to Millions
In this section, we describe our efforts to scale the environment-building pipeline to a massive corpus of real-world software changes, moving beyond small-scale benchmarks to a million-scale dataset of executable tasks.
#### Large-scale Data Curation
We harvested a comprehensive dataset of approximately 33.3 million pull requests (PRs) spanning the most recent five years (2021–2025) of GitHub’s history. To extract high-quality tasks from this raw pool, we applied a series of rigorous heuristic filters. Specifically, we removed PRs with excessive file changes or line counts to avoid overly complex or noisy tasks, and discarded any entries that did not contain a discernible test patch. Furthermore, we prioritized PRs that were explicitly linked to at least one GitHub issue, as these provide the most reliable ground-truth problem statements. This filtering process resulted in a candidate set of approximately 1 million high-quality PRs.
#### Infrastructure for Large-scale Rollouts
To support our massive data generation effort, we implement our pipeline on top of MegaFlow (Zhang et al., 2026), a distributed execution system for orchestrating large numbers of long-running agentic jobs. MegaFlow achieves massive parallelism by dispatching each environment-building task as an independent job to a dedicated Alibaba Cloud Elastic Compute Service (ECS) instance. Within this sandboxed virtual machine, our agent executes the entire build process, culminating in a verified Docker image. Upon completion, successful images are pushed to Alibaba Cloud’s Container Registry (ACR) https://www.alibabacloud.com/en/product/container-registry, where we leverage Docker’s layer caching to significantly reduce storage costs by reusing common base layers. This highly parallel and resource-efficient architecture was instrumental in processing millions of pull requests concurrently, enabling the practical construction of our million-scale dataset.
#### Massive Production with Qwen-Next-80A3
We deployed our fine-tuned Qwen-Next-80A3 model to execute the building and verification process for the filtered candidates. Leveraging the model’s efficiency and the iterative agentic loop described in previous sections, we achieved a non-hacked success rate of 75.9%. This process ultimately yielded 717,122 executable, high-fidelity environments. These tasks are primarily focused on assessing an agent’s ability to resolve real-world software issues. Additionally, to expand the dataset’s coverage, we selected a subset of 2025 PRs that were not linked to issues, utilizing the PR titles and descriptions as the problem statements to synthesize an additional environments, and producing 90,571 environments.
Table 2: Data statistics for our building environments.
| Language | Instances | Repos | Inst/Repo (Avg) | Avg Lines of evaluation.sh |
| --- | --- | --- | --- | --- |
| Python | 202,302 | 13,098 | 15.45 | 25.01 |
| Javascript / Typescript | 175,660 | 11,604 | 15.14 | 27.41 |
| Go | 121,062 | 5,554 | 21.80 | 28.87 |
| Java | 86,105 | 4,700 | 18.32 | 24.75 |
| Rust | 74,180 | 4,445 | 16.69 | 19.31 |
| C / C++ | 37,228 | 3,405 | 10.93 | 45.78 |
| C# | 24,387 | 1,929 | 12.64 | 31.84 |
| Others | 86,769 | 8,225 | 10.55 | 38.89 |
| # Total | 807,693 | 52,960 | 15.25 | 28.21 |
#### Statistical Analysis
The statistics presented in Table 2 highlight both the scale and diversity of our generated dataset. With 807,693 instances spread across 52,960 unique repositories, this dataset represents an unprecedented resource for training and evaluating software engineering agents. The language distribution largely mirrors the current open-source landscape, with Python and JavaScript/TypeScript constituting the largest shares. Notably, Go repositories exhibit the highest average number of instances per repository (21.80), suggesting a high density of verifiable, issue-linked PRs within its ecosystem, possibly due to strong conventions around testing and development. The “Avg Eval Lines” metric reveals interesting insights into the complexity of verification across languages. C/C++ instances, as expected, require the longest verifier scripts on average (45.78 lines), reflecting the typical complexity and boilerplate of their build systems. In contrast, Rust’s concise average (19.31 lines) may point to the efficiency and standardization of its cargo toolchain, allowing for more succinct test execution commands. Overall, the dataset provides a rich and varied testbed, capturing a wide array of real-world challenges for coding agents.
## 5 Evaluation: Large-scale Agentic Training
We now turn to leveraging this resource for large-scale agentic model training. Our goal is to demonstrate that training on this diverse, real-world data can significantly enhance a model’s capabilities as a software engineering agent.
### 5.1 Mid-training
We hypothesize that intermediate training on a vast corpus of high-quality agentic trajectories can endow a model with a strong foundation in both software engineering problem-solving. This “mid-training” phase is designed to bridge the gap between the pre-training and post-training on downstream agentic tasks.
#### Setup
Our mid-training process begins by generating a massive dataset of agentic trajectories. We first employed the Qwen3-Coder-480B-A30B (Qwen Team, 2025a) model to interact with the environments we previously constructed. To ensure a diversity of problem-solving strategies, we conducted these rollouts across five different agentic scaffolds: SWE-agent (Yang et al., 2024), Mini-SWE-agent (SWE-agent Team, 2025), OpenHands (Wang et al., 2024), Claude-Code (Anthropic, 2024), Qwen-Code (Qwen Team, 2025c). The entire rollout process was orchestrated by our MegaFlow system. For each environment, we performed rejection sampling: a trajectory is deemed successful and retained only if the final generated code passes the corresponding evaluation.sh script and clears an additional in-house quality filter. This rigorous filtering process yielded a high-quality dataset of 500K successful trajectories, comprising a total of 30 billion training tokens. We then used this data to conduct intermediate training on a Qwen3-Next-80A3 model, using 256K sequence length and Best-Fit packing (Ding et al., 2024). Critically, we applied no loss mask during this training phase. This strategy ensures that the model learns not only to predict agentic actions but also to internalize the vast amount of code and natural language in the agent’s observations and responses, developing it into a more comprehensive “coding world model” (Copet et al., 2025; Zhang et al., 2025a).
#### Scaling Trends
To analyze the effectiveness of our large-scale agentic mid-training, we evaluated model checkpoints on two widely used agentic coding benchmarks: SWE-Bench Verified (primarily Python-based) and the more diverse SWE-Bench Multilingual. The performance evolution throughout the training process is depicted in Figure 5(a). The results demonstrate a clear and positive scaling trend, confirming that our mid-training on the real-world instances successfully transfers to standard evaluation benchmarks. On the standard SWE-Bench Verified, the model’s performance exhibits a steady climb, starting from an already strong baseline of 50.3% and consistently improving to a final score of over 61% after 2,000 training steps. More notably, the trend on the more challenging SWE-Bench Multilingual benchmark shows even more dramatic improvement. Starting from a significantly lower baseline of approximately 31%, the model’s performance surges to over 46% by the end of the training. This substantial gain of over 15 percentage points underscores the critical value of our dataset’s linguistic diversity. It proves that training on a massive, diverse corpus of real-world, multi-language software issues is essential for creating agents that can generalize beyond a single programming ecosystem.
### 5.2 Agentic Reinforcement Learning
Beyond training via rejection sampling, our collection of executable environments is suited for Reinforcement Learning (RL), where the binary pass/fail signal from the evaluation.sh script serves as a direct and reliable reward. To demonstrate this, we conducted agentic RL experiments on different models with different sizes and structures.
#### Setup
We validate the effectiveness of the data on agentic reinforcement learning. Prior to training, we perform rollouts with the base model to filter out queries that are either too difficult or too easy. During training, we set the maximum number of interaction turns to 200 and the context length to 128k. The training process is powered by our asynchronous RL framework, which natively supports agentic workflows. This architecture mitigates data skewness while facilitating seamless multi-turn interactions without framework-induced overhead, achieving a 2x–4x speedup compared to existing RL infrastructures.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: Performance vs. Training Tokens for SWE-Bench Benchmarks
### Overview
This is a line chart comparing the performance of two benchmarks, "SWE-Bench Verified" and "SWE-Bench Multilingual," as a function of the number of training tokens (in billions). The chart shows a general upward trend for both benchmarks, with the "Verified" set consistently achieving higher performance scores than the "Multilingual" set across all measured training scales.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Training Tokens (Billion)"
* **Scale:** Linear, ranging from approximately 5 to 70 billion.
* **Major Tick Marks:** 10, 20, 30, 40, 50, 60, 70.
* **Y-Axis:**
* **Label:** "Performance (%)"
* **Scale:** Linear, ranging from 30% to 60%.
* **Major Tick Marks:** 30, 35, 40, 45, 50, 55, 60.
* **Legend:**
* **Position:** Bottom-right corner of the plot area.
* **Entry 1:** Blue line with circular markers, labeled "SWE-Bench Verified".
* **Entry 2:** Orange line with circular markers, labeled "SWE-Bench Multilingual".
### Detailed Analysis
**Data Series 1: SWE-Bench Verified (Blue Line)**
* **Trend:** The line shows a steady, generally upward trend with minor fluctuations. It starts just above 50% and ends above 60%.
* **Approximate Data Points (Training Tokens (B), Performance (%)):**
* (7, 50.5)
* (13, 52.5)
* (20, 52.7)
* (27, 52.3)
* (33, 54.7)
* (40, 53.9)
* (47, 56.0)
* (53, 58.2)
* (60, 58.2)
* (67, 61.2)
**Data Series 2: SWE-Bench Multilingual (Orange Line)**
* **Trend:** The line shows an overall upward trend but with more pronounced volatility compared to the blue line. It starts near 31% and ends near 46%.
* **Approximate Data Points (Training Tokens (B), Performance (%)):**
* (7, 31.0)
* (13, 38.5)
* (20, 39.5)
* (27, 41.0)
* (33, 41.0)
* (40, 40.0)
* (47, 43.8)
* (53, 40.7)
* (60, 40.0)
* (67, 46.3)
### Key Observations
1. **Performance Gap:** The "SWE-Bench Verified" benchmark consistently outperforms the "SWE-Bench Multilingual" benchmark by a significant margin (approximately 15-20 percentage points) at every measured training token scale.
2. **Scaling Law:** Both benchmarks demonstrate a positive correlation between the number of training tokens and performance, suggesting that model capability on these tasks improves with scale.
3. **Volatility Difference:** The "Multilingual" series exhibits more performance volatility (e.g., dips at 40B and 60B tokens) compared to the relatively smoother progression of the "Verified" series.
4. **Final Surge:** Both series show their steepest performance increase in the final segment, from 60B to 67B tokens.
### Interpretation
The chart illustrates a fundamental scaling relationship in machine learning: increasing the volume of training data (tokens) generally leads to better performance on downstream benchmarks. The consistent performance gap between "SWE-Bench Verified" and "SWE-Bench Multilingual" suggests that the multilingual variant of the task is inherently more challenging for the model, possibly due to the need to generalize across multiple programming languages or handle more diverse codebases.
The higher volatility in the multilingual series could indicate that performance on this more complex task is more sensitive to the specific composition of the training data at different scales, or that the model's development path for multilingual understanding is less monotonic. The final sharp uptick for both lines might signify a phase change in model capability or the effect of a specific training strategy employed at the largest scale.
**Language Note:** All text in the image is in English.
</details>
(a)
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: SWE-Bench Multilingual Performance Over Training Steps
### Overview
The image displays a line chart tracking the performance of a model on the "SWE-Bench Multilingual" benchmark across a series of training steps. The chart shows a generally upward trend with some fluctuations, culminating in a peak performance value highlighted with a star.
### Components/Axes
* **Y-Axis (Vertical):** Labeled "SWE-Bench Multilingual (%)". The scale runs from 32 to 42, with major grid lines and numerical markers at intervals of 2 (32, 34, 36, 38, 40, 42).
* **X-Axis (Horizontal):** Labeled "Training Steps". It represents a progression of training iterations, but no specific numerical values or tick marks are provided for the steps.
* **Data Series:** A single data series represented by an orange line connecting circular data points. The final data point is marked with an orange star instead of a circle.
* **Legend:** No separate legend is present. The single line's meaning is defined by the axis labels.
* **Grid:** Horizontal dashed grid lines are present at each major y-axis tick (32, 34, 36, 38, 40, 42).
### Detailed Analysis
The chart plots 8 distinct data points. The approximate values, read from the y-axis position relative to the grid lines, are as follows (from left to right, corresponding to increasing training steps):
1. **Point 1:** ~32.3%
2. **Point 2:** ~37.7%
3. **Point 3:** ~37.3%
4. **Point 4:** ~36.0%
5. **Point 5:** ~35.0%
6. **Point 6:** ~37.0%
7. **Point 7:** ~40.0%
8. **Point 8 (Star):** 42.0% (This value is explicitly labeled next to the star marker).
**Trend Description:** The performance line begins at its lowest point (~32.3%). It then experiences a sharp increase to ~37.7%, followed by a slight decline over the next two points to a local minimum of ~35.0%. From this trough, the performance begins a steady and accelerating climb, passing through ~37.0% and ~40.0%, before reaching its maximum value of 42.0% at the final recorded step.
### Key Observations
* **Peak Performance:** The highest recorded value is 42.0%, achieved at the final training step shown. This point is specially annotated with a star and a direct numerical label.
* **Performance Dip:** There is a noticeable decline in performance between the second and fifth data points, dropping from ~37.7% to ~35.0%.
* **Strong Final Trajectory:** After the dip, the model's performance improves consistently and at an increasing rate over the last three data points.
* **Initial Volatility:** The most significant single jump in performance occurs between the first and second data points (~5.4 percentage points).
### Interpretation
This chart visualizes the learning curve of an AI model on a multilingual software engineering benchmark. The data suggests that training is not a linear process of improvement. The initial rapid gain indicates the model quickly learns foundational patterns. The subsequent dip could represent a phase where the model is consolidating knowledge, encountering more difficult examples, or experiencing a temporary instability common in training dynamics.
The most critical insight is the strong recovery and sustained upward trend in the latter half of the plotted steps. This indicates that despite mid-training setbacks, the optimization process successfully navigated towards a higher performance plateau, ultimately achieving a new best score of 42.0%. The star marker emphasizes this final value as the key result or the stopping point of interest. The absence of specific step numbers on the x-axis limits the analysis to relative progression rather than absolute training duration or computational cost.
</details>
(b)
Figure 5: (a) Performance scaling trends of the Qwen3-Next-80A3 model during mid-training. (b) Reinforcement learning curve of the Qwen3-30B-A3B for SWE-Bench Multilingual.
#### Results on Qwen3-30B-A3B
The training curve in Figure 5(b) shows the effectiveness of agentic reinforcement training. The Qwen3-30a3 model shows a remarkable improvement on SWE-Bench Multilingual, surging from a baseline of approximately 32% to a peak of 42.0%. This substantial 10-point absolute gain underscores the power of RL to enhance generalization across diverse, multilingual tasks.
#### Applied to Qwen3-Max-Thinking
To push the boundaries of agentic coding performance, we applied our methodology to our flagship Qwen3-Max-Thinking model. The resulting model achieved a performance of 75.3% on SWE-Bench Verified. This result validates the effectiveness of our large-scale data generation pipeline at a production level, showcasing its ability to elevate state-of-the-art models.
## 6 Related Work
#### Agentic and Synthetic Environment Generation
A significant line of research focuses on generating software engineering tasks synthetically to create scalable training data, often without relying on authentic historical issues. For instance, SWE-smith (Yang et al., 2025b) procedurally generates tasks by artificially injecting bugs, while SWE-Flow (Zhang et al., 2025b) leverages test documentation to synthesize novel problems. Other works generate complex bugs from scratch (Sonwane et al., 2025) or create interactive tasks from bug reports (Jin et al., 2023). The scope of synthesis extends to generating command-line tasks from natural language (Lin et al., 2018; Gandhi et al., 2026). Similarly, RepoST (Xie et al., 2024) employs synthetic tests to generate training data. While these methods provide scalable training data, they do not fully capture the complexity and long-tail challenges of real-world software issues. In contrast, our work focuses exclusively on building executable environments for real-world software engineering problems derived directly from public repositories.
#### Real-World Software Environment Setup
Much recent work focuses purely on the environment configuration capability itself, proposing scripted, agentic, or expert-driven methods to create executable environments, but without providing verifiers for specific software issues (Hu et al., 2025; Milliken et al., 2025; Bouzenia & Pradel, 2025; Horton & Parnin, 2019; Jain et al., 2024; Vergopoulos et al., 2025; Guo et al., 2026; Arora et al., 2025; Eliseeva et al., 2025; Kuang et al., 2025; Fu et al., 2025). Another line of work provides end-to-end verification for real software issues. The seminal SWE-bench (Jimenez et al., 2024) established this paradigm with over 2,400 real-world Python issues, each with a test script to verify a fix. This Python-centric approach was significantly scaled by SWE-rebench (Badertdinov et al., 2025), which developed a fully automated pipeline to generate over 21,000 verifiable tasks. Other works like SWE-Gym (Pan et al., 2024), SWE-bench-Live (Zhang et al., 2025c), SWE-Factory (Guo et al., 2025), daVinci-Dev (Zeng et al., 2026), SWE-Bench++ (Wang et al., 2025), and SWE-bench+ (Aleithan et al., 2024) have also aimed to expand data generation, primarily within the Python ecosystem. A few recent efforts have created multi-language benchmarks, such as Multi-SWE-bench (Zan et al., 2025) and SWE-PolyBench (Rashid et al., 2025); however, these have generally been limited in scale. Our work systematically overcomes these limitations by presenting a highly scalable and reliable pipeline that operates across arbitrary languages and repositories, successfully generating over million-scale executable environments with a validated, task-specific verifier.
#### Building Code Verifier
Automated code verifiers are typically constructed by executing test suites to validate solutions. Traditional test generation methods, including probability-based approaches (Pacheco et al., 2007), constraint-based (Xiao et al., 2013), and search-based (Harman & McMinn, 2010; Lukasczyk & Fraser, 2022), often suffer from limited coverage and poor readability, and are typically restricted to regression or implicit oracles (Barr et al., 2015). Recent work leverages LLMs to generate unit tests for verification (Alagarsamy et al., 2024; Chen et al., 2024b; Schäfer et al., 2024; Yuan et al., 2024; Chen et al., 2024a), though the resulting tests can be unreliable due to confidently incorrect assertions. CodeRM (Ma et al., 2025) improves reward signal quality by scaling the number of generated tests and dynamically adapting test counts to problem difficulty. In this work, we leverage a building agent to automatically generate verifier for complext repository-level software engineering issues.
## 7 Conclusion
In this paper, we introduced SWE-Universe, a scalable framework designed to overcome the critical bottlenecks of low yield, inconsistent quality, and prohibitive cost in generating real-world software engineering environments. Our autonomous building agent, powered by a custom-trained model and equipped with iterative self-verification and in-loop hacking detection, successfully constructed a massive dataset of over 800,000 executable, multi-lingual tasks—the largest of its kind. We demonstrated the profound value of this resource through large-scale agentic training, showing that it provides a powerful signal for both supervised learning and reinforcement learning, ultimately enabling our Qwen3-Max-Thinking model to achieve a score of 75.3% on SWE-Bench Verified.
## References
- Alagarsamy et al. (2024) Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti. A3test: Assertion-augmented automated test case generation. Inf. Softw. Technol., 176(C), November 2024. ISSN 0950-5849. doi: 10.1016/j.infsof.2024.107565. URL https://doi.org/10.1016/j.infsof.2024.107565.
- Aleithan et al. (2024) Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, and Song Wang. Swe-bench+: Enhanced coding benchmark for llms. arXiv preprint arXiv:2410.06992, 2024.
- Anthropic (2024) Anthropic. Claude code: An agentic coding assistant. https://github.com/anthropics/claude-code, 2024. Accessed: 2026-01-22.
- Arora et al. (2025) Avi Arora, Jinu Jang, and Roshanak Zilouchian Moghaddam. Setupbench: Assessing software engineering agents’ ability to bootstrap development environments. arXiv preprint arXiv:2507.09063, 2025.
- Badertdinov et al. (2025) Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. SWE-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025. URL https://arxiv.org/abs/2505.20411.
- Barr et al. (2015) Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. The oracle problem in software testing: A survey. IEEE Transactions on Software Engineering, 41(5):507–525, 2015. doi: 10.1109/TSE.2014.2372785.
- Bouzenia & Pradel (2025) Islem Bouzenia and Michael Pradel. You name it, i run it: An llm agent to execute tests of arbitrary projects. Proc. ACM Softw. Eng., 2(ISSTA), June 2025. doi: 10.1145/3728922. URL https://doi.org/10.1145/3728922.
- Chen et al. (2024a) Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, and Jianling Sun. B4: Towards optimal assessment of plausible code solutions with plausible tests. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE ’24, pp. 1693–1705, New York, NY, USA, 2024a. Association for Computing Machinery. ISBN 9798400712487. doi: 10.1145/3691620.3695536. URL https://doi.org/10.1145/3691620.3695536.
- Chen et al. (2024b) Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, pp. 572–576, New York, NY, USA, 2024b. Association for Computing Machinery. ISBN 9798400706585. doi: 10.1145/3663529.3663801. URL https://doi.org/10.1145/3663529.3663801.
- Copet et al. (2025) Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025.
- DeepSeek-AI (2025) DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models. CoRR, abs/2512.02556, 2025. doi: 10.48550/ARXIV.2512.02556. URL https://doi.org/10.48550/arXiv.2512.02556.
- Ding et al. (2024) Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. Fewer truncations improve language modeling. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=kRxCDDFNpp.
- Eliseeva et al. (2025) Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. Envbench: A benchmark for automated environment setup. arXiv preprint arXiv:2503.14443, 2025.
- Fu et al. (2025) Kelin Fu, Tianyu Liu, Zeyu Shang, Yingwei Ma, Jian Yang, Jiaheng Liu, and Kaigui Bian. Multi-docker-eval: Ashovel of the gold rush’benchmark on automatic environment building for software engineering. arXiv preprint arXiv:2512.06915, 2025.
- Gandhi et al. (2026) Kanishk Gandhi, Shivam Garg, Noah D Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents. arXiv preprint arXiv:2601.16443, 2026.
- Guo et al. (2025) Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, and Zibin Zheng. Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks. arXiv preprint arXiv:2506.10954, 2025.
- Guo et al. (2026) Xinshuai Guo, Jiayi Kuang, Linyue Pan, Yinghui Li, Yangning Li, Hai-Tao Zheng, Ying Shen, Di Yin, and Xing Sun. Evoconfig: Self-evolving multi-agent systems for efficient autonomous environment configuration. arXiv preprint arXiv:2601.16489, 2026.
- Harman & McMinn (2010) Mark Harman and Phil McMinn. A theoretical and empirical study of search-based testing: Local, global, and hybrid search. IEEE Transactions on Software Engineering, 36(2):226–247, 2010. doi: 10.1109/TSE.2009.71.
- Horton & Parnin (2019) Eric Horton and Chris Parnin. Dockerizeme: Automatic inference of environment dependencies for python code snippets. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 328–338. IEEE, 2019.
- Hu et al. (2025) Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao. Repo2run: Automated building executable environment for code repository at scale. arXiv preprint arXiv:2502.13681, 2025.
- Jain et al. (2024) Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, and Ion Stoica. R2e: turning any github repository into a programming agent environment. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024.
- Jimenez et al. (2024) Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66.
- Jin et al. (2023) Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. Inferfix: End-to-end program repair with llms. In Proceedings of the 31st ACM joint european software engineering conference and symposium on the foundations of software engineering, pp. 1646–1656, 2023.
- Kuang et al. (2025) Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, and Philip S Yu. Process-level trajectory evaluation for environment configuration in software engineering agents. arXiv preprint arXiv:2510.25694, 2025.
- Lin et al. (2018) Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D Ernst. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
- Liu et al. (2025) Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025.
- Lukasczyk & Fraser (2022) Stephan Lukasczyk and Gordon Fraser. Pynguin: automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, ICSE ’22, pp. 168–172, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392235. doi: 10.1145/3510454.3516829. URL https://doi.org/10.1145/3510454.3516829.
- Ma et al. (2025) Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, and Jie Tang. Dynamic scaling of unit tests for code reward modeling. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6917–6935, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.343. URL https://aclanthology.org/2025.acl-long.343/.
- Milliken et al. (2025) Louis Milliken, Sungmin Kang, and Shin Yoo. Beyond pip Install: Evaluating LLM Agents for the Automated Installation of Python Projects . In 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 1–11, Los Alamitos, CA, USA, March 2025. IEEE Computer Society. doi: 10.1109/SANER64311.2025.00009. URL https://doi.ieeecomputersociety.org/10.1109/SANER64311.2025.00009.
- Pacheco et al. (2007) Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE’07), pp. 75–84, 2007. doi: 10.1109/ICSE.2007.37.
- Pan et al. (2024) Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym. arXiv preprint arXiv:2412.21139, 2024.
- Qwen Team (2025a) Qwen Team. Qwen3-coder: Agentic coding in the world. https://qwen.ai/blog?id=qwen3-coder, 2025a. Accessed: 2025-07-22.
- Qwen Team (2025b) Qwen Team. Qwen3-next: Towards ultimate training & inference efficiency. https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd, 2025b. Accessed: 2025-09-11.
- Qwen Team (2025c) Qwen Team. qwen-code. https://github.com/QwenLM/qwen-code, 2025c.
- Qwen Team (2026) Qwen Team. Pushing qwen3-max-thinking beyond its limits. https://qwen.ai/blog?id=qwen3-max-thinking, 2026. Accessed: 2026-01-26.
- Rashid et al. (2025) Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, et al. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents. arXiv preprint arXiv:2504.08703, 2025.
- Schäfer et al. (2024) Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50(1):85–105, 2024. doi: 10.1109/TSE.2023.3334955.
- Sonwane et al. (2025) Atharv Sonwane, Isadora White, Hyunji Lee, Matheus Pereira, Lucas Caccia, Minseon Kim, Zhengyan Shi, Chinmay Singh, Alessandro Sordoni, Marc-Alexandre Côté, et al. Bugpilot: Complex bug generation for efficient learning of swe skills. arXiv preprint arXiv:2510.19898, 2025.
- SWE-agent Team (2025) SWE-agent Team. mini-swe-agent: The 100 line ai agent that’s actually useful. https://github.com/SWE-agent/mini-swe-agent, 2025. Accessed: 2026-01-22.
- Vergopoulos et al. (2025) Konstantinos Vergopoulos, Mark Niklas MĂĽller, and Martin Vechev. Automated benchmark generation for repository-level coding tasks. arXiv preprint arXiv:2503.07701, 2025.
- Wang et al. (2025) Lilin Wang, Lucas Ramalho, Alan Celestino, Phuc Anthony Pham, Yu Liu, Umang Kumar Sinha, Andres Portillo, Onassis Osunwa, and Gabriel Maduekwe. Swe-bench++: A framework for the scalable generation of software engineering benchmarks from open-source repositories. arXiv preprint arXiv:2512.17419, 2025.
- Wang et al. (2024) Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024.
- Xiao et al. (2026) Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report. arXiv preprint arXiv:2601.02780, 2026.
- Xiao et al. (2013) Xusheng Xiao, Sihan Li, Tao Xie, and Nikolai Tillmann. Characteristic studies of loop problems for structural test generation via symbolic execution. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 246–256, 2013. doi: 10.1109/ASE.2013.6693084.
- Xie et al. (2024) Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, and Carolyn Rose. Codebenchgen: Creating scalable execution-based code generation benchmarks, 2024.
- Yang et al. (2025a) An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025a.
- Yang et al. (2024) John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems, 37:50528–50652, 2024.
- Yang et al. (2025b) John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025b. URL https://arxiv.org/abs/2504.21798.
- Yuan et al. (2024) Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proc. ACM Softw. Eng., 1(FSE), July 2024. doi: 10.1145/3660783. URL https://doi.org/10.1145/3660783.
- Zan et al. (2025) Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-swe-bench: A multilingual benchmark for issue resolving, 2025. URL https://arxiv.org/abs/2504.02605.
- Zeng et al. (2026) Ji Zeng, Dayuan Fu, Tiantian Mi, Yumin Zhuang, Yaxing Huang, Xuefeng Li, Lyumanshan Ye, Muhang Xie, Qishuo Hua, Zhen Huang, et al. davinci-dev: Agent-native mid-training for software engineering. arXiv preprint arXiv:2601.18418, 2026.
- Zhang et al. (2025a) Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, et al. Agent learning via early experience. arXiv preprint arXiv:2510.08558, 2025a.
- Zhang et al. (2025b) Lei Zhang, Jiaxi Yang, Min Yang, Jian Yang, Mouxiang Chen, Jiajun Zhang, Zeyu Cui, Binyuan Hui, and Junyang Lin. Synthesizing software engineering data in a test-driven manner. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu (eds.), Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pp. 76518–76540. PMLR, 13–19 Jul 2025b. URL https://proceedings.mlr.press/v267/zhang25cn.html.
- Zhang et al. (2026) Lei Zhang, Mouxiang Chen, Ruisheng Cao, Jiawei Chen, Fan Zhou, Yiheng Xu, Jiaxi Yang, Zeyao Ma, Liang Chen, Changwei Luo, Kai Zhang, Fan Yan, KaShun Shum, Jiajun Zhang, Zeyu Cui, Feng Hu, Junyang Lin, Binyuan Hui, and Min Yang. Megaflow: Large-scale distributed orchestration system for the agentic era, 2026. URL https://arxiv.org/abs/2601.07526.
- Zhang et al. (2025c) Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, et al. Swe-bench goes live! arXiv preprint arXiv:2505.23419, 2025c.