2405.15793v3

Model: gemini-3-flash-free

## SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering ## John Yang ∗ Carlos E. Jimenez ∗ Alexander Wettig Kilian Lieret Shunyu Yao Karthik Narasimhan Ofir Press Princeton Language and Intelligence, Princeton University ## Abstract Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@ 1 rate of 12 . 5 %and 87 . 7 %, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance. ## 1 Introduction Recent work has demonstrated the efficacy of LM agents for code generation with execution feedback [39]. However, applying agents to more complex code tasks like software engineering remains unexplored. To solve programming tasks, LM agents are typically designed to use existing applications, such as the Linux shell or Python interpreter [53, 57, 59]. However, to perform more complex programming tasks such as software engineering [20], human engineers benefit from sophisticated applications like VSCode with powerful tools and extensions. Inspired by human-computer interaction (HCI) studies on the efficacy of user interfaces for humans [7], we investigate whether LM agents could similarly benefit from better-designed interfaces for performing software engineering tasks. Figure 1: SWE-agent is an LM interacting with a computer through an agent-computer interface (ACI), which includes the commands the agent uses and the format of the feedback from the computer. <details> <summary>Figure 1 Details</summary> ![b981303ab1929a3e9e0d4c84e370163fe153654149b3ceb177a35616c4282160](http://localhost:8000/v1/image/b981303ab1929a3e9e0d4c84e370163fe153654149b3ceb177a35616c4282160) ### Visual Description # Technical Document Extraction: SWE-agent Architecture Diagram ## 1. Overview This image is a high-level architectural diagram illustrating the interaction between a Software Engineering Agent (**SWE-agent**) and a **Computer** system via an intermediate layer called the **Agent-Computer Interface (ACI)**. The diagram uses a closed-loop feedback system to show how a Language Model (LM) interacts with a file system and terminal. --- ## 2. Component Segmentation ### Region A: SWE-agent (Left Section) * **Header Label:** `SWE-agent` (accompanied by an icon of a hand interacting with a computer monitor). * **Primary Component:** `LM Agent` * **Visuals:** Contained within a blue-bordered rounded rectangle. Includes a robot emoji icon. * **Function:** Acts as the central processing unit that initiates actions and receives feedback. ### Region B: Agent-Computer Interface (Middle Section) * **Header Label:** `Agent-Computer Interface` * **Visuals:** A light gray background box containing two primary communication channels. * **Outbound Channel (Top):** `LM-friendly commands` * **Visuals:** Blue-bordered box. * **Specific Commands Listed:** * `Navigate repo` (Icon: Paper plane/navigation) * `Search files` (Icon: Magnifying glass) * `View files` (Icon: Stacked lines/document) * `Edit lines` (Icon: Pen/writing) * **Inbound Channel (Bottom):** `LM-friendly environment feedback` * **Visuals:** Brown-bordered box. ### Region C: Computer (Right Section) * **Header Label:** `Computer` (accompanied by a desktop computer icon). * **Visuals:** A light orange/peach background box. * **Sub-component 1:** `Terminal` (Icon: Command prompt `>_`) * **Sub-component 2:** `Filesystem` (Icon: List/hierarchy) * **Directory/File Structure Example:** * `sklearn/` (Folder icon) * `examples/` (Folder icon) * `README.rst` (File icon) --- ## 3. Process Flow and Logic The diagram depicts a bidirectional loop between the **LM Agent** and the **Computer**: 1. **Action Initiation (Blue Path):** * The **LM Agent** generates an intent. * This intent passes through the **Agent-Computer Interface** where it is translated into **LM-friendly commands** (Navigate, Search, View, Edit). * The blue arrow points from the interface directly into the **Computer** system (Terminal/Filesystem). 2. **Environment Response (Brown Path):** * The **Computer** executes the commands within the Terminal or Filesystem. * The resulting state or output is processed back through the **Agent-Computer Interface**. * The information is formatted as **LM-friendly environment feedback**. * The brown arrow points back to the **LM Agent**, completing the cycle. --- ## 4. Textual Transcription Summary | Category | Extracted Text | | :--- | :--- | | **Main Titles** | SWE-agent, Agent-Computer Interface, Computer | | **Agent Components** | LM Agent | | **Interface Commands** | LM-friendly commands: Navigate repo, Search files, View files, Edit lines | | **Interface Feedback** | LM-friendly environment feedback | | **System Components** | Terminal, Filesystem | | **Filesystem Data** | sklearn/, examples/, README.rst | **Language Declaration:** All text in this image is in **English**. </details> ∗ Equal contribution. Correspondence to johnby@stanford.edu , carlosej@princeton.edu . Data, code, and leaderboard at swe-agent.com Consider the simple setting of an agent interacting directly with a Linux shell [59]. In practice, we find that LM agents can struggle to reliably take actions in this environment. For example, it fails to provide simple commands to edit a small file segment, and does not provide any feedback if the user makes an invalid edit. These deficits substantially hamper performance, motivating the need for an agent-computer interface (ACI), i.e., an abstraction layer between the LM agent and computer, to enhance the LM agent's abilities in computer environments (Figure 1). From this effort, we introduce SWE-agent, an agent composed of an LM and ACI, that can interact with a computer to solve challenging real-world software engineering problems, such as those proposed in SWE-bench [20]. In contrast to the Linux Shell's granular, highly configurable action space, SWE-agent's ACI instead offers a small set of simple actions for viewing, searching through and editing files. The ACI uses guardrails to prevent common mistakes, and an agent receives specific, concise feedback about a command's effects at every turn. We show that ACIs tailored specifically for LMs outperform existing user interfaces (UIs) designed for human users , such as the Linux shell. Using GPT-4 Turbo as a base LM, SWE-agent solves 12 . 47 %of the 2 , 294 SWE-bench test tasks, substantially outperforming the previous best resolve rate of 3 . 8 % by a non-interactive, retrievalaugmented system [20]. We perform an ablation study on a subset of 300 SWE-bench test instances (SWE-bench Lite) to analyze our ACI design choices. The results show that SWE-agent solves 10 . 7 percentage points more instances than the baseline agent, which uses only the default Linux shell. Although our ACI was developed for GPT-4 Turbo, we show that it is portable to a different LM; SWE-agent with Claude 3 Opus can solve 10 . 5 %of the benchmark tasks. Our contributions are twofold. First, we introduce the concept of the agent-computer interface (ACI) and demonstrate how careful ACI design can substantially improve LM agent performance without modifying the underlying LM's weights. Second, we build, evaluate, and open-source SWE-agent, a system that provides LMs an ACI for solving real-world software engineering tasks. Unlike prior works that independently explore the merits of tool use, prompting techniques, and code execution in interactive settings, our approach unifies these factors within the ACI framework. We show that crafting LM-centric interactive components has meaningful effects on downstream task performance. ## 2 The Agent-Computer Interface An LM acts as an agent when it interacts with an environment by iteratively taking actions and receiving feedback [42, 62]. Typically, the environment has hard constraints, as in robotics, where agents control actuators in the physical world. On the other hand, digital environments can be molded by abstractions in the form of application programming interfaces and user interfaces for software and humans respectively. Naturally, existing interfaces have been designed with one of these users in mind. We argue that LM agents represent a new category of end user, with their own needs and abilities. We refer to the interface LM agents use to interact with computers as the agent-computer interface (ACI). Figure 2 illustrates how ACIs provide LM agents with important functionality to interface with computers, similar to how code editors also help humans use computers more effectively. Figure 2: Specialized applications like IDEs (e.g., VSCode, PyCharm) make scientists and software engineers more efficient and effective at computer tasks. Similarly, ACI design aims to create a suitable interface that makes LM agents more effective at digital work such as software engineering. <details> <summary>Figure 2 Details</summary> ![4195a4857537a9dcedf72a6ae972d014d9f5966a32b549822b721cddbbcb3f32](http://localhost:8000/v1/image/4195a4857537a9dcedf72a6ae972d014d9f5966a32b549822b721cddbbcb3f32) ### Visual Description # Technical Document Extraction: Agent vs. Human Computer Interaction ## 1. Overview This image is a comparative technical diagram illustrating two different interaction models with a computer system. The diagram is split into two primary sections by a vertical black divider line. The left side depicts an **LM Agent** (Language Model Agent) interacting with an **ACI** (Agent-Computer Interface), while the right side depicts a **Human** interacting with a **UI** (User Interface). --- ## 2. Component Isolation and Analysis ### Region A: Left Side (LM Agent Interaction) This region describes how an automated agent interacts with a computer environment. * **Subject:** A blue rounded rectangle labeled **"LM Agent"** with a robot head icon. * **Interaction Flow:** Two curved arrows (one blue pointing toward the computer, one orange pointing back toward the agent) indicate a bidirectional feedback loop. * **Target System:** A large orange-bordered box labeled **"Computer"** (accompanied by a monitor and tower icon). * **Internal Component (ACI):** Inside the computer box is a light-grey sub-container labeled **"ACI"** (Agent-Computer Interface). * **ACI Tools/Sub-categories:** * **File Viewer:** Represented by a light blue pill-shaped button with a list icon. * **File Editor:** Represented by a light purple pill-shaped button with a pen/nib icon. * **Code Search:** Represented by a light yellow pill-shaped button with a magnifying glass icon. ### Region B: Right Side (Human Interaction) This region describes how a human user interacts with a computer environment. * **Subject:** A green rounded rectangle labeled **"Human"** with an icon depicting two people (one with a laptop, one with a beaker). * **Interaction Flow:** Two curved arrows (one green pointing toward the computer, one orange pointing back toward the human) indicate a bidirectional feedback loop. * **Target System:** A large orange-bordered box labeled **"Computer"** (accompanied by a monitor and tower icon). * **Internal Component (UI):** Inside the computer box is a light-grey sub-container labeled **"UI"** (User Interface). * **UI Tools/Software Icons:** * **Visual Studio Code (VS Code):** Represented by the blue "ribbon" logo. * **PyCharm:** Represented by the green and yellow geometric logo with the "PC" text box. --- ## 3. Comparative Data Table | Feature | Left Side (Automated) | Right Side (Manual) | | :--- | :--- | :--- | | **Primary Actor** | LM Agent | Human | | **Interface Type** | ACI (Agent-Computer Interface) | UI (User Interface) | | **Interaction Method** | Programmatic/API-driven tools | Graphical Software Applications | | **Specific Tools** | File Viewer, File Editor, Code Search | VS Code, PyCharm | | **Visual Indicator** | Blue/Orange Flow | Green/Orange Flow | --- ## 4. Logical Flow and Trends 1. **Symmetry:** The diagram uses a symmetrical layout to highlight the functional equivalence between an Agent and a Human. Both occupy the same relative position on their respective sides of the divider. 2. **Abstraction Level:** The **ACI** is depicted as a set of discrete, functional capabilities (View, Edit, Search), suggesting a structured, command-based interaction. In contrast, the **UI** is depicted via complex IDE (Integrated Development Environment) logos, suggesting a visual and feature-rich environment designed for human perception. 3. **Feedback Loops:** Both systems emphasize a closed-loop interaction where the actor sends a command/action and receives information back from the computer system. </details> Disparities in humans' and LMs' abilities and limitations motivates different interface design guidelines. For instance, the current generation of LMs lack the visual understanding abilities to directly operate GUI-based applications with rich visual components and signals. However, many of the features provided by these applications, such as syntax checking and navigation tools, could be useful to LM agents if they were presented in a suitable manner. Additionally, humans can flexibly ignore unnecessary information, whereas all content has a fixed cost in memory and computation for LMs and distracting context can harm performance [27]. Therefore, LM agents may be more effective at interacting with computers when provided an interface that was built informed by these differences. Ultimately, a well-designed ACI should help the LM agent understand the state of the application given previous changes, manage history to avoid unnecessary context from prior observations, and provide actions that models can use efficiently and reliably. The ACI specifies both the commands available to the LM and how the environment state is communicated back to the LM. It also tracks the history of all previous commands and observations and, at each step, manages how these should be formatted and combined with high-level instructions into a single input for the LM. In this paper, we assume a fixed LM and focus on designing the ACI to improve its performance. This means that we shape the actions, their documentation, and environment feedback to complement an LM's limitations and abilities. We draw inspiration from the field of HCI, where user studies elicit insights about how compatible different interfaces are with respect to human intuition and performance [7]. We use two approaches to enhance performance on a development set: (1) manually inspect agent behavior to identify difficulties and propose improvements, and (2) run a grid search to select the best ACI configuration. Taking these two actions resulted in several insights about design principles that seem especially important for building effective ACIs: 1. Actions should be simple and easy to understand for agents. Many bash commands have documentation that includes dozens of options. Simple commands with a few options and concise documentation are easier for agents to use, reducing the need for demonstrations or fine-tuning. This is a defining principle for all SWE-agent commands that we describe in Section 3. 2. Actions should be compact and efficient. Important operations (e.g., file navigation, editing) should be consolidated into as few actions as possible. Efficient actions help agents make meaningful progress towards a goal in a single step. A poor design would therefore have many simple actions that must be composed across multiple turns for a higher order operation to take effect. We show this idea in action in the Editing and Search interface analyses in Section 5.1. 3. Environment feedback should be informative but concise. High quality feedback should provide the agent with substantive information about the current environment state (and the effect of the agent's recent actions) without unnecessary details. For instance, when editing a file, updating the agent about revised content is helpful. Figures 3a, 3b and Table 3 show this. 4. Guardrails mitigate error propagation and hasten recovery. Like humans, LMs make mistakes when editing or searching and can struggle to recover from these errors. Building in guardrails, such as a code syntax checker that automatically detects mistakes, can help agents recognize and quickly correct errors. We show the effect of editing guardrails in Table 3. Analysis and ablation studies in Section 5 demonstrate how alternative ACIs affect LM performance. Our studies shows how these principles appear recurrently across actions, feedback, and workflows. ## 3 SWE-agent: Designing an ACI for Software Engineering Here we describe how SWE-agent provides an ACI for LMs to act as software engineering agents, enabling them to effectively search, navigate, edit, and execute code commands. The ACI comprises several principal components, including search/navigation, file viewer, file editor, and context management. At each step, SWE-agent generates a thought and a command, then incorporates the feedback from the command's execution in the environment (ReAct; Yao et al. [62]). Built atop the Linux shell, SWE-agent also allows access to common Linux commands and utilities when needed. Search and navigation. Navigating codebases requires finding the relevant file and content. A common strategy to do this involves looking up terms that might be useful, e.g., files, functions, or class definitions mentioned in an issue. We introduce the special commands find\_file , search\_file , and search\_dir , which output a summary of search results when searching for filenames and strings within files or directories. Figure 10 shows examples of these search result formats. The find\_file command searches for filenames in the repository, while the search\_file and search\_dir locates strings in a file(s) of a subdirectory. Our interface encourages efficient searches by suppressing verbose results. The search commands return at most 50 (a) Observation from the file viewer. (b) Action using the edit interface. <details> <summary>Figure 3 Details</summary> ![37edc1084f0561a73b4c8773155820ce733b6644ddc536849e2c7c2c133479d5](http://localhost:8000/v1/image/37edc1084f0561a73b4c8773155820ce733b6644ddc536849e2c7c2c133479d5) ### Visual Description # Technical Document Extraction: Response (Thought + Edit Action) ## 1. Component Isolation The image is a UI element representing a software agent's reasoning and subsequent code modification. It is segmented into three distinct regions: * **Header:** Title of the action. * **Thought Block:** Natural language explanation of the intent. * **Edit Action Block:** A code snippet showing the specific modification to a source file. --- ## 2. Textual Transcription ### Header **Text:** `Response (Thought + Edit Action)` ### Thought Block (Main Text) **Text:** "We need to add a check to ensure that all powers are the same. Let's edit the code to include this check." ### Edit Action Block (Code Snippet) This block is contained within a light-gray rounded rectangle and uses syntax highlighting. **Line-by-Line Transcription:** ```python edit 404:407 elif (total_degree > 3 and len(set(k.exp[k.is_Pow])) == 1): if all(coeff[k] == 1): diop_type = 'general_sum_even' end_of_edit ``` --- ## 3. Technical Analysis & Logic Flow ### Component Description * **Action Type:** Code Edit. * **Target Range:** Lines 404 through 407 of the source file. * **Language:** Python (based on syntax: `elif`, `len()`, `set()`, and indentation). ### Logic Flow and Trend Verification The code implements a conditional check within a larger logic structure (likely a mathematical expression parser or solver). 1. **Primary Condition (`elif`):** * Checks if `total_degree` is greater than 3. * **Trend/Logic:** It uses a set comprehension logic `len(set(...)) == 1` to verify that all exponents (`k.exp`) in the expression are identical. This confirms the "Thought" block's intent to ensure "all powers are the same." 2. **Secondary Condition (`if`):** * Checks if all coefficients (`coeff[k]`) are equal to 1. 3. **Assignment:** * If both conditions are met, the variable `diop_type` is assigned the string value `'general_sum_even'`. ### Syntax Highlighting (Color Legend) * **Orange:** Keywords (`elif`, `if`, `and`). * **Blue:** Functions and variables (`total_degree`, `len`, `set`, `k.exp`, `k.is_Pow`, `all`, `coeff`, `diop_type`). * **Purple:** Line numbers and markers (`404:407`). * **Green:** String literals (`'general_sum_even'`). * **Black:** Structural markers (`edit`, `end_of_edit`). --- ## 4. Summary of Information The image depicts a specific logic update to a Python-based mathematical processing script. The update introduces a constraint that requires all terms in an expression to have the same exponent and a coefficient of 1 before classifying the Diophantine equation type as a "general sum even" type. </details> Figure 3: The file viewer and edit command are integrated. (a) The file viewer shows the agent the open file's content with line numbers. (b) The agent invokes the edit function to replace lines 404-407 in the open file. After the edit, the file viewer shows the agent the now updated version of the file. results for each search query; if a search exceeds this number, we do not report the results and instead suggest that the agent write a more specific query. File viewer. After finding a file they want to view, agents use the interactive file viewer by calling the command open on the relevant file path. The file viewer presents a window of at most 100 lines of the file at a time. The agent can move this window with the commands scroll\_down and scroll\_up or access a specific line with the goto command. To facilitate in-file navigation and code localization, we display: the full path of the open file, the total number of lines in the file, the number of lines omitted before and after the current window, and the line number (prepended to each visible line). Figure 3a shows an example of this interface. File editor. We provide a few commands that let LMs create and edit files. The edit command works in conjunction with the file viewer, allowing agents to replace a specific range of lines in the open file. This command takes 3 required arguments: the start line, end line, and replacement text. In a single step, agents can replace all lines between the start and end lines with the replacement text, as shown in Figure 3b. After edits are applied, the file viewer automatically displays the updated content, helping the agent observe the effects of its edit immediately without invoking additional commands. Figure 3b shows an example agent response, including a file edit. Similar to how humans can use tools like syntax highlighting to help them notice format errors when editing files in an IDE, we integrate a code linter into the edit function to alert the agent of mistakes it may have introduced when editing a file. Select errors from the linter are shown to the agent along with a snippet of the file contents before/after the error was introduced. Invalid edits are discarded, and the agent is asked to try editing the file again. Context management. The SWE-agent system uses informative prompts, error messages, and history processors to keep agent context concise and informative. Agents receive instructions, documentation, and demonstrations on the correct use of bash and ACI commands. At each step, the system instructs them to generate both a thought and an action [62]. Malformed generations trigger an error response, shown in Figure 32, asking the agent to try again, which is repeated until a valid generation is received. Once received, all past error messages except the first are omitted. The agent's environment responses display computer output using the template shown in Figure 30; however, if no output is generated, a specific message ('Your command ran successfully and did not produce any output') is included to enhance clarity. To further improve context relevance, observations preceding the last 5 are each collapsed into a single line, shown in Figure 31. By removing most content from prior observations, we maintain essential information about the plan and action history while reducing unnecessary context, which allows for more interaction cycles and avoids showing outdated file information. §A provides further implementation details. ## 4 Experimental Setup Datasets. We primarily evaluate on the SWE-bench dataset, which includes 2 , 294 task instances from 12 different repositories of popular Python packages [20]. We report our main agent results on the full SWE-bench test set and ablations and analysis on the SWE-bench Lite test set, unless otherwise specified. SWE-bench Lite is a canonical subset of 300 instances from SWE-bench that focus on evaluating self-contained functional bug fixes. We also test SWE-agent's basic code editing abilities with HumanEvalFix, a short-form code debugging benchmark [32]. Models. All results, ablations, and analyses are based on two leading LMs, GPT-4 Turbo ( gpt-4-1106-preview ) [34] and Claude 3 Opus ( claude-3-opus-20240229 ) [6]. We experimented with a number of additional closed and open source models, including Llama 3 and DeepSeek Coder [14], but found their performance in the agent setting to be subpar. Many LMs' context window is too small, such as Llama 3's context window of 8 k. GPT-4 Turbo and Claude 3 Opus have 128 k and 200 k token context windows, respectively, which provides sufficient room for the LM to interact for several turns after being fed the system prompt, issue description, and optionally, a demonstration. Baselines. We compare SWE-agent to two baselines. The first setting is the non-interactive, retrievalaugmented generation (RAG) baselines established in Jimenez et al. [20]. Here, a BM25 retrieval system retrieves the most relevant codebase files using the issue as the query; given these files, the model is asked to directly generate a patch file that resolves the issue. The second setting, called Shell-only, is adapted from the interactive coding framework introduced in Yang et al. [59]. Following the InterCode environment, this baseline system asks the LM to resolve the issue by interacting with a shell process on Linux. Like SWE-agent, model prediction is generated automatically based on the final state of the codebase after interaction. Metrics. We report %Resolved or pass @1 as the main metric, which is the proportion of instances for which all tests pass successfully after the model generated patch is applied to the repository [20]. We also report the $ Avg. Cost metric, the API inference cost incurred by SWE-agent averaged over all successfully resolved instances. Due to budget constraints, we set the per-instance budget to $4; if a run exceeded this budget, existing edits were submitted automatically. Configuration search. During the design process of SWE-agent, we arrived at the final ACI design through qualitative analysis of system behavior on a small set of hand-picked examples from the development split of SWE-bench. For the remaining hyperparameter choices, we performed a sweep over the window size, history processing, and decoding temperature, shown in §B.1. ## 5 Results Across all systems, SWE-agent w/ GPT-4 Turbo achieves the best performance all-around, successfully solving 12 . 47 %( 286 / 2 , 294 ) of the full SWE-bench test set and 18 . 00 %( 54 / 300 ) of the Lite split. As shown in Table 1, compared to RAG on Lite, SWE-agent is 8 -13 x more costly but yields a 6 . 7 -fold improved % Resolved rate. An LM-friendly ACI's value is confirmed by SWE-agent's 64 % relative increase compared to Shell-only, both with GPT-4 Turbo. In Table 2, SWE-agent yields strong performance on HumanEvalFix with 88 . 3 % pass@1 rate. Figure 4 reveals that average performance variance is relatively low, but per-instance resolution can change considerably. More results are given in the appendix: §B.2 shows that the success rate is uncorrelated to the issue age (controlling for possible test pollution), B.5 presents more details on performance variance and pass @ k , and B.7 discusses extra evaluation details. ## 5.1 Analysis of ACI Design We perform several ablations of the SWE-agent interface, specifically with respect to the SWE-agent w/ GPT-4 configuration, summarized in Table 3. Our case studies shed light on interesting agent behavior along with the impact of different ACI designs. Human user interfaces are not always suitable as agent-computer interfaces. Current LMs are vulnerable to a number of pitfalls when searching for relevant content in a Linux shell environment. Some exploration patterns (e.g., chains of cd , ls , cat ) are extremely inefficient. grep or find look ups can perform better but occasionally produce many lines of irrelevant results. We hypothesize that better localization is possible with faster navigation and a more informative search interface. https://github.com/meta-llama/llama3 Token counts for different models are not directly comparable since they use different tokenizers. Table 1: Main results for SWE-agent performance on the full and Lite splits of the SWE-bench test set. We benchmark models in the SWE-agent, Basic CLI, and Retrieval Augmented Generation (RAG) settings established in SWE-bench [20]. | | SWE-bench | SWE-bench | SWE-bench Lite | SWE-bench Lite | |-------------------|-------------|-------------|------------------|------------------| | Model | %Resolved | $ Avg. Cost | %Resolved | $ Avg. Cost | | RAG | | | | | | w/ GPT-4 Turbo | 1.31 | 0.13 | 2.67 | 0.13 | | w/ Claude 3 Opus | 3.79 | 0.25 | 4.33 | 0.25 | | Shell-only agent | | | | | | w/ GPT-4 Turbo | - | - | 11.00 | 1.46 | | w/o Demonstration | - | - | 7.33 | 0.79 | | SWE-agent | | | | | | w/ GPT-4 Turbo | 12.47 | 1.59 | 18.00 | 1.67 | | w/ Claude 3 Opus | 10.46 | 2.59 | 13.00 | 2.18 | Table 2: Pass@1 results on HumanEvalFix [32]. Except for SWE-agent, we use scores as reported in Yu et al. [65]. Figure 4: SWE-agent w/ GPT-4 Turbo Pass @ k performance across 6 runs on SWE-bench Lite. | Model | Python | JS | Java | |-------------------------------|----------|------|--------| | CodeLLaMa-instruct-13B | 29.2 | 19.5 | 32.3 | | GPT-4 | 47 | 48.2 | 50 | | DeepseekCoder-CodeAlpaca-6.7B | 49.4 | 51.8 | 45.1 | | WaveCoder-DS-6.7B | 57.9 | 52.4 | 57.3 | | SWE-agent w/ GPT-4 Turbo | 87.7 | 89.7 | 87.9 | <details> <summary>Figure 4 Details</summary> ![65db4d86ff6ff16ee53f2048349f27f24e0800dd66d57c2dc6e5e05cb2c43d76](http://localhost:8000/v1/image/65db4d86ff6ff16ee53f2048349f27f24e0800dd66d57c2dc6e5e05cb2c43d76) ### Visual Description # Technical Data Extraction: Resolution Performance Chart ## 1. Image Classification and Overview This image is a **line graph** depicting the relationship between a variable **k** and the percentage of items resolved. The chart utilizes a single data series represented by a blue line with circular markers at each data point. ## 2. Component Isolation ### A. Header/Title * **Content:** None present. ### B. Main Chart Area (Axes and Data) * **Y-Axis Label:** `% Resolved` (Oriented vertically, 90 degrees counter-clockwise). * **Y-Axis Scale:** Numerical range from 15 to 35. * **Y-Axis Markers:** 15, 20, 25, 30, 35. * **X-Axis Label:** `k` (Centered horizontally below the axis). * **X-Axis Scale:** Discrete integer values from 1 to 6. * **X-Axis Markers:** 1, 2, 3, 4, 5, 6. ### C. Legend * **Placement:** No explicit legend is provided as there is only one data series. * **Visual Encoding:** Blue line with solid circular markers. ## 3. Trend Verification The data series exhibits a **positive, monotonic trend with a decreasing rate of growth** (concave down). * As the value of **k** increases, the **% Resolved** also increases. * The steepest increase occurs between $k=1$ and $k=2$. * The slope gradually flattens as **k** approaches 6, suggesting diminishing returns for higher values of **k**. ## 4. Data Table Reconstruction Based on the spatial positioning of the markers relative to the axis tick marks, the following values are extracted: | k (X-axis) | % Resolved (Y-axis) [Approximate] | | :--- | :--- | | 1 | 18.0 | | 2 | 24.0 | | 3 | 27.5 | | 4 | 29.5 | | 5 | 31.5 | | 6 | 33.0 | ## 5. Technical Summary The chart illustrates a performance metric where increasing the parameter **k** from 1 to 6 results in an improvement in the resolution rate from approximately **18%** to **33%**. The most significant gain (approx. +6%) is achieved by moving from $k=1$ to $k=2$, while subsequent increments of **k** yield smaller marginal improvements (averaging approx. +2.25% per step thereafter). </details> Table 3: SWE-bench Lite performance under ablations to the SWE-agent interface, which is denoted by . We consider different approaches to searching and editing (see Figures 5 and 6, respectively). We also verify how varying the file viewer window size affects performance, and we ablate the effect of different context management approaches. | Editor | Editor | Search | File Viewer | Context | |-----------------|------------|----------------------|----------------------|-------------------------| | edit action | 15.0 ↓ 3.0 | Summarized 18.0 | 30 lines 14.3 ↓ 3.7 | Last 5 Obs. 18.0 | | w/ linting 10.3 | 18.0 | Iterative 12.0 ↓ 6.0 | 100 lines 18.0 | Full history 15.0 ↓ 3.0 | | No edit | ↓ 7.7 No | search 15.7 ↓ 2.3 | Full file 12.7 ↓ 5.3 | w/o demo. 16.3 ↓ 1.7 | Figure 5 compares the Shell-only setting to two different search interfaces. Iterative search, directly inspired by traditional user interfaces for search, e.g., Vim or VSCode, shows results one by one via the file viewer. Agents can look through results using next and prev actions. Each result displays the matching line along with n surrounding lines of context. An advantage is that an agent can begin editing directly after seeing the relevant code in its search. However, when given a large number of search results, agents tend to look through every match exhaustively, calling next until each result has been inspected. This inefficient behavior can exhaust an agent's cost budget or context window, leading to even worse performance than the not having additional search tools at all ( 15 . 7 % ↓ 2.3 for No search vs. 12 . 0 % ↓ 6.0 with Iterative search). Compact, efficient file editing is critical to performance. SWE-agent's file editor and viewer are designed to consolidate the editing process into a single command that enables easy multi-line edits with consistent feedback and automatically updates the agent's view of the file after editing. In the No edit setting, editing options are restrictive and prone to errors; the primary methods available are either replacing entire files through redirection and overwriting or using utilities like sed for single-line or search-and-replace edits. Both methods have significant drawbacks. Redirection involves copying and rewriting entire files for even minor changes, which is both inefficient and error-prone. Although sed can facilitate specific edits, executing multi-line edits is cumbersome and can lead to unintended consequences that are challenging to detect. Moreover, both strategies Figure 5: Three different Search interfaces for task instance pvlib\_\_pvlib-python-1224 . In Shell-only, an agent performs localization using only standard bash commands and utilities. Compared to Iterative search, Summarized search shows an exhaustive list of search results and provides guidance on refining under-specified queries. <details> <summary>Figure 5 Details</summary> ![c9d847f806d038cc52f800e910fc3dc16789480cad9fbcc35b9218587bb39000](http://localhost:8000/v1/image/c9d847f806d038cc52f800e910fc3dc16789480cad9fbcc35b9218587bb39000) ### Visual Description # Technical Document Extraction: Search Methodology Comparison This image is a comparative diagram illustrating three different search strategies employed by an AI agent when navigating a codebase. The diagram is divided into three vertical panels, each representing a specific methodology: **No Search**, **Iterative Search**, and **Summarized Search**. --- ## 1. Panel: No Search (Left) **Background Color:** Light Red/Pink **Description:** This panel illustrates a manual, inefficient approach where the agent relies on basic file system navigation commands rather than specialized search tools. ### Header Information * **Title:** 🔍 No Search * **Summary Box:** Agent performs manual search with `ls` + `cd`, or uses `grep` but struggles to interpret the result. ### Process Flow (Dialogue & Terminal) 1. **Agent Thought/Action (Blue Box):** * *Text:* "First, I should list out all the files in the codebase that may contain the definition of PVSystem." * *Command:* `$ ls` 2. **System Output (Orange Box):** * *Content:* ```text benchmarks/ setup.cfg docs/ versioneer.py pvlib/ README.md setup.py ... ``` 3. **Agent Thought/Action (Blue Box):** * *Text:* "It's most likely the source file is in pvlib/ folder. Let's change directory." * *Command:* `$ cd pvlib/` --- ## 2. Panel: Iterative Search (Center) **Background Color:** Light Orange/Tan **Description:** This panel illustrates a repetitive process where the agent uses a search command but must manually cycle through numerous individual results. ### Header Information * **Title:** 🔍 Iterative Search * **Summary Box:** Actions to show next/prev search result are repeated many times until results are exhausted. ### Process Flow (Dialogue & Terminal) 1. **Agent Thought/Action (Blue Box):** * *Text:* "I will use the `search_all` command to find occurrences of "PVSystem"..." * *Command:* `$ search "PVSystem"` 2. **System Output (Orange Box):** * *Content:* ```text [Result 1/15 (Line 70)] [File: pvlib/pvsystem.py] 22:# a dict of required pa... 23:_DC_MODEL_PARAMS = { 24: 'sapm': { (97 more lines omitted) ``` 3. **Agent Thought/Action (Blue Box):** * *Text:* "This is not the PVSystem class def..." * *Command:* `$ next` --- ## 3. Panel: Summarized Search (Right) **Background Color:** Light Green **Description:** This panel illustrates an optimized approach where the search tool provides a high-level summary, allowing the agent to identify the correct file immediately. ### Header Information * **Title:** 🔍 Summarized Search * **Summary Box:** 1. Show all results in single output 2. Tell agent to retry if too many results. ### Process Flow (Dialogue & Terminal) 1. **Agent Thought/Action (Blue Box):** * *Text:* "The next step is to locate the `PVSystem` class in the codebase and modify the constructor..." * *Command:* `$ search "class PVSystem"` 2. **System Output (Orange Box):** * *Content:* ```text Found 1 matches for "class PVSystem" in /pvlib-python: /pvlib--pvlib-python/pvlib/ pvsystem.py (1 matches) End of matches ``` 3. **Agent Thought/Action (Blue Box):** * *Text:* "It's located in `pvlib/pvsystem.py`..." * *Command:* `$ open pvlib/pvsystem.py` --- ## Summary of Components * **Blue Outlined Boxes:** Represent the AI Agent's internal reasoning and the specific command it executes. * **Orange Outlined Boxes:** Represent the terminal/system output returned to the agent. * **Trend/Logic Observation:** * **No Search** is slow and requires multiple steps of manual navigation. * **Iterative Search** provides specific data but creates a bottleneck by forcing the agent to process results one by one (1/15). * **Summarized Search** is the most efficient, providing a targeted overview that allows the agent to jump directly to the relevant file. </details> Figure 6: Three different Edit interfaces for task instance sympy\_\_sympy-24102 . Editing with bash commands requires several actions to successfully modify a file. The Editing component defines an edit command that leverages the File Viewer component to replace the bash style of editing workflow with a single command. Linting is beneficial for stymieing cascading errors that often start with an error-introducing edit by the agent. <details> <summary>Figure 6 Details</summary> ![feb628576ef9b320c2ce9246586488fdea225d9d477734fd21ea6ca9f039473c](http://localhost:8000/v1/image/feb628576ef9b320c2ce9246586488fdea225d9d477734fd21ea6ca9f039473c) ### Visual Description # Technical Document Extraction: Comparison of Code Editing Workflows This document analyzes a comparative diagram illustrating three different Large Language Model (LLM) interaction patterns for code editing: **No edit**, **edit w/o Linting**, and **edit w/ Linting**. --- ## 1. Component Isolation The image is segmented into three vertical panels, each representing a different workflow strategy. ### Panel A: No edit (Left) * **Background Color:** Light Red * **Header Summary:** * **Failure Modes:** (1) No std. out causes uncertainty (2) cat floods context window (3) LMs must track many states (e.g. line #s, pre/post-edit) * **Workflow Steps:** 1. **Action (Blue Box):** * *Thought:* "I'll use cat to display `sympy/parsing/mathematica.py`..." * *Command:* `$ cat parsing/mathematica.py` 2. **System Response (Orange Box):** `File Content (~1079 Lines)` 3. **Action (Blue Box):** * *Thought:* "To fix the issue, we need to modify the `MathematicaParser` class..." * *Command:* `$ sed -i "/# trig/i \ 'λ':Symbol('\*\λ\*')'..."` 4. **System Response (Orange Box):** `(No Output)` 5. **Action (Blue Box):** * *Thought:* "It seems there's a misunderstanding... I'll write a Python script to add the rules..." * *Command:* `$ echo (30 more lines omitted)` ### Panel B: edit w/o Linting (Center) * **Background Color:** Light Orange * **Header Summary:** * `edit` consolidates necessary steps → 1 action, but **recovery** from self-incurred errors is difficult. * **Workflow Steps:** 1. **Action (Blue Box):** * *Thought:* "Let's edit the `_atom_conversions` dictionary to include the mapping..." * *Command:* `$ edit 1055:1059 [Code Snippet] end_of_edit` 2. **System Response (Orange Box - Code Block):** ```python 1053: Function=_parse... 1054: ) 1055: _atom_conversions = { 1056: "I": I, 1057: "Pi": pi, 1058: "E": E, 1059: "λ": symbols('lamda'), 1060: } 1061: } 1062: def _from_fullform... ``` 3. **Error Indicator:** A red "no entry" icon points to line 1060 with the text: **"Syntax Error!"** (Note: The snippet shows an extra closing brace at line 1061). ### Panel C: edit w/ Linting (Right) * **Background Color:** Light Green * **Header Summary:** * **"LM-centric" edit** = (1) Show file before & after edit (2) Block edits with syntax or compile errors. * **Workflow Steps:** 1. **Action (Blue Box):** * *Thought:* "Let's edit the `_atom_conversions`... (Same as first Editing action)" 2. **System Response (Orange Box - Feedback):** * "Your proposed edit has introduced new syntax error(s). - E999 IndentationError" * "How your edit would have looked... [File Viewer w/ Edit Applied]" * "The original code before your edit [File Viewer of Original Code]" * "Your changes have NOT been applied. Fix your edit command & try again." 3. **Action (Blue Box - Correction):** * *Thought:* "The edit introduces an extra `{`. I will update my edit command to fix this..." * *Command:* `$ edit 1055:1060 [Code Snippet] end_of_edit` --- ## 2. Key Trends and Comparative Analysis | Feature | No edit | edit w/o Linting | edit w/ Linting | | :--- | :--- | :--- | :--- | | **Efficiency** | Low (Multiple steps: cat, sed, echo) | High (Single `edit` command) | High (Single command + feedback) | | **Context Usage** | High (Dumps entire files) | Low (Targeted line ranges) | Low (Targeted line ranges) | | **Error Handling** | Silent failure (No output) | State corruption (Applies bad code) | **Proactive** (Blocks bad code) | | **Feedback Loop** | Manual verification required | Requires manual fix of broken state | Automated linting guides the LM | --- ## 3. Technical Observations * **Spatial Grounding:** The diagram uses a left-to-right progression of "maturity" in tool design. * **The "No edit" workflow** demonstrates the "Context Window Flood" problem where reading a 1079-line file consumes significant tokens and makes tracking line numbers difficult for the LM. * **The "edit w/o Linting" workflow** shows that while a specialized `edit` tool reduces token usage, it creates a "Recovery" problem. If the LM makes a syntax error (shown at line 1060/1061), the codebase is now broken, and the LM must spend more effort fixing its own mistake. * **The "edit w/ Linting" workflow** introduces a "Guardrail." By refusing to apply the edit and showing a "Before vs. After" comparison, the system prevents the codebase from entering a broken state and provides the LM with the exact error (E999 IndentationError) needed to self-correct. </details> lack immediate feedback about file updates, making these silent operations potentially confusing for models to interpret and increasing the risk of errors. Without SWE-agent's file editor interface, performance drops to ( 10 . 3 % ↓ 7.7 ). We also find that agents are sensitive to the number of lines the file viewer displays. Either too little content (30 lines, 14 . 3 % ↓ 3.7 ) or too much (entire file, 12 . 7 % ↓ 5.3 ) lowers performance. Guardrails can improve error recovery. A prominent failure mode occurs when models repeatedly edit the same code snippet. The usual suspect for this behavior is an agent introducing a syntax error (e.g., incorrect indentation, extra parenthesis) via an errant edit . As discussed in Section 3, we add an intervention to the edit logic that lets a modification apply only if it does not produce major errors. We compare this interface with the No edit and edit w/o linting alternatives in Figure 6. This intervention improves performance considerably (without linting, 15 . 0 % ↓ 3.0 ). ## 5.2 Analysis of Agent Behavior Recurring problem-solving patterns emerge when LMs are equipped with a useful, intuitive ACI. We describe several model behaviors and problem-solving patterns that can be discerned from model performance and each model's corresponding trajectories. Reproduction and/or localization is the first step. SWE-agent usually begins with either writing reproduction code and/or localizing the issue's cause to specific lines of code. As shown in Figure 7, all trajectories begin with either create (reproduction) or find\_file / search\_dir (localization). To reproduce, models will create a new file, add reproduction code to it with an edit , then run with python ; this is the most popular triple of actions in Table 8. Using this feedback along with file Figure 7: The frequency with which actions are invoked at each turn by SWE-agent w/ GPT-4 for task instances that it solved on the SWE-bench full test set ( 286 trajectories). <details> <summary>Figure 7 Details</summary> ![826389ec909f0dd671dfe7addfa379f86fc20f2a65e1296917d5df2f177701eb](http://localhost:8000/v1/image/826389ec909f0dd671dfe7addfa379f86fc20f2a65e1296917d5df2f177701eb) ### Visual Description # Technical Data Extraction: Action Frequency and Failure Analysis This document provides a comprehensive extraction of the data contained in the provided image, which consists of two primary components: a stacked bar chart showing action frequency over time and a pie chart detailing failure modes. --- ## 1. Component Isolation: Stacked Bar Chart (Left) ### Metadata and Axes * **Type:** Stacked Bar Chart. * **Y-Axis Label:** Frequency (Scale: 0 to 250+). * **X-Axis Label:** Turn (Scale: 0 to 39, marked in increments of 3). * **Legend Location:** Top center, between the two charts. ### Legend Categories (Actions) The legend identifies 14 distinct actions categorized by color groups: * **Search/File Operations (Reds/Oranges):** `search_dir` (Dark Red), `search_file` (Rust), `find_file` (Orange), `find` (Peach). * **Editing/System (Greens/Grays):** `create` (Dark Green), `edit` (Light Green), `exit_cost` (Black), `submit` (Gray). * **Navigation (Blues):** `open` (Dark Blue), `scroll_up` (Medium Blue), `scroll_down` (Bright Blue), `goto` (Light Blue). * **Execution (Yellows):** `python3` (Olive/Gold), `python` (Yellow), `pytest` (Light Yellow). ### Trend Analysis * **Overall Trend:** The total frequency of actions per turn follows a sharp downward decay curve. It starts at its peak (approx. 280 actions) at Turn 0 and tapers off to near zero by Turn 39. * **Early Phase (Turns 0-6):** Dominated by search and navigation actions (`search_dir`, `find_file`, `open`). There is a significant spike in `edit` (green) and `python3/pytest` (yellow) actions starting around Turn 3. * **Middle Phase (Turns 9-21):** The proportion of `edit` (green) and `submit` (gray) actions remains relatively high compared to the shrinking total volume. * **Late Phase (Turns 24-39):** Action volume is very low. The remaining actions are primarily `edit`, `submit`, and occasional `python` executions. --- ## 2. Component Isolation: Failure Mode Pie Chart (Right) ### Metadata * **Type:** Pie Chart. * **Data Labels:** Numerical percentages are printed directly on the slices. * **Legend Location:** Bottom right, below the pie chart. ### Data Table: Failure Modes The following table reconstructs the data presented in the pie chart, cross-referencing the legend colors with the percentage values. | Color | Failure Category | Percentage (%) | | :--- | :--- | :--- | | Rust | Incorrect Implementation | 39.9 | | Peach | Overly Specific Implementation | 12.1 | | Dark Green | Failed to Recover from Edit | 23.4 | | Dark Blue | Failed to Find Edit Location | 12.9 | | Light Blue | Failed to Find Relevant File | 2.4 | | Olive/Gold | Gave Up Prematurely | 4.8 | | Yellow | Can't Reproduce | 2.4 | | Light Yellow | Ran Out of Time | 2.0 | **Total Calculated:** 100.0% --- ## 3. Synthesis and Observations * **Primary Failure Driver:** "Incorrect Implementation" is the single largest cause of failure at 39.9%. * **Process Correlation:** The stacked bar chart shows a high frequency of `edit` and `python/pytest` actions throughout the middle turns. The pie chart reveals that a combined 63.3% of failures (Incorrect Implementation + Failed to Recover from Edit) occur during or after the editing phase. * **Navigation Efficiency:** While navigation actions (`open`, `scroll`, `find`) dominate the early turns, "Failed to Find Relevant File" and "Failed to Find Edit Location" account for a combined 15.3% of total failures, suggesting that the initial search phase is successful in the majority of cases. * **Temporal Constraint:** "Ran Out of Time" is the least frequent failure mode (2.0%), suggesting that the decay in the bar chart is due to task completion or other failure types rather than reaching a hard time limit. </details> Figure 8: Failure mode distribution for SWEagent w/ GPT-4 Turbo trajectories of unresolved instances. Each instance is labeled automatically using an LM with the categories from Table 9. names and symbols in the issue description, an agent will start with a broad, directory-level keyword search, before then zooming into specific files and lines. This is reflected in Figure 22, where the most likely actions following localization sequences like ( python , find\_file ) and ( search\_dir , open ) are search\_file and goto , indicative of how an agent 'zooms in" on a bug. Extensive analysis on correlations between different groups of actions are discussed in §B.3.3 Remaining turns are mostly 'edit, then execute" loops. As exhibited in Figure 7, from turn 5 onwards, the most frequent two actions for all turns are edit and python . Captured as high probability next actions following ( edit , python ) in Figure 22, additional localization operations are often interspersed across these later turns, where agents might look at more in-file code with search\_file , scroll\_up/down , or other files altogether with search\_dir , find\_file . This behavior usually arises in response to new information from re-running the reproduction script. Submissions are distributed normally from turn 10 onwards, although resolved task instances correlate more with earlier submit s (see §B.3.1). A walk-through of common trajectory phases is in §B.3.2. Editing remains challenging for agents. A non-trivial minority of edit actions raise a linting error; out of 2 , 294 task instances, 1 , 185 ( 51 . 7 %) of SWE-agent w/ GPT-4 Turbo trajectories have 1 + failed edits. While agents generally recover more often than not from failed edits, the odds of recovery decrease as the agent accumulates more failed edits. Recovery refers to a sequence of consecutive failed edits followed immediately by a successful edit. Any attempt at editing has a 90 . 5 %chance of eventually being successful. This probability drops off to 57 . 2 %after a single failed edit. More editing phenomena are discussed in §B.3.3, and data about agents' generated fixes are in §B.6. Agents succeed quickly and fail slowly. We find that runs submitted relatively early are much more likely to be successful compared to those submitted after a larger number of steps or cost. We show in Table 15 the distribution of resolved and unresolved instances, including only instances that did not exhaust their budget. We observe that successful runs complete earlier and at a cheaper cost than unsuccessful ones. In general, successful instances solved by SWE-agent w/ GPT 4 finish with a median cost of $ 1 . 21 and 12 steps compared to a mean of $ 2 . 52 and 21 steps for unsuccessful ones. Furthermore, we find that 93 . 0 %of resolved instances are submitted before exhausting their cost budget, compared to 69 . 0 %of instances overall. For these reasons, we suspect that increasing the maximum budget or token limit are unlikely to substantially increase performance. More statistics about how trajectories typically conclude are in §B.9. Most failures are incorrect implementations. We use GPT-4o to automatically categorize unresolved trajectories (SWE-agent w/ GPT-4 Turbo on SWE-bench Lite, n =248 ) into one of 9 manually defined categories described in Table 9. On a hand-labeled validation set, the LM's judgment agrees with the authors' on 87 %of instances. From Figure 8, about half ( 52 . 0 %) of unresolved instances fall into the Incorrect Implementation or Overly Specific Implementation categories, suggesting that agents' proposed solutions often simply fail to functionally address the issue or are insufficiently general solutions. Cascading failed edits make up another 23 . 4 %of failures. More details in §B.4. ## 6 Related Work ## 6.1 Software Engineering Benchmarks Code generation benchmarks, which evaluate models on the task of synthesizing code from natural language descriptions, have served as a long-standing bellwether for measuring LM performance [5, 1, 15, 30]. Subsequent works have built upon the code generation task formulation to contribute new benchmarks that translate problems to different (programming) languages [3, 49], incorporate third-party libraries [25, 29], introduce derivative code completion tasks [18, 32], increase test coverage [26], change the edit scope [8, 9, 64], and add robustness to dataset contamination [19]. Code generation problems are largely self-contained, with short problem descriptions ( ∼ 100 lines) and corresponding solutions that are similarly brief, requiring nothing more complex than basic language primitives. Tests are either handwritten or generated synthetically via fuzz testing. In recent months, the rapid development of LMs has begun to saturate many of these benchmarks. For instance, the top method solves 94 . 4 %of HumanEval [70]. Gauging future trends with the code generation task paradigm can be limited by the simplicity of this setting and cost of human-in-the-loop problem creation. In response, recent efforts have demonstrated that software engineering (SE) can serve as a diverse, challenging testbed for LM evaluation [68, 20, 28]. Repository-level code editing introduces many reasoning challenges grounded in real SE subtasks, such as spotting errant code and identifying cross-file relationships and understanding codebasespecific symbols and conventions. As a field, SE has generally studied tasks in a more isolated manner; prior benchmarks tended to frame problems in isolation from the rest of a codebase [21, 23]. We use SWE-bench because it unites many separate SE tasks, such as automated program repair [10, 40, 55], bug localization [4, 58], and testing [22, 46, 56] under a single task formulation that faithfully mirrors practical SE. Furthermore, SWE-bench task instances are diverse, having been automatically collected from real GitHub issues across 12 different repositories. In addition, SWEbench performance is based on rigorous, execution-based evaluation with human-written unit tests. ## 6.2 Language Models as Agents The co-emergence of stronger LMs, increasingly challenging benchmarks, and practical use cases have together motivated a paradigm shift in LMs' inference setting. Instead of traditional zero/fewshot generation, LM agents [17, 42, 47, 54] that interact with a real/virtual world have proliferated as the default setting for web navigation [24, 33, 36, 41, 45, 61, 62, 71], computer control [35, 53, 57], and code generation tasks [16, 50, 63]. Interaction and code generation are increasingly used together, with code as the modality of choice for actions [48, 59], tool construction [13, 51, 69], and reasoning [39, 66, 67]. Coding agents have also been applied to offensive security [11, 37, 60], theorem proving [44], and clinical tasks [38, 43, 52]. To the best of our knowledge, SWE-agent is the first work to explore language agents for end-to-end software engineering (SE). ## 7 Discussion We introduce SWE-agent, an agent composed of an LM and ACI capable of autonomously solving software engineering tasks. Through our design methodology, results, and analysis, we demonstrate the value of ACIs tailored to leverage LMs' strengths and mitigate their weaknesses. Beyond empirical applications, we hope the further study of ACIs can also make principled use of and contribute to our understanding of language models and agents, analogous to the synergy between human-computer interaction (HCI) and psychology [2]. Humans and LMs have different characteristics, training objectives, specialities, and limitations [12, 31], and the interaction design processes can be seen as systematic behavioral experimentation that could reveal more insights into these differences towards establishing a comparative understanding of human and artificial intelligence. ## Acknowledgements We thank Austin W. Hanjie, Sam Ainsworth, Xindi Wu, Yuhan Liu, Mengzhou Xia, Dan Friedman, Tianyu Gao, Adithya Bhaskar, Aatmik Gupta, Louisa Nyhus, Alisa Liu, Ori Yoran and Richard Zhu for their valuable feedback and advice. We would also like to thank the broader Princeton Language and Intelligence community for supporting our work. We acknowledge support from an Oracle Collaborative Research award and the National Science Foundation under Grant No. 2239363. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation ## References - [1] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021. - [2] J. M. Carroll. Human-computer interaction: psychology as a science of design. Annual review of psychology , 48(1):61-83, 1997. - [3] F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman, A. Guha, M. Greenberg, and A. Jangda. Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022. - [4] S. Chakraborty, Y. Li, M. Irvine, R. Saha, and B. Ray. Entropy guided spectrum based bug localization using statistical language model. arXiv preprint arXiv:1802.06947 , 2018. - [5] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, and J. K. et. al. Evaluating large language models trained on code, 2021. - [6] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, and I. Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. - [7] A. Cooper, R. Reimann, and D. Cronin. About face 3: the essentials of interaction design . John Wiley & Sons, Inc., USA, 2007. ISBN 9780470084113. - [8] Y. Ding, Z. Wang, W. U. Ahmad, H. Ding, M. Tan, N. Jain, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth, and B. Xiang. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2023. URL https://openreview.net/forum? id=wgDcbBMSfh . - [9] X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha, X. Peng, and Y. Lou. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation, 2023. - [10] Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan. Automated repair of programs from large language models, 2023. - [11] R. Fang, R. Bindu, A. Gupta, Q. Zhan, and D. Kang. Llm agents can autonomously hack websites, 2024. - [12] T. L. Griffiths. Understanding human intelligence through human limitations. Trends in Cognitive Sciences , 24(11):873-883, 2020. - [13] Y. Gu, Y. Shu, H. Yu, X. Liu, Y. Dong, J. Tang, J. Srinivasa, H. Latapie, and Y . Su. Middleware for llms: Tools are instrumental for language agents in complex environments, 2024. - [14] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang. Deepseek-coder: When the large language model meets programming - the rise of code intelligence. CoRR , abs/2401.14196, 2024. URL https: //arxiv.org/abs/2401.14196 . - [15] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. Measuring coding challenge competence with apps, 2021. - [16] S. Holt, M. R. Luyten, and M. van der Schaar. L2MAC: Large language model automatic computer for unbounded code generation. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=EhrzQwsV4K . - [17] S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2023. - [18] Q. Huang, J. Vora, P. Liang, and J. Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation, 2024. - [19] N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. - [20] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum? id=VTF8yNQM66 . - [21] R. Just, D. Jalali, and M. D. Ernst. Defects4J: A Database of existing faults to enable controlled testing studies for Java programs. In ISSTA 2014, Proceedings of the 2014 International Symposium on Software Testing and Analysis , pages 437-440, San Jose, CA, USA, July 2014. Tool demo. - [22] S. Kang, J. Yoon, and S. Yoo. Large language models are few-shot testers: Exploring llm-based general bug reproduction, 2023. - [23] R.-M. Karampatsis and C. Sutton. How often do single-statement bugs occur? the manysstubs4j dataset. 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR) , pages 573-577, 2019. URL https://api.semanticscholar.org/CorpusID: 173188438 . - [24] J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. C. Lim, P.-Y. Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024. - [25] Y. Lai, C. Li, Y . Wang, T. Zhang, R. Zhong, L. Zettlemoyer, S. W. tau Yih, D. Fried, S. Wang, and T. Yu. Ds-1000: A natural and reliable benchmark for data science code generation, 2022. - [26] J. Liu, C. S. Xia, Y . Wang, and L. Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 , 2023. - [27] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts, 2023. - [28] T. Liu, C. Xu, and J. McAuley. Repobench: Benchmarking repository-level code autocompletion systems. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=pPjZIOuQuF . - [29] Y. Liu, X. Tang, Z. Cai, J. Lu, Y. Zhang, Y. Shao, Z. Deng, H. Hu, K. An, R. Huang, S. Si, S. Chen, H. Zhao, L. Chen, Y. Wang, T. Liu, Z. Jiang, B. Chang, Y. Qin, W. Zhou, Y. Zhao, A. Cohan, and M. Gerstein. Ml-bench: Evaluating large language models for code generation in repository-level machine learning tasks, 2024. - [30] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, S. Fu, and S. Liu. Codexglue: A machine learning benchmark dataset for code understanding and generation, 2021. - [31] R. T. McCoy, S. Yao, D. Friedman, M. Hardy, and T. L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638 , 2023. - [32] N. Muennighoff, Q. Liu, A. R. Zebaze, Q. Zheng, B. Hui, T. Y. Zhuo, S. Singh, X. Tang, L. V. Werra, and S. Longpre. Octopack: Instruction tuning code large language models. In The Twelfth International Conference on Learning Representations , 2024. URL https: //openreview.net/forum?id=mw1PWNSWZP . - [33] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022. - [34] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. S. Gu, Y . Guo, C. Hallacy, J. Han, J. Harris, Y . He, M. Heaton, J. Heidecke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaftan, Łukasz Kaiser, A. Kamali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim, J. H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, Łukasz Kondraciuk, A. Kondrich, A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. M. McKinney, C. McLeavey, P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin, V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair, R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, H. Noh, L. Ouyang, C. O'Keefe, J. Pachocki, A. Paino, J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perelman, F. de Avila Belbute Peres, M. Petrov, H. P. de Oliveira Pinto, Michael, Pokorny, M. Pokrass, V. H. Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. Rae, A. Ramesh, C. Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Staudacher, F. P. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. B. Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya, C. Voss, C. Wainwright, J. J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, and B. Zoph. Gpt-4 technical report, 2023. - [35] C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. - [36] O. Press, M. Zhang, S. Min, L. Schmidt, N. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language models. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 5687-5711, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp. 378. URL https://aclanthology.org/2023.findings-emnlp.378 . - [37] M. Shao, B. Chen, S. Jancheska, B. Dolan-Gavitt, S. Garg, R. Karri, and M. Shafique. An empirical evaluation of llms for solving offensive security challenges, 2024. - [38] W. Shi, R. Xu, Y. Zhuang, Y. Yu, J. Zhang, H. Wu, Y. Zhu, J. Ho, C. Yang, and M. D. Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records, 2024. - [39] N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. - [40] D. Sobania, M. Briesch, C. Hanna, and J. Petke. An analysis of the automatic bug fixing performance of chatgpt, 2023. - [41] A. Sridhar, R. Lo, F. F. Xu, H. Zhu, and S. Zhou. Hierarchical prompting assists large language model on web navigation, 2023. - [42] T. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths. Cognitive architectures for language agents, 2023. - [43] X. Tang, A. Zou, Z. Zhang, Z. Li, Y. Zhao, X. Zhang, A. Cohan, and M. Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning, 2024. - [44] A. Thakur, G. Tsoukalas, Y. Wen, J. Xin, and S. Chaudhuri. An in-context learning agent for formal theorem-proving, 2024. - [45] R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, V. Zhao, Y. Zhou, C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, P. Srinivasan, L. Man, K. Meier-Hellstern, M. R. Morris, T. Doshi, R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Rajakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, A. Cohen, R. Bernstein, R. Kurzweil, B. AgueraArcas, C. Cui, M. Croak, E. Chi, and Q. Le. Lamda: Language models for dialog applications, 2022. - [46] J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang. Software testing with large language model: Survey, landscape, and vision, 2023. - [47] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, W. X. Zhao, Z. Wei, and J. Wen. A survey on large language model based autonomous agents. Frontiers of Computer Science , 18(6), Mar. 2024. ISSN 2095-2236. doi: 10.1007/s11704-024-40231-1. URL http://dx.doi.org/10.1007/s11704-024-40231-1 . - [48] X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji. Executable code actions elicit better llm agents, 2024. - [49] Z. Wang, G. Cuenca, S. Zhou, F. F. Xu, and G. Neubig. Mconala: A benchmark for code generation from multiple natural languages, 2023. - [50] Z. Wang, S. Zhou, D. Fried, and G. Neubig. Execution-based evaluation for open-domain code generation, 2023. - [51] Z. Wang, D. Fried, and G. Neubig. Trove: Inducing verifiable and efficient toolboxes for solving programmatic tasks, 2024. - [52] M. Wornow, A. Narayan, K. Opsahl-Ong, Q. McIntyre, N. H. Shah, and C. Re. Automating the enterprise with foundation models, 2024. - [53] Z. Wu, C. Han, Z. Ding, Z. Weng, Z. Liu, S. Yao, T. Yu, and L. Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024. - [54] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y. Zhou, W. Wang, C. Jiang, Y. Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Cheng, Q. Zhang, W. Qin, Y. Zheng, X. Qiu, X. Huang, and T. Gui. The rise and potential of large language model based agents: A survey, 2023. - [55] C. S. Xia and L. Zhang. Less training, more repairing please: revisiting automated program repair via zero-shot learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages 959-971, 2022. - [56] C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang. Universal fuzzing via large language models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , 2023. - [57] T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, and T. Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. - [58] A. Z. H. Yang, C. Le Goues, R. Martins, and V. Hellendoorn. Large language models for test-free fault localization. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ICSE '24, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400702174. doi: 10.1145/3597503.3623342. URL https://doi. org/10.1145/3597503.3623342 . - [59] J. Yang, A. Prabhakar, K. R. Narasimhan, and S. Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2023. URL https://openreview.net/forum?id=fvKaLF1ns8 . - [60] J. Yang, A. Prabhakar, S. Yao, K. Pei, and K. R. Narasimhan. Language agents as hackers: Evaluating cybersecurity skills with capture the flag. In Multi-Agent Security Workshop@ NeurIPS'23 , 2023. - [61] S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023. - [62] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations , 2023. URL https://openreview.net/forum?id=WE\_vluYUL-X . - [63] P. Yin, W.-D. Li, K. Xiao, A. Rao, Y. Wen, K. Shi, J. Howland, P. Bailey, M. Catasta, H. Michalewski, A. Polozov, and C. Sutton. Natural language to code generation in interactive data science notebooks, 2022. - [64] H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, T. Xie, and Q. Wang. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In International Conference on Software Engineering , 2023. URL https://api. semanticscholar.org/CorpusID:256459413 . - [65] Z. Yu, X. Zhang, N. Shang, Y. Huang, C. Xu, Y. Zhao, W. Hu, and Q. Yin. Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. arXiv preprint arXiv:2312.14187 , 2023. - [66] E. Zelikman, Q. Huang, G. Poesia, N. D. Goodman, and N. Haber. Parsel: Algorithmic reasoning with language models by composing decompositions, 2022. URL https://arxiv.org/ abs/2212.10561 . - [67] E. Zelikman, E. Lorch, L. Mackey, and A. T. Kalai. Self-taught optimizer (stop): Recursively self-improving code generation, 2024. - [68] F. Zhang, B. Chen, Y. Zhang, J. Keung, J. Liu, D. Zan, Y. Mao, J.-G. Lou, and W. Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. In The 2023 Conference on Empirical Methods in Natural Language Processing , 2023. URL https://openreview.net/forum?id=q09vTY1Cqh . - [69] S. Zhang, J. Zhang, J. Liu, L. Song, C. Wang, R. Krishna, and Q. Wu. Training language model agents without modifying language models, 2024. - [70] A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang. Language agent tree search unifies reasoning acting and planning in language models, 2023. - [71] S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, Y. Bisk, D. Fried, U. Alon, and G. Neubig. Webarena: A realistic web environment for building autonomous agents, 2023. ## Appendix In the appendix, we provide additional analyses and more extensive discussions about SWE-agent, agent-computer interface (ACI) design, and model performance on various evaluation benchmarks. We also provide several thorough case studies of SWE-agent behavior on select task instances. Data, code, and leaderboard at swe-agent.com. ## A SWE-agent Design In this section, we go into greater discussion about the design methodology, appearance, and implementation of each of the SWE-agent components. As described in Section 3, the SWE-agent interface consists of several components that enable agents to accomplish key subtasks that are fundamental to solving software engineering problems. These are generally the following: 1. Localization : Identify file(s)/line(s) causing the issue. 3. Testing : Write new scripts or modify existing test files to reproduce the issue and/or verify if fixes are correct. 2. Editing : Generate fixes addressing the given issue. To enable LM-based agents to efficiently carry out these individual functions and progress towards the overarching goal of resolving a codebase issue, we provide a file viewer, file editor, search / navigation system, and context management system. In Section A.1, we provide a thorough breakdown of each of these components. In Section A.2, we discuss the technical design decisions and challenges of building SWE-agent. In Section A.3, we discuss how SWE-agent is configured to support the final interface, along with how SWE-agent is built to enable easy extensibility and customization to alter the interface. ## A.1 ACI Design In this section, we revisit each component discussed in Section 3. Per section, we first briefly review the component. We then discuss the underlying motivation for the component with respect to existing software tools. Finally, we note any additional thoughts that influenced the design process of the component with some occasional discussion of what aspects of the component heavily influence language model behavior. For a quick, text-free overview, comprehensive documen- message history for SWE-agent. Each prompt template is discussed thoroughly in Section C. Figure 9: An overview over the structure of a trajectory: We first present the system prompt, demonstration (optional), and issue statement. The agent then interacts in turn with the environment. Past observations may be collapsed , i.e. we truncate any long output, as described in Section 3. <details> <summary>Figure 8 Details</summary> ![33169a89042f114b4f0734924f0053bf1b49028721a2bf03e05f8efb47fcd6a5](http://localhost:8000/v1/image/33169a89042f114b4f0734924f0053bf1b49028721a2bf03e05f8efb47fcd6a5) ### Visual Description # Technical Document Extraction: LLM Agent Prompt and Trajectory Structure This image is a technical diagram illustrating the structured components of a Large Language Model (LLM) prompt and the subsequent iterative trajectory for solving a software engineering task. ## 1. Component Isolation The diagram is organized vertically into three primary segments: * **Header (Prompt Configuration):** Three large, black-bordered blocks defining the initial context. * **Main Trajectory (Iterative Loop):** A series of alternating blue and orange bars representing the interaction between the agent and the environment. * **Footer (Output):** A final yellow block representing the resulting artifact. --- ## 2. Detailed Content Transcription ### Region 1: Prompt Configuration (Header) These blocks establish the "System" and "User" context before the agent begins work. | Block Title | Content / Bullet Points | | :--- | :--- | | **System Prompt** | • Describe environment and commands<br>• Specify response format | | **Demonstration** | Full trajectory of a successful example | | **Issue statement** | • Give reported issue description<br>• Instructions to resolve issue<br>• High-level strategy tips | ### Region 2: Iterative Trajectory (Main Chart) This section depicts a ReAct-style (Reasoning and Acting) loop. The colors indicate the source of the information. * **Blue Outlined Bars (Agent Output):** Labeled **"Thought & Action"**. These represent the model's internal reasoning and the command it chooses to execute. * **Orange Outlined Bars (Environment Output):** Labeled **"Environment Response"**. * The first two orange bars feature a diagonal grey-striped pattern and are labeled **"(collapsed)"**, indicating historical context that may be truncated or summarized. * A vertical ellipsis (**⋮**) indicates that this loop repeats multiple times. * The final orange bar is solid, labeled **"Environment Response"**, representing the most recent feedback. * **Final Blue Bar:** Labeled **"Submit"**, indicating the agent has completed the task. ### Region 3: Resulting Artifact (Footer) A yellow-shaded block representing the final output of the process. | Label | Text Content (Code/Diff) | | :--- | :--- | | **Patch File** | `diff --git a/src/sqlfluff/rules/L060.py` <br> `b/src/sqlfluff/rules/L060.py` <br> `--- a/src/sqlfluff/rules/L060.py` <br> `+++ b/src/sqlfluff/rules/L060.py` | --- ## 3. Flow and Logic Analysis 1. **Initialization:** The process begins with a comprehensive **System Prompt**, a **Demonstration** (few-shot learning), and the specific **Issue statement**. 2. **The Loop:** The agent enters a cycle of **Thought & Action** (Blue) followed by an **Environment Response** (Orange). * *Trend:* The diagram shows that earlier responses are "collapsed" to manage context window limits, while the most recent response remains fully visible. 3. **Termination:** Once the agent determines the issue is resolved, it issues a **Submit** command. 4. **Output:** The final result is a **Patch File**, specifically shown as a git diff targeting a Python file in the `sqlfluff` library (`L060.py`). ## 4. Technical Metadata * **Language:** English (100%). * **Visual Coding:** * **Black Borders:** Static prompt instructions. * **Blue Borders:** Agent-generated content. * **Orange Borders:** System/Environment-generated feedback. * **Yellow Background:** Final data product/artifact. </details> tation for all commands, their usage, and docstrings are included in Table 4. Figure 9 visualizes the File viewer. As discussed in Section 3, the File Viewer is fundamental to a language agent's ability to understand file content and understand how different programmatic entities relate to one another. The File Viewer refers to an interface that consists of the four commands, as shown in Table 4, and a customized standard output for displaying n lines of a file at a time. Using the file viewer, an agent can look at n lines of a file at a time and jump around the file. The File Viewer enables agents to perform fine-grained localization steps and also understand relationships between intra-file entities. First, we discuss why existing software systems and graphical user interfaces are sub-optimal for LM use. In a Shell-only setting, there are several commands that can be used to inspect file content. However, out of the box command line tools are sub-optimal or limiting for language agents for Table 4: In additional to the standard Linux Bash commands, we provide SWE-agent with specialized tools, including an interactive file viewer, search functionalities, and edit tools for the open file. Required arguments are enclosed in <> and optional arguments are in [] . The last column shows the documentation presented to the LM. | Category | Command | Documentation | |--------------|---------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | File viewer | open <path> [<line_number>] | Opens the file at the given path in the editor. If line_number is provided, the window will move to include that line. | | File viewer | goto <line_number> | Moves the window to show line_number. | | File viewer | scroll_down | Moves the window up 100 lines. | | File viewer | scroll_up | Moves the window down 100 lines. | | Search tools | search_file <search_term> [<file>] search_dir <search_term> [<dir>] | Searches for search_term in file. If file is not provided, searches in the current open file. Searches for search_term in all files in dir. If dir is not provided, searches in the current directory. | | Search tools | find_file <file_name> [<dir>] | Finds all files with the given name in dir. If dir is not provided, searches in the current | | Search tools | | directory. | | File editing | edit <n>:<m> <replacement_text> end_of_edit | Replaces lines n through m (inclusive) with the given text in the open file. All of the replacement_text will be entered, so make sure your indentation is formatted properly. Python files will be checked for syntax errors after the edit. If an error is found, the edit will not be executed. Reading the error message and modifying your command is recommended as issuing the same command will return the same error. | | File editing | create <filename> | Creates and opens a new file with the given name. | | Task | submit | Generates and submits the patch from all previ- ous edits and closes the shell. | several reasons. First, commands that print files to standard output (e.g. cat , printf ) can easily flood a language agent's context window with too much file content, the majority of which is usually irrelevant to the issue. Enabling a language agent to filter out distractions and focus on relevant code snippets is crucial to generating effective edits. While commands like head and tail reduce length to the first/last n lines, it is not intuitive to use bash commands to perform in-file navigation. It is either impossible or requires a long list of arguments to show specific file lines. Furthermore, since such Bash commands are stateless, 'scrolling" up/down relative to the current file position typically requires regenerating the same lengthy command with minor changes. Interactive tools like more and less accommodate this, but (1) representing navigation actions (multiple key up/down clicks) is intuitive for humans, but is verbose and costly for language agents, and (2) even if jumping to a specific line number is allowed, it is not possible to quickly identify what classes/methods/symbols are declared in a file and then immediately go to their definitions. There are a couple features of the File Viewer interface that make it friendlier and more operable than the Shell-only setting. First, the File Viewer standard output contextualizes code snippets with prepended line numbers and indicators of the number of lines above/below the current region. These details give a more focused view of a file without compromising easy viewing of other parts of the codebase. This kind of file presentation also makes precise and consistent editing commands possible, as we discuss more thoroughly in the following section. Figure 10: The File Viewer and Search components of the SWE-agent interface. The corresponding commands for each component are shown in blue. These examples are copied from trajectories generated by SWE-agent w/ GPT-4 Turbo on the pvlib\_\_pvlib-python-1603 task instance. <details> <summary>Figure 9 Details</summary> ![046ceec42f119d785061f751ad211ad1f9ecd6c5852bda5f19487153876d8b8d](http://localhost:8000/v1/image/046ceec42f119d785061f751ad211ad1f9ecd6c5852bda5f19487153876d8b8d) ### Visual Description # Technical Document Extraction: Software Interface Overview This image depicts a multi-paneled graphical user interface (GUI) for a code development environment, specifically showing file viewing and search functionalities related to a Python project named `pvlib-python`. ## 1. Component Isolation The interface is divided into two primary columns: * **Left Column:** File Viewer (Large panel). * **Right Column:** Search Tools (Three stacked panels: Find Files, Search Directory, and Search File). --- ## 2. Detailed Content Extraction ### Region: File Viewer (Left Column) This section displays the contents of a specific Python file. * **Header Command:** `open atmosphere.py` * **Content Transcription:** * `(255 more lines above)` * **Line 256:** `def gueymard94_pw(temp_air, relative_humidity):` * **Line 257:** `r"""` * **Line 258:** `Calculates precipitable water (cm) from ambient air temperature (C)` * **Line 259:** `and relatively humidity (%) using an empirical model. The` * **Line 260:** `accuracy of this method is approximately 20% for moderate PW (1-3` * **Line 261:** `cm) and less accurate otherwise.` * **Line 262:** * **Line 263:** `The model was developed by expanding Eq. 1 in [2]_:` * **Line 264:** * **Line 265:** `.. math::` * **Line 266:** * **Line 267:** `Pw = 0.1 H_v \rho_v` * `(76 lines omitted)` * **Line 343:** * **Line 344:** * **Line 345:** `def bird_hulstrom80_aod_bb(aod380, aod500):` * **Line 346:** `"""` * **Line 347:** `Approximate broadband aerosol optical depth.` * **Line 348:** * **Line 349:** `Bird and Hulstrom developed a correlation for broadband aerosol` * **Line 350:** `depth (AOD) using two wavelengths, 380 nm and 500 nm.` * **Line 351:** * **Line 352:** `Parameters` * **Line 353:** `----------` * **Line 354:** `aod380 : numeric` * **Line 355:** `AOD measured at 380 nm. [unitless]` * `(180 more lines below)` --- ### Region: Find Files (Top Right) * **Header Command:** `find_file atmosphere.py` * **Output Text:** * Found 2 matches for "atmosphere.py" in `/pvlib__pvlib-python/pvlib`: * `/pvlib__pvlib-python/pvlib/tests/atmosphere.py` * `/pvlib__pvlib-python/pvlib/atmosphere.py` --- ### Region: Search Directory (Middle Right) * **Header Command:** `search_dir APPARENT_ZENITH_MODELS` * **Output Text:** * Found 4 matches for "APPARENT_ZENITH_MODELS" in `/pvlib__pvlib-python`: * `/pvlib__pvlib-python/docs/sphinx/source/whatsnew/v0.3.0.txt` (1 matches) * `/pvlib__pvlib-python/pvlib/atmosphere.py` (2 matches) * `/pvlib__pvlib-python/pvlib/location.py` (1 matches) * End of matches for "APPARENT_ZENITH_MODELS" in `/pvlib__pvlib-python` --- ### Region: Search File (Bottom Right) * **Header Command:** `search_file APPARENT_ZENITH_MODELS` * **Output Text:** * Found 2 matches for "APPARENT_ZENITH_MODELS" in `/pvlib__pvlib-python/pvlib/atmosphere.py`: * **Line 12:** `APPARENT_ZENITH_MODELS = ('simple', 'kasten1966', 'kastenyoung1989',` * **Line 15:** `AIRMASS_MODELS = APPARENT_ZENITH_MODELS + TRUE_ZENITH_MODELS` * End of matches for "APPARENT_ZENITH_MODELS" in `/pvlib__pvlib-python/pvlib/atmosphere.py` --- ## 3. Technical Summary The image illustrates a workflow for navigating a Python library (`pvlib`). 1. **Discovery:** The user locates the file `atmosphere.py`. 2. **Global Search:** The user searches the entire directory for the constant `APPARENT_ZENITH_MODELS`, finding it in documentation, the atmosphere module, and the location module. 3. **Local Search:** The user drills down into `atmosphere.py` to see the specific lines (12 and 15) where the constant is defined and used. 4. **Inspection:** The user opens the file to read the implementation of functions like `gueymard94_pw` and `bird_hulstrom80_aod_bb`. </details> Another advantage of the File Viewer is that the commands are designed to be complementary and grounded in the File Viewer standard output. This saves the model from having to do repetitive or additional actions that unnecessarily increase the potential for error. As a concrete example, if an agent used a sed command to view the first 100 lines of a file and wants to look at the next 100 lines, it will have to recalculate parameters such as the start line and end line and reflect these updates correctly in the subsequent generation. As a rule of thumb, reducing the need for models to do this arithmetic by constructing actions and standard output that complement one another and build upon the effects of prior actions is highly preferable. File editor. The File Editor, working in conjunction with the File Viewer, primarily refers to the edit command and the guardrails it enforces to protect models against self-incurred cascading edit errors. Editing and testing are crucial to language agents' success on programming tasks, and a well-designed interface directly influences how well an agent's capabilities can be elicited. In other words, a bad interface undermines model performance. As discussed in Section 3, editing can be very difficult in a Shell-only setting. Built in commands (e.g., sed ) often require a lengthy list of arguments, and the mis-specification of an argument can easily throw a model off track as it attempts to correct self-incurred errors. We also observe that when agents use such commands directly, they struggle with the arithmetic skills required to generate an edit. Details such as including the correct indentation level, inserting delimiters at specific points in a line, and adhering to stylistic preferences of the codebase all require some amount of planning or calculation. Similar to the Shell-only file viewing process, file editing may also require repeating many commands. For instance, performing a multi-line edit can only be represented as multiple sed calls with requisite, delicate tweaks to the arguments for every turn. Furthermore, as referenced in Section 5.1, editing in Shell-only is usually a 'silent" procedure. Confirming whether an edit succeeded and viewing its effects requires additional steps that can bloat the editing process with extra, needless commands. The edit command, documented in Table 4, addresses the Shell-only failure modes by being grounded in the File Viewer standard output. The line numbers argument eliminates the need for any additional arithmetic, and the find-and-replace edit mechanism is a format that existing models are more used to. With this functionality, agents can also perform multi-line edits in a single action. Finally, as mentioned in Section 5.2, an important feature of the edit command is that it does not apply changes which incur a linting error. A fair and verified assumption we make when considering this feature is that the original codebase associated with each task instance is well-formed. In other words, we assume that codebase maintainers will only push syntactically sound code that can be compiled successfully. When an agent issues an edit, it is applied to the codebase. Then, we run the following linting command ( CURRENT\_FILE refers to the file that is currently open): ``` flake8 --isolated --select=F821,F822,F831,E111,E112,E113,E999,E902 "$CURRENT_FILE" 2>&1 ``` The arguments for select are error codes that refer to syntax issues such as indentation. F 821 and F 822 indicate undefined names/symbols. F 831 indicates a duplicate argument in a function definition. E 111 , E 112 , E 113 are indentation errors. E 999 denotes a syntax error and an E 902 occurs if flake 8 cannot read the source file. If the edit does not introduce any of these errors, this command will produce no output. The edit is kept and the updated file content is shown using the File Viewer centered around the lines the edit occurred. If however the linting command produces output, which indicates the edit introduces a syntax error, the edit is reverted. In place of the file viewer, a message shown in Figure 11 is displayed to the agent which shows the error that was caused, what the edit would have looked like, and the original file content. During the development process, we experimented with variations to this message, including the omission of one or more parts. Our takeaway was that having all three messages is helpful. Without the error type, the agent might misdiagnose what the mistake was. Without a snippet of the changed file content, the agent will re-issue the same command more frequently. Without a snippet of the original file content, the agent has to attend to the same content from several turns ago; agents also sometimes generate edit 's with respect to wrong, errant file content because it is from a more recent turn. ``` was. Without a snippet of the changed file content, the agent will re-issue the same command more frequently. Without a snippet of the original file content, the agent has to attend to the same content from several turns ago; agents also sometimes generate edit 's with respect to wrong, errant file content because it is from a more recent turn. Linting Error Message ``` ``` 72: solar_position_method='nrel_numpy', 73: airmass_model='kastenyoung1989', 74: altitude=None, pressure=None, 75: **kwargs): (1816 more lines below) -------------------------------------------------- Your changes have NOT been applied. Please fix your edit command and try again. You either need to 1) Specify the correct start/end line arguments or 2) Correct your edit code. DO NOT re-run the same failed edit command. Running it again will lead to the same error. ``` Figure 11: A linting error message. This is emitted if a model generates an edit command that introduces a syntax error into the codebase. The error message shows the before and after of the proposed edit along with what error messages were thrown. The problem with this edit is that it omits the orientation\_strategy field in its edit of the basic\_chain method definition. The editing guardrail has a drawback. To a certain degree, it forces some edits to be done in a particular order. For instance, in Figure 11, if the model's intention was in fact to remove the orientation\_strategy argument, due to the SWE-agent editing guardrails, it would have to remove all references from the function implementation either at the same time in a single action, or before removing it from the method header if split into two separate actions. For this particular scenario, the latter is necessary because the file snippet is not large enough to show the entirety of the basic\_chain implementation. This example highlights the trade-offs between the flexibility and guardrails of a command. Deciding whether to introduce a guardrail depends on how well it reduces common model errors compared to whether such restrictions hamper models' preferred workflows. Search & navigation. The File Viewer and File Editor together allow agents to make edits, write tests, and perform localization at a file level. The Search & navigation module complements these capabilities by giving agents the tools to perform keyword-driven localization at both a directory level and file level. As discussed, the main struggles with using built in Shell-only search commands such as grep and find are (1) given a general enough term, they are prone to producing too many search results that can consume an inordinate amount of space in the context window, and (2) they are highly configurable, making search result outcomes potentially inconsistent in appearance. The alternative to these search utilities is to navigate the file system directly with cd and look at what's in each folder with variations of ls and cat ; this kind of approach can take a large number of turns without yielding any particularly useful information. Figure 10 visualizes the standard output for the three different search commands. The search\_dir and find\_file helps agents perform directory level searches. The reason we provide two commands is due to the kinds of keywords that are present in an issue description (e.g., class references, file names). The search\_file command allows agents to search for terms at a file-level, which is helpful for efficient fine-grained localization. Taking a step back, the goal of these search commands is to make it easy for the agent to utilize any signal (e.g., line number, stack trace, natural language) about the root cause of an issue that may be present in the issue description or codebase. Once again, simpler command usage patterns with consistent output formats are easier for agents to use and reduces the chance for mistakes or irrelevant outputs. The main guardrail in place for all three search commands is curbing the number of search results to 50 or fewer. The downside is that reporting an error forces the model to generate another search query which can be an expensive operation. This reflects a trade-off between keeping observations concise and making additional calls to the base LM. ## A.2 Implementation The SWE-agent codebase is generally composed of three modules: the environment, the agent, and the logging mechanism for saving task episodes into trajectories and patch generations. Environment. The SWE-agent environment is heavily influenced by the InterCode library [59]. For the general pipeline of agent interactions with the environment, our work directly adopts InterCode's interactive coding task formulation. The environment integrates large parts of the interaction handling logic from the InterCode-Bash environment, which is essentially the Shell-only setting referenced in the paper. As a part of this adoption, SWE-agent also uses Docker containers to ensure reproducible and safe execution. Because of this, SWE-agent's infrastructure makes it easy for a user to swap out the Dockerfile (a domain specific language for defining a container) to support other codebases and programming languages beyond the scope of SWE-bench task instances. One difference is that SWE-agent makes minor adjustments to the underlying communication logic that transfers actions and observations between the Docker container and agent entity. Agent. Beyond serving as an agentic wrapper for facilitating multi-turn queries from an LM, the agent module defines the functions that render the ACI (e.g., context management, commands, interface logic, input/output format) and supports inference for closed/open, API-based/local language models. The main workflow is to define an interface as a class and/or set of commands, which can then be specified via a configuration file, discussed more thoroughly in Section A.3. The commands for the top performing SWE-agent with GPT 4 configuration are shown in Table 4. Logging. For each task episode, the main artifacts produced are the trajectory, which contains a history of the interactions between the agent and environment, and the final patch generation, which can represents a summary of the changes proposed by the agent during the interaction. The patch generation can be used directly for SWE-bench [20] evaluation. ## A.3 Configuration The SWE-agent system is instantiated by three components: an LM, a SWE-bench style dataset or GitHub issue, and a configuration file. The configuration file serves to specify the design of the ACI. Iteratively refining the configuration file is the main way we achieved better agent performance and carried out different analyses for the main paper. In this section, we will present a thorough review of what a SWE-agent configuration file looks like. An agent-computer interface is generally made up of four categories of configurable components: 1. Prompt templates: These prompt templates are used to inform the language model of the task setting, show the list of available commands, augment environment responses with the values of state variables, and provide the initial task setting. 2. Command files: These files contain the source code of bash or Python functions and scripts. Commands are easily modified, added, and removed through manipulating these files' code contents directly. Documentation added in these files can also be injected into prompts to inform the model of the available commands. 3. Control flow: Methods for parsing model responses and processing history can be specified through these configuration arguments. 4. Environment variables: Initial values of variables that may interact with commands and the shell can also be specified in the configuration. In the following Figure 12, we include an annotated example of the contents of a configuration file. <details> <summary>Figure 10 Details</summary> ![b780b31af167b2fa802e23fb24d28684195b3bace13a02f3d21c2c8fcb5c6de0](http://localhost:8000/v1/image/b780b31af167b2fa802e23fb24d28684195b3bace13a02f3d21c2c8fcb5c6de0) ### Visual Description # Technical Document Extraction: Configuration (.yaml) ## 1. Component Isolation The image is a structured technical snippet representing a configuration file. It is divided into two primary regions: * **Header:** A title bar identifying the content type. * **Main Content Area:** A code block containing YAML-formatted configuration keys, comments, and descriptions. --- ## 2. Detailed Content Extraction ### Header * **Text:** `Configuration (.yaml)` * **Description:** This serves as the title for the data block, indicating that the following text follows YAML (YAML Ain't Markup Language) syntax. ### Main Content Area (YAML Transcription) The following is a precise transcription of the text within the configuration block. Note that the arrow symbol (`↪`) indicates a line wrap from the original visual representation. ```yaml # Prompt Templates: Control how observations of environment are shown # ↪ to agent system_template: | First `system` message shown to agent instance_template: |- Instance prompt, contains task instance-specific content next_step_template: |- Format template of per-turn observation (Contains standard output ↪ from agent's action) next_step_no_output_template: |- ``` --- ## 3. Component Analysis and Flow This document defines the structure of prompts used to interact with an AI agent. The flow follows a standard LLM (Large Language Model) interaction pattern: 1. **Commentary:** The top lines explain the purpose of the file: controlling how the environment's observations are presented to the agent. 2. **`system_template`**: Defines the high-level instructions or persona (the "system" message) that initializes the agent's behavior. 3. **`instance_template`**: Defines the specific task or problem the agent is currently assigned to solve. 4. **`next_step_template`**: Defines the structure for iterative feedback. It formats the results/observations resulting from the agent's previous actions. 5. **`next_step_no_output_template`**: A placeholder for handling scenarios where an agent's action results in no observable output. --- ## 4. Technical Metadata * **Language:** English. * **Format:** YAML. * **Syntax Notations:** * The pipe symbols (`|` and `|-`) are YAML block scalars used to handle multi-line strings. * The `|-` specifically denotes a "strip" block scalar, which removes trailing newlines. * **Visual Indicators:** The use of a monospaced font suggests this is intended to be read as source code or a configuration schema. </details> ``` Format template of observation when there is no standard output -> from the agent's action format_error_template: |- Format template of error message (Used when agent's action causes -> an error) demonstration_template: | Format template for showing a demonstration to the agent demonstrations: - 'trajectory/<username>/<experiment folder>/*.traj' - File is a demonstration of how to solve a task. This could an agent -> generated trajectory. - You can include 1+ demonstrations # Environment States: Define features of the SWEEnv environment env_variables: # Default variables for SWEEnv at the beginning of each instance CURRENT_FILE: 0 CURRENT_LINE: OVERLAP: SEARCH_FILES: SEARCH_INDEX: SEARCH_RESULTS: WINDOW_SIZE: START_INDEX: END_INDEX: START_CURSOR: END_CURSOR: START_CURSORS_MARK: END_CURSOR_MARK: state_command: | # `state_command` allows you to update state variables to reflect any -> aspect of the environment (e.g. current working directory) name: state code: | state() { echo '{"pwd": "'$PWD'"}'; # Action Interface: Define how an agent interacts with the SWEEnv -> environment command_files: - path/to/bash_file.sh - Each file contains a list of commands implemented in bash - You can include 1+ command files parse_command: Reference to functionality for defining command -> documentation history_processor: Reference to functionality for controlling agent's -> message history parse_function: Parser run on agent output ``` Figure 12: An example of the configuration file that defines the SWE-agent ACI. A configuration is represented as a single .yaml file, allowing you to define the commands that agents may use, write prompts shown to the agent over the course of a single trajectory, and control the input/output interface that sits between the agent and environment. The prompt templates are explained in detail in Section C. The environment variables and command files work in tandem; environment variables make the interfaces stateful, and when commands are invoked, the corresponding states are updated to reflect the changes to the environment and interface. The parse\_command , parse\_function , and history\_processor all reference implementations declared within the agent module. The parse\_command file describes how command documentation should be presented to the agent. The parse\_function is what enforces the input/output formats for the agent. The history\_processor points to the logic for controlling and modifying the message history enforced at each turn throughout a single task episode. The configuration-based workflow of SWE-agent makes it easy to test new ACIs by incorporating novel commands, input/output formats, context managers, and more into the existing codebase. In the following subsections, we showcase existing implementations of several of these components and discuss how they can be extended. Commands. We describe how to implement your own commands for the SWE-agent ACI. As shown in the above Figure 12, commands are declared as a list of one or more file paths in the command\_files argument. Individual commands must be declared as separate functions in .py or .sh files. Every command subscribes to the following skeleton code in Figure 13. ``` Command_Skeleton Code # @yaml # signature: [command] [argument(s)] # docstring: [Brief description of what your command does.] # arguments: # [argument 1 name]: # type: [type (i.e. integer, string)] # description: [Brief description of this argument] # required: [true|false] # [argument 2 name]: # ... [command]() { # Implementation here } ``` Figure 13: The skeleton code for defining a command that can be accessed in the SWE-agent ACI. The function definition includes both the underlying implementation along with several arguments that describe how to use the command, which is compiled into the System template's command documentation at run time. The choice of Python or Bash based implementations of commands means they can be written to do a wide variety of actions, and the use of Docker means that the commands and system can be co-designed. Here is a list of guidelines around how to implement commands correctly. - Command arguments can be referenced via positional parameters notation (i.e. $1 ). - The implementation for your command is unconstrained. There are no limitations on the form of the underlying command code. - If there are no arguments, omit the arguments section. - The minimal documentation requirements are signature and docstring . - Third party libraries can be freely imported and used by commands (e.g., flake8 ). - Global variables can be used to make stateful changes to the environment. For instance, for the commands associated with the File Viewer, you'll see we define the CURRENT\_LINE variable for the file viewer. This variable is modified across multiple commands, including open , goto , scroll\_up , scroll\_down , and edit . - To show effects of a command, print to standard output (e.g., with echo ). The command parsing logic is implemented such that it does not look for a return value. Once the file path containing the command is added to command\_docs as an argument, the command is immediately available for use in subsequent task episodes. Including a demonstration that uses more complicated commands can be helpful to showcase proper use and may increase the frequency with which the agent uses the command. Input/output format. The input/output format defines what a correctly formatted response for an agent should look like. Selecting a suitable format greatly affects how well agents can interact with the environment. The methods for communicating and enforcing the input/output format are separated across several arguments. In Figure 12, the value of parse\_function should point to a class definition that enforces the format and actually parses the agent's responses. Informing the agent of the expectations around the input/output format should take place in system\_template , and the agent can be reminded of these standards via the format\_error\_template . New input/output formats can be easily devised and enforced by updating these arguments to point to a new class or display different natural language instructions. Context management. Context management is implemented as a class within the agent module. The history\_processor argument allows one to specify which context manager to use via the configuration file. Underneath the hood, the context manager is invoked per turn of the interactive loop. From the entire recorded history of the agent's interactions so far, the context manager constructs the literal history to be fed to the agent to invoke the next response. The general design of history\_processor s allows for easy experimentation towards more sophisticated strategies for managing history. ## B Extended Results In this section, we provide additional results, including performance marginalized against different dimensions, patch generation statistics, and problem solving patterns reflected by SWE-agent trajectories. Per analysis, we provide numerical or qualitative evidence that supports our findings, describe our takeaways from each finding, and discuss both the strengths of SWE-agent relative to prior baselines along with future directions based on improving common failure modes. ## B.1 Hyperparameter Sweep We performed a hyperparameter sweep using a subset of 37 instances sampled randomly from the dev split of SWE-bench. We present the results in Table 5, where we perform the sweeps for both the GPT-4 Turbo and Claude 3 Opus models. For GPT-4 Turbo the best configuration has a % Resolved rate of 15 . 1 %, with a temperature of 0 . 0 , window length of 100 and history set to last five observations (described in §3). There is a three way tie for Claude 3 Opus between the aforementioned configuration along with two additional settings (Temperature/Window/History of 0 . 2 / 100 /Last-5 and 0 . 2 / 200 /Full). We elect to run inference of both models on the SWE-bench test sets (both full and Lite splits) using the 0 . 0 / 100 /Last-5 configuration. Table 5: Hyper parameter sweep results on a subset of the SWE-bench dev split. % Resolved shows the mean score across 5 samples. | Model | Temperature | Window | History | %Resolved | |---------------|---------------|----------|-------------|-------------| | GPT-4 Turbo | 0 | 100 | Full | 14.1 | | GPT-4 Turbo | 0 | 100 | Last 5 Obs. | 15.1 | | GPT-4 Turbo | 0 | 200 | Full | 9.2 | | GPT-4 Turbo | 0 | 200 | Last 5 Obs. | 10.8 | | GPT-4 Turbo | 0.2 | 100 | Full | 10.8 | | GPT-4 Turbo | 0.2 | 100 | Last 5 Obs. | 12.4 | | GPT-4 Turbo | 0.2 | 200 | Full | 8.7 | | GPT-4 Turbo | 0.2 | 200 | Last 5 Obs. | 10.8 | | Claude 3 Opus | 0 | 100 | Full | 5.4 | | Claude 3 Opus | 0 | 100 | Last 5 Obs. | 8.1 | | Claude 3 Opus | 0 | 200 | Full | 7 | | Claude 3 Opus | 0 | 200 | Last 5 Obs. | 7.1 | | Claude 3 Opus | 0.2 | 100 | Full | 7.4 | | Claude 3 Opus | 0.2 | 100 | Last 5 Obs. | 8.1 | | Claude 3 Opus | 0.2 | 200 | Full | 8.1 | | Claude 3 Opus | 0.2 | 200 | Last 5 Obs. | 6.8 | ## B.2 Model Performance We present analyses of model performance marginalized across different dimensions and categories. Performance by Repository. We include a breakdown of model performance by repository on the SWE-bench Lite dataset in Table 6. We also include and adjust the performance of Claude 2 on SWE-bench, inherited from the baseline performances established in the original work. As presented above, SWE-agent performance is superior to prior approaches, solving not only a higher percentage of problems across repositories, but also resolving problems in repositories that were previously nearly or completely unsolved by prior retrieval augmented generation baselines used in the original SWE-bench work (e.g. matplotlib, sympy/sympy). Temporal Analysis. In Table 7, we provide a temporal breakdown that shows the % Resolved statistics for task instances from different years. There is no clear correlation between a task instance's https://github.com/matplotlib/matplotlib/ https://github.com/sympy/sympy Table 6: % Resolved performance across repositories represented in the SWE-bench Lite dataset. Each row corresponds to a repository while each column is the model's performance for that repository. The numbers in parentheses in the 'Repo" column is the number of task instances in SWE-bench Lite that are from the corresponding repository. | | SWE-agent | SWE-agent | RAG | RAG | RAG | |--------------------------------|-------------|---------------|--------|---------------|----------| | Repo | GPT 4 | Claude 3 Opus | GPT 4 | Claude 3 Opus | Claude 2 | | astropy/astropy (6) | 16.67% | 33.33% | 0.00% | 0.00% | 0.00% | | django/django (114) | 26.32% | 16.67% | 4.39% | 6.14% | 5.26% | | matplotlib/matplotlib (23) | 13.04% | 13.04% | 0.00% | 0.00% | 0.00% | | mwaskom/seaborn (4) | 25.00% | 0.00% | 25.00% | 25.00% | 0.00% | | pallets/flask (3) | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | | psf/requests (6) | 33.33% | 16.67% | 0.00% | 0.00% | 0.00% | | pydata/xarray (5) | 0.00% | 0.00% | 20.00% | 20.00% | 0.00% | | pylint-dev/pylint (6) | 16.67% | 0.00% | 0.00% | 0.00% | 0.00% | | pytest-dev/pytest (17) | 17.65% | 5.88% | 0.00% | 5.88% | 5.88% | | scikit-learn/scikit-learn (23) | 17.39% | 17.39% | 0.00% | 4.35% | 8.70% | | sphinx-doc/sphinx (16) | 6.25% | 6.25% | 0.00% | 0.00% | 0.00% | | sympy/sympy (77) | 10.39% | 5.19% | 1.30% | 2.60% | 0.00% | Table 7: % Resolved performance for task instances from different years represented in the SWEbench Lite dataset. Each row corresponds to a year while each column is the model's performance for task instances with a created\_at timestamp from that year. The numbers in parentheses in the Year column is the number of task instances in SWE-bench Lite from that corresponding year. | | SWE-agent | SWE-agent | RAG | RAG | RAG | |-------------------|-------------|---------------|-------|---------------|----------| | Year | GPT 4 | Claude 3 Opus | GPT 4 | Claude 3 Opus | Claude 2 | | 2023 (30) | 23.33% | 13.33% | 3.33% | 3.33% | 0.0% | | 2022 (57) | 21.05% | 17.54% | 5.26% | 7.02% | 1.75% | | 2021 (42) | 23.81% | 11.90% | 2.38% | 4.76% | 2.38% | | 2020 (66) | 10.61% | 7.58% | 3.03% | 1.52% | 1.52% | | Before 2020 (105) | 17.14% | 10.48% | 0.95% | 4.76% | 5.71% | creation year and its resolution rate across either models or setting. For instance, while the SWE-agent w/ GPT-4 approach solves the highest percentage of problems from 2021, while the RAG w/ GPT-4 and SWE-agent w/ Claude 3 Opus approaches perform better on task instances from 2022. ## B.3 Trajectory Analysis We present additional characterizations of trajectories corresponding to task instances that were successfully resolved by SWE-agent w/ GPT-4 Turbo (unless otherwise specified). ## B.3.1 Turns to Resolution Figure 14 visualizes the distribution of the number of turns SWE-agent needed to complete task instances that were successfully resolved. On the full SWE-bench test set, SWE-agent w/ GPT-4 takes an average of 14.71 turns to finish a trajectory, with a median of 12 turns and 75% of trajectories being completed within 18 turns. On the Lite split of the SWE-bench test set, SWE-agent w/ Claude 3 Opus takes an average of 12.71 turns to finish a trajectory, with a median of 13 turns and 75% of trajectories being completed within 15 turns. From the distribution, it is evident that across models and SWE-bench splits, the majority of task instances are typically solved and completed comfortably within the allotted budget. This also points to a general area of improvement for language agent systems - if a language agent's initial problem solving approach, typically reflected in the first 10 to 20 turns, does not yield a good Figure 14: Distribution of the number of turns for interactive trajectories corresponding to solved task instances on SWE-bench. The left histogram shows this distribution for SWE-agent w/ GPT 4 on the full SWE-bench test set (286 trajectories). The right histogram is the performance of SWE-agent w/ Claude 3 Opus on the Lite split of the SWE-bench test set (35 trajectories). <details> <summary>Figure 11 Details</summary> ![434c3bbc97f6137735ac3d3eaa1c2b9b14ccb55c458a3c01038949217c5ce3f3](http://localhost:8000/v1/image/434c3bbc97f6137735ac3d3eaa1c2b9b14ccb55c458a3c01038949217c5ce3f3) ### Visual Description # Technical Data Extraction: Resolved by Turn Histograms This document contains a detailed extraction of data from two side-by-side histograms comparing the performance of two AI models (GPT-4 and Claude 3 Opus) in terms of how many "turns" it takes to resolve a task. ## 1. General Metadata * **Image Type:** Comparative Histograms (Frequency Distributions). * **Primary Language:** English. * **Visual Style:** Blue bars with black outlines on a white background. * **Common Y-Axis Label:** Count (Frequency of occurrences). * **Common X-Axis Label:** Turn (The number of interactions/steps taken). --- ## 2. Component Analysis: Left Chart (GPT-4) ### Header Information * **Title:** Resolved by Turn (GPT 4, Full) ### Axis Scales * **Y-Axis (Count):** Ranges from 0 to 60, with major tick marks every 10 units (0, 10, 20, 30, 40, 50, 60). * **X-Axis (Turn):** Ranges from approximately 4 to 40, with major tick marks every 5 units (5, 10, 15, 20, 25, 30, 35, 40). ### Data Distribution & Trends * **Trend Description:** The distribution is unimodal and right-skewed. There is a significant peak (mode) between turns 8 and 13, followed by a steady decline with a small secondary bump around turn 18-20. The data trails off significantly after turn 35. * **Estimated Data Points (Bin Counts):** * **Turns 4-6:** ~24 * **Turns 6-8:** ~27 * **Turns 8-11 (Peak):** ~59 * **Turns 11-13:** ~55 * **Turns 13-15:** ~23 * **Turns 15-18:** ~20 * **Turns 18-20:** ~26 * **Turns 20-23:** ~8 * **Turns 23-25:** ~3 * **Turns 25-28:** ~9 * **Turns 28-30:** ~7 * **Turns 30-32:** ~5 * **Turns 32-34:** ~13 * **Turns 34-37:** ~3 * **Turns 37-40:** ~4 --- ## 3. Component Analysis: Right Chart (Claude 3 Opus) ### Header Information * **Title:** Resolved by Turn (Claude 3 Opus, Full) ### Axis Scales * **Y-Axis (Count):** Ranges from 0 to 30+, with major tick marks every 5 units (0, 5, 10, 15, 20, 25, 30). Note: The highest bar exceeds the 30 mark. * **X-Axis (Turn):** Ranges from approximately 4 to 26, with major tick marks every 5 units (5, 10, 15, 20, 25). ### Data Distribution & Trends * **Trend Description:** The distribution is multimodal and more compressed than the GPT-4 chart. It shows three distinct peaks: a primary peak around turn 10-11, a secondary peak at turn 14, and a tertiary peak at turn 16-17. The resolution process for this model appears to conclude much earlier (by turn 26) compared to GPT-4 (which goes to turn 40). * **Estimated Data Points (Bin Counts):** * **Turns 4-5:** ~3 * **Turns 6-7:** ~3 * **Turns 7-8:** ~12 * **Turns 8-10:** ~16 * **Turns 10-11 (Highest Peak):** ~32 * **Turns 11-13:** ~8 * **Turns 13-14 (Secondary Peak):** ~29 * **Turns 14-16:** ~9 * **Turns 16-17 (Tertiary Peak):** ~27 * **Turns 17-19:** ~7 * **Turns 19-20:** ~12 * **Turns 20-22:** ~9 * **Turns 22-23:** ~17 * **Turns 23-25:** ~7 * **Turns 25-26:** ~1 --- ## 4. Comparative Summary | Feature | GPT-4 (Full) | Claude 3 Opus (Full) | | :--- | :--- | :--- | | **Max Turn Count** | ~40 Turns | ~26 Turns | | **Highest Frequency** | ~59 (at Turn 10) | ~32 (at Turn 10) | | **Distribution Shape** | Single dominant peak, long tail. | Multiple distinct peaks, shorter range. | | **Efficiency Indicator** | Resolves most tasks by turn 15, but has many outliers. | Resolves tasks in a tighter window, finishing all by turn 26. | </details> Figure 15: The distribution of agent trajectories by total steps (left) and cost (right) for SWE-agent with GPT-4 Turbo on SWE-bench. The distributions of resolved instances are shown in orange and unresolved are shown in blue. Resolved instances clearly display an earlier mean and fewer proportion of trajectories with many steps or that cost near the maximum budget of $ 4 . 00 . <details> <summary>Figure 12 Details</summary> ![1cb5574865d3c486a9a71750c997c178ce5d5997917266f76515ae38c778f6e1](http://localhost:8000/v1/image/1cb5574865d3c486a9a71750c997c178ce5d5997917266f76515ae38c778f6e1) ### Visual Description # Technical Data Extraction: Submitted Prediction Analysis This document provides a detailed extraction and analysis of two histograms comparing prediction outcomes based on "Total Steps" and "Cost (USD)". ## 1. General Metadata * **Language:** English * **Image Type:** Comparative Histograms (Stacked) * **Primary Categories (Legend):** * **Resolved: True** (Represented by **Orange** bars) * **Resolved: False** (Represented by **Blue** bars) * **Layout:** Two side-by-side charts sharing a common Y-axis label ("Frequency"). --- ## 2. Chart 1: Submitted Prediction Total Steps by Resolution ### Component Isolation * **Header:** Submitted Prediction Total Steps by Resolution * **X-Axis Title:** Total Steps * **Y-Axis Title:** Frequency * **Legend Location:** Top right [x=0.85, y=0.90] ### Axis Markers * **X-Axis:** 5, 10, 15, 20, 25, 30, 35 * **Y-Axis:** 0, 20, 40, 60, 80, 100, 120, 140, 160 ### Data Trends and Observations * **Overall Distribution:** The data is right-skewed, with a primary peak occurring between 10 and 15 steps. * **Resolved (True - Orange):** This series shows a relatively low frequency across all step counts. It peaks early around 10-12 steps (approx. 55 frequency) and tapers off significantly after 20 steps. * **Resolved (False - Blue):** This series dominates the frequency. It shows a bimodal distribution with a major peak between 10-15 steps (reaching ~170 frequency) and a secondary, smaller peak around 18-20 steps. * **Key Data Point:** The highest total frequency occurs at approximately 13 steps, where the "False" resolution significantly outweighs the "True" resolution. --- ## 3. Chart 2: Submitted Prediction Cost by Resolution ### Component Isolation * **Header:** Submitted Prediction Cost by Resolution * **X-Axis Title:** Cost (USD) * **Y-Axis Title:** Frequency * **Legend Location:** Top right [x=0.85, y=0.90] ### Axis Markers * **X-Axis:** 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0 * **Y-Axis:** 0, 20, 40, 60, 80, 100, 120, 140 ### Data Trends and Observations * **Overall Distribution:** The cost distribution is heavily right-skewed. The majority of predictions cost between $0.75 and $1.75 USD. * **Resolved (True - Orange):** Successful resolutions are concentrated at the lower end of the cost spectrum. The peak frequency for "True" occurs around $0.90 - $1.00 USD (approx. 48 frequency). As cost increases beyond $2.00, the frequency of "True" resolutions becomes negligible. * **Resolved (False - Blue):** Unsuccessful resolutions follow a similar initial curve but maintain a much higher frequency across the entire cost range. The peak occurs at approximately $1.10 USD (reaching ~140 frequency). * **Key Data Point:** There is a long tail for "False" resolutions extending all the way to $4.00 USD, whereas "True" resolutions almost entirely disappear after the $3.50 mark. --- ## 4. Comparative Summary | Metric | Peak Frequency Range | Success (True) Trend | Failure (False) Trend | | :--- | :--- | :--- | :--- | | **Total Steps** | 10 - 15 Steps | Low, peaks early at ~11 steps. | High, bimodal peaks at ~13 and ~19 steps. | | **Cost (USD)** | $0.80 - $1.50 | Concentrated under $1.50. | Dominates all price points; long tail to $4.00. | **Conclusion:** Both charts indicate that "False" resolutions (failures) are significantly more frequent than "True" resolutions (successes) across all measured steps and costs. Success is most likely to occur at lower step counts (approx. 10) and lower costs (approx. $1.00). </details> solution, it struggles to make use of later turns that build upon past mistakes. To remedy this issue and induce stronger error recovery capabilities in language agents, future directions could consider improving either the model, the ACI, or both. ## B.3.2 Walkthrough of Trajectory Phases We describe what happens in different phases of an agent's problem solving trajectory. To support our observations, we present several tables and distributions that help highlight consistent trends. Initial reproduction, localization steps. First, the initial steps that SWE-agent usually takes is heavily dominated by Localization and Reproduction operations. The most commonly occurring pattern in general is the create , edit , python triplet. Across these commands, an agent creates an empty python file, adds an executable code snippet via edit , and then attempts to run it. As an alternative, the agent also sometimes decides to start off instead with Localization, or identifying the files/lines causing the issue. Depending on how informative the issue description and results for initial search queries are, agents will run additional search queries with finer grained search tools to zoom in on the target problematic code area (e.g., search\_dir , open , search\_file / scroll\_down ). These trends are also reflected in Figure 16, which shows a distribution of patterns across turns according to the categories defined in Table 8. The three leftmost bars reflect that Reproduction followed by Localization constitutes the lion's share of operations that occur in the early phases of a trajectory. For a more thorough breakdown, we also include Figure 17, which shows an estimated distribution of each action with respect to different turns, normalized across the total number of times Table 8: We present a table of the most frequently occurring action patterns at each turn ('frequently" means ≥ 4 times) in trajectories of task instances resolved by SWE-agent w/ GPT-4. For instance, the pattern create , edit , python appears 156 times at the first to third turns. In addition, we also manually assign each entry a category (Reproduction, Localization (File), Localization (Line), Editing, Submission) that generally captures the underlying purpose of such a pattern. 'Reproduction" refers to the sub-task of recreating the error or request described by the issue. 'Localization" refers to the sub-task of identifying the code that is the cause of the issue. | Turns | Pattern | Count | Category | |---------|----------------------------------|---------|---------------------| | 1-3 | create , edit , python | 156 | Reproduction | | 1-3 | search_dir , open , search_file | 21 | Localization (File) | | 1-3 | search_dir , open , scroll_down | 12 | Localization (Line) | | 1-3 | create , edit , edit | 11 | Reproduction | | 1-3 | search_dir , open , edit | 10 | Localization (Line) | | 2-4 | edit , python , find_file | 71 | Localization (File) | | 2-4 | edit , python , edit | 37 | Reproduction | | 2-4 | edit , python , search_dir | 26 | Localization (File) | | 2-4 | edit , python , open | 15 | Localization (File) | | 2-4 | open , edit , edit | 13 | Editing | | 2-4 | open , edit , create | 13 | Editing | | 2-4 | open , scroll_down , scroll_down | 9 | Localization (Line) | | 2-4 | open , scroll_down , edit | 5 | Editing | | 2-4 | open , edit , submit | 5 | Submission | | 3-5 | python , find_file , open | 61 | Localization (File) | | 3-5 | python , edit , python | 25 | Editing | | 3-5 | search_file , goto , edit | 24 | Localization (Line) | | 3-5 | python , search_dir , open | 23 | Localization (File) | | 3-5 | edit , create , edit | 13 | Editing | | 3-5 | python , edit , edit | 11 | Editing | | 3-5 | python , open , edit | 7 | Editing | | 3-5 | python , find_file , find_file | 7 | Localization (File) | | 3-5 | edit , edit , submit | 4 | Submission | | 3-5 | edit , edit , create | 4 | Editing | | 4-6 | find_file , open , edit | 28 | Editing | | 4-6 | find_file , open , search_file | 19 | Localization (Line) | | 4-6 | edit , edit , python | 11 | Reproduction | | 4-6 | goto , edit , edit | 8 | Editing | | 4-6 | find_file , open , goto | 8 | Localization (Line) | | 4-6 | goto , edit , submit | 7 | Submission | | 4-6 | goto , edit , create | 7 | Editing | | 4-6 | find_file , open , scroll_down | 6 | Localization (Line) | | 4-6 | scroll_down , scroll_down , edit | 5 | Localization (Line) | | 4-6 | find_file , find_file , open | 5 | Localization (File) | | 5-7 | open , search_file , goto | 29 | Localization (Line) | | 5-7 | open , edit , python | 20 | Editing | | 5-7 | open , goto , edit | 7 | Editing | | 5-7 | scroll_down , edit , submit | 4 | Submission | | 6-8 | scroll_down (x3) | 6 | Localization (Line) | | 6-8 | search_file , goto , scroll_down | 4 | Localization (Line) | | 7-9 | edit , python , rm | 20 | Editing | | 7-9 | goto , edit , python | 12 | Editing | | 8-10 | python , rm , submit | 19 | Submission | | 8-10 | search_file , goto , search_file | 4 | Localization (File) | | 9-11 | edit (x3) | 18 | Editing | | 9-11 | edit , open , edit | 6 | Editing | | 9-11 | goto , search_file , goto | 4 | Localization (Line) | Figure 16: We assign each pattern to one of five categories (as presented in Table 8) and present a histogram of the turns at which patterns from specific categories show up frequently. <details> <summary>Figure 13 Details</summary> ![911de671b55dc4dcdd9439e1ec7c3d69bc137be12b9c1d766be2fa0a431d8c52](http://localhost:8000/v1/image/911de671b55dc4dcdd9439e1ec7c3d69bc137be12b9c1d766be2fa0a431d8c52) ### Visual Description # Technical Document Extraction: Freq. of Pattern Categories per Turn ## 1. Document Metadata * **Title:** Freq. of Pattern Categories per Turn * **Type:** Stacked Bar Chart * **Primary Language:** English * **X-Axis Label:** Turns (Categorical ranges) * **Y-Axis Label:** Frequency (Numerical, 0 to 200+) ## 2. Component Isolation ### Header * **Text:** "Freq. of Pattern Categories per Turn" ### Main Chart Area * **X-Axis Categories (Turns):** 1-3, 2-4, 3-5, 4-6, 5-7, 6-8, 7-9, 8-10, 9-11, 10-12, 11-13, 12-14. * **Y-Axis Markers:** 0, 50, 100, 150, 200. * **Data Series:** Five stacked categories represented by specific colors. ### Legend (Spatial Grounding: Top Right [x~0.8, y~0.2]) * **Light Orange:** Localization (Line) * **Dark Red:** Localization (File) * **Dark Green:** Editing * **Bright Green:** Reproduction * **Grey:** Submission --- ## 3. Data Extraction and Trend Analysis ### Trend Verification * **Overall Trend:** The total frequency of all categories combined shows a sharp downward trend as the number of "Turns" increases. The highest activity occurs in the 1-3 turn range (~220 units) and tapers off to near zero by the 12-14 turn range. * **Reproduction (Bright Green):** Dominant in the earliest turns (1-3), then rapidly decreases and disappears after the 4-6 turn range. * **Localization (File) (Dark Red):** Peaks in the 2-4 and 3-5 turn ranges, then diminishes significantly. * **Editing (Dark Green):** Maintains a relatively consistent presence in the middle ranges (3-5 through 5-7) and is the primary activity seen in later turns (7-9, 9-11). * **Submission (Grey):** Appears sporadically as a small "cap" on the bars, notably in ranges 2-4, 3-5, 4-6, 5-7, and 8-10. ### Data Table (Estimated Values) The following table reconstructs the visual data points from the stacked bars. Values are approximated based on the Y-axis scale. | Turns | Localization (Line) [Orange] | Localization (File) [Red] | Editing [Dk Green] | Reproduction [Lt Green] | Submission [Grey] | Total (Approx) | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | **1-3** | 22 | 26 | 0 | 172 | 0 | 220 | | **2-4** | 12 | 115 | 32 | 38 | 5 | 202 | | **3-5** | 27 | 90 | 65 | 3 | 3 | 188 | | **4-6** | 38 | 5 | 43 | 11 | 7 | 104 | | **5-7** | 29 | 0 | 26 | 0 | 5 | 60 | | **6-8** | 10 | 0 | 0 | 0 | 0 | 10 | | **7-9** | 0 | 0 | 38 | 0 | 0 | 38 | | **8-10** | 0 | 4 | 0 | 0 | 22 | 26 | | **9-11** | 4 | 0 | 24 | 0 | 0 | 28 | | **10-12** | 0 | 0 | 3 | 0 | 0 | 3 | | **11-13** | 6 | 0 | 0 | 0 | 0 | 6 | | **12-14** | 0 | 0 | 3 | 0 | 0 | 3 | --- ## 4. Summary of Findings The chart illustrates the lifecycle of a process across "Turns." 1. **Initial Phase (Turns 1-3):** Heavily dominated by **Reproduction**. 2. **Middle Phase (Turns 2-5):** Shift toward **Localization (File)** and the beginning of **Editing**. 3. **Late Phase (Turns 7+):** Activity drops significantly, with **Editing** and occasional **Submission** or **Localization (Line)** being the only remaining patterns. 4. **Submission** activity is most prominent as a concluding action in the 8-10 turn range, despite the overall lower frequency in that bracket. </details> the command occurs across all turns. From these graphs, we can see that create is invoked much more frequently in the very first turn than in any other turn. The search\_dir and search\_file distributions are roughly bi-modal, with a peak of occurrences for both actions showing up in Turn 1 (if the agent decides to do Localization immediately) and the Turn 4 (if the agent decides to do Localization after Reproduction). We also present Figure 18, which communicates similar information as Figure 17, but presented instead as a stacked bar chart with more commands. The chart is created directly from Figure 7, with the frequency of actions at each turn n normalized across the total number of trajectories with a length greater than or equal to n turns. ## Density Plots of Actions across Turns Figure 17: This density plot shows a normalized distribution of actions across different turns of a trajectory. exit\_cost refers to when the token budget cost was exhausted and the episode's changes are automatically submitted (contrary to an intentional submit invoked by the agent). <details> <summary>Figure 14 Details</summary> ![3956b7f29dd5e5f7a3a72c1ed3f57f11b7c7e1d1ef184bf281b3d74f980e0dd4](http://localhost:8000/v1/image/3956b7f29dd5e5f7a3a72c1ed3f57f11b7c7e1d1ef184bf281b3d74f980e0dd4) ### Visual Description # Technical Document Extraction: Density Plots of Actions across Turns ## 1. Document Metadata * **Title:** Density Plots of Actions across Turns * **Chart Type:** Kernel Density Estimation (KDE) Plot / Multi-series Density Plot * **Primary Language:** English ## 2. Component Isolation ### A. Header * **Text:** "Density Plots of Actions across Turns" (Centered at the top). ### B. Main Chart Area * **Y-Axis Label:** "Density" * **Y-Axis Scale:** Linear, ranging from 0.00 to 0.25 with markers at [0.00, 0.05, 0.10, 0.15, 0.20, 0.25]. * **X-Axis Label:** "Turn" * **X-Axis Scale:** Linear, ranging from 0 to 40 with markers every 5 units [0, 5, 10, 15, 20, 25, 30, 35, 40]. ### C. Legend (Spatial Grounding: Top-Center/Right) The legend is located in the upper-middle portion of the plot area, containing eight categories with corresponding color swatches: 1. **search_dir:** Dark Red / Maroon 2. **search_file:** Burnt Orange / Rust 3. **find_file:** Bright Orange 4. **create:** Dark Green 5. **edit:** Medium Green 6. **exit_cost:** Black 7. **submit:** Grey 8. **python:** Gold / Yellow --- ## 3. Data Series Analysis and Trends ### Early-Phase Actions (Turns 0–10) These actions show high density at the beginning of the process, indicating they are primarily initialization or discovery steps. * **create (Dark Green):** * *Trend:* Sharp, immediate spike. * *Peak:* Reaches the highest density of all series (~0.26) at approximately Turn 1. It drops off rapidly, becoming negligible after Turn 5. * **search_dir (Dark Red):** * *Trend:* Rapid rise and fall with a secondary smaller hump. * *Peak:* Primary peak at Turn 1 (~0.15 density); secondary peak at Turn 4 (~0.11 density). * **find_file (Bright Orange):** * *Trend:* Bimodal distribution. * *Peak:* First peak at Turn 1 (~0.13); second, higher peak at Turn 4 (~0.18). * **search_file (Burnt Orange):** * *Trend:* Broad distribution in the early phase. * *Peak:* Peaks around Turn 6 (~0.10 density). * **python (Gold):** * *Trend:* Gradual rise in the early phase. * *Peak:* Peaks around Turn 3 (~0.11 density) and maintains a low-level "tail" across the mid-turns. ### Mid-Phase Actions (Turns 5–25) * **edit (Medium Green):** * *Trend:* Broad, low-amplitude distribution. * *Peak:* Peaks around Turn 7 (~0.07 density) and slowly tapers off toward Turn 30. * **submit (Grey):** * *Trend:* Shifted further right than discovery actions; multi-modal. * *Peak:* Primary peak at Turn 10 (~0.11 density). It shows smaller fluctuations (local maxima) around Turn 18 and Turn 28. ### Late-Phase Actions (Turns 20–40) * **exit_cost (Black):** * *Trend:* Distinctly separated from all other actions. It is nearly zero for the first 20 turns, followed by a small spike and then a massive surge at the end. * *Peak:* Small local peak at Turn 22 (~0.04). The dominant peak occurs at Turn 34 (~0.23 density), with a secondary late peak at Turn 37 (~0.10). --- ## 4. Summary of Key Findings * **Initialization:** The "create" and "search_dir" actions are the most immediate, occurring almost exclusively in the first 1-3 turns. * **Exploration/Execution:** "find_file" and "python" actions bridge the gap between initialization and the middle phase. * **Processing:** "edit" and "submit" represent the middle of the workflow, with "submit" peaking significantly later (Turn 10) than the initial search actions. * **Termination:** The "exit_cost" action is a clear outlier in time, representing the finality of the process, occurring almost entirely after Turn 30. </details> Cycle of edit, then evaluate. From the fifth turn onwards, the distribution of actions per turn can be generally described as alternating edit and python / pytest actions. After reproducing the issue and localizing the file(s) responsible for the problem, agents will typically make edits to the file, then run the reproduction script or existing tests to check whether the proposed edits resolve the original issue and maintain existing desirable behavior. This pair of actions will often repeat for several turns, as an initial edit usually does not successfully resolve the given issue. Multiple rounds of editing that are supplemented by execution feedback from prior turns are conducive to more well-formed, successful subsequent edits. As reflected in Table 8, for turn 4 onwards, the most popular pattern that begins at each turn usually falls under the Editing category. This is also made obvious by Figure 18, where the edit command is the most popular command for Turns 5 to 31 , with only one exception (Turn 30 ). From Figure 17, it is also notably that the distributions of the edit and python commands are quite similar, as they typically follow one another. Figure 18: A normalized view of Figure 7. The distributions for turn n are normalized across the number of trajectories that have a length of at least n or more turns. <details> <summary>Figure 15 Details</summary> ![8261c321c56ba66137eacb1fecf29db14c2b713274aa95b68f9aeaaefba3d3b7](http://localhost:8000/v1/image/8261c321c56ba66137eacb1fecf29db14c2b713274aa95b68f9aeaaefba3d3b7) ### Visual Description # Technical Document Extraction: Distribution of Actions per Turn ## 1. Document Metadata * **Title:** Distribution of Actions per Turn (Normalized) * **Image Type:** Stacked Bar Chart (Normalized to 1.0) * **Primary Language:** English ## 2. Component Isolation ### A. Header * **Text:** "Distribution of Actions per Turn (Normalized)" ### B. Main Chart Area * **Y-Axis Label:** "Frequency" * **Y-Axis Scale:** 0.0 to 1.0 (intervals of 0.2) * **X-Axis Label:** "Turn" * **X-Axis Scale:** Discrete integers from 0 to 38. * **Data Representation:** Each vertical bar represents a single "Turn." The bar is segmented by color to show the relative frequency (proportion) of different actions performed during that specific turn across a dataset. ### C. Legend (Spatial Placement: Right Side [x=0.85, y=0.5]) The legend contains 15 categories, grouped by color families: * **Red/Orange (Search/Find):** `search_dir` (dark red), `search_file` (burnt orange), `find_file` (orange), `find` (light orange). * **Green (File Manipulation):** `create` (dark green), `edit` (medium green). * **Grayscale (Completion):** `exit_cost` (black), `submit` (gray). * **Blue (Navigation):** `open` (dark blue), `scroll_up` (medium blue), `scroll_down` (bright blue), `goto` (light blue). * **Yellow/Gold (Execution):** `python3` (olive/dark gold), `python` (gold), `pytest` (light yellow). --- ## 3. Data Extraction and Trend Analysis ### General Trends 1. **Early Phase (Turns 0-5):** High concentration of search and navigation actions. Turn 0 is dominated by `create` (dark green) and `search_dir` (dark red). Turn 1 shows a massive spike in `open` (dark blue). 2. **Middle Phase (Turns 6-30):** The chart is dominated by `edit` (medium green) and `submit` (gray), interspersed with navigation (`scroll_up/down`, `goto`) and execution (`python/pytest`). 3. **Late Phase (Turns 31-38):** A sharp transition occurs. `edit` actions diminish. `exit_cost` (black) and `submit` (gray) become the primary components, indicating the conclusion of tasks. ### Detailed Action Distribution by Turn (Approximate Values) | Turn | Dominant Actions (Visual Estimate) | Observations | | :--- | :--- | :--- | | **0** | `create` (~60%), `search_dir` (~20%), `find_file` (~20%) | Initialization phase. | | **1** | `open` (~35%), `edit` (~60%) | First file opening and editing. | | **2** | `python` (~55%), `edit` (~30%) | Early execution attempt. | | **3** | `edit` (~35%), `find_file` (~25%), `open` (~15%) | Mixed search and edit. | | **5** | `edit` (~40%), `python` (~15%), `find_file` (~10%) | Heavy editing begins. | | **10** | `edit` (~40%), `submit` (~15%), `open` (~10%), `python` (~20%) | Sustained development cycle. | | **20** | `edit` (~40%), `submit` (~10%), `python` (~25%), `goto` (~10%) | High execution/testing frequency. | | **30** | `edit` (~35%), `submit` (~15%), `python` (~15%), `scroll_up` (~10%) | Transition toward completion. | | **34** | `edit` (~60%), `exit_cost` (~20%), `open` (~15%) | Final edits before exit. | | **36** | `edit` (~35%), `exit_cost` (~65%) | Exit process dominates. | | **37** | `submit` (100%) | Pure submission turn. | | **38** | `exit_cost` (100%) | Final termination. | --- ## 4. Component Analysis * **`edit` (Medium Green):** This is the most consistent action throughout the lifecycle, appearing in almost every turn from 1 to 36, typically occupying 30-60% of the turn's volume. * **`python` / `pytest` (Yellows):** These appear consistently from Turn 2 through Turn 33, representing the iterative "run/test" cycle of a technical task. * **`submit` (Gray):** This action starts appearing around Turn 4 and remains a steady background component until it takes over completely at Turn 37. * **`exit_cost` (Black):** This action is negligible for the first 30 turns but grows rapidly from Turn 31 onwards, representing the overhead or finality of ending a session. * **Navigation (Blues):** `open`, `scroll`, and `goto` are most prevalent in the first half of the process (Turns 1-25), suggesting that information seeking and context setting happen early. ## 5. Summary of Information The image depicts the lifecycle of a technical task (likely coding or debugging). It shows a logical progression: 1. **Discovery:** Searching and creating files. 2. **Iteration:** A long middle phase of editing, navigating code, and running Python scripts/tests. 3. **Finalization:** A concluding phase where editing stops, and the user submits work and exits the environment. </details> Interspersed across these later turns are additional Localization operations for inspecting other parts of the current file (e.g., scroll\_down , scroll\_up ) or opening other files (e.g., open , search\_dir/file , find\_file ). These minor trend lines reflect the tasks that involve multiline or multi-file edits. Figure 18 displays a steady presence of such actions from Turn 6 onwards. Agents will invoke such actions to read different parts (e.g., documentation, implementation) of a long function, especially when it does not fit entirely within the file viewer's number of lines. After editing one function A , running the reproduction script will occasionally propagate an error in a different function B , where function B invokes A . This is a common reason for the additional directory and file level navigation that occurs in the later stages of a trajectory. Concluding submission turns. There is a consistent proportion of submit actions per turn, with a relative peak around Turn 10 , as shown in Figure 17. As mentioned in Section 5.2 and above, the majority of resolved task instances end with an intentional submit command. As suggested by both Figure 15 and Figure 18, submissions are concentrated between Turns 10 and 20 , becoming less frequent for each turn beyond this range. This trend reflects how agents struggle to use later turns to their advantage, particularly when the original problem solving approach fails, which is fairly evident by Turn 20 . Effectively utilizing later turns to either remedy multiple prior errors or pivot to a different problem solving approach are all viable strategies given the 20 + turns that remain. However, due to overwhelming context or greedy tendencies, agents do not reflect such dynamic behavior, instead opting to focus on continued local editing rather than additional exploration. Finally, there is a sharp cut off of exit\_cost actions scattered throughout Turns 30 to 40 ; this reflects that the $ 4 cost limit we impose on runs roughly corresponds to this number of turns. The discrepancies mainly comes from variations in the size of observations, with trajectories containing multiple observations that have a high number of tokens corresponding to ones that terminate relatively earlier. Increasing the cost allowance per task episode would directly increase the maximum number of the turns per episode. ## B.3.3 Breakdowns of Action Sequences In this part, we include more granular examinations of patterns of actions that emerge frequently in trajectories. We also identify consistent associations between groups of actions and how their effects build off one another across several turns. Editing Trends. Editing is a core facet of agents' ability to reproduce issues and propose fixes effectively. It is also the action that models typically struggle with the most. Here, we list several trends we were able to discern about how agents edit. First, across the full SWE-bench test set, a non-trivial minority of edit actions are unsuccessful, meaning the edit invocation raises a linting error. Going forwards, we refer to such an occurrence as a failed edit. Out of 2 , 294 task instances, 1 , 185 ( 51 . 7 %) have at least one turn with an failed edit. Of these trajectories, there is a median of 3 failed edits per trajectory, with a max of 33 . The rate of failed edits is smaller for resolved task instances. Out of 286 resolved instances, 113 ( 31 . 5 %) have at least one turn with an failed edit, with a median/mean/max of 2 failed edits per trajectory, with a max of 26 . Figure 19 shows corresponding distributions. Figure 19: Distribution of the number of failed edit actions per trajectory across all (left) and resolved (right) task instances by SWE-agent with GPT-4 Turbo. A 'failed" edit refers to an edit action that raised a linting error. The left-most bar for both graphs corresponds to the number of trajectories with no failed edits. <details> <summary>Figure 16 Details</summary> ![a96fd83732edba71fefc3f8d68a55989d6626ad19f0e5931c3bcfdad3fbc144c](http://localhost:8000/v1/image/a96fd83732edba71fefc3f8d68a55989d6626ad19f0e5931c3bcfdad3fbc144c) ### Visual Description # Technical Data Extraction: Failed Edits per Trajectory This document contains a detailed extraction of data from two side-by-side histograms illustrating the frequency of failed edits within trajectories. ## 1. Image Overview The image consists of two histograms plotted with blue bars and black outlines. Both charts share the same X-axis label and scale but differ in their Y-axis scales and data subsets. * **Language:** English * **Chart Type:** Histogram * **Color Coding:** Blue bars (Data), Black (Axes and Text) --- ## 2. Left Chart: Failed Edits per Trajectory (All) ### Metadata and Labels * **Title:** Failed Edits per Trajectory (All) * **Y-Axis Label:** Count * **X-Axis Label:** Num. Failed Edits * **X-Axis Markers:** 0, 5, 10, 15, 20, 25, 30, 35, 40 * **Y-Axis Markers:** 0, 200, 400, 600, 800, 1000 ### Trend Analysis The data shows a **heavy-tailed distribution** (power-law-like). The vast majority of trajectories have very few failed edits (0-1), with a rapid exponential decay as the number of failed edits increases. There is a very thin "long tail" extending out to 40 failed edits. ### Data Point Extraction (Estimated) | Num. Failed Edits (Bin) | Count (Approx.) | | :--- | :--- | | 0 - 1 | ~1100 | | 1 - 2 | ~320 | | 2 - 3 | ~180 | | 3 - 4 | ~110 | | 4 - 5 | ~80 | | 5 - 10 | Decreasing from ~60 to ~20 | | 10 - 40 | Sparse distribution, mostly < 20 counts per bin | --- ## 3. Right Chart: Failed Edits per Trajectory (Resolved) ### Metadata and Labels * **Title:** Failed Edits per Trajectory (Resolved) * **Y-Axis Label:** Count * **X-Axis Label:** Num. Failed Edits * **X-Axis Markers:** 0, 5, 10, 15, 20, 25, 30, 35, 40 * **Y-Axis Markers:** 0, 25, 50, 75, 100, 125, 150, 175 ### Trend Analysis Similar to the "All" chart, this follows a **heavy-tailed distribution**. However, the total volume (Count) is significantly lower, indicated by the Y-axis maxing out at 175 rather than 1000+. This suggests that "Resolved" trajectories represent a smaller subset of the total data, but they still predominantly feature low numbers of failed edits. ### Data Point Extraction (Estimated) | Num. Failed Edits (Bin) | Count (Approx.) | | :--- | :--- | | 0 - 1 | ~170 | | 1 - 2 | ~45 | | 2 - 3 | ~22 | | 3 - 4 | ~12 | | 4 - 5 | ~8 | | 5 - 10 | Rapidly tapers to near zero | | 10 - 40 | Extremely sparse; most bins are empty or have 1-2 counts | --- ## 4. Comparative Summary * **Scale Difference:** The "All" dataset has a peak frequency roughly 6.5 times higher than the "Resolved" dataset (~1100 vs ~170). * **Consistency:** Both datasets show that the most common scenario is 0 to 1 failed edits. * **Success Correlation:** The presence of a long tail in the "Resolved" chart indicates that even trajectories with a high number of failed edits (up to ~30) can eventually reach a "Resolved" state, though this is statistically rare. </details> Second, with linting enabled editing, agents 'recover" more often than not from failed edits. To understand whether and how effectively agents use linting error feedback to construct a subsequent, well-formed edit action, we define two terms. Recovery refers to a sequence of failed edits followed immediately by a successful edit, suggesting the agent used linting feedback to make a well-formatted edit. An unsuccessful recovery is consecutive failed edits followed immediately by a non-edit action. Figure 20: Probability of successful edit after n failed edits. The likelihood of recovery decreases as n increases. <details> <summary>Figure 17 Details</summary> ![bcf007e0806260ba1e874c5e256223132a085aa02b22ce7e4d289f35c9f45dfa](http://localhost:8000/v1/image/bcf007e0806260ba1e874c5e256223132a085aa02b22ce7e4d289f35c9f45dfa) ### Visual Description # Technical Document Extraction: Probability of Successful Edit ## 1. Document Metadata * **Language:** English * **Image Type:** Line Graph with Data Markers * **Subject:** Statistical analysis of edit success rates relative to prior failures. ## 2. Component Isolation ### Header * **Title:** Chance of Successful Edit Given $\ge n$ Failed Edits ### Main Chart Area * **Y-Axis Label:** Probability * **Y-Axis Markers:** 0.0, 0.2, 0.4, 0.6, 0.8 * **X-Axis Label:** # of Consecutive Failed Edits * **X-Axis Markers:** Integers from 0 to 20 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20) * **Data Series:** A single blue line with circular markers at each integer coordinate. ### Footer/Legend * **Legend:** Not explicitly present as a separate box; the single data series is defined by the chart title. --- ## 3. Trend Verification and Data Extraction ### Visual Trend Analysis The data series exhibits a **strong downward slope**, indicating a negative correlation between the number of consecutive failed edits and the probability of the next edit being successful. * **Initial Drop:** There is a sharp decrease in probability between 0 and 1 failed edits. * **Steady Decline:** From 1 to 8 failures, the probability decreases at a relatively constant, steep rate. * **Plateauing:** After approximately 15 failures, the curve flattens significantly, approaching a probability near zero (asymptotic behavior). ### Data Table Reconstruction The following values are estimated based on the spatial alignment of the blue markers against the Y-axis grid. | # of Consecutive Failed Edits ($n$) | Estimated Probability of Success | | :--- | :--- | | 0 | ~0.92 | | 1 | ~0.58 | | 2 | ~0.51 | | 3 | ~0.46 | | 4 | ~0.40 | | 5 | ~0.35 | | 6 | ~0.30 | | 7 | ~0.24 | | 8 | ~0.18 | | 9 | ~0.16 | | 10 | ~0.13 | | 11 | ~0.10 | | 12 | ~0.08 | | 13 | ~0.07 | | 14 | ~0.06 | | 15 | ~0.04 | | 16 | ~0.03 | | 17 | ~0.02 | | 18 | ~0.02 | | 19 | ~0.02 | | 20 | ~0.02 | --- ## 4. Technical Summary This chart illustrates the "sunk cost" or "diminishing returns" aspect of editing attempts. * **Baseline:** With 0 prior failures, the success rate is at its peak (over 90%). * **Critical Threshold:** Once a user has failed 4 consecutive times, the probability of success drops below 50% (tossing a coin). * **Terminal State:** After 17 consecutive failures, the probability of success becomes negligible (approx. 2%), suggesting that further attempts after this point are statistically unlikely to result in a successful edit. </details> Across trajectories corresponding to resolved task instances, there are 135 occurrences of 1 + failed edit attempts. Out of these, the agent recovers successfully 104 times. The number of consecutive failed edit attempts before a successful versus failed recovery is also vastly different. Successful recoveries are usually preceded by 2 . 03 edit attempts, less than the average 4 . 22 failed edit attempts of unsuccessful recoveries. Across all task instances, the relative rate of unsuccessful recoveries increases, with 810 successful recoveries versus 555 unsuccessful ones. While the number of consecutive failed edit attempts resulting in a recovery remains steady ( 2 . 2 ), it increases significantly for unsuccessful recoveries ( 5 . 59 ). Third, the odds of recovery decreases as the agent accumulates more failed edit attempts. Figure 20 displays a line plot of the probability of a successful edit given n failed edit attempts in a row. The leftmost data point of n = 0 means that any attempt at editing has a 90 . 5 % chance of eventually being successful. This value drops off once the agent incurs a single failed edit; there is only a 57 . 2 % chance the edit is ultimately successful. In other words, there is a 42 . 8 % chance the agent never recovers upon encountering 1 edit error. Action sequence analysis. We calculate the transition probabilities showing the likelihood of the next action given the previous n actions. To perform this analysis, we first determine the 15 most commonly occurring sequences of n actions, for n ∈ { 1 , 2 , 3 , 4 } . We then count how frequently each command appears after this sequence and finally normalize the counts across the total number of occurrences of the sequence to get a likelihood of the 'Next Action" with respect to the preceding n sequence of actions. We show these transition probability heatmaps, with n = 1 in Figure 21, n = 2 in Figure 22, n = 3 in Figure 23, and n = 4 in Figure 24. From these graphs, it is immediately obvious that several action sequences emerge consistently across many task instances. The high likelihood cells in these heatmaps suggest that SWE-agent uses common problem solving patterns which correspond to higher order operations such as reproducing an issue, localizing buggy code, and proposing/verifying edits. In Figure 21, we see direct associations between pairs of actions. There are several obvious trends. All trajectories begin with create , find\_file , search\_dir , and end on either a submit or exit\_cost . The most popular next action is edit ; it is the most likely action to follow create , edit , goto , pytest , and python . Scroll (e.g., scroll\_down/up ) and search (e.g. find\_file , search\_dir ) actions tend to be repeated. Other interesting correlations are also present. The edit/evaluate pattern is reflected in the correlation between the edit and python pair. A variety of localization patterns are also conspicuous. Sometimes, searching for a file turns out to be less fruitful than searching for a keyword, and visa versa. This is reflected in the find\_file and search\_dir pair. The invocation of open is representative of an agent honing in on a specific file to then continue localizing ( search\_file 0 . 35 , scroll\_down 0 . 18 , goto 0 . 09 ) or begin editing ( edit 0 . 25 ). As the number of prior actions considered increases, more complex operations carried across multiple commands become apparent, echoing the observations from Table 8. In Figure 23, reproduction (e.g. [ create , edit , python ]) is typically followed by adjustments to the script ( edit 0 . 39 ) or localization ( find\_file 0 . 31 , search\_dir 0 . 22 ). Fruitful localization patterns are once again reflected by [ find\_file / search\_dir , open , search\_file ] are followed by goto . In Figure 24, the most popular 4 -grams are related to reproduction or editing. The [ edit , python , rm , submit ] pattern is a popular way for trajectories to finish. Common failure modes are also apparent; repeated actions like edit ( 4 x) and scroll\_down ( 4 x) typically continues cascading. ## B.4 Failure Modes In this section, we provide insight on categorizing common agent failure modes. We perform an automated analysis of the unresolved trajectories ( n = 248 ) from the SWE-bench Lite split with our default configuration. We first create a list of possible failure categories based on model behavior analyzed in Sections B.3.2, which are described in full detail in Table 9. A validation set of 15 instances are then sampled from the 248 instances left marked unresolved by SWE-agent and the authors hand-label them according to these categories. Finally, we combine the agent's trajectory with the patch generated by its changes and the gold patch for reference and use an LM to categorize We use gpt-4o-2024-05-13 from OpenAI. ## Transition Probabilities Heatmap Figure 21: Heatmap displaying the relative frequency of different actions being invoked after the most popular actions in SWE-agent w/ GPT-4 Turbo trajectories across all task instances. <details> <summary>Figure 18 Details</summary> ![ee34f295af4c17448c432be109570fdde08be1af560d190047150c7e1f28d037](http://localhost:8000/v1/image/ee34f295af4c17448c432be109570fdde08be1af560d190047150c7e1f28d037) ### Visual Description # Technical Document Extraction: Transition Probabilities Heatmap ## 1. Document Overview This image is a **Transition Probabilities Heatmap**, likely representing a Markov Chain or a state transition model for a software interaction or coding environment. It visualizes the probability of moving from a "Current Action" to a "Next Action." ## 2. Component Isolation ### A. Header * **Title:** Transition Probabilities Heatmap ### B. Main Chart (Heatmap) * **Y-Axis (Current Action):** Contains 14 categories (including start state). * **X-Axis (Next Action):** Contains 14 categories (including end state). * **Right-side Secondary Y-Axis:** Numerical counts representing the total occurrences for each "Current Action" row. * **Color Scale (Legend):** Located at the far right. It is a gradient scale from white (0.0) to dark blue (1.0), representing the probability value. ### C. Data Categories (Labels) The following labels are used for both Current and Next actions (with slight variations for start/end): 1. `<START>` (Current only) 2. `create` 3. `edit` 4. `exit_cost` 5. `find_file` 6. `goto` 7. `open` 8. `pytest` 9. `python` 10. `scroll_down` 11. `scroll_up` 12. `search_dir` 13. `search_file` 14. `submit` 15. `<END>` (Next only) --- ## 3. Data Table Reconstruction The following table transcribes the probability values within the heatmap. The final column "Total Count" is extracted from the right-hand axis. | Current Action | create | edit | exit_cost | find_file | goto | open | pytest | python | scroll_down | scroll_up | search_dir | search_file | submit | <END> | **Total Count** | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :--- | | **<START>** | 0.65 | 0.00 | 0.00 | 0.16 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.19 | 0.00 | 0.00 | 0.00 | 2214 | | **create** | 0.01 | 0.98 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1959 | | **edit** | 0.03 | 0.51 | 0.02 | 0.01 | 0.01 | 0.03 | 0.00 | 0.33 | 0.00 | 0.00 | 0.01 | 0.01 | 0.04 | 0.00 | 4745 | | **exit_cost** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 2851 | | **find_file** | 0.01 | 0.01 | 0.00 | 0.21 | 0.00 | 0.62 | 0.00 | 0.00 | 0.00 | 0.00 | 0.13 | 0.00 | 0.00 | 0.00 | 3335 | | **goto** | 0.01 | 0.39 | 0.02 | 0.02 | 0.13 | 0.05 | 0.00 | 0.00 | 0.17 | 0.04 | 0.01 | 0.14 | 0.01 | 0.00 | 15341 | | **open** | 0.01 | 0.25 | 0.01 | 0.03 | 0.09 | 0.06 | 0.00 | 0.01 | 0.18 | 0.00 | 0.02 | 0.35 | 0.00 | 0.00 | 2117 | | **pytest** | 0.11 | 0.30 | 0.04 | 0.12 | 0.05 | 0.14 | 0.05 | 0.02 | 0.00 | 0.00 | 0.11 | 0.04 | 0.02 | 0.00 | 4417 | | **python** | 0.02 | 0.47 | 0.01 | 0.15 | 0.03 | 0.13 | 0.00 | 0.01 | 0.00 | 0.00 | 0.11 | 0.01 | 0.04 | 0.00 | 678 | | **scroll_down** | 0.01 | 0.13 | 0.02 | 0.02 | 0.07 | 0.04 | 0.00 | 0.00 | 0.61 | 0.02 | 0.02 | 0.04 | 0.01 | 0.00 | 1587 | | **scroll_up** | 0.00 | 0.16 | 0.02 | 0.01 | 0.11 | 0.01 | 0.00 | 0.00 | 0.08 | 0.53 | 0.02 | 0.05 | 0.01 | 0.00 | 2358 | | **search_dir** | 0.01 | 0.01 | 0.00 | 0.09 | 0.00 | 0.71 | 0.00 | 0.00 | 0.00 | 0.00 | 0.16 | 0.01 | 0.00 | 0.00 | 610 | | **search_file** | 0.00 | 0.04 | 0.01 | 0.05 | 0.67 | 0.06 | 0.00 | 0.00 | 0.01 | 0.00 | 0.03 | 0.12 | 0.00 | 0.00 | 3888 | | **submit** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 56 | --- ## 4. Key Trends and Observations ### High Probability Transitions (Dark Blue) * **Terminal Actions:** Both `exit_cost` and `submit` have a **1.00** probability of transitioning to `<END>`, indicating these are final steps in a sequence. * **Creation Flow:** `create` has a **0.98** probability of being followed by `edit`. * **Search Flow:** `search_dir` has a high probability (**0.71**) of being followed by `open`. * **Navigation Flow:** `search_file` frequently leads to `goto` (**0.67**). * **Initial State:** The `<START>` state most frequently leads to `create` (**0.65**). ### Self-Looping/Repetitive Actions * **Scrolling:** `scroll_down` (0.61) and `scroll_up` (0.53) show significant self-transition probabilities, suggesting users often scroll multiple times consecutively. * **Editing:** `edit` has a **0.51** probability of staying in the `edit` state. ### Common Intermediate Actions * **Edit as a Hub:** Many actions have a moderate probability of transitioning to `edit`. For example, `python` (0.47), `goto` (0.39), and `pytest` (0.30) all frequently lead back to editing. ### Data Volume * The action with the highest frequency of occurrence is `goto` with **15,341** instances. * The action with the lowest frequency is `submit` with only **56** instances. </details> Figure 22: Heatmap displaying the relative frequency of different actions being invoked after the most popular pairs of actions in SWE-agent w/ GPT-4 Turbo trajectories across all task instances. <details> <summary>Figure 19 Details</summary> ![775a12701b86e61d0a46d2ecaaf741ddd4b57bde2af0a4300420211caeb168c4](http://localhost:8000/v1/image/775a12701b86e61d0a46d2ecaaf741ddd4b57bde2af0a4300420211caeb168c4) ### Visual Description # Technical Document Extraction: Transition Probabilities Heatmap ## 1. Document Overview This image is a **Heatmap** titled **"Transition Probabilities Heatmap"**. It visualizes the conditional probability of a "Next Action" (x-axis) occurring given a sequence of "Previous 2 Actions" (y-axis). The data represents state transitions in a computational or workflow environment (likely software development or file management). ## 2. Component Isolation ### Header * **Title:** Transition Probabilities Heatmap ### Main Chart Area * **Type:** Heatmap / Transition Matrix. * **Y-Axis Label:** Previous 2 Actions. * **X-Axis Label:** Next Action. * **Color Scale (Legend):** Located on the far right. A vertical gradient bar ranging from **0.0 (White)** to **1.0 (Dark Blue)**. * **Right-side Margin:** Contains a column of integers representing the total count (N) for each row's history. ### Footer * Contains the X-axis labels rotated at a 45-degree angle. --- ## 3. Data Extraction ### Axis Categories **X-Axis (Next Action):** 1. `<END>` 2. `create` 3. `edit` 4. `exit_cost` 5. `find_file` 6. `goto` 7. `open` 8. `pytest` 9. `python` 10. `scroll_down` 11. `scroll_up` 12. `search_dir` 13. `search_file` 14. `submit` **Y-Axis (Previous 2 Actions) & Row Totals:** 1. `<START>, create` (1384) 2. `create, edit` (1658) 3. `edit (2x)` (1880) 4. `edit, python` (1278) 5. `find_file, open` (7777) 6. `goto, edit` (1312) 7. `open, edit` (1438) 8. `open, scroll_down` (2217) 9. `open, search_file` (4298) 10. `python, edit` (2074) 11. `python, find_file` (825) 12. `rm, submit` (1169) 13. `scroll_down (2x)` (662) 14. `search_dir, open` (605) 15. `search_file, goto` (2334) --- ## 4. Transition Probability Matrix (Reconstructed Table) Values represent the probability $P(\text{Next Action} | \text{Previous 2 Actions})$. | Previous 2 Actions | <END> | create | edit | exit_cost | find_file | goto | open | pytest | python | scroll_down | scroll_up | search_dir | search_file | submit | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | **<START>, create** | 0.00 | 0.00 | **1.00** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | **create, edit** | 0.00 | 0.09 | 0.12 | 0.00 | 0.01 | 0.00 | 0.01 | 0.01 | **0.73** | 0.00 | 0.00 | 0.02 | 0.00 | 0.01 | | **edit (2x)** | 0.00 | 0.02 | **0.72** | 0.03 | 0.00 | 0.01 | 0.02 | 0.00 | 0.15 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | | **edit, python** | 0.00 | 0.02 | **0.48** | 0.01 | 0.15 | 0.03 | 0.12 | 0.00 | 0.01 | 0.00 | 0.00 | 0.12 | 0.01 | 0.04 | | **find_file, open** | 0.00 | 0.02 | 0.19 | 0.00 | 0.04 | 0.08 | 0.04 | 0.00 | 0.00 | 0.18 | 0.01 | 0.01 | **0.42** | 0.00 | | **goto, edit** | 0.00 | 0.06 | **0.38** | 0.01 | 0.01 | 0.05 | 0.03 | 0.00 | **0.31** | 0.01 | 0.01 | 0.01 | 0.02 | 0.10 | | **open, edit** | 0.00 | 0.06 | **0.32** | 0.01 | 0.02 | 0.01 | 0.08 | 0.01 | **0.41** | 0.00 | 0.00 | 0.02 | 0.01 | 0.06 | | **open, scroll_down** | 0.00 | 0.00 | 0.19 | 0.01 | 0.02 | 0.05 | 0.07 | 0.00 | 0.00 | **0.57** | 0.02 | 0.02 | 0.03 | 0.00 | | **open, search_file** | 0.00 | 0.00 | 0.02 | 0.01 | 0.02 | **0.74** | 0.05 | 0.00 | 0.01 | 0.00 | 0.00 | 0.02 | 0.11 | 0.00 | | **python, edit** | 0.00 | 0.00 | **0.34** | 0.01 | 0.01 | 0.00 | 0.05 | 0.00 | **0.57** | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | | **python, find_file** | 0.00 | 0.00 | 0.02 | 0.00 | 0.18 | 0.00 | **0.70** | 0.00 | 0.00 | 0.00 | 0.00 | 0.09 | 0.00 | 0.00 | | **rm, submit** | **1.00** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | **scroll_down (2x)** | 0.00 | 0.01 | 0.10 | 0.03 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | **0.66** | 0.01 | 0.02 | 0.04 | 0.01 | | **search_dir, open** | 0.00 | 0.01 | 0.11 | 0.01 | 0.01 | 0.04 | 0.05 | 0.00 | 0.00 | 0.23 | 0.00 | 0.02 | **0.53** | 0.00 | | **search_file, goto** | 0.00 | 0.01 | **0.36** | 0.01 | 0.02 | 0.11 | 0.05 | 0.00 | 0.00 | 0.17 | 0.04 | 0.02 | 0.19 | 0.01 | --- ## 5. Key Trends and Observations ### Deterministic Transitions (Probability = 1.00) * **Start Sequence:** After `<START>, create`, the next action is always `edit` (1.00). * **End Sequence:** After `rm, submit`, the next action is always `<END>` (1.00). ### High Probability Clusters (Strong Trends) * **Editing Loops:** If the previous actions were `edit (2x)`, there is a 0.72 probability the next action is `edit` again, suggesting a repetitive task. * **Navigation Patterns:** * `open, search_file` is followed by `goto` with a 0.74 probability. * `python, find_file` is followed by `open` with a 0.70 probability. * **Execution Patterns:** * `create, edit` is followed by `python` (0.73), likely indicating a "create-edit-run" cycle. * `python, edit` is followed by `python` (0.57), indicating a "fix-and-run" cycle. ### Low Probability / Sparse Areas * The `<END>` action is almost never reached except through the `rm, submit` sequence. * The `pytest` action has extremely low transition probabilities across all observed histories (mostly 0.00 or 0.01), suggesting it is a rare or specific terminal action not well-captured by these common histories. * The `exit_cost` and `scroll_up` actions also show very low engagement across the board. </details> Figure 23: Heatmap displaying the relative frequency of different actions being invoked after the most popular triplets of actions in SWE-agent w/ GPT-4 Turbo trajectories across all task instances. <details> <summary>Figure 20 Details</summary> ![a28d73a461a62a5c54d2b8263a7434097e34170a3a912eaa1ab9831cc00a1b5b](http://localhost:8000/v1/image/a28d73a461a62a5c54d2b8263a7434097e34170a3a912eaa1ab9831cc00a1b5b) ### Visual Description # Technical Document Extraction: Transition Probabilities Heatmap ## 1. Document Overview This image is a **Transition Probabilities Heatmap** representing a Markov-style analysis of user or agent actions. It visualizes the probability of a "Next Action" occurring given a sequence of the "Previous 3 Actions." ## 2. Component Isolation ### A. Header * **Title:** Transition Probabilities Heatmap ### B. Main Chart Area (Heatmap) * **Y-Axis (Rows):** Labeled "Previous 3 Actions". Contains 15 unique sequences of three actions. * **X-Axis (Columns):** Labeled "Next Action". Contains 13 unique individual actions. * **Data Cells:** Each cell contains a numerical probability value (ranging from 0.00 to 1.00). * **Right-side Margin:** A column of integers representing the total count (frequency) of each "Previous 3 Actions" sequence in the dataset. ### C. Legend (Spatial Grounding: Right Side) * **Type:** Vertical Color Bar. * **Scale:** 0.0 to 0.8+. * **Color Gradient:** Light white/blue (0.0) to deep navy blue (0.8+). * **Visual Trend:** Darker cells indicate a higher probability of transition, while lighter cells indicate low to zero probability. --- ## 3. Data Extraction ### X-Axis: Next Action Categories 1. `create` 2. `edit` 3. `exit_cost` 4. `find_file` 5. `goto` 6. `open` 7. `pytest` 8. `python` 9. `scroll_down` 10. `scroll_up` 11. `search_dir` 12. `search_file` 13. `submit` ### Y-Axis: Previous 3 Actions & Frequency Count | Sequence (Previous 3 Actions) | Frequency (Count) | | :--- | :--- | | `<START>, create, edit` | 729 | | `create, edit, python` | 1223 | | `edit (2x), python` | 666 | | `edit (3x)` | 5614 | | `edit, python, edit` | 556 | | `edit, python, find_file` | 1401 | | `edit, python, open` | 971 | | `edit, python, rm` | 2048 | | `find_file, open, search_file` | 703 | | `open, search_file, goto` | 1513 | | `python, edit (2x)` | 1071 | | `python, edit, python` | 651 | | `scroll_down (3x)` | 546 | | `search_dir, open, search_file` | 532 | | `search_file, goto, edit` | 1518 | --- ## 4. Transition Probability Matrix (Reconstruction) *Note: Values are probabilities. Rows sum to approximately 1.00.* | Previous 3 Actions | create | edit | exit_cost | find_file | goto | open | pytest | python | scroll_down | scroll_up | search_dir | search_file | submit | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | **\<START>, create, edit** | 0.04 | 0.11 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | **0.82** | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | | **create, edit, python** | 0.03 | **0.36** | 0.00 | **0.28** | 0.00 | 0.09 | 0.01 | 0.01 | 0.00 | 0.00 | **0.20** | 0.00 | 0.02 | | **edit (2x), python** | 0.03 | **0.57** | 0.01 | 0.09 | 0.03 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 0.08 | 0.02 | 0.05 | | **edit (3x)** | 0.01 | **0.82** | 0.04 | 0.00 | 0.01 | 0.01 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | | **edit, python, edit** | 0.00 | **0.34** | 0.01 | 0.01 | 0.00 | 0.05 | 0.00 | **0.57** | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | | **edit, python, find_file** | 0.00 | 0.02 | 0.00 | 0.18 | 0.00 | **0.70** | 0.00 | 0.00 | 0.00 | 0.00 | 0.10 | 0.00 | 0.00 | | **edit, python, open** | 0.00 | **0.44** | 0.02 | 0.02 | **0.24** | 0.06 | 0.00 | 0.00 | 0.11 | 0.01 | 0.01 | 0.08 | 0.01 | | **edit, python, rm** | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.02 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | **0.95** | | **find_file, open, search_file** | 0.00 | 0.01 | 0.01 | 0.05 | **0.71** | 0.02 | 0.00 | 0.00 | 0.02 | 0.00 | 0.02 | 0.15 | 0.00 | | **open, search_file, goto** | 0.01 | **0.39** | 0.01 | 0.02 | 0.09 | 0.06 | 0.00 | 0.00 | 0.17 | 0.04 | 0.02 | 0.17 | 0.01 | | **python, edit (2x)** | 0.01 | **0.36** | 0.01 | 0.01 | 0.00 | 0.05 | 0.00 | **0.57** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | **python, edit, python** | 0.01 | **0.63** | 0.01 | 0.12 | 0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 0.10 | 0.00 | 0.03 | | **scroll_down (3x)** | 0.01 | 0.09 | 0.04 | 0.02 | 0.04 | 0.03 | 0.00 | 0.00 | **0.69** | 0.01 | 0.03 | 0.03 | 0.01 | | **search_dir, open, search_file** | 0.00 | 0.02 | 0.00 | 0.01 | **0.80** | 0.04 | 0.00 | 0.00 | 0.02 | 0.00 | 0.01 | 0.08 | 0.00 | | **search_file, goto, edit** | 0.08 | **0.41** | 0.01 | 0.02 | 0.04 | 0.00 | 0.00 | **0.24** | 0.00 | 0.00 | 0.02 | 0.03 | 0.11 | --- ## 5. Key Trends and Observations 1. **Strongest Predictor:** The sequence `edit, python, rm` has the highest single transition probability in the chart, with a **0.95** probability of the next action being `submit`. 2. **Repetitive Behavior:** * `edit (3x)` leads to another `edit` with a probability of **0.82**. * `scroll_down (3x)` leads to another `scroll_down` with a probability of **0.69**. 3. **Common Transitions:** * Sequences involving `search_file` and `open` frequently lead to `goto` (probabilities of **0.71** and **0.80**). * The sequence `edit, python, find_file` strongly leads to `open` (**0.70**). 4. **Data Volume:** The most frequent sequence observed is `edit (3x)` with **5,614** occurrences, indicating that editing is the most common repetitive task in the dataset. The least frequent sequence shown is `search_dir, open, search_file` with **532** occurrences. 5. **Dead Ends:** The `pytest` column is almost entirely white (0.00), suggesting it is rarely the immediate next action following these specific sequences. </details> Figure 24: Heatmap displaying the relative frequency of different actions being invoked after the most popular quadruplets of actions in SWE-agent w/ GPT-4 Turbo trajectories across all task instances. <details> <summary>Figure 21 Details</summary> ![9606d77e2a0bba204d01c454428984b175013f93bf268549d7c37cba5e4c6da3](http://localhost:8000/v1/image/9606d77e2a0bba204d01c454428984b175013f93bf268549d7c37cba5e4c6da3) ### Visual Description # Technical Document Extraction: Transition Probabilities Heatmap ## 1. Document Overview This image is a technical heatmap titled **"Transition Probabilities Heatmap"**. It visualizes the conditional probability of a "Next Action" (x-axis) occurring given a sequence of the "Previous 4 Actions" (y-axis). The data represents state transitions in a software development or command-line environment (likely involving Python coding and file navigation). ## 2. Component Isolation ### A. Header * **Title:** Transition Probabilities Heatmap ### B. Main Chart (Heatmap Grid) * **Y-Axis Label:** Previous 4 Actions * **X-Axis Label:** Next Action * **Data Representation:** A grid of 15 rows by 14 columns. Each cell contains a numerical probability value ranging from 0.00 to 1.00. * **Color Encoding:** A sequential blue color scale where white/light blue represents low probability (0.0) and dark blue represents high probability (1.0). * **Right-side Margin:** A column of integers representing the total count (sample size) for each row's sequence. ### C. Legend (Spatial Grounding: [x=right, y=center]) * **Type:** Vertical Color Bar. * **Scale:** 0.0 (White) to 1.0 (Dark Blue), with markers at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0. --- ## 3. Data Extraction ### X-Axis Categories (Next Action) 1. `<END>` 2. `create` 3. `edit` 4. `exit_cost` 5. `find_file` 6. `goto` 7. `open` 8. `pytest` 9. `python` 10. `scroll_down` 11. `scroll_up` 12. `search_dir` 13. `search_file` 14. `submit` ### Y-Axis Categories (Previous 4 Actions) and Row Data The following table reconstructs the heatmap data. Each row represents a sequence of 4 actions, followed by the probability of the next action, and the total count of that sequence. | Previous 4 Actions | <END> | create | edit | exit_cost | find_file | goto | open | pytest | python | scroll_down | scroll_up | search_dir | search_file | submit | Count | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | `<START>`, create, edit, python | 0.00 | 0.02 | 0.32 | 0.00 | 0.36 | 0.00 | 0.06 | 0.01 | 0.00 | 0.00 | 0.00 | 0.24 | 0.00 | 0.00 | 581 | | create, edit, python, edit | 0.00 | 0.00 | 0.35 | 0.00 | 0.01 | 0.00 | 0.02 | 0.00 | 0.60 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 474 | | create, edit, python, find_file | 0.00 | 0.00 | 0.01 | 0.00 | 0.17 | 0.00 | 0.74 | 0.00 | 0.00 | 0.00 | 0.00 | 0.08 | 0.00 | 0.00 | 547 | | edit (2x), python, edit | 0.00 | 0.01 | 0.44 | 0.02 | 0.01 | 0.00 | 0.04 | 0.00 | 0.44 | 0.00 | 0.00 | 0.01 | 0.00 | 0.02 | 693 | | edit (3x), python | 0.00 | 0.02 | 0.58 | 0.02 | 0.06 | 0.04 | 0.11 | 0.00 | 0.00 | 0.01 | 0.01 | 0.07 | 0.04 | 0.04 | 1123 | | edit (4x) | 0.00 | 0.01 | 0.85 | 0.05 | 0.00 | 0.01 | 0.01 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 539 | | edit, python, edit (2x) | 0.00 | 0.01 | 0.36 | 0.01 | 0.01 | 0.00 | 0.04 | 0.00 | 0.57 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1059 | | edit, python, edit, python | 0.00 | 0.01 | 0.64 | 0.01 | 0.12 | 0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 0.09 | 0.00 | 0.03 | 679 | | edit, python, find_file, open | 0.00 | 0.00 | 0.24 | 0.00 | 0.03 | 0.12 | 0.01 | 0.00 | 0.13 | 0.01 | 0.00 | 0.00 | 0.45 | 0.00 | 457 | | edit, python, rm, submit | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 419 | | open, search_file, goto, edit | 0.00 | 0.09 | 0.39 | 0.00 | 0.01 | 0.03 | 0.04 | 0.00 | 0.27 | 0.01 | 0.00 | 0.01 | 0.02 | 0.12 | 520 | | python, edit (2x), python | 0.00 | 0.02 | 0.64 | 0.01 | 0.12 | 0.01 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.09 | 0.01 | 0.03 | 363 | | python, edit, python, edit | 0.00 | 0.00 | 0.27 | 0.01 | 0.00 | 0.00 | 0.02 | 0.00 | 0.68 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 336 | | scroll_down (4x) | 0.00 | 0.01 | 0.07 | 0.04 | 0.02 | 0.04 | 0.02 | 0.00 | 0.00 | 0.72 | 0.01 | 0.03 | 0.03 | 0.01 | 1034 | | search_dir, open, search_file, goto | 0.00 | 0.01 | 0.41 | 0.01 | 0.01 | 0.09 | 0.06 | 0.00 | 0.00 | 0.22 | 0.04 | 0.03 | 0.12 | 0.01 | 4564 | --- ## 4. Key Trends and Observations ### Deterministic Transitions * **The "Submit" Sequence:** The sequence `edit, python, rm, submit` has a **1.00** probability of transitioning to `<END>`. This is the only perfectly deterministic transition in the dataset, indicated by the darkest blue cell. ### High-Probability Loops * **Editing Persistence:** The sequence `edit (4x)` has an **0.85** probability of the next action being `edit` again, suggesting a high likelihood of continuous editing once a threshold is reached. * **Scrolling Persistence:** `scroll_down (4x)` has a **0.72** probability of the next action being `scroll_down`. * **Python/Edit Cycle:** Sequences ending in `edit` often lead to `python` (e.g., `python, edit, python, edit` -> `python` at **0.68**), and sequences ending in `python` often lead back to `edit` (e.g., `python, edit (2x), python` -> `edit` at **0.64**). ### Navigation Patterns * **File Opening:** After `create, edit, python, find_file`, there is a **0.74** probability the user will `open` a file. * **Search to Edit:** The sequence `search_dir, open, search_file, goto` leads to `edit` with a probability of **0.41**, which is the highest probability for that specific sequence, though it is more distributed across other actions like `scroll_down` (0.22). ### Sample Size (Counts) * The most frequent sequence observed is `search_dir, open, search_file, goto` with **4,564** occurrences. * The least frequent sequence shown is `python, edit, python, edit` with **336** occurrences. </details> each trajectory. In Figure 8, we show the results of this automated categorization. Evaluated on our validation set, the LM generated labels agree with the authors' labels on 87 %of instances. We find that about half ( 52 . 0 %) of the unresolved instances fall into the Incorrect Implementation or Overly Specific Implementation categories, suggesting that agents' proposed solutions often simply fail to functionally address the issue or are insufficiently general solutions. Another significant category is the Failed Edit Recovery category, making up 23 . 4 %of instances, which happens when models fail to generate well-formed edits to files, which can seriously inhibit their performance. The remaining failure modes make up less than 25 %of instances, but highlight different aspects of the challenges faced by the agent in the problem-solving process. Table 9: Descriptions of failure mode categories. | Category | Description | |--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Failed to Reproduce | The agent tried but was not able to successfully reproduce the problem in the issue. | | Failed to Find Relevant File | The agent never opened or saw the correct file. | | Failed to Find Edit Location | The agent opened and viewed the correct file but didn't find or edit a relevant location. | | Overly Specific Implementation | The agent made a relevant change but its solution was not sufficiently general; in this case it might solve the very specific issue suggested but it does so in a way that might change the behavior of the code in other, more general, cases. | | Incorrect Implementation | The agent made a change to a reasonable area but their solution didn't correctly address the issue. | | Ran Out of Budget | The agent seemed to be on the right track to a solution, but the episode ended before they could complete their changes. | | Failed Edit Recovery | The agent went into an edit loop, making recurrent failing edits without recovering. | | Gave Up Prematurely | The agent decides to stop solving the problem after encountering some difficulty. | | Other | There was some other problem that prevented the agent from resolving this issue. | ## B.5 Performance Variance and Pass@k Rate Since running SWE-agent on SWE-bench can be rather expensive, we perform, all results, unless otherwise stated, are reported using a pass@ 1 metric (% Resolved). However, we also test our main SWE-agent configuration for a higher number of runs to test the variance and pass @ k performance for k ∈ { 3 , 6 } . These results are shown in Table 10, suggesting that average performance variance is relatively low, though per-instance resolution can change considerably. Table 10: Performance for 6 separate runs of SWE-agent with GPT-4 on SWE-bench Lite. The % Resolved rate for each individual run is shown in the first table, and the pass @ k rate in the second. | | SWE-bench Lite | SWE-bench Lite | SWE-bench Lite | SWE-bench Lite | SWE-bench Lite | SWE-bench Lite | | |----------|------------------|------------------|------------------|------------------|------------------|------------------|--------------| | | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Run 6 | Avg. | | Resolve% | 17.33 | 18.00 | 18.00 | 18.67 | 17.33 | 18.33 | 17.94 0 . 49 | | | Pass@1 | Pass@2 | Pass@3 | | Pass@4 | Pass@5 | Pass@6 | | Pass @ k | 17.94 | 23.89 | 27.35 | | 29.67 | 31.33 | 32.67 | ## B.6 Patch Generations In this section, we present some statistics and analysis around the edits generated by SWE-agent. At the end of a task episode, the edits made by SWE-agent are aggregated and saved as a single .patch file, the canonical representation for code changes of a pull request on GitHub. From these patch representations, we can quantitatively characterize an agent's generations and see how they compare to the original solutions written by human codebase maintainers. Table 11 presents a summary of four basic statistics about the model generations. Lines added and lines removed refer to the total number of lines that were added or deleted in the patch, an indicator of the size of the modification. The number of hunks and files is more indicative of how many 'regions" of the codebase were modified. A higher number of hunks and files suggests that there are more distinct, separate places in the codebase where the patch made changes. For both 'Resolved" and 'All" categories of task instances, models tend to generate 'larger" edits (e.g., more lines added, hunks, and files) than the corresponding gold solution. Prior RAG baselines in Jimenez et al. [20] typically produce smaller edits on average. The source of this increase for agent-generated solutions can largely be attributed to additional reproduction code. Table 11: We show the (median) / (mean) value for several statistics characterizing patch generations. We calculate these statistics across two dimensions. First, the 'Resolved" / 'All" labels denote whether the patch resolved the issue. Second, for the task instances specific to each model, we calculate the same statistics across the gold patches. To diminish the effect of outliers, we calculate these statistics based on values falling within within the 90 th percentile of the distribution. | Model | Outcome | Lines + | Lines - | Hunks | Files | |----------------------------|--------------|-------------------------|-----------------------|-----------------------|-----------------------| | SWE-agent w/ GPT-4 Turbo | Resolved Any | 3.0 / 5.7 12.0 / 16.58 | 1.0 / 1.32 1.0 / 1.35 | 1.0 / 1.52 2.0 / 1.83 | 1.0 / 1.22 1.0 / 1.53 | | Gold | Resolved Any | 2.0 / 3.58 7.0 / 11.67 | 1.0 / 1.98 2.0 / 4.05 | 1.0 / 1.3 2.0 / 2.45 | 1.0 / 1.0 1.0 / 1.24 | | SWE-agent w/ Claude 3 Opus | Resolved Any | 3.0 / 5.09 11.0 / 15.25 | 1.0 / 1.59 1.0 / 1.79 | 1.0 / 1.56 2.0 / 2.14 | 1.0 / 1.26 2.0 / 1.87 | | Gold | Resolved Any | 3.0 / 3.91 | 1.0 / 1.94 | 1.0 / 1.4 | 1.0 / 1.0 | | | | 6.0 / 10.68 | 2.0 / 3.61 | 2.0 / 2.22 | 1.0 / 1.13 | When comparing the 'Resolved" and 'All" categories, we see that successfully resolved edits are relatively smaller than the original distribution. This trend is consistent with the RAG based solutions; issues that require multiple edits across a codebase remains challenging for agents. ## B.7 HumanEvalFix Evaluation In this section, we include further discussion about our evaluation of SWE-agent on HumanEvalFix. We choose to evaluate on the HumanEvalFix task because it focuses on code editing and debugging, which was empirically demonstrated in Muennighoff et al. [32] to be a more difficult task for LMs (as reported in their work, GPT 4 scores 78 . 3 %on HumanEval, compared to 47 . 8 %on HumanEvalFix). The code editing task can also be thought of as a 'subtask" in SWE-bench; being able to identify and fix bugs is a major part of software engineering. We adopt the HumanEvalFix dataset ( 164 problems per language) to be compatible with the SWEagent setting. Following the documentation in Muennighoff et al. [32], SWE-agent is initialized in a directory with a single file containing a buggy code snippet and example test(s) if available. It is then asked to edit the code and verify its fixes. The configuration file is identical to the one used for SWE-bench, with the exception of a language-specific demonstration. For this task, localization and navigating a large codebase are not necessary; the main focus is on generating the correct edit. SWE-agent achieves the best performance on the HumanEvalFix benchmark for three of the languages we evaluate on, as shown in Table 2. Figure 25 also suggests that the large majority of task instances are solved within the first 10 turns. Figure 25: Similar to Figure 14, we show the distribution of the number of turns for trajectories corersponding to solved task instances from the HumanEvalFix dataset. <details> <summary>Figure 22 Details</summary> ![06660a1e5e3f5a415a73f4b72d8146fe66d3ffe598b0ff86adf2b31aab4b297b](http://localhost:8000/v1/image/06660a1e5e3f5a415a73f4b72d8146fe66d3ffe598b0ff86adf2b31aab4b297b) ### Visual Description # Technical Data Extraction: HumanEvalFix Turn Frequency Histograms This document provides a detailed extraction of the data presented in three side-by-side histograms illustrating the distribution of "Turns" across different programming language benchmarks: JavaScript (js), Java, and Python. ## 1. General Layout and Metadata The image consists of three separate histogram plots arranged horizontally. - **Language:** English. - **Common Y-Axis:** Frequency (Scale: 0 to 100, increments of 20). - **Common X-Axis:** Turn (Scale: 0 to 40, markers at 0, 10, 20, 30). - **Component Isolation:** - **Left Plot:** HumanEvalFix-js (Color: Dark Red) - **Center Plot:** HumanEvalFix-java (Color: Orange) - **Right Plot:** HumanEvalFix-python (Color: Green) --- ## 2. Detailed Data Extraction by Benchmark ### A. HumanEvalFix-js (Left Plot) * **Color:** Dark Red (#99001A) * **Trend Analysis:** The distribution is highly leptokurtic (peaked) and right-skewed. The vast majority of data points are concentrated between 5 and 10 turns, with a very thin tail extending toward 35. * **Estimated Data Points (Frequency per Bin):** * **Bin ~5:** ~35 * **Bin ~7.5 (Peak):** ~90 * **Bin ~10:** ~13 * **Bin ~12.5:** ~6 * **Bin ~15:** ~2 * **Bin ~17.5:** ~1 * **Bin ~32.5:** ~1 (Outlier) ### B. HumanEvalFix-java (Center Plot) * **Color:** Orange (#FF7F24) * **Trend Analysis:** Similar to the JS plot, this shows a strong peak around 7-8 turns. However, the "tail" of the distribution is more populated than the JS version, indicating more instances requiring 10-20 turns. * **Estimated Data Points (Frequency per Bin):** * **Bin ~5:** ~29 * **Bin ~7.5 (Peak):** ~81 * **Bin ~10:** ~12 * **Bin ~12.5:** ~7 * **Bin ~15:** ~4 * **Bin ~17.5:** ~5 * **Bin ~20:** ~2 * **Bin ~25:** ~1 * **Bin ~27.5:** ~1 ### C. HumanEvalFix-python (Right Plot) * **Color:** Green (#458B00) * **Trend Analysis:** This plot shows the highest peak of the three. The distribution is extremely tight, with the overwhelming majority of tasks completed in under 10 turns. The tail is the shortest and least populated of the three languages. * **Estimated Data Points (Frequency per Bin):** * **Bin ~5 (Peak):** ~96 * **Bin ~7.5:** ~26 * **Bin ~10:** ~12 * **Bin ~12.5:** ~2 * **Bin ~15:** ~3 * **Bin ~17.5:** ~1 * **Bin ~20:** ~1 --- ## 3. Comparative Summary Table | Metric | HumanEvalFix-js | HumanEvalFix-java | HumanEvalFix-python | | :--- | :--- | :--- | :--- | | **Peak Frequency** | ~90 | ~81 | ~96 | | **Peak Turn Bin** | ~7.5 | ~7.5 | ~5.0 | | **Distribution Shape** | Peaked, right-skewed | Peaked, moderate tail | Highly peaked, short tail | | **Max Turn Observed** | ~33 | ~28 | ~22 | ## 4. Conclusion The data indicates that for all three languages, the "HumanEvalFix" process typically concludes within 5 to 10 turns. **Python** shows the highest efficiency (highest frequency at the lowest turn count), while **Java** shows a slightly higher tendency for tasks to require a moderate number of extra turns (10-20 range) compared to the others. </details> ## B.8 Dataset Information In the following Table 12, we provide descriptions of the two datasets that we use for evaluation: SWE-bench [20] and HumanEvalFix [32]. Both datasets have been released under permissive software licenses that allow for evaluation use, and can be used in proprietary systems. Table 12: Information about each of the datasets that we evaluate SWE-agent on. | Dataset | Released | License | Splits | Count | Languages | GitHub Repo | |--------------|------------|-----------|---------------|--------------|--------------------------------|---------------------------| | SWE-bench | 10/10/2023 | MIT | Test Lite Dev | 2294 300 225 | Python | princeton-nlp/ SWE-bench | | HumanEvalFix | 07/23/2023 | MIT | Test | 164 | Python, JS, Go Java, C++, Rust | bigcode-project/ octopack | ## B.9 Miscellaneous In this section, we include additional minor analyses around agent behavior and their generations. Agents are better at localizing files than BM25. The interactive setting also enables agents to identify the correct file(s) to edit more often compared to the RAG baselines in Jimenez et al. [20]. To measure this, we calculate the F1 score between the set of [edited, removed] files by the agent's prediction versus the gold patch. SWE-agent w/ GPT-4 Turbo achieves an F1 score of 59 . 05 %, while BM25 w/ Claude 3 Opus produces an F1 score of just 45 . 47 %. Most resolved task instances are intentionally submitted. There are four ways a task episode ends. - 'Submit" refers to a task episode that ends when the agent generates the submit command. - 'Exit Cost (Submit)" refers to the scenario where the episode ends because the cost limit was hit, and the changes so far are gathered and submitted as an edit. - 'Exit Cost (No Submit)" refers to when the cost limit was hit and no edit 's were made, so there was nothing to submit. In this scenario, the instance is guaranteed to be unresolved. - 'Early Exit" refers to when the task episode terminates because an agent issued too many malformed responses in a row. Any changes are submitted as an edit. Table 13 shows the counts for the number of trajectories that ended on these four different outcomes, categorized across the agent, SWE-bench split, and whether or not that task instance was resolved. For SWE-agent with GPT-4 Turbo, the majority of 'All" task instances are submit . For the trajectories corresponding to'All" task instances by SWE-agent with Claude 3 Opus, slightly less than 50 %of task instances are submitted, while the slight majority are auto-submitted when the cost limit is hit. Table 13: This table showcases the counts for the four ways ('Submit", 'Exit Cost (Submit)", 'Exit Cost (No Submit)", 'Early Exit") a task episode could conclude. | Model | Split | Outcome | Submit | Exit Cost (Submit) | Exit Cost (No Submit) | Early Exit | |----------------------------|-----------|---------------------------|-----------------|----------------------|-------------------------|--------------| | SWE-agent w/ GPT-4 Turbo | Full Lite | Resolved All Resolved All | 266 1589 50 203 | 20 630 4 95 | 0 48 0 2 | 0 1 0 0 | | SWE-agent w/ Claude 3 Opus | Full Lite | Resolved All Resolved All | 206 882 32 133 | 35 1048 3 156 | 0 73 0 11 | 0 1 0 0 | However, these trends do not hold for 'Resolved" task instances. For SWE-agent with both models, the large majority of these task instances are submit . Reiterating the conclusion in Section 5.2 and prior visualizations in Section B.3, we see here again that resolved task instances often imply that the agent is able to produce and verify an edit within the allotted number of turns. The SWE-agent ACI is also effective at eliciting well-formed thoughts and actions from agents. Across all runs, there are only two 'Early Exit" occurrences, where the episode terminated because the agent generated too many malformed responses in a row. Finally, Table 13 also upholds an expected trend. Task instances that finish with a submit action are more likely to be resolved than those that are cutoff by cost. For instance, for SWE-agent with GPT-4 Turbo on full SWE-bench, 14 . 3 %of task instances that end with a submit are resolved, which is much higher than 3 . 1 %for those finishing on exit\_cost . ## C Prompts In this section, we go through the prompt templates that make up the agent's history, discussing them in the order of presentation to SWE-agent. Per template, we describe its purpose, walk through its content, and note any additional motivations that influenced how we wrote the template. The companion figures of template content are all drawn from our default configuration, using SWE-agent w/ GPT-4. The template content can and should be adapted slightly to fit the agent's intended use case. The purpose of this section is to describe our thought process for how we designed each template for these tasks to serve as reference for future work. Across templates, we find that providing tips which tell agents to not make specific mistakes, avoid common pitfalls, and use helpful execution signals are effective for eliciting more successful problem solving. Prompt Workflow. We present Figure 26 which shows the order in which different prompt templates are invoked. This flow of prompts reflects the logic that generates trajectories similar to the one that is visualized in Figure 9. Figure 26: The flow of prompt templates throughout a single SWE-agent task instance episode. The system, demonstration, and issue templates are shown all together at the beginning of the task episode, followed by turn-specific prompts that are shown depending on whether the agent response is well-formatted and whether the action has standard output. <details> <summary>Figure 23 Details</summary> ![a4e6781785e94309c89c97fdb4bfc9063f2a5b488b99a75f8ffd3b93bfc8da93](http://localhost:8000/v1/image/a4e6781785e94309c89c97fdb4bfc9063f2a5b488b99a75f8ffd3b93bfc8da93) ### Visual Description # Technical Document Extraction: Language Model Task Episode Flow This document describes a technical flow diagram illustrating the lifecycle of a task episode involving a Language Model (LM). The process is iterative, beginning with initial prompts and continuing through multiple turns until specific termination criteria are met. ## 1. Termination Criteria (Header Region) Located at the top left of the diagram, marked with a red octagonal stop icon, are the conditions under which the process **Terminates**: 1. Action is `submit`, or 2. Cost limit is exceeded, or 3. 2+ Consecutive Format Errors --- ## 2. Initial Input Phase (Left Region) The process begins at the bottom left with the label **"Task Episode Begins"**. This leads into a stacked block titled **"Initial Messages"**, which contains three components: * **(1) System**: (Grey background) The base system prompt. * **(2) Demonstration**: (Light blue background) Examples provided to the model. * **(3) Instance**: (Purple background) The specific task or problem to be solved. --- ## 3. Core Processing Loop (Main Chart Region) The diagram follows a linear flow from left to right, which then loops back. ### Turn #1 The **Initial Messages** are fed into the first stage: * **Component**: Language Model Inference (represented by a brain icon). * **Output**: The model produces a **"Thought & Action"**. ### Execution The "Thought & Action" output moves to the next stage: * **Component**: Execute Action in Env. (represented by a gear icon). * **Function**: The environment processes the model's requested action. ### Outcome Branches Following execution, the flow branches into one of three possible states: 1. **Next Step** (Grey box): "Show output of model's action + prompt for next action." 2. **Next Step (No Output)** (Grey box): "Show custom message stating action succeeded quietly." 3. **Format Error** (Light orange box): "Indicate prior action could not be parsed, model must retry." --- ## 4. Iteration Logic (Feedback Loop) A feedback line labeled **"Turns #2 - #N (Till `submit` action)"** originates from the right side of the outcome boxes. * **Direction**: The line moves from the right side of the outcome branches, across the top of the diagram, and points back down into the **Language Model Inference** component. * **Logic**: This indicates that the results of the previous turn (whether a success, a quiet success, or an error) are fed back into the model to generate the next "Thought & Action." * **Termination Check**: The loop continues until one of the termination criteria (listed in Section 1) is triggered, specifically highlighting the `submit` action in the loop label. --- ## Summary of Components and Flow | Component | Description | | :--- | :--- | | **Inputs** | System, Demonstration, and Instance messages. | | **Inference** | The Language Model generates a reasoning chain and a specific command. | | **Environment** | The system where the action is performed. | | **Feedback** | The environment's response is returned to the model for subsequent turns. | | **Exit Conditions** | Submission, budget exhaustion, or repeated formatting failures. | </details> System Template. The system template describes the interactive task setting, the commands at the agent's disposal, and the expected response format. It is the first message for any episode, does not change in content across task instances, and is not removed or collapsed at any point from the message history. The agent is told of the general task setting, which is a command line that comes with a special file viewer interface. After this, the agent is presented the command documentation, which shows a usage example and docstring for every custom command, mirroring the content of Figure 27. As discussed before, from manual observation, we find that agents need a lot of support to make effective use of the edit command. ## System Prompt SETTING: You are an autonomous programmer, and you're working directly in the command line with a special interface. The special interface consists of a file editor that shows you 100 lines of a file at a time. In addition to typical bash commands, you can also use the following commands to help you navigate and edit files. COMMANDS: {documentation} Please note that THE EDIT COMMAND REQUIRES PROPER INDENTATION. If you'd like to add the line ' print(x)' you must fully write that out, with all those spaces before the code! Indentation is important and code that is not indented correctly will fail and require fixing before it can be run. ## RESPONSE FORMAT: (Open file: <path>) <cwd> $ Your shell prompt is formatted as follows: You need to format your output using two fields; discussion and command. Your output should always include one discussion and one command field EXACTLY as in the following example: <details> <summary>Figure 24 Details</summary> ![29527801f4ce425083d3b7c3334bc367ab4f286e733451fedc55bba297eeef45](http://localhost:8000/v1/image/29527801f4ce425083d3b7c3334bc367ab4f286e733451fedc55bba297eeef45) ### Visual Description # Technical Document Extraction: Command Interface Instructions ## 1. Document Overview This image contains a block of instructional text and a demonstration of a command-line interface protocol. It outlines the rules for a user or agent interacting with a shell environment within a specific "DISCUSSION" and command framework. ## 2. Component Isolation ### Region 1: Header and Example (Top) This section contains a heading and a brief narrative example of how to initiate a directory search. **Transcription:** > DISCUSSION > First I'll start by using ls to see what files are in the current directory. Then maybe we can look at some relevant files to see what they look like. > ```bash > ls -a > ``` ### Region 2: Instructional Body (Bottom) This section provides strict operational constraints for the user/agent regarding command execution and environment limitations. **Transcription:** > You should only include a *SINGLE* command in the command section and then wait for a response from the shell before continuing with more discussion and commands. Everything you include in the DISCUSSION section will be saved for future reference. If you'd like to issue two commands at once, PLEASE DO NOT DO THAT! Please instead first submit just the first command, and then after receiving a response you'll be able to issue the second command. You're free to use any other bash commands you want (e.g. find, grep, cat, ls, cd) in addition to the special commands listed above. However, the environment does NOT support interactive session commands (e.g. python, vim), so please do not invoke them. --- ## 3. Key Technical Constraints & Rules Based on the extracted text, the following operational parameters are defined: * **Command Limit:** Only one (1) command is permitted per submission. * **Workflow:** The user must wait for a shell response before proceeding to the next discussion or command block. * **Persistence:** All text within the "DISCUSSION" section is saved for future reference. * **Prohibited Actions:** * Issuing multiple commands simultaneously. * Invoking interactive session commands (specifically mentioning `python` and `vim`). * **Permitted Tools:** Standard bash commands are allowed, including but not limited to: * `find` * `grep` * `cat` * `ls` * `cd` ## 4. Language Declaration The entirety of the text in this image is in **English**. No other languages are present. </details> Figure 27: The system prompt for SWE-agent describes the environment. The documentation field is populated with brief description of all enabled commands, similar to Table 4. An agent will occasionally generate an edit with either the wrong level of indentation or incorrectly specified line numbers. Because of this, we include a note telling the agent to pay attention to proper indentation. Finally, the system prompt describes what the agent's response should look like, communicated with an example (e.g. JSON format, XML delimiters) followed by a paragraph reinforcing the importance of issuing a single thought/action pair per turn. Because of the constraints imposed by Docker containers, we include one last point about the command line environment not supporting any interactive session commands, such as vi or python . The system template does not introduce any task instance specific information. Demonstration Template. If provided, the demonstration template immediately follows the system template as the second message showing the agent a trajectory which resulted in the successful resolution of a task instance from the development set. As confirmed by the ablation in Table 3, including a demonstration slightly helps agents with understanding proper command usage and reduces the rate of errant responses. Rather than being written out entirely like the system template, the demonstration template is put together based on two fields as shown in the configuration file discussed in Section A.3. First, the demonstrations argument points at 1+ .traj file(s), each containing a multi-turn, SWE-agent style history of interactions. Second, the demonstration\_template and put\_demos\_in\_history arguments control how the demonstration is represented in the message history. If the put\_demos\_in\_history argument is set True , every turn of the demonstration trajectory is used as a separate message in the history. The alternative is to write out the entire demonstration as a single message, wrapped in a natural language instruction indicating it is a demonstration (this is the method that is displayed in Figure 28). The effectiveness of each method varies by model. The first method of adding a demonstration as separate messages can be preferable because it gives an agent the impression that it has already solved a task instance and is working on the next one. However, for a different model, it can be confusing to have the contexts from two tasks in a single trajectory, which is why the second method of explicitly telling the model that it's being given a demonstration can be better. <details> <summary>Figure 25 Details</summary> ![a02840b50900b5bf0163e431ede81f842302b98fc66f6f4153f14d96f670d40b](http://localhost:8000/v1/image/a02840b50900b5bf0163e431ede81f842302b98fc66f6f4153f14d96f670d40b) ### Visual Description # Technical Document Extraction: Demonstration Interface ## 1. Component Isolation The image is a single-frame UI element, likely a modal or a callout box, characterized by a light cream background, a black border with rounded corners, and a distinct header section. * **Header:** Contains the title "Demonstration". * **Main Content Body:** Contains instructional text, a placeholder for dynamic content, and a simulated terminal or log output. --- ## 2. Textual Transcription ### Header Section **Title:** Demonstration ### Main Content Body The text is written in a monospaced font, suggesting a technical or coding environment. **Instructional Text:** > Here is a demonstration of how to correctly accomplish this task. > It is included to show you how to correctly use the interface. > You do not need to follow exactly what is done in the demonstration. > > --- DEMONSTRATION --- > > {{INSTANCE PROMPT}} **Simulated Output/Log Text:** The following block uses line-continuation arrows ($\hookrightarrow$) to indicate wrapped text from a single logical line or a continuous stream of thought. > The code has been updated to use the `round` function, which should > $\hookrightarrow$ fix the rounding issue. Before submitting the changes, it would > $\hookrightarrow$ be prudent to run the reproduce.py code again to ensure that this > $\hookrightarrow$ change has the desired effect. I see that "Current directory" is > $\hookrightarrow$ still `/marshmallow-code__marshmallow`, so we can just run > $\hookrightarrow$ `reproduce.py` directly. > ... --- ## 3. Technical Analysis & Flow ### Components and Logic 1. **Context Setting:** The first three lines establish the purpose of the box: it is a non-mandatory guide for user interaction. 2. **Placeholder:** The string `{{INSTANCE PROMPT}}` is a template variable, indicating where specific task instructions would be injected programmatically. 3. **Action Log/Reasoning:** The lower block of text represents a "Chain of Thought" or a log from an automated agent (likely an AI coding assistant). * **Action:** Implementation of the `round` function. * **Validation Step:** Suggestion to run a script named `reproduce.py`. * **Environmental Awareness:** The agent identifies the current working directory as `/marshmallow-code__marshmallow`. * **Continuation:** The ellipsis (`...`) at the bottom suggests the process or log is ongoing. ### Formatting Observations * **Language:** English. * **Typography:** Monospaced (Courier-style), typical for code editors or terminal emulators. * **Visual Cues:** The use of backticks (e.g., `` `round` ``) indicates code-level references or terminal commands. </details> ``` python reproduce.py ``` (Open file: /marshmallow-code__marshmallow/src/marshmallow/fields.py) (Current directory: /marshmallow-code__marshmallow) bash-$ The output has changed from 344 to 345, which suggests that the ``rounding issue has been fixed. Let's remove the reproduce.py file ``since it is no longer needed. ``` rm reproduce.py ``` Your command ran successfully and did not produce any output. (Open file: /marshmallow-code__marshmallow/src/marshmallow/fields.py) (Current directory: /marshmallow-code__marshmallow) bash-$ rm doesn't have any output when it deletes successfully, so that must ``have worked. Now that we have fixed the issue, we should submit ``our changes to the code base. Let's submit the changes using the ``submit command. ``` submit ``` --- END OF DEMONSTRATION --- ``` Figure 28: A simplified demonstration template showing how demonstrations are provided to the model as a single message. Here we show only the final 3 turns in the demonstration for brevity. We are unsure if demonstrations actually help agents understand the nuances of domain specific problem solving. Because of the diversity of software engineering issues, we think the role the demonstration plays is primarily to help the agent learn to issue properly formatted commands. Prior work has demonstrated that fine tuning may have the potential to imbue agents with a certain degree of expertise around how to adaptively solve task instances that may vary in terms of what strategy is most successful. Instance Template. The instance template introduces the agent to the task instance. The problem statement is shown, followed by a brief set of instructions that reiterate important points from the system template. These points are the one thought/action per-turn requirement, mentioning the lack of support for interactive shell commands, and a reminder of the importance of editing indentation. Finally, a notably effective part of the instance template is the inclusion of tips which serve as an additional guidelines for how to operate successfully in the bash environment, shown in Figure 29. These tips were developed manually and iteratively; after running SWE-agent with a particular configuration on the development set, we manually looked at the trajectories for failure modes. The tips were born out of these failures, and through repeated inspection, we found that such tips did reduce the frequency of errant problem solving strategies that they are meant to address. While our manual approach to writing tips certainly does not scale, representing feedback for common mistakes as tips is surprisingly effective. Developing better methods for this process of identifying failure modes and writing natural language instructions that describe the correct alternative behavior could be an avenue to better performance for future SWE-agent based systems. Finally, at the end of the message, the agent is presented with a command line prompt indicating that the task has begun and that the agent should issue its first command. ``` Instance Message We're currently solving the following issue within our repository. Here's the issue text: ISSUE: {issue} ``` INSTRUCTIONS: Now, you're going to solve this issue on your own. Your terminal session has started and you're in the repository's root directory. You can use any bash commands or the special interface to help you. Edit all the files you need to and run any checks or tests that you want. Remember, YOU CAN ONLY ENTER ONE COMMAND AT A TIME. You should always wait for feedback after every command. When you're satisfied with all of the changes you've made, you can submit your changes to the code base by simply running the submit command. Note however that you cannot use any interactive session commands (e.g. python, vim) in this environment, but you can write scripts and run them. E.g. you can write a python script and then run it with python <script\_name>.py . ↪ → ↪ → ↪ → ↪ → ↪ → ↪ → ↪ → ↪ → ↪ → ↪ → ↪ → (Open file: {open\_file}) (Current directory: bash-$ ``` ↠ to help you. Edit all the files you need to and run any checks or ↠ tests that you want. Remember, YOU CAN ONLY ENTER ONE COMMAND AT ↠ A TIME. You should always wait for feedback after every command. ↠ When you're satisfied with all of the changes you've made, you ↠ can submit your changes to the code base by simply running the ↠ submit command. Note however that you cannot use any interactive ↠ session commands (e.g. python, vim) in this environment, but you ↠ can write scripts and run them. E.g. you can write a python ↠ script and then run it with `python <script_name>.py`. NOTE ABOUT THE EDIT COMMAND: Indentation really matters! When editing ↠ a file, make sure to insert appropriate indentation before each ↠ line! IMPORTANT TIPS: 1. Always start by trying to replicate the bug that the issues ↠ discusses. If the issue includes code for reproducing the bug, we ↠ recommend that you re-implement that in your environment, and run ↠ it to make sure you can reproduce the bug. Then start trying to ↠ fix it. When you think you've fixed the bug, re-run the bug ↠ reproduction script to make sure that the bug has indeed been ↠ fixed. 2. If you run a command and it doesn't work, try running a different ↠ command. A command that did not work once will not work the ↠ second time unless you modify it! 3. If you open a file and need to get to an area around a specific ↠ line that is not in the first 100 lines, say line 583, don't just ↠ use the scroll_down command multiple times. Instead, use the goto ↠ 583 command. It's much quicker. 4. If the bug reproduction script requires inputting/reading a ↠ specific file, such as buggy-input.png, and you'd like to ↠ understand how to input that file, conduct a search in the ↠ existing repo code, to see whether someone else has already done ↠ that. Do this by running the command: find_file "buggy-input.png" ↠ If that doesn't work, use the linux 'find' command. 5. Always make sure to look at the currently open file and the ↠ current working directory (which appears right after the ↠ currently open file). The currently open file might be in a ↠ different directory than the working directory! Note that some ↠ commands, such as 'create', open files, so they might change the ↠ current open file. 6. When editing files, it is easy to accidentally specify a wrong ↠ line number or to write code with incorrect indentation. Always ↠ check the code after you issue an edit to make sure that it ↠ reflects what you wanted to accomplish. If it didn't, issue ↠ another command to fix it. (Open file: {open_file}) (Current directory: {working_dir}) bash-$ ``` Figure 29: The instance template. This message shows the task instance's problem statement (referenced by the {issue} field), shows additional task instance-specific information, and provides a set of tips suggesting recommended problem solving approaches and pitfalls to look out for. Next Step Template. Assuming an agent's response is well formed and contains an action, there are two simple templates used to present the corresponding output of the agent's action, as shown in Figure 30. If an agent's action produces some standard output, the agent is simply shown this output with a command line prompt to indicate that the agent should respond with the next action. <details> <summary>Figure 26 Details</summary> ![5cac2fccefeabe34c22b6cf125cc3eddd7afd3633990803cc2b4d2b5e11c9fe5](http://localhost:8000/v1/image/5cac2fccefeabe34c22b6cf125cc3eddd7afd3633990803cc2b4d2b5e11c9fe5) ### Visual Description # Technical Document Extraction: Next Step Template ## 1. Component Isolation The image is a structured text box, likely a UI element or a documentation snippet representing a command-line interface or an agentic workflow prompt. It is divided into two primary regions: * **Header Region:** Contains the title of the template. * **Main Content Region:** Contains placeholder text, environmental status indicators, and a command prompt. --- ## 2. Detailed Text Transcription ### Header Region * **Text:** `Next Step Template` * **Styling:** Bold, Serif font, aligned to the top-left within a distinct header bar. ### Main Content Region The main body is enclosed in a light-cream colored box with a black border. The text is rendered in a monospaced (fixed-width) font, typical of code or terminal outputs. * **Line 1:** `{OBSERVATION}` * *Note: This is a placeholder variable enclosed in curly braces.* * **Line 2:** `(Open file: /path/to/open/file.py)` * *Note: Indicates the current state of an editor or an active file path.* * **Line 3:** `(Current directory: /path/to/cwd)` * *Note: Indicates the current working directory (CWD) of the environment.* * **Line 4:** `bash-$` * *Note: Represents a standard Bash shell prompt, awaiting user input.* --- ## 3. Technical Analysis and Flow This image represents a **State-Aware Prompt Template**. It is designed to provide context to a user or an AI agent before they execute a command. ### Components and Logic: 1. **Contextual Variable (`{OBSERVATION}`):** A slot intended to be filled with the results of a previous action or a description of the current system state. 2. **Environmental Metadata:** * **File Context:** Explicitly tracks which file is currently being accessed or edited (`/path/to/open/file.py`). * **Spatial Context:** Explicitly tracks the location in the file system (`/path/to/cwd`). 3. **Action Interface:** The `bash-$` prompt signifies that the "Next Step" involves executing a terminal command based on the provided observation and environmental context. --- ## 4. Summary of Information | Field | Content / Value | | :--- | :--- | | **Template Name** | Next Step Template | | **Primary Variable** | `{OBSERVATION}` | | **Active File Path** | `/path/to/open/file.py` | | **Working Directory** | `/path/to/cwd` | | **Interface Type** | Bash Shell (`bash-$`) | **Language Declaration:** The text is entirely in **English**. No other languages are present. </details> Figure 30: The environment's 'next step" template. This is emitted after each observation to inform the model of the current state of the shell and programs. However, if an agent's action runs silently and produces no output (e.g. rm abc.py , touch abc.py ), we found that simple showing no output along with a prompt can be confusing for agents to interpret, and it will often run additional, unnecessary commands to determine the effect of the prior action. To guard against this situation, the agent is informed verbosely that the command ran successfully and did not produce output. While the System, Demonstration, and Instances template are only used a single time, the next step template is used repeatedly. In the SWE-agent configuration described in this work, the next step templates are fairly simple, as they essentially just add the command line prompt to the end of the execution standard output. We have not explored other variations to this style. Collapsed Observation Template. As shown in Figure 9 and discussed in Section 2, old observations are collapsed ; meaning that the structure and order of the agent's interaction history is preserved, but the content of old observations are replaced with a one-line placeholder. This summary simply states that the observation is omitted with the number of lines that were removed, as shown in Figure 31. <details> <summary>Figure 27 Details</summary> ![84daac7d7c845365ca7b688cee5fb070950e8ceabfa5c971004ff4f1bea99193](http://localhost:8000/v1/image/84daac7d7c845365ca7b688cee5fb070950e8ceabfa5c971004ff4f1bea99193) ### Visual Description ### Technical Document Extraction: Environment Response Template #### 1. Component Isolation The image is a single UI (User Interface) element, specifically a container or "card" representing a collapsed state of a template. It is segmented into two horizontal regions: * **Header Region:** Contains the title of the component. * **Body Region:** Contains a placeholder text indicating omitted content. #### 2. Text Transcription The image contains the following English text: **Header Text:** > Environment Response (collapsed) Template **Body Text (Monospaced Font):** > Old output omitted (101 lines) #### 3. Visual and Structural Description * **Type:** UI Component / Data Container. * **Shape:** Rectangular with rounded corners. * **Border:** A thin, solid black outline defines the perimeter and separates the header from the body. * **Background Color:** The entire component has a light cream or off-white background. * **Typography:** * The header uses a **bold, serif font**, centered vertically and aligned to the left within the header section. * The body uses a **regular, monospaced font** (typical of code or terminal output), aligned to the left. * **Functional Context:** The text "collapsed" and "omitted (101 lines)" indicates that this is a summary view of a much larger dataset or log file, intended to save vertical space in a technical interface. #### 4. Data and Facts * **Status:** Collapsed. * **Quantity of Hidden Data:** 101 lines. * **Subject Matter:** Environment Response. </details> Figure 31: A collapsed environment response. The content of old observations are replaced with this template. The purpose of collapsing old observations is twofold. First, it simply reduces the number of tokens needing to be processed at each interaction step. Second, by removing old observations' content, it can also reduce the presence of outdated or duplicate information, such as directory and file contents or command output. We think that both of these purposes can serve to improve agent performance. Error Message Template. An agent's response is not always well formed. As discussed, an agent is generally expected to respond with a single thought and action per turn. In our work, we experimented with asking the agent to generate one of three different styles of responses for communicating one thought and one action (same type of response across any single episode). 1. A string where the action is wrapped as a markdown-style code block (```). 3. An XML style generation with ' < thought > " and ' < action > " delimiters. 2. A JSON style dictionary with 'thought" and 'action" keys. On occasion, an agent may generate a response that doesn't conform to the requested format. If this occurs, we show an error message to the agent indicating that the prior message was malformed and to issue another response that does not make the same mistake, as presented in Figure 32. If a model generates 3 malformed responses in a row, the episode will terminate early. Figure 32: The environment's error message. This is emitted if a model generation doesn't conform to the thought-action format suggested. <details> <summary>Figure 28 Details</summary> ![9fd78c2bdbf8b337c49664d0818efa08487f3740ffad719388ab4607024613e2](http://localhost:8000/v1/image/9fd78c2bdbf8b337c49664d0818efa08487f3740ffad719388ab4607024613e2) ### Visual Description # Technical Document Extraction: Error Message Interface ## 1. Component Isolation The image is a digital interface element, specifically a system error notification box. It is segmented into two primary regions: * **Header:** Contains the title of the notification. * **Main Body:** Contains the descriptive error text and a required formatting template. --- ## 2. Textual Transcription ### Header Region * **Title:** `Error Message` (Bold, Serif font) ### Main Body Region The body text is rendered in a monospaced (typewriter-style) font, indicating technical instructions or code-related requirements. **Transcribed Text:** > Your output was not formatted correctly. You must always include one discussion and one command as part of your response. Make sure you do not have multiple discussion/command tags. > Please make sure your output precisely matches the following format: > DISCUSSION > Discuss here with yourself about what your planning and what you're going to do in this step. > > ``` > command(s) that you're going to run > ``` --- ## 3. Technical Analysis and Flow This image represents a **Validation Error** from a system that requires a specific two-part response structure from a user or an AI agent. ### Required Components: 1. **Discussion Segment:** A self-reflective planning phase where the agent describes its intended actions. 2. **Command Segment:** The actual execution code or command. ### Constraints Identified: * **Cardinality:** Exactly one "discussion" and one "command" per response. * **Uniqueness:** Multiple tags of the same type are prohibited. * **Formatting:** The output must strictly adhere to the provided template, which uses the keyword `DISCUSSION` followed by a triple-backtick code block for the command. --- ## 4. Visual Specifications * **Background Color:** Light cream/off-white. * **Border:** Thin black solid line with rounded corners. * **Layout:** A horizontal line separates the "Error Message" header from the body content. * **Language:** English (100%). No other languages are present. </details> Another context management trick is that if models generate a malformed response, but then subsequently respond with a valid one, the message history is modified such that the action and response correspond to the malformed generation is removed. This kind of de-noising reduces unnecessary context and helps prevent future malformed generations. Each well-formatted response becomes an additional in-context demonstration of how to interact with the environment correctly; this 'momentum' of correct responses is effective for helping agents continue to issue correct actions at later turns in trajectories when there is a lot of information in the message history. ## D Qualitative Analysis We choose 2 successful and unsuccessful task instances from SWE-bench Lite to perform a full qualitative analysis, based on the trajectories generated by the default configuration using GPT-4 Turbo. ## Analysis for psf/requests-2317 Issue Summary. The agent was tasked with resolving a problem in the requests library where the method variable, a bytes object (e.g., b'GET' ), was being incorrectly converted to a literal string ( "b'GET'" ) by the builtin\_str function. This caused the requests library to raise a 404 Not Found response due to the incorrect method string. The issue was observed in Python 3.4 and not in Python 2.6, indicating a version-specific difference in handling bytes. ## Interaction Summary 1. Locate the Target File : The agent began by finding the sessions.py file in the requests library directory. It correctly identified that the relevant file was in the requests directory rather than the build directory. 2. Search for the Issue : It opened sessions.py and used the search\_file command to locate the builtin\_str function call. It found the relevant line (line 428) where the method variable was being converted. 3. Edit the Code : The agent edited the sessions.py file to check if the method was a bytes object and, if so, decoded it to ASCII before applying the builtin\_str function. 4. Test Changes : To verify the fix, the agent created a new file, reproduce\_issue.py , which attempted to send an HTTP request using a binary method string. The script printed the response status code to check if the request was handled correctly. It ran the reproduce\_issue.py script, which executed successfully and received a 200 OK response, indicating that the issue was resolved. 5. Clean Up and Submit : Finally, the agent deleted the reproduction script and submitted the changes to the code base. Evaluation. Upon evaluation, the agent's proposed changes successfully pass the instance's unit tests. When comparing the agent's generation to the 'gold patch', we notice that the gold patch uses the to\_native\_string function, defined in requests/utils.py#L655, which behaves identically to the agent's proposed changes. Ultimately, while the agent successfully resolves the issue, it fails to use existing utility functions, which could lead to less maintainable or portable code in production. Figure 33: Trajectory for instance psf\_\_requests-2317 . ## Issue psf\_\_requests-2317 We're currently solving the following issue within our repository. Here's the issue text: method = builtin\_str(method) problem ISSUE: In requests/sessions.py is a command: Converts method from method = builtin\_str(method) b'GET' "b'GET'" to Which is the literal string, no longer a binary string. When requests tries to use the method "b'GET'', it gets a 404 Not Found response. I'm not sure if this is a bug with neutronclient or a bug with requests, but I'm starting here. Seems if requests handled the method value being a binary string, we wouldn't have any problem. I am using python3.4 and python-neutronclient (2.3.9) with requests (2.4.3). neutronclient is broken because it uses this "args = utils.safe\_encode\_list(args)" command which converts all the values to binary string, including method. Also, I tried in python2.6 and this bug doesn't exist there. Some difference between 2.6 and 3.4 makes this not work right. ``` INSTRUCTIONS:(31 lines) (Open file: n/a) (Current directory: /psf__requests) bash-\$ ``` ## SWE-agent ## psf\_\_requests-2317 To address the issue, we first need to replicate the behavior described. The issue mentions that the 'builtin\_str' function in 'requests/sessions.py' is converting a binary string method like 'b'GET'' into the literal string ''b'GET''', which is incorrect. We should start by locating the 'sessions.py' file within the 'requests' library to examine the code around the 'builtin\_str' function call. Let's begin by finding the 'sessions.py' file in the current repository. ``` find_file "sessions.py" ``` ## Observation ``` Found 2 matches for "sessions.py" in /psf__requests: /psf__requests/build/lib/requests/sessions.py /psf__requests/requests/sessions.py (Open file: n/a) (Current directory: /psf__requests) bash-$ ``` ## SWE-agent ## psf\_\_requests-2317 ## psf\_\_requests-2317 We have found two matches for 'sessions.py'. The one in the 'build/lib/requests' directory is likely a build artifact, and the one we're interested in is probably '/psf\_\_requests/requests/sessions.py'. Let's open this file to inspect the code around the 'builtin\_str' function call. open requests/sessions.py ## Observation ## psf\_\_requests-2317 ``` Observation psf__requests-2317 [File: /psf__requests/requests/sessions.py (671 lines total)] 1:# -*- coding: utf-8 -*- 2: 3:""" 4:requests.session 5:~~~~~~~~~~~~~~~ 6: 7:This module provides a Session object to manage and persist 8: settings across 9: requests (cookies, auth, proxies). 10:"""" 11:import os 12:from collections import Mapping 13:from datetime import datetime 14: 15:from .auth import _basic_auth_str 16:from .compat import cookielib, OrderedDict, urljoin, urlparse, 18: built_str ``` ``` 17:from .cookies import ( 18: cookiejar_from_dict, extract_cookies_to_jar, 19: RequestsCookieJar, merge_cookies) 19: from .models import Request, PreparedRequest, 20: DEFAULT_REDIRECT_LIMIT 20: from .hooks import default_hooks, dispatch_hook 21: from .utils import to_key_val_list, default_headers, 22: to_native_string 23: from .exceptions import ( 23: TooManyRedirects, InvalidSchema, ChunkedEncodingError, 24: ContentDecodingError) 25: from .structures import CaseInsensitiveDict 25: 26: from .adapters import HTTPAdapter 27: 28: from .utils import ( 29: requote_uri, get_environ_proxies, get_netrc_auth, 30: should_bypass_proxies, 31: get_auth_from_url 32:) 33: 34: from .status_codes import codes 35:# formerly defined here, reexposed here for backward 36: compatibility 37: from .models import REDIRECT_STATI 37: 38: 39:def merge_setting(request_setting, session_setting, 40: dict_class=OrderedDict): 41: """ 42: Determines appropriate setting for a given request, taking 43: into account the 44: explicit setting on that request, and the setting in the 45: session. If a 46: setting is a dictionary, they will be merged together using 47: `dict_class` 48: """ 49: 50: if session_setting is None: 51: return request_setting 52: 53: if request_setting is None: 54: return session_setting 55: 56: # Bypass if not a dictionary (e.g. verify) 57: if not ( 58: isinstance(session_setting, Mapping) and 59: isinstance(request_setting, Mapping) 60: ): 61: return request_setting 62: 63: merged_setting = 64: dict_class(to_key_val_list(session_setting)) 65: merged_setting.update(to_key_val_list(request_setting)) 66: 67: # Remove keys that are set to None. 68: for (k, v) in request_setting.items(): 69: if v is None: 70: del merged_setting[k] 71: 72: merged_setting = dict((k, v) for (k, v) in 73: merged_setting.items() if v is not None) ``` ``` 68: 69: return merged_setting 70: 71: 72:def merge_hooks(request_hooks, session_hooks, 73: dict_class=OrderedDict): 74: """ 75: Properly merges both requests and session hooks. 76: This is necessary because when request_hooks == 77: {'response': []}, the 78: merge breaks Session hooks entirely. 79: """ 80: if session_hooks is None or session_hooks.get('response') 81: == []: 82: return request_hooks 83: if request_hooks is None or request_hooks.get('response') 84: == []: 85: return session_hooks 86: return merge_setting(request_hooks, session_hooks, 87: dict_class) 88: 89:class SessionRedirectMixin(object): 90: def resolve_redirects(self, resp, req, stream=False, 91: verify=True, cert=None, proxies=None): 92: """Receives a Response. Returns a generator of 93: Responses.""" 94: i = 0 95: hist = [] # keep track of history 96: 97: while resp.is_redirect: 98: prepared_request = req.copy() 99: 100: if i > 0: 101: # Update history and keep track of redirects. 102: (571 more lines below) (Open file: n/a) (Current directory: /psf__requests) bash-$ ``` ## SWE-agent ## psf\_\_requests-2317 The 'sessions.py' file is now open, but the 'builtin\_str' function call is not visible in the first 100 lines of the file. Since we don't have the exact line number where the 'builtin\_str' function is called, we'll need to search within the file for it. Let's use the 'search\_file' command to find the usage of 'builtin\_str' within 'sessions.py'. ``` search_file "builtin_str" ``` ## Observation ## psf\_\_requests-2317 ``` Found 2 matches for "builtin_str" in / /psf__requests/requests/sessions.py: Line 16:from .compat import cookielib, OrderedDict, urljoin, / urlparse, builtin_str Line 428: method = builtin_str(method) End of matches for "builtin_str" in / /psf__requests/requests/sessions.py (Open file: /psf__requests/requests/sessions.py) (Current directory: /psf__requests) bash-$ ``` ## SWE-agent psf\_\_requests-2317 We have found two occurrences of 'builtin\_str' in 'sessions.py'. The second occurrence on line 428 is where the method is being converted to a string. We should inspect this line to understand the context and see how we can address the issue. Let's navigate to line 428 in 'sessions.py' to see the relevant code. goto 428 ## Observation ## psf\_\_requests-2317 ``` goto 428 ``` ``` ``` ``` 495: :param data: (optional) Dictionary, bytes, or → file-like object to send in the body of the :class:`Request`. 496: :param json: (optional) json to send in the body of → the :class:`Request`. 497: :param \*kwargs: Optional arguments that ``request`` → takes. 498: """ 499: 500: return self.request('POST', url, data=data, json=json, → **kwargs) 501: 502: def put(self, url, data=None, **kwargs): 503: """Sends a PUT request. Returns :class:`Response` → object. 504: 505: :param url: URL for the new :class:`~Request` object. 506: :param data: (optional) Dictionary, bytes, or → file-like object to send in the body of the :class:`Request`. 507: :param \*kwargs: Optional arguments that ``request`` → takes. 508: """ 509: return self.request('PUT', url, data=data, **kwargs) 510: 511: def patch(self, url, data=None, **kwargs): 512: (159 more lines below) (Open file: /psf__requests/requests/sessions.py) (Current directory: /psf__requests) bash-$ ``` ## SWE-agent ## psf\_\_requests-2317 We are now looking at the section of the 'sessions.py' file where the 'builtin\_str' function is used to convert the 'method' variable. According to the issue, this conversion is causing a problem when the 'method' is a binary string because it's being converted to a literal string that includes the binary prefix 'b'...''. Let's edit the line to add a check that ensures binary strings are decoded to regular strings before being passed to 'builtin\_str'. To address this issue, we need to ensure that the 'method' variable is correctly converted to a string without the binary prefix. This might involve checking the type of the 'method' variable and handling binary strings differently from regular strings. ``` edit 428:428 # Ensure method is a string, decode if it's a binary string if isinstance(method, bytes): method = method.decode('ascii') method = builtin_str(method) end_of_edit ``` ## Observation ## psf\_\_requests-2317 ``` [File: /psf__requests/requests/sessions.py (674 lines total)] (377 more lines above) 378: def request(self, method, url, 379: params=None, 380: data=None, 381: headers=None, 382: cookies=None, ``` ``` 383: files=None, 384: auth=None, 385: timeout=None, 386: allow_redirects=True, 387: proxies=None, 388: hooks=None, 389: stream=None, 390: verify=None, 391: cert=None, 392: json=None): 393: """Constructs a :class:`Request <Request>`, prepares it 394: and sends it. 395: Returns :class:`Response <Response>` object. 396: :param method: method for the new :class:`Request` 397: object. 398: :param url: URL for the new :class:`Request` object. 399: :param params: (optional) Dictionary or bytes to be 400: sent in the query 401: string for the :class:`Request`. 402: :param data: (optional) Dictionary or bytes to send in 403: the body of the 404: :class:`Request`. 405: :param headers: (optional) Dictionary of HTTP Headers 406: to send with the 407: :class:`Request`. 408: :param cookies: (optional) Dict or CookieJar object to 409: send with the 410: :class:`Request`. 411: :param files: (optional) Dictionary of ``'filename': 412: file-like-objects`` 413: for multipart encoding upload. 414: :param auth: (optional) Auth tuple or callable to 415: enable 416: Basic/Digest/Custom HTTP Auth. 417: :param timeout: (optional) How long to wait for the 418: server to send 419: data before giving up, as a float, or a (`connect 420: timeout, read 421: timeout <user/advanced.html#timeouts>`_) tuple. 422: :type timeout: float or tuple 423: :param allow_redirects: (optional) Set to True by 424: default. 425: :type allow_redirects: bool 426: :param proxies: (optional) Dictionary mapping protocol 427: to the URL of 428: the proxy. 429: :param stream: (optional) whether to immediately 430: download the response 431: content. Defaults to ``False``. 432: :param verify: (optional) if ``True``, the SSL cert 433: will be verified. 434: A CA_BUNDLE path can also be provided. 435: :param cert: (optional) if String, path to ssl client 436: cert file (.pem). 437: If Tuple, ('cert', 'key') pair. 438: """ 439: # Ensure method is a string, decode if it's a binary 440: string 441: if isinstance(method, bytes): ``` ``` __________________________________________________________________________________________________ 430: method = method.decode('ascii') 431: method = builtin_str(method) 432: 433: # Create the Request. 434: req = Request( 435: method = method.upper(), 436: url = url, 437: headers = headers, 438: files = files, 439: data = data or {}, 440: json = json, 441: params = params or {}, 442: auth = auth, 443: cookies = cookies, 444: hooks = hooks, 445: ) 446: prep = self.prepare_request(req) 447: 448: proxies = proxies or {} 449: 450: settings = self.merge_environment_settings( 451: prep.url, proxies, stream, verify, cert 452: ) 453: 454: # Send the request. 455: send_kwargs = { 456: 'timeout': timeout, 457: 'allow_redirects': allow_redirects, 458: } 459: send_kwargs.update(settings) 460: resp = self.send(prep, **send_kwargs) 461: 462: return resp 463: 464: def get(self, url, **kwargs): 465: """Sends a GET request. Returns :class:`Response` 466: 467: :param url: URL for the new :class:`~Request` object. 468: :param \*\*kwargs: Optional arguments that ``request`` 469: takes. 470: 471: """ 472: return self.request('GET', url, **kwargs) 473: 474: def options(self, url, **kwargs): 475: """Sends a OPTIONS request. Returns :class:`Response` 476: 477: object. 478: 479: :param url: URL for the new :class:`~Request` object. 480: (197 more lines below) 481: File updated. Please review the changes and make sure they are 482: correct (correct indentation, no duplicate lines, etc). Edit 483: the file again if necessary. 484: (Open file: /psf__requests/requests/sessions.py) 485: (Current directory: /psf__requests) 486: bash-$ ``` ## SWE-agent ## psf\_\_requests-2317 The edit has been made to decode binary strings before converting them to a string with 'builtin\_str'. This should prevent the issue where a binary string like 'b'GET'' is incorrectly converted to the literal string ''b'GET'''. Let's create a new file called 'reproduce\_issue.py' to test the fix. To verify that this change resolves the issue, we should create a script that replicates the problem as described. We'll write a script that uses the 'requests' library to make a request with a binary method and check if it now handles the method correctly. create reproduce\_issue.py ## Observation ## psf\_\_requests-2317 ``` [File: /psf__requests/reproduce_issue.py (1 lines total)] 1: (Open file: /psf__requests/requests/sessions.py) (Current directory: /psf__requests) bash-$ ``` ## SWE-agent ## psf\_\_requests-2317 Now that we have created the 'reproduce\_issue.py' file, let's write a script that attempts to send a request with a binary method string, which should now be handled correctly due to our fix in 'sessions.py'. We'll write a simple script that imports the 'requests' library, sets a binary method, and tries to make a request to a test URL. ``` request to a test URL. ----------------------------------------------------------------------------- edit 1:1 import requests # Define a binary method string method = b'GET' # Test URL (example) url = 'https://httpbin.org/get' # Attempt to send a request with a binary method response = requests.request(method, url) # Print the response status code print('Response status code:', response.status_code) # Print a success message if the script completes print("Script completed successfully, no errors.") end_of_edit ``` ``` # Print the response status code. print('Response status code:', response.status_code) # Print a success message if the script completes print("Script completed successfully, no errors.") end_of_edit ``` ## Observation ``` [File: /psf__requests/reproduce_issue.py (16 lines total)] 1:import requests 2: 3:# Define a binary method string 4:method = b'GET' 5: 6:# Test URL (example) 7:url = 'https://httpbin.org/get' ``` ## psf\_\_requests-2317 ``` total) ] ``` ``` 8: 9:# Attempt to send a request with a binary method 10:response = requests.request(method, url) 11: 12:# Print the response status code 13:print('Response status code:', response.status_code) 14: 15:# Print a success message if the script completes 16:print("Script completed successfully, no errors.") File updated. Please review the changes and make sure they are ↪ → correct (correct indentation, no duplicate lines, etc). Edit ↪ → the file again if necessary. (Open file: /psf__requests/reproduce_issue.py) (Current directory: /psf__requests) bash-$ ``` ## SWE-agent psf\_\_requests-2317 The script has been written and should now attempt to send a request using a binary method string. This will help us verify if the issue has been resolved by the changes we made to 'sessions.py'. Let's run the script to see if it executes successfully without encountering the issue described. python reproduce\_issue.py ## Observation ``` Response status code: 200 Script completed successfully, no errors. (Open file: /psf__requests/reproduce_issue.py) (Current directory: /psf__requests) bash-$ ``` ## SWE-agent psf\_\_requests-2317 psf\_\_requests-2317 The script ran successfully and received a 200 OK response, which indicates that the issue with handling binary method strings has been resolved. The 'requests' library is now correctly processing the binary string 'b'GET'' as a valid HTTP method. Let's remove the 'reproduce\_issue.py' file. Since the fix appears to be working, we can clean up by removing the 'reproduce\_issue.py' script, as it is no longer needed. After that, we can submit our changes to the code base. rm reproduce\_issue.py ## Observation ``` (Open file: /psf__requests/reproduce_issue.py) (Current directory: /psf__requests) bash-$ ``` psf\_\_requests-2317 ## SWE-agent psf\_\_requests-2317 The 'reproduce\_issue.py' file has been successfully removed. Now that we have verified the fix and cleaned up the test script, we can proceed to submit our changes to the code base. Let's submit the changes using the 'submit' command. submit ## Observation psf\_\_requests-2317 ``` Observation ``` ## Gold Patch psf\_\_requests-2317 ``` Gold Patch diff --git a/requests/sessions.py b/requests/sessions.py --- a/requests/sessions.py +++ b/requests/sessions.py @@ -13,7 +13,7 @@ from datetime import datetime from .auth import _basic_auth_str -from .compat import cookielib, OrderedDict, urljoin, urlparse, ↔ builtin_str +from .compat import cookielib, OrderedDict, urljoin, urlparse from .cookies import ( cookiejar_from_dict, extract_cookies_to_jar, → RequestsCookieJar, merge_cookies) from .models import Request, PreparedRequest, ↔ DEFAULT_REDIRECT_LIMIT @@ -425,7 +425,7 @@ def request(self, method, url, If Tuple, ('cert', 'key') pair. """ -method = builtin_str(method) +method = to_native_string(method) # Create the Request. req = Request( ``` ## Analysis for pylint-dev/pylint-5859 Issue Summary. The issue involves the -notes option in pylint , which fails to report fixme warnings (W0511) for note tags that are entirely punctuation. Specifically, when running pylint with the -notes="YES,???" option, only the 'YES' tag is recognized, and the '???' tag is ignored. The expected behavior is to recognize the punctuation-only tag normally. ## Interaction Summary ## 1. Reproducing the Issue : - The agent started by creating a test.py file and added comments to replicate the issue described. This was followed by running pylint to confirm the issue, which only reported the 'YES' note and not the '???' note. ## 2. Searching for Relevant Code : - The agent attempted to search for the handling of the -notes option using the search\_dir command. After some initial confusion with the command options, it successfully identified the pylint/checkers/misc.py file as containing the relevant code. ## 3. Editing the Code : - The agent located the regular expression responsible for matching note tags within the open method of the EncodingChecker class. It modified the regular expression to handle punctuation-only note tags by replacing the word boundary \b with a lookahead assertion (? ¯ \W|$) . ## 4. Test Changes with Reproduction Code : - After making the modification, the agent re-ran pylint on the test.py file. The output confirmed that both note tags ('YES' and '???') were now correctly identified. - Clean Up and Submit : - Finally, the agent removed the test.py file and submitted the changes. Evaluation The agent effectively replicated the issue, identified and navigated to the relevant code, made a targeted modification to the regular expression, and verified the fix through testing. While the agent's solution successfully passes the necessary tests, there are slight differences in behavior between the agent's solution and the gold patch. In particular, the agent replaced the \b word boundary in the regular expression to handle note tags with the regex pattern (\W|$) to mark the tag boundary. Comparing this to the gold patch solution which uses the pattern (:|\s|\Z) in the same way, it may be that the agent's solution causes other unintended consequences that aren't captured by the existing tests. Figure 34: Trajectory for instance pylint-dev\_\_pylint-5859 . <details> <summary>Figure 29 Details</summary> ![4121e49307a7cd61ce6ccf9cd3550c497d54aae4c074d81c051c15595e3117f7](http://localhost:8000/v1/image/4121e49307a7cd61ce6ccf9cd3550c497d54aae4c074d81c051c15595e3117f7) ### Visual Description # Technical Document Extraction: Issue pylint-dev__pylint-5859 ## 1. Document Overview This image is a structured technical report or "Issue Card" documenting a specific software bug in the `pylint` repository. The document is contained within a rounded rectangular border and uses a combination of bold headers and monospaced text for code examples. ## 2. Component Isolation ### Region 1: Header * **Left Alignment:** `Issue` * **Right Alignment:** `pylint-dev__pylint-5859` (Identifier for the repository and issue number). ### Region 2: Main Content (Issue Description) This section contains the prose and technical details of the bug. * **Introductory Text:** "We’re currently solving the following issue within our repository. Here’s the issue text:" * **ISSUE Title:** `"--notes" option ignores note tags that are entirely punctuation` * **Section Header:** **Bug description** * **Description Text:** "If a note tag specified with the ‘--notes’ option is entirely punctuation, pylint won’t report a fixme warning (W0511)." * **Code/Example Block (Monospaced):** ```text YES: yes ???: no ``` * **Explanation Text:** `pylint test.py -notes="YES,???"` will return a fixme warning (W0511) for the first line, but not the second. ### Region 3: Configuration and Command Metadata This section specifies the environment/settings used to reproduce the issue. * **Section Header:** **Configuration** * **Format:** `ini` * **Sub-label:** `Default` * **Section Header:** **Command used** * **Environment:** `shell` --- ## 3. Technical Data Summary | Field | Value | | :--- | :--- | | **Repository/Issue ID** | pylint-dev__pylint-5859 | | **Bug Category** | Functional / Reporting | | **Affected Option** | `--notes` | | **Error Code** | `W0511` (fixme warning) | | **Trigger Condition** | Note tags consisting entirely of punctuation (e.g., `???`) | | **Observed Behavior** | Punctuation-only tags are ignored; alphanumeric tags (e.g., `YES`) work correctly. | | **Configuration Type** | Default `.ini` | | **Execution Context** | Shell command | ## 4. Language Declaration The entire document is in **English**. No other languages are present. </details> ``` pylint test.py -notes="YES,???" ### Pylint output shell ********** Module test test.py:1:1: W0511: YES: yes (fixme) Expected behavior ********** Module test test.py:1:1: W0511: YES: yes (fixme) test.py:2:1: W0511: ????: no (fixme) Pylint version shell pylint 2.12.2 astroid 2.9.0 Python 3.10.2 (main, Feb 2 2022, 05:51:25) [Clang 13.0.0 (clang-1300.0.29.3)] ### OS / Environment macOS 11.6.1 ### Additional dependencies _No response_ INSTRUCTIONS: (31 lines) (Open file: n/a) (Current directory: /pylint) bash-$ ``` ## SWE-agent pylint-dev\_\_pylint-5859 To begin addressing the issue, we should first try to replicate the bug as described. We can do this by creating a new Python script that includes the comments with the note tags 'YES' and '???' and then running pylint with the '-notes' option. Let's start by creating a new file called 'test.py' and adding the relevant comments to it. create test.py ## Observation ## pylint-dev\_\_pylint-5859 ``` [File: /pylint-dev__pylint/test.py (1 lines total)] 1: (Open file: n/a) (Current directory: /pylint-dev__pylint) bash-$ ``` ## SWE-agent Now let's add the example comments from the issue into 'test.py'. ``` edit 1:1 # YES: yes # ????: no end_of_edit ``` ## Observation pylint-dev\_\_pylint-5859 ## pylint-dev\_\_pylint-5859 ``` [File: /pylint-dev__pylint/test.py (2 lines total)] 1:# YES: yes 2:# ???: no File updated. Please review the changes and make sure they are <> correct (correct indentation, no duplicate lines, etc). Edit <> the file again if necessary. (Open file: /pylint-dev__pylint/test.py) (Current directory: /pylint-dev__pylint) bash-$ ``` ## SWE-agent ## pylint-dev\_\_pylint-5859 The 'test.py' file has been updated with the example comments. Next, we should run pylint with the '-notes' option to see if we can replicate the issue where the second line with punctuation-only tags is ignored. ``` pylint test.py --notes="YES, ???" ``` ## Observation ``` ************ Module test test.py:1:1: W0511: YES: yes (fixme) (Open file: /pylint-dev__pylint/test.py) (Current directory: /pylint-dev__pylint) bash-$ ``` ## SWE-agent ## pylint-dev\_\_pylint-5859 ## pylint-dev\_\_pylint-5859 The output confirms the issue: pylint only reports a fixme warning for the first line with the 'YES' tag and ignores the second line with the '???' tag. This matches the behavior described in the issue. The next step is to locate the part of the pylint code that handles the '-notes' option and determine why it's ignoring note tags that are entirely punctuation. We should start by finding the file that contains the relevant code. Since we don't know the exact file name, we can search the directory for files that might contain the string '-notes' to find where the option is handled. ``` search_dir "--notes" ``` ## Observation ## pylint-dev\_\_pylint-5859 ``` grep: unrecognized option '--notes' Usage: grep [OPTION]... PATTERNS [FILE]... Try 'grep --help' for more information. Usage: grep [OPTION]... PATTERNS [FILE]... Try 'grep --help' for more information. No matches found for "--notes" in /pylint-dev__pylint (Open file: /pylint-dev__pylint/test.py) (Current directory: /pylint-dev__pylint) bash-$ ``` ## SWE-agent ## pylint-dev\_\_pylint-5859 It seems that the 'search\_dir' command did not work as expected because '-notes' was interpreted as an option for 'grep' rather than a search term. To avoid this issue, we can enclose the search term in quotes to ensure it's treated as a literal string. Let's try searching again using the correct format for the search term. ``` search_dir "--notes" ``` ## Observation ## pylint-dev\_\_pylint-5859 ``` grep: unrecognized option '--notes' Usage: grep [OPTION]... PATTERNS [FILE]... Try 'grep --help' for more information. Usage: grep [OPTION]... PATTERNS [FILE]... Try 'grep --help' for more information. No matches found for "--notes" in /pylint-dev__pylint (Open file: /pylint-dev__pylint/test.py) (Current directory: /pylint-dev__pylint) bash-$ ``` ## SWE-agent ## pylint-dev\_\_pylint-5859 It appears that the issue persists because the search term is still being interpreted as an option. To circumvent this, we can use a different search term that is less likely to be confused with a command-line option. Let's try searching for just 'notes' instead, which should still lead us to the relevant part of the code. search\_dir "notes" ## Observation ## pylint-dev\_\_pylint-5859 ``` Found 24 matches for "notes" in /pylint-dev__pylint: /pylint-dev__pylint/ChangeLog (2 matches) /pylint-dev__pylint/doc/development_guide/testing.rst (1 matches) /pylint-dev__pylint/doc/whatsnew/2.5.rst (1 matches) /pylint-dev__pylint/examples/pylintrc (2 matches) /pylint-dev__pylint/pylint/checkers/misc.py (9 matches) /pylint-dev__pylint/pylintrc (2 matches) ``` ``` /pylint-dev__pylint/script/bump_changelog.py (1 matches) /pylint-dev__pylint/tests/checkers/unittest_misc.py (2 matches) /pylint-dev__pylint/tests/functional/f/fixme.py (1 matches) /pylint-dev__pylint/tests/functional/f/fixme.rc (2 matches) /pylint-dev__pylint/tests/functional/f/fixme.txt (1 matches) End of matches for "notes" in /pylint-dev__pylint (Open file: /pylint-dev__pylint/test.py) (Current directory: /pylint-dev__pylint) bash-$ ``` ## SWE-agent ## pylint-dev\_\_pylint-5859 The search results indicate that there are several matches for 'notes' in the repository, but the most likely candidate for the code we need to examine is in 'pylint/checkers/misc.py', as it contains the most matches. This file is probably where the miscellaneous checkers, including the fixme warning (W0511), are implemented. Let's open the 'misc.py' file to investigate further. open pylint/checkers/misc.py ## Observation ## pylint-dev\_\_pylint-5859 ``` open pylint/checkers/misc.py ``` ``` """Check source code is ascii only or has an encoding -> declaration (PEP 263).""" 29: 30: import re 31: import tokenize 32: from typing import TYPE_CHECKING, List, Optional 33: 34: from astroid import nodes 35: 36: from pylint.checkers import BaseChecker 37:from pylint.interfaces import IRawChecker, ITokenChecker 38:from pylint.typing import ManagedMessage 39:from pylint.utils.pragma_parser import OPTION_PO, -> PragmaParserError, parse_pragma 40: 41:if TYPE_CHECKING: 42: from pylint.lint import PyLinter 43: 44: 45:class ByIdManagedMessagesChecker(BaseChecker): 46: 47: """Checks for messages that are enabled or disabled by id 48: instead of symbol.""" 49: __implements__ = IRawChecker 50: name = "miscellaneous" 51: msgs = { 52: "I0023": ( 53: "%s", 54: "use-symbolic-message-instead", 55: "Used when a message is enabled or disabled by id.", 56: ) 57: } 58: options = () 59: def _clear_by_id_managed_msgs(self) -> None: 60: self.linter._by_id_managed_msgs.clear() 61: def _get_by_id_managed_msgs(self) -> List[ManagedMessage]: 62: return self.linter._by_id_managed_msgs 65: def process_module(self, node: nodes.Module) -> None: 66: """Inspect the source file to find messages activated 67: or deactivated by id.""" 68: managed_msgs = self._get_by_id_managed_msgs() 69: for (mod_name, msgid, symbol, lineno, is_disabled) in 70: managed_msgs: 71: if mod_name == node.name: 72: verb = "disable" if is_disabled else "enable" 73: txt = f"'{msgid}' is cryptic: use '# pylint: 74: {verb}={symbol}' instead" 75: 76: self.add_message("use-symbolic-message-instead", line=lineno, 77: args=txt) 78: self._clear_by_id_managed_msgs() 79:class EncodingChecker(BaseChecker): 80: 81: """Checks for: 82: * warning notes in the code like FIXME, XXX 83: * encoding issues. 84: """ ``` ``` 83: __implements__ = (IRawChecker, ITokenChecker) 84: __implements__ = (IRawChecker, ITokenChecker) 85: # configuration section name 86: name = "miscellaneous" 88: msgs = { "W0511": ( "%s", "fixme", "Used when a warning note as FIXME or XXX is 91: -> detected.", 92: ) 93: } 94: options = ( 95: { 96: "notes", 97: { 98: "type": "csv", 99: "type": "csv", 100: (Open file: /pylint-dev__pylint/test.py) 101: (Current directory: /pylint-dev__pylint) 102: bash-$ 103: } ``` ## SWE-agent ## pylint-dev\_\_pylint-5859 We have opened the 'misc.py' file which likely contains the code for the fixme warning (W0511) related to the '-notes' option. Since the file is 203 lines long and we are currently viewing the first 100 lines, we need to scroll down or use the 'goto' command to find the relevant section that handles the '-notes' option. Let's scroll down to see more of the file. scroll\_down ## Observation ## pylint-dev\_\_pylint-5859 ``` Observation pylint-dev__pylint-5859 [File: /pylint-dev__pylint/pylint/checkers/misc.py (203 lines <> total)] (98 more lines above) 99: { 100: "type": "csv", 101: "metavar": "<comma separated values>", 102: "default": ("FIXME", "XXX", "TODO"), 103: "help": ( 104: "List of note tags to take in 105: consideration, " 106: "separated by a comma." 107: ), 108: }, 109: ), 110: { 111: "notes-rgx", 112: { 113: "type": "string", 114: "metavar": "<regexp>", 115: "help": "Regular expression of note tags to 116> take in consideration.", 117: }, ``` ``` 116: ), 117: ) 118: 119: def open(self): 120: super().open() 121: 122: notes = "|".join(re.escape(note) for note in → self.config.notes) 123: if self.config.notes_rgx: 124: regex_string = 125: rf"\s*({notes}|{self.config.notes_rgx})\b" 126: else: 127: regex_string = rf"#\s*({notes})\b" 128: 129: self._fixme_pattern = re.compile(regex_string, re.I) 130: 131: def _check_encoding( 132: self, lineno: int, line: bytes, file_encoding: str 133: ) -> Optional[str]: 134: try: 135: return line.decode(file_encoding) 136: except UnicodeDecodeError: 137: pass 138: except LookupError: 139: if ( 140: line.startswith(b"#") 141: and "coding" in str(line) 142: and file_encoding in str(line) 143: ): 144: msg = f"Cannot decode using encoding" 145: '{file_encoding}', bad encoding" 146: self.add_message("syntax-error", line=lineno, 147: args=msg) 148: return None 149: 150: def process_module(self, node: nodes.Module) -> None: 151: """Inspect the source file to find encoding problem.""" 152: encoding = node.file_encoding if node.file_encoding 153: else "ascii" 154: 155: with node.stream() as stream: 156: for lineno, line in enumerate(stream): 157: self._check_encoding(lineno + 1, line, 158: 159: encoding) 160: 161: def process_tokens(self, tokens): 162: """Inspect the source to find fixme problems.""" 163: if not self.config.notes: 164: return 165: comments = ( 166: token_info for token_info in tokens if 167: token_info.type == tokenize.COMMENT 168: ) 169: for comment in comments: 170: comment_text = comment.string[1:].lstrip() # trim 171: 172: '#' and whitespace 173: 174: # handle pylint disable clauses 175: disable_option_match = 176: OPTION_PO.search(comment_text) 177: if disable_option_match: 178: try: 179: values = [] ``` ``` 170: try: 171: for pragma_repr in ( 172: p_rep 173: for p_rep in 174: parse_pragma(disable_option_match.group(2)) 175: if p_rep.action == "disable" 176: ): 177: values.extend(pragma_repr.messages) 178: except PragmaParserError: 179: # Printing useful information dealing 180: pass 181: if set(values) & set(self.config.notes): 182: continue 183: except ValueError: 184: self.add_message( 185: "bad-inline-option", 186: args=disable_option_match.group(1).strip(), 187: line=comment.start[0], 188: ) 189: continue 190: # emit warnings if necessary 191: match = self._fixme_pattern.search("#" + 192: comment_text.lower()) 193: if match: 194: self.add_message( 195: "fixme", 196: col_offset=comment.start[1] + 1, 197: args=comment_text, 198: line=comment.start[0], 199: ) 200: (5 more lines below) (Open file: /pylint-dev__pylint/pylint/checkers/misc.py) (Current directory: /pylint-dev__pylint) bash-$ ``` ## SWE-agent ## pylint-dev\_\_pylint-5859 We have found the relevant section of the 'misc.py' file that handles the '-notes' option. The regular expression pattern used to match note tags is constructed in lines 122-128. It seems that the issue might be related to how the regular expression is constructed or how it is used to match comments in the source code. The issue might be that the regular expression pattern does not account for note tags that consist entirely of punctuation characters. To fix this, we may need to adjust the regular expression pattern to ensure it matches these tags correctly. The regular expression is built from the 'notes' configuration, which is a list of comma-separated values, and an optional 'notes-rgx' regular expression. The pattern is then compiled and used in the 'process\_tokens' method to search for matching comments. Before we make any changes, let's examine the regular expression pattern construction in more detail. We should look at the 'open' method where the pattern is compiled, specifically lines 122-128. goto 122 ## Observation ## pylint-dev\_\_pylint-5859 ``` """ Observation [File: /pylint-dev__pylint/pylint/checkers/misc.py (203 lines -> total)] (103 more lines above) 104: -> consideration, " 105: 106: 107: }, 108: ), 109: ( 110: "notes-rgx", 111: { 112: 113: "metavar": "<regexp>", 114: "help": "Regular expression of note tags to 115: take in consideration.", 116: }, 117: ) 118: 119: def open(self): 120: super().open() 121: 122: notes = "|".join(re.escape(note) for note in 123: self.config.notes) 124: if self.config.notes_rgx: 125: regex_string = 126: rf"#\s*({notes}|{self.config.notes_rgx})\b" 127: else: 128: regex_string = rf"#\s*({notes})\b" 129: 130: self._fixme_pattern = re.compile(regex_string, re.I) 131: 132: def _check_encoding( 133: self, lineno: int, line: bytes, file_encoding: str 134: ) -> Optional[str]: 135: try: 136: return line.decode(file_encoding) 137: except UnicodeDecodeError: 138: pass 139: except LookupError: 140: if ( 141: line.startswith(b"#") 142: and "coding" in str(line) 143: and file_encoding in str(line) 144): msg = f"Cannot decode using encoding 145: '{file_encoding}', bad encoding" 146: self.add_message("syntax-error", line=lineno, 147: args=msg) 148: return None 149: 150: def process_module(self, node: nodes.Module) -> None: 151: """Inspect the source file to find encoding problem.""" 152: encoding = node.file_encoding if node.file_encoding 153: else "ascii" 154: 155: with node.stream() as stream: 156: for lineno, line in enumerate(stream): 157: self._check_encoding(lineno + 1, line, ``` ``` 154: 155: def process_tokens(self, tokens): 156: """Inspect the source to find fixme problems.""" 157: if not self.config.notes: 158: return 159: comments = ( 160: token_info for token_info in tokens if 161: token_info.type == tokenize.COMMENT 162: ) 163: for comment in comments: 164: comment_text = comment.string[1:].lstrip() # trim 165: '#' and whitespaces 166: # handle pylint disable clauses 167: disable_option_match = 168: OPTION_PO.search(comment_text) 169: if disable_option_match: 170: try: 171: values = [] 172: try: 173: for pragma_repr in ( 174: p_rep 175: for p_rep in 176: parse_pragma(disable_option_match.group(2)) 177: if p_rep.action == "disable" 178: ): 179: values.extend(pragma_repr.messages) 180: except PragmaParserError: 181: # Printing useful information dealing 182: with this error is done in the lint package 183: pass 184: if set(values) & set(self.config.notes): 185: continue 186: except ValueError: 187: self.add_message( 188: "bad-inline-option", 189: line=comment.start[0], 190: ) 191: continue 192: ) 193: # emit warnings if necessary 194: match = self._fixme_pattern.search("#" + 195: comment_text.lower()) 196: if match: 197: self.add_message( 198: "fixme", 199: col_offset=comment.start[1] + 1, 200: args=comment_text, 201: line=comment.start[0], 202: ) 203: 204: def register(linter: "PyLinter") -> None: 205: linter.register_checker(EncodingChecker(linter)) 206: linter.register_checker(ByIdManagedMessagesChecker(linter)) (Open file: /pylint-dev__pylint/pylint/checkers/misc.py) (Current directory: /pylint-dev__pylint) bash-$ ``` ## SWE-agent ## pylint-dev\_\_pylint-5859 We are now looking at the section of the 'misc.py' file where the regular expression for the '-notes' option is constructed. The regular expression is built from the 'notes' configuration option, which is a list of note tags. These tags are escaped using 're.escape' and joined with a pipe '|' to form an alternation pattern. This pattern is then used to create a regular expression that matches comments starting with one of the note tags followed by a word boundary '\b'. To fix this issue, we can modify the regular expression to ensure that it matches note tags that are entirely punctuation. We can do this by removing the word boundary and allowing the tag to be followed by any non-word character or the end of the string. The issue might be with the word boundary '\b', which does not match if the tag is followed by punctuation. Since the tag '???' is entirely punctuation, the word boundary may not be recognized correctly, causing the tag not to match. Let's edit the regular expression construction to fix the issue. ``` edit 124:126 regex_string = ~ rf"#\s*({notes}|{self.config.notes_rgx})(?=\W|$)" else: regex_string = rf"#\s*({notes})(?=\W|$)" end_of_edit ``` ## Observation ## pylint-dev\_\_pylint-5859 ``` ``` ``` 105: "separated by a comma." 106: ), 107: }, 108: ), 109: ( 110: "notes-rgx", 111: { 112: "type": "string", 113: "metavar": "<regexp>", 114: "help": "Regular expression of note tags to 115: take in consideration.", 116: }, 117: ), 118: ) 119: def open(self): 120: super().open() 121: 122: notes = "|".join(re.escape(note) for note in 123: self.config.notes) 124: if self.config.notes_rgx: 125: regex_string = 126: rf"#\s*({notes}|{self.config.notes_rgx})(?=\W|$)" 127: else: 128: regex_string = rf"#\s*({notes})(?=\W|$)" 129: 130: self._fixme_pattern = re.compile(regex_string, re.I) 131: 132: def _check_encoding( 133: self, lineno: int, line: bytes, file_encoding: str 134: ) -> Optional[str]: 135: try: 136: return line.decode(file_encoding) 137: except UnicodeDecodeError: 138: pass 139: except LookupError: 140: if ( 141: line.startswith(b"#") 142: and "coding" in str(line) 143: and file_encoding in str(line) 144: ): 145: msg = f"Cannot decode using encoding 146: '{file_encoding}', bad encoding" 147: self.add_message("syntax-error", line=lineno, 148: args=msg) 149: return None 150: 151: def process_module(self, node: nodes.Module) -> None: 152: """Inspect the source file to find encoding problem.""" 153: encoding = node.file_encoding if node.file_encoding 154: else "ascii" 155: 156: with node.stream() as stream: 157: for lineno, line in enumerate(stream): 158: self._check_encoding(lineno + 1, line, 159: encoding) 160: 161: encoding) 162: 163: def process_tokens(self, tokens): 164: """Inspect the source to find fixme problems.""" 165: if not self.config.notes: 166: return 167: comments = ( ``` ``` 160: token_info for token_info in tokens if ↪ → token_info.type == tokenize.COMMENT 161: ) 162: for comment in comments: 163: comment_text = comment.string[1:].lstrip() # trim ↪ → '#' and whitespaces 164: 165: # handle pylint disable clauses 166: disable_option_match = ↪ → OPTION_PO.search(comment_text) 167: if disable_option_match: 168: try: 169: values = [] 170: try: 171: for pragma_repr in ( 172: p_rep 173: for p_rep in 174: parse_pragma(disable_option_match.group(2)) 175: ) 176: File updated. Please review the changes and make sure they are ↪ → correct (correct indentation, no duplicate lines, etc). Edit ↪ → the file again if necessary. (Open file: /pylint-dev__pylint/pylint/checkers/misc.py) (Current directory: /pylint-dev__pylint) bash-$ ``` ## SWE-agent pylint-dev\_\_pylint-5859 The edit to the regular expression has been made successfully. We replaced the word boundary '\b' with a lookahead assertion '(?=\W|$)' that matches either a non-word character or the end of the string after the note tag. This should allow the regular expression to match note tags that consist entirely of punctuation. Let's run pylint on 'test.py' with the updated code. Before submitting the changes, we should test to ensure that the fix works as expected. We can do this by running pylint on the 'test.py' file again with the '-notes' option to see if it now correctly identifies the '???' note tag. ``` pylint test.py --notes="YES, ???" ``` ## Observation ## pylint-dev\_\_pylint-5859 ``` ************ Module test test.py:1:1: W0511: YES: yes (fixme) test.py:2:1: W0511: ????: no (fixme) (Open file: /pylint-dev__pylint/pylint/checkers/misc.py) (Current directory: /pylint-dev__pylint) bash ``` ## SWE-agent pylint-dev\_\_pylint-5859 The pylint output now shows that it correctly identifies both the 'YES' and '???' note tags, which means the fix has resolved the issue. With the bug fixed, we can proceed to clean up any test files and then submit our changes to the code base. First, let's remove the 'test.py' file since it was only needed for reproducing the issue. rm test.py ## Observation ## pylint-dev\_\_pylint-5859 ``` (Open file: /pylint-dev__pylint/pylint/checkers/misc.py) (Current directory: /pylint-dev__pylint) bash-$ ``` ## SWE-agent ## pylint-dev\_\_pylint-5859 The 'test.py' file has been successfully removed. Now that we have fixed the issue and cleaned up the test files, we can submit our changes to the code base. Let's submit the changes using the 'submit' command. submit ## Observation ## pylint-dev\_\_pylint-5859 ``` diff --git a/pylint/checkers/misc.py b/pylint/checkers/misc.py index 69149e61a..28978365f 100644 --- a/pylint/checkers/misc.py +++ b/pylint/checkers/misc.py @@ -121,9 +121,9 @@ class EncodingChecker(BaseChecker): notes = "|".join(re.escape(note) for note in -> self.config.notes) if self.config.notes_rgx: - regex_string = - rf"#\s*({notes}|{self.config.notes_rgx})\b" + regex_string = - rf"#\s*({notes}|{self.config.notes_rgx})(?=\W|$)" else: - regex_string = rf"#\s*({notes})\b" + regex_string = rf"#\s*({notes})(?=\W|$)" self._fixme_pattern = re.compile(regex_string, re.I) ``` ## Gold Patch ## pylint-dev\_\_pylint-5859 ``` Gold Patch pylint-dev__pylint-5859 diff --git a/pylint/checkers/misc.py b/pylint/checkers/misc.py --- a/pylint/checkers/misc.py +++ b/pylint/checkers/misc.py @@ -121,9 +121,9 @@ def open(self): notes = "|".join(re.escape(note) for note in -> self.config.notes) if self.config.notes_rgx: - regex_string = - > rf"#\s*({notes}|{self.config.notes_rgx})\b" + regex_string = - > rf"#\s*({notes}|{self.config.notes_rgx})(?=(:|\s|\Z))" else: - regex_string = rf"#\s*({notes})\b" ``` ``` + regex_string = rf"#\s*({notes})(?=(:\s|\Z))" self._fixme_pattern = re.compile(regex_string, re.I) ``` ## Analysis for sympy/sympy-21614 Issue Summary. The issue is reporting an UnedinedKind type returned when printing a sympy.Derivative 's kind attribute. They expect something like a DerivativeKind instead of UndefinedKind . ## Interaction Summary 1. Reproduce the Issue : The agent began by creating a Python script ( reproduce\_issue.py ) to reproduce the issue, using the reproduction code provided in the issue report. It ran the script and successfully reproduced the issue. 2. Searching for Derivative Code : The agent begins to search for code where the Derivative class is defined. It makes several attempts with different strategies to find the relevant code. After a few unsuccessful approaches, looking for particular filenames, it searches simply for 'class Derivative', which provides a single search result. It continues to read the Derivative class definition, which spans many hundreds of lines. 3. Exhausting Budget : It continuously scrolls through the function.py file where Derivative is defined until is exhausts its budget and terminates the episode without submitting. Evaluation The agent makes a few mistakes using the search function effectively. The agent should have made more efficient searches by using more commonsensical patterns (e.g. searching for 'class Derivative') early on, instead of more speculative search attempts. In this particular setting, the gold patch shows that the updated code involves overriding an inherited property kind from the sympy.basic object. This fix would require a more thorough understanding of the codebase and inheritence structure of Derivative to be effective. An agent attempting to resolve this issue would need to have more targeted and efficient search approaches to improve its ability to identify this problem. Figure 35: Trajectory for instance sympy\_\_sympy-21614 . <details> <summary>Figure 30 Details</summary> ![692de06420026ed5be9b32dbbc7ab51ed3019d9c495ab27b46c4d6657c6336f4](http://localhost:8000/v1/image/692de06420026ed5be9b32dbbc7ab51ed3019d9c495ab27b46c4d6657c6336f4) ### Visual Description # Technical Document Extraction: Issue sympy__sympy-21614 ## 1. Document Header * **Category:** Issue * **Identifier:** sympy__sympy-21614 ## 2. Issue Description The document describes a bug report regarding the `kind` attribute in the SymPy library, specifically comparing the behavior of `Integral` versus `Derivative` when applied to a `MatrixSymbol`. ### Text Transcription: "We're currently solving the following issue within our repository. Here's the issue text: ISSUE: Wrong Derivative kind attribute I'm playing around with the 'kind' attribute. The following is correct:" ### Code Block 1: Correct Behavior (Integral) ```python from sympy import Integral, Derivative from sympy import MatrixSymbol from sympy.abc import x A = MatrixSymbol('A', 2, 2) i = Integral(A, x) i.kind # MatrixKind(NumberKind) ``` * **Observation:** When an `Integral` is performed on a 2x2 `MatrixSymbol`, the `.kind` attribute correctly returns `MatrixKind(NumberKind)`. ### Text Transcription (Continued): "This one is wrong:" ### Code Block 2: Incorrect Behavior (Derivative) ```python d = Derivative(A, x) d.kind # UndefinedKind ``` * **Observation:** When a `Derivative` is performed on the same `MatrixSymbol`, the `.kind` attribute returns `UndefinedKind`, which is identified as the bug. ## 3. Environment and Instructions * **INSTRUCTIONS:** (31 lines) * **Open file:** n/a * **Current directory:** `/sympy__sympy` * **Terminal Prompt:** `bash-$` ## 4. Summary of Technical Data | Component | Input Object | Operation | Expected/Observed Kind | Status | | :--- | :--- | :--- | :--- | :--- | | Integral | MatrixSymbol('A', 2, 2) | `Integral(A, x)` | `MatrixKind(NumberKind)` | Correct | | Derivative | MatrixSymbol('A', 2, 2) | `Derivative(A, x)` | `UndefinedKind` | **Incorrect** | **Technical Conclusion:** The `Derivative` class in SymPy fails to properly propagate or assign the `kind` attribute when operating on matrix types, whereas the `Integral` class handles it correctly. </details> ## SWE-agent ## sympy\_\_sympy-21614 First, I'll create a new Python script to reproduce the issue as described. This script will attempt to create a 'Derivative' of a 'MatrixSymbol' and print out the 'kind' attribute to verify if it is indeed 'UndefinedKind'. ``` create reproduce_issue.py ``` ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/reproduce_issue.py (1 lines total)] 1: (Open file: n/a) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent sympy\_\_sympy-21614 Now let's paste in the example code from the issue that demonstrates the incorrect behavior for the 'kind' attribute of a 'Derivative' of a 'MatrixSymbol'. ``` ------------------- edit 1:1 from sympy import Integral, Derivative from sympy import MatrixSymbol from sympy.abc import x A = MatrixSymbol('A', 2, 2) d = Derivative(A, x) print(d.kind) # Expected: MatrixKind(NumberKind), but getting UndefinedKind end_of_edit ``` ## Observation ## sympy\_\_sympy-21614 ``` Observation sympy__sympy/reproduce_issue.py (7 lines total)] [File: /sympy__sympy/reproduce_issue.py (7 lines total)] 1:from sympy import Integral, Derivative 2:from sympy import MatrixSymbol 3:from sympy.abc import x 4:A = MatrixSymbol('A', 2, 2) 5:d = Derivative(A, x) 6:print(d.kind) 7:# Expected: MatrixKind(NumberKind), but getting UndefinedKind File updated. Please review the changes and make sure they are <> correct (correct indentation, no duplicate lines, etc). Edit <> the file again if necessary. (Open file: /sympy__sympy/reproduce_issue.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 The reproduction script has been created and updated with the code that demonstrates the issue. Now, let's run the script to confirm that the 'kind' attribute of the 'Derivative' is indeed 'UndefinedKind'. python reproduce\_issue.py ## Observation UndefinedKind ``` (Open file: /sympy__sympy/reproduce_issue.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent sympy\_\_sympy-21614 ## sympy\_\_sympy-21614 The output confirms that the 'kind' attribute of the 'Derivative' is 'UndefinedKind', which is not the expected behavior. The next step is to locate the part of the code that handles the 'kind' attribute for the 'Derivative' class and make the necessary corrections. Since the 'Derivative' class is likely to be defined in a file related to calculus or differentiation, I will search for the file that contains the 'Derivative' class definition. find\_file "derivative.py" ## Observation sympy\_\_sympy-21614 /sympy\_\_sympy sympy\_\_sympy-21614 ``` No matches found for "derivative.py" in /sympy__sympy (Open file: /sympy__sympy/reproduce_issue.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent It seems that the file name I searched for does not match the actual file name where the 'Derivative' class is defined. The 'Derivative' class might be defined in a file with a different name, possibly something more general related to symbolic calculus in SymPy. Let's try a broader search for files that might contain the 'Derivative' class definition. Since the class name is 'Derivative', it's likely that the file name contains 'calculus' or a similar keyword. I'll use the 'search\_dir' command to look for the term 'Derivative' within the Python files in the current directory. ``` search_dir "Derivative" . ``` ## Observation ``` More than 182 files matched for "Derivative" in /sympy__ > Please narrow your search. (Open file: /sympy__sympy/reproduce_issue.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent sympy\_\_sympy-21614 ``` in /sympy__sympy. ``` ## sympy\_\_sympy-21614 The search for 'Derivative' returned too many results, so I need to narrow it down. Since the 'Derivative' class is a core part of SymPy, it's likely to be in a core module. A common convention is to have such core classes in a file named after the module, so I will search for a file that might be named 'core.py' or similar. ``` find_file "core.py" ``` ## Observation ``` Found 5 matches for "core.py" in /sympy__sympy: /sympy__sympy/sympy/core/core.py /sympy__sympy/sympy/multipledispatch/core.py /sympy__sympy/sympy/unify/core.py /sympy__sympy/sympy/strategies/branch/core.py /sympy__sympy/sympy/strategies/core.py (Open file: /sympy__sympy/reproduce_issue.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent sympy\_\_sympy-21614 sympy\_\_sympy-21614 The search has returned several 'core.py' files, but the most likely candidate for containing the 'Derivative' class definition is within the main 'sympy/core' directory. Let's open the 'core.py' file in the 'sympy/core' directory to investigate further. open sympy/core/core.py ## Observation ## sympy\_\_sympy-21614 ``` Observation sympy-21614 [File: /sympy__sympy/sympy/core/core.py (104 lines total)] 1:""" The core's core. """ 2: 3:# used for canonical ordering of symbolic sequences 4:# via __cmp__ method: 5:# FIXME this is *so* irrelevant and outdated! 6:ordering_of_classes = [ 7: # singleton numbers 8: 'Zero', 'One', 'Half', 'Infinity', 'NaN', 'NegativeOne', 9: -> 'NegativeInfinity', 10: # numbers 11: 'Integer', 'Rational', 'Float', 12: # singleton symbols 13: 'Exp1', 'Pi', 'ImaginaryUnit', 14: # symbols 15: 'Symbol', 'Wild', 'Temporary', 16: # arithmetic operations 17: 'Pow', 'Mul', 'Add', 18: # function values 19: 'Derivative', 'Integral', 20: # defined singleton functions 21: 'Abs', 'Sign', 'Sqrt', 22: 'Floor', 'Ceiling', 23: 'Re', 'Im', 'Arg', 24: 'Conjugate', 25: 'Exp', 'Log', 26: 'Sin', 'Cos', 'Tan', 'Cot', 'ASin', 'ACos', 'ATan', 'ACot', 27: 'Sinh', 'Cosh', 'Tanh', 'Coth', 'ASinh', 'ACosh', 'ATanh', 28: -> 'ACoth', ``` ``` class RisingFactorial, 'FallingFactorial', 'factorial', 'binomial', 'Gamma', 'LowerGamma', 'UpperGamma', 'PolyGamma', 'Erf', # special polynomials 'Chebyshev', 'Chebyshev2', # undefined functions 'Function', 'WildFunction', # anonymous functions 'Lambda', # Landau O symbol 'Order', # relational operations 'Equality', 'Unequality', 'StrictGreaterThan', 'StrictLessThan', 'GreaterThan', 'LessThan', 42:] 43: 44: 45:class Registry: 46: """ 47: Base class for registry objects. 48: 49: Registries map a name to an object using attribute 50: notation. Registry 51: classes behave singletonically: all their instances share 52: the same state, 53: which is stored in the class object. 54: 55: All subclasses should set `__slots__ = ()`. 56: """ 57: __slots__ = () 58: 59: def __setattr__(self, name, obj): 60: setattr(self.__class__, name, obj) 61: 62: def __delattr__(self, name): 63: delattr(self.__class__, name) 64: 65:#A set containing all sympy class objects 66:all_classes = set() 67: 68: 69: class BasicMeta(type): 70: 71: def __init__(cls, *args, **kws): 72: all_classes.add(cls) 73: cls.__sympy__ = property(lambda self: True) 74: 75: def __cmp__(cls, other): 76: # If the other object is not a Basic subclass, then we 77: are not equal to 78: # it. 79: if not isinstance(other, BasicMeta): 80: return -1 81: n1 = cls.__name__ 82: n2 = other.__name__ 83: if n1 == n2: 84: return 0 85: 86: UNKNOWN = len(ordering_of_classes) + 1 87: try: 88: i1 = ordering_of_classes.index(n1) ``` ``` 86: except ValueError: 87: i1 = UNKNOWN 88: try: 89: i2 = ordering_of_classes.index(n2) 90: except ValueError: 91: i2 = UNKNOWN 92: if i1 == UNKNOWN and i2 == UNKNOWN: 93: return (n1 > n2) - (n1 < n2) 94: return (i1 > i2) - (i1 < i2) 95: 96: def __lt__(cls, other): 97: if cls.__cmp__(other) == -1: 98: return True 99: return False 100: (Open file: /sympy__sympy/reproduce_issue.py) (Current directory: /sympy__) ``` ## SWE-agent sympy\_\_sympy-21614 The 'core.py' file in the 'sympy/core' directory does not seem to contain the implementation details of the 'Derivative' class. It's more likely that the implementation is in another file within the 'sympy/core' directory. Since the 'Derivative' class is a fundamental part of symbolic calculus, it might be in a file named something like 'function.py' or 'operations.py'. Let's try to find a file that specifically contains the implementation of the 'Derivative' class within the 'sympy/core' directory. find\_file "function.py" sympy/core ## Observation ## sympy\_\_sympy-21614 ``` Found 1 matches for "function.py" in /sympy__sympy/sympy/core: /sympy__sympy/sympy/core/function.py (Open file: /sympy__sympy/sympy/core/core.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent sympy\_\_sympy-21614 Great, we have found the 'function.py' file in the 'sympy/core' directory, which is likely to contain the implementation of the 'Derivative' class. Let's open this file to investigate and make the necessary changes. open sympy/core/function.py ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] 1:"""" 2:There are three types of functions implemented in SymPy: 3: ``` ``` 4: 1) defined functions (in the sense that they can be 5: evaluated) like 6: exp or sin; they have a name and a body: 7: f = exp 8: 2) undefined function which have a name but no body. 9: Undefined 10: functions can be defined using a Function class as 11: follows: 12: f = Function('f') 13: (the result will be a Function instance) 14: 3) anonymous function (or lambda function) which have a 15: body (defined 16: with dummy variables) but have no name: 17: f = Lambda(x, exp(x)*x) 18: f = Lambda((x, y), exp(x)*y) 19: The fourth type of functions are composites, like (sin + 20: cos)(x); these work in 21: SymPy core, but are not yet part of SymPy. 17: 18: Examples 19: ======= 20: 21: >>> import sympy 22: >>> f = sympy.Function("f") 23: >>> from sympy.abc import x 24: >>> f(x) 25: f(x) 26: >>> print(sympy.srepr(f(x).func)) 27: Function('f') 28: >>> f(x).args 29: (x,) 30: 31: """ 32: 33:from typing import Any, Dict as tDict, Optional, Set as tSet, 34: Tuple as tTuple, Union 35: 36:from .add import Add 37:from .assumptions import ManagedProperties 38:from .basic import Basic, _atomic 39:from .cache import cacheit 40:from .compatibility import iterable, is_sequence, as_int, 41: ordered, Iterable 42:from .decorators import _sympifyit 43:from .expr import Expr, AtomicExpr 44:from .numbers import Rational, Float 45:from .operations import LatticeOp 46:from .rules import Transform 47:from .singleton import S 48:from .sympify import sympify 49: 50:from sympy.core.containers import Tuple, Dict 51:from sympy.core.parameters import global_parameters 52:from sympy.core.logic import fuzzy_and, fuzzy_or, fuzzy_not, 53: FuzzyBool 54:from sympy.utilities import default_sort_key 55:from sympy.utilities.exceptions import SymPyDeprecationWarning 56:from sympy.utilities.iterables import has_dups, sift 57:from sympy.utilities.misc import filldedent 58: 59:import mpmath 60:import mpmath.libmp as mlib 61: ``` ``` 59:import inspect 60:from collections import Counter 61: 62:def _coeff_isneg(a): 63: """Return True if the leading Number is negative. 64: 65: Examples 66: ======== 67: 68: >>> from sympy.core.function import _coeff_isneg 69: >>> from sympy import S, Symbol, oo, pi 70: >>> _coeff_isneg(-3*pi) 71: True 72: >>> _coeff_isneg(S(3)) 73: False 74: >>> _coeff_isneg(-oo) 75: True 76: >>> _coeff_isneg(Symbol('n', negative=True)) # coeff is 1 77: False 78: 79: For matrix expressions: 80: 81: >>> from sympy import MatrixSymbol, sqrt 82: >>> A = MatrixSymbol("A", 3, 3) 83: >>> _coeff_isneg(-sqrt(2)*A) 84: True 85: >>> _coeff_isneg(sqrt(2)*A) 86: False 87: """ 88: 89: if a.is_MatMul: 90: a = a.args[0] 91: if a.is_Mul: 92: a = a.args[0] 93: return a.is_Number and a.is_extended_negative 94: 95: 96:class PoleError(Exception): 97: pass 98: 99: 100:class ArgumentIndexError(ValueError): 101 (3289 more lines below) (Open file: /sympy/__sympy/sympy/core/core.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent sympy\_\_sympy-21614 The 'function.py' file is now open, and it's quite large with 3389 lines. The 'Derivative' class is likely to be defined in this file, but we need to find the exact location. Since the file is too long to scroll through manually, I'll use the 'search\_file' command to find the 'Derivative' class definition within this file. search\_file "class Derivative" ``` Observation Found 1 matches for "class Derivative" in → /sympy__sympy/sympy/core/function.py: Line 1053:class Derivative(Expr): End of matches for "class Derivative" in → /sympy__sympy/sympy/core/function.py (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent sympy\_\_sympy-21614 We have found that the 'Derivative' class is defined on line 1053 of 'function.py'. Let's navigate to that line to inspect the implementation and determine how to correct the 'kind' attribute issue. goto 1053 ## Observation ## sympy\_\_sympy-21614 ``` goto 1053 ``` ``` 1072: Contiguously identical symbols are merged into a tuple -> giving 1073: the symbol and the count: 1074: 1075: >>> Derivative(f(x), x, x, y, x) 1076: Derivative(f(x), (x, 2), y, x) 1077: 1078: If the derivative cannot be performed, and evaluate is -> True, the 1079: order of the variables of differentiation will be made -> canonical: 1080: 1081: >>> Derivative(f(x, y), y, x, evaluate=True) 1082: Derivative(f(x, y), x, y) 1083: 1084: Derivatives with respect to undefined functions can be -> calculated: 1085: 1086: >>> Derivative(f(x)**2, f(x), evaluate=True) 1087: 2*f(x) 1088: 1089: Such derivatives will show up when the chain rule is used -> to 1090: evaluate a derivative: 1091: 1092: >>> f(g(x)).diff(x) 1093: Derivative(f(g(x)), g(x))*Derivative(g(x), x) 1094: 1095: Substitution is used to represent derivatives of -> functions with 1096: arguments that are not symbols or functions: 1097: 1098: >>> f(2*x + 3).diff(x) == 2*Subs(f(y).diff(y), y, 2*x -> + 3) 1099: True 1100: 1101: Notes 1102: ==== 1103: 1104: Simplification of high-order derivatives: 1105: 1106: Because there can be a significant amount of -> simplification that can be 1107: done when multiple differentiations are performed, -> results will be 1108: automatically simplified in a fairly conservative fashion -> unless the 1109: keyword ``simplify`` is set to False. 1110: 1111: >>> from sympy import sqrt, diff, Function, symbols 1112: >>> from sympy.abc import x, y, z 1113: >>> f, g = symbols('f,g', cls=Function) 1114: 1115: >>> e = sqrt((x + 1)**2 + x) 1116: >>> diff(e, (x, 5), simplify=False).count_ops() 1117: 136 1118: >>> diff(e, (x, 5)).count_ops() 1119: 30 1120: 1121: Ordering of variables: 1122: 1123: If evaluate is set to True and the expression cannot be ``` ``` 1124: list of differentiation symbols will be sorted, that is, <> the expression is 1125: assumed to have continuous derivatives up to the order <> asked. 1126: 1127: Derivative wrt non-Symbols: 1128: 1129: For the most part, one may not differentiate wrt <> non-symbols. 1130: For example, we do not allow differentiation wrt `x*y` <> because 1131: there are multiple ways of structurally defining where <> x*y appears 1132: in an expression: a very strict definition would make 1133: (x*y*z).diff(x*y) == 0. Derivatives wrt defined functions <> (like 1134: cos(x)) are not allowed, either: 1135: 1136: >>> (x*y*z).diff(x*y) 1137: Traceback (most recent call last): (2252 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent sympy\_\_sympy-21614 We are now at the beginning of the 'Derivative' class definition. To address the issue with the 'kind' attribute, we need to find where the 'kind' property is defined or handled within this class. Since the file is large, I will scroll down to find the relevant section where the 'kind' property is likely to be implemented. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` Observation sympy__sympy-21614 [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (1135 more lines above) 1136: >>> (x*y*z).diff(x*y) 1137: Traceback (most recent call last): ... 1138: ... 1139: ValueError: Can't calculate derivative wrt x*y. 1140: 1141: To make it easier to work with variational calculus, 1142: however, 1143: derivatives wrt AppliedUndef and Derivatives are allowed. 1144: For example, in the Euler-Lagrange method one may write 1145: F(t, u, v) where u = f(t) and v = f'(t). These variables 1146: written explicitly as functions of time:: 1147: >>> from sympy.abc import t 1148: >>> F = Function('F') 1149: >>> U = f(t) 1150: >>> V = U.diff(t) 1151: 1152: The derivative wrt f(t) can be obtained directly: ``` ``` 1154: >>> direct = F(t, U, V).diff(U) 1155: 1156: When differentiation wrt a non-Symbol is attempted, the 1157: non-Symbol 1158: is temporarily converted to a Symbol while the 1159: differentiation 1160: is performed and the same answer is obtained: 1161: >>> indirect = F(t, U, V).subs(U, x).diff(x).subs(x, 1162: 1163: The implication of this non-symbol replacement is that all 1164: functions are treated as independent of other functions 1165: symbols are independent of the functions that contain 1166: 1167: >>> x.diff(f(x)) 1168: 0 1169: >>> g(x).diff(f(x)) 1170: 0 1171: 1172: It also means that derivatives are assumed to depend only 1173: on the variables of differentiation, not on anything 1174: contained 1175: within the expression being differentiated:: 1176: >>> F = f(x) 1177: >>> Fx = F.diff(x) 1178: >>> Fx.diff(F) # derivative depends on x, not F 1179: 0 1180: >>> Fxx = Fx.diff(x) 1181: >>> Fxx.diff(Fx) # derivative depends on x, not Fx 1182: 0 1183: 1184: The last example can be made explicit by showing the 1185: of Fx in Fxx with y: 1186: 1187: >>> Fxx.subs(Fx, y) 1188: Derivative(y, x) 1189: 1190: Since that in itself will evaluate to zero, 1191: differentiating 1192: 1193: >>> _.doit() 1194: 1195: 1196: Replacing undefined functions with concrete expressions 1197: 1198: One must be careful to replace undefined functions with 1200: expressions 11199: that contain variables consistent with the function 11201: definition and 12002: the variables of differentiation or else insconsistent 11201: result will 12021: be obtained. Consider the following example: 12022: 12023: >>> eq = f(x)*g(y) 12024: >>> eq.subs(f(x), x*y).diff(x, y).doit() 12025: y*Derivative(g(y), y) + g(y) ``` ``` ``` ## SWE-agent sympy\_\_sympy-21614 We have scrolled down through the 'function.py' file, but we have not yet found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (1233 more lines above) [File contents omitted for brevity] (2056 more lines below) ``` ``` (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (1331 more lines above) [File contents omitted for brevity] (1958 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (1429 more lines above) [File contents omitted for brevity] (1860 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (1527 more lines above) [File contents omitted for brevity] (1762 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (1625 more lines above) [File contents omitted for brevity] (1664 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (1723 more lines above) [File contents omitted for brevity] (1566 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (1821 more lines above) [File contents omitted for brevity] (1468 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (1919 more lines above) [File contents omitted for brevity] (1370 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (2017 more lines above) [File contents omitted for brevity] (1272 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (2115 more lines above) [File contents omitted for brevity] (1174 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (2213 more lines above) [File contents omitted for brevity] (1076 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (2311 more lines above) [File contents omitted for brevity] (978 more lines below) ``` ``` (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, but we still haven't found the implementation of the 'kind' property for the 'Derivative' class. We need to continue scrolling to find the relevant section where the 'kind' property is likely to be implemented or overridden. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (2409 more lines above) [File contents omitted for brevity] (880 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, and we have now found the 'diff' function, which is related to differentiation. However, we still need to find the implementation of the 'kind' property for the 'Derivative' class. Since we are in the section related to differentiation, we should be close to the relevant code. Let's continue scrolling to find the 'Derivative' class implementation. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (2507 more lines above) [File contents omitted for brevity] (782 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, and we have now found the 'diff' function, which is related to differentiation. However, we still need to find the implementation of the 'kind' property for the 'Derivative' class. Since we are in the section related to differentiation, we should be close to the relevant code. Let's continue scrolling to find the 'Derivative' class implementation. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (2605 more lines above) [File contents omitted for brevity] (684 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, and we have now found the 'diff' function, which is related to differentiation. However, we still need to find the implementation of the 'kind' property for the 'Derivative' class. Since we are in the section related to differentiation, we should be close to the relevant code. Let's continue scrolling to find the 'Derivative' class implementation. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (2703 more lines above) [File contents omitted for brevity] (586 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 We have scrolled further down the 'function.py' file, and we have now found the 'expand' function, which is related to expression expansion. However, we still need to find the implementation of the 'kind' property for the 'Derivative' class. Since we are in the section related to expression manipulation, we should be close to the relevant code. Let's continue scrolling to find the 'Derivative' class implementation. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (2801 more lines above) [File contents omitted for brevity] (488 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 Wehave scrolled further down the 'function.py' file, and we have now found various expansion functions and wrappers, but we still need to find the implementation of the 'kind' property for the 'Derivative' class. Since we are in the section related to expression manipulation, we should be close to the relevant code. Let's continue scrolling to find the 'Derivative' class implementation. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (2899 more lines above) [File contents omitted for brevity] (390 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 Wehave scrolled further down the 'function.py' file, and we have now found various expansion functions and wrappers, but we still need to find the implementation of the 'kind' property for the 'Derivative' class. Since we are in the section related to expression manipulation, we should be close to the relevant code. Let's continue scrolling to find the 'Derivative' class implementation. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (2997 more lines above) [File contents omitted for brevity] (292 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent ## sympy\_\_sympy-21614 Wehave scrolled further down the 'function.py' file, and we have now found various expansion functions and wrappers, but we still need to find the implementation of the 'kind' property for the 'Derivative' class. Since we are in the section related to expression manipulation, we should be close to the relevant code. Let's continue scrolling to find the 'Derivative' class implementation. scroll\_down ## Observation ## sympy\_\_sympy-21614 ``` [File: /sympy__sympy/sympy/core/function.py (3389 lines total)] (3095 more lines above) [File contents omitted for brevity] (194 more lines below) (Open file: /sympy__sympy/sympy/core/function.py) (Current directory: /sympy__sympy) bash-$ ``` ## SWE-agent Exit due to cost limit exit\_cost ## Observation Exited ## Gold Patch sympy\_\_sympy-21614 sympy\_\_sympy-21614 ## sympy\_\_sympy-21614 ``` Gold Patch ``` ## Analysis for django/django-14411 Issue Summary. The issue involves the ReadOnlyPasswordHashWidget in the Django admin interface. The label element for this widget has a 'for' attribute that points to a non-labelable element. Since the ReadOnlyPasswordHashWidget renders text rather than an input element, the label should not have a 'for' attribute. ## Interaction Summary 1. Locate the Relevant File : The agent began by searching for the ReadOnlyPasswordHashWidget definition using the find\_file command, but found no results. It then used the search\_dir command to find references to the widget within the repository and located relevant matches in several files. The agent chose to first examine the helpers.py file, but found it only contained a reference to the widget and did not include the label rendering code. 2. Explore Related Files : Next, the agent opened the forms.py file in the django/contrib/auth directory, which defined the ReadOnlyPasswordHashWidget and its associated ReadOnlyPasswordHashField . It identified that the widget used a template ( read\_only\_password\_hash.html ) for rendering, but the template did not include relevant information relating to the Widget's for attribute. 3. Edit the Class Definition : Returning to forms.py , the agent found the UserChangeForm which used the ReadOnlyPasswordHashField . It reasoned that Django's form rendering system likely added the 'for' attribute automatically. To address the issue, the agent attempted to write a label\_tag method for the ReadOnlyPasswordHashField to exclude the 'for' attribute. The agent made several edits to this effect, importing the necessary utilities from django.utils.html and adding a custom label\_tag method. 4. Submit Without Testing : After the desired changes were made, the agent proceeds to submit without testing as it assumes that a Django server cannot be run and tested in this environment. Evaluation The agent correctly identified the need to customize the label rendering for the ReadOnlyPasswordHashField class. However, its solution failed to solve the issue since it did not understand how labels were defined for a Widget object. In actuality, the model needed to override the id\_for\_label property defined in the Widget class to change this behavior properly. Similar functionality can be seen in other parts of the codebase. Figure 36: Trajectory for instance django\_\_django-14411 . ## Issue ## django\_\_django-14411 We're currently solving the following issue within our repository. Here's the issue text: ISSUE: Description Label for ReadOnlyPasswordHashWidget points to non-labelable element. (last modified by David Sanders) INSTRUCTIONS: (31 lines) In the admin, the label element for the ReadOnlyPasswordHashWidget widget has a 'for' attribute which points to a non-labelable element, since the widget just renders text, not an input. There's no labelable element for the widget, so the label shouldn't have a 'for' attribute. ``` (Open file: n/a) (Current directory: /django__django) bash-$ ``` ## SWE-agent ## django\_\_django-14411 To address the issue, we need to locate the file that contains the 'ReadOnlyPasswordHashWidget' widget's rendering code. Since this is likely part of the Django admin, we should start by searching for the file that defines 'ReadOnlyPasswordHashWidget'. Let's use the 'find\_file' command to locate this file within the repository. find\_file "ReadOnlyPasswordHashWidget" ## Observation ## django\_\_django-14411 ``` No matches found for "ReadOnlyPasswordHashWidget" in -> /django__django (Open file: n/a) (Current directory: /django__django) bash-$ ``` ## SWE-agent ## django\_\_django-14411 It seems that the 'find\_file' command did not yield any results for 'ReadOnlyPasswordHashWidget'. This could be because the file name does not exactly match the widget name. Since the widget is likely to be part of the Django admin, it might be located in a file related to forms or widgets in the admin app. Let's try a broader search using the 'search\_dir' command to look for the term 'ReadOnlyPasswordHashWidget' within the files of the repository. This should help us find the file where the widget is defined or used. search\_dir "ReadOnlyPasswordHashWidget" ## Observation ## django\_\_django-14411 ``` Found 9 matches for "ReadOnlyPasswordHashWidget" in <> /django__django: /django__django/django/contrib/admin/helpers.py (1 matches) /django__django/django/contrib/auth/forms.py (2 matches) /django__django/docs/releases/1.11.1.txt (1 matches) /django__django/tests/auth_tests/test_forms.py (4 matches) /django__django/tests/auth_tests/test_views.py (1 matches) End of matches for "ReadOnlyPasswordHashWidget" in /django__django: (Open file: n/a) (Current directory: /django__django) bash-$ ``` ## SWE-agent ## django\_\_django-14411 The search results indicate that there are matches for 'ReadOnlyPasswordHashWidget' in several files, but the most likely places where the rendering code would be defined are in 'helpers.py' and 'forms.py' within the 'django/contrib/admin' and 'django/contrib/auth' directories, respectively. Since the issue is related to the admin interface, let's first check the 'helpers.py' file in the 'django/contrib/admin' directory. We'll open this file and look for the relevant code. open django/contrib/admin/helpers.py ## Observation ## django\_\_django-14411 ``` Observation [File: /django__django/django/contrib/admin/helpers.py (424 lines ↔ total)] 1:import json 2: 3:from django import forms 4:from django.contrib.admin.utils import ( 5: display_for_field, flatten_fieldsets, help_text_for_field, 6: label_for_field, 7: lookup_field, quote, 8:) 9:from django.core.exceptions import ObjectDoesNotExist 10:from django.db.models.fields.related import ( 11: ForeignObjectRel, ManyToManyRel, OneToOneField, 12:) 13:from django.forms.utils import flatatt 14:from django.template.defaultfilters import capfirst, 15: linebreaksbr 16: from django.urls import NoReverseMatch, reverse 17: from django.utils.html import conditional_escape, format_html ``` ``` 16:from django.utils.safestring import mark_safe 17:from django.utils.translation import gettext, gettext_lazy as _ 18: 19:ACTION_CHECKBOX_NAME = '_selected_action' 20: 21: 22:class ActionForm(forms.Form): 23: action = forms.ChoiceField(label=_('Action:')) 24: select_across = forms.BooleanField( 25: label='', 26: required=False, 27: initial=0, 28: widget=forms.HiddenInput({'class': 'select-across'}), 29: ) 30: 31: 32:checkbox = forms.CheckboxInput({'class': 'action-select'}, 33: lambda value: False) 34: 35:class AdminForm: 36: def __init__(self, form, fieldsets, prepopulated_fields, 37: readonly_fields=None, model_admin=None): 38: self.form, self.fieldsets = form, fieldsets 39: self.prepopulated_fields = [{ 40: 'field': form[field_name], 41: 'dependencies': [form[f] for f in dependencies] 42: } for field_name, dependencies in 43: prepopulated_fields.items()] 44: self.model_admin = model_admin 45: if readonly_fields is None: 46: readonly_fields = () 47: self.readonly_fields = readonly_fields 48: 49: def __iter__(self): 50: for name, options in self.fieldsets: 51: yield Fieldset( 52: self.form, name, 53: readonly_fields=self.readonly_fields, 54: model_admin=self.model_admin, 55: **options 56: ) 57: 58: @property 59: def errors(self): 60: return self.form.errors 61: 62: @property 63: def non_field_errors(self): 64: return self.form.non_field_errors 65: 66: @property 67: def media(self): 68: media = self.form.media 69: for fs in self: 70: media = media + fs.media 71: return media 72: 73: 74:class Fieldset: 75: def __init__(self, form, name=None, readonly_fields=(), 76: fields=(), classes=(), 77: description=None, model_admin=None): ``` ``` 75: self.form = form 76: self.name, self.fields = name, fields 77: self.classes = ' '.join(classes) 78: self.description = description 79: self.model_admin = model_admin 80: self.readonly_fields = readonly_fields 81: 82: @property 83: def media(self): 84: if 'collapse' in self.classes: 85: return forms.Media(js=['admin/js/collapse.js']) 86: return forms.Media() 87: 88: def __iter__(self): 89: for field in self.fields: 90: yield Fieldline(self.form, field, 91: self.field_name, model_admin=self.model_admin) 92: 93:class Fieldline: 94: def __init__(self, form, field, readonly_fields=None, 95: model_admin=None): 96: self.form = form # A django.forms.Form instance 97: if not hasattr(field, "__iter__") or isinstance(field, str): 98: self.fields = [field] 99: else: 100: self.fields = field 101: self.has_visible_field = not all( 102: (324 more lines below) (Open file: n/a) (Current directory: /django__django) bash-$ ``` ## SWE-agent django\_\_django-14411 The 'helpers.py' file is now open, but we need to find the specific part of the code that deals with the 'ReadOnlyPasswordHashWidget'. Since the file is quite long, we should use the 'search\_file' command to locate the term within this file. search\_file "ReadOnlyPasswordHashWidget" ## Observation ## django\_\_django-14411 ``` Found 1 matches for "ReadOnlyPasswordHashWidget" in → /django__django/django/contrib/admin/helpers.py: Line 215: # ReadOnlyPasswordHashWidget. End of matches for "ReadOnlyPasswordHashWidget" in → /django__django/django/contrib/admin/helpers.py (Open file: /django__django/django/contrib/admin/helpers.py) (Current directory: /django__django) bash-$ ``` ``` (Open file: /django__django/django/contrib/admin/helpers.py) (Current directory: /django__django) bash-$ ``` ## SWE-agent ## django\_\_django-14411 We have found a reference to 'ReadOnlyPasswordHashWidget' on line 215 of the 'helpers.py' file. Let's navigate to that line to examine the context and see how the label is being rendered. goto 215 ## Observation ## django\_\_django-14411 ``` goto 215 ``` ``` 237: return conditional_escape(result_repr) 238: 239: 240:class InlineAdminFormSet: 241: """ 242: A wrapper around an inline formset for use in the admin 243: system. 244: """ 245: def __init__(self, inline, formset, fieldsets, 246: prepopulated_fields=None, 247: readonly_fields=None, model_admin=None, 248: has_add_permission=True, 249: has_change_permission=True, 250: has_delete_permission=True, 251: has_view_permission=True): 252: self.opts = inline 253: self.formset = formset 254: self.fieldsets = fieldsets 255: self.model_admin = model_admin 256: if readonly_fields is None: 257: readonly_fields = () 258: self.readonly_fields = readonly_fields 259: if prepopulated_fields is None: 260: prepopulated_fields = {} 261: self.prepopulated_fields = prepopulated_fields 262: self.classes = ' '.join(inline.classes) if 263: inline.classes else '' 264: self.has_add_permission = has_add_permission 265: self.has_change_permission = has_change_permission 266: self.has_delete_permission = has_delete_permission 267: self.has_view_permission = has_view_permission 268: 269: def __iter__(self): 270: if self.has_change_permission: 271: readonly_fields_for_editing = self.readonly_fields 272: else: 273: readonly_fields_for_editing = self.readonly_fields 274: + flatten_fieldsets(self.fieldsets) 275: 278: for form, original in zip(self.formset.initial_forms, 279: self.formset.get_queryset()): 280: view_on_site_url = 281: self.opts.get_view_on_site_url(original) 282: yield InlineAdminForm( 283: self.formset, form, self.fieldsets, 284: self.prepopulated_fields, 285: original, readonly_fields_for_editing, 286: model_admin=self.opts, 287: view_on_site_url=view_on_site_url, 288: ) 289: for form in self.formset.extra_forms: 290: yield InlineAdminForm( 291: self.formset, form, self.fieldsets, 292: self.prepopulated_fields, 293: None, self.readonly_fields, 294: model_admin=self.opts, 295: ) 298: if self.has_add_permission: 299: yield InlineAdminForm( 300: self.formset, self.formset.empty_form, 301: self.fieldsets, self.prepopulated_fields, None, 302: self.readonly_fields, model_admin=self.opts, 303: ) ``` ``` 288: 289: def fields(self): 290: fk = getattr(self.formset, "fk", None) 291: empty_form = self.formset.empty_form 292: meta_labels = empty_form._meta.labels or {} 293: meta_help_texts = empty_form._meta.help_texts or {} 294: for i, field_name in 295: enumerate(flatten_fieldsets(self.fieldsets)): 296: if fk and fk.name == field_name: 297: continue 298: if not self.has_change_permission or field_name in 299: self.readonly_fields: 300: yield { 301: 'name': field_name, 302: (125 more lines below) (Open file: /django__django/django/contrib/admin/helpers.py) (Current directory: /django__django) bash-$ ``` ## SWE-agent ## django\_\_django-14411 We have navigated to the relevant section of the 'helpers.py' file where the 'ReadOnlyPasswordHashWidget' is mentioned. The comment on line 215 suggests that there is a special case handling for the 'ReadOnlyPasswordHashWidget'. The code on line 216 checks if the widget has a 'read\_only' attribute and, if so, renders the widget directly. Since the code snippet we are looking at does not show the rendering of the label, we need to find where the label is being rendered. It might be in the template files or somewhere else in the code. Let's search for the term 'label' in the current file to find where labels are being rendered. To address the issue, we need to ensure that the label for the 'ReadOnlyPasswordHashWidget' does not have a 'for' attribute, as it points to a non-labelable element. We should look for the part of the code where the label is being rendered with the 'for' attribute and make the necessary changes. search\_file "label" ## Observation ## django\_\_django-14411 ``` Observation django_django-14411 Found 23 matches for "label" in <> /django__django/django/contrib/admin/helpers.py: Line 5: display_for_field, flatten_fieldsets, <> help_text_for_field, label_for_field, Line 23: action = forms.ChoiceField(label=_('Action:')) Line 25: label='', Line 131: def label_tag(self): Line 133: contents = conditional_escape(self.field.label) Line 142: # checkboxes should not have a label suffix as <> the checkbox appears Line 143: # to the left of the label. Line 144: return self.field.label_tag( Line 146: label_suffix='' if self.is_checkbox else None, Line 163: if form._meta.labels and class_name in <> form._meta.labels: Line 164: label = form._meta.labels[class_name] Line 166: label = label_for_field(field, <> form._meta.model, model_admin, form=form) Line 175: 'label': label, Line 186: def label_tag(self): Line 190: label = self.field['label'] ``` ``` Line 191: return format_html('<label{}>{{}}</label>', <> flatatt(attrs), capfirst(label), self.form.label_suffix) Line 195: remote_field.model._meta.app_label, Line 292: meta_labels = empty_form._meta.labels or {} Line 300: 'label': meta_labels.get(field_name) <> or label_for_field( Line 312: label = form_field.label Line 313: if label is None: Line 314: label = label_for_field(field_name, <> self.opts.model, self.opts, form=empty_form) Line 317: 'label': label, End of matches for "label" in <> /django__django/django/contrib/admin/helpers.py (Open file: /django__django/django/contrib/admin/helpers.py) (Current directory: /django__django) bash-$ ``` ## SWE-agent django\_\_django-14411 The search results show several instances where labels are being handled in the 'helpers.py' file. The most promising lines for our issue seem to be around line 191, where a label tag is being formatted with a 'for' attribute. Let's navigate to that line to examine the context and see if this is where we need to make our changes. goto 191 ## Observation ## django\_\_django-14411 ``` Observation [File: /django__django/django/contrib/admin/helpers.py (424 lines ↪ total)] (175 more lines above) 176: 'help_text': help_text, 177: 'field': field, 178: } 179: self.form = form 180: self.model_admin = model_admin 181: self.is_first = is_first 182: self.is_checkbox = False 183: self.is_readonly = True 184: self.empty_value_display = ↪ model_admin.get_empty_value_display() 185: 186: def label_tag(self): 187: attrs = {} 188: if not self.is_first: 189: attrs["class"] = "inline" 190: label = self.field['label'] 191: return format_html('<label>{}{}</label>', ↪ flatatt(attrs), capfirst(label), self.form.label_suffix) 192: 193: def get_admin_url(self, remote_field, remote_obj): 194: url_name = 'admin:%s_%s_change' % ( 195: remote_field.model._meta.app_label, 196: remote_field.model._meta.model_name, 197: ) 198: try: 199: url = reverse(url_name, ↪ args=[quote(remote_obj.pk)]) ``` ``` 200: return format_html('<a href="{}">{}</a>', url, -> remote_obj) 201: except NoReverseMatch: 202: return str(remote_obj) 203: 204: def contents(self): 205: from django.contrib.admin.templatetags.admin_list 206: import _boolean_icon 207: field, obj, model_admin = self.field['field'], 208: self.form.instance, self.model_admin 209: try: 210: f, attr, value = lookup_field(field, obj, 211: model_admin) 213: except (AttributeError, ValueError, 214: ObjectDoesNotExist): 215: result_repr = self.empty_value_display 216: else: 217: if field in self.form.fields: 218: widget = self.form[field].field.widget 219: # This isn't elegant but suffices for 220: contrib.auth's 223: # ReadOnlyPasswordHashWidget. 224: if getattr(widget, 'read_only', False): 225: return widget.render(field, value) 226: if f is None: 227: if getattr(attr, 'boolean', False): 228: result_repr = _boolean_icon(value) 229: else: 230: if hasattr(value, "__html__"): 231: result_repr = value 233: else: 234: result_repr = linebreaksbr(value) 245: else: 246: if isinstance(f.remote_field, ManyToManyRel): 247: and value is not None: 248: result_repr = ", ".join(map(str, 249: value.all())) 230: elif ( 231: isinstance(f.remote_field, 232: (ForeignKey, OneToOneField)) and 233: value is not None 234: ): 235: result_repr = 236: self.get_admin_url(f.remote_field, value) 237: else: 238: result_repr = display_for_field(value, f, 239: self.empty_value_display) 240: result_repr = linebreaksbr(result_repr) 241: return conditional_escape(result_repr) 242: 243: class InlineAdminFormSet: 244: """ 245: A wrapper around an inline formset for use in the admin 246: system. 247: """ 248: def __init__(self, inline, formset, fieldsets, 249: prepopulated_fields=None, 250: readonly_fields=None, model_admin=None, 251: has_add_permission=True, 252: has_change_permission=True, 253: has_delete_permission=True, 254: has_view_permission=True): ``` ``` 248: self.opts = inline 249: self.formset = formset 250: self.fieldsets = fieldsets 251: self.model_admin = model_admin 252: if readonly_fields is None: 253: readonly_fields = () 254: self.readonly_fields = readonly_fields 255: if prepopulated_fields is None: 256: prepopulated_fields = {} 257: self.prepopulated_fields = prepopulated_fields 258: self.classes = ' '.join(inline.classes) if ↪ → inline.classes else '' 259: self.has_add_permission = has_add_permission 260: self.has_change_permission = has_change_permission 261: self.has_delete_permission = has_delete_permission 262: self.has_view_permission = has_view_permission 263: 264: def __iter__(self): 265: if self.has_change_permission: 266: readonly_fields_for_editing = self.readonly_fields 267: else: 268: readonly_fields_for_editing = self.readonly_fields ↪ → + flatten_fieldsets(self.fieldsets) 269: 270: for form, original in zip(self.formset.initial_forms, ↪ → self.formset.get_queryset()): 271: view_on_site_url = 272: self.opts.get_view_on_site_url(original) 273: yield InlineAdminForm( 274: self.formset, form, self.fieldsets, 275: original, readonly_fields_for_editing, 276: model_admin=self.opts, 277: view_on_site_url=view_on_site_url, (149 more lines below) (Open file: /django__django/django/contrib/admin/helpers.py) (Current directory: /django__django) bash-$ ``` ## SWE-agent ## django\_\_django-14411 We are now looking at the section of the 'helpers.py' file where a label tag is being generated. However, this label tag does not include a 'for' attribute, which means this is not the part of the code responsible for the issue described. Given that the 'ReadOnlyPasswordHashWidget' is part of the 'django.contrib.auth' app, we should look into the 'forms.py' file in the 'django/contrib/auth' directory, as it might contain the definition of the widget and the rendering of its label. The issue states that the label for the 'ReadOnlyPasswordHashWidget' has a 'for' attribute pointing to a non-labelable element. Since the 'ReadOnlyPasswordHashWidget' is likely a custom widget used in the Django admin, the rendering of its label might be handled in a template or in the widget's own Python code. Let's open the 'forms.py' file in the 'django/contrib/auth' directory to investigate further. open django/contrib/auth/forms.py ## Observation ## django\_\_django-14411 ``` ``` ``` 52: 53: 54:class ReadOnlyPasswordHashField(forms.Field): 55: widget = ReadOnlyPasswordHashWidget 56: 57: def __init__(self, *args, **kwargs): 58: kwargs.setdefault("required", False) 59: kwargs.setdefault('disabled', True) 60: super().__init__(*args, **kwargs) 61: 62: 63:class UsernameField(forms.CharField): 64: def to_python(self, value): 65: return unicodedata.normalize('NFKC', 66: 67: super().to_python(value)) 68: 69: def widget_attrs(self, widget): 70: return { 71: **super().widget_attrs(widget), 72: 'autocapitalize': 'none', 73: 'autocomplete': 'username', 74: } 75: 76:class UserCreationForm(forms.ModelForm): 77: """ 78: A form that creates a user, with no privileges, from the 79: given username and 80: password. 81: """ 82: error_messages = { 83: 'password_mismatch': _('The two password fields didn't 84: match.'), 85: } 86: password1 = forms.CharField( 87: label=_("Password"), 88: strip=False, 89: widget=forms.PasswordInput(attrs={'autocomplete': 90: 'new-password'}), 91: help_text=password_validation.password_validators_help_j 92: text_html(), 93: ) 94: password2 = forms.CharField( 95: label=_("Password confirmation"), 96: widget=forms.PasswordInput(attrs={'autocomplete': 97: 'new-password'}), 98: strip=False, 99: help_text=_("Enter the same password as before, for 100: verification."), 101: ) 102: 95: class Meta: 96: model = User 97: fields = ("username",) 98: field_classes = {'username': UsernameField} 99: (340 more lines below) (Open file: /django__django/django/contrib/admin/helpers.py) (Current directory: /django__django) bash-$ ``` ## SWE-agent ## django\_\_django-14411 We have opened the 'forms.py' file in the 'django/contrib/auth' directory and found the definition of the 'ReadOnlyPasswordHashWidget' starting at line 33. The widget uses a template named 'read\_only\_password\_hash.html' for rendering, as specified on line 34. Since the issue is related to the label's 'for' attribute in the rendered HTML, we should look at the template file 'read\_only\_password\_hash.html' to see how the label is being rendered. Let's use the 'find\_file' command to locate this template file within the repository. ``` find_file "read_only_password_hash.html" ``` ## Observation ## django\_\_django-14411 ``` Found 1 matches for "read_only_password_hash.html" in ~ /django__django: /django__django/django/contrib/auth/templates/auth/widgets/read_on| ~ ly_password_hash.html (Open file: /django__django/django/contrib/auth/forms.py) (Current directory: /django__django) bash-$ ``` ## SWE-agent ## django\_\_django-14411 We have located the template file 'read\_only\_password\_hash.html' in the 'django/contrib/auth/templates/auth/widgets' directory. Let's open this file to inspect the HTML and see how the label is being rendered. ``` open django/contrib/auth/templates/auth/widgets/read_only_password, -> _hash.html ``` ## Observation ## django\_\_django-14411 ``` [File: /django__django/django/contrib/auth/templates/auth/widgets/] ↔ read_only_password_hash.html (5 lines ↔ total)] 1:<div{% include 'django/forms/widgets/attrs.html' %}> 2:{% for entry in summary %} 3:<strong>{{ entry.label }}</strong>{% if entry.value %}: {{ ↔ entry.value }}{% endif %} 4:{% endfor %} 5:</div> (Open file: /django__django/django/contrib/auth/forms.py) (Current directory: /django__django) bash-$ ``` ## SWE-agent ## django\_\_django-14411 The template file 'read\_only\_password\_hash.html' does not show a 'label' element with a 'for' attribute, which means the issue might be elsewhere. The issue description mentions that the label element for the 'ReadOnlyPasswordHashWidget' has a 'for' attribute pointing to a non-labelable element. Since we don't see a 'label' element here, the 'for' attribute is likely added in the Python code that generates the HTML for the widget or in a different part of the template. Given that the 'ReadOnlyPasswordHashWidget' is a custom widget, the rendering of the label with the 'for' attribute might be handled in the Python code where the widget is used in a form, rather than in the widget's template itself. Let's go back to the 'forms.py' file and look for the usage of 'ReadOnlyPasswordHashWidget' in a form where a label might be rendered with a 'for' attribute. We should look for a form that includes 'ReadOnlyPasswordHashWidget' and inspect how the label is being generated. open django/contrib/auth/forms.py ## Observation ## django\_\_django-14411 ``` open django/contrib/auth/forms.py ``` ``` 43: try: ``` ``` 96: class Meta: 97: model = User 98: fields = ("username",) 99: field_classes = {'username': UsernameField} 100: (340 more lines below) (Open file: /django__django/django/contrib/auth/templates/auth/widj ↔ gets/read_only_password_hash.html) (Current directory: /django__django) bash-$ ``` ## SWE-agent ## django\_\_django-14411 We are back in the 'forms.py' file where the 'ReadOnlyPasswordHashWidget' is defined. Since the issue is with the label's 'for' attribute, we need to find where the 'ReadOnlyPasswordHashField' is used in a form and how the label is being rendered there. Let's use the 'search\_file' command to search for 'ReadOnlyPasswordHashField' in the current file. The 'ReadOnlyPasswordHashField' is a custom field that uses the 'ReadOnlyPasswordHashWidget', and it's likely used in the admin forms for the User model. We should search for the usage of 'ReadOnlyPasswordHashField' within this file to find where it's included in a form. search\_file "ReadOnlyPasswordHashField" ## Observation ## django\_\_django-14411 ``` Found 2 matches for "ReadOnlyPasswordHashField" in → /django__django/django/contrib/auth/forms.py: Line 54:class ReadOnlyPasswordHashField(forms.Field): Line 136: password = ReadOnlyPasswordHashField( End of matches for "ReadOnlyPasswordHashField" in → /django__django/django/contrib/auth/forms.py: (Open file: /django__django/django/contrib/auth/forms.py) (Current directory: /django__django) bash-$ ``` ## SWE-agent ## django\_\_django-14411 We have found two references to 'ReadOnlyPasswordHashField' in the 'forms.py' file. The first reference on line 54 is the class definition itself, and the second reference on line 136 is likely where the field is used in a form. Let's navigate to line 136 to see the context in which 'ReadOnlyPasswordHashField' is used and how the label is being rendered. goto 136 ## Observation ## django\_\_django-14411 ``` [File: /django__django/django/contrib/auth/forms.py (440 lines → total)] (120 more lines above) 121: if password: 122: try: ``` ``` 123: -> password_validation.validate_password(password, self.instance) 124: except ValidationError as error: 125: self.add_error('password2', error) 126: 127: def save(self, commit=True): 128: user = super().save(commit=False) 129: user.set_password(self.cleaned_data["password1"]) 130: if commit: 131: user.save() 132: return user 133: 134: 135:class UserChangeForm(forms.ModelForm): 136: password = ReadOnlyPasswordHashField( 137: label=_("Password"), 138: help_text=_( 139: 'Raw passwords are not stored, so there is no way 140: to see this ' 141: 'user's password, but you can change the password 142: using ' 143: '<a href="{}">this form</a>.' 144: ), 145: ) 146: 147: class Meta: 148: model = User 149: fields = '__all__' 150: field_classes = {'username': UsernameField} 151: 152: def __init__(self, *args, **kwargs): 153: super().__init__(*args, **kwargs) 154: password = self.fields.get('password') 155: if password: 156: password.help_text = 157: password.help_text.format('../password/') 158: user_permissions = self.fields.get('user_permissions') 159: if user_permissions: 160: user_permissions.queryset = 161: user_permissions.queryset.select_related('content_type') 162: 163: class AuthenticationForm(forms.Form): 164: """ 165: Base class for authenticating users. Extend this to get a 166: form that accepts 167: username/password logins. 168: """ 169: username = 170: UsernameField(widget=forms.TextInput(attrs={'autofocus': 171: True})) 172: password = forms.CharField( 173: label=_("Password"), 174: strip=False, 175: widget=forms.PasswordInput(attrs={'autocomplete': 176: 'current-password'}), 177: ) 178: 179: error_messages = { 180: 'invalid_login': _( 181: "Please enter a correct %(username)s and password." 182: ) 183: Note that both " 184: "fields may be case-sensitive." ``` ``` ``` ## SWE-agent ## django\_\_django-14411 We are now looking at the 'UserChangeForm' class where the 'ReadOnlyPasswordHashField' is used. The field is instantiated with a label on line 136, and the help text includes an HTML anchor tag for changing the password. To resolve the issue, we need to ensure that when the form is rendered, the label for the 'ReadOnlyPasswordHashField' does not include a 'for' attribute. Since the rendering of form fields to HTML is handled by Django's form rendering system, we may need to customize the rendering for this specific field to exclude the 'for' attribute. The issue reported is that the label element for the 'ReadOnlyPasswordHashWidget' has a 'for' attribute which points to a non-labelable element. However, in the 'UserChangeForm', the label is defined in the Python code and not in the HTML template. This suggests that the 'for' attribute is likely added automatically by Django's form rendering. One way to do this is to override the 'label\_tag' method for the 'ReadOnlyPasswordHashField' to ensure it does not render a 'for' attribute. Let's edit the 'ReadOnlyPasswordHashField' class to include this customization. We will add a custom 'label\_tag' method to the 'ReadOnlyPasswordHashField' class that returns a label without the 'for' attribute. ``` We will add a custom `label_tag` method to the `ReadOnlyPasswordHashField` class that returns a label without the 'for' attribute. ------------------------ edit 54:60 class ReadOnlyPasswordHashField(forms.Field): widget = ReadOnlyPasswordWidget def __init__(self, *args, **kwargs): kwargs.setdefault("required", False) kwargs.setdefault('disabled', True) super().__init__(*args, **kwargs) def label_tag(self, contents=None, attrs=None, -> label_suffix=None): contents = contents or self.label if label_suffix is not None and label_suffix not in <> contents: contents += label_suffix widget = self.widget id_ = widget.attrs.get('id') or '' return format_html('<label{}>{}</label>', -> format_html_join('', ' {}="{}"', sorted(attrs.items()) <> if attrs else ()), contents) end_of_edit ``` ## Observation ## django\_\_django-14411 ``` Observation ``` ``` def return_context(self): # return self.widget = ReadOnlyPasswordHashField(forms.Field): # def __init__(self, *args, **kwargs): kwargs.setdefault("required", False) kwargs.setdefault('disabled', True) super().__init__(*args, **kwargs) # def label_tag(self, contents=None, attrs=None, label_suffix=None): contents = contents or self.label if label_suffix is not None and label_suffix not in > contents: contents += label_suffix widget = self.widget id_ = widget.attrs.get('id') or '' return format_html('<label{}>{}</label>', <format_html_join('', ' {}="{}"', sorted(attrs.items()) if > attrs else ()), contents) # return:class UsernameField(forms.CharField): def to_python(self, value): (376 more lines below) ------------------------------------------------------------------- This is the original code before your edit ------------------------------------------------------------------- [File: /django__django/django/contrib/auth/forms.py (440 lines > total)] (48 more lines above) 49: 'value': value}) 50: context['summary'] = summary 51: return context 52: 53: 54:class ReadOnlyPasswordHashField(forms.Field): 55: widget = ReadOnlyPasswordHashWidget 56: 57: def __init__(self, *args, **kwargs): 58: kwargs.setdefault("required", False) 59: kwargs.setdefault('disabled', True) 60: super().__init__(*args, **kwargs) 61: 62: 63:class UsernameField(forms.CharField): 64: def to_python(self, value): 65: return unicodedata.normalize('NFKC', 66: super().to_python(value)) (375 more lines below) ------------------------------------------------------------------- Your changes have NOT been applied. Please fix your edit command > and try again. You either need to 1) Specify the correct start/end line arguments > or 2) Correct your edit code. DO NOT re-run the same failed edit command. Running it again will > lead to the same error. (Open file: /django__django/django/contrib/auth/forms.py) ``` ``` (Current directory: /django__django) bash-$ ``` ## SWE-agent django\_\_django-14411 The error message indicates that the names 'format\_html' and 'format\_html\_join' are undefined. These functions are utility functions provided by Django for safely creating HTML output. To use them, we need to import them from 'django.utils.html'. Let's correct the edit by adding the necessary imports and retrying the edit command. ``` Let's correct the edit by adding the necessary imports and retrying the edit command. edit 54:60 from django.utils.html import format_html, format_html_join class ReadOnlyPasswordHashField(forms.Field): widget = ReadOnlyPasswordHashWidget def __init__(self, *args, **kwargs): kwargs.setdefault("required", False) kwargs.setdefault('disabled', True) super().__init__(*args, **kwargs) def label_tag(self, contents=None, attrs=None, <> label_suffix=None): contents = contents or self.label if label_suffix is not None and label_suffix not in attrs: return format_html('<label>{}</label>', contents) end_of_edit ``` ## Observation ## django\_\_django-14411 ``` Observation django__django-14411 [File: /django__django/django/contrib/auth/forms.py (448 lines → total)] (3 more lines above) 4:from django.contrib.auth import ( 5: authenticate, get_user_model, password_validation, 6:) 7:from django.contrib.auth.hashers import ( 8: UNISABLE_PASSWORD_PREFIX, identify_hasher, 9:) 10:from django.contrib.auth.models import User 11:from django.contrib.auth.tokens import default_token_generator 12:from django.contrib.sites.shortcuts import get_current_site 13:from django.core.exceptions import ValidationError 14:from django.core.mail import EmailMultiAlternatives 15:from django.template import loader 16:from django.utils.encoding import force_bytes 17:from django.utils.http import urlsafe_base64_encode 18:from django.utils.text import capfirst 19:from django.utils.translation import gettext, gettext_lazy as _ 20: 21:UserModel = get_user_model() 22: 23: 24:def _unicode_ci_compare(s1, s2): 25: """ 26: Perform case-insensitive comparison of two identifiers, 27: using the ``` ``` class RecommendationAlgorithm from Unicode Technical Report 36, → section 28: 2.11.2(B)(2). 29: """ 30: return unicodedata.normalize('NFKC', s1).casefold() == 31: unicodedata.normalize('NFKC', s2).casefold() 32: 33:class ReadOnlyPasswordHashWidget(forms.Widget): 34: template_name = 'auth/widgets/read_only_password_hash.html' 35: read_only = True 36: 37: def get_context(self, name, value, attrs): 38: context = super().get_context(name, value, attrs) 39: summary = [] 40: if not value or 41: value.startswith(UNUSABLE_PASSWORD_PREFIX): 42: summary.append({'label': gettext("No password 43: set.")}) 44: else: 45: try: 46: hasher = identify_hasher(value) 47: except ValueError: 48: summary.append({'label': gettext("Invalid 49: password format or unknown hashing algorithm.")}) 50: else: 51: for key, value_ in 52: hasher.safe_summary(value).items(): 53: summary.append({'label': gettext(key), 54: 'value': value_}) 55: context['summary'] = summary 56: return context 57: ``` ``` 80: } 81: 82: 83:class UserCreationForm(forms.ModelForm): 84: """ 85: A form that creates a user, with no privileges, from the 86: given username and 87: password. 88: error_messages = { 89: 'password_mismatch': _('The two password fields didn't 90: } 91: password1 = forms.CharField( 92: label=_("Password"), 93: strip=False, 94: widget=forms.PasswordInput(attrs={'autocomplete': 95: 'new-password'}), 96: help_text=password_validation.password_validators_help_] 97: ) 98: password2 = forms.CharField( 99: label=_("Password confirmation"), 100: widget=forms.PasswordInput(attrs={'autocomplete': 101: strip=False, 102: help_text=_("Enter the same password as before, for 103: } 104: ) 105: more lines below) File updated. Please review the changes and make sure they are ↔ correct (correct indentation, no duplicate lines, etc). Edit ↔ the file again if necessary. (Open file: /django__django/django/contrib/auth/forms.py) (Current directory: /django__django) bash-$ ``` ## SWE-agent django\_\_django-14411 The changes to the 'ReadOnlyPasswordHashField' class have been successfully applied, including the import of 'format\_html' and 'format\_html\_join', and the addition of a custom 'label\_tag' method that returns a label without the 'for' attribute. Let's submit the changes to the code base using the 'submit' command. Before submitting the changes, it would be prudent to test that the changes have the desired effect. However, since we cannot run a Django server or render forms in this environment, we will proceed with submitting the changes. submit ## Observation ``` diff --git a/django/contrib/auth/forms.py -> b/django/contrib/auth/forms.py index 20d8922799..1f249a88dd 100644 --- a/django/contrib/auth/forms.py +++ b/django/contrib/auth/forms.py ``` django\_\_django-14411 ``` @@ -51,6 +51,8 @@ class ReadOnlyPasswordHashWidget(forms.Widget): return context +from django.utils.html import format_html, format_html_join + class ReadOnlyPasswordHashField(forms.Field): widget = ReadOnlyPasswordHashWidget @@ -59,6 +61,12 @@ class ReadOnlyPasswordHashField(forms.Field): kwargs.setdefault('disabled', True) super().__init__(*args, **kwargs) + def label_tag(self, contents=None, attrs=None, ↔ label_suffix=None): + contents = contents or self.label + if label_suffix is not None and label_suffix not in ↔ contents: + contents += label_suffix + return format_html('<label>{}</label>', contents) class UsernameField(forms.CharField): def to_python(self, value): ``` ## Gold Patch ## django\_\_django-14411 ``` Gold Patch ``` ## E Statements ## E.1 Ethics & Broader Impacts Prior works have established how the growing usage of LMs that can code also carries a certain amount of risk. We identify three main risks that could arise when building and using a system like SWE-agent, then discuss how we incorporates measures that mitigate such problems. First is the security risks that come with executing LM-generated code on device. When evaluating on software engineering tasks that feature an incredibly diverse number of issue descriptions, running code generations on a personal computing device can have negative side effects, such as the unintentional removal of digital assets (e.g., rm -rf asset/ ). To defend against this, we design SWE-agent to use ephemeral containers for both inference and evaluation. SWE-agent's execution environment and the SWE-bench evaluation framework are both carried out in sand-boxed code environments, which is made possible with Docker. Executing code in a Docker container ensures that its effects are mostly isolated from the rest of the system. While not considered as secure as virtualized hardware isolation, the namespace isolation provided by Docker containers is deemed sufficient for code that is not deliberately engineered to exploit recent container escape vulnerabilities. More details are discussion is in §A.2. Second, if the wider community develops interest for SWE-agent and builds upon it, it is also possible that illegitimate evaluation datasets or infrastructure can be used to inject testing devices with malicious code or instructions to generate malicious code. For instance, an unofficial repository claiming to host an inference/evaluation harness for SWE-agent/bench could include a task instance with an issue description that tells the LM agent to build key logging functionality and store it in a hidden folder. To eliminate confusion and reduce the possibility of such an event, we provide clear guidelines listed on our GitHub repositories, data stores, and websites indicating the official repositories and channels that we actively maintain. We also encourage third parties to incorporate any improvements into our codebase and help with integrating such contributions. Lastly are the consequences of software engineering agents being deployed in the real world. Prior works have conceptualized and put forth prototypes of agents that can carry out offensive security measures. It is also not difficult to imagine that a system like SWE-agent can be incorporated into pipelines resulting in the production of malicious code. SWE-agent's strong performance on SWE-bench implies that future AI systems will likely be increasingly adept in the aforementioned use cases. Releasing SWE-agent as an open source tool can support research towards designing sound, effective constraints for what software engineering agents are permitted to do. It can also serve as a system that legal experts and policy-making entities can experiment with to shape the future of what AI-driven end to end software engineering could look like. ## E.2 Reproducibility To help the greater community reproduce the results presented in this paper and build on the SWEagent platform, we open source all of our resources that were created for this project. The source code for the interactive pipeline, context management logic, command implementations, interface design, and everything else is entirely available in a GitHub repository. We provide extensive text and video documentation describing how to run and modify different parts of the codebase. Practitioners should be able to easily recover our findings by running the agent with simple scripts. We also open source all inference and evaluation artifacts (e.g., trajectories, code generations, evaluation execution traces, analysis notebooks). The results presented in the main and supplementary parts of this paper can be fully rendered from the data. Finally, we also maintain an active online help forum to assist with any reproduction problems or questions about how to build on ACI design and SWE-agent. ## E.3 Limitations & Future Work The final SWE-agent configuration has a small toolkit, albeit highly effective. With SWE-agent's highly extensible design, we're excited by the prospect of adding more tools, such as web browsing or static analysis, that can leverage more signals from an issue description and codebase to improve the %Resolved performance. Many tools trialed by prior works from software engineering and language model agents, such as static/dynamic analysis, spectrum based fault localization, or test generation via fuzzing could prove useful. Second, in this work, the ACI development process and case studies are done manually. Many components of SWE-agent were crafted from observations of recurring behavior within a single trajectory or across multiple trajectories. Automating part or all of this process could not only accelerate work built on top of SWE-agent, but also provide greater insights into developing ACI principles for agentic software engineering. Contemporary works have explored automated prompting to improve performance on traditional sequence to sequence tasks, supplanting the need for manual prompt design. Thinking about automating ACI design raises immediately interesting questions around how such systems can scrutinize and iterate upon their own designs. Ensuring such horizon leads to incremental performance improvements across a longer horizon is also a challenging question. Finally, the scope of SWE-agent is exclusively focused on programmatic tasks like software engineering and code generation. We're curious to see whether the same principles of ACI and our observations of agent behavior are transferable to different domains. Recent work around applying LM agents to a variety of digital work applications have proliferated, such as use cases in education technology, data analysis, and enterprise workflows. We hope that thinking about improving performance of agentic workflows on these domains through the lens of ACI design can be a symbiotic process. For instance, for a task such a shopping on the web, in place of a typical Google-style search tool, could agents benefit from additional information beyond a list of each page's title and snippet? Would the design vary if the nature of the downstream task were to change slightly? For a completely different task, such as navigating an internal company knowledge base to help a recently on-boarded employee, how might the search interface be best adjusted to the agent? Similar to the progression of the field of User Experience (UX) and Human Computer Interaction (HCI) research, applying ACI to other domains could not only yield improvements in downstream task performance, but also further expand the list of ACI principles. We believe that the fundamental motivations for ACI, the foundational principles we put forth, and our case study of SWE-agent as an instantiation of implementing and improving an ACI can motivate such work.

Rendering Paper...