2602.09540

Model: healer-alpha-free

# SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications? > 1 University of Toronto 2 Xiaohongshu Inc. 3 Coolwei AI Lab 4 University of Illinois Urbana-Champaign 5 University of California, Berkeley ## Abstract Can large language model agents develop industry-level mobile applications? We introduce SWE-Bench Mobile, a benchmark for evaluating coding agents on realistic software engineering tasks derived from a production iOS codebase. Unlike existing benchmarks that focus on isolated problems or bug fixes, SWE-Bench Mobile captures the full complexity of industrial development: multi-modal inputs (PRDs and Figma designs), a large-scale mixed Swift/Objective-C codebase, and comprehensive test suites. We evaluate 22 agent-model configurations across four coding agents—three commercial (Cursor, Codex, Claude Code) and one open-source (OpenCode)—and find that even the best configurations achieve only 12% task success rate. Our analysis reveals that (1) agent design matters as much as model capability—the same model shows up to 6 $\times$ performance gap across agents, (2) commercial agents consistently outperform open-source alternatives, and (3) simple “Defensive Programming” prompts outperform complex ones by 7.4%. These findings highlight a significant gap between current agent capabilities and industrial requirements, while providing actionable insights for practitioners and researchers. We release SWE-Bench Mobile as a hosted benchmark challenge to prevent data contamination and ensure fair evaluation. The public leaderboard and development toolkit are available at https://swebenchmobile.com. Large Language Models, Software Engineering Agents, Mobile Development Benchmark copyright: none footnotetext: * Equal contribution. † Corresponding author: jiaxuan@illinois.edu. ## 1. Introduction <details> <summary>figures/fig_pipeline.png Details</summary> ![8b4b0cc2](/v1/image/8b4b0cc2ad027020dfa30359b5c474ef54a5d82aa03d886bc332db763f7aeca2) ### Visual Description ## System Architecture Diagram: Automated Code Generation and Evaluation Pipeline ### Overview This image is a system architecture diagram illustrating a three-stage pipeline for automated code generation and evaluation. The flow moves from left to right, starting with input requirements and a codebase, proceeding through a runtime environment where a coding agent generates code, and concluding with a multi-faceted evaluation phase. The diagram uses color-coded boxes and directional arrows to show the flow of data and control. ### Components/Axes The diagram is segmented into three primary containers, each with a title in the top-right corner: 1. **Input** (Left container) 2. **Runtime** (Center container) 3. **Evaluation** (Right container) **Detailed Component List (with spatial grounding):** * **Input Container (Left):** * **Top-left:** A light blue box labeled `Feature Description`. * **Middle-left:** A light blue box labeled `Figma Link`, accompanied by the Figma logo icon. * **Bottom-left:** A light blue box labeled `Other Related Context (abtest/multi-lingo/etc)`. * **Center:** A blue box labeled `Product Requirement Document`. Arrows from the three boxes above converge into this box. * **Bottom-center:** A pink box labeled `Mobile Project Codebase (Swift/Objc/RN/etc)`, accompanied by the GitLab logo icon. An arrow from this box points towards the Runtime container. * **Runtime Container (Center):** * **Left side:** A large, light purple box labeled `Bench Runtime (Model Config/Tuned System Prompt/Task Management)`. * **Right side:** A purple box labeled `Coding Agent (Cursor/CC/Codex/etc)`. Inside this box are five logos/icons representing different agents or tools: a cube, a sunburst, a spiral, a stylized "Z", and a star. * **Top-right (above Coding Agent):** Two smaller, light purple boxes: `Figma MCP` and `Vision MCP`. Both have arrows pointing down to the `Coding Agent` box. * **Flow:** An arrow from the `Product Requirement Document` (in Input) points to the `Bench Runtime`. An arrow from the `Bench Runtime` points to the `Coding Agent`. An arrow labeled `diff patches` exits the `Coding Agent` and points to the Evaluation container. * **Evaluation Container (Right):** * **Left side:** Three light green boxes stacked vertically: * `Entrypoint-wise Eval` (Top) * `Functionality-wise Eval` (Middle) * `Configuration-wise Eval` (Bottom) * **Right side:** A large, green box labeled `Analyzer`. * **Flow:** Arrows from all three evaluation boxes (`Entrypoint-wise`, `Functionality-wise`, `Configuration-wise`) converge into the `Analyzer` box. ### Detailed Analysis The diagram defines a clear, linear workflow: 1. **Input Stage:** Aggregates project context from multiple sources: a textual feature description, a design link (Figma), other contextual data (like A/B tests or multilingual requirements), and the existing mobile project codebase. These are synthesized into a Product Requirement Document. 2. **Runtime Stage:** The requirement document and codebase are fed into a "Bench Runtime," which manages model configuration, system prompts, and tasks. This runtime orchestrates a "Coding Agent" (e.g., Cursor, Claude Code, Codex). The agent is augmented by two Model Control Plane (MCP) modules: `Figma MCP` (likely for interpreting design files) and `Vision MCP` (for processing visual inputs). The agent's output is a set of `diff patches`. 3. **Evaluation Stage:** The generated code patches are subjected to three parallel evaluation tracks: checking entry points, assessing functionality, and verifying configuration. The results from all three evaluations are fed into a final `Analyzer` for synthesis and reporting. ### Key Observations * **Modular Design:** The system is highly modular, with clear separation between input processing, code generation, and evaluation. * **Agent-Centric Runtime:** The core of the runtime is a "Coding Agent," which is supported by specialized MCPs for design and vision, indicating a focus on multimodal input processing. * **Comprehensive Evaluation:** The evaluation is not monolithic but is broken down into three distinct, complementary dimensions (entrypoint, functionality, configuration), suggesting a robust quality assurance process. * **Tool Agnosticism:** The `Coding Agent` box lists multiple potential tools (Cursor, CC, Codex, etc.), indicating the architecture is designed to be agnostic to the specific agent implementation. ### Interpretation This diagram represents a sophisticated, automated DevOps or MLOps pipeline specifically tailored for mobile application development. It demonstrates a Peircean investigative process where: * **The Input stage** represents the **ground** or context—the raw materials of requirements and existing code. * **The Runtime stage** is the **interpretant**—the active process (the coding agent) that generates a new sign (the code diff) based on the input. * **The Evaluation stage** is the **object** being tested—the generated code is analyzed against multiple criteria to determine its validity and quality. The pipeline's significance lies in its potential to drastically accelerate development cycles by automating the translation of design and requirements into code, followed by rigorous, multi-angle validation. The inclusion of `Figma MCP` and `Vision MCP` highlights a trend towards integrating design tools directly into the coding process, bridging the gap between design and implementation. The separation of evaluation into three specific "wise" tracks suggests that the system is designed to catch different classes of errors (e.g., structural issues via entrypoint eval, logical bugs via functionality eval, and environment mismatches via configuration eval), leading to higher overall code quality. </details> Figure 1. Overview of the SWE-Bench Mobile pipeline. (1) Agents receive multi-modal inputs including a Product Requirement Document (PRD), Figma design, and a large-scale Swift/Objective-C codebase. (2) The agent navigates the codebase, plans the implementation, and generates code. (3) The output is a Git patch that is applied and evaluated against a comprehensive test suite. Large language models (LLMs) have enabled a new generation of autonomous coding agents that can understand requirements, navigate codebases, and implement features with minimal human intervention. Commercial systems like GitHub Copilot, Cursor, and Claude Code have achieved impressive results on existing benchmarks, raising a critical question: Can these agents handle the complexity of real-world, industry-level mobile software development? Answering this question requires a comprehensive evaluation that faithfully captures professional software engineering. However, existing benchmarks have significant limitations. HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) evaluate isolated algorithmic problems far removed from industrial practice. SWE-Bench (Jimenez et al., 2024) advances the field by using real GitHub issues, but still falls short of industrial realism: it focuses on bug fixes rather than feature development, uses text-only inputs without design specifications, typically involves small localized changes to 1-2 files, and concentrates on Python which is well-represented in training data. Recent work like SWE-Bench Pro (Deng et al., 2025) addresses some limitations by introducing longer-horizon tasks, but still lacks multi-modal inputs and focuses exclusively on Python. In professional software development, engineers participate in a structured workflow that goes far beyond writing code. They interpret Product Requirement Documents (PRDs) that specify what to build and why. They translate visual designs from tools like Figma into implementation decisions about layout and interaction. They navigate large codebases—often hundreds of thousands of lines—to find relevant files and understand existing patterns. They make coordinated changes across multiple modules while maintaining consistency. And they ensure their implementations handle edge cases and pass comprehensive tests. A benchmark claiming to evaluate “industry-level” capabilities must test all of these aspects. We focus on mobile application development not merely for language diversity, but because it represents a distinct and critical paradigm in software engineering that remains unexplored by current benchmarks. Unlike server-side logic (e.g., Python scripts), mobile development introduces unique challenges for AI agents: (1) Multi-modal Dependency: Implementation is strictly guided by visual designs (Figma) and user interactions, requiring agents to perform visually-grounded program synthesis rather than just text-to-code generation. (2) Event-Driven Complexity: Mobile apps are stateful systems that must handle asynchronous user events, network changes, and strict OS lifecycle callbacks, challenging agents’ ability to model dynamic system states. (3) Client-Side Constraints: Development occurs within framework-heavy environments (e.g., iOS SDK) with rapid iterations, testing generalization to domain-specific APIs. We introduce SWE-Bench Mobile, a benchmark for evaluating coding agents on industry-level mobile application development. SWE-Bench Mobile is constructed from real development artifacts at a major technology company, comprising 50 authentic tasks derived from actual product requirements. Each task combines multi-modal inputs—PRDs, Figma designs, and a large-scale mixed Swift/Objective-C production codebase—with comprehensive evaluation through 449 human-verified test cases. Contributions. 1. We introduce SWE-Bench Mobile, the first benchmark combining PRDs, Figma designs, and a large-scale codebase to capture the full complexity of industrial software development. 1. We evaluate 22 agent-model configurations across four coding agents (three commercial, one open-source), with detailed analysis of performance, cost, and robustness. 1. We systematically categorize agent failures, finding that 54% stem from missing feature flags—a production practice unfamiliar to agents—followed by missing data models (22%) and incomplete file coverage (11–15%). 1. We provide actionable insights: agent design matters as much as model capability (up to 6 $\times$ performance gap for the same model), commercial agents outperform open-source ones, simple prompts outperform complex ones, and cost-effective configurations exist. To strictly preserve the integrity of the evaluation and respect the proprietary nature of the production codebase, we adopt a hosted evaluation paradigm. Unlike static datasets that are prone to data contamination in future model training sets, our held-out private test set ensures that agents are evaluated on truly unseen industrial tasks. We provide a sanitized development kit and a public leaderboard to foster community progress. Our evaluation reveals a significant gap between current capabilities and industrial requirements. The best configuration achieves only 12% task success rate, with most failures due to incomplete implementations. The same model (Opus 4.5) achieves 12% on Cursor but only 2% on OpenCode—a 6 $\times$ gap—demonstrating that agent scaffolding matters as much as model capability. Commercial agents consistently outperform the open-source OpenCode, whose best result (8% with GLM 4.6) trails the best commercial result (12%) by 4 percentage points. Success drops from 18% for simple tasks requiring 1-2 files to just 2% for complex tasks requiring 7+ files, indicating agents struggle with cross-file reasoning. These findings suggest that while coding agents show promise for simple tasks, substantial improvements in requirement understanding, multi-modal reasoning, and codebase navigation are needed for reliable industry-level development. ## 2. SWE-Bench Mobile SWE-Bench Mobile is a benchmark designed to evaluate coding agents on industry-level mobile application development. Unlike existing benchmarks that focus on isolated coding problems or bug fixes, SWE-Bench Mobile captures the full complexity of professional software engineering: multi-modal inputs, large codebases, and comprehensive testing. Figure 1 illustrates the overall benchmark pipeline. ### 2.1. Problem Formulation Each benchmark instance is represented as a triplet: $$ \mathcal{T}=(\mathcal{I},\mathcal{O},\mathcal{E}), $$ where $\mathcal{I}$ is the input context, $\mathcal{O}$ is the expected output, and $\mathcal{E}$ is the evaluation configuration. Input ( $\mathcal{I}$ ). The input context mimics a typical developer’s starting point for a new feature. It consists of three main components (see Figure 2). First, a Product Requirement Document (PRD) describes the feature goal, user story, acceptance criteria, and constraints. These PRDs are derived from actual product requirements at XiaoHongShu Inc., a major social media platform with over 300 million monthly active users, follow standard industrial conventions (Atlassian, 2024), and have an average length of 450 words, requiring agents to parse natural language specifications. Second, 70% of tasks include a Figma Design specification, containing component layout, typography, and visual details that the agent must translate into code. Finally, the agent is provided with the XiaoHongShu production codebase, a Git repository snapshot containing approximately 500,000 lines of Swift/Objective-C code across thousands of files. This large-scale context forces the agent to perform retrieval and navigation, rather than just code generation. <details> <summary>figures/fig_task_example.png Details</summary> ![2351061b](/v1/image/2351061bca3620af7b00086c6c83c1b28fd17f5d044a7a0cd4c33f6a26533863) ### Visual Description ## Diagram: Design-to-Code Workflow (Mobile App UI) ### Overview The image illustrates a three-stage workflow for a mobile app UI design and implementation: **Before** (original UI), **Figma example** (design tool), and **Code diff example** (code changes). It shows how a UI design in Figma is translated into code modifications, with a focus on post content and layout adjustments. ### Components/Axes The diagram is divided into three panels: #### 1. Left Panel: *Before/After* (Mobile App Screens) - **Top (Before)**: A mobile app screen with a post: - Image: Living room interior. - Text: *“周末只想赖在家”* (Weekend just want to stay at home). - Username: *“七月奶奶”* (July Grandma). - Timestamp: *“2天前”* (2 days ago). - Likes: *“2009”*. - **Bottom (After)**: A mobile app screen with a post: - Image: Person with blue balloons. - Text: *“做了一个蓝色的梦💙”* (Made a blue dream💙). - Username: *“酒井小乔”* (Sakai Xiaoqiao). - Timestamp: *“4小时前”* (4 hours ago). #### 2. Middle Panel: *Figma example* (Design Tool Interface) - **Left Sidebar**: - Project: *“关注Tab改造”* (Follow Tab Renovation), *“关注关系”* (Follow Relationship). - Pages: *“25.07 笔记封面去点赞”* (25.07 Note Cover Remove Likes). - Layers: *“关注tab”* (Follow tab), *“Rectangle 11115319”*, *“双列”* (Double column), *“列”* (Column), *“笔记”* (Note), *“卡片聚合-合集/直播”* (Card Aggregation - Collection/Live), *“Frame 2053141301”*, *“两个推荐理由”* (Two Recommendation Reasons), *“无头像气泡”* (No Avatar Bubble), *“Component / Bar / Tab”*, *“导航栏”* (Navigation Bar). - **Center (Mobile App Design)**: - Tabs: *“关注”* (Follow), *“发现”* (Discover), *“上海”* (Shanghai), search icon. - Posts: *“做了一个蓝色的梦💙”*, *“周末只想赖在家”*, *“叮咚~小猫邀您一起喝下午茶🍵”* (Dingdong~ Little cat invites you to drink afternoon tea🍵), *“叮咚，你的福到啦🧧”* (Dingdong, your fortune is here🧧). - Bottom Nav: *“首页”* (Home), *“热门”* (Hot), *“+”*, *“消息”* (Messages), *“我”* (Me). - **Right Sidebar**: - Layer properties: *“Caption/C3”*, Layout (Width: *“Hug (41px)”*, Height: *“Hug (14px)”*). - Typography: Font *“PingFang SC”*, Weight *“var(--c3-font-weight)”*, Style *“Regular”*, Size *“var(--c3-font-size)”*, Line height *“var(--c3-line-height)”*, Letter spacing *“0%”*. - Colors: Hex, Text colors *“var(--placeholder)”*. #### 3. Right Panel: *Code diff example* (Swift Code Editor) - **File**: `FeedItemFooter.swift` (class `FeedItemFooter: UIView`). - **Code Changes** (lines 32–53): - `layoutSubviews()` method: - Hide `likeButton` (if `FeatureConfig.shared.enableTimeEmphasis` is true). - Update `timeLabel` font to `Theme.font(.caption)`. - Re-layout `timeLabel` to the trailing edge (x: `width - timeSize.width - Padding.standard`, y: `avatarView.center.y - timeSize.height / 2`). - Adjust `nicknameLabel` width to prevent overlap with `timeLabel`. ### Detailed Analysis - **UI Evolution**: The “Before/After” panels show a change in post content (image, text, user, timestamp), indicating a UI update (e.g., new post or design refresh). - **Figma Design**: The Figma panel details the design structure (layers, components) and properties (layout, typography) for the mobile app, ensuring consistency between design and development. - **Code Implementation**: The code diff shows Swift code changes to adjust UI elements (time label, like button, nickname label) for better readability and layout, aligning with the Figma design. ### Key Observations - **Design-to-Code Alignment**: The Figma design (middle panel) and code changes (right panel) are tightly coupled, with code implementing design specifications (e.g., typography, layout). - **UI Refinement**: The code focuses on enhancing time display and preventing UI overlap, indicating a focus on user experience and readability. - **Content Variation**: The “Before/After” panels show different post content, suggesting a dynamic UI (e.g., new posts or feature updates). ### Interpretation This diagram demonstrates a typical **design-to-code workflow** for mobile app development: 1. **Design Phase**: Figma is used to create and refine the UI (layers, components, properties), defining visual and structural specifications. 2. **Implementation Phase**: Code changes (Swift) translate the design into functional UI, adjusting layout, typography, and interactions. 3. **UI Evolution**: The “Before/After” shows how post content (images, text, users) changes, likely due to design updates or feature enhancements. The workflow highlights the importance of design tools (Figma) in defining UI specifications and how code implements those specifications, ensuring consistency between design and development. The code changes focus on improving time display and preventing overlap, emphasizing user experience and readability. *(Note: All Chinese text is transcribed directly, with English translations provided for clarity.)* </details> Figure 2. A concrete example of a SWE-Bench Mobile task (Task 056). The agent must interpret the PRD requirements (replace interaction button with publish time label) and visual design (Figma), locate the relevant files in the codebase (FeedItemFooter.swift), and implement the changes while handling edge cases and feature configuration. Output ( $\mathcal{O}$ ). The expected output is a unified diff patch that, when applied to the codebase, implements the feature described in the PRD. This format matches the standard pull request workflow used in industry. Evaluation ( $\mathcal{E}$ ). Each task is paired with a task-specific pytest suite (9.1 tests per task on average) that evaluates the generated patch directly. Concretely, tests operate on the unified diff text without compiling or running the iOS application, and therefore avoid build-time overhead and simulator/device nondeterminism. This patch-level evaluation is designed to verify the presence of necessary UI-facing edits (e.g., view construction, layout logic) and data/logic edits (e.g., control-flow, state updates), while remaining tolerant to superficial variability such as identifier naming, refactoring style, and minor structural reorganization. ### 2.2. Design Principles SWE-Bench Mobile is constructed under guiding principles to ensure relevance to professional software engineering. End-to-End Realism is paramount; tasks span the full engineering process from PRD to testing, preserving real-world dependencies and incomplete specifications. Unlike synthetic benchmarks, our tasks come from actual product development cycles. Multi-Modal Reasoning is required, as agents must jointly interpret textual requirements (PRD), visual designs (Figma), and structured code. Diverse Coverage ensures robustness, with tasks covering multiple categories (Table 3) and difficulty levels, from simple UI adjustments to complex architectural refactoring. Finally, by focusing on Swift/Objective-C, an Under-Represented Language in LLM training data compared to Python or JavaScript, SWE-Bench Mobile serves as a challenging test of an agent’s ability to generalize to less familiar syntax and frameworks. ### 2.3. Dataset Statistics Table 1 summarizes the key statistics of SWE-Bench Mobile. The benchmark consists of 50 tasks with 449 total test cases. The majority of tasks (70%) include Figma designs, and 92% include reference images, highlighting the multi-modal nature of the dataset. The average PRD length is 450 words, providing substantial context. The codebase scale is significant, with the repository size reaching approximately 5GB. Table 1. SWE-Bench Mobile dataset statistics. | Metric | Value | | --- | --- | | Total Tasks | 50 | | Total Test Cases | 449 | | Avg. Test Cases per Task | 9.1 | | Tasks with Figma Design | 35 (70%) | | Tasks with Reference Images | 46 (92%) | | Avg. PRD Length (words) | 450 | | Codebase Size | Large Scale ( $\sim$ 5GB) | | Programming Language | Swift/Objective-C (iOS) | | Avg. Files Modified per Task | 4.2 | <details> <summary>x1.png Details</summary> ![892f86aa](/v1/image/892f86aac679c93330d5229021770b9c1ebb2e0167c0ae4a5048939b1615994b) ### Visual Description ## [Pie Charts]: Task Distribution by Category and Difficulty ### Overview The image displays two distinct pie charts presented side-by-side on a white background. The charts visualize the same dataset of 50 items, categorized in two different ways: first by functional category, and second by difficulty level. The charts are simple 2D pie charts with labeled segments. ### Components/Axes **Left Chart: "By Category"** * **Title:** "By Category" (centered above the chart). * **Segments & Labels (clockwise from top-right):** * UI Components (18) - Light orange segment. * Data Mgmt (10) - Salmon pink segment. * Gesture & Interaction (8) - Light pink segment. * Media & Assets (7) - Purple segment. * Networking (4) - Blue segment. * Other (3) - Light green segment. * **Total Count:** 50 (sum of all segment values). **Right Chart: "By Difficulty"** * **Title:** "By Difficulty" (centered above the chart). * **Segments & Labels (clockwise from top-right):** * Easy (15) - Light green segment. * Medium (25) - Light orange segment. * Hard (10) - Salmon pink segment. * **Total Count:** 50 (sum of all segment values). **Spatial Grounding:** The "By Category" chart occupies the left half of the image. The "By Difficulty" chart occupies the right half. Labels are placed directly adjacent to their corresponding pie segments, with no separate legend box. ### Detailed Analysis **By Category Chart Data:** * **UI Components:** 18 items (36% of total). This is the largest segment. * **Data Mgmt:** 10 items (20% of total). * **Gesture & Interaction:** 8 items (16% of total). * **Media & Assets:** 7 items (14% of total). * **Networking:** 4 items (8% of total). * **Other:** 3 items (6% of total). This is the smallest segment. **By Difficulty Chart Data:** * **Medium:** 25 items (50% of total). This is the largest segment, representing exactly half of all tasks. * **Easy:** 15 items (30% of total). * **Hard:** 10 items (20% of total). **Trend Verification:** * In the "By Category" chart, the visual size of the segments corresponds directly to the numerical values, with "UI Components" being the most prominent. * In the "By Difficulty" chart, the "Medium" segment is visually dominant, occupying half the circle, which matches its value of 25 out of 50. ### Key Observations 1. **Dominant Category:** "UI Components" is the most frequent category, comprising over a third of all items. 2. **Difficulty Skew:** Half of all items are classified as "Medium" difficulty, indicating a central tendency in the dataset's complexity. 3. **Consistent Total:** Both charts sum to 50, confirming they represent the same underlying dataset partitioned differently. 4. **Color Reuse:** The color light orange is used for both "UI Components" (largest category) and "Medium" (largest difficulty), potentially implying a visual association between the most common category and the most common difficulty level. Similarly, salmon pink is used for "Data Mgmt" and "Hard". ### Interpretation This data suggests a structured set of tasks or components, likely for a software development project, training curriculum, or technical assessment. The primary focus is on **User Interface (UI) development**, as evidenced by the "UI Components" category being the largest. The fact that half the tasks are of **"Medium" difficulty** indicates a balanced design, avoiding an excess of trivial or overly complex items, which is optimal for progressive learning or a sustainable workload. The cross-categorization allows for insightful analysis. For instance, one could investigate if the 18 "UI Components" tasks are predominantly of "Medium" difficulty, or if the 10 "Hard" tasks are concentrated in categories like "Data Mgmt" or "Networking". The presence of an "Other" category (6%) acknowledges minor, miscellaneous tasks that don't fit the primary classifications. Overall, the charts depict a well-defined and categorized body of work with a clear emphasis on UI and a central difficulty level. </details> Figure 3. Task distribution by category (left) and difficulty (right). Each label shows the count, percentage, and average agent pass rate. UI Components (36%) dominate the benchmark, while performance drops sharply from Easy (18.5% pass) to Hard (5.8% pass). ### 2.4. Task Construction Source. Tasks are derived from real product requirements at XiaoHongShu Inc., a leading social media platform in China with over 300 million monthly active users. Each task represents a feature that was actually implemented by XiaoHongShu engineers in the production iOS application, ensuring realistic complexity and scope. Unlike existing benchmarks that use synthetic problems or isolated bug fixes from open-source repositories, our tasks capture the full complexity of feature development in a commercial mobile application: multi-file changes, UI/UX implementation from design specs, integration with existing business logic, and handling of edge cases and feature flags. This industry-sourced approach ensures that our benchmark reflects the actual challenges faced by software engineers in production environments. Quality Control. Each task undergoes a rigorous multi-stage review process. First, the PRDs are reviewed to ensure requirements are clear and self-contained. Next, comprehensive test suites are designed to verify both correctness and quality. Finally, we perform human validation to verify that the reference implementation passes all tests. Difficulty Calibration. Tasks are labeled by implementation complexity based on several factors: the number of files to modify (1-2 for Easy, 3-5 for Medium, 6+ for Hard), the lines of code changed ( $<$ 50 for Easy, 50-150 for Medium, $>$ 150 for Hard), and the architectural complexity, distinguishing between localized changes and cross-module refactoring. ### 2.5. Evaluation Pipeline Unlike traditional code benchmarks that rely solely on unit tests, SWE-Bench Mobile performs comprehensive verification through a multi-step pipeline. Patch-to-Task Routing. SWE-Bench Mobile evaluates submissions as unified diff patches and associates each patch with a specific task. This routing step ensures that each submission is evaluated under the task’s PRD-defined intent and its corresponding test suite, while keeping the evaluation independent of repository checkout, compilation, or runtime execution. In practice, the test harness exposes the patch text to the task-specific tests, enabling purely diff-based verification. Static Analysis. Before running task-specific assertions, we perform lightweight static checks on the diff text. This includes verifying unified diff structure (e.g., diff --git headers), rejecting empty or near-empty patches, and ensuring that added lines contain meaningful code changes rather than only whitespace or comments. We also check whether the patch touches relevant files using flexible path patterns (e.g., accepting file moves/renames), and apply basic language-agnostic sanity checks to filter malformed submissions early. Diff-Based Intent Tests. Direct runtime evaluation for mobile applications is challenging to scale. Unit tests are ill-suited for validating visual correctness, while end-to-end UI testing introduces substantial compilation overhead and environmental nondeterminism. To address these constraints, SWE-Bench Mobile adopts a diff-based evaluation strategy: our pytest suites inspect the patch diff and verify structural intent and architectural compliance. This allows us to evaluate high-level architectural decisions and requirement compliance at scale. Tests are constructed from the PRD and a human reference patch, emphasizing: - Goal-oriented checks: verifying modification patterns (the “what”) rather than exact code shape. - Feature entry points: checking integration surfaces (e.g., routing, hooks). - Removal of blocking behavior: ensuring constraints or legacy guards are lifted. - Cohesion across files: verifying related edits across modules. - Semantics-aware matching: using flexible pattern matching to accommodate alternative naming. Batch Reporting and Error Analysis. Beyond pass/fail decisions, our evaluator produces both task-level and test-case-level summaries. For large-scale runs, we classify failures into coarse categories (e.g., missing critical file edits, missing UI components, empty patches). This analysis provides interpretable diagnostics of common agent failure modes and supports systematic iteration on prompts and agent scaffolding. Metrics. We report two complementary metrics. Task Success Rate is the percentage of tasks where all tests pass, representing the strict standard for a completed feature. Test Pass Rate is the percentage of individual test cases passed, which reveals partial progress even when the full task is not completed. The gap between these metrics reveals how often agents make partial progress without fully completing tasks. ### 2.6. Comparison with Existing Benchmarks Table 2 compares SWE-Bench Mobile with existing coding benchmarks. SWE-Bench Mobile distinguishes itself by being multi-modal, including PRDs and Figma designs rather than just code or text descriptions. It operates on a large-scale codebase ( $\sim$ 5GB), significantly larger than the individual repositories or snippets used in other benchmarks. Furthermore, it targets mixed Swift/Objective-C, which is under-represented in training data compared to Python, and focuses on feature implementation rather than bug fixing. Table 2. Comparison with existing benchmarks. | Benchmark | Multi-Modal | Codebase | Language | | --- | --- | --- | --- | | HumanEval | ✗ | None | Python | | MBPP | ✗ | None | Python | | SWE-Bench | ✗ | Medium | Python | | SWE-Bench Mobile | ✓ | Large | Swift/ObjC | ## 3. Experiments We evaluate leading coding agents on SWE-Bench Mobile to answer several key research questions. First, we investigate how state-of-the-art coding agents perform on industry-level mobile development tasks (RQ1). Second, we analyze how task complexity affects agent performance (RQ2). Third, we examine the cost-performance trade-off (RQ3). Fourth, we assess the robustness of agent results across multiple runs (RQ4). Finally, we explore how prompt engineering affects performance (RQ5). ### 3.1. Experimental Setup Agents and Models. We evaluate four coding agents spanning commercial and open-source systems: Cursor, an AI-powered code editor with an agent mode; Codex, OpenAI’s coding agent CLI; Claude Code, Anthropic’s coding agent CLI; and OpenCode, an open-source coding agent. We test these agents with multiple backbone models including Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku, GLM 4.6, GLM 4.7, GPT 5, GPT 5.1, GPT 5.2, and Gemini 3 Pro, yielding 22 agent-model configurations in total. Metrics. We report two primary metrics: Task Success Rate, which is the percentage of tasks where all test cases pass, and Test Pass Rate, which is the percentage of individual test cases passed. All rates are computed with a fixed denominator of 50 tasks and 449 test cases. When an agent fails to produce a patch for a task (e.g., due to timeout or error), the missing patch is counted as failing all associated tests. ### 3.2. Main Results (RQ1) Figure 4 presents the main experimental results across all agent-model configurations. <details> <summary>x2.png Details</summary> ![58eef43c](/v1/image/58eef43c350ea6e951027724105e1a29a47d2a9aff2e5163f83be98c19644774) ### Visual Description ## Horizontal Bar Chart: Task Success Rate by Tool-Model Combination ### Overview This image is a horizontal bar chart comparing the "Task Success Rate (%)" of various combinations of development tools (Cursor, Codex, Claude Code, OpenCode) paired with different AI models (e.g., Opus 4.5, Sonnet 4.5, GLM 4.6, GPT 5). The chart ranks these combinations from highest to lowest success rate. ### Components/Axes * **Chart Type:** Horizontal Bar Chart. * **X-Axis:** Labeled "Task Success Rate (%)". The scale runs from 0 to 16, with major tick marks at intervals of 2 (0, 2, 4, 6, 8, 10, 12, 14, 16). * **Y-Axis:** Lists 22 distinct tool-model combinations. The labels are left-aligned. * **Legend:** Located in the bottom-right corner of the chart area. It maps colors to four tool categories: * **Red:** Cursor * **Blue:** Codex * **Green:** Claude Code (abbreviated as "CC" in some y-axis labels) * **Orange:** OpenCode (abbreviated as "OC" in some y-axis labels) * **Data Labels:** Each bar has its exact percentage value displayed at its right end. ### Detailed Analysis The chart presents the following data points, ordered from highest to lowest success rate: **12.0% Success Rate:** * Cursor + Opus 4.5 (Red bar) * Cursor + Sonnet 4.5 (Red bar) * Codex + GLM 4.6 (Blue bar) **10.0% Success Rate:** * Codex + Sonnet 4.5 (Blue bar) * Codex + GPT 5 (Blue bar) * CC + GLM 4.6 (Green bar) * CC + Sonnet 4.5 (Green bar) **8.0% Success Rate:** * Cursor + GPT 5.2 (Red bar) * CC + Opus 4.5 (Green bar) * CC + Haiku (Green bar) * OC + GLM 4.6 (Orange bar) **6.0% Success Rate:** * Cursor + Gemini 3 Pro (Red bar) * OC + GPT 5.1 (Orange bar) **4.0% Success Rate:** * Codex + Opus 4.5 (Blue bar) * OC + Sonnet 4.5 (Orange bar) * OC + GLM 4.7 (Orange bar) * OC + Gemini 3 Pro (Orange bar) * OC + GPT 5.2 (Orange bar) **2.0% Success Rate:** * Cursor + GPT 5.1 (Red bar) * OC + Opus 4.5 (Orange bar) * OC + GPT 5 (Orange bar) **0.0% Success Rate:** * Codex + GPT 5.1 (Blue bar) ### Key Observations 1. **Performance Ceiling:** The highest observed task success rate is 12.0%, achieved by three different combinations. 2. **Tool Performance Clusters:** * **Cursor (Red):** Shows a wide performance range, from 12.0% (top) down to 2.0%. * **Codex (Blue):** Also has a wide range, from 12.0% down to 0.0% (the lowest value on the chart). * **Claude Code (Green):** Performance is clustered in the middle-to-high range (8.0% - 10.0%). * **OpenCode (Orange):** Dominates the lower half of the chart, with most results between 2.0% and 8.0%. 3. **Model Performance:** Certain models like Opus 4.5, Sonnet 4.5, and GLM 4.6 appear frequently in the higher-performing combinations across different tools. 4. **Notable Outlier:** The combination "Codex + GPT 5.1" is the only one with a 0.0% success rate. ### Interpretation This chart provides a comparative benchmark of how effectively different AI-assisted coding toolchains complete tasks. The data suggests that the choice of both the underlying tool (Cursor, Codex, etc.) and the AI model it's paired with significantly impacts success rates. * **Tool-Model Synergy:** The top results (12.0%) are not exclusive to one tool, indicating that high performance is achievable through different technological paths (e.g., Cursor with Anthropic models, Codex with GLM). * **Claude Code's Consistency:** The green bars (Claude Code) show no results below 8.0%, suggesting it may offer more reliable, if not always top-tier, performance across the tested models. * **OpenCode's Lower Tier:** The prevalence of orange bars (OpenCode) in the lower success brackets could indicate it is either less optimized for these specific tasks or is being tested with a different, potentially more challenging, set of models. * **The GPT 5.1 Anomaly:** The 0.0% result for "Codex + GPT 5.1" is a critical data point. It could signify a severe incompatibility, a configuration error in the test, or that this specific model version fails completely on the evaluated task suite when used with Codex. This warrants further investigation. In summary, the chart is a valuable snapshot for developers or researchers to identify promising tool-model combinations and avoid potentially ineffective ones for the class of tasks measured. </details> Figure 4. Task Success Rate across all configurations. Best performance is 12%, achieved by Cursor + Opus/Sonnet and Codex + GLM. Key Findings. Our evaluation reveals a generally low overall performance, with even the best agents solving only 12% of tasks. This indicates a significant gap between current capabilities and industrial requirements. However, the Test Pass Rate (up to 28.1%) is much higher than the Task Success Rate (12%), indicating that agents often make partial progress but fail to complete tasks fully. Notably, we find that the choice of agent matters significantly: the same model (Opus 4.5) achieves 12% on Cursor but only 2% on OpenCode, a 6 $\times$ difference. Commercial agents consistently outperform the open-source OpenCode agent: the best OpenCode configuration (GLM 4.6, 8%) trails the best commercial configuration (12%) by 4 percentage points. ### 3.3. Task Complexity Analysis (RQ2) We analyze how task complexity affects agent performance. Figure 5 shows the relationship between task complexity (measured by number of files modified and patch size) and success rate. <details> <summary>x3.png Details</summary> ![e023e7b7](/v1/image/e023e7b757dee4ebe8f5904244d231641a7e6e5b9bbb9cd8af6b65cfbffc3708) ### Visual Description ## [Bar Charts with Error Bars]: Performance vs. Change Complexity ### Overview The image displays two side-by-side bar charts, labeled (a) and (b), which analyze the relationship between software change complexity and task success rate. Both charts use bar heights to represent mean success rates and vertical error bars to indicate variability. The data suggests an inverse relationship: as the complexity of a change increases (either by number of files modified or lines changed), the average task success rate decreases. ### Components/Axes **Chart (a) - Left:** * **Title:** (a) Performance vs. Files Modified * **Y-axis:** Label: "Task Success Rate (%)". Scale: Linear, from -20 to 60, with major ticks at intervals of 20. * **X-axis:** Label: "Number of Files Modified". Categories: "1-2", "3-4", "5-6", "7+". * **Data Series:** Blue bars with black error bars. * **Annotations:** Sample size (`n=`) is written above each bar. **Chart (b) - Right:** * **Title:** (b) Performance vs. Patch Size * **Y-axis:** Label: "Task Success Rate (%)". Scale: Linear, from -10 to 40, with major ticks at intervals of 10. * **X-axis:** Label: "Lines Changed (Added + Deleted)". Categories: "1-50", "51-100", "101-200", "200+". * **Data Series:** Green bars with black error bars. * **Annotations:** Sample size (`n=`) is written above each bar. ### Detailed Analysis **Chart (a) Analysis:** * **Trend:** The blue bars show a clear downward trend. The mean task success rate is highest for the smallest changes and decreases monotonically as more files are modified. * **Data Points (Approximate):** * **1-2 Files (n=3):** Mean ≈ 18%. Error bar range ≈ -25% to 60%. * **3-4 Files (n=10):** Mean ≈ 10%. Error bar range ≈ -8% to 28%. * **5-6 Files (n=5):** Mean ≈ 5%. Error bar range ≈ -14% to 24%. * **7+ Files (n=11):** Mean ≈ 2%. Error bar range ≈ -6% to 10%. **Chart (b) Analysis:** * **Trend:** The green bars also show a clear downward trend. The mean task success rate is highest for the smallest patches and decreases as the number of lines changed increases. * **Data Points (Approximate):** * **1-50 Lines (n=10):** Mean ≈ 20%. Error bar range ≈ -5% to 45%. * **51-100 Lines (n=5):** Mean ≈ 12%. Error bar range ≈ -18% to 40%. * **101-200 Lines (n=10):** Mean ≈ 6%. Error bar range ≈ -8% to 20%. * **200+ Lines (n=4):** Mean ≈ 3%. Error bar range ≈ -14% to 19%. ### Key Observations 1. **Consistent Inverse Relationship:** Both metrics of change complexity (files modified and lines changed) correlate with a lower average task success rate. 2. **High Variability:** The error bars are very large relative to the mean values, especially for the lower-complexity categories (1-2 files, 1-50 lines). This indicates a wide spread in outcomes for tasks involving small changes. 3. **Diminishing Returns on Success:** The drop in success rate is most pronounced when moving from the smallest category to the next. The rate of decrease slows for higher complexity categories. 4. **Sample Size Variation:** The number of observations (`n`) varies per category, with the smallest samples in the extreme categories (n=3 for 1-2 files, n=4 for 200+ lines), which may affect the reliability of those specific mean estimates. ### Interpretation The data demonstrates a clear **complexity penalty** in software development tasks. Tasks that require modifying a larger number of files or a greater volume of code (lines changed) are, on average, less likely to be completed successfully. This aligns with software engineering principles that advocate for small, focused changes to reduce risk and cognitive load. The **high variability**, particularly for small changes, is a critical finding. It suggests that while small changes have a higher *average* success rate, their outcomes are highly unpredictable—some succeed brilliantly, while others fail significantly. This could be due to factors not captured here, such as the nature of the bug being fixed or the developer's expertise. From a practical standpoint, this analysis supports strategies like **incremental development** and **pull request scoping**. Keeping changes small (few files, few lines) not only raises the expected success rate but also makes outcomes more predictable (as seen by the slightly tighter error bars for the largest categories). The charts provide empirical evidence that complexity is a key risk factor to manage in software workflows. </details> Figure 5. Performance decreases sharply with task complexity. (a) Tasks requiring 1-2 file modifications have 18% success rate vs. 2% for 7+ files. (b) Small patches ( $<$ 50 lines) achieve 20% success vs. 3% for large patches ( $>$ 200 lines). Error bars show 95% confidence intervals based on binomial proportions. Key Findings. Performance drops sharply as complexity increases. The success rate drops from 18% for tasks requiring 1-2 file modifications to just 2% for tasks requiring 7+ files, suggesting that agents struggle with cross-file reasoning. Similarly, larger patches correlate with lower success, indicating difficulty with complex implementations. ### 3.4. Model Comparison Across Agents A surprising finding is that the same model performs very differently across agents. Figure 6 shows this comparison across all four agents. <details> <summary>x4.png Details</summary> ![d103ac12](/v1/image/d103ac126edbcb9818603aaa922d54460448775e15d32af5c290a40060438e1e) ### Visual Description ## Grouped Bar Chart: Task Success Rate by Model and Coding Tool ### Overview This image is a grouped bar chart comparing the "Task Success Rate (%)" of four different AI models (Opus 4.5, Sonnet 4.5, GLM 4.6, GPT 5.1) when used with four distinct coding tools (Cursor, Codex, Claude Code, OpenCode). The chart visualizes performance differences across both models and tools. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **X-Axis (Horizontal):** Labeled "Model". It lists four categorical models: 1. Opus 4.5 2. Sonnet 4.5 3. GLM 4.6 4. GPT 5.1 * **Y-Axis (Vertical):** Labeled "Task Success Rate (%)". It is a linear scale ranging from 0 to 16, with major tick marks at intervals of 2 (0, 2, 4, 6, 8, 10, 12, 14, 16). * **Legend:** Located in the top-right corner of the chart area. It maps colors to the four coding tools: * **Red:** Cursor * **Blue:** Codex * **Green:** Claude Code * **Yellow/Orange:** OpenCode * **Data Labels:** Numerical values are printed directly above most bars, indicating the exact success rate percentage. The label "N/A" is used for missing data points. ### Detailed Analysis The chart presents the following data points for each model-tool combination. The color of each bar is cross-referenced with the legend. **1. Model: Opus 4.5** * **Cursor (Red):** 12% * **Codex (Blue):** 4% * **Claude Code (Green):** 8% * **OpenCode (Yellow):** 2% **2. Model: Sonnet 4.5** * **Cursor (Red):** 12% * **Codex (Blue):** 10% * **Claude Code (Green):** 10% * **OpenCode (Yellow):** 4% **3. Model: GLM 4.6** * **Cursor (Red):** N/A (No bar present) * **Codex (Blue):** 12% * **Claude Code (Green):** 10% * **OpenCode (Yellow):** 8% **4. Model: GPT 5.1** * **Cursor (Red):** 2% * **Codex (Blue):** N/A (No bar present) * **Claude Code (Green):** N/A (No bar present) * **OpenCode (Yellow):** 6% ### Key Observations * **Highest Performance:** The highest success rate shown is 12%, achieved by three different combinations: Cursor with Opus 4.5, Cursor with Sonnet 4.5, and Codex with GLM 4.6. * **Tool Consistency:** Cursor shows high variance, performing best with Opus/Sonnet (12%) but worst with GPT (2%) and is not applicable to GLM. OpenCode shows the most consistent, albeit lower, performance across all models (2%, 4%, 8%, 6%). * **Model Performance:** Sonnet 4.5 and GLM 4.6 show the most consistent high performance across multiple tools. GPT 5.1 shows the lowest overall performance among the models with available data. * **Missing Data:** There are two "N/A" points: Cursor is not applicable to GLM 4.6, and both Codex and Claude Code are not applicable to GPT 5.1. ### Interpretation This chart suggests that the effectiveness of a coding assistant is highly dependent on the specific pairing of the underlying AI model and the coding tool interface. There is no single "best" tool or model; performance is contextual. * **Cursor** appears to be a powerful tool for the Opus and Sonnet model families but is ineffective with GPT 5.1 and incompatible with GLM 4.6 in this test. * **Codex** performs exceptionally well with GLM 4.6 but poorly with Opus 4.5. * **Claude Code** delivers solid, middle-of-the-road performance (8-10%) with the three models it was tested on. * **OpenCode** is a universal but lower-performing option, suggesting it may be a more general-purpose or less specialized tool. The "N/A" values are critical findings, indicating either a lack of integration between that tool and model or a failure to produce a measurable result in the test scenario. The data implies that users should select their tool based on the specific model they are employing to maximize task success rates. </details> Figure 6. Same model, different agents: Opus 4.5 achieves 12% on Cursor but only 2% on OpenCode—a 6 $\times$ gap. Commercial agents consistently outperform the open-source alternative. Implications. This finding suggests that agent scaffolding (tool use, context management, iteration strategy) is as important as the underlying model capability. The performance gap between commercial agents (Cursor, Codex, Claude Code) and the open-source OpenCode is substantial across all models, suggesting that years of engineering investment in tool integration, context management, and iterative refinement provide significant advantages. Practitioners should evaluate agents holistically rather than focusing solely on model benchmarks. ### 3.5. Performance by Task Category We analyze how agents perform across different task categories. Figure 7 shows the success rate breakdown. <details> <summary>x5.png Details</summary> ![89e01084](/v1/image/89e01084688940d74bbbe359ca4e458c56b1145b16e20e368e5c6af84c07a322) ### Visual Description ## Heatmap Chart: Task Success Rate by Category and Agent ### Overview This image is a heatmap chart titled "Task Success Rate by Category and Agent." It visually compares the success rates (in percentages) of four different AI coding agents across five distinct task categories. The chart uses a color gradient to represent the success rate, with darker blues indicating higher percentages and lighter yellows indicating lower percentages. ### Components/Axes * **Title:** "Task Success Rate by Category and Agent" (centered at the top). * **Y-Axis (Vertical):** Labeled with five task categories. From top to bottom: 1. UI Components 2. Data Mgmt (Data Management) 3. Gesture 4. Media 5. Network * **X-Axis (Horizontal):** Labeled with four AI agent names. From left to right: 1. Cursor 2. Codex 3. Claude Code 4. OpenCode * **Legend/Color Scale:** Positioned on the right side of the chart. It is a vertical color bar labeled "Success Rate (%)". The scale ranges from approximately 4% (light yellow) at the bottom to 14% (dark blue) at the top, with intermediate markers at 6%, 8%, 10%, and 12%. * **Data Grid:** A 5x4 grid of colored cells, each containing a percentage value representing the success rate for a specific agent-category pair. ### Detailed Analysis The following table reconstructs the data presented in the heatmap. The color of each cell corresponds to the value, with darker blue for higher success rates. | Category / Agent | Cursor | Codex | Claude Code | OpenCode | | :--- | :--- | :--- | :--- | :--- | | **UI Components** | 14% (Darkest Blue) | 10% (Medium Blue) | 8% (Teal) | 5% (Light Yellow-Green) | | **Data Mgmt** | 12% (Medium Blue) | 15% (Very Dark Blue) | 11% (Medium Blue) | 7% (Light Green) | | **Gesture** | 8% (Teal) | 6% (Light Green) | 7% (Light Green) | 3% (Lightest Yellow) | | **Media** | 10% (Medium Blue) | 8% (Teal) | 9% (Teal) | 4% (Light Yellow) | | **Network** | 11% (Medium Blue) | 12% (Medium Blue) | 10% (Medium Blue) | 5% (Light Yellow-Green) | **Trend Verification per Category (Row):** * **UI Components:** Success rate shows a clear **downward trend** from left to right: Cursor (14%) > Codex (10%) > Claude Code (8%) > OpenCode (5%). * **Data Mgmt:** Success rate peaks in the second column (Codex) and is generally high: Codex (15%) > Cursor (12%) > Claude Code (11%) > OpenCode (7%). * **Gesture:** Success rates are low overall, with a slight peak for Cursor: Cursor (8%) > Claude Code (7%) > Codex (6%) > OpenCode (3%). * **Media:** Success rates are moderate, with a slight peak for Cursor: Cursor (10%) > Claude Code (9%) > Codex (8%) > OpenCode (4%). * **Network:** Success rates are relatively consistent for the first three agents: Codex (12%) > Cursor (11%) > Claude Code (10%) > OpenCode (5%). **Trend Verification per Agent (Column):** * **Cursor:** Performs best in UI Components (14%) and worst in Gesture (8%). * **Codex:** Achieves the highest single success rate in the chart (15% in Data Mgmt) and performs poorly in Gesture (6%). * **Claude Code:** Shows relatively consistent performance, ranging from 7% (Gesture) to 11% (Data Mgmt). * **OpenCode:** Consistently has the lowest success rate in every category, ranging from 3% (Gesture) to 7% (Data Mgmt). ### Key Observations 1. **Highest Performance:** The single highest success rate is **15%**, achieved by **Codex** in the **Data Mgmt** category. 2. **Lowest Performance:** The single lowest success rate is **3%**, achieved by **OpenCode** in the **Gesture** category. 3. **Agent Ranking:** In aggregate across all categories, **Cursor** and **Codex** appear to be the top-performing agents, while **OpenCode** is consistently the lowest-performing agent. 4. **Category Difficulty:** The **Gesture** category appears to be the most challenging overall, with the lowest average success rates. The **Data Mgmt** category shows the highest peak performance. 5. **Color-Value Correlation:** The color scale accurately reflects the numerical values. The darkest blue cell (Codex/Data Mgmt, 15%) is at the top of the scale, and the lightest yellow cell (OpenCode/Gesture, 3%) is below the bottom of the labeled scale (4%). ### Interpretation This heatmap provides a comparative performance analysis of AI coding agents. The data suggests that task success is highly dependent on both the specific agent and the type of task. * **Agent Specialization:** **Codex** demonstrates a clear strength in **Data Management** tasks. **Cursor** shows strong, often leading, performance in **UI Components** and **Media**. This implies different agents may be better suited for different development domains. * **General-Purpose vs. Weak Performance:** **Claude Code** exhibits the most consistent, middle-of-the-road performance, suggesting it may be a more general-purpose agent without extreme strengths or weaknesses in these categories. **OpenCode**'s uniformly low scores indicate it may be less capable or less optimized for these specific task types compared to the others. * **Task Complexity:** The uniformly low scores in the **Gesture** category suggest that tasks involving gesture recognition or implementation are particularly difficult for current AI coding agents, representing a potential area for future improvement. * **Practical Implication:** A developer or team choosing an agent for a project heavy on data management might prioritize Codex, while one focused on building user interfaces might lean towards Cursor. For a project involving gesture control, all agents show significant room for improvement, and human oversight would be crucial. **Language Declaration:** All text within the image is in English. No other languages are present. </details> Figure 7. Task Success Rate by Category and Agent. Agents generally perform better on Data Management tasks but struggle with Gesture & Interaction and Media tasks, which require complex multi-modal reasoning. ### 3.6. Cost and Time Analysis (RQ3) Table 3 presents the cost and time metrics for each configuration. We measure API cost per task and average execution time. Table 3. Cost and time comparison across all agents. Best value in each column is bold. OpenCode costs are reported via OpenRouter API billing. | Agent | Model | Cost ($/task) | Time (min) | | --- | --- | --- | --- | | Cursor | Opus 4.5 | 3.50 | 15.0 | | Cursor | Sonnet 4.5 | 2.00 | 14.2 | | Codex | GLM 4.6 | 1.30 | 13.3 | | Codex | Sonnet 4.5 | 2.50 | 12.5 | | CC | GLM 4.6 | 1.30 | 11.7 | | CC | Sonnet 4.5 | 2.00 | 13.3 | | CC | Opus 4.5 | 4.00 | 15.0 | | CC | Haiku | 0.50 | 8.3 | | OC | Opus 4.5 | 9.33 | 8.2 | | OC | Sonnet 4.5 | 3.50 | 11.1 | | OC | GLM 4.6 | 0.13 | 32.5 | | OC | GLM 4.7 | 0.49 | 52.1 | | OC | GPT 5 | 0.18 | 9.8 | | OC | GPT 5.1 | 0.02 | 2.0 | | OC | GPT 5.2 | 0.04 | 10.9 | | OC | Gemini 3 Pro | 0.03 | 8.9 | Key Findings. Among commercial agents, Codex + GLM 4.6 offers the best value, achieving 12% success at only $1.30/task—the same success rate as Cursor + Opus 4.5 but at less than half the cost ($3.50/task). OpenCode exhibits a striking cost–time trade-off: it is dramatically cheaper (GLM 4.6 at $0.13/task vs. $1.30 for Codex/CC), but GLM models run much slower (32–52 min vs. 11–13 min). OpenCode + Opus 4.5 is the most expensive configuration at $9.33/task yet achieves only 2% success, while OpenCode + GPT 5.1 is the cheapest at $0.02/task but completes tasks in only 2 minutes on average—likely because it fails quickly on most tasks (6% success, 7.1% test pass rate). ### 3.7. Robustness Analysis (RQ4) To assess result stability, we run selected configurations multiple times. Figure 8 shows the variance across runs. <details> <summary>x6.png Details</summary> ![a5a0646a](/v1/image/a5a0646a1bf3608cce21d6053ae36a6622b8f9c22217bf8805bab7869e61c68b) ### Visual Description ## Bar Chart: Robustness: Stability Across Multiple Runs ### Overview This is a bar chart comparing the task success rate and stability (measured by standard deviation) of two different systems or methods across multiple runs. The chart visually contrasts the mean performance and consistency of "CC + Opus 4.5" versus "Codex + Opus 4.5". ### Components/Axes * **Title:** "Robustness: Stability Across Multiple Runs" (centered at the top). * **Y-Axis:** Labeled "Task Success Rate (%)". The scale runs from 0 to 12, with major tick marks at intervals of 2 (0, 2, 4, 6, 8, 10, 12). * **X-Axis:** Contains two categorical labels: * Left: "CC + Opus 4.5" * Right: "Codex + Opus 4.5" * **Legend:** Located in the top-right corner. It defines three graphical elements: * A gray filled rectangle: "Mean Success Rate" * An open circle (○): "Individual Run" * A dark gray line (—): "Standard Deviation" * **Data Series:** Two bars, each with associated annotations and markers. * **Left Bar (Green):** Represents "CC + Opus 4.5". * **Right Bar (Blue):** Represents "Codex + Opus 4.5". ### Detailed Analysis **1. CC + Opus 4.5 (Green Bar, Left Side):** * **Mean Success Rate (μ):** Annotated in green text to the right of the bar as **μ=6.7%**. The bar height corresponds to this value. * **Standard Deviation (σ):** Annotated in gray text below the mean as **σ=1.15**. A vertical error bar (dark gray line) extends from approximately 5.55% to 7.85% (6.7% ± 1.15%), visually representing this spread. * **Individual Runs:** Two open circle markers are plotted. * One circle is positioned at approximately **6.0%** on the y-axis, aligned with the top of the bar. * A second circle is positioned at approximately **8.0%** on the y-axis, at the top of the error bar. * **Trend/Verification:** The bar shows a moderate success rate with noticeable variability, as confirmed by the spread of the two individual data points and the non-zero standard deviation. **2. Codex + Opus 4.5 (Blue Bar, Right Side):** * **Mean Success Rate (μ):** Annotated in blue text to the right of the bar as **μ=4.0%**. The bar height corresponds to this value. * **Standard Deviation (σ):** Annotated in gray text below the mean as **σ=0.00**. There is no visible error bar, consistent with a standard deviation of zero. * **Individual Runs:** Two open circle markers are plotted. * Both circles are positioned at exactly **4.0%** on the y-axis, overlapping at the top of the bar. * **Trend/Verification:** The bar shows a lower success rate than the green bar but perfect consistency, as all individual runs achieved the exact same result (4.0%), resulting in zero standard deviation. ### Key Observations 1. **Performance vs. Stability Trade-off:** The system with the higher mean success rate ("CC + Opus 4.5", 6.7%) exhibits variability (σ=1.15). The system with the lower mean ("Codex + Opus 4.5", 4.0%) exhibits perfect stability (σ=0.00). 2. **Individual Run Distribution:** For "CC + Opus 4.5", the two recorded runs are at the extremes of the standard deviation range (one at the mean, one at the upper bound). For "Codex + Opus 4.5", the runs are identical and equal to the mean. 3. **Visual Clarity:** The use of color (green vs. blue) and clear annotations for μ and σ makes the comparison straightforward. The legend accurately maps the graphical elements (bar, circle, line) to their statistical meaning. ### Interpretation This chart demonstrates a classic engineering or machine learning trade-off between **peak performance** and **reliability/consistency**. * **"CC + Opus 4.5"** appears to be a more capable system on average, achieving a higher task success rate. However, its performance is not guaranteed; it can vary significantly between runs (from ~6% to 8% in this sample). This suggests it may be sensitive to initial conditions, random seeds, or other stochastic elements in its process. * **"Codex + Opus 4.5"** is a less capable but highly predictable system. It reliably produces the same, lower success rate every time. This could be desirable in scenarios where consistency and predictability are more critical than achieving the highest possible success rate, or where the cost of a failed run is high. The data suggests that the choice between these two systems depends entirely on the application's priorities. If maximizing the chance of a high success rate is paramount and some failure is acceptable, "CC + Opus 4.5" is preferable. If guaranteed, predictable performance is required, even at a lower level, "Codex + Opus 4.5" is the better choice. The chart effectively argues that "robustness" (stability) and "performance" (success rate) are distinct metrics that must be evaluated together. </details> Figure 8. Result stability across multiple runs. Error bars indicate standard deviation. While Claude Code shows moderate variance ( $\sigma$ =1.15%), the absolute fluctuation is small ( $\pm$ 1 task), indicating that agent performance is relatively stable. Observations. We observe moderate variance for Claude Code + Opus 4.5, with scores of 6%, 8%, and 6% across 3 runs ( $\mu$ =6.7%, $\sigma$ =1.15%). In contrast, Codex + Opus 4.5 is perfectly stable at 4% across runs. ### 3.8. Prompt Engineering (RQ5) We conduct a systematic ablation study with 12 prompt variants using Claude Code + GLM 4.6. Table 4 shows the results. Table 4. Prompt ablation results. Best and worst highlighted. Full prompts in Appendix C. | Prompt Strategy | Task (%) | Test (%) | | --- | --- | --- | | Defensive Programming | 10.0 | 26.7 | | Quality Focused | 10.0 | 26.3 | | Example Driven | 10.0 | 23.4 | | Chain of Thought | 10.0 | 21.8 | | Baseline | 10.0 | 19.3 | | Explicit Instructions | 8.0 | 17.8 | | Figma Emphasis | 8.0 | 18.0 | | Test Driven | 6.0 | 22.0 | | Detailed Role | 4.0 | 20.7 | | Structured Checklist | 4.0 | 20.7 | | Context Rich | 4.0 | 22.7 | | Comprehensive | 4.0 | 22.7 | Key Findings. The ”Defensive Programming” prompt strategy performs best, improving the Test Pass Rate by 7.4% over the baseline (19.3% $\rightarrow$ 26.7%) while maintaining the same Task Success Rate (10.0%). This indicates that while both prompts complete the same number of tasks fully, Defensive Programming handles edge cases better in partially-completed tasks, passing significantly more individual test cases. This suggests that emphasizing defensive coding practices helps agents avoid common pitfalls even when they cannot complete all requirements. Interestingly, complexity appears to hurt performance; overly detailed prompts reduce Task Success from 10.0% to 4.0%. Overall, prompts focusing on code quality outperform those emphasizing workflow. ### 3.9. Error Analysis We categorize failure modes across all experiments by analyzing test failure messages from the best-performing agents. The most critical failure pattern is Missing Feature Flags (54%), where agents implement core functionality but fail to add proper feature toggles or experiment flags—a standard practice in production mobile development for gradual rollout and A/B testing. Missing Data Models (22%) occurs when agents fail to create or update data structures required by the PRD. Missing Files (11-15%) represents cases where agents identify some but not all required files to modify. Missing UI Components (11-15%) captures failures to implement specific UI elements like buttons, labels, or views. Missing Required Methods (9%) reflects incomplete class implementations. While Incomplete Multi-File Implementation affects only 4-7% of tasks, it disproportionately impacts complex features requiring coordination across 5+ files. The dominance of feature flag failures highlights a gap between agents’ code generation capabilities and their understanding of production deployment practices. ## 4. Discussion and Conclusion Our evaluation reveals a significant gap between current agent capabilities and the demands of industrial mobile development, with the best configurations achieving only a 12% success rate. This shortfall, primarily driven by failures in cross-file reasoning and requirement understanding, underscores that autonomous software engineering remains an open challenge. Implications. For practitioners, our results suggest that agents should currently be viewed as “copilots” requiring human oversight rather than autonomous developers. The high variance in performance across agents for the same model (e.g., Cursor 12% vs. OpenCode 2% for Opus 4.5) highlights the critical role of agent scaffolding—practitioners should evaluate the complete system, not just the underlying model. The consistent gap between commercial and open-source agents suggests that engineering investment in tool integration and context management provides significant practical value. Furthermore, cost-effective models like GLM 4.6 can match the performance of expensive frontier models when paired with effective agent frameworks, offering a viable path for scalable adoption. For researchers, the sharp performance drop on complex, multi-file tasks (18% vs. 2%) points to a need for better code context retrieval and graph-based reasoning. The 25% failure rate due to requirement misunderstanding calls for improved grounding of natural language PRDs into code. Additionally, the under-utilization of visual designs suggests that future work must better integrate multi-modal signals into the coding loop. Future Work. We plan to expand SWE-Bench Mobile along several dimensions. First, we will add Android (Kotlin) tasks to enable cross-platform comparison and investigate whether agents exhibit consistent strengths and weaknesses across mobile ecosystems. Second, we will integrate simulator-based runtime evaluation to verify UI rendering, gesture handling, and state management—aspects that text-based diff inspection cannot capture. Third, we aim to evaluate additional open-source agents like OpenHands and SWE-Agent, and open-weight models like Qwen-Coder, to further broaden the benchmark’s coverage. Finally, we plan to develop a public API for continuous evaluation, allowing agent providers to track their progress over time as both models and scaffolding improve. In conclusion, SWE-Bench Mobile provides a rigorous testbed for the next generation of coding agents. While current performance is modest, the benchmark offers a clear roadmap for advancing agents from simple script generation to complex, industry-level software development. ## 5. Related Work ### 5.1. Code Generation Benchmarks Early benchmarks for code generation focused on algorithmic problem-solving. HumanEval (Chen et al., 2021) introduced 164 hand-crafted Python programming problems with unit tests, becoming a standard evaluation for code LLMs. MBPP (Austin et al., 2021) expanded this with 974 crowd-sourced problems. While influential, these benchmarks test isolated function generation rather than realistic software engineering. SWE-Bench (Jimenez et al., 2024) marked a significant advance by evaluating agents on real GitHub issues from popular Python repositories. Agents must understand issue descriptions, navigate codebases, and generate patches that pass existing tests. The benchmark has since evolved into a family of tasks, including SWE-bench Multimodal (Yang et al., 2025a), which incorporates visual elements such as screenshots and diagrams to test visual software domains; SWE-bench Multilingual (Zan et al., 2025; Yang et al., 2025b), which expands evaluation to 9 programming languages beyond Python; and SWE-bench Pro (Deng et al., 2025), which introduces longer-horizon instances and includes proprietary/commercial codebases. Even with these extensions, many existing benchmarks still derive tasks from GitHub issue and pull-request artifacts, which more often emphasize bug fixing and localized improvements than new feature implementation from high-level specifications. Other benchmarks target specific domains: DS-1000 (Lai et al., 2023) for data science, ODEX (Wang et al., 2022) for open-domain execution, and ClassEval (Du et al., 2024) for class-level generation. DevBench (Li et al., 2024) evaluates repository-level coding but still focuses on Python. SWE-Bench Mobile differs from these benchmarks in several key aspects: (1) multi-modal inputs including PRDs and Figma designs, (2) a large-scale production codebase (approx. 5GB), (3) mixed Swift/Objective-C target languages, and (4) feature implementation rather than bug fixing. ### 5.2. Coding Agents The emergence of powerful LLMs has enabled a new generation of autonomous coding agents. These systems go beyond simple code completion to perform multi-step reasoning, tool use, and iterative refinement. Commercial agents include GitHub Copilot (Microsoft), Cursor (Anysphere), Claude Code (Anthropic), and Codex CLI (OpenAI). These agents integrate with development environments and can navigate codebases, run tests, and iterate on solutions. Open-source agents have emerged as alternatives. OpenCode provides a terminal-based coding agent supporting multiple LLM backends. SWE-Agent (Yang et al., 2024) introduces an agent-computer interface optimized for software engineering. AutoCodeRover (Zhang et al., 2024b) combines code search with LLM reasoning. Agentless (Xia et al., 2024) shows that simpler approaches without complex agent loops can be competitive. CodeAgent (Zhang et al., 2024a) uses a repository-level code graph for navigation. Our work provides a challenging benchmark for evaluating both commercial and open-source agents on industry-level tasks, revealing significant gaps in current capabilities and the importance of agent scaffolding. ### 5.3. Multi-Modal Code Understanding Recent work has explored combining visual and textual information for code-related tasks. Design2Code (Si et al., 2024) evaluates generating code from webpage screenshots. Screenshot2Code systems convert UI designs to implementation. SWE-Bench Mobile extends this direction by incorporating Figma designs as part of the input specification, requiring agents to reason about visual layouts alongside textual requirements. ### 5.4. Prompt Engineering for Code Prompt engineering significantly impacts LLM performance on coding tasks. Chain-of-thought prompting (Wei et al., 2022) improves reasoning. Self-debugging (Chen et al., 2023) enables iterative refinement. Structured prompts with role definitions and examples often outperform simple instructions. Our ablation study (Section 3.8) systematically evaluates 12 prompt strategies, finding that “Defensive Programming” prompts emphasizing edge cases outperform both simple baselines and complex multi-step prompts. ## Limitations Platform Scope. SWE-Bench Mobile focuses on a single production iOS codebase from XiaoHongShu, which ensures depth and realism but limits generalization to other mobile platforms (Android, cross-platform frameworks like Flutter/React Native) and programming paradigms. The Swift/Objective-C mixed-language codebase, while representative of many large iOS projects, may not capture challenges unique to Kotlin-based Android development or cross-platform toolchains. Evaluation Methodology. Our evaluation uses text-based diff inspection rather than runtime execution, which means we validate structural correctness and architectural compliance but cannot detect issues that only manifest during runtime interactions, on specific devices, or under particular OS versions. Future work should integrate simulator-based testing to capture dynamic behaviors such as UI rendering, memory management, and concurrency issues. Prompt and Model Coverage. Our prompt ablation study covers one agent-model configuration (Claude Code + GLM 4.6) and 12 prompt variants. While this provides insights into prompt sensitivity, different models may respond differently to these strategies. Additionally, API costs reported are based on pricing at experiment time and may vary with different prompting strategies or model updates. Benchmark Scale. The benchmark’s 50 tasks, while derived from real product development, represent a snapshot of mobile development challenges and may not cover all possible feature types (e.g., real-time communication, payment integration, accessibility features) or edge cases encountered in production. We plan to continuously expand the task set to improve coverage. ## Ethics Statement The tasks and codebase in SWE-Bench Mobile are derived from XiaoHongShu Inc. with explicit permission for research use. The codebase snapshot excludes sensitive credentials and business logic. Human validation was performed by the authors and XiaoHongShu engineers; no crowdworkers were employed. Our work evaluates AI agents for software engineering tasks. Current performance (12% task success rate) indicates that human oversight remains essential. We view these agents as assistive tools rather than replacements for human developers. Practitioners should use comprehensive testing and code review when deploying AI-generated code, as emphasized by our benchmark’s evaluation approach. ## References - (1) - Atlassian (2024) Atlassian. 2024. How to Write a Product Requirements Document (PRD). https://www.atlassian.com/agile/product-management/requirements. - Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732 (2021). - Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374 (2021). - Chen et al. (2023) Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching Large Language Models to Self-Debug. arXiv preprint arXiv:2304.05128 (2023). - Deng et al. (2025) Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arXiv preprint arXiv:2509.16941 (2025). - Du et al. (2024) Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation. In International Conference on Machine Learning. - Jimenez et al. (2024) Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world GitHub Issues?. In The Twelfth International Conference on Learning Representations. - Lai et al. (2023) Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. In International Conference on Machine Learning. - Li et al. (2024) Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Xiong, and Karthik Narasimhan. 2024. DevBench: A Comprehensive Benchmark for Software Development. arXiv preprint arXiv:2403.08604 (2024). - Si et al. (2024) Chenglei Si, Yanzhe Li, Zhengyuan Jiang, Xinyang Liu, Zheng Lu, Yuqing Jiang, Yong Liu, Yu Wang, Yujiu Yuan, Lydia Liu, et al. 2024. Design2Code: How Far Are We From Automating Front-End Engineering? arXiv preprint arXiv:2403.03163 (2024). - Wang et al. (2022) Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. 2022. Execution-Based Evaluation for Open-Domain Code Generation. arXiv preprint arXiv:2212.10481 (2022). - Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837. - Xia et al. (2024) Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying LLM-based Software Engineering Agents. arXiv preprint arXiv:2407.01489 (2024). - Yang et al. (2024) John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv preprint arXiv:2405.15793 (2024). - Yang et al. (2025a) John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ofir Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, et al. 2025a. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?. In The Thirteenth International Conference on Learning Representations. - Yang et al. (2025b) John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. 2025b. SWE-smith: Scaling Data for Software Engineering Agents. arXiv preprint arXiv:2504.21798 (2025). - Zan et al. (2025) Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. 2025. Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving. arXiv:2504.02605 [cs.SE] https://arxiv.org/abs/2504.02605 - Zhang et al. (2024a) Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024a. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. arXiv preprint arXiv:2401.07339 (2024). - Zhang et al. (2024b) Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024b. AutoCodeRover: Autonomous Program Improvement. arXiv preprint arXiv:2404.05427 (2024). ## Appendix A Task Examples We present two representative tasks from SWE-Bench Mobile to illustrate the benchmark format. Each task includes a Product Requirement Document (PRD) with design specifications, translated from the original Chinese used by the development team. ### A.1. Task 003: Custom Emoji Limit Adjustment Difficulty: Easy Files to Modify: 3 Test Cases: 5 Adjust Custom Emoji Collection Limit Background The current custom emoji (saved stickers) limit is hardcoded to 300 on the client side. As user demand grows, we need to increase this limit to better serve our users. Requirements (1) Increase limit: Change from 300 to 999 (2) Update UI prompts: Adjust warning messages to reflect new limit (3) Server-driven config: Remove hardcoded values; future changes should not require app updates (4) Comprehensive coverage: Apply to all emoji-saving scenarios (chat, comments, etc.) Competitor Analysis App Emoji Limit WeChat 999 Douyin (TikTok) 599 Kuaishou 158 Design Mockups See Figure 9 for the original design specifications provided to developers. Evaluation Criteria - Hardcoded limit (300) removed or increased to $\geq$ 450 - New limit (999) properly configured - Server-driven configuration implemented - Changes applied across multiple files - Non-empty, meaningful code changes ### A.2. Task 007: Card Message Click Decoupling Difficulty: Medium Files to Modify: 5 Test Cases: 5 iOS Card Reference Click Decoupling Background Card messages have been added to the app. While most iOS code is decoupled from the messaging module, the click logic for card message references remains coupled in AppChatBaseViewController. This task decouples the click handling for better maintainability. Architecture Design Abstract click logic to CardRefBaseProvider. The view controller should find the concrete implementation based on card type, following the provider pattern. Implementation Sketch ⬇ @objc (AppRefMessageDataService) public class AppRefMessageDataService: NSObject { var chatType: String? var chatId: String? var senderId: String? var messageId: String? } Impact Scope: Shopping card, Advertisement card Evaluation Criteria - New AppRefMessageDataService class created - Click handling moved out of AppChatBaseViewController - Provider pattern correctly implemented - Shopping and advertisement card handling works - No regression in existing functionality ## Appendix B Complete Experimental Results Table 5. Complete evaluation results on SWE-Bench Mobile. Task Success measures the percentage of tasks where all test cases pass (out of 50 tasks). Test Pass measures the percentage of individual test cases passed (out of 449 tests). Best results per agent in bold. | Agent | Model | Task Success (%) | Test Pass (%) | Cost ($/task) | Time (min) | | --- | --- | --- | --- | --- | --- | | Cursor | Claude Opus 4.5 | 12.0 | 28.1 | 3.50 | 15.0 | | Claude Sonnet 4.5 | 12.0 | 26.7 | 2.00 | 14.2 | | | GPT-5.2 | 8.0 | 27.4 | 1.80 | 20.0 | | | Gemini 3 Pro | 6.0 | 23.2 | 1.00 | 12.5 | | | GPT-5.1 | 2.0 | 19.6 | 1.10 | 14.2 | | | Codex | GLM-4.6 | 12.0 | 19.6 | 1.30 | 13.3 | | Claude Sonnet 4.5 | 10.0 | 28.1 | 2.50 | 12.5 | | | GPT-5 | 10.0 | 21.4 | 1.50 | 10.0 | | | Claude Opus 4.5 | 4.0 | 20.7 | 3.50 | 14.2 | | | GPT-5.1 | 0.0 | 7.1 | 1.00 | 13.3 | | | Claude Code | GLM-4.6 | 10.0 | 26.7 | 1.30 | 11.7 | | Claude Sonnet 4.5 | 10.0 | 24.7 | 2.00 | 13.3 | | | Claude Opus 4.5 | 8.0 | 21.8 | 4.00 | 15.0 | | | Claude Haiku | 8.0 | 18.3 | 0.50 | 8.3 | | | OpenCode | GLM-4.6 | 8.0 | 17.8 | 0.13 | 32.5 | | GPT-5.1 | 6.0 | 7.1 | 0.02 | 2.0 | | | Claude Sonnet 4.5 | 4.0 | 14.7 | 3.50 | 11.1 | | | GLM-4.7 | 4.0 | 14.3 | 0.49 | 52.1 | | | Gemini 3 Pro | 4.0 | 13.4 | 0.03 | 8.9 | | | GPT-5.2 | 4.0 | 12.0 | 0.04 | 10.9 | | | Claude Opus 4.5 | 2.0 | 12.0 | 9.33 | 8.2 | | | GPT-5 | 2.0 | 12.0 | 0.18 | 9.8 | | ### B.1. Cross-Agent Model Comparison Table 6 reveals that the same model can perform very differently across agents, highlighting the importance of agent design. Table 6. Same model, different agents: Task Success Rate (%). The gap between best and worst agent can be as large as 6 $\times$ . | Model | Cursor | Codex | CC | OpenCode | Gap | | --- | --- | --- | --- | --- | --- | | Opus 4.5 | 12 | 4 | 8 | 2 | 6 $\times$ | | Sonnet 4.5 | 12 | 10 | 10 | 4 | 3 $\times$ | | GLM-4.6 | — | 12 | 10 | 8 | 1.5 $\times$ | | GPT-5.1 | 2 | 0 | — | 6 | $\infty$ | ## Appendix C Prompt Templates We designed 12 prompt variants for the ablation study. Below we present the key prompts. All prompts share a common structure: role definition, task description, and output format. The differentiating factor is the emphasis placed on different aspects. ### C.1. Best Prompt: Defensive Programming P10: Defensive Programming (Best) “You are a senior iOS engineer known for writing robust, production-ready code. Implement the feature with a focus on defensive programming and edge case handling. Don’t just implement the happy path. Think about everything that could go wrong: • Empty data, nil values, invalid formats • Very long/short text, different screen sizes • Slow network, timeouts, concurrent operations • First-time user, offline mode, low memory Your code should handle all of this gracefully without crashing.” ### C.2. Baseline Prompt P1: Baseline “You are an iOS developer. Read the PRD carefully and implement the required changes. Generate a unified diff patch that can be applied to the codebase.” ### C.3. Worst Performing Prompts P12: Comprehensive (Worst) “You are a senior iOS engineer. Before implementing: (1) Analyze the PRD thoroughly (2) Identify all affected files (3) Plan your implementation strategy (4) Consider edge cases (5) Review the Figma design (6) Check for existing patterns (7) Implement with tests in mind (8) Validate against requirements Generate a complete, production-ready patch.” Why Comprehensive Failed: The overly detailed checklist appears to overwhelm the model, causing it to focus on process rather than actual implementation. Simpler, focused prompts consistently outperform complex ones. ### C.4. Other Notable Prompts P7: Chain of Thought: Asks the model to “think step by step” before coding. Achieved 10% Task Success but lower Test Pass Rate (21.8%) than Defensive Programming. P9: Figma Emphasis: Emphasizes matching the Figma design exactly. Surprisingly underperformed (8% Task Success), possibly because many tasks don’t require UI changes. P11: Test Driven: Asks the model to “think about what tests would verify your implementation.” Achieved only 6% Task Success despite the intuitive appeal of test-driven thinking. ## Appendix D Dataset Statistics We provide detailed statistics of the SWE-Bench Mobile dataset in Table 7. The benchmark consists of 50 tasks with varying levels of complexity, involving multi-modal inputs (PRDs and Figma designs) and a large-scale production codebase. The tasks are designed to cover a wide range of mobile development scenarios, ensuring a comprehensive evaluation of agent capabilities. Table 7. SWE-Bench Mobile dataset statistics. | Metric | Value | | --- | --- | | Task Composition | | | Total Tasks | 50 | | Tasks with Figma Design | 35 (70%) | | Tasks with Reference Images | 46 (92%) | | Task Complexity | | | Avg. PRD Length (words) | 450 | | Avg. Test Cases per Task | 9.1 | | Total Test Cases | 449 | | Avg. Files to Modify | 4.2 | | Codebase | | | Programming Language | Swift/Objective-C (iOS) | | Codebase Size | $\sim$ 500K LoC | ## Appendix E Reproducibility Environment. All experiments were conducted on macOS 14.x with: - Cursor: v2.3 with Agent mode enabled - Codex: OpenAI Codex CLI v0.77.0 - Claude Code: Anthropic Claude Code CLI v2.1.37 - OpenCode: v1.1.44 (open-source coding agent) Model API Configuration. For reproducibility, we specify the exact API endpoints and configurations used: - GPT Models (GPT 5, 5.1, 5.2): Accessed via Microsoft Azure OpenAI API with default temperature and top-p settings - Claude Models (Opus 4.5, Sonnet 4.5, Haiku): Accessed via Google Vertex AI API for Anthropic models - Gemini 3 Pro: Accessed via Google Vertex AI API with standard configuration - GLM Models (GLM 4.6, 4.7): Used GLM Coding Plan with default agent scaffolding Multi-Modal Input Handling. To handle Figma designs and reference images, we configured Model Context Protocol (MCP) integrations: - Vision-capable models (GPT, Claude, Gemini): Used official Figma MCP to directly access design specifications - GLM Models: Since GLM 4.6 is not a native vision model, we used the official GLM Vision MCP to process images and Figma designs, converting visual inputs into structured descriptions for the text-only model Evaluation Pipeline. 1. Load generated patch file as text 1. Run task-specific pytest test suite (tests inspect the patch diff text using pattern matching and structural analysis) 1. Record pass/fail status for each test case 1. Aggregate results across all 50 tasks Availability and Hosted Evaluation. The SWE-Bench Mobile benchmark is derived from a proprietary production codebase with permission from XiaoHongShu Inc. Due to the confidential nature of the source code and product requirements, the full dataset cannot be publicly released. We view this constraint as a feature rather than a limitation: by keeping the test set private, we eliminate the risk of data contamination —a well-known issue with public benchmarks where test instances may leak into LLM training corpora (Jimenez et al., 2024). SWE-Bench Mobile is designed as a standardized evaluation platform for coding agent providers and foundation model vendors. We host a public leaderboard at https://swebenchmobile.com where agent companies (e.g., Cursor, Codex, Claude Code) and model providers (e.g., OpenAI, Anthropic, Google, Zhipu AI) can submit their systems for evaluation against our held-out industrial test suite. This provides an objective, contamination-free comparison on real-world mobile development tasks that complements existing Python-centric benchmarks. Submission guidelines and evaluation configurations are available at https://github.com/realtmxi/mobile-bench. ## Appendix F Task Design Mockups Figure 9 shows the design mockups provided to agents for Task 003 (Custom Emoji Limit). These real-world screenshots demonstrate the user pain point and expected UI behavior that agents must understand to implement the feature correctly. <details> <summary>figures/fig_mockup_001.jpg Details</summary> ![c9d94ae3](/v1/image/c9d94ae34325cf3b388cb80e00acc0ee341f178b1ecbe2e88e4b06d65b0c4348) ### Visual Description ## [Screenshot]: Social Media Post Query about Emoji Pack Limits ### Overview This image is a screenshot from the Chinese social media platform Xiaohongshu (Little Red Book). It displays a user's post asking for help with a technical issue, alongside the platform's interface showing the post's engagement and comments. The image is split into two primary regions: a left panel featuring the main query in large text, and a right panel showing the standard social media post view. ### Components/Axes The image is divided into two main vertical sections: 1. **Left Panel (Query Highlight):** A stylized, off-white notepad graphic with a blue tape accent in the top-left corner. It contains the user's core question in large, black Chinese characters. Two phrases are highlighted with a yellow, wavy underline. 2. **Right Panel (Social Media Interface):** A standard post view containing: * **Header:** User profile picture, username ("一条鱼🦋"), and a red "关注" (Follow) button. * **Post Body:** Title, main text, a hashtag, and a timestamp. * **Engagement Metrics:** A comment count indicator. * **Comment Section:** A scrollable list of user comments, each with a profile picture, username, comment text, timestamp, and interaction buttons (like, reply). * **Footer Bar:** A text input field ("说点什么..." / "Say something...") and aggregate engagement icons with counts for likes, favorites, comments, and shares. ### Detailed Analysis **Left Panel - Transcribed Text:** * **Main Text:** "我想问一下，要是小红书表情包上线太多，无法添加表情，该怎么办？" * **English Translation:** "I want to ask, if there are too many emoji packs on Xiaohongshu and you can't add more emojis, what should you do?" * **Highlighted Phrases (Yellow Underline):** 1. "小红书" (Xiǎo hóng shū / Xiaohongshu) 2. "表情包上线太多" (biǎo qíng bāo shàng xiàn tài duō / emoji packs online too many) **Right Panel - Post Details:** * **Username:** 一条鱼🦋 (A Fish 🦋) * **Post Title:** "表情包上限太多了" (The emoji pack upper limit is too high) * **Post Body:** "我想问一下，要是小红书表情包上线太多，无法添加表情，该怎么办？ #用一句话迎接五月" * **English Translation:** "I want to ask, if there are too many emoji packs on Xiaohongshu and you can't add more emojis, what should you do? #UseOneSentenceToWelcomeMay" * **Post Date:** 2024-05-04 * **Comment Count:** "共 40 条评论" (Total 40 comments) **Right Panel - Comment Section (Visible Comments):** 1. **Comment 1:** * **Username:** 憶宅^. * **Text:** "我没有表情包但是显示表情包上线太多🍑" * **English Translation:** "I don't have emoji packs but it shows 'too many emoji packs online'🍑" * **Date:** 02-08 * **Interactions:** 9 likes, 5 replies. 2. **Comment 1 - Author Reply:** * **Username:** 一条鱼🦋 作者 (Author) * **Text:** "泥...🙈" * **English Translation:** "You...🙈" (An expression of surprise or speechlessness) * **Date:** 02-08 * **Interactions:** 3 likes, with a "回复" (Reply) button. * **Note:** "展开 4 条回复" (Expand 4 replies) is visible below. 3. **Comment 2:** * **Username:** 治愈宝宝朱丽叶吉安娜 (Healing Baby Juliet Gianna) * **Text:** "有300个吗，根本不够用" * **English Translation:** "Are there 300? It's not enough at all." * **Date:** Not fully visible. * **Note:** This comment includes a small, partially visible image thumbnail. **Right Panel - Footer Engagement Metrics:** * **Likes:** ♡ 50 * **Favorites:** ☆ 10 * **Comments:** 💬 40 * **Shares:** ↗ (Icon only, count not visible) ### Key Observations * **Problem Statement:** The core issue is a perceived or actual limit on the number of emoji packs/stickers a user can have active or saved on Xiaohongshu, preventing the addition of new ones. * **Highlighted Emphasis:** The left panel graphically emphasizes the platform name ("小红书") and the specific problem ("表情包上线太多"), framing it as the central topic. * **Community Response:** The visible comments show users experiencing similar issues ("I don't have emoji packs but it shows...") and others questioning the limit's sufficiency ("Are there 300? It's not enough..."). * **Platform Context:** The post is tagged with a timely hashtag (#用一句话迎接五月 - May 1st related), indicating it was posted during a specific social media trend or event. ### Interpretation This screenshot captures a common user experience issue within a feature-rich social media app. The data suggests a potential point of friction for active users who engage heavily with the platform's expressive tools (emoji/sticker packs). The user's query, amplified by the graphic on the left, seeks a technical solution or workaround. The comments reveal that this is not an isolated incident; at least one other user reports the same error message even without many packs, hinting at a possible bug or a very low default limit. The author's reply ("You...🙈") suggests shared frustration or surprise at the problem's occurrence. The post's engagement (50 likes, 40 comments) indicates it resonates with a segment of the user base. Ultimately, the image documents a user-generated support request that highlights a potential limitation in the platform's resource management for user-generated content like emoji packs. </details> (a) User Complaint. Social media post showing frustration with the 300-emoji limit: “Xiaohongshu’s emoji limit is too high, I can’t add more emojis.” <details> <summary>figures/fig_mockup_002.jpg Details</summary> ![edd8b394](/v1/image/edd8b39430895efafc8bf99af92c6a0820554f24cc3cb5a05e7b77a6c8e8a49e) ### Visual Description ## [Screenshot]: Social Media Post Regarding Feature Request on Xiaohongshu (Little Red Book) ### Overview The image is a screenshot of a social media post, likely from the platform Xiaohongshu (Little Red Book). The post is a user's plea for the platform to remove the limit on how many stickers (表情包) a user can save to their collection. The screenshot is composed of two main visual sections: a large, stylized text overlay on the left and the actual social media post interface on the right. ### Components/Axes The image is divided into two primary regions: 1. **Left Region (Text Overlay):** A large, bold text graphic on a light gray background. * **Text:** "小红书能不能取消表情包收藏上限呀?" (Can Xiaohongshu remove the sticker collection limit?). * **Annotation:** The first three characters, "小红书" (Xiaohongshu), are circled with a thick, hand-drawn orange line, emphasizing the platform being addressed. 2. **Right Region (Social Media Post Interface):** This contains the full post, user information, and interaction elements. * **Header:** * **User Avatar:** A circular profile picture of a man. * **Username:** Not fully visible, but the avatar is present. * **Button:** A red "关注" (Follow) button in the top-right corner. * **Post Body:** * **Main Text:** A paragraph of Chinese text detailing the user's frustration. * **Hashtags:** A series of blue, clickable hashtags at the end of the post body. * **Metadata:** The post date "02-14" and location "广西" (Guangxi). * **Comments Section:** * **Header:** "共 2 条评论" (Total 2 comments). * **Comment 1:** * **Commenter Avatar:** A circular image of a cartoon duck. * **Commenter Username:** "鱼啊鱼你係边" (Fish, oh fish, where are you?). * **Comment Content:** An image meme (sticker) of a screaming cat with overlaid text. * **Comment Metadata:** "4 天前" (4 days ago) and location "广东" (Guangdong). * **Interaction Buttons:** "赞" (Like) and "回复" (Reply) icons. * **Footer Interaction Bar:** * **Input Field:** "说点什么..." (Say something...). * **Interaction Counts & Icons:** * Heart icon with count "9" (Likes). * Star icon labeled "收藏" (Favorite/Save). * Speech bubble icon with count "2" (Comments). * Share icon. ### Detailed Analysis / Content Details **1. Transcription of Primary Post Text:** * **Original Chinese:** > 小红书能不能取消表情包收藏上限呀? > 遇到喜欢的表情包我就收藏，看一个见一个爱一个，爱一个收一个，逐渐达到表情包收藏上限。看到新的喜欢的表情包，还是忍不住收藏，但是他已经达到上限了，每次我都只能忍痛放弃，实在喜欢的话就去表情包收藏里面挑挑拣拣的删掉一个相比之下不怎么喜欢的。心痛，哦，太心痛了😭#表情包 #小红书 #收藏上限 #表情包分享 #分享 #喜欢 #停不下来 * **English Translation:** > Can Xiaohongshu remove the sticker collection limit? > When I see a sticker I like, I save it. I see one, fall in love with one, love one, save one, and gradually reach the sticker collection limit. When I see a new sticker I like, I still can't help but save it, but I've already hit the limit. Every time, I have to painfully give up. If I really like it, I have to go into my sticker collection, pick and choose, and delete one that I like comparatively less. It's heartbreaking, oh, so heartbreaking 😭 #stickers #Xiaohongshu #collectionlimit #stickersharing #sharing #like #cantstop **2. Transcription of Comment Meme Text:** * **Image Text (Chinese):** "发疯" (Going crazy / Freaking out) at the top. "啊！啊！啊！！" (Ah! Ah! Ah!!) at the bottom. * **Visual:** A white cat with its mouth wide open, appearing to scream or roar, with a small flame effect near its mouth. **3. Key Data Points & Metadata:** * **Post Date:** 02-14 (February 14th). * **Post Location:** 广西 (Guangxi, China). * **Comment Count:** 2. * **Like Count (for the post):** 9. * **Comment Age:** 4 days old (at time of screenshot). * **Comment Location:** 广东 (Guangdong, China). ### Key Observations * **Emotional Tone:** The post expresses strong frustration and "heartache" (心痛) over a perceived platform limitation. The use of the crying emoji (😭) and the dramatic language ("忍痛放弃" - painfully give up) underscores this. * **User Behavior Pattern:** The user describes a compulsive collecting behavior ("看一个见一个爱一个" - see one, fall in love with one) that is directly thwarted by the platform's technical limit. * **Community Engagement:** The comment section features a highly expressive meme that visually amplifies the original poster's sentiment of frustration ("发疯" - going crazy), indicating shared understanding or empathy from other users. * **Platform Context:** The hashtags (#小红书, #收藏上限) categorize the post within discussions about the platform's features and user experience pain points. ### Interpretation This screenshot captures a specific user experience (UX) friction point on the Xiaohongshu platform. The data suggests that the **sticker collection limit is perceived as an arbitrary barrier that disrupts a core user behavior**—the spontaneous and emotional curation of content (stickers) that users find delightful or relatable. * **The Problem:** The limit creates a "scarcity mindset" within a digital collection, forcing users into a tedious management task (deleting less-liked items) that detracts from the joy of discovery and saving. This is a classic conflict between **user desire for unlimited personalization** and **platform resource management** (likely server storage or app performance considerations). * **The Emotional Impact:** The user's detailed description and the supportive, exaggerated meme in the comments highlight that this is not merely an inconvenience but a source of genuine annoyance. It transforms a positive interaction (finding a fun sticker) into a negative one (having to make painful choices). * **Broader Implication:** For platform designers, this post is a direct piece of user feedback indicating that the current limit may be set too low for power users or enthusiastic collectors. It raises questions about whether the limit is a necessary technical constraint or an artificial one that could be adjusted, made more flexible (e.g., allowing users to earn more slots), or removed entirely to improve user satisfaction and engagement. The post itself, and the meme-driven comment, are forms of **user-led advocacy for a feature change**. </details> (b) Community Feedback. Another user asking “Can Xiaohongshu remove the emoji collection limit?” showing widespread user demand. <details> <summary>figures/fig_mockup_003.jpg Details</summary> ![b05c7d62](/v1/image/b05c7d627e5ef5935a7c9e0218e51c0889aed2ed1cf01e759e4142fc901bfff2) ### Visual Description ## [Screenshot: Social Media Post & Sticker Management Interface] ### Overview The image is a split-screen screenshot of a Chinese social media platform (likely Xiaohongshu) interface. The left side shows a sticker/emoji management interface with a limit-related warning, and the right side displays a user post complaining about sticker quantity restrictions, along with its comment section. ### Components/Sections #### Left Section (Sticker Management Interface) - **Header Area**: There is a text input field with the prompt “有话要说快来评论” (Have something to say, come comment). A black warning bubble overlays it, containing the text “表情包已达上限，无法添加” (Sticker pack has reached the limit, cannot add). - **Sticker Category Icons**: Below the header, there are icons representing different sticker packs, such as a house-shaped icon, a smiley-face icon, a heart-shaped icon, an icon with the character “好”, a peach-shaped icon, a cow-shaped icon, and an icon labeled “年度表情包” (Annual Sticker Pack). - **“已添加的表情” (Added Stickers) Section**: This section presents a grid of stickers. There is a “+” button (for adding more stickers, but it is disabled due to the limit). The stickers include various types: memes (e.g., a cartoon girl with a boy, a photo of two people with the text “恨一行干一行”), text-based stickers (e.g., “造成破防 100%”, “贴贴!”, “老公”, “老公我爱你”, “我日，攻击性好强”), and character-based stickers (e.g., Jerry from *Tom & Jerry* holding a book). #### Right Section (Xiaohongshu Post & Comments) - **Author & Follow Button**: The author is “耳机陆” (Earphone Lu), and there is a red “关注” (Follow) button. - **Post Title**: “小红书能不能别限制表情包数量啊” (Can Xiaohongshu stop limiting the number of stickers?). - **Post Content**: “我微信都900+表情包，小红书还得挑挑删删才能加新的 #表情包 #表情包分享 #搞笑表情包 #斗图表情包 #小红书表情包 #记录生活 #爱小红书爱官方爱官方推我上热门” (I have 900+ stickers on WeChat, but on Xiaohongshu I have to pick and delete to add new ones. Hashtags: #StickerPack #StickerSharing #FunnyStickers #MemeStickers #XiaohongshuStickers #RecordLife #LoveXiaohongshuLoveOfficialLoveOfficialPushMeToHot). - **Date**: 2024-08-31. - **Comment Section**: There are 24 comments in total. The first comment is from “有心人（接好运版🍀）” (Someone with good luck version): “把你的表情包都分享给我谢谢🥺” (Share all your stickers with me, thanks🥺), dated 2024-08-31, with 20 likes. The author replies “微信见” (See you on WeChat) and includes a sticker of Jerry from *Tom & Jerry* holding a book. - **Bottom Interaction Bar**: There is a comment input field “说点什么...” (Say something...), along with like (6 likes), collect, comment (24 comments), and share icons. ### Detailed Analysis - **Sticker Limit Warning**: The black bubble on the left clearly indicates that the sticker pack has reached its limit and no more stickers can be added, which is the core issue the post is addressing. - **Sticker Variety**: The added stickers cover a wide range, including memes, text-based, and character-based ones, reflecting users' diverse needs for self-expression through stickers. - **Social Interaction in Comments**: A user requests stickers, and the author directs them to WeChat, showing that users often use other platforms to share stickers when the current platform has restrictions. - **Hashtag Usage**: The post uses multiple hashtags to increase its visibility, which is a common content-promotion strategy on Xiaohongshu. ### Key Observations - **UI-Driven Pain Point**: The sticker limit warning in the UI directly shows the user's frustration with the platform's restriction. - **Sticker Sharing Behavior**: The comment-section interaction reveals that users are eager to share stickers but are limited by the platform, so they turn to other apps like WeChat. - **Content Engagement Strategy**: The use of hashtags and the complaint-based tone of the post are designed to attract users who face the same issue, thereby increasing engagement (likes, comments, shares). ### Interpretation - **User Frustration with Platform Limitation**: The post and the UI screenshot highlight a significant user pain point. Xiaohongshu's sticker limit is much more restrictive than WeChat's (where the author has 900+ stickers), forcing users to curate their sticker collection (delete old ones to add new ones), which is inconvenient. - **Alternative Sharing Channels**: The comment asking for stickers and the author's reply suggest that users are willing to use other platforms (like WeChat) to share stickers when the current platform's restrictions are too tight. - **Platform Design Consideration**: The sticker limit might be a design choice by Xiaohongshu, but it is causing user dissatisfaction. This could potentially impact user experience and engagement if not addressed. - **Content Resonance**: The post's complaint-based tone and use of hashtags aim to resonate with other users facing the same sticker-limit issue, which can help the post gain more engagement (likes, comments, shares) as it taps into a common user frustration. </details> (c) Emoji Collection UI. The sticker collection interface with warning dialog “Emoji limit reached, cannot add more.” Agents must increase this limit from 300 to 999. Figure 9. Design mockups for Task 003 (Custom Emoji Limit). These mockups are provided to agents as part of the PRD to guide implementation. They show real user complaints about the 300-emoji limit and the current UI that needs modification. ## Appendix G Qualitative Analysis of Agent Outputs We present detailed analyses of agent-generated patches to provide insights into both successful implementations and common failure modes. These examples illustrate the practical challenges agents face when implementing features from PRDs and Figma designs in a production codebase. Table 8. Successful implementation by Cursor + GPT-5.2 on Task 007 (Medium difficulty). The agent correctly created the AppRefMessageDataService class with all required fields and methods, demonstrating strong architectural understanding. | Task Context | | --- | | Difficulty: Medium Files to Modify: 5 Category: Architecture Refactoring | | Agent: Cursor + GPT-5.2 Result: ✓ PASS (5/5 tests) | | Problem Statement (Summary) | | Decouple card message click handling from AppChatBaseViewController by abstracting logic into CardRefBaseProvider. The click logic for card references should be moved to a new AppRefMessageDataService class following the provider pattern. | | Key Requirements | | • Create AppRefMessageDataService class with fields: chatType, chatId, senderId, messageId, sender, innerContentDict • Move click handling out of view controller • Implement provider pattern for different card types • Support shopping and advertisement cards | | Generated Patch (Key Excerpts) |

Rendering Paper...