# Technical Document Extraction: OSWorld Agent Framework
This document provides a comprehensive technical extraction of the provided image, which illustrates a framework for an AI agent interacting with an operating system environment to perform complex tasks.
---
## 1. Task Instruction Examples (Top Section)
The top half of the image displays two distinct task scenarios, each visualized through a three-step sequence showing the agent's interaction with a GUI.
### Task Instruction 1: Bookkeeping Update
**Text Prompt:** "Update the bookkeeping sheet with my recent transactions over the past few days in the provided folder."
**Visual Sequence:**
1. **Initial State:** A spreadsheet application (LibreOffice Calc) is open, titled "Bookkeeping simple".
2. **Action:** The agent navigates to a folder containing image files of physical receipts (e.g., a receipt showing the number "112").
3. **Final State:** The spreadsheet is updated with new rows.
* **Data Table Extraction (Initial State):**
| Description | Category | Type | Amount | Balance |
| :--- | :--- | :--- | :--- | :--- |
| Office Supplies Purchase | Office Supplies | Expense | -150 | 850 |
| Client Payment Received | Sales | Income | 500 | 1350 |
| Internet Bill | Utilities | Expense | -60 | 1290 |
| Freelance Services | Services | Income | 300 | 1590 |
| Rent Payment | Rent | Expense | -700 | 890 |
| Software Subscription | Software | Expense | -100 | 790 |
* **Data Table Extraction (Updated State):** New entries are added below the initial list with values such as -36.93, -5.7, -154.06, and -8.1, resulting in a final balance calculation.
### Task Instruction 2: Coding Assistance
**Text Prompt:** "...some details about snake game omitted... Could you help me tweak the code so the snake can actually eat the food?"
**Visual Sequence:**
1. **Initial State:** VS Code is open with a Python file (`snake.py`). A game window shows a snake moving but not interacting with food.
2. **Action:** The agent modifies the Python code in the editor. Dotted lines indicate the agent's focus moving between the code logic and the file explorer.
3. **Final State:** The game window shows the snake successfully approaching/consuming a green food pixel.
---
## 2. System Architecture Diagram (Bottom Section)
This section describes the flow of information between the user, the AI agent, and the simulated environment.
### Components and Flow
1. **Input Phase:**
* **Task Instruction:** The user provides a natural language prompt (e.g., the examples above).
* **Task Initial State Setup Config:** This configuration prepares the environment based on the instruction.
2. **The Agent:**
* **Label:** "Agent (e.g., GPT-4V)"
* **Function:** Receives the "input" and interacts with the environment through a loop of observation and action.
3. **OSWorld Environment (Virtual Machine):**
* **OS Support:** Icons for Apple (macOS), Ubuntu (Linux), and Windows.
* **Arbitrary Apps:** Icons for Google Chrome, VLC Media Player, GIMP, Microsoft Excel, VS Code, and Microsoft PowerPoint.
* **Interfaces:** Represented by "GUI" and "Terminal" icons.
4. **Interaction Loop (Observation & Action):**
* **Observation (Environment -> Agent):**
* `screenshot` (Camera icon)
* `a11y-tree` (Accessibility tree/Code icon)
* **Action (Agent -> Environment):**
* `mouse` (Clicking/Movement)
* `keyboard` (Typing)
* The agent "predicts" these actions based on the observation.
5. **Evaluation Phase:**
* **get env state:** The system retrieves the final state of the VM.
* **Final State:** The resulting condition of the OS after agent actions.
* **Execution-based Evaluation:** The final step where the success of the task is measured based on the actual state change in the environment.
---
## 3. Technical Summary of Trends and Logic
* **Multimodal Interaction:** The framework relies on both visual (screenshots) and structural (a11y-tree) data to understand the OS state.
* **Cross-Application Workflow:** Task 1 demonstrates the agent's ability to move data between different types of interfaces (Image Viewer to Spreadsheet).
* **Closed-Loop Execution:** The "Execution-based Evaluation" indicates that the system does not just check if the agent *said* it finished, but verifies if the files/code in the VM were actually altered correctly.
* **Spatial Grounding:** The dotted lines in the task sequences represent the agent's "gaze" or "cursor path," showing precise [x, y] coordinate targeting for clicks and drags.