## [Diagram Type]: Technical Diagram (AI/Robotics System Comparison)
### Overview
The image is a technical diagram comparing two AI/robotics systems: **ALFWorld** (left) and **Franka Kitchen** (right). ALFWorld contrasts a baseline method (ReAct) with a proposed method (RAP, labeled "Ours"), while Franka Kitchen illustrates a vision-language model pipeline for robotic task execution. The diagram uses text, arrows, and colored boxes to explain components, data flow, and memory-augmented learning.
### Components/Axes (Diagram Structure)
#### Left Panel: ALFWorld
- **ReAct (Top-Left):**
- A box labeled "Current Task" with text: *"You are in the middle of room. ~~ Task: put a hot tomato on desk Act: ..."*
- A box labeled "Manual Examples" with text: *"Task: put a hot apple in fridge. Act: go to countertop Obs: you see ~ ..."*
- An arrow labeled **"ICL"** (In-Context Learning) connects "Manual Examples" to "Current Task".
- **RAP (Ours, Bottom-Left):**
- A box labeled "Current Task" with text: *"You are in the middle of room. ~~ Task: put a hot tomato on desk Act: think: I need to find a tomato. Obs: OK. Act: go to fridge. ... Act: think: Next, I need to heat it Obs: OK. Act: heat tomato with microwave. Obs: you heat the tomato."*
- A cylinder labeled **"Memory (Past Experiences)"** contains two example tasks:
- *"Task: put a tomato in fridge Act: go to fridge Obs: you see tomato,~ ..."*
- *"Task: put a hot apple in fridge Act: heat mug with microwave. Obs: you heat mug ~."*
- Arrows labeled **"Retriever"** (orange) and **"ICL"** (blue) connect "Current Task" to "Memory" examples.
#### Right Panel: Franka Kitchen
- **Current Task (Top-Left of Right Panel):**
- A dashed box with text: *"Task: Open a microwave door Actions: 1. Move the robot arm to the microwave"* (accompanied by a small image of a robot arm).
- **Memory (Past Experience, Orange Cylinder):**
- Contains two example tasks (with robot images):
- *"Task: Open a microwave door Actions: 1. Move the robot arm to the microwave"*
- *"Task: Open a cabinet door Actions: 1. Move the robot arm to the cabinet door. 2. Grip the handle"*
- **Visual Observation (Green Box):**
- Connected to "Current Task" (green arrow) and to a **"+"** symbol (combining inputs).
- **Text Prompt (Blue Box):**
- Text: *"Text Prompt: Task Information + Trajectory Retrieved Memory"*
- Connected to "Memory" (blue arrow) and to the **"+"** symbol.
- **Vision Language Model (Yellow Box):**
- Receives input from the **"+"** (Visual Observation + Text Prompt) and outputs to "Plan of Actions".
- **Plan of Actions (Teal Box):**
- Text: *"To complete the task, 1. Move robot arm to microwave 2. ... 3. ..."*
- **Policy Network (Red Box):**
- Receives "Plan of Actions" and outputs **"Low-level Actions"** to "Updated Observation, Task, Actions".
- **Updated Observation, Task, Actions (Gray Box):**
- Text: *"Task: Open microwave door Actions: 1. Move robot arm to microwave"* (accompanied by a robot image).
### Detailed Analysis (Text Transcription)
- **ALFWorld - ReAct:**
- Current Task: *"You are in the middle of room. ~~ Task: put a hot tomato on desk Act: ..."*
- Manual Examples: *"Task: put a hot apple in fridge. Act: go to countertop Obs: you see ~ ..."*
- Arrow: *"ICL"* (In-Context Learning) from Manual Examples to Current Task.
- **ALFWorld - RAP (Ours):**
- Current Task: *"You are in the middle of room. ~~ Task: put a hot tomato on desk Act: think: I need to find a tomato. Obs: OK. Act: go to fridge. ... Act: think: Next, I need to heat it Obs: OK. Act: heat tomato with microwave. Obs: you heat the tomato."*
- Memory (Past Experiences):
- Example 1: *"Task: put a tomato in fridge Act: go to fridge Obs: you see tomato,~ ..."*
- Example 2: *"Task: put a hot apple in fridge Act: heat mug with microwave. Obs: you heat mug ~."*
- Arrows: *"Retriever"* (orange) from Current Task to Memory, *"ICL"* (blue) from Memory to Current Task.
- **Franka Kitchen:**
- Current Task: *"Task: Open a microwave door Actions: 1. Move the robot arm to the microwave"* (robot image).
- Memory (Past Experience):
- Example 1: *"Task: Open a microwave door Actions: 1. Move the robot arm to the microwave"* (robot image).
- Example 2: *"Task: Open a cabinet door Actions: 1. Move the robot arm to the cabinet door. 2. Grip the handle"* (robot image).
- Visual Observation: Green box, connected to Current Task (green arrow) and to "+".
- Text Prompt: *"Text Prompt: Task Information + Trajectory Retrieved Memory"* (blue box), connected to Memory (blue arrow) and to "+".
- Vision Language Model: Yellow box, receives "+" (Visual Observation + Text Prompt), outputs to Plan of Actions.
- Plan of Actions: *"To complete the task, 1. Move robot arm to microwave 2. ... 3. ..."* (teal box).
- Policy Network: Red box, receives Plan of Actions, outputs "Low-level Actions" to Updated Observation, Task, Actions.
- Updated Observation, Task, Actions: *"Task: Open microwave door Actions: 1. Move robot arm to microwave"* (robot image).
### Key Observations
- **ALFWorld Comparison:** ReAct uses static manual examples for ICL, while RAP uses a retriever to access dynamic past experiences (memory), enabling more adaptive task-solving (e.g., recalling "heating a mug" to solve "heating a tomato").
- **Franka Kitchen Pipeline:** Integrates visual observation, text prompts (with retrieved memory), a vision-language model, and a policy network to generate low-level actions, emphasizing memory-augmented robotic autonomy.
- **Color Coding:** Green (Visual Observation), Blue (Text Prompt), Yellow (Vision Language Model), Teal (Plan of Actions), Red (Policy Network), Orange (Memory), and Gray (Task Boxes) distinguish components.
### Interpretation
- **ALFWorld:** The diagram contrasts a baseline (ReAct) with a proposed method (RAP) that leverages memory retrieval for in-context learning. RAP’s use of past experiences (e.g., heating a mug) to solve new tasks (heating a tomato) suggests improved adaptability in text-based task environments.
- **Franka Kitchen:** The pipeline demonstrates a modular approach to robotic task execution: visual input + text (with memory) → vision-language model → action plan → policy network → low-level actions. This highlights the role of memory and multimodal (vision + text) processing in physical robotic autonomy.
- **Overall:** Both sections emphasize memory-augmented AI for task planning, with ALFWorld focusing on text-based environments and Franka Kitchen on physical robotics, showcasing the versatility of such approaches across domains.