\n
## Diagram: Robotic Task Planning and Execution
### Overview
This diagram illustrates a comparison between three approaches to robotic task planning and execution: ReAct, RAP (Ours), and a system utilizing visual observation in a simulated environment (Franka Kitchen). It depicts the flow of information and the use of memory and retrieval mechanisms in each approach. The diagram focuses on the task of manipulating objects (tomato, apple, mug) within a kitchen environment.
### Components/Axes
The diagram is segmented into three main sections, each representing a different approach. Each section includes:
* **Current Task:** Displays the current task being processed by the system.
* **Memory (Past Experiences):** Represents the system's memory of previous interactions.
* **Retrieval Mechanisms:** Illustrates how past experiences are retrieved to inform current actions.
* **ICL (In-Context Learning):** Shows the use of examples to guide the system.
* **Vision Language Model:** (Franka Kitchen section only) Represents the component that processes visual information.
* **Policy Network:** (Franka Kitchen section only) Represents the component that generates low-level actions.
The diagram also includes labels for actions ("Act:") and observations ("Obs:").
### Detailed Analysis or Content Details
**1. ReAct (Top-Left)**
* **Current Task:** "You are in the middle of room." and "Task: put a hot tomato on desk". "Act: ..."
* **Manual Examples:**
* "Task: put a hot apple in fridge." "Act: go to countertop" "Obs: you see ~"
* The flow is a direct arrow from the Current Task to the Manual Examples (ICL).
**2. RAP (Ours) (Center-Left)**
* **Current Task:** "You are in the middle of room." and "Task: put a hot tomato on desk". "Act: think: I need to find a tomato." "Obs: OK." "Act: go to fridge."
* **Memory (Past Experiences):** Contains two examples:
* "Task: put a tomato in fridge." "Act: go to fridge" "Obs: you see tomato, ~"
* "Task: put a hot apple on desk." "Act: heat mug with microwave." "Obs: you heat the tomato."
* **Retrieval Mechanisms:** Two "Retriever" blocks with arrows pointing from the Current Task to the Memory.
* **ICL:** Arrows connect the Retriever blocks to the Current Task, indicating the use of retrieved information.
**3. Franka Kitchen (Right)**
* **Visual Observation:** An image of a microwave is shown.
* **Text Prompt:** "Task Information + Trajectory Retrieved Memory"
* **Memory (Past Experience):**
* **Vision Language Model:** Receives input from the Text Prompt and outputs "Task: Open a microwave door".
* **Policy Network:** Receives input from the Vision Language Model and outputs "Plan of Actions: 1. Move robot arm to microwave 2. ..."
* **Low-level Actions:** "Updated Observation, Task, Actions"
* **Task Examples (within Vision Language Model):**
* "Task: Open a microwave door" "Actions: 1. Move robot arm to microwave 2. Grip the handle"
* "Task: Open a cabinet door" "Actions: 1. Move robot arm to cabinet door 2. Grip the handle"
* "Task: Open microwave door" "Actions: 1. Move robot arm to microwave 2. Grip the handle"
### Key Observations
* RAP (Ours) explicitly utilizes a memory of past experiences and a retrieval mechanism to inform current actions.
* ReAct relies on manual examples (ICL) without a dedicated memory component.
* Franka Kitchen integrates visual observation with language processing and policy planning.
* The Franka Kitchen section shows a more detailed breakdown of the task into sub-actions.
* The RAP section shows the system reasoning about the task ("think: I need to find a tomato.") before taking action.
### Interpretation
The diagram demonstrates a progression in robotic task planning complexity. ReAct represents a basic approach relying on pre-defined examples. RAP (Ours) introduces the concept of learning from past experiences and retrieving relevant information to improve performance. Franka Kitchen showcases a more sophisticated system that combines visual perception, language understanding, and action planning.
The use of memory and retrieval in RAP suggests an attempt to address the limitations of relying solely on pre-defined examples. The system can adapt to new situations by leveraging its past experiences. The Franka Kitchen approach highlights the importance of integrating visual information for robots operating in real-world environments.
The diagram suggests that a combination of these approaches – learning from experience, utilizing visual perception, and employing language understanding – is crucial for developing robust and adaptable robotic systems. The "plus" symbol connecting the Franka Kitchen section to the other two suggests a potential integration of these methods. The diagram is a conceptual illustration of different architectures and does not present quantitative data. It focuses on the flow of information and the components involved in each approach.