## Diagram: Do Better Video Models Lead to Higher Embodied Success?
### Overview
This diagram illustrates a system called "World-in-World" which evaluates the performance of different SoTA (State-of-the-Art) video models in embodied AI tasks. The diagram shows the flow of information from video models to a closed-loop interaction system, and then to various embodied tasks. A leaderboard and scaling charts are included to show the performance of different models.
### Components/Axes
The diagram is divided into three main sections:
1. **SoTA Video Models:** Lists several video models with their logos.
2. **World-in-World System:** Depicts the core interaction loop.
3. **Performance Evaluation:** Shows a leaderboard and scaling charts.
The diagram includes the following labels:
* **Title:** "Do Better Video Models Lead to Higher Embodied Success?"
* **Sub-title:** "World-in-World, an open platform"
* **SoTA video models:** Wan, Cosmos, SVD, Runway, LTX, Hunyuan
* **Embodied Policy:** A label indicating the policy used by the robot.
* **Unified Action:** A label indicating the action taken by the robot.
* **Closed-Loop Interaction:** A label describing the interaction between the robot and the environment.
* **Embodied Tasks:** Active Recognition, Navigation, Question Answering, Manipulation.
* **Leaderboard:** Lists models and their task success rates.
* **Data Scaling:** Shows task success vs. seen examples.
* **Inference-time Scaling:** Shows task success vs. inference count.
* **Correlation:** Shows task success vs. visual quality.
* **Task Success:** Label for the y-axis of the charts.
* **Seen Examples:** Label for the x-axis of the first chart.
* **Inference Count:** Label for the x-axis of the second chart.
* **Visual Quality:** Label for the x-axis of the third chart.
### Detailed Analysis or Content Details
**SoTA Video Models:**
The following models are listed with their logos:
* Wan
* Cosmos
* SVD
* Runway
* LTX
* Hunyuan
**World-in-World System:**
A video input (represented by a play button icon) is fed into the "World-in-World" system. This system then generates an "Embodied Policy" which drives a robot to perform a "Unified Action" within a "Closed-Loop Interaction" with the environment.
**Embodied Tasks:**
The output of the system is demonstrated through four embodied tasks:
* **Active Recognition:** Shown with an image of a person looking at a screen.
* **Navigation:** Shown with an image of a robot navigating a room.
* **Question Answering:** Shown with an image of a robot interacting with objects and a screen.
* **Manipulation:** Shown with an image of a robot manipulating colored blocks.
**Leaderboard:**
The leaderboard lists three models and their task success rates:
* **WanTF:** 82.6% Task Success, 0.389 Image Quality
* **LDP:** 61.0% Task Success, 0.365 Image Quality
* **Cosmos:** 60.3% Task Success, 0.369 Image Quality
**Data Scaling:**
The "Data Scaling" chart shows task success increasing with the number of "Seen Examples". The lines are colored:
* **S (Blue):** Line slopes upward, starting at approximately 50% task success with 0 seen examples and reaching approximately 85% task success with 1000 seen examples.
* **Wan (Green):** Line slopes upward, starting at approximately 50% task success with 0 seen examples and reaching approximately 75% task success with 1000 seen examples.
* **R (Purple):** Line slopes upward, starting at approximately 40% task success with 0 seen examples and reaching approximately 65% task success with 1000 seen examples.
**Inference-time Scaling:**
The "Inference-time Scaling" chart shows task success increasing with the "Inference Count". The lines are colored:
* **S (Blue):** Line slopes upward, starting at approximately 50% task success with 0 inference count and reaching approximately 85% task success with 100 inference count.
* **Wan (Green):** Line slopes upward, starting at approximately 50% task success with 0 inference count and reaching approximately 75% task success with 100 inference count.
* **R (Purple):** Line slopes upward, starting at approximately 40% task success with 0 inference count and reaching approximately 65% task success with 100 inference count.
**Correlation:**
The "Correlation" chart shows task success plotted against "Visual Quality". The lines are colored:
* **S (Blue):** Line slopes downward, starting at approximately 85% task success with low visual quality and decreasing to approximately 50% task success with high visual quality.
* **Wan (Green):** Line slopes downward, starting at approximately 75% task success with low visual quality and decreasing to approximately 50% task success with high visual quality.
* **R (Purple):** Line slopes downward, starting at approximately 65% task success with low visual quality and decreasing to approximately 40% task success with high visual quality.
### Key Observations
* WanTF consistently outperforms LDP and Cosmos in task success.
* Task success generally increases with more seen examples and higher inference counts.
* There is a negative correlation between task success and visual quality, suggesting that better visuals do not necessarily lead to better embodied success.
* The "S" model consistently shows the highest task success across all scaling charts.
### Interpretation
The diagram suggests that embodied success is not solely dependent on the visual quality of video models. While visual fidelity is important, factors like the amount of training data ("Seen Examples") and computational resources ("Inference Count") play a significant role. The negative correlation between task success and visual quality is particularly noteworthy, implying that models optimized for embodied tasks may prioritize functionality over photorealism. The "World-in-World" platform provides a framework for evaluating and comparing different video models based on their embodied performance, rather than solely on their visual output. The leaderboard highlights WanTF as a leading model in this regard. The diagram emphasizes a shift in evaluation criteria for video models, moving away from purely aesthetic measures towards practical, task-oriented performance.