\n
## Scatter Plot: Per-Task Success Rates: Real World vs World Model
### Overview
This image is a scatter plot comparing the success rates of three different models (RT-1-X, Octo, OpenVLA) on various tasks, measured in two different environments: the "Real World" and a "World Model" (a simulated environment). Each data point represents a single task. The chart includes a linear regression fit line and statistical correlation metrics.
### Components/Axes
* **Title:** "Per-Task Success Rates: Real World vs World Model"
* **X-Axis:** "Real World Success Rate (%)". Scale ranges from 0 to 100 with major ticks at 0, 20, 40, 60, 80, 100.
* **Y-Axis:** "World Model Success Rate (%)". Scale ranges from 0 to 100 with major ticks at 0, 20, 40, 60, 80, 100.
* **Legend:** Located in the bottom-right quadrant.
* Blue Circle: "RT-1-X"
* Orange Square: "Octo"
* Red Triangle: "OpenVLA"
* Black Dashed Line: "Fit"
* **Statistical Annotation:** A box in the top-left quadrant contains the text: "r = 0.78" and "p < 0.001".
* **Grid:** Light gray dashed grid lines are present at major tick intervals on both axes.
### Detailed Analysis
**Data Series and Trends:**
1. **RT-1-X (Blue Circles):**
* **Trend:** Data points are clustered in the lower-left portion of the chart, indicating generally lower success rates in both environments. The trend is weakly positive but with high variance.
* **Approximate Data Points (Real World %, World Model %):** (0, 0), (5, 0), (10, 0), (10, 5), (10, 10), (15, 25), (20, 5), (20, 30), (25, 5), (30, 10), (30, 20), (50, 15), (60, 50).
2. **Octo (Orange Squares):**
* **Trend:** Data points are spread across the low-to-mid range of the chart. There is a positive trend, but with significant scatter, especially at lower real-world success rates.
* **Approximate Data Points (Real World %, World Model %):** (0, 0), (0, 10), (0, 20), (0, 30), (5, 15), (10, 20), (10, 30), (25, 25), (35, 35), (40, 50), (40, 60), (50, 10), (55, 50), (60, 10).
3. **OpenVLA (Red Triangles):**
* **Trend:** Data points are predominantly in the upper-right quadrant, showing high success rates in both environments. The trend is strongly positive and aligns closely with the fit line.
* **Approximate Data Points (Real World %, World Model %):** (10, 60), (45, 60), (40, 40), (50, 50), (70, 40), (70, 90), (75, 75), (75, 80), (75, 90), (80, 40), (80, 50), (90, 95), (95, 85), (100, 60), (100, 70), (100, 100).
4. **Fit Line (Black Dashed):**
* **Trend:** A linear regression line showing a strong positive correlation. It starts near (0, 10) and ends near (100, 80).
* **Equation (Visual Estimate):** Approximately y = 0.7x + 10.
### Key Observations
1. **Strong Positive Correlation:** The overall dataset shows a strong positive correlation (r = 0.78, p < 0.001) between real-world success rate and world model success rate. This suggests that performance in the simulated environment is a good predictor of real-world performance.
2. **Model Performance Stratification:** There is a clear separation between the models. OpenVLA consistently achieves the highest success rates, followed by Octo, with RT-1-X generally performing the lowest.
3. **Variance at Low Success Rates:** Models with lower real-world success rates (RT-1-X and some Octo tasks) show much higher variance in their world model performance. Some tasks with 0% real-world success have world model success up to 30%.
4. **High-End Alignment:** For tasks with high real-world success rates (>70%), the world model success rates are also high and more tightly clustered around the trend line, particularly for OpenVLA.
5. **Notable Outliers:**
* An Octo task at approximately (50, 10) has a much lower world model success than predicted.
* An OpenVLA task at approximately (10, 60) has a very high world model success despite very low real-world success.
### Interpretation
The data demonstrates that the "World Model" simulation environment is a valid and useful tool for predicting real-world robotic task performance, as evidenced by the strong correlation. The stratification suggests that the OpenVLA model is significantly more robust and capable across both simulated and real environments compared to Octo and RT-1-X.
The higher variance at lower performance levels indicates that for tasks or models that are fundamentally flawed or poorly suited to the task, the simulation may not accurately reflect the degree of real-world failure. Conversely, the tight alignment at high performance levels suggests the simulation is highly reliable for evaluating and iterating on high-performing systems. The outlier with high simulated success but low real-world success (OpenVLA at ~10,60) is critical—it represents a task where the simulation is overly optimistic, potentially due to a "sim-to-real gap" where the model exploits a shortcut in the simulation that doesn't translate to reality. This chart is essential for validating the simulation's fidelity and for model selection and development.