2506.00613

Model: healer-alpha-free

# WorldGym: World Model as An Environment for Policy Evaluation **Authors**: - Percy Liang Sherry Yang (Stanford University NYU Google DeepMind) ## Abstract Evaluating robot control policies is difficult: real-world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world-model-based policy evaluation environment (WorldGym), an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards. We evaluate a set of VLA-based real-robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real-world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world model further enables efficient evaluation of robot policies’ generalization ability on novel tasks and environments. We find that modern VLA-based robot policies still struggle to distinguish object shapes and can become distracted by adversarial facades of objects. While generating highly realistic object interaction remains challenging, WorldGym faithfully emulates robot motions and offers a practical starting point for safe and reproducible policy evaluation before deployment. See videos and code at https://world-model-eval.github.io ## 1 Introduction Robots can help humans in ways that range from home robots performing chores (Shafiullah et al., 2023; Liu et al., 2024) to hospital robots taking care of patients (Soljacic et al., 2024). One of the major road blocks in the development robots lies in evaluation — how should we ensure that these robots will work reliably without causing any physical damage when deployed in the real world? Traditionally, people have used handcrafted software simulators to develop and evaluate robot control policies (Tedrake et al., 2019; Todorov et al., 2012; Erez et al., 2015). However, handcrafted simulation based on our understanding of the physical world can be limited, especially when it comes to hardcoding complex dynamics with high degrees of freedom or complex interactions such as manipulating soft objects (Sünderhauf et al., 2018; Afzal et al., 2020; Choi et al., 2021). As a result, the sim-to-real gap has hindered progress in robotics (Zhao et al., 2020; Salvato et al., 2021; Dulac-Arnold et al., 2019). With the development of generative models trained on large-scale video data (Ho et al., 2022; Villegas et al., 2022; Singer et al., 2022), recent work has shown that video world models can visually emulate interactions with the physical real world, by conditioning on control inputs in the form of text (Yang et al., 2023; Brooks et al., 2024) or keyboard strokes (Bruce et al., 2024). This brings up an interesting question — could video world models be used to emulate robot interactions with the real world, hence being used as an environment to evaluate robot policies in the world model before real-world testing or deployment? Learning a dynamics model from past experience and performing rollouts in the learned dynamics model has been extensively studied in model-based reinforcement learning (RL) (Hafner et al., 2019; Fonteneau et al., 2013; Zhang et al., 2021; Kaiser et al., 2019; Yu et al., 2020). However, most of the existing work in model-based RL considers single-task settings, which puts itself at a disadvantage compared to model-free RL, since learning a dynamics model can be much harder than learning a policy in the single-task setting. Nevertheless, we make the important observation that - While there can be many tasks and policies, there is only one physical world in which we live that is governed by the same set of physical laws. This makes it possible to learn a single world model that, in principle, can be used as an interactive environment to evaluate any policies on any tasks. <details> <summary>x1.png Details</summary> ![c6eed137](/v1/image/c6eed1372d09a4538901d9ed803bb1136d1c79bf4112f279336ddc1350e9f211) ### Visual Description ## System Architecture Diagram: Reinforcement Learning with World Model and VLM Reward ### Overview This image is a technical system architecture diagram illustrating a reinforcement learning or robotic control pipeline. It depicts a process that starts with visual and language instructions, uses a world model and policy to generate action sequences, and finally employs a Vision-Language Model (VLM) to compute a reward signal. The diagram is structured into three main vertical sections: Input, Processing, and Reward Calculation. ### Components/Axes The diagram is organized into distinct regions with labeled components and directional arrows indicating data flow. **1. Left Panel: Initial Frame and Language Instruction (Input Region)** * **Header:** "Initial Frame and Language Instruction" * **Three Input Examples:** * **Top Example:** Labeled "Evaluation Dataset Example". Contains an image of a robotic arm over a sink with objects and a yellow instruction box with the text: "Put the eggplant in the pot". * **Middle Example:** Labeled "OOD Image Input" (OOD likely means Out-Of-Distribution). The image shows a different scene with more colorful objects. The instruction box is identical: "Put the eggplant in the pot". * **Bottom Example:** Labeled "OOD Language Instruction". The image is similar to the top example, but the instruction box has a red border and different text: "Put the eggplant in the drying rack". * **Flow:** Arrows from all three examples converge and point to a small yellow box labeled "g" in the central processing region. **2. Central Processing Region** * **Input Node:** A small yellow box labeled "g". It receives input from the left panel. * **Initial Observation:** An image labeled "o₀" (o subscript 0) is shown below the "g" box. It depicts the initial state of the robotic workspace. * **Policy Blocks:** Three identical gray rectangular blocks labeled "Policy". They are arranged horizontally. * **World Model Block:** A large, light blue horizontal bar labeled "World Model" in bold text. It spans above the three Policy blocks. * **Observation Sequence:** Three images labeled "o₁", "o₂", and "o₃" (o subscripts 1, 2, 3) are positioned above the World Model bar. They show sequential states of the robotic arm performing the task. * **Flow Arrows:** * An arrow goes from "g" to the first "Policy" block. * Arrows connect the "Policy" blocks to the "World Model" bar from below. * Arrows point upward from the "World Model" bar to each of the observation images ("o₁", "o₂", "o₃"). * A final arrow leads from the last observation ("o₃") to the right panel. **3. Right Panel: Reward Calculation** * **Header:** "VLM as Reward" * **Logo:** A black, stylized, interlocking circular logo (resembling the OpenAI logo) is centered in this panel. * **Output Symbol:** An arrow points downward from the logo to a mathematical symbol: "R̂" (R with a circumflex/hat), representing the estimated or predicted reward. ### Detailed Analysis The diagram details a sequential decision-making process: 1. **Input Stage:** The system takes an initial visual observation (`o₀`) and a language instruction (encapsulated by `g`). The examples show the system is being tested on both in-distribution ("Evaluation Dataset") and out-of-distribution (OOD) scenarios, varying either the image context or the language command. 2. **Action & Prediction Stage:** The policy network, conditioned on the input `g`, generates actions. These actions and the current state are fed into a "World Model." The World Model's role is to predict future states of the environment, generating the sequence of predicted observations: `o₁`, `o₂`, `o₃`. 3. **Reward Assignment Stage:** The final predicted state (`o₃`) is passed to a "VLM as Reward" module. This module, represented by a large language model logo, evaluates how well the final state fulfills the original instruction and outputs a scalar reward value, `R̂`. ### Key Observations * **OOD Testing:** The diagram explicitly highlights testing for robustness by including "OOD Image Input" and "OOD Language Instruction" as separate cases, indicating the system's generalization capability is a key focus. * **World Model Centrality:** The "World Model" is the largest and most central component, suggesting it is the core innovation or focus of this architecture. It acts as a simulator or predictor of future states. * **VLM as a Reward Function:** Using a Vision-Language Model (VLM) to compute reward (`R̂`) is a notable design choice. It implies the reward is not from a pre-defined metric but from a model that can understand both the visual outcome and the language goal. * **Sequential Predictions:** The output of the World Model is a sequence of frames (`o₁` to `o₃`), not just a final state, which may allow for more granular reward assessment or planning. ### Interpretation This diagram represents a **model-based reinforcement learning framework for language-conditioned robotic tasks**. The key investigative insight is the integration of a **World Model** for planning or simulation with a **Vision-Language Model (VLM)** as a flexible, semantic reward function. * **What it demonstrates:** The system aims to learn policies that can follow natural language instructions in physical environments. By using a world model, it can "imagine" the consequences of its actions before executing them. The VLM reward allows the system to be trained or evaluated based on high-level, human-understandable goals ("put X in Y") rather than low-level coordinates. * **Relationships:** The policy and world model are tightly coupled in a planning loop. The VLM sits outside this loop as an evaluator. The OOD examples stress that the entire pipeline—from perception (images) to understanding (language) to action (policy)—must be robust. * **Notable Implications:** This architecture could enable robots to generalize better to new objects and instructions. The use of a VLM as a reward also points toward **reinforcement learning from human feedback (RLHF)** or **goal-conditioned RL** paradigms, where the reward signal is derived from a model that encapsulates human preferences or task semantics. The hat on the R (`R̂`) signifies it is an estimate, acknowledging the potential noise or imperfection in the VLM's judgment. </details> Figure 1: Overview of WorldGym. Given an initial frame and an action sequence predicted by a policy, WorldGym uses a world model to interactively predict future frames, serving as a generative simulator. WorldGym then passes the generated rollout to a VLM which provides rewards. WorldGym can easily be used to test policies on OOD tasks and environments by changing the input language instruction or directly modifying the initial image. Inspired by this observation, we propose a world-model-based policy evaluation environment (WorldGym), as shown in Figure 1. WorldGym first combines knowledge of the world across diverse environments by learning a single world model that generates videos conditioned on actions. To enable efficient rollouts of policies which predict different-length action chunks, WorldGym aligns its diffusion horizon length with policies’ chunk sizes at inference time. With video rollouts from the world model, WorldGym then uses a vision-language model (VLM) to determine tasks’ success from generated videos. Our experiments show that WorldGym can emulate end-effector controls across different control axes highly effectively for robots with different morphologies. We then use the world model to evaluate VLA-based robot policies by rolling out the policies in the world model starting from real initial frames, and compare their success rates (policy values) in WorldGym to those achieved in real-world experiments. Our result suggests that policy values in WorldGym are highly correlated with policy performance in the real world, and the relative rankings of different policies are preserved. Furthermore, as WorldGym requires only a single initial frame as input, we show how we can easily design out-of-distribution (OOD) tasks and environments and then use WorldGym to evaluate robot policies within these newly “created” environments. We find that modern robot policies still struggle to distinguish some classes of objects by their shape, and can even be distracted by adversarial facades of objects. Although simulating realistic object interactions remains challenging, we believe WorldGym can serve as a highly useful tool for sanity check and testing robot policies safely and reproducibly before deploying them on real robots. Key contributions of this paper include: - We propose to use video world model to evaluate robot policies across different robot morphologies, and perform a comprehensive set of studies to understand its feasibility. - We propose flexibly aligning diffusion horizon length with policies’ action chunk sizes for efficient rollouts of a variety of policies over hundreds of interactive steps. - We show a single world model learned on data from diverse tasks and environments can enable policy value estimates that highly correlate with real-world policy success rates. - We demonstrate the ease of testing robot policies on OOD tasks and environments within an autoregressive video generation-based world model. ## 2 Problem Formulation In this section, we define relevant notations and review the formulation of offline policy evaluation (OPE). We also situate OPE in practical settings with partially observable environments and image-based observations. Multi-Task POMDP. We consider a multi-task, finite-horizon, partially observable Markov Decision Process (POMDP) (Puterman, 2014; Kaelbling et al., 1995), specified by $\mathcal{M}=(S,A,O,G,R,T,\mathcal{E},H)$ , which consists of a state space, action space, observation space, goal space, reward function, transition function, emission function, and horizon length. A policy $\pi$ interacts with the environment for a goal starting from an initial state $g,s_{0}\sim G$ , producing a distribution $\pi(\cdot|s_{t},g)$ over $A$ from which an action $a_{t}$ is sampled and applied to the environment at each step $t\in[0,H]$ . The environment produces a scalar reward $r_{t}=R(s_{t},g)$ , and transitions to a new state $s_{t+1}\sim T(s_{t},a_{t})$ and emits a new observation $o_{t+1}\sim\mathcal{E}(s_{t+1})$ . We consider the sparse reward setting with $R(s_{H},g)\in\{0,1\}$ and $R(s_{t},g)=0,\forall t<H$ , where $g$ is a language goal that defines the task. Data is logged from previous interactions into an offline dataset $D=\{g,s_{0},o_{0},a_{0},...,s_{H},o_{H},r_{H}\}$ . The value of a policy $\pi$ can be defined as the total expected future reward: $$ \displaystyle\rho(\pi)= \displaystyle\mathbb{E}[R(s_{H},g)|s_{0},g\sim G,a_{t}\sim\pi(s_{t},g), \displaystyle s_{t+1}\sim T(s_{t},a_{t}),\forall t\in[0,H]]. \tag{1} $$ Estimating the value of $\rho(\pi)$ from previously collected data $D$ , known as offline policy evaluation (OPE) (Levine et al., 2020), has been extensively studied (Thomas & Brunskill, 2016; Jiang & Li, 2016; Fu et al., 2021; Yang et al., 2020; Thomas et al., 2015b). However, existing work in OPE mostly focuses on simulated settings that are less practical (e.g., assumptions about full observability, access to ground truth states). Model-Based Evaluation. Motivated by characteristics of a real-robot system such as image based observations, high control frequencies, diverse offline data from different tasks/environments, and the lack of access to the ground truth state of the world, we consider the use of offline data to learn a single world model $\hat{T}(\cdot|\mathbf{o},\mathbf{a})$ , where $\mathbf{o}$ represents a sequence of previous image observations and $\mathbf{a}$ represents a sequence of next actions. A sequence of next observations can be sampled from the world model $\mathbf{o^{\prime}}\sim\hat{T}(\mathbf{o},\mathbf{a})$ . With this world model, one can estimate the policy value $\rho(\pi)$ with Monte-Carlo sampling using stochastic rollouts from the policy and the world model: $$ \displaystyle\hat{\rho}(\pi)= \displaystyle\mathbb{E}[\hat{R}([o_{0},...,o_{H}],g)|s_{0},g\sim G,\mathbf{a}\sim\pi(\mathbf{o},g), \displaystyle\mathbf{o^{\prime}}\sim\hat{T}(\mathbf{o},\mathbf{a}),\mathbf{o}=\mathbf{o^{\prime}}], \tag{2} $$ where $\hat{R}$ is a learned reward function. Previously, model-free policy evaluation may be more preferable since in a single task setting, dynamics models are potentially harder to learn than policy values themselves, and doing rollouts in a dynamics model may lead to compounding errors (Xiao et al., 2019). However, we make the key observations that while there can be many tasks and many policies, there is only one physical world that is governed by the same set of physical laws. As a result, learning a world model can benefit from diverse data from different tasks and environments with different state spaces, goals, and reward functions. More importantly, a world model can be directly trained on image-based observations, which is often the perception modality of real-world robots. ## 3 Building and Evaluating the World Model In this section, we first describe our implementation of world model training and inference. Then, we discuss how we validate our world model’s performance prior to rolling out real robot policies within it in the next section. ### 3.1 Building the World Model First, we describe the architecture and key implementation details, followed by our proposed inference scheme for policy rollouts. #### 3.1.1 World Model Training We train a latent Diffusion Transformer (Peebles & Xie, 2023) on sequences of frames paired with actions, using Diffusion Forcing (Chen et al., 2024) to enable autoregressive frame generation. Per-frame robot action vectors are linearly projected to the model dimension and added elementwise to diffusion timestep embeddings, the result of which is used to condition the model through AdaLN-Zero modulation, similar to class conditioning in Peebles & Xie (2023). To ensure the world model is fully controllable by robot actions, we propose to randomly drop out actions for entire video clips, and use classifier-free guidance to improve the world model’s adherence to action inputs. Conditioning on previous frames’ latents is achieved via causal temporal attention blocks interleaved between spatial attention blocks, as in Bruce et al. (2024); Ma et al. (2025). See Appendix A for additional implementation details. <details> <summary>x2.png Details</summary> ![76a9fbc1](/v1/image/76a9fbc1639ab3a210d2b33579ad6fe11d9fd6f9456c27f5e697daca1ab5a810) ### Visual Description ## Comparative Visualization: Robotic Manipulation Task Sequences ### Overview The image is a composite visualization comparing "Ground-truth" (real or reference) and "Generated" (synthesized or predicted) video sequences for four distinct robotic manipulation tasks. It is organized into a 2x2 grid of task panels. Each panel contains two rows of four sequential frames: the top row is labeled "Ground-truth" and the bottom row is labeled "Generated". The purpose is to visually assess the fidelity of the generated sequences against the ground truth. ### Components/Axes * **Primary Labels:** The text "Ground-truth" and "Generated" appears vertically aligned to the left of each corresponding row of frames. * **Structure:** The image is segmented into four independent task demonstration panels. * **Top-Left Panel:** A robotic arm manipulating a blue cloth on a wooden table. * **Top-Right Panel:** A robotic arm interacting with a wooden drawer unit. * **Bottom-Left Panel:** A robotic arm interacting with a white plate and a colorful, patterned object (possibly a bag or container) on a black surface. * **Bottom-Right Panel:** A robotic arm manipulating a small green object on a light-colored wooden table. * **Frame Sequence:** Each row within a panel displays four consecutive frames, implying a temporal sequence from left to right. ### Detailed Analysis **Panel 1 (Top-Left): Blue Cloth Manipulation** * **Task:** A robotic gripper picks up and folds/moves a blue cloth. * **Ground-truth Sequence:** Shows the gripper approaching, grasping, lifting, and repositioning the cloth. The cloth's folds and position change realistically. * **Generated Sequence:** The sequence closely mirrors the ground truth in terms of gripper position, cloth deformation, and overall motion trajectory. Minor differences in the exact fold geometry of the cloth are perceptible upon close inspection. **Panel 2 (Top-Right): Drawer Interaction** * **Task:** A robotic arm opens a drawer in a wooden unit. * **Ground-truth Sequence:** The gripper approaches the drawer handle, pulls it open, and retracts slightly. The drawer slides out smoothly. * **Generated Sequence:** The generated frames replicate the action. The drawer's open position and the arm's posture in the final frames appear consistent with the ground truth. **Panel 3 (Bottom-Left): Object Interaction on Table** * **Task:** A robotic arm interacts with a white plate and a colorful object on a black tabletop. * **Ground-truth Sequence:** The gripper moves towards the colorful object, appears to grasp or push it, causing it to shift position relative to the plate. * **Generated Sequence:** The generated sequence shows a similar interaction. The movement of the colorful object and the arm's path are visually comparable to the ground truth. **Panel 4 (Bottom-Right): Small Object Manipulation** * **Task:** A robotic arm picks up or manipulates a small green object on a table. * **Ground-truth Sequence:** The gripper descends, interacts with the green object, and lifts or moves it. * **Generated Sequence:** The generated frames show the same fundamental action. The object's position and the gripper's configuration in each frame align well with the corresponding ground-truth frame. ### Key Observations 1. **High Fidelity:** Across all four tasks, the "Generated" sequences demonstrate a high degree of visual and temporal fidelity when compared to their "Ground-truth" counterparts. The core actions, object states, and robotic poses are faithfully reproduced. 2. **Consistent Structure:** The comparison is presented in a clear, consistent format, making side-by-side evaluation straightforward. 3. **Minor Discrepancies:** While the overall sequences match, subtle differences exist in fine details, such as the exact wrinkle pattern on the cloth (Panel 1) or the precise lighting reflection on an object (Panel 3). These are expected in generative model outputs. 4. **Task Diversity:** The visualization tests the generative model on a variety of manipulation primitives: deformable object manipulation (cloth), articulated object manipulation (drawer), and rigid object interaction. ### Interpretation This image serves as a qualitative evaluation metric for a generative model, likely a video prediction or robotic simulation model. The "Ground-truth" represents the target reality—real video footage of a robot performing tasks. The "Generated" represents the model's attempt to synthesize or predict these video sequences. The close correspondence between the two suggests the model has successfully learned the underlying physics, kinematics, and visual appearance of these robotic tasks. It can generate plausible future frames or novel views that maintain physical and temporal coherence. The minor discrepancies highlight the current limits of the model's precision, which could be due to factors like complex deformable object physics, lighting modeling, or fine-grained texture generation. The choice of tasks is significant: manipulating a cloth (highly deformable) and a drawer (constrained articulation) are challenging problems in robotics. Success here indicates the model captures complex dynamics beyond simple rigid-body motion. This type of visualization is crucial for research in robot learning, computer vision, and generative AI, providing an intuitive, holistic assessment that numerical metrics alone cannot offer. It answers the question: "Does the model's imagined version of the task look and behave like reality?" </details> Figure 2: Qualitative evaluation of the world model on Bridge, RT-1, VIOLA, and Berkeley UR5. In each group, top row shows the ground truth video from the real robot. Bottom row shows the generated video from the world model conditioned on the same actions as the original video. The world model closely follows the true dynamics across different robot morphologies. #### 3.1.2 Rolling Out a Policy in the World Model Our policy evaluation pipeline operates through an iterative loop between the robot policy and the world model. First, the world model is initialized with an initial observation $o_{0}$ , which is then passed as input to a policy $\pi$ which produces a chunk of actions $\mathbf{a}_{\text{pred}}$ . The actions are passed back to the world model, which predicts a new frame for each action in $\mathbf{a}_{\text{pred}}$ . The latest frame produced by the world model is then returned to the policy as its next input observation. Since different robot policies output a different number of actions at once (Kim et al., ; Brohan et al., 2022; Chi et al., 2023), WorldGym needs to support efficient prediction of a chunk of videos conditioned on a chunk of (variable length) actions. By virtue of being trained with Diffusion Forcing, as well as our usage of a causal temporal attention mask, we can flexibly control how many frames our world model denoises in parallel at inference time, i.e. its prediction horizon length. We propose setting the horizon equal to the policy’s action chunk size, $|\mathbf{a}_{\text{pred}}|$ . This has the benefit of efficient frame generation for policies with differing action chunk sizes, all from a single world model checkpoint. This contrasts with prior diffusion world models for robotics, such as Cosmos (NVIDIA et al., 2025), which, due to being trained with bidirectional attention and a fixed context length, must always denoise 16 latent frames in parallel. This constraint results in wasted compute for action chunk sizes less than the context length and unrealized parallelism for chunk sizes which are larger. On the other hand, our design allows parallelism to flexibly match the number of actions, thus utilizing hardware more effectively (see Appendix F.2). #### 3.1.3 VLM as Reward We opt for GPT-4o (OpenAI et al., 2024) as a reward model, passing in the sequence of frames from the generated rollout and the language instruction (see the prompt for the VLM in Appendix B). In certain cases where both policies being evaluated fail to perform a task end-to-end, it is still helpful to get signals on which policy is closer to completing a task. We can specify these partial credit criteria to the VLM to further distinguish performance between different policies, which has been done manually using heuristics in prior work (Kim et al., ). We validate the accuracy of VLM-predicted rewards in Appendix B.2. ### 3.2 Evaluating the World Model Next, we describe how we validate the performance of our world model prior to policy evaluation, ensuring that it exhibits realistic robot movement and adheres to arbitrary action controls. <details> <summary>x3.png Details</summary> ![842f7a47](/v1/image/842f7a4747007cde638e19462743311427ac8ad4afb83852c8e7847a80efec27) ### Visual Description ## Multi-Panel Robotic Manipulation Sequence ### Overview The image is a composite figure showing four distinct sequences of a robotic arm performing manipulation tasks on a tabletop workspace. Each sequence is presented as a horizontal strip of three frames, illustrating a progression of actions. The figure is divided into four labeled panels: "Gripper", "Y Sweep", "X Sweep", and "Z Sweep". ### Components/Axes * **Panel Layout:** The image is split into a 2x2 grid of panels. * **Top-Left Panel:** Labeled "Gripper" on the left vertical axis. * **Top-Right Panel:** Labeled "Y Sweep" on the left vertical axis. * **Bottom-Left Panel:** Labeled "X Sweep" on the left vertical axis. * **Bottom-Right Panel:** Labeled "Z Sweep" on the left vertical axis. * **Common Elements:** Each panel contains three sequential photographic frames showing a white robotic arm (likely a collaborative robot) interacting with objects on a dark gray or black tabletop. The background shows a lab or workshop environment. * **Annotations:** Each panel includes white text boxes and directional arrow icons overlaid on the images to describe the action or movement. * **Objects:** Various objects are present on the table across the sequences, including what appears to be a red apple, a dark cylindrical can, a book or magazine, a white plate, and a yellow sponge. ### Detailed Analysis **1. Gripper Panel (Top-Left)** * **Sequence:** Three frames showing the robotic arm's end-effector (gripper) interacting with a dark cylindrical object (can). * **Text Annotations:** * Frame 1: "close" (in a white box). * Frame 2: "open" (in a white box). * Frame 3: "close" (in a white box). * **Action Trend:** The gripper closes on the object, opens to release it, and then closes again. The object's position appears unchanged between frames 1 and 3. **2. Y Sweep Panel (Top-Right)** * **Sequence:** Three frames showing the arm moving an object along the Y-axis (front-to-back relative to the camera view). * **Directional Icons:** * Frame 1: Upward-pointing arrow (↑). * Frame 2: Upward-pointing arrow (↑). * Frame 3: Downward-pointing arrow (↓). * **Objects & Action:** The arm holds a dark cylindrical object (can). An apple is visible on the table to the left. The trend shows the arm moving the can away from the camera (↑) in the first two frames, then back towards the camera (↓) in the third frame. **3. X Sweep Panel (Bottom-Left)** * **Sequence:** Three frames showing the arm moving an object along the X-axis (left-to-right). * **Directional Icons:** * Frame 1: Right-pointing arrow (→). * Frame 2: Right-pointing arrow (→). * Frame 3: Left-pointing arrow (←). * **Objects & Action:** The arm holds a flat, rectangular object (a book or magazine). The trend shows the arm moving the object to the right (→) in the first two frames, then back to the left (←) in the third frame. **4. Z Sweep Panel (Bottom-Right)** * **Sequence:** Three frames showing the arm moving an object along the Z-axis (vertical up-and-down). * **Directional Icons:** * Frame 1: Downward-pointing arrow (↓). * Frame 2: Downward-pointing arrow (↓). * Frame 3: Upward-pointing arrow (↑). * **Objects & Action:** The arm holds a yellow sponge. A white plate is visible on the table. The trend shows the arm lowering the sponge (↓) in the first two frames, then raising it (↑) in the third frame. ### Key Observations * **Consistent Structure:** All four panels follow an identical three-frame narrative structure: initiate action, continue action, reverse action. * **Action Types:** The figure isolates and demonstrates four fundamental robotic manipulation primitives: gripper actuation (open/close) and Cartesian translation along the three primary axes (X, Y, Z). * **Object Variety:** Different objects are used in each sweep (can, book, sponge), suggesting the tests are performed on items with different physical properties (shape, weight, deformability). * **Spatial Grounding:** The directional arrows are consistently placed in the lower portion of each frame. The "Gripper" text labels are centered within their frames. The panel labels ("Gripper", "Y Sweep", etc.) are vertically aligned on the far left of each row. ### Interpretation This image is a technical demonstration or validation figure, likely from a robotics research paper or report. It systematically documents the successful execution of basic motor skills by a robotic system. * **What it Demonstrates:** The figure provides visual proof that the robotic system can reliably perform discrete, repeatable actions: grasping/releasing an object and moving it precisely along each of the three cardinal axes. The use of different objects implies testing for generalization. * **Relationship Between Elements:** The four panels are not a single continuous task but a catalog of independent capabilities. Together, they form a foundational skill set required for more complex manipulation tasks (e.g., pick-and-place, assembly). The "Gripper" panel tests the end-effector, while the "Sweep" panels test the arm's positional control. * **Notable Patterns:** The strict three-frame, action-reverse pattern is a clear methodological choice for clear communication. It shows not just that the robot can move in a direction, but that it has controlled, bidirectional authority over that movement. The lack of visible errors or object drops across all sequences suggests a high degree of reliability in these controlled tests. * **Underlying Purpose:** This figure serves to establish baseline competency. Before showing a robot performing a complex task like setting a table, a researcher would use a figure like this to prove the robot can first reliably move its arm to a location and open/close its gripper. It answers the fundamental question: "Can the hardware and low-level control system execute basic commands as expected?" </details> -5mm Figure 3: Results on end-effector control across action dimensions. Generated videos closely follow the gripper controls such as open and close the gripper as well as moving in different directions starting from any initial observation frame. Results for control sweeps on the Bridge robot can be found in Figure 16 in Appendix E.1. #### 3.2.1 Agreement with Validation Split First, we test the world model’s ability to generate similar videos as running a robot in the real world. Specifically, we take the validation split of initial images from the Open-X Embodiment dataset, and predict videos conditioned on the same action sequences as in the original data. Figure 2 shows that the generated rollouts generally follow the real-robot rollouts across different initial observations and different robot morphologies. #### 3.2.2 End-Effector Control Sweeps Next, we need a way to evaluate whether our world model can emulate arbitrary action sequences, beyond the kinds of action sequences present in the training data. We propose hard-coding a robot control policy by only moving one action dimension at once (and keeping the other action dimensions as zeros). The robot is then expected to move along that one action dimension with non-zero input, corresponding to moving in different horizontal and vertical directions as well as open and close its gripper. Figure 3 shows that the generated videos faithfully follow the intended end-effector movement, Results are best viewed as videos in the supplementary material. despite the fact that these particular sequences of controls are not present in the training data. ## 4 Evaluating Policies in WorldGym Having established confidence in the world model’s performance, we now use the world model to evaluate policies. We begin by rolling out three recent VLA policies in WorldGym and check whether WorldGym reflects real-world success. (Section 4.1). We then assess whether relative policy performance is preserved, comparing different versions, sizes, and training stages of the same models (Section 4.2). Finally, we explore WorldGym’s potential to test policies on out-of-distribution (OOD) tasks and environments (Section 4.3), including novel instructions and altered visual contexts. ### 4.1 Correlation between Real-World and Simulated Policy Performance <details> <summary>x4.png Details</summary> ![258e94d4](/v1/image/258e94d46ca46e80b094b3e2e30824a56769a45dbd8d1d4b12e3a777552e89a8) ### Visual Description \n ## Scatter Plot: Per-Task Success Rates: Real World vs World Model ### Overview This image is a scatter plot comparing the success rates of three different models (RT-1-X, Octo, OpenVLA) on various tasks, measured in two different environments: the "Real World" and a "World Model" (a simulated environment). Each data point represents a single task. The chart includes a linear regression fit line and statistical correlation metrics. ### Components/Axes * **Title:** "Per-Task Success Rates: Real World vs World Model" * **X-Axis:** "Real World Success Rate (%)". Scale ranges from 0 to 100 with major ticks at 0, 20, 40, 60, 80, 100. * **Y-Axis:** "World Model Success Rate (%)". Scale ranges from 0 to 100 with major ticks at 0, 20, 40, 60, 80, 100. * **Legend:** Located in the bottom-right quadrant. * Blue Circle: "RT-1-X" * Orange Square: "Octo" * Red Triangle: "OpenVLA" * Black Dashed Line: "Fit" * **Statistical Annotation:** A box in the top-left quadrant contains the text: "r = 0.78" and "p < 0.001". * **Grid:** Light gray dashed grid lines are present at major tick intervals on both axes. ### Detailed Analysis **Data Series and Trends:** 1. **RT-1-X (Blue Circles):** * **Trend:** Data points are clustered in the lower-left portion of the chart, indicating generally lower success rates in both environments. The trend is weakly positive but with high variance. * **Approximate Data Points (Real World %, World Model %):** (0, 0), (5, 0), (10, 0), (10, 5), (10, 10), (15, 25), (20, 5), (20, 30), (25, 5), (30, 10), (30, 20), (50, 15), (60, 50). 2. **Octo (Orange Squares):** * **Trend:** Data points are spread across the low-to-mid range of the chart. There is a positive trend, but with significant scatter, especially at lower real-world success rates. * **Approximate Data Points (Real World %, World Model %):** (0, 0), (0, 10), (0, 20), (0, 30), (5, 15), (10, 20), (10, 30), (25, 25), (35, 35), (40, 50), (40, 60), (50, 10), (55, 50), (60, 10). 3. **OpenVLA (Red Triangles):** * **Trend:** Data points are predominantly in the upper-right quadrant, showing high success rates in both environments. The trend is strongly positive and aligns closely with the fit line. * **Approximate Data Points (Real World %, World Model %):** (10, 60), (45, 60), (40, 40), (50, 50), (70, 40), (70, 90), (75, 75), (75, 80), (75, 90), (80, 40), (80, 50), (90, 95), (95, 85), (100, 60), (100, 70), (100, 100). 4. **Fit Line (Black Dashed):** * **Trend:** A linear regression line showing a strong positive correlation. It starts near (0, 10) and ends near (100, 80). * **Equation (Visual Estimate):** Approximately y = 0.7x + 10. ### Key Observations 1. **Strong Positive Correlation:** The overall dataset shows a strong positive correlation (r = 0.78, p < 0.001) between real-world success rate and world model success rate. This suggests that performance in the simulated environment is a good predictor of real-world performance. 2. **Model Performance Stratification:** There is a clear separation between the models. OpenVLA consistently achieves the highest success rates, followed by Octo, with RT-1-X generally performing the lowest. 3. **Variance at Low Success Rates:** Models with lower real-world success rates (RT-1-X and some Octo tasks) show much higher variance in their world model performance. Some tasks with 0% real-world success have world model success up to 30%. 4. **High-End Alignment:** For tasks with high real-world success rates (>70%), the world model success rates are also high and more tightly clustered around the trend line, particularly for OpenVLA. 5. **Notable Outliers:** * An Octo task at approximately (50, 10) has a much lower world model success than predicted. * An OpenVLA task at approximately (10, 60) has a very high world model success despite very low real-world success. ### Interpretation The data demonstrates that the "World Model" simulation environment is a valid and useful tool for predicting real-world robotic task performance, as evidenced by the strong correlation. The stratification suggests that the OpenVLA model is significantly more robust and capable across both simulated and real environments compared to Octo and RT-1-X. The higher variance at lower performance levels indicates that for tasks or models that are fundamentally flawed or poorly suited to the task, the simulation may not accurately reflect the degree of real-world failure. Conversely, the tight alignment at high performance levels suggests the simulation is highly reliable for evaluating and iterating on high-performing systems. The outlier with high simulated success but low real-world success (OpenVLA at ~10,60) is critical—it represents a task where the simulation is overly optimistic, potentially due to a "sim-to-real gap" where the model exploits a shortcut in the simulation that doesn't translate to reality. This chart is essential for validating the simulation's fidelity and for model selection and development. </details> (a) Per-Task Task Success Rates. Each point represents a task from Table 5, with different policies being represented by different shaped markers. There is a strong correlation ( $r=0.78$ ) between policy performance in our world model (y-axis) and within the real world (x-axis). <details> <summary>x5.png Details</summary> ![d3447070](/v1/image/d3447070817129473a3303a60096e85f9164447796233c8e27d100f484ea08e2) ### Visual Description \n ## Bar Chart: Mean Success Rates: Real World vs World Model ### Overview This is a grouped bar chart comparing the mean success rates (in percentage) of three different systems or models—RT-1-X, Octo, and OpenVLA—in two distinct environments: "Real World" and "World Model." The chart includes error bars for each data point. ### Components/Axes * **Chart Title:** "Mean Success Rates: Real World vs World Model" * **Y-Axis:** * **Label:** "Success Rate (%)" * **Scale:** Linear, ranging from 0 to 80, with major gridlines at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70, 80). * **X-Axis:** * **Categories (from left to right):** "RT-1-X", "Octo", "OpenVLA". * **Legend:** * **Position:** Top-left corner of the plot area. * **Series 1:** "Real World" - Represented by light blue bars. * **Series 2:** "World Model" - Represented by dark blue bars with a thick black outline. * **Data Labels:** The exact percentage value is printed in white text inside each bar. * **Error Bars:** Each bar has a black, vertical error bar extending above and below the top of the bar, indicating variability or confidence intervals. ### Detailed Analysis The chart presents the following data points for each system: **1. RT-1-X (Leftmost Group)** * **Real World (Light Blue Bar):** Success Rate = **18.5%**. The error bar extends approximately from 15% to 22%. * **World Model (Dark Blue Bar):** Success Rate = **15.6%**. The error bar extends approximately from 12% to 19%. * **Trend:** The Real World performance is slightly higher than the World Model performance for this system. **2. Octo (Middle Group)** * **Real World (Light Blue Bar):** Success Rate = **20.0%**. The error bar extends approximately from 15% to 25%. * **World Model (Dark Blue Bar):** Success Rate = **23.8%**. The error bar extends approximately from 20% to 28%. * **Trend:** The World Model performance is slightly higher than the Real World performance for this system. **3. OpenVLA (Rightmost Group)** * **Real World (Light Blue Bar):** Success Rate = **70.6%**. The error bar extends approximately from 65% to 76%. * **World Model (Dark Blue Bar):** Success Rate = **67.4%**. The error bar extends approximately from 62% to 72%. * **Trend:** The Real World performance is slightly higher than the World Model performance for this system. ### Key Observations 1. **Dominant Performance:** The OpenVLA system demonstrates a dramatically higher success rate (around 70%) in both environments compared to RT-1-X and Octo (both around 20% or lower). 2. **Model-Real World Correlation:** For each system, the success rates in the "Real World" and "World Model" are relatively close, suggesting the world models are reasonably predictive of real-world performance, though not perfectly aligned. 3. **Performance Gap Direction:** The direction of the gap between Real World and World Model performance is not consistent. RT-1-X and OpenVLA perform better in the Real World, while Octo performs better in its World Model. 4. **Variability:** The error bars for OpenVLA are larger in absolute terms (spanning ~10-11 percentage points) compared to RT-1-X and Octo (spanning ~4-7 percentage points), though this is proportional to the higher mean values. ### Interpretation The data strongly suggests that the **OpenVLA system is significantly more capable** than the RT-1-X and Octo systems for the tasks measured, achieving success rates over three times higher. The close correspondence between "Real World" and "World Model" results for each system indicates that the simulated environments (world models) used for evaluation are **valid proxies** for real-world performance, capturing the relative ranking and approximate magnitude of success. However, the consistent small discrepancies (2-3 percentage points) and the reversal of the gap direction for Octo highlight that these simulations are **not perfect replicas**; unmodeled factors in the real world or simplifications in the simulation can slightly alter outcomes. The larger error bars for OpenVLA may indicate greater sensitivity to initial conditions or task variations when operating at a higher performance level. </details> (b) Mean Success Rates. Robot policies’ mean success rates in the world model differ by an average of only 3.3% between from the real world, near the standard error range for each policy. Relative performance rankings between RT-1-X, Octo, and OpenVLA are also preserved. Figure 4: Success rates of modern VLAs, as evaluated within WorldGym and the real world. Qualitative Evaluation. To ensure WorldGym is useful for policy evaluation, we test whether policy performance within the world model is similar to that of the real world. To do so, we perform a direct comparison with the Bridge evaluation trials from OpenVLA (Kim et al., ). Specifically, the OpenVLA Bridge evaluation consists of 17 challenging tasks which are not present in the Bridge V2 (Walke et al., 2023) dataset. We use WorldGym to evaluate the three open-source policies evaluated in Kim et al. : RT-1-X (O’Neill et al., 2023), Octo (Octo Model Team et al., 2024), and OpenVLA (Kim et al., ). For each task and each policy, Kim et al. perform 10 trials, each with randomized initial object locations. We obtain the first frame of the recorded rollouts for all trials of all tasks. We then simulate each of the 10 real-world trials by using the original initial frame to roll out the policy within the world model as described in Section 3.1.2. We show qualitative rollouts in WorldGym from different policies in Figure 5, which shows that rollouts from OpenVLA generally perform better than rollouts from RT-1-X and Octo on the Bridge robot (top two rows). We further show that WorldGym can be easily used to perform rollouts in other environments with other robots, such as the Google Robot (bottom row in Figure 5). <details> <summary>x6.png Details</summary> ![2f52d8b2](/v1/image/2f52d8b26d3bb32442954ce97c2f264e8d6ecb8baae5975d570522a0ce5ca538) ### Visual Description \n ## Comparative Robotic Task Execution Chart: Model Performance on Manipulation Tasks ### Overview The image is a comparative chart displaying the performance of three different robotic manipulation models (OpenVLA, Octo, RT-1-X) on three distinct tasks. It is structured as a grid with three rows (tasks) and four columns (including the initial state and the three models). Each cell contains a sequence of three frames showing the progression of the robotic arm's attempt to complete the specified task. ### Components/Axes * **Column Headers (Top Row):** The columns are labeled from left to right: 1. `Initial Frame` 2. `OpenVLA` 3. `Octo` 4. `RT-1-X` * **Row Labels (Leftmost Column):** The rows are labeled with the task description, written vertically. From top to bottom: 1. `Put eggplant in pot` 2. `Put red block in pink bowl` 3. `Put blue object in bag` * **Visual Content:** Each cell contains a sequence of three photographic frames showing a robotic arm (a black, multi-jointed arm with a gripper) interacting with objects on a workspace. The workspace appears to be a light blue tray or a wooden table, depending on the task. ### Detailed Analysis **Task 1: Put eggplant in pot (Top Row)** * **Initial Frame:** Shows a purple eggplant-shaped object and a small grey pot on a light blue tray. The robotic arm is positioned above. * **OpenVLA:** The sequence shows the arm descending, grasping the purple object, moving it over the grey pot, and releasing it inside. The final frame shows the object resting within the pot. * **Octo:** The sequence shows the arm descending and grasping the purple object. In the subsequent frames, the arm moves the object near the pot but appears to place it next to or partially on the rim, not fully inside. The final frame shows the object adjacent to the pot. * **RT-1-X:** The sequence shows the arm descending, grasping the purple object, and moving it towards the pot. The final frame shows the object being held directly above the pot's opening, poised for release. **Task 2: Put red block in pink bowl (Middle Row)** * **Initial Frame:** Shows a red cube, a blue cup, and a pink bowl on a light blue tray. The robotic arm is positioned above. * **OpenVLA:** The sequence shows the arm descending, grasping the red cube, moving it over the pink bowl, and releasing it. The final frame shows the red cube inside the pink bowl. * **Octo:** The sequence shows the arm descending and grasping the red cube. In the following frames, the arm moves the cube towards the pink bowl but appears to place it next to the bowl, not inside. The final frame shows the red cube on the tray beside the pink bowl. * **RT-1-X:** The sequence shows the arm descending, grasping the red cube, and moving it over the pink bowl. The final frame shows the cube being released and falling into the pink bowl. **Task 3: Put blue object in bag (Bottom Row)** * **Initial Frame:** Shows a blue rectangular object (resembling a book or box) and a brown paper bag on a wooden table. The robotic arm is positioned above. * **OpenVLA:** The sequence shows the arm descending, grasping the blue object, moving it over the open bag, and releasing it. The final frame shows the blue object inside the bag. * **Octo:** The sequence shows the arm descending and attempting to grasp the blue object. The grasp appears less secure. In the following frames, the arm moves the object towards the bag but drops it near the bag's opening. The final frame shows the blue object on the table next to the bag. * **RT-1-X:** The sequence shows the arm descending, grasping the blue object, and moving it over the bag. The final frame shows the object being released and falling into the bag. ### Key Observations 1. **Performance Consistency:** The `OpenVLA` model successfully completes all three tasks, placing the target object inside the designated container in each case. 2. **Performance Variability:** The `Octo` model fails to complete any of the three tasks as specified. In each case, it grasps the object but places it adjacent to, rather than inside, the target container. 3. **Partial Success:** The `RT-1-X` model shows mixed results. It successfully completes Tasks 2 and 3 (red block in pink bowl, blue object in bag). For Task 1 (eggplant in pot), it positions the object correctly above the pot but the final frame does not show the release, leaving the completion ambiguous. 4. **Task Environment:** Tasks 1 and 2 are performed on a light blue tray with multiple objects present, while Task 3 is performed on a clearer wooden table surface. ### Interpretation This chart serves as a qualitative benchmark comparing the generalization and precision of three robotic foundation models across varied manipulation tasks requiring pick-and-place actions. * **What the data suggests:** `OpenVLA` demonstrates robust and reliable performance across different objects (eggplant, block, book), containers (pot, bowl, bag), and environments (tray, table). `Octo` consistently fails the final placement step, suggesting a potential limitation in its spatial understanding or policy for the "place" component of the task. `RT-1-X` performs well on two tasks but shows hesitation or incomplete execution on the third, indicating possible task-specific variability in its policy. * **How elements relate:** The side-by-side comparison directly contrasts the models' policies for the same initial state. The "Initial Frame" column establishes a controlled starting point, making the differences in the subsequent frames attributable to the models' decision-making. * **Notable patterns/anomalies:** The most striking pattern is the consistent failure mode of `Octo` (placing beside instead of inside). This is not a random error but a systematic deviation from the task goal. The ambiguity in `RT-1-X`'s first task (holding above the pot) is also notable, as it differs from its clear successes in the other two tasks. The chart effectively highlights that successful grasping does not guarantee successful task completion; the "place" action is a critical differentiator. </details> Figure 5: Qualitative policy rollouts on Bridge and Google Robot for RT-1-X, Octo, and OpenVLA. OpenVLA rollouts often lead to more visual successes than the other two policies across environments. Quantitative Evaluation. Using the simulated rollouts from WorldGym, we then compute the average task success rate similar to Kim et al. , and plot the success rate for each task for each policy in Figure 4(a). We find that real-world task performance is strongly correlated with the task performance reported by the world model, achieving a Pearson correlation of $r=0.78$ . While per-task policy success rates within WorldGym still differ slightly from those in the real world (see Table 5), the mean success rates achieved by these policies within WorldGym are quite close to the their real-world values, as shown in Figure 4(b). The success rates differ by an average of only 3.3%, with RT-1-X achieving 18.5% in the real world vs 15.5% in the world model, Octo achieving 20.0% vs 23.82%, and OpenVLA achieving 70.6% vs 67.4%, respectively. See quantitative results of evaluating the three policies on the Google Robot in Appendix E.2 ### 4.2 Policy Ranking within a World Model <details> <summary>x7.png Details</summary> ![142cf839](/v1/image/142cf839ada59c6cb91bd7afe943890e92b9cb50b99f81521d6b025ce6efe4dd) ### Visual Description ## Bar Chart: Mean Success Rates Across Different Model Versions ### Overview The image is a vertical bar chart comparing the mean success rates (in percentage) of four different AI model versions. The chart includes error bars for each data point, indicating variability or confidence intervals around the mean. ### Components/Axes * **Title:** "Mean Success Rates Across Different Model Versions" (centered at the top). * **Y-Axis:** Labeled "Success Rate (%)". The scale runs from 0 to 70 with major gridlines at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70). * **X-Axis:** Lists four categorical model versions. The labels are rotated approximately 30 degrees for readability. * **Data Series:** Four bars, each a different color, with the mean percentage value printed inside the bar in white text. Each bar has a black error bar (whisker) extending above and below the top of the bar. * **Legend:** There is no separate legend box. The model names are provided as x-axis labels directly beneath their corresponding bars. ### Detailed Analysis The chart presents the following data points, from left to right: 1. **Octo Small 1.5** * **Color:** Blue * **Mean Success Rate:** 21.5% * **Error Bar:** Extends from approximately 18% to 25% (±~3.5%). * **Visual Trend:** This is the lowest-performing model in the set. 2. **Octo Base 1.5** * **Color:** Orange * **Mean Success Rate:** 23.8% * **Error Bar:** Extends from approximately 20% to 28% (±~4%). * **Visual Trend:** Shows a slight improvement over the Octo Small 1.5 model. 3. **OpenVLA v0.1 7B** * **Color:** Green * **Mean Success Rate:** 27.6% * **Error Bar:** Extends from approximately 23% to 32% (±~4.5%). * **Visual Trend:** Continues the upward trend, performing better than both Octo models. 4. **OpenVLA 7B** * **Color:** Red * **Mean Success Rate:** 67.4% * **Error Bar:** Extends from approximately 62% to 72% (±~5%). * **Visual Trend:** Shows a dramatic, non-linear increase in performance, more than doubling the success rate of the next best model. ### Key Observations * **Performance Leap:** The most striking feature is the substantial performance gap between the "OpenVLA 7B" model and the three preceding models. Its success rate (67.4%) is approximately 2.4 times higher than the "OpenVLA v0.1 7B" (27.6%). * **Incremental vs. Step Change:** The first three models (Octo Small, Octo Base, OpenVLA v0.1) show relatively incremental improvements in mean success rate (21.5% -> 23.8% -> 27.6%). The jump to OpenVLA 7B represents a step change. * **Error Bar Consistency:** The size of the error bars (representing variability) appears roughly consistent across the first three models, spanning about 7-9 percentage points. The error bar for OpenVLA 7B is similar in absolute size (~10 points) but proportionally smaller relative to its much higher mean. * **Clear Hierarchy:** The chart establishes a clear performance hierarchy: OpenVLA 7B >> OpenVLA v0.1 7B > Octo Base 1.5 > Octo Small 1.5. ### Interpretation This chart demonstrates a significant advancement in model capability with the release of "OpenVLA 7B." The data suggests that whatever architectural changes, training data, or methodologies were introduced in this version resulted in a major breakthrough in task success rates compared to its predecessors and contemporaries. The relatively small and consistent improvements among the first three models indicate a plateau or incremental progress within a certain paradigm. The dramatic spike for OpenVLA 7B implies a paradigm shift—possibly the effect of scaling model size (to 7B parameters), a more effective training approach, or a better-aligned objective function. The presence of error bars is crucial, as it confirms that the observed differences, especially the large gap for OpenVLA 7B, are statistically meaningful and not just noise. The chart effectively communicates that OpenVLA 7B is not just marginally better but represents a new tier of performance for the evaluated task. </details> Figure 6: Success Rates of different model versions in WorldGym. We evaluate different generations of Octo and OpenVLA in the world model, showing that WorldGym assigns higher score to larger and more recent versions. <details> <summary>x8.png Details</summary> ![73234f9b](/v1/image/73234f9b9285b571032ca017883f7fb4d8cd45d2df8a831126a302e360052e8c) ### Visual Description ## Line Chart: Mean Success Rate Across Checkpoints ### Overview The image displays a line chart comparing the performance of two machine learning policies, "Video Policy" and "Diffusion Policy," over the course of training. The chart plots the mean success rate (as a percentage) against the number of training steps, marked at specific checkpoints. Both lines include shaded regions, likely representing confidence intervals or standard deviation around the mean. ### Components/Axes * **Chart Title:** "Mean Success Rate Across Checkpoints" (centered at the top). * **Y-Axis:** Labeled "Success Rate (%)". The scale runs from 0 to 35, with major tick marks at intervals of 5 (0, 5, 10, 15, 20, 25, 30, 35). * **X-Axis:** Labeled "Checkpoint (training steps)". The scale is non-linear, with labeled checkpoints at 0, 5k, 10k, 20k, 40k, and 60k steps. * **Legend:** Positioned in the top-left corner of the chart area. * **Video Policy:** Represented by a solid blue line with circular markers. * **Diffusion Policy:** Represented by a solid orange line with square markers. * **Data Series:** Two lines with associated shaded error bands. * The **Video Policy (blue)** line has a light blue shaded region. * The **Diffusion Policy (orange)** line has a light orange shaded region. ### Detailed Analysis **Video Policy (Blue Line with Circles):** * **Trend:** Shows a consistent upward trend, with a notable acceleration in improvement after the 10k step checkpoint. * **Data Points (Approximate):** * At 0 steps: ~19% * At 10k steps: ~19.5% * At 20k steps: ~26% * At 40k steps: ~29% * The shaded confidence band is relatively narrow, suggesting lower variance in performance at each checkpoint. **Diffusion Policy (Orange Line with Squares):** * **Trend:** Shows an overall upward trend but with more variability. Performance increases from 10k to 20k steps, dips at 40k steps, and then rises again by 60k steps. * **Data Points (Approximate):** * At 10k steps: ~4% * At 20k steps: ~10% * At 40k steps: ~8% * At 60k steps: ~15% * The shaded confidence band is wider than that of the Video Policy, indicating higher variance or uncertainty in the mean success rate. ### Key Observations 1. **Performance Gap:** The Video Policy consistently achieves a higher mean success rate than the Diffusion Policy at all comparable checkpoints (10k, 20k, 40k steps). 2. **Learning Trajectory:** The Video Policy shows a smooth, accelerating learning curve. The Diffusion Policy's learning curve is less smooth, exhibiting a performance regression between 20k and 40k steps before recovering. 3. **Data Availability:** The Video Policy has a data point at the 0-step checkpoint, while the Diffusion Policy's first recorded point is at 10k steps. 4. **Uncertainty:** The wider error bands for the Diffusion Policy suggest its performance is less consistent across training runs or evaluation episodes compared to the Video Policy. ### Interpretation The chart demonstrates a clear comparative advantage for the "Video Policy" over the "Diffusion Policy" in this specific task, as measured by mean success rate. The Video Policy not only starts at a higher performance level but also learns more efficiently and reliably, as indicated by its steeper, smoother ascent and tighter confidence intervals. The dip in the Diffusion Policy's performance at 40k steps is a critical anomaly. This could indicate a period of instability in training, such as catastrophic forgetting, overfitting to a specific subset of data, or a challenging phase in the optimization landscape. Its subsequent recovery by 60k steps suggests the training process eventually overcame this hurdle. The absence of a 0-step checkpoint for the Diffusion Policy might imply it was initialized differently or that its baseline performance was not measured. Overall, the data suggests that for the evaluated task and within the observed training duration, the Video Policy is the more effective and robust approach. The shaded regions emphasize that while the mean trends are clear, there is inherent variability in the performance of both methods, more so for the Diffusion Policy. </details> Figure 7: Success Rate within WorldGym throughout training. We train a video-based policy and a diffusion policy from scratch and evaluate it within our world model as it trains. We see that mean task success rate within the world model increases with additional training steps. Now we test whether WorldGym can preserve policy rankings known a priori. We evaluated policies across different versions, sizes, and training stages within WorldGym on the OpenVLA Bridge evaluation task suite, and found their in-world-model performance rankings to be consistent with prior knowledge of their relative performance. Different VLAs with Known Ranking. First, we average success rates across all 17 tasks and find that the relative performance rankings between RT-1-X, Octo, and OpenVLA are the same (Figure 4(b)) within both WorldGym and the real-world results reported in OpenVLA (Kim et al., ). Same Policies across Versions and Sizes. We further examine whether WorldGym preserves rankings between different versions and sizes of the same policy. In particular, we compare Octo-Small 1.5 against Octo-Base 1.5, and OpenVLA v0.1 7B, an undertrained development model, against OpenVLA 7B. As shown in Figure 7, the larger and more recent models outperform their smaller or earlier counterparts within WorldGym, consistent with the findings of real-world experiments performed in Octo Model Team et al. (2024) and Kim et al. . This provides additional evidence that WorldGym faithfully maintains relative rankings even across model upgrades. Same Policy across Training Steps. To examine whether WorldGym provides meaningful signals for policy training, hyperparameter tuning, and checkpoint selections, we train two robot policies from scratch. Building on prior evidence of WorldGym’s effectiveness in evaluating VLA-based policies, we extend our study to two additional families: a video prediction–based policy (UniPi) (Du et al., 2023a) and a diffusion-based policy (DexVLA) (Wen et al., 2025), both trained on the Bridge V2 dataset (see Appendix C and Appendix D). We evaluate checkpoints of the video prediction policy at 2K, 8K, 12K, and 18K steps, and the diffusion policy at 10K, 20K, 40K, and 60K steps. As shown in Figure 7, WorldGym tends to assign higher success rates to checkpoints as they increase in training steps, consistent with the lower mean squared error these policies achieve on their validation splits. This demonstrates WorldGym’s ability to preserve policy rankings across models with different amounts of training compute. Thus, we have shown how WorldGym can be used to obtain reasonable policy rankings. In particular, for the VLA-based policies we evaluate, we arrive at the same conclusions as real-world experiments about their relative performances. Notably, this is achieved all without the manual effort of setting up real robot evaluation environments and monitoring policy rollouts. While real-world evaluation can sometimes take days to complete, all WorldGym rollouts reported here can be completed in under an hour on a single GPU and require only initial images for each trial. ### 4.3 Out-of-Distribution Inputs In this section, use WorldGym to explore policies’ performance on both OOD input images and OOD language instructions. <details> <summary>x9.png Details</summary> ![ccfdb05a](/v1/image/ccfdb05ae2a846eebaea52bd2a04b453000fac090d8220ec762b97e6125c7398) ### Visual Description ## Photograph Sequence: Robotic Arm Color-Based Object Manipulation ### Overview The image is a composite of four photographs arranged in a 2x2 grid. It documents a robotic arm performing two distinct pick-and-place tasks based on color commands. The sequence demonstrates the robot's ability to identify and manipulate specific objects in a structured environment. ### Components/Axes * **Image Layout:** A 2x2 grid of four separate photographs. * **Text Overlays:** Two text labels are present. * Top-left panel: A grey rectangular box in the upper-left corner contains the text "Pick red". * Bottom-left panel: A grey rectangular box in the upper-left corner contains the text "Pick blue". * **Physical Components in Scene:** * **Robotic Arm:** A black, multi-jointed robotic arm with a two-finger gripper. * **Work Surface:** A light-colored wooden table or platform. * **Objects:** Two rectangular, flat objects. * One is **blue**. * One is **red**. * **Background:** A wall with horizontal wooden paneling. A black structural frame is visible on the left side of each panel. ### Detailed Analysis The image is segmented into four independent panels, processed below by quadrant. **Top-Left Panel (Command: "Pick red")** * **Content:** The robotic arm is positioned above the workspace. Its gripper is open and hovering approximately over the center of the wooden surface, between the blue and red objects. The blue object is on the left, the red object is on the right. **Top-Right Panel (Execution: "Pick red")** * **Content:** The robotic arm has moved. Its gripper is now closed around the top edge of the **red** object, successfully lifting it off the wooden surface. The blue object remains stationary on the left. **Bottom-Left Panel (Command: "Pick blue")** * **Content:** The scene resets. The robotic arm is again in a neutral, hovering position above the workspace, similar to the top-left panel. The blue object is on the left, the red object is on the right. **Bottom-Right Panel (Execution: "Pick blue")** * **Content:** The robotic arm has moved to the left side of the workspace. Its gripper is closed around the top edge of the **blue** object, lifting it. The red object remains stationary on the right. ### Key Observations 1. **Task Success:** The robotic arm successfully executes both commands, picking the correct colored object each time. 2. **Consistent Environment:** The starting position of the objects (blue left, red right) and the background are identical across all four panels, indicating a controlled experimental setup. 3. **Precise Movement:** The arm demonstrates precise spatial movement, transitioning from a central "home" position to the specific coordinates of the target object. 4. **Gripper Function:** The two-finger gripper is shown to effectively grasp the thin, flat objects by their edges. ### Interpretation This image sequence is a technical demonstration of a robotic system's **visual perception and task execution capabilities**. The core function being tested is **color-based object recognition and manipulation**. * **What it demonstrates:** The system can parse a natural language command ("Pick red"/"Pick blue"), identify the corresponding object in its visual field based on color, plan a motion path, and execute a precise grasp. The side-by-side comparison of command and result serves as clear validation. * **How elements relate:** The text overlay is the input command. The first panel in each row shows the system's state upon receiving the command. The second panel shows the physical outcome, proving the command was correctly interpreted and acted upon. The unchanging position of the non-target object confirms the selectivity of the action. * **Notable implications:** This is a foundational task for warehouse automation, assembly lines, or service robotics. The clean, repeatable setup suggests this is likely from a research paper, technical report, or demonstration video showcasing algorithm performance in a controlled environment. The absence of clutter or other objects highlights that the test is isolating the color-identification variable. </details> Figure 8: OOD: Color Classification. We add red and blue pieces of paper to a table, and ask the policies to “pick red” or “pick blue” (OOD image and language). OpenVLA excels, picking the correct colored paper in all trials, whereas all other policies score near chance. -5mm OOD Image Input. Using modern image generation models like Nano Banana (Google, 2025), we can easily generate new input images to initialize our world model with. We evaluate robot policies under three OOD settings: unseen object interaction, distractor objects, and object classification (see detailed results in Table 6). <details> <summary>x10.png Details</summary> ![0c2c1bf6](/v1/image/0c2c1bf6a820e6d135ce3d584f8ad7e364470a91086d3226f68fb1a67c62ad40) ### Visual Description ## Diagram: Sequential Image Editing for Robot Policy Instruction ### Overview The image is a technical diagram illustrating a three-step process where natural language prompts are used to edit an initial scene image via an "Image Model," and the resulting edited image is then used to generate a "Robot Policy Instruction" for a robotic arm to perform a physical task. The diagram demonstrates a method for guiding robot actions through visual scene modifications rather than direct reprogramming. ### Components/Axes The diagram is organized into a left column containing a legend and a main area with three horizontal rows, labeled (a), (b), and (c). Each row follows the same flow: 1. **Input:** An initial photograph of a scene. 2. **Process:** A yellow "Image Edit Prompt" box points to an "Image Model" box. 3. **Output:** Two sequential photographs showing the result of the image edit and the subsequent robot action. **Legend (Located in the middle-left):** * **Title:** "Legend" * **Yellow Box:** "Image Edit Prompt" * **Gray Box:** "Robot Policy Instruction" **Row Labels:** * **(a)** add an orange * **(b)** swap carrot and orange * **(c)** turn the carrot red **Recurring Text in Output Images:** * A gray text box in the top-left corner of each output image reads: "Put orange on plate". This is the constant "Robot Policy Instruction." ### Detailed Analysis The diagram details three distinct editing scenarios applied to the same initial scene. **Initial Scene (Common to all rows):** A light blue plastic sink or tray contains: * A yellow dish rack on the left. * An orange carrot lying in the center. * A beige plate on the right. * A black robotic gripper is positioned above the scene. **Row (a): "add an orange"** 1. **Edit Prompt:** "add an orange" (Yellow box). 2. **Image Model Output:** The scene is edited to include a new orange fruit placed inside the yellow dish rack. The original carrot and plate remain. 3. **Robot Action:** The robotic arm is shown descending, its gripper holding the newly added orange, moving it toward the plate. **Row (b): "swap carrot and orange"** 1. **Edit Prompt:** "swap carrot and orange" (Yellow box). 2. **Image Model Output:** The positions of the carrot and the orange are exchanged. The orange is now in the center of the tray, and the carrot is in the dish rack. 3. **Robot Action:** The robotic arm is shown descending, its gripper holding the orange (now from the center of the tray), moving it toward the plate. **Row (c): "turn the carrot red"** 1. **Edit Prompt:** "turn the carrot red" (Yellow box). 2. **Image Model Output:** The color of the carrot is changed from orange to red. Its position and the position of the orange in the dish rack remain unchanged from the initial state of row (a). 3. **Robot Action:** The robotic arm is shown descending, its gripper holding the now-red carrot, moving it toward the plate. ### Key Observations 1. **Consistent Robot Instruction:** Despite three different visual edits to the scene, the "Robot Policy Instruction" text ("Put orange on plate") remains identical in all output sequences. 2. **Action-Object Discrepancy:** In rows (a) and (b), the robot correctly acts on the *orange* as per the instruction. However, in row (c), the robot acts on the *red carrot*, which contradicts the literal text instruction but aligns with the visual salience created by the edit ("turn the carrot red"). 3. **Spatial Consistency:** The plate (the target location) and the dish rack (a container) maintain their positions across all edits. The edits only manipulate the objects (carrot, orange) within this fixed environment. 4. **Process Flow:** The arrows clearly define a unidirectional pipeline: Text Prompt → Image Model → Edited Image → Robot Policy → Physical Action. ### Interpretation This diagram illustrates a research concept in **vision-based robotic manipulation**, specifically exploring **instruction following via visual scene editing**. * **Core Idea:** Instead of translating a language command directly into robot actions, the system first uses the command to *modify a visual representation* of the world. The robot's policy is then conditioned on this edited image, not the original text. This decouples high-level intent (the edit prompt) from low-level control. * **What the Data Suggests:** The experiment shows that the robot's behavior is driven by the *visual outcome* of the edit, not the semantic meaning of the original instruction. In row (c), the instruction "Put orange on plate" is overridden by the visually dominant red carrot, which the robot then moves. This highlights the power and potential pitfall of visual conditioning: the robot follows what it *sees* as the salient object post-edit. * **Significance:** This approach could make robot systems more flexible. A single, fixed "put object on plate" policy can be guided to manipulate different objects by simply editing the scene image with a prompt, without retraining the low-level policy. However, it also reveals a need for robust alignment between the visual edit's intent and the downstream policy's interpretation to avoid errors like the one in row (c). * **Underlying Mechanism:** The "Image Model" is likely a text-guided image editing AI (e.g., a diffusion model). The "Robot Policy" is a pre-trained vision-based control system. The diagram proposes using the former as a "visual planner" for the latter. </details> Figure 9: OOD: Unseen object. We use Nano Banana (Google, 2025) to add an orange to the world model’s initial frame. When both the orange and the carrot are present, (a-b) OpenVLA grabs whichever is closer. After (c) editing the carrot’s color to red, however, the orange is correctly picked up. Figure 10: OOD: Failure modes. Left: We add a laptop to the scene, which displays an image of a carrot. In 15% of trials, OpenVLA grabs the laptop instead of the real carrot. Right: We test the ability distinguish to between squares and circles, celebrity faces, and cats and dogs, with all policies scoring near-chance. - Unseen Objects: We edit a scene to contain both a carrot and an orange, asking the policy to pick up the orange (Figure 10). OpenVLA grabs whichever object is closer until we edit the carrot’s color to be red, after which it always grabs the orange correctly. This suggests that it struggles to distinguish carrots and oranges by their shape. - Distractor Objects: We use the image editing model to add a computer displaying an image of a carrot (Figure 10, left). We see that OpenVLA mistakenly to grabs the carrot on the computer screen in 15% of trials, suggesting limited 3D/2D object distinction. - Classification: We add a piece of paper on each side of a desk. We first color one paper red and the other blue and instruct the model to “pick red”/“pick blue” (Figure 8). OpenVLA achieves a perfect score, always moving towards the correct color. Octo and RT-1, on the other hand, typically move towards whichever paper is closer, scoring no better than chance. We also try more advanced classification tasks, (Figure 10, right), but find that the policies all score near-chance. For a more quantitative study, we modify all the initial frames of the OpenVLA’s Bridge evaluation task suite to include random OOD distractor items (see Figure 12), keeping the language instructions the same. We then repeat the rollout procedure from Section 4.1 in order to measure the degree to which the addition of unrelated objects affects policy performance. We find that all the tested VLAs degrade in performance, with OpenVLA being the most robust of the three (Figure 13). <details> <summary>x12.png Details</summary> ![b40553a5](/v1/image/b40553a5028e4b89c5267b8d8fca0528b41367060bf36f075d0b1798fe80b1c1) ### Visual Description \n ## Photographic Sequence: Robotic Arm Task Demonstration ### Overview The image is a composite of four distinct photographic sequences, arranged in a 2x2 grid. Each sequence documents a robotic arm performing a specific manipulation task within a simulated kitchen sink environment. The sequences are labeled with descriptive titles. The environment consists of a light blue sink basin, a wooden countertop, a yellow dish rack, and various colorful, toy-like objects (cups, plates, pots, and food-shaped items). ### Components/Axes * **Layout:** Four separate image sequences, each containing 3-4 frames showing progressive steps of a task. * **Titles (Text Labels):** * Top-Left Sequence: "Put yellow corn in red cup" * Top-Right Sequence: "Put plate on drying rack" * Bottom-Left Sequence: "Put yellow corn in red cup" (Note: This title is identical to the top-left, but the objects and task differ.) * Bottom-Right Sequence: "Move the pot to the counter" * **Key Objects:** * **Robotic Arm:** A black, multi-jointed robotic arm with a gripper, present in all frames. * **Sink:** A light blue, rectangular basin. * **Dish Rack:** A yellow, grid-like rack positioned to the left of the sink. * **Countertop:** A light wood-grain surface surrounding the sink. * **Manipulable Objects:** Various colored items including a red cup, a blue cup, a yellow corn-shaped object, a purple eggplant-shaped object, an orange carrot-shaped object, a white plate, and a silver pot. ### Detailed Analysis **Sequence 1 (Top-Left): "Put yellow corn in red cup"** * **Frame 1:** The robotic arm is positioned above the sink. Inside the sink are a red cup, a blue cup, a yellow corn object, and a purple eggplant object. * **Frame 2:** The gripper has grasped the yellow corn object. * **Frame 3:** The arm moves the yellow corn over the red cup. * **Frame 4:** The yellow corn is released and placed inside the red cup. The task is complete. **Sequence 2 (Top-Right): "Put plate on drying rack"** * **Frame 1:** The robotic arm is above the sink. A white plate and an orange carrot object are in the sink. * **Frame 2:** The gripper has grasped the white plate. * **Frame 3:** The arm moves the plate to the left, over the yellow dish rack. * **Frame 4:** The plate is placed onto the dish rack. The carrot remains in the sink. **Sequence 3 (Bottom-Left): "Put yellow corn in red cup" (Title Discrepancy)** * **Frame 1:** The robotic arm is above the sink. A silver pot and a blue cup are in the sink. **No yellow corn or red cup is visible.** * **Frame 2:** The gripper has grasped the silver pot. * **Frame 3:** The arm moves the pot over the blue cup. * **Frame 4:** The pot is placed inside the blue cup. The title does not match the observed action. **Sequence 4 (Bottom-Right): "Move the pot to the counter"** * **Frame 1:** The robotic arm is above the sink. A red cup, a purple eggplant, and a silver pot are in the sink. * **Frame 2:** The gripper has grasped the silver pot. * **Frame 3:** The arm moves the pot to the right, out of the sink and over the wooden countertop. * **Frame 4:** The pot is placed on the countertop. The red cup and eggplant remain in the sink. ### Key Observations 1. **Task Repetition with Variation:** The first and third sequences share the same title ("Put yellow corn in red cup") but depict completely different tasks involving different objects (corn/cup vs. pot/cup). This suggests a possible labeling error or a demonstration of the same high-level command applied to different object sets. 2. **Consistent Environment:** All sequences use the identical sink, rack, and countertop setup, indicating a controlled experimental or training environment. 3. **Object Permanence:** Objects not involved in the active task (e.g., the carrot in Sequence 2, the eggplant in Sequence 4) remain stationary throughout their respective sequences. 4. **Robotic Precision:** The arm demonstrates precise grasping, transport, and placement in each sequence, successfully completing the intended physical manipulation. ### Interpretation This image serves as a technical demonstration of a robotic system's capability to perform sequential, object-manipulation tasks based on natural language instructions. The sequences illustrate the robot's ability to: * **Parse Instructions:** Understand commands like "put X in Y" or "move A to B." * **Perceive Objects:** Identify specific objects (corn, cup, plate, pot) within a cluttered scene. * **Execute Motor Plans:** Perform the necessary sequence of movements (reach, grasp, transport, release) to fulfill the command. The discrepancy in the third sequence's title is a critical observation. It highlights a potential issue in either the instruction-generation system, the data labeling process, or the demonstration curation. For a technical document, this inconsistency would need to be flagged for review, as it could indicate a breakdown in the alignment between language commands and physical actions, which is fundamental to reliable human-robot interaction. The overall purpose is likely to showcase progress in embodied AI, where an agent must ground language in physical actions within a 3D environment. The clean, consistent setup suggests this is from a research paper, project demonstration, or training data collection effort for robotic learning. </details> Figure 11: OOD Language Instructions. We pick a set of tasks from the OpenVLA Bridge evaluation suite and modify the language instruction, e.g. changing the the target object and/or its goal destination. OOD Language. Additionally, even without access to an image editing model, we demonstrate that WorldGym can be used to evaluate policies’ performance on OOD language instructions. Starting from a set of initial frames from the tasks listed in Table 5, we modify each task’s language instruction, e.g. changing the target object and/or its goal state. Figure 11 shows rollouts from OpenVLA for these OOD language tasks. We can then easily obtain success rates for these unseen tasks by rolling them out within WorldGym, finding that OpenVLA generalizes best (see Table 1). Policies struggle across the board on the “Move the pot to the counter” task, with only OpenVLA achieving a single success. We suspect that OpenVLA consistently outperforms Octo and RT-1-X on OOD language tasks due to its strong VLM backbone and richer robot pretraining dataset (Kim et al., ). | Task | RT-1-X | Octo | OpenVLA | | --- | --- | --- | --- | | Move Pot Into Drying Rack | 3 | 0 | 7 | | Move The Pot To The Counter | 0 | 0 | 1 | | Put Plate On Drying Rack | 4 | 2 | 8 | | Put Yellow Corn In Red Cup | 1 | 2 | 3 | Table 1: Policy Evaluations Results on Bridge OOD Language Tasks. “Move the pot to the counter” is perhaps the most challenging because the Bridge dataset does not contain trajectories which move objects outside of the sink basin. OpenVLA has the strongest performance, which we attribute to its more powerful language model backbone. <details> <summary>x13.png Details</summary> ![03364502](/v1/image/0336450299740946b9eb225ff823304d9ace337215e0c82f7ea8ffbc492816ca) ### Visual Description \n ## Diagram: Sequential Process of Adding Distractions to a Sink Environment ### Overview The image displays a four-panel sequence illustrating a process where objects are progressively added to a sink environment, with annotations indicating the involvement of an "Image Model." The sequence flows from left to right, connected by arrows. ### Components/Axes * **Panels:** Four rectangular image panels arranged horizontally. * **Flow Arrows:** Black arrows connect the panels, indicating the sequence direction (left to right). * **Text Annotations:** * Above the first arrow (between Panel 1 and Panel 2): A yellow rectangular box containing the text "add distractions". * Below the first arrow: A white rectangular box containing the text "Image Model". * Above the second arrow (between Panel 3 and Panel 4): A yellow rectangular box containing the text "add distractions". * Below the second arrow: A white rectangular box containing the text "Image Model". * **Scene Elements (Common to all panels):** * A light blue kitchen sink basin. * A white dish rack on the left side of the sink. * A wooden countertop surface. * A faucet and pipes visible in the background. ### Detailed Analysis **Panel 1 (Leftmost):** * **Content:** A relatively empty sink. * **Objects:** A large, stainless steel cooking pot is placed inside the sink, leaning against the left side. The dish rack is empty. **Panel 2 (Second from left):** * **Transition:** Annotated with "add distractions" and "Image Model". * **Objects Added:** Compared to Panel 1, three new objects appear: 1. A light blue bowl (in the dish rack). 2. A yellow rubber duck (in the sink, near the pot). 3. A small red toy car (in the sink, near the duck). * **Objects Removed:** The large stainless steel pot from Panel 1 is no longer present. **Panel 3 (Third from left):** * **Content:** The sink is mostly empty again, but with new elements. * **Objects:** 1. A purple, spherical object (possibly a toy or fruit) sits in the sink. 2. A large, black, multi-pronged object (resembling a robotic gripper or a complex tool) is suspended above the center of the sink. * **Objects Removed:** The blue bowl, yellow duck, and red car from Panel 2 are absent. **Panel 4 (Rightmost):** * **Transition:** Annotated with "add distractions" and "Image Model". * **Objects Added:** Compared to Panel 3, several new objects appear: 1. A red plastic cup (in the sink). 2. A green rectangular block (in the sink). 3. A blue toy car (in the sink, near the green block). 4. A crumpled piece of white paper or foil (in the sink). * **Objects Retained:** The purple spherical object and the black suspended object from Panel 3 remain in place. ### Key Observations 1. **Process Pattern:** The sequence follows a clear pattern: a base scene (Panel 1 or 3) is shown, then an "add distractions" step mediated by an "Image Model" results in a more cluttered scene (Panel 2 or 4). 2. **Object Permanence:** Objects are not persistent across the entire sequence. The pot from Panel 1 disappears. The distractions from Panel 2 disappear before Panel 3. This suggests each "base scene" (1 and 3) might be a separate starting point for a test. 3. **Nature of "Distractions":** The added objects are varied in color, shape, and type (kitchenware, toys, abstract shapes), likely chosen to challenge visual recognition systems. 4. **The "Image Model":** Its placement below the arrow suggests it is the agent or system responsible for processing the scene or generating the "distractions." The black gripper-like object in Panels 3 and 4 could be a physical manifestation or tool associated with this model. ### Interpretation This diagram visually documents an experimental or testing protocol for an **Image Model**, likely in the field of computer vision or robotics. The core demonstration is the model's interaction with a physical environment (the sink) under increasing visual complexity. * **What it suggests:** The process tests the model's ability to operate or maintain performance when "distractions" (clutter, irrelevant objects) are introduced into its workspace. The two separate sequences (1→2 and 3→4) might represent different test trials or scenarios. * **Relationship between elements:** The "Image Model" is the central actor. The "add distractions" command is the experimental variable. The sink environment is the controlled testbed. The changing objects are the measured outcome or the introduced noise. * **Notable implication:** The removal of objects between sequences (e.g., the pot disappearing) indicates these are discrete trials, not a continuous accumulation. The presence of the black gripper in the second sequence (Panels 3 & 4) but not the first might indicate a change in the model's hardware configuration or a different type of test (e.g., manipulation vs. pure observation). * **Underlying purpose:** This setup is characteristic of research into **robust perception** and **scene understanding**. The goal is likely to evaluate how well the Image Model can identify target objects, understand spatial relationships, or plan actions when the visual scene is non-ideal and filled with irrelevant items. The sink provides a consistent, structured background against which the variable distractions are placed. </details> Figure 12: OOD Distraction Examples. We use Nano Banana (Google, 2025) to add distractions to every image of the OpenVLA Bridge task suite. The resulting change in mean success rates can be seen in Figure 13. <details> <summary>x14.png Details</summary> ![1705927e](/v1/image/1705927e5c65e3addf6aa14ed5d16733418d5b533100364c96b78834121b4778) ### Visual Description \n ## Grouped Bar Chart: Effect of OOD Distractors on Success Rates ### Overview This is a grouped bar chart comparing the performance of three different models (RT-1-X, Octo, OpenVLA) on a task, measured by success rate percentage. The chart specifically illustrates the negative impact of introducing Out-Of-Distribution (OOD) distractor images into the input. For each model, there are two bars: one representing the baseline "World Model" performance and another representing performance "with OOD input image." ### Components/Axes * **Chart Title:** "Effect of OOD Distractors on Success Rates" (centered at the top). * **Y-Axis:** Labeled "Success Rate (%)". The scale runs from 0 to 70, with major gridlines at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70). * **X-Axis:** Lists three model categories: "RT-1-X", "Octo", and "OpenVLA". * **Legend:** Positioned in the top-left corner of the plot area. * A solid, light blue rectangle corresponds to "World Model". * A light blue rectangle with diagonal hatching (///) corresponds to "World Model (with OOD input image)". * **Data Series:** There are two data series, represented by paired bars for each model on the x-axis. * **Series 1 (World Model):** Solid-colored bars. Colors are model-specific: light blue for RT-1-X, orange for Octo, and dark red for OpenVLA. * **Series 2 (with OOD input image):** Hatched bars with the same base color as their solid counterpart for each model. * **Error Bars:** Each bar has a black, vertical error bar extending above and below the top of the bar, indicating variability or confidence intervals. ### Detailed Analysis The chart presents the following specific data points (values are labeled directly on the bars): **1. Model: RT-1-X (Leftmost group)** * **World Model (Solid Light Blue Bar):** Success Rate = **15.6%**. The error bar extends from approximately 12% to 19%. * **With OOD input image (Hatched Light Blue Bar):** Success Rate = **7.6%**. The error bar extends from approximately 4% to 12%. * **Trend:** Performance drops by approximately 8 percentage points (a ~51% relative decrease) when OOD distractors are introduced. **2. Model: Octo (Center group)** * **World Model (Solid Orange Bar):** Success Rate = **23.8%**. The error bar extends from approximately 20% to 28%. * **With OOD input image (Hatched Orange Bar):** Success Rate = **4.1%**. The error bar extends from approximately 2% to 6%. * **Trend:** Performance drops dramatically by approximately 19.7 percentage points (an ~83% relative decrease) with OOD distractors. This is the largest absolute and relative drop among the three models. **3. Model: OpenVLA (Rightmost group)** * **World Model (Solid Dark Red Bar):** Success Rate = **67.4%**. The error bar extends from approximately 62% to 72%. * **With OOD input image (Hatched Dark Red Bar):** Success Rate = **39.4%**. The error bar extends from approximately 34% to 45%. * **Trend:** Performance drops by approximately 28 percentage points (a ~42% relative decrease) with OOD distractors. Despite the drop, OpenVLA maintains the highest success rate in both conditions. ### Key Observations 1. **Universal Negative Impact:** All three models experience a significant decrease in success rate when tested with OOD input images compared to the standard World Model condition. 2. **Performance Hierarchy:** The baseline performance ranking (World Model) is OpenVLA (67.4%) > Octo (23.8%) > RT-1-X (15.6%). This hierarchy is preserved under the OOD condition: OpenVLA (39.4%) > RT-1-X (7.6%) > Octo (4.1%). Notably, Octo falls from second to last place under OOD conditions. 3. **Varying Robustness:** The models show different levels of robustness to OOD distractors. * **Octo** is the most severely affected, losing over 80% of its baseline performance. * **RT-1-X** and **OpenVLA** show more comparable relative degradation (~51% and ~42% loss, respectively), though OpenVLA's absolute drop is larger. 4. **Error Bar Overlap:** For RT-1-X and Octo, the error bars for the two conditions do not overlap, strongly suggesting the performance drop is statistically significant. For OpenVLA, the error bars also do not overlap. ### Interpretation This chart demonstrates a critical vulnerability in the evaluated world models: their performance is highly sensitive to the presence of out-of-distribution visual distractors. The data suggests that: * **OOD Distractors are a Major Failure Mode:** The introduction of irrelevant, unfamiliar visual elements (OOD distractors) severely degrades the models' ability to successfully complete their intended tasks. This indicates a lack of robustness and a potential over-reliance on specific, in-distribution visual cues. * **Model Architecture/Training Matters:** The stark difference in impact between Octo and the others implies that the underlying design or training data of a model significantly influences its resilience to visual noise. OpenVLA's superior baseline and relative robustness might point to advantages in its architecture or training regimen. * **Practical Implications:** For real-world deployment, where environments are uncontrolled and contain unexpected objects, this vulnerability is a serious concern. A model like Octo, which performs reasonably well in clean conditions, could fail catastrophically in a cluttered or novel setting. The results argue for the necessity of testing AI models, especially those for robotics or vision, under OOD conditions to assess their true reliability. * **The "World Model" Concept is Fragile:** The title implies these are "World Models," yet their modeled understanding of the world breaks down when presented with slightly unfamiliar visual input. This challenges the robustness of the learned world representations and highlights a gap between performance in curated benchmarks and potential real-world utility. </details> Figure 13: Effect of OOD Distractors. We use an image editing model to add distractor objects to the Bridge evaluation suite, finding that RT-1-X drops in performance by 51%, Octo by 83%, and OpenVLA by 41.5%, making OpenVLA the most robust to distractors. See Table 7 for details. The ability to use WorldGym to quickly design and evaluate policies within OOD tasks and environments thus leads us to new findings about policies’ strengths and weaknesses. Future research could be prioritized to address these issues, all without spending extra effort to set up additional experiments in the real world or within handcrafted simulators. ## 5 Related Work Action-Conditioned Video Generation. Previous work has shown that video generation can simulate real-world interactions (Yang et al., 2023; Brooks et al., 2024), robotic plans (Du et al., 2024; 2023b), and games (AI et al., 2024; Bruce et al., 2024; Valevski et al., 2024; Alonso et al., 2024) when conditioned on text or keyboard controls. Prior work (NVIDIA et al., 2025) has begun to explore applying video generation to simulating complex robotic controls. We take this a step further by using video-based world models to quantitatively estimate robot policy success rates. WorldGym draws architectural inspirations from prior work on video generation such as Diffusion Forcing (Chen et al., 2024) and Diffusion Transformers (Peebles & Xie, 2023), but experiments with variable horizon lengths to support efficient long-horizon rollouts for policies with a variety of action chunk sizes. Policy Evaluation. Off-policy and offline policy evaluation has long been studied in the RL literature (Farajtabar et al., 2018; Jiang & Li, 2015; Kallus & Uehara, 2019; Munos et al., 2016; Precup et al., 2000; Thomas et al., 2015a). Some of these approaches are model-based, learning a dynamics model from previously collected data and rolling out the learned dynamics model for policy evaluation (Fonteneau et al., 2013; Zhang et al., 2021; Yu et al., 2020; Hafner et al., 2020). Since learning a dynamics model is challenging and subject to accumulation of error, a broader set of work has focused on model-free policy evaluation, which works by estimating the value function (Le et al., 2019; Duan & Wang, 2020; Sutton et al., 2009; 2016) or policy correction (Kanamori et al., 2009; Nguyen et al., 2010; Nachum et al., 2019). WorldGym performs model-based policy evaluation, but proposes to learn a single world model on image-based observation that can be used to evaluate different policies on different tasks. SIMPLER (Li et al., 2024) aims to evaluate realistic policies by constructing software-based simulators from natural images and showed highly correlated curves between simulated evaluation and real-robot execution, but it is hard to evaluate OOD language and image input in SIMPLER without significant hand engineering of the software simulator. Li et al. (2025) proposes to evaluate robot policies in a world model in a specific bi-manual manipulation setup, whereas WorldGym focuses on evaluating policies across diverse environments and robot morphologies while enabling testing OOD language and image inputs. ## 6 Conclusion We have presented WorldGym, a world-model-based environment for evaluating robot policies. WorldGym emulates realistic robot interactions and shows strong correlations between simulated evaluation and real-world policy outcomes. WorldGym further provides the flexibility for evaluating OOD language instructions and performing tasks with an OOD initial frame. While not all interactions emulated by WorldGym are fully realistic, WorldGym serves as an important step towards safe and reproducible policy evaluation before deployment. #### Acknowledgments We thank Xinchen Yan and Doina Precup for reviewing versions of this manuscript. We thank Moo Jin Kim for help in setting up the OpenVLA policy. We thank Boyuan Chen and Kiwhan Song for the Diffusion Forcing GitHub repository. ## References - Afzal et al. (2020) Afsoon Afzal, Deborah S Katz, Claire Le Goues, and Christopher S Timperley. A study on the challenges of using robotics simulators for testing. arXiv preprint arXiv:2004.07368, 2020. - AI et al. (2024) Decart AI, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer. 2024. URL https://oasis-model.github.io/. - Alonso et al. (2024) Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. arXiv preprint arXiv:2405.12399, 2024. - Brohan et al. (2022) Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. - Brooks et al. (2024) Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. 2024. URL https://openai. com/research/video-generation-models-as-world-simulators, 3, 2024. - Bruce et al. (2024) Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024. - Chen et al. (2024) Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. arXiv preprint arXiv:2407.01392, 2024. - Chi et al. (2023) Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, pp. 02783649241273668, 2023. - Choi et al. (2021) HeeSun Choi, Cindy Crump, Christian Duriez, Asher Elmquist, Gregory Hager, David Han, Frank Hearl, Jessica Hodgins, Abhinandan Jain, Frederick Leve, et al. On the use of simulation in robotics: Opportunities, challenges, and suggestions for moving forward. Proceedings of the National Academy of Sciences, 118(1):e1907856118, 2021. - Chung et al. (2023) Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining, 2023. URL https://arxiv.org/abs/2304.09151. - Du et al. (2023a) Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation, 2023a. URL https://arxiv.org/abs/2302.00111. - Du et al. (2023b) Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language planning. arXiv preprint arXiv:2310.10625, 2023b. - Du et al. (2024) Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024. - Duan & Wang (2020) Yaqi Duan and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function approximation, 2020. arXiv:2002.09516. - Dulac-Arnold et al. (2019) Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901, 2019. - Ebert et al. (2021) Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021. - Erez et al. (2015) Tom Erez, Yuval Tassa, and Emanuel Todorov. Simulation tools for model-based robotics: Comparison of bullet, havok, mujoco, ode and physx. In 2015 IEEE international conference on robotics and automation (ICRA), pp. 4397–4404. IEEE, 2015. - Esser et al. (2024) Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403.03206. - Farajtabar et al. (2018) Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. arXiv preprint arXiv:1802.03493, 2018. - Fonteneau et al. (2013) Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and Damien Ernst. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Annals of Operations Research, 208(1):383–416, 2013. - Fu et al. (2021) Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Ziyu Wang, Alexander Novikov, Mengjiao Yang, Michael R Zhang, Yutian Chen, Aviral Kumar, et al. Benchmarks for deep off-policy evaluation. arXiv preprint arXiv:2103.16596, 2021. - Google (2025) Google. Image editing in gemini just got a major upgrade. Blog post on “The Keyword”, Google, August 26 2025. URL https://blog.google/products/gemini/updated-image-editing-model/. Multimodal Generation Lead, Gemini Apps; Gemini Image Product Lead, Google DeepMind. - Hafner et al. (2019) Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019. - Hafner et al. (2020) Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020. - He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385. - Ho et al. (2022) Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022. - Hu et al. (2022) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. - Jiang & Li (2015) Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2015. - Jiang & Li (2016) Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International conference on machine learning, pp. 652–661. PMLR, 2016. - Kaelbling et al. (1995) Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Partially observable markov decision processes for artificial intelligence. In International Workshop on Reasoning with Uncertainty in Robotics, pp. 146–163. Springer, 1995. - Kaiser et al. (2019) Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019. - Kallus & Uehara (2019) Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. arXiv preprint arXiv:1908.08526, 2019. - Kanamori et al. (2009) Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391–1445, 2009. - (34) Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv. org/abs/2406.09246. - Le et al. (2019) Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712. PMLR, 2019. - Levine et al. (2020) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020. - Li et al. (2024) Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024. - Li et al. (2025) Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator. arXiv preprint arXiv:2505.19017, 2025. - Liu et al. (2024) Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202, 2024. - Ma et al. (2025) Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation, 2025. URL https://arxiv.org/abs/2401.03048. - Munos et al. (2016) R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062, 2016. - Nachum et al. (2019) Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in neural information processing systems, 32, 2019. - Nguyen et al. (2010) XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010. - NVIDIA et al. (2025) NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, and Artur Zolkowski. Cosmos world foundation model platform for physical ai, 2025. URL https://arxiv.org/abs/2501.03575. - Octo Model Team et al. (2024) Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. - O’Neill et al. (2023) Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023. - OpenAI et al. (2024) OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexis Conneau, Ali Kamali, Allan Jabri, Allison Moyer, Allison Tam, Amadou Crookes, Amin Tootoochian, Amin Tootoonchian, Ananya Kumar, Andrea Vallone, Andrej Karpathy, Andrew Braunstein, Andrew Cann, Andrew Codispoti, Andrew Galu, Andrew Kondrich, Andrew Tulloch, Andrey Mishchenko, Angela Baek, Angela Jiang, Antoine Pelisse, Antonia Woodford, Anuj Gosalia, Arka Dhar, Ashley Pantuliano, Avi Nayak, Avital Oliver, Barret Zoph, Behrooz Ghorbani, Ben Leimberger, Ben Rossen, Ben Sokolowsky, Ben Wang, Benjamin Zweig, Beth Hoover, Blake Samic, Bob McGrew, Bobby Spero, Bogo Giertler, Bowen Cheng, Brad Lightcap, Brandon Walkin, Brendan Quinn, Brian Guarraci, Brian Hsu, Bright Kellogg, Brydon Eastman, Camillo Lugaresi, Carroll Wainwright, Cary Bassin, Cary Hudson, Casey Chu, Chad Nelson, Chak Li, Chan Jun Shern, Channing Conger, Charlotte Barette, Chelsea Voss, Chen Ding, Cheng Lu, Chong Zhang, Chris Beaumont, Chris Hallacy, Chris Koch, Christian Gibson, Christina Kim, Christine Choi, Christine McLeavey, Christopher Hesse, Claudia Fischer, Clemens Winter, Coley Czarnecki, Colin Jarvis, Colin Wei, Constantin Koumouzelis, Dane Sherburn, Daniel Kappler, Daniel Levin, Daniel Levy, David Carr, David Farhi, David Mely, David Robinson, David Sasaki, Denny Jin, Dev Valladares, Dimitris Tsipras, Doug Li, Duc Phong Nguyen, Duncan Findlay, Edede Oiwoh, Edmund Wong, Ehsan Asdar, Elizabeth Proehl, Elizabeth Yang, Eric Antonow, Eric Kramer, Eric Peterson, Eric Sigler, Eric Wallace, Eugene Brevdo, Evan Mays, Farzad Khorasani, Felipe Petroski Such, Filippo Raso, Francis Zhang, Fred von Lohmann, Freddie Sulit, Gabriel Goh, Gene Oden, Geoff Salmon, Giulio Starace, Greg Brockman, Hadi Salman, Haiming Bao, Haitang Hu, Hannah Wong, Haoyu Wang, Heather Schmidt, Heather Whitney, Heewoo Jun, Hendrik Kirchner, Henrique Ponde de Oliveira Pinto, Hongyu Ren, Huiwen Chang, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian O’Connell, Ian Osband, Ian Silber, Ian Sohl, Ibrahim Okuyucu, Ikai Lan, Ilya Kostrikov, Ilya Sutskever, Ingmar Kanitscheider, Ishaan Gulrajani, Jacob Coxon, Jacob Menick, Jakub Pachocki, James Aung, James Betker, James Crooks, James Lennon, Jamie Kiros, Jan Leike, Jane Park, Jason Kwon, Jason Phang, Jason Teplitz, Jason Wei, Jason Wolfe, Jay Chen, Jeff Harris, Jenia Varavva, Jessica Gan Lee, Jessica Shieh, Ji Lin, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joanne Jang, Joaquin Quinonero Candela, Joe Beutler, Joe Landers, Joel Parish, Johannes Heidecke, John Schulman, Jonathan Lachman, Jonathan McKay, Jonathan Uesato, Jonathan Ward, Jong Wook Kim, Joost Huizinga, Jordan Sitkin, Jos Kraaijeveld, Josh Gross, Josh Kaplan, Josh Snyder, Joshua Achiam, Joy Jiao, Joyce Lee, Juntang Zhuang, Justyn Harriman, Kai Fricke, Kai Hayashi, Karan Singhal, Katy Shi, Kavin Karthik, Kayla Wood, Kendra Rimbach, Kenny Hsu, Kenny Nguyen, Keren Gu-Lemberg, Kevin Button, Kevin Liu, Kiel Howe, Krithika Muthukumar, Kyle Luther, Lama Ahmad, Larry Kai, Lauren Itow, Lauren Workman, Leher Pathak, Leo Chen, Li Jing, Lia Guy, Liam Fedus, Liang Zhou, Lien Mamitsuka, Lilian Weng, Lindsay McCallum, Lindsey Held, Long Ouyang, Louis Feuvrier, Lu Zhang, Lukas Kondraciuk, Lukasz Kaiser, Luke Hewitt, Luke Metz, Lyric Doshi, Mada Aflak, Maddie Simens, Madelaine Boyd, Madeleine Thompson, Marat Dukhan, Mark Chen, Mark Gray, Mark Hudnall, Marvin Zhang, Marwan Aljubeh, Mateusz Litwin, Matthew Zeng, Max Johnson, Maya Shetty, Mayank Gupta, Meghan Shah, Mehmet Yatbaz, Meng Jia Yang, Mengchao Zhong, Mia Glaese, Mianna Chen, Michael Janner, Michael Lampe, Michael Petrov, Michael Wu, Michele Wang, Michelle Fradin, Michelle Pokrass, Miguel Castro, Miguel Oom Temudo de Castro, Mikhail Pavlov, Miles Brundage, Miles Wang, Minal Khan, Mira Murati, Mo Bavarian, Molly Lin, Murat Yesildal, Nacho Soto, Natalia Gimelshein, Natalie Cone, Natalie Staudacher, Natalie Summers, Natan LaFontaine, Neil Chowdhury, Nick Ryder, Nick Stathas, Nick Turley, Nik Tezak, Niko Felix, Nithanth Kudige, Nitish Keskar, Noah Deutsch, Noel Bundick, Nora Puckett, Ofir Nachum, Ola Okelola, Oleg Boiko, Oleg Murk, Oliver Jaffe, Olivia Watkins, Olivier Godement, Owen Campbell-Moore, Patrick Chao, Paul McMillan, Pavel Belov, Peng Su, Peter Bak, Peter Bakkum, Peter Deng, Peter Dolan, Peter Hoeschele, Peter Welinder, Phil Tillet, Philip Pronin, Philippe Tillet, Prafulla Dhariwal, Qiming Yuan, Rachel Dias, Rachel Lim, Rahul Arora, Rajan Troll, Randall Lin, Rapha Gontijo Lopes, Raul Puri, Reah Miyara, Reimar Leike, Renaud Gaubert, Reza Zamani, Ricky Wang, Rob Donnelly, Rob Honsby, Rocky Smith, Rohan Sahai, Rohit Ramchandani, Romain Huet, Rory Carmichael, Rowan Zellers, Roy Chen, Ruby Chen, Ruslan Nigmatullin, Ryan Cheu, Saachi Jain, Sam Altman, Sam Schoenholz, Sam Toizer, Samuel Miserendino, Sandhini Agarwal, Sara Culver, Scott Ethersmith, Scott Gray, Sean Grove, Sean Metzger, Shamez Hermani, Shantanu Jain, Shengjia Zhao, Sherwin Wu, Shino Jomoto, Shirong Wu, Shuaiqi, Xia, Sonia Phene, Spencer Papay, Srinivas Narayanan, Steve Coffey, Steve Lee, Stewart Hall, Suchir Balaji, Tal Broda, Tal Stramer, Tao Xu, Tarun Gogineni, Taya Christianson, Ted Sanders, Tejal Patwardhan, Thomas Cunninghman, Thomas Degry, Thomas Dimson, Thomas Raoux, Thomas Shadwell, Tianhao Zheng, Todd Underwood, Todor Markov, Toki Sherbakov, Tom Rubin, Tom Stasi, Tomer Kaftan, Tristan Heywood, Troy Peterson, Tyce Walters, Tyna Eloundou, Valerie Qi, Veit Moeller, Vinnie Monaco, Vishal Kuo, Vlad Fomenko, Wayne Chang, Weiyi Zheng, Wenda Zhou, Wesam Manassra, Will Sheu, Wojciech Zaremba, Yash Patil, Yilei Qian, Yongjik Kim, Youlong Cheng, Yu Zhang, Yuchen He, Yuchen Zhang, Yujia Jin, Yunxing Dai, and Yury Malkov. Gpt-4o system card, 2024. URL https://arxiv.org/abs/2410.21276. - Peebles & Xie (2023) William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023. - Precup et al. (2000) Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759–766, 2000. - Puterman (2014) Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. - Salvato et al. (2021) Erica Salvato, Gianfranco Fenu, Eric Medvet, and Felice Andrea Pellegrino. Crossing the reality gap: A survey on sim-to-real transferability of robot controllers in reinforcement learning. IEEE Access, 9:153171–153187, 2021. - Shafiullah et al. (2023) Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023. - Singer et al. (2022) Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. - Soljacic et al. (2024) Fran Soljacic, Theresa Law, Meia Chita-Tegmark, and Matthias Scheutz. Robots in healthcare as envisioned by care professionals. Intelligent Service Robotics, pp. 1–17, 2024. - Sünderhauf et al. (2018) Niko Sünderhauf, Oliver Brock, Walter Scheirer, Raia Hadsell, Dieter Fox, Jürgen Leitner, Ben Upcroft, Pieter Abbeel, Wolfram Burgard, Michael Milford, et al. The limits and potentials of deep learning for robotics. The International journal of robotics research, 37(4-5):405–420, 2018. - Sutton et al. (2009) Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning, pp. 993–1000, 2009. - Sutton et al. (2016) Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem of off-policy temporal-difference learning. Journal of Machine Learning Research, 17(73):1–29, 2016. - Tedrake et al. (2019) Russ Tedrake et al. Drake: Model-based design and verification for robotics. 2019. - Thomas et al. (2015a) P. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence off-policy evaluation. In Proceedings of the 29th Conference on Artificial Intelligence, 2015a. - Thomas & Brunskill (2016) Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148. PMLR, 2016. - Thomas et al. (2015b) Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015b. - Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026–5033. IEEE, 2012. - Valevski et al. (2024) Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024. - Villegas et al. (2022) Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022. - Walke et al. (2023) Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pp. 1723–1736. PMLR, 2023. - Wang et al. (2024) Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. URL https://arxiv.org/abs/2409.12191. - Wen et al. (2025) Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control, 2025. URL https://arxiv.org/abs/2502.05855. - Xiao et al. (2019) Chenjun Xiao, Yifan Wu, Chen Ma, Dale Schuurmans, and Martin Müller. Learning to combat compounding-error in model-based reinforcement learning. arXiv preprint arXiv:1912.11206, 2019. - Yang et al. (2020) Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. Off-policy evaluation via the regularized lagrangian. Advances in Neural Information Processing Systems, 33:6551–6561, 2020. - Yang et al. (2023) Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023. - Yu et al. (2020) Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020. - Zhang et al. (2021) Michael R Zhang, Tom Le Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, Ziyu Wang, and Mohammad Norouzi. Autoregressive dynamics models for offline policy evaluation and optimization. arXiv preprint arXiv:2104.13877, 2021. - Zhao et al. (2020) Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In 2020 IEEE symposium series on computational intelligence (SSCI), pp. 737–744. IEEE, 2020. Appendix ## Appendix A Additional Details of the Autoregressive Diffusion Transformer Implementation details: We use the VAE from Stable Diffusion 3 Esser et al. (2024) to independently encode 256 $\times$ 256 image frames into latent space. We employ a 16-layer transformer with 1024 hidden dimensions and 16 attention heads. We train the world model on a diverse set of data sources, including 9 of the robot datasets from Open-X Embodiment whose action spaces can be unified, such as Bridge V2 (Walke et al., 2023) and RT-1 (Brohan et al., 2022). We encode actions from a 7-dimensional vector, using the 6-dimensional end-effector position and binary gripper state as our action space. Action spaces from different robots are aligned by normalizing each component’s 10 ${}^{\text{th}}$ - and 90 ${}^{\text{th}}$ -percentile values to those of the RT-1 dataset. We train with a context length of 20 frames; for longer rollouts, we condition on a sliding window of the last 20 frames. | Hyperparameter | Value | | --- | --- | | Total parameters | 609 M | | Image Resolution | 256 $\times$ 256 | | DiT Patch Size | 2 | | Input Channels | 16 | | Hidden Size | 1024 | | Layers | 16 | | Attention Heads | 16 | | MLP Ratio | 4 | | Optimizer | AdamW (weight decay = $0.002$ , $\beta_{1}=0.9,\beta_{2}=0.99$ ) | | Learning rate | 8e-5 | | Batch size | 16 | | Action dimension | 7 | | Training hardware | 2xA100 80GB | | Training steps | 300k | | Diffusion noise schedule | sigmoid | | Sampling timesteps | 10 | | Prediction target | $v$ | Table 2: Hyperparameters for training WorldGym’s video prediction model. Algorithm 1 WorldGym policy evaluation loop. World model $\hat{T}$ with training context length $N_{\text{train}}$ and prediction horizon $h$ , rollout length $N_{\text{rollout}}$ , policy $\pi$ with action chunk size $|\mathbf{a}_{\text{pred}}|$ , reward model $\hat{R}$ , initial observation $o_{0}$ , goal $g$ $\mathbf{o}\leftarrow[o_{0}]$ $\mathbf{a}\leftarrow[a_{\text{null}}]$ $n=0$ while $n\leq N_{\text{rollout}}$ do $\mathbf{a_{\text{pred}}}\leftarrow\pi(\mathbf{o}_{n},g)$ for $i=0$ to $\lceil|\mathbf{a}_{\text{pred}}|/h\rceil-1$ do $\mathbf{a}_{\text{ctx}}\leftarrow\mathbf{a}_{-N_{\text{train}}:}$ $\mathbf{o}_{\text{ctx}}\leftarrow\mathbf{o}_{-N_{\text{train}}:}$ $\mathbf{o}_{\text{pred}}\leftarrow\hat{T}(\mathbf{o}_{\text{ctx}},\mathbf{a}_{\text{ctx}}||\mathbf{a}_{\text{pred},h\cdot i:h\cdot(i+1)})$ $\triangleright$ predict a block of $h$ frames in parallel $\mathbf{o}\leftarrow\mathbf{o}||\mathbf{o}_{\text{pred}}$ $\triangleright$ concatenate generated block of observation frames with observation history $\mathbf{a}\leftarrow\mathbf{a}||\mathbf{a}_{\text{pred},h\cdot i:h\cdot(i+1)}$ end for $n\leftarrow n+n_{\text{chunk}}$ end while $r\leftarrow\hat{R}(\mathbf{o})$ Algorithm 1 shows the detailed algorithm for performing a sliding window rollout of a policy with action chunk size $|\mathbf{a}_{\text{pred}}|$ and a world model with prediction horizon $h$ . Note that in practice we always choose $h=|\mathbf{a}_{\text{pred}}|$ at inference time. ## Appendix B Details of VLM as Reward ### B.1 Prompt for VLM as Reward Prompt GPT-4o as Reward $\hat{R}$ . Note that has_partial is True if the chosen task has a partial credit criteria, which is the case for some tasks used in OpenVLA (Kim et al., ). ⬇ Here is a sequence of frames from a robot policy which has been rolled out in a video - generation - based world model. I need your help determining whether the policy is successful. How successfully does the robot complete the following task? Instruction: {instruction} {rubric. strip ()} Provide brief reasoning (2-3 sentences). Then output EXACTLY one final line: Final Score: X Where X is { ’ one of 0, 0.5, or 1’ if has_partial else ’0 or 1’ }. No extra numbers after that line. Note: Since this video was generated by a video prediction model (conditioned on robot actions), it may contain some artifacts due to the video model capacity. If there is a partial credit criteria, the rubric is: ⬇ 0 = Failure: little or no progress toward: "{instruction}" 0.5 = Partial: "{partial_desc}" achieved BUT the instruction not fully completed 1 = Success: Instruction fully completed (counts even if partial also true) Otherwise, the rubric is: ⬇ Score rubric: 0 = Failure: instruction "{instruction}" not completed. 1 = Success: instruction completed. ### B.2 Validating VLM Success Predictions Table 3: Performance of VLM as reward (mean and standard error across 4 runs) on videos from RT-1 (Brohan et al., 2022) using ground truth task success labels. GPT-4o achieves high true positives and true negatives. Notably, GPT-4o as reward has very low false positive rate, which is especially important for not over-estimating a policy value. | | RT-1 Success | RT-1 Fail | | --- | --- | --- | | VLM Success | 0.81 $\pm$ 0.14 (TP) | 0.03 $\pm$ 0.05 (FP) | | VLM Fail | 0.19 $\pm$ 0.14 (FN) | 0.97 $\pm$ 0.05 (TN) | To determine whether a VLM can serve as a reliable reward function, we pass rollout videos from the RT-1 dataset, along with the prompts constructed from the templates above, as inputs to query GPT-4o. We use whether the task is successful according to the RT-1 data (validation split) as the ground truth. Table 3 shows that GPT-4o achieves high true positive and true negative rate for real videos, indicating that it is an effective evaluator of task success. Notably, GPT-4o achieves very low false positives (i.e., the rollout is a failure but the VLM thinks it is a success), which is highly useful in policy evaluation. ## Appendix C Architecture and Training Details of Video Based Policy Our video-based policy follows the framework of UniPi Du et al. (2023a), combining a language-conditioned video prediction model with an inverse dynamics model. The video prediction module shares the same architecture as our world model, but replaces the conditioning on robot actions at each timestep with language instructions. For language conditioning, we employ the pretrained and frozen UMT5-xxl encoder Chung et al. (2023) to obtain token-level embeddings. These embeddings are aggregated via mean pooling to form a 4096-dimensional instruction representation. This representation is projected to match the model dimensionality and is used to modulate the diffusion transformer through adaptive layer normalization (adaLN-Zero). In this way, task semantics are directly integrated into the video prediction process. We train our video generation model for 180k steps on Bridge V2 (Walke et al., 2023). The visualization of the video generation policy on validation scenes can be seen in Figure 14. The inverse dynamics model predicts the action sequence given a short video clip of 10 frames. Each frame is encoded with a ResNet 50 backbone He et al. (2015), producing per-frame features $f_{t}$ . To capture motion, we compute both $f_{t}$ and temporal differences $\Delta f_{t}=f_{t+1}-f_{t}$ , concatenate them, and flatten across the clip. The resulting representation is passed through an MLP to predict $10\times d_{a}$ outputs, corresponding to the action dimension $d_{a}$ at each timestep. Input images are normalized with ImageNet statistics, and the model is trained with mean squared error on ground-truth actions. The inverse dynamics model is trained independently on the Bridge V2 dataset for 200k steps. ### C.1 Validation Visualization of Language Conditioned Video Generation Model <details> <summary>x15.png Details</summary> ![8b30fcd0](/v1/image/8b30fcd0abe1215b1cd40d7fe92353228bf5629ba69b10ad183047f98da593f6) ### Visual Description ## Robotic Task Sequence Comparison: Ground-Truth vs. Generated ### Overview The image is a composite figure displaying six distinct robotic manipulation tasks. Each task is presented in a separate panel, arranged in a 3x2 grid (3 rows, 2 columns). Within each panel, two horizontal rows of four sequential image frames are shown. The top row is labeled "Ground-truth" and the bottom row is labeled "Generated". This structure is designed to visually compare real-world recorded sequences (Ground-truth) with sequences produced by a generative model (Generated) for the same robotic task. ### Components/Axes * **Layout:** A 3x2 grid of task panels. * **Panel Structure (per task):** * **Left-side Labels:** Vertical text labels "Ground-truth" (top row) and "Generated" (bottom row) are positioned to the left of their respective image sequences. * **Image Sequences:** Each row contains four frames showing a temporal progression of a robotic arm performing a task. The frames are ordered left to right. * **Content:** Each panel features a black robotic arm (likely a WidowX or similar model) operating in a simulated or controlled real-world environment with various objects on a tabletop or counter. ### Detailed Analysis The image contains no charts, graphs, or data tables with numerical values. It is a qualitative visual comparison. Below is a breakdown of each task panel, proceeding left-to-right, top-to-bottom. **Panel 1 (Top-Left): Pouring Liquid** * **Task:** A robotic arm pours liquid from a clear bottle with an orange label into a clear cup. * **Objects:** Bottle, cup, other background items (purple container, green object). * **Sequence:** The arm moves the bottle over the cup, tilts it, and returns it upright. * **Comparison:** The "Ground-truth" and "Generated" sequences appear visually very similar in object placement and arm motion. **Panel 2 (Top-Right): Stacking Blocks** * **Task:** The robotic arm stacks colored blocks (blue, yellow, red, green) into a vertical tower. * **Objects:** Four colored blocks on a wooden surface. * **Sequence:** The arm picks up and places blocks sequentially to build the stack. * **Comparison:** The final stacked configuration in the last frame of both rows is identical. The intermediate positions of the arm and blocks show high correspondence. **Panel 3 (Middle-Left): Placing Objects** * **Task:** The arm places a small green object next to a pink object on a blue mat. * **Objects:** Green object, pink object, blue mat, wooden surface. * **Sequence:** The arm moves the green object from a starting position to a target location beside the pink object. * **Comparison:** The spatial relationship between the green and pink objects in the final frame is consistent between the two sequences. **Panel 4 (Middle-Right): Wiping a Surface** * **Task:** The robotic arm uses a white cloth or paper towel to wipe a wooden surface. * **Objects:** White cloth, wooden surface. * **Sequence:** The arm moves the cloth in a back-and-forth or circular wiping motion across the surface. * **Comparison:** The path and coverage of the cloth appear closely matched between the ground-truth and generated sequences. **Panel 5 (Bottom-Left): Operating a Stove** * **Task:** The arm turns the knob on a simulated stovetop burner. * **Objects:** Stovetop with black surface and red coil burners, control knob. * **Sequence:** The arm approaches the knob, grips it, and rotates it. * **Comparison:** The interaction point and the resulting state of the knob (e.g., rotated position) are consistent. **Panel 6 (Bottom-Right): Opening a Drawer** * **Task:** The robotic arm pulls open a wooden drawer. * **Objects:** Wooden cabinet with a drawer, small pink object on the counter. * **Sequence:** The arm grips the drawer handle and pulls it outward. * **Comparison:** The drawer's open position in the final frames of both sequences is visually identical. ### Key Observations 1. **High Fidelity:** The "Generated" sequences demonstrate a high degree of visual and procedural fidelity when compared to the "Ground-truth" sequences across all six diverse tasks. 2. **Task Diversity:** The tasks cover a range of fundamental robotic manipulation skills: pouring, stacking, placing, wiping, turning, and pulling. 3. **Consistent Framing:** The camera angle, lighting, and environment are consistent within each task panel between the two rows, isolating the comparison to the action sequence itself. 4. **No Obvious Artifacts:** At this resolution, the generated frames do not show significant visual artifacts, blurring, or object distortions that would distinguish them from the real frames. ### Interpretation This image serves as a qualitative evaluation figure, likely from a research paper on video generation or world models for robotics. Its primary purpose is to demonstrate that a generative AI model can produce realistic and accurate video sequences of robotic tasks that are nearly indistinguishable from real recordings. * **What it suggests:** The model has learned the physical dynamics, object interactions, and sequential logic required for these tasks. It can generate plausible future frames given an initial state or a task description. * **How elements relate:** The side-by-side, frame-by-frame comparison is the core analytical method. The "Ground-truth" provides the target, and the "Generated" row is the model's prediction. Their visual similarity is the key result. * **Notable implications:** Such a capability is crucial for training robots using simulated data (sim-to-real transfer), planning future actions by "imagining" outcomes, or creating large-scale synthetic training datasets. The lack of visible discrepancies implies the model's predictions are temporally coherent and physically plausible for these specific, likely well-represented, task types. A full technical assessment would require quantitative metrics (e.g., FID, SSIM) and evaluation on more complex or novel tasks. </details> Figure 14: Validation Visualization of the language-conditioned video generation model on Bridge-V2. At inference, the model takes a UMT5-xxl instruction encoding and an initial frame, then predicts the next nine frames to complete the task. ## Appendix D Architecture and Training Details of Diffusion Policy We followed the recipe of DexVLA (Wen et al., 2025) for training the diffusion policy. We load the Qwen2-VL-2B (Wang et al., 2024) backbone and the pre-trained control head, and perform an adaptation on BridgeV2 using LoRA (Hu et al., 2022), which inserts low-rank adapter matrices inside the backbone’s attention and feed-forward blocks. These matrices along with the policy head are the only trainable modules in the adaptation stage. This preserves the backbone’s general vision-language competence, and makes adaptation compute and memory efficient. We fine-tune the model with adapters on Bridge-V2 for 60k steps. During training, we rescale the actions to $(-1,1)$ to match the diffusion target range. DexVLA’s policy head is trained as a denoising diffusion model with the standard $\epsilon$ -prediction DDPM objective, i.e. at each update Gaussian noise is added to ground-truth action sequences at a randomly sampled diffusion step and the network is trained to predict that noise using an MSE loss. At inference, actions are generated with DDIM in a small number of steps, progressively denoising from a Gaussian initialization to a trajectory. We choose AdamW for the optimizer, using standard decoupled weight decay which applies decay to linear/attention weights, but exclude biases and LayerNorm parameters. ## Appendix E Additional Experimental Results ### E.1 Additional Results on Real-Robot Videos <details> <summary>x16.png Details</summary> ![63b5c622](/v1/image/63b5c6220f6a635072d06efe9a829713c2afe40ae76fcea8e810c123ff2b096e) ### Visual Description \n ## [Comparative Image Grid]: Ground-Truth vs. Generated Robotic Manipulation Sequences ### Overview The image is a composite grid containing six distinct panels arranged in a 3x2 layout. Each panel presents a side-by-side comparison of two image sequences: a "Ground-truth" sequence (top row) and a "Generated" sequence (bottom row). Each sequence consists of four sequential frames depicting a robotic arm performing a specific manipulation task in a simulated or controlled environment. The primary purpose is to visually compare the fidelity of generated video sequences against real or reference (ground-truth) sequences for various robotic tasks. ### Components/Axes * **Structure:** A 3-row by 2-column grid of panels. * **Panel Labeling:** Each panel is divided into two horizontal rows. * **Top Row Label:** "Ground-truth" (text rotated 90 degrees counter-clockwise, positioned vertically along the left edge of the row). * **Bottom Row Label:** "Generated" (text rotated 90 degrees counter-clockwise, positioned vertically along the left edge of the row). * **Content per Row:** Each row within a panel contains four sequential image frames, showing a progression of a robotic task. * **Language:** All labels are in English. ### Detailed Analysis The image contains no charts, graphs, or data tables. It is a visual comparison grid. The analysis below isolates each panel and describes the visible components and actions. **Panel 1 (Top-Left):** * **Task:** A robotic arm interacts with a light-colored rectangular box and a smaller blue object on a wooden surface. * **Ground-truth Sequence:** Shows the arm approaching, contacting, and slightly displacing the blue object next to the box. Lighting and shadows are consistent. * **Generated Sequence:** Appears to closely replicate the scene geometry, object colors (box: beige, object: blue), and the arm's motion trajectory. The visual fidelity is high, with minor potential differences in shadow softness. **Panel 2 (Top-Right):** * **Task:** A robotic arm manipulates a crumpled green object (possibly a cloth or bag) on a dark tabletop. * **Ground-truth Sequence:** The arm's gripper pinches and lifts the green object. * **Generated Sequence:** Replicates the complex deformation of the green object and the gripper's interaction. The texture and color of the object and the arm's metallic finish are well-matched. **Panel 3 (Middle-Left):** * **Task:** A robotic arm performs a task involving a white cylindrical container and a black object on a white surface with a grid pattern. * **Ground-truth Sequence:** The arm moves the black object towards or into the white container. * **Generated Sequence:** Accurately reproduces the high-contrast scene (black object, white container/surface) and the precise motion of the arm. **Panel 4 (Middle-Right):** * **Task:** A robotic arm moves a pink spherical object on a light wooden table. A blue partition is visible in the background. * **Ground-truth Sequence:** The arm pushes or rolls the pink sphere across the table. * **Generated Sequence:** Successfully generates the distinct pink color of the sphere, the wood grain texture, and the rolling motion. The background blue partition is also present. **Panel 5 (Bottom-Left):** * **Task:** A robotic arm interacts with objects near a blue bin. A red sphere and a yellow cube are on the floor. * **Ground-truth Sequence:** The arm appears to be positioning itself relative to the blue bin and the objects. * **Generated Sequence:** Correctly renders the bright blue bin, red sphere, and yellow cube, along with the arm's configuration. The spatial relationships between objects are maintained. **Panel 6 (Bottom-Right):** * **Task:** A close-up view focusing on a robotic gripper with red markers at its tips, interacting with a metallic, grooved surface (possibly a heatsink or part of a machine). * **Ground-truth Sequence:** Shows the gripper approaching and making contact with the grooved surface. * **Generated Sequence:** Replicates the detailed geometry of the gripper (including the red tips) and the fine, repetitive pattern of the metallic grooves. The lighting and reflections on the metal are consistent. ### Key Observations 1. **High Visual Fidelity:** In all six panels, the "Generated" sequences demonstrate a strong visual resemblance to their corresponding "Ground-truth" sequences in terms of scene layout, object identity, color, texture, and lighting. 2. **Motion Consistency:** The generated sequences appear to capture the correct temporal progression and motion dynamics of the robotic arms and manipulated objects. 3. **Task Diversity:** The grid showcases a variety of robotic manipulation tasks involving different objects (rigid, deformable), surfaces, and camera perspectives (wide shots and close-ups). 4. **Labeling Consistency:** The "Ground-truth" and "Generated" labels are applied uniformly across all panels, establishing a clear comparative framework. ### Interpretation This image serves as a qualitative evaluation figure, likely from a research paper or technical report on video generation or simulation for robotics. The core message is that the generative model being demonstrated can produce highly realistic and physically plausible video sequences of robotic manipulation tasks that are visually indistinguishable (or very close) from real-world or high-fidelity simulation recordings ("Ground-truth"). The **Peircean investigative** reading suggests this is an *iconic* and *indexical* representation. It is *iconic* because the generated images resemble the real ones. It is *indexical* because the side-by-side comparison points directly to the model's capability—the generated sequence is an index of the model's performance. The absence of quantitative metrics implies the argument is being made through visual evidence, trusting the viewer to perceive the similarity. The selection of diverse tasks (handling rigid boxes, deformable cloths, spheres, close-up interactions) is intended to demonstrate the model's generalization capability across different manipulation scenarios. The high fidelity suggests potential applications in generating training data for robots, creating realistic simulations, or validating planning algorithms. </details> Figure 15: Additional Qualitative Evaluation of simulating actions from different robots. The world model generally generates the video that look very similar to the original video conditioned on the same actions that produced the original video in the real world. <details> <summary>x17.png Details</summary> ![c3dc2d24](/v1/image/c3dc2d240dec041ed4e662ab5d173c75be9d314543874c8efc76453390596af5) ### Visual Description ## Diagram: Robotic Arm Task Sequences ### Overview The image is a composite diagram consisting of six distinct panels arranged in a 3x2 grid. Each panel contains a sequence of four sequential frames illustrating a robotic arm performing a specific manipulation task in a simulated kitchen or workshop environment. The tasks involve interacting with various objects using directional commands (Right, Left, Forward, Backward) and action commands (Open, Close). The purpose appears to be demonstrating or training robotic control for everyday object manipulation. ### Components/Axes * **Structure:** Six rectangular panels, each with a white background and a thin black border. * **Content per Panel:** Four sequential image frames arranged horizontally. * **Labels:** Each frame has a white rectangular label with black text positioned at the bottom-left corner of the frame. The text indicates the command or action being performed in that frame. * **Objects:** The scenes include a variety of objects: a kitchen sink, a cutting board, vegetables (carrot, cucumber, bell pepper), a purple block, a silver sphere, a red cylinder, a drawer, and various small containers and tools. * **Agent:** A black robotic arm with a gripper is the active agent in all sequences. ### Detailed Analysis **Panel 1 (Top-Left): Sink Interaction** * **Sequence:** The robotic arm is positioned over a white kitchen sink containing a blue plate, a green sponge, and other small items. * **Frame 1:** Label: **"Right"**. The arm is positioned towards the left side of the sink. * **Frame 2:** Label: **"Right"**. The arm has moved slightly to the right. * **Frame 3:** Label: **"Left"**. The arm has moved back towards the left. * **Frame 4:** No label visible. The arm is in a similar position to Frame 3. * **Trend:** The sequence demonstrates lateral (side-to-side) movement control over the sink area. **Panel 2 (Top-Right): Cutting Board Interaction** * **Sequence:** The arm is over a wooden cutting board with a green cucumber, an orange carrot, and a yellow bell pepper. * **Frame 1:** Label: **"Right"**. The arm is positioned over the left side of the board. * **Frame 2:** Label: **"Right"**. The arm has moved to the right, now over the center of the board. * **Frame 3:** Label: **"Left"**. The arm has moved back to the left. * **Frame 4:** No label visible. The arm is positioned over the left side. * **Trend:** Similar lateral movement demonstration, but in a different workspace context. **Panel 3 (Middle-Left): Object Relocation (Purple Block)** * **Sequence:** The arm interacts with a purple rectangular block on a wooden surface. A red sphere is also present. * **Frame 1:** Label: **"Backward"**. The arm is positioned behind (further from the camera) the purple block. * **Frame 2:** Label: **"Backward"**. The arm has moved further backward. * **Frame 3:** Label: **"Forward"**. The arm has moved forward, now positioned over the purple block. * **Frame 4:** No label visible. The arm appears to be grasping or has just released the purple block. * **Trend:** Demonstrates depth-axis (forward/backward) movement for approaching and interacting with an object. **Panel 4 (Middle-Right): Object Relocation (Silver Sphere)** * **Sequence:** The arm interacts with a silver metallic sphere on a wooden surface. A purple block is also present. * **Frame 1:** Label: **"Backward"**. The arm is positioned behind the silver sphere. * **Frame 2:** Label: **"Backward"**. The arm has moved further backward. * **Frame 3:** Label: **"Forward"**. The arm has moved forward, now positioned over the silver sphere. * **Frame 4:** No label visible. The arm is positioned over the sphere. * **Trend:** Another demonstration of depth-axis control with a different object (sphere vs. block). **Panel 5 (Bottom-Left): Drawer Interaction (Red Cylinder)** * **Sequence:** The arm interacts with a red cylinder placed near a drawer handle on a wooden countertop. * **Frame 1:** Label: **"Close"**. The drawer appears closed. The arm is near the red cylinder. * **Frame 2:** Label: **"Open"**. The drawer is now partially open. The arm is positioned over the red cylinder. * **Frame 3:** Label: **"Close"**. The drawer is closed again. The arm is near the cylinder. * **Frame 4:** No label visible. The scene is similar to Frame 3. * **Trend:** Demonstrates a discrete action ("Open"/"Close") applied to an articulated object (the drawer), with the red cylinder as a potential manipulation target or obstacle. **Panel 6 (Bottom-Right): Drawer Interaction (Empty)** * **Sequence:** The arm interacts with the same drawer, but the red cylinder is absent. * **Frame 1:** Label: **"Close"**. The drawer is closed. * **Frame 2:** Label: **"Open"**. The drawer is open. * **Frame 3:** Label: **"Close"**. The drawer is closed. * **Frame 4:** No label visible. The drawer is closed. * **Trend:** Isolates the drawer "Open"/"Close" action without the secondary object present. ### Key Observations 1. **Consistent Structure:** All six panels follow an identical format: four frames with command labels for the first three frames. 2. **Command Types:** Two distinct command vocabularies are used: **Directional** (Right, Left, Forward, Backward) for continuous movement, and **Action** (Open, Close) for discrete state changes. 3. **Environmental Variety:** The tasks are set in slightly different but related environments (sink area, cutting board, open counter, drawer), suggesting training for generalization across contexts. 4. **Object Variety:** The arm interacts with objects of different shapes (block, sphere, cylinder, vegetables), materials, and functions. 5. **Label Placement:** All labels are consistently placed in the bottom-left corner of their respective frames, ensuring clear association. ### Interpretation This diagram is a technical visualization of **robotic manipulation primitives**. It breaks down complex tasks (like "prepare food" or "tidy a counter") into fundamental, teachable sub-skills: * **Spatial Navigation:** The "Right/Left" and "Forward/Backward" sequences train the robot's ability to position its end-effector precisely in 3D space relative to objects and work surfaces. * **Object Interaction:** The "Open/Close" sequences train the robot to apply forces to change the state of articulated objects (like drawers), a crucial skill for environmental manipulation. * **Context Generalization:** By showing similar actions in slightly different settings (e.g., lateral movement over a sink vs. a cutting board), the diagram implies a focus on developing robust policies that work across varied but semantically similar environments. The absence of labels on the fourth frame in each sequence likely indicates the **result state** or the moment after the commanded action is completed. The overall purpose is likely for **robot learning**, either as training data for imitation learning or as a demonstration of a learned policy's capabilities. The clear, segmented presentation makes it suitable for technical documentation, research papers, or system specification guides in the field of robotics and AI. </details> Figure 16: Additional End-Effector Control Sweep on Bridge. We simulate different gripper controls along different action dimensions corresponding to left-right, forward-backward, and gripper open-close. The world model generally generates videos that follow the actions. ### E.2 Additional Results on Google Robot To assess the generalizability of WorldGym, we performed rollouts with different policies on Google Robot. For our analysis, we chose a subset of tasks from the RT-1 dataset (Brohan et al., 2022). A partial score of 0.5 was assigned to a rollout if the robot attempted to reach the target location. OpenVLA again outperformed Octo and RT-1-X (see Table 4). However, in this environment Octo and RT-1-X are narrowly behind. The strong performance of RT-1-X might be due to a higher proportion of Google Robot trajectories than WidowX in its pretraining mix. Table 4: Policy rollouts on Google Robot (RT-1 subset). OpenVLA outperforms RT-1-X and Octo, but by a smaller margin than on the Bridge dataset. | Task | # Trials | RT-1-X # Successes | Octo # Successes | OpenVLA # Successes | | --- | --- | --- | --- | --- | | Close Bottom Drawer | 10 | 9 | 8.5 | 6 | | Open Left Fridge Door | 10 | 4.5 | 3.5 | 4 | | Pick Blue Chip Bag | 10 | 5 | 5.5 | 9 | | Place Redbull Can Upright | 10 | 1.5 | 3 | 3.5 | ### E.3 Detailed Results on the OpenVLA Bridge evaluation tasks Table 5: Detailed Bridge Evaluation Results comparing RT-1-X (O’Neill et al., 2023), Octo (Octo Model Team et al., 2024), and OpenVLA (Kim et al., ) on the Bridge evaluation suite of tasks from Kim et al. . Real-world task success rates are taken directly from (Kim et al., ), WorldGym success rates are from rolling out policies within our world model. | Task | # Trials | RT-1-X | Octo | OpenVLA | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | Real-world # Successes | WorldGym # Successes | Real-world # Successes | WorldGym # Successes | Real-world # Successes | WorldGym # Successes | | | | Put Eggplant into Pot (Easy Version) | 10 | 1 | 1 | 5 | 1 | 10 | 7 | | Put Eggplant into Pot | 10 | 0 | 0 | 1 | 2 | 10 | 6 | | Put Cup from Counter into Sink | 10 | 1 | 3 | 1 | 3 | 7 | 9 | | Put Eggplant into Pot (w/ Clutter) | 10 | 1 | 0.5 | 3.5 | 3.5 | 7.5 | 8 | | Put Yellow Corn on Pink Plate | 10 | 1 | 3 | 4 | 6 | 9 | 9.5 | | Lift Eggplant | 10 | 3 | 2 | 0.5 | 1.5 | 7.5 | 7.5 | | Put Carrot on Plate (w/ Height Change) | 10 | 2 | 0.5 | 1 | 3 | 4.5 | 6 | | Put Carrot on Plate | 10 | 1 | 0 | 0 | 1 | 8 | 4 | | Flip Pot Upright | 10 | 2 | 3 | 6 | 1 | 8 | 5 | | Lift AAA Battery | 10 | 0 | 1 | 0 | 0 | 7 | 4 | | Move Skull into Drying Rack | 10 | 1 | 2 | 0 | 3 | 5 | 5 | | Lift White Tape | 10 | 3 | 1 | 0 | 1 | 1 | 6 | | Take Purple Grapes out of Pot | 10 | 6 | 5 | 0 | 2 | 4 | 4 | | Stack Blue Cup on Pink Cup | 10 | 0.5 | 0 | 0 | 0 | 4.5 | 6 | | Put {Eggplant, Red Bottle} into Pot | 10 | 2.5 | 0.5 | 4 | 5 | 7.5 | 9 | | Lift {Cheese, Red Chili Pepper} | 10 | 1.5 | 2.5 | 2.5 | 2.5 | 10 | 10 | | Put {Blue Cup, Pink Cup} on Plate | 10 | 5 | 1.5 | 5.5 | 5 | 9.5 | 8.5 | | Mean Success Rate | | 18.5 $\pm$ 4.0% | lightlightgray15.5 $\pm$ 3.4% | 20.0 $\pm$ 5.3% | lightlightgray23.82 $\pm$ 4.3% | 70.6 $\pm$ 6.1% | lightlightgray67.4 $\pm$ 4.9% | We report the mean success rate across tasks with standard error (SE) computed as $$ \text{SE}=\frac{\mathrm{sd}(r_{1},\dots,r_{T})}{\sqrt{T}}, $$ where $r_{i}$ is the per-task success rate and $T$ is the number of tasks. ### E.4 Detailed Results on OOD Image Evaluation Tasks Table 6: Detailed Bridge OOD Image task results. OpenVLA appears to be more robust across the different OOD settings of object generalization, distractions and classification. | Category | Task | # Trials | RT-1-X # Successes | Octo # Successes | OpenVLA # Successes | | --- | --- | --- | --- | --- | --- | | Object Generalization | Pick up Orange (Carrot closer to Gripper) | 10 | 1 | 1 | 4 | | Object Generalization | Pick up Orange (Orange closer to Gripper) | 10 | 3 | 4 | 9 | | Object Generalization | Pick up Orange (Replace Carrot with Radish) | 10 | 1 | 4 | 10 | | Distractor Robustness | Pick up Carrot (With Computer on side) | 10 | 6 | 7 | 9 | | Distractor Robustness | Pick up Carrot (Computer closer to gripper) | 10 | 3 | 1 | 8 | | Classification | Pick {Red, Blue} | 20 | 8 | 10 | 20 | | Classification | Pick {Circle, Square} | 20 | 8 | 10 | 12 | | Classification | Pick {Taylor Swift, Snoop Dogg} | 20 | 7 | 10 | 11 | Table 7: Policy rollout performance comparison in the presence of unrelated distractions. OpenVLA is more robust to distractions over RT-1-X and Octo. However, all policies suffer significant performance drop in the presence of distractors. | Task | # Trials | RT-1-X # Successes | Octo # Successes | OpenVLA # Successes | | --- | --- | --- | --- | --- | | Put Eggplant into Pot (Easy Version) | 10 | 1 | 1 | 3 | | Put Eggplant into Pot | 10 | 0 | 0 | 6 | | Put Cup from Counter into Sink | 10 | 0 | 1 | 8 | | Put Eggplant into Pot (w/ Clutter) | 10 | 0 | 0 | 4 | | Put Yellow Corn on Pink Plate | 10 | 0 | 0 | 2 | | Lift Eggplant | 10 | 0 | 1 | 7 | | Put Carrot on Plate (w/ Height Change) | 10 | 0 | 0 | 2 | | Put Carrot on Plate | 10 | 0 | 2 | 4 | | Flip Pot Upright | 10 | 0 | 0 | 0 | | Lift AAA Battery | 10 | 1 | 0 | 1 | | Move Skull into Drying Rack | 10 | 3 | 0 | 5 | | Lift White Tape | 10 | 0 | 0 | 4 | | Take Purple Grapes out of Pot | 10 | 7 | 0 | 3 | | Stack Blue Cup on Pink Cup | 10 | 0 | 0 | 4 | | Put {Eggplant, Red Bottle} into Pot | 10 | 0 | 0 | 5 | | Lift {Cheese, Red Chili Pepper} | 10 | 1 | 2 | 6 | | Put {Blue Cup, Pink Cup} on Plate | 10 | 0 | 0 | 3 | | Mean Success Rate | | 7.6 $\pm$ 4.3% | 4.1 $\pm$ 1.7% | 39.4 $\pm$ 5.1% | ## Appendix F Ablation Studies ### F.1 Dataset Size Analysis Table 8: Dataset ablation. Larger training dataset improves all three metrics comparing generated videos and ground-truth validation videos. $\uparrow$ means higher the better. | | Subset (Bridge V1) | Full (Bridge V2) | | --- | --- | --- | | MSE $\downarrow$ | 0.015 | 0.010 | | LPIPS $\downarrow$ | 0.131 | 0.073 | | SSIM $\uparrow$ | 0.735 | 0.827 | We measure MSE, LPIPS, and SSIM on generated videos from a model that is trained on less video data (Bridge V1 (Ebert et al., 2021)) and compare with a model that is trained on more data (Bridge V2 (Walke et al., 2023)). Table 8 shows that the model trained on more data leads to improvements in all three metrics. ### F.2 Parallelism Efficiency Analysis Table 9: Parallelism efficiency comparison. Inference time for generating 40-frame video rollouts on an A100 GPU with different horizon lengths, demonstrating the efficiency gains from parallel frame denoising. | Prediction Horizon | Time (s) | | --- | --- | | $h=1$ | 93 | | $h=4$ | 33 | Increasing the horizon from 1 to 4 frames achieves a 2.8× speedup. This is particularly useful for evaluating robot policies with differing action chunk sizes. For instance, OpenVLA (Kim et al., ) predicts just a single action per frame, while Octo (Octo Model Team et al., 2024) predicts 4 actions per frame. Using the same world model checkpoint, we can improve the efficiency of rollout generations by matching the horizon to the policy’s chunk size at inference time.

Rendering Paper...