# Generative World Modelling for Humanoids 1X World Model Challenge Technical Report - Team Revontuli
> equal contribution.
## Abstract
World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.
## 1 Introduction
World models [11] equip agents (e.g. humanoid robots) with internal simulators of their environments. By “imagining” the consequences of their actions, agents can plan, anticipate outcomes, and improve decision-making without direct real-world interaction.
A central challenge in world modelling is the design of architectures that are both sufficiently expressive and computationally tractable. Early approaches have largely relied on recurrent networks [13, 14, 15] or multilayer perceptrons [16, 17, 34, 7]. More recently, advances in generative modelling have driven a new wave of architectural choices. A prominent line of work leverages autoregressive transformers over discrete latent spaces [6, 33, 41, 26, 3, 10], while others explore diffusion- and flow-based approaches [1, 8]. At scale, these methods underpin powerful foundation models [21, 39, 36, 28, 12, 23] capable of producing realistic and accurate video predictions.
<details>
<summary>figs/image_1.png Details</summary>

### Visual Description
## Diagram: Robotic Manipulation Pipeline with Tokenized State Representation
### Overview
The image is a technical diagram illustrating a three-stage pipeline for processing visual data from a robotic manipulation task. It shows the transformation of raw camera images into abstract tokenized representations, followed by a generation step and a final evaluation comparison. The diagram is divided into three vertical panels separated by dashed lines, each representing a distinct phase of the process.
### Components/Flow
The diagram is organized into three main vertical sections, each containing two horizontal layers:
1. **Top Layer (Visual Data):** Shows first-person perspective images from a robot's camera.
2. **Bottom Layer (Abstract Representation):** Shows 3D grid structures (cubes) representing tokenized data.
**Flow Direction:** The process flows from left to right, indicated by large, light-blue arrows connecting the stages. A secondary, double-headed arrow appears in the final stage.
**Text Labels (Transcribed):**
* **"Tokenizer"**: Located in the top-left section, with a dark teal arrow pointing down from the first image panel to the first cube grid.
* **"Context + States"**: Located at the bottom-left, below the first cube grid. The word "States" is highlighted in red.
* **"Generation"**: Located at the bottom-center, below the second cube grid.
* **"Evaluation"**: Located at the bottom-right, below the third cube grid.
### Detailed Analysis
**Panel 1: Context + States (Left Section)**
* **Top (Visual Data):** A sequence of images (approximately 5-6 frames) is shown, with the foremost frame highlighted by a dark teal border. The image depicts a robot's white and black arm reaching towards a desk. On the desk is a grey tray containing various small objects (tools, parts). In the background, there is a shelving unit with drawers.
* **Bottom (Abstract Representation):** A 3D grid of cubes (approximately 4x4x4) is shown in a dark teal wireframe. A large, dark teal arrow labeled "Tokenizer" points from the highlighted image above to this grid, indicating the conversion of visual data into a tokenized format.
* **Flow:** A large, light-blue arrow points from this panel to the next.
**Panel 2: Generation (Center Section)**
* **Top (Visual Data):** A similar sequence of images is shown, now with the foremost frame highlighted by a light green border. The robot arm's position has changed slightly, suggesting progression in the task.
* **Bottom (Abstract Representation):** A 3D grid of cubes, identical in structure to the first, is shown in a light green wireframe. This represents the generated state or prediction based on the tokenized context.
* **Flow:** A large, light-blue arrow points from this panel to the final panel.
**Panel 3: Evaluation (Right Section)**
* **Top (Visual Data):** Two images are shown side-by-side. The left image has a light green border (matching the "Generation" panel), and the right image has a dark grey border. A large, double-headed, light-blue arrow connects them, indicating a comparison or evaluation step between the generated state and a target or ground truth state.
* **Bottom (Abstract Representation):** Two 3D cube grids are shown side-by-side. The left grid is in light green wireframe (matching the "Generation" output), and the right grid is in dark grey wireframe. The same double-headed arrow connects them, mirroring the comparison shown in the visual data layer above.
### Key Observations
1. **Color-Coded Stages:** The diagram uses a consistent color scheme to link stages: dark teal for the initial "Context + States," light green for "Generation," and dark grey for the evaluation target.
2. **Parallel Representation:** The top (visual) and bottom (abstract) layers in each panel are presented in parallel, emphasizing that the tokenized grid is a direct representation of the visual scene.
3. **Evaluation is a Comparison:** The final stage is explicitly a comparison between two entities: the generated output (green) and a reference or ground truth (grey), applied at both the visual and token levels.
4. **Spatial Grounding:** The legend/color mapping is clear and consistent across the diagram. The green border and green cubes are co-located in the "Generation" panel. The grey border and grey cubes are co-located in the right side of the "Evaluation" panel.
### Interpretation
This diagram illustrates a core framework for a **vision-based robotic learning or planning system**. The process can be interpreted as follows:
1. **Tokenization as State Encoding:** The system first converts high-dimensional visual input (pixels from the robot's camera) into a lower-dimensional, structured "tokenized" state (the cube grid). This is a common technique in machine learning to make complex data manageable for algorithms. The label "Context + States" suggests this token grid encapsulates both the environmental context and the current state of the robot and objects.
2. **Generative Prediction:** The "Generation" phase likely involves a model (e.g., a world model or policy network) that takes the tokenized context as input and predicts the next state or a sequence of actions. The output is another tokenized representation (the green grid).
3. **Evaluation via Comparison:** The final "Evaluation" phase compares the generated/predicted state (green) against a target state (grey). This target could be a desired goal state, a ground-truth next frame from training data, or the output of a different model. The double-headed arrow signifies a loss function or similarity metric being computed between these two representations. This comparison is crucial for training the generative model (via backpropagation) or for assessing the performance of a robotic plan.
**Underlying Concept:** The diagram emphasizes a **learned, abstract state-space approach** to robotics. Instead of operating directly on pixels, the system learns a compressed representation (tokens) and performs prediction and evaluation within this latent space, which can be more efficient and robust. The parallel structure between visual and token layers argues for the fidelity and utility of the learned representation.
</details>
Figure 1: Overview of the 1X World Model Challenges Left depicts the context (inputs), middle the model generations, and right the evaluations. Sampling challenge (top): The model observes 17 past frames along with past and future robot states, then generates future frames in pixel space. Performance is measured by PSNR between the predicted and ground-truth 77th frame. Compression challenge (bottom): The Cosmos $8\times 8\times 8$ tokeniser encodes the history of 17 RGB frames into three latent token grids of shape $3\times 32\times 32$ . Models must predict the next three latent token grids corresponding to the next 17 frames. Evaluation is based on Top-500 cross-entropy between predicted and ground-truth tokens.
Table 1: Performance on Public 1X World Model Leaderboard
| Sampling | Revontuli Duke | Test 23.00 21.56 | Val 25.53 25.30 | Test (Top-500) – – | Val – – | 1st 2nd |
| --- | --- | --- | --- | --- | --- | --- |
| Michael | 18.51 | – | – | – | 3rd | |
| Revontuli | – | – | 6.64 | 4.92 | 1st | |
| Compression | Duke | – | – | 7.50 | 5.60 | 2nd |
| a27sridh | – | – | 7.99 | – | 3rd | |
The 1X World Model Challenge evaluates predictive performance on two tracks: Sampling and Compression. Fig. 1 outlines the tasks, and Tab. 1 reports our results. These challenges capture core problems when using world models in robotics. Our methods show strong performance that we hope will shape future efforts.
## 2 Sampling Challenge
#### Problem Statement
In the sampling task, the model must predict the $512\times 512$ frame observed by the robot $2$ s into the future. Conditioning is provided by the first 17 frames $\mathbf{x}_{0:16}$ and the complete sequence of robot states $\mathbf{s}_{0:76}\in\mathbb{R}^{77\times 25}$ . Performance is evaluated using PSNR between the predicted and ground-truth last frames.
#### Data Pre-processing
We downsample the original 77 frames clips by a factor of four, yielding shorter 21 sample clips. As a result, this gives us five conditioning frames, ( $\mathbf{x}_{0},\mathbf{x}_{4},\dots,\mathbf{x}_{16}$ ), and the remaining 16 serve as prediction targets. Wan2.2-VAE applies spatial compression to the first frame and temporal compression of 4 to the remaining frames, producing a latent sequence of length $(1+(L-1)/4)$ for a clip of length $L=21$ .
### 2.1 Model
#### Base Model
For our solution, we adapt Wan 2.2 TI2V-5B [36], a flow-matching generative video model with a 30-layer DiT backbone [30]. The base model is designed as a text-image-to-video (TI2V), but we modified the architecture to condition the predictions on videos and robot states. The model operates on latent video representations from Wan2.2-VAE, which compresses clips to a size $(1+(L-1)/4)\times 16\times 16$ .
#### Video-State Conditioning
To incorporate video conditioning, we modified the masking of the input latents. In a standard image-to-video model, the first latent in the time dimension is masked, treating the input image as fixed during generation, thereby establishing a conditional mapping. We extend this idea by fixing multiple frames during generation, effectively transforming the model from image-to-video to video-to-video. The original Wan 2.2 also conditions textual prompts to generate videos. Since our dataset does not include textual descriptions, we use empty strings as text prompts while retaining the original cross-attention layer, enabling future work to leverage text conditioning.
As shown in Fig. 2, we incorporate state conditioning into the model’s predictions using adaLN-Zero [30] within Wan’s DiT blocks. We first downsample the states to match those of the downsampled video. The continuous angle and velocity states are augmented with sinusoidal features, and all states are projected through an MLP to a hidden dimension of $r_{\text{dim}}=256$ .
Then, we compress the projected features along the temporal dimension with a 2-layer 1d convolutional network to match the compression of Wan-VAE for the video frames, mapping the state features to shape $((1+L//4),r_{\text{dim}})$ . Finally, we fed the compressed feature into an MLP layer to get the modulation used by adaLN-Zero layers. The obtained robot modulation is added to the modulation of the flow matching timestep. The robot modulation acts differently on latent since the timestep embedding is the same for the whole latent, while for the states, they will modulate the latent slice associated with the corresponding frames.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Neural Network Architecture Diagram: States-conditioned Wan 2.2 DiT Block
### Overview
The image is a technical block diagram illustrating the architecture of a single "States-conditioned Wan 2.2 DiT Block," which is repeated 30 times (`x30`). This block processes a `Latent` input, conditioned on a `Time Embedding` (derived from a `Flow-Matching Timestep t`) and a `States Embedding` (derived from `Robot States s`). The diagram details the internal layers, data flow, and parameter injection points within this transformer-based block.
### Components/Axes
The diagram is structured as a flowchart with labeled rectangular blocks, circular addition nodes, and directional arrows. The primary components are:
1. **Inputs (Left & Bottom):**
* `Latent`: The main input tensor, entering from the left.
* `Time Embedding` (Orange box): Derived from `Flow-Matching Timestep t` (Orange box below it).
* `States Embedding` (Blue box): Derived from `Robot States s` (Blue box below it).
2. **Main Processing Block (Central Gray Box):** Contains three sequential sub-blocks, each with a residual connection.
* **Sub-block 1:** `Layer Norm` (Green) -> `Scale, Shift` (Blue) -> `Self Attention` (Pink) -> `Scale` (Blue). Parameters `γ₁, β₁` and `α₁` are injected here.
* **Sub-block 2:** `Layer Norm` (Green) -> `Cross Attention` (Pink).
* **Sub-block 3:** `Layer Norm` (Green) -> `Scale, Shift` (Blue) -> `MLP` (Yellow) -> `Scale` (Blue). Parameters `γ₂, β₂` and `α₂` are injected here.
3. **Conditioning Injection Points:**
* The combined `Time Embedding` and `States Embedding` (via a circular addition node) feed into the `Scale, Shift` layers of Sub-block 1 and Sub-block 3, providing parameters `γ₁, β₁` and `γ₂, β₂` respectively.
* The combined embedding also feeds directly into the `Scale` layers following the `Self Attention` and `MLP` modules, providing parameters `α₁` and `α₂`.
4. **Output (Right):** The processed tensor exits to the right after the final residual addition.
5. **Metadata:**
* Title/Label: `States-conditioned Wan 2.2 DiT Block` (Bottom right of the main gray box).
* Repetition Factor: `x30` (Bottom right corner).
### Detailed Analysis
**Data Flow & Layer Sequence:**
1. The `Latent` input first passes through a `Layer Norm`.
2. It then enters a `Scale, Shift` layer, which is modulated by parameters (`γ₁, β₁`) derived from the combined Time and States embeddings.
3. The normalized and scaled features undergo `Self Attention`.
4. The output of the attention is scaled by a factor `α₁` (also from the combined embeddings).
5. A residual connection adds the original `Latent` input to this processed output.
6. This sum goes through another `Layer Norm` and then `Cross Attention`. The diagram does not explicitly show the source for the cross-attention, but it is typically a secondary input like text or, in this context, possibly the robot state information.
7. Another residual connection adds the input from step 5 to the cross-attention output.
8. This sum goes through a third `Layer Norm`, followed by another `Scale, Shift` layer modulated by (`γ₂, β₂`).
9. The features are processed by an `MLP` (Multi-Layer Perceptron).
10. The MLP output is scaled by `α₂`.
11. A final residual connection adds the input from step 7 to this scaled output, producing the final block output.
**Parameter Notation:**
* `γ₁, β₁`, `γ₂, β₂`: Scale and shift parameters for feature modulation, likely applied in an affine transformation (e.g., `γ * x + β`).
* `α₁`, `α₂`: Scalar scaling parameters applied after attention and MLP operations.
### Key Observations
1. **Conditioning Mechanism:** The architecture uses a sophisticated conditioning scheme where robot states and diffusion timestep are embedded and then used to generate multiple sets of parameters (`γ, β, α`) that modulate the main latent features at different stages (before attention/MLP and after).
2. **Hybrid Attention:** The block contains both `Self Attention` (for intra-latent reasoning) and `Cross Attention` (for integrating external information, presumably the robot states).
3. **Residual Design:** The diagram shows three distinct residual addition points (marked with `⊕`), creating a deep, gradient-friendly pathway.
4. **Modular Repetition:** The label `x30` indicates this entire complex block is stacked 30 times, forming a deep network.
5. **Color Coding:** The diagram uses color to group similar operations: Green for Normalization, Blue for Scaling/Shifting, Pink for Attention, Yellow for MLP, Orange for Time-related components, and Blue for State-related components.
### Interpretation
This diagram details a specialized **Diffusion Transformer (DiT)** block designed for **robotic control tasks**. The "Wan 2.2" designation likely refers to a specific model version or architecture variant.
* **Purpose:** The block's function is to iteratively denoise or refine a `Latent` representation (which could encode a robot's trajectory, image, or plan) over 30 layers. The process is guided by two critical pieces of information: the diffusion timestep (`t`), which controls the noise level, and the current robot states (`s`), which provide real-world context.
* **How Elements Relate:** The `Time Embedding` and `States Embedding` are not merely concatenated to the input. Instead, they are fused and then used to generate *dynamic parameters* that actively transform the features within the main network. This is a form of **adaptive normalization** or **hypernetwork-based conditioning**, allowing the model to drastically change its behavior based on the robot's current situation and the diffusion process stage.
* **Significance:** This architecture is designed for high-precision, state-aware generation. The cross-attention layer is crucial for grounding the generated output in the actual physical state of the robot. The deep stack (30 blocks) suggests a high-capacity model capable of learning complex, multi-step planning or control policies. The use of "Flow-Matching" indicates it may be based on a modern, continuous-time formulation of diffusion models, which can be more efficient than discrete noising schedules.
</details>
Figure 2: State conditioning of DiT-Block. Wan2.2 TI2V-5B DiT architecture was updated to enable state conditioning using adaLN-Zero [30] and combining it with the timestep of the Flow Matching scheduler [36].
### 2.2 Training
Models were trained for 23k steps with AdamW [25] with a constant learning rate of $4\cdot 10^{-4}$ . We applied LoRA [22] fine-tuning with rank 32 on the Wan 2.2 DiT backbone. We experimented with and without classifier-free guidance (CFG) [20] during training but observed little improvement in PSNR performance (see Sec. 2.4). Training was conducted on a DataCrunch instant cluster equipped with 4 nodes, each with $8\times$ NVIDIA B200 GPUs. We used a total effective batch size of 1024. The B200 VRAM capacity of 184GB allows for more efficient training of memory-hungry video generation models.
### 2.3 Inference
Since the challenge does not restrict inference compute time, we experimented with different approaches for our submissions. In our initial attempts, we followed [24] post-processing pipeline, applying Gaussian blur and performing histogram matching on the predicted frames. This post-processing improved the PSNR score by $1.2$ dB, as reported in [24]. Because PSNR heavily penalizes outlier deviations from the target image, sharper images with slight errors are typically scored worse than blurrier images with comparable errors.
We found that exploiting predictive uncertainty with an ensemble of predictions outperformed Gaussian blurring. This produces blurring mainly in regions of high motion, such as the humanoid’s arms (see Tab. 2). Increasing the number of ensemble samples improved PSNR on both the validation set and the public leaderboard, with different performance found from tuning the number of inference steps and the classifier-free guidance weight, as shown in Tab. 2.
Table 2: Sampling results on validation and test sets. † The results on test set were obtained after the deadline. ∗ This model has been trained on the whole train + validation raw dataset.
| Num. Inf. Samples 20 | Num. Samples 1 | CFG Scale – | Val. PSNR $[\uparrow]$ 22.63 | Test PSNR $[\uparrow]$ 21.05 | Val. SSIM $[\uparrow]$ 0.707 | Val. LPIPS $[\downarrow]$ 0.137 | Val. FID $[\downarrow]$ 40.23 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 5 | – | 24.52 | 22.11 | 0.750 | 0.165 | 71.46 | |
| 20 | – | 24.88 | 22.42 | 0.762 | 0.201 | 90.71 | |
| 1st sub. ∗ | 20 | – | 26.62 | 23.00 | 0.836 | 0.082 | 31.70 |
| 20 | 20 | 2.0 | 24.20 | 22.26 | 0.734 | 0.164 | 71.83 |
| 2nd sub. | 20 | 1.5 | 24.59 | 22.53 | 0.746 | 0.169 | 74.10 |
| 100 | 20 | 1.5 | 25.07 | 22.55 † | 0.762 | 0.148 | 65.76 |
| 20 | 1.0 | 25.53 | 23.04 † | 0.773 | 0.158 | 69.25 | |
### 2.4 Results
Tab. 2 reports the quantitative results of our model on the validation set using the PSNR metric. We further extend the evaluation by reporting Structural Similarity Index Measure (SSIM) [37], Learned Perceptual Image Patch Similarity (LPIPS) [40], and Fréchet Inception Distance (FID) [19], all computed on our model’s predictions over the validation set.
The table is divided into three blocks. The first block contains models trained without classifier-free guidance (CFG) [20]. We ablate over the number of averaged samples used for final predictions, ranging from 1 to 20. Increasing the number of samples has a smoothing effect that improves Val. PSNR scores but degrades visual quality, as reflected in the other metrics. The bottom row of this block contains a model that is additionally trained on the validation dataset. This makes the values reported on the validation dataset not comparable with the rest of the entries in the table. However, the result on the public leaderboard showed a $+0.58$ dB increase on PSNR.
The second and third blocks present models trained with CFG applied during training. Earlier experiments on the validation data showed that raising the cfg_scale beyond a certain point did not improve PSNR scores. Nevertheless, we retained the run with cfg_scale as our second-best competition submission. For completeness, we also report results obtained by increasing the number of sampling steps using the same checkpoint. These results show consistent improvements over the previous CFG-based predictions.
## 3 Compression Challenge
<details>
<summary>x2.png Details</summary>

### Visual Description
## Neural Network Architecture Diagram: Spatio-Temporal Transformer Block
### Overview
The image displays a detailed architectural diagram of a neural network model designed for processing sequential spatio-temporal data, likely for tasks involving video understanding or robotics. The diagram illustrates the data flow from input tokens, through a series of processing blocks involving attention mechanisms and normalization layers, to output tokens. A key feature is the integration of a "Robot State" input early in the pipeline.
### Components/Axes
The diagram is organized horizontally, representing the flow of data from left (input) to right (output).
**1. Input Section (Left):**
* **Label:** `Input Tokens`
* **Structure:** A 3D tensor represented as a stack of grids.
* **Dimensions:** Labeled with `T` (vertical axis, likely Time/Sequence length), `H` (Height), and `W` (Width).
**2. Robot State Integration (Top-Left):**
* **Label:** `Robot State` (accompanied by a small image of a humanoid robot).
* **Processing Path:** The Robot State data flows through:
* `MLP` (Multi-Layer Perceptron)
* `Conv` (Convolutional layer)
* **Integration Point:** The processed Robot State is combined with the embedded input tokens via a summation operation (⊕ symbol).
**3. Core Processing Block (Center):**
This main section is enclosed in a rounded rectangle and is repeated `L` times (indicated by `L Layers` label at the top-right of the block). Each layer contains:
* **Spatial Attention Sub-block:**
* Input goes through a `Layer Norm` (green vertical bar).
* The core is a `Spatial Attention` mechanism, visualized as a grid with colored cells (blue, purple, pink gradients).
* A residual connection (arrow bypassing the block) adds the original input to the attention output via a summation (⊕).
* **Temporal Attention Sub-block:**
* The output from the spatial block goes through another `Layer Norm`.
* The core is a `Temporal Attention` mechanism, visualized as a grid with a different color pattern (orange, red, blue, green, purple).
* Another residual connection and summation (⊕) follow.
* **ST Block (Spatio-Temporal Block):**
* The output goes through a third `Layer Norm`.
* The core is an `MLP` (yellow block).
* A final residual connection and summation (⊕) for this layer.
**4. Output Section (Right):**
* After the `L` repeated layers, the data passes through a final `Layer Norm` and a `Linear` layer.
* **Final Output:** A 3D tensor labeled `Output Tokens`, with a structure mirroring the input.
### Detailed Analysis
* **Data Flow:** The primary path is `Input Tokens` -> `Embed` -> [Integration with processed `Robot State`] -> `L x (Spatial Attention -> Temporal Attention -> ST Block/MLP)` -> `Layer Norm` -> `Linear` -> `Output Tokens`.
* **Key Operations:** The diagram explicitly labels the following operations: `Embed`, `MLP`, `Conv`, `Layer Norm`, `Spatial Attention`, `Temporal Attention`, `Linear`, and summation (⊕) for residual connections.
* **Visual Coding:**
* **Layer Norm:** Consistently represented as vertical green bars.
* **Attention Mechanisms:** Represented by colored grids. The Spatial Attention grid uses a blue-to-pink vertical gradient. The Temporal Attention grid uses a more complex, multi-colored checkerboard pattern.
* **MLP:** Represented as a solid yellow block within the ST Block.
* **Residual Connections:** Represented by black arrows that bypass the main processing blocks and connect to summation circles (⊕).
### Key Observations
1. **Dual Attention Mechanism:** The architecture explicitly separates `Spatial Attention` and `Temporal Attention` into sequential sub-blocks within each layer. This suggests a design focused on independently modeling spatial relationships (within a frame) and temporal relationships (across frames) before combining them.
2. **Early Fusion of Robot State:** The `Robot State` is processed and injected into the network at the very beginning, after the initial token embedding. This indicates that proprioceptive or state information from the robot is a critical, foundational input for the model's predictions.
3. **Residual Learning Framework:** Every major sub-block (Spatial Attention, Temporal Attention, MLP) is followed by a residual connection. This is a standard technique to facilitate training deep networks by allowing gradients to flow more easily.
4. **Parameter Sharing:** The label `L Layers` indicates that the entire block containing Spatial Attention, Temporal Attention, and the ST Block is repeated `L` times, with the weights likely being shared across these layers.
### Interpretation
This diagram depicts a sophisticated **Spatio-Temporal Transformer** variant, tailored for embodied AI or robotics applications. The architecture is designed to process a sequence of observations (e.g., video frames or a history of sensor readings, represented as `Input Tokens` with dimensions T, H, W) while simultaneously conditioning on the robot's own state.
The separation of spatial and temporal attention is a strategic choice. It allows the model to first understand "what is where" in each individual observation (spatial attention) and then understand "how things change over time" (temporal attention). This is more interpretable and potentially more efficient than a single, monolithic spatio-temporal attention mechanism.
The early fusion of the `Robot State` is crucial. It grounds the visual or sensory processing in the context of the robot's own configuration (e.g., joint angles, position), enabling the model to make predictions or decisions that are physically plausible and relevant to the robot's immediate situation. The repeated `L` layers allow the model to build increasingly abstract and integrated representations of the scene and its dynamics, ultimately producing `Output Tokens` that could be used for tasks like action prediction, video captioning, or control signal generation. The overall design emphasizes hierarchical feature extraction and the integration of multimodal (exteroceptive and proprioceptive) information.
</details>
(a) Illustration of our ST-Transformer architecture for the compression challenge Given three grids of past video tokens of shape $3\times 32\times 32$ , as well as the robot state of shape $64\times 25$ as context, the transformer predicts the future three grids of shape $3\times 32\times 32$ . The ST-Transformer consists of $L$ layers of spatio-temporal blocks, each containing per time step spatial attention over the $H\times W$ tokens at time step $t$ , followed by causal temporal attention of the same spatial coordinate across time, and then a feed-forward network. Each colour in the spatial and temporal attention represents a single self-attention map.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Line Chart: Training and Validation Cross Entropy Loss
### Overview
The image displays a line chart tracking the cross entropy loss of a machine learning model over the course of training. It compares three different loss curves: training loss using teacher forcing, validation loss using teacher forcing, and validation loss using autoregressive generation. The chart illustrates the model's learning progression and the performance gap between training and validation under different inference modes.
### Components/Axes
* **Chart Type:** Line chart.
* **Y-Axis (Vertical):**
* **Label:** "Cross Entropy Loss"
* **Scale:** Linear scale ranging from 4 to 8.
* **Major Tick Marks:** 4, 5, 6, 7, 8.
* **X-Axis (Horizontal):**
* **Label:** "Step"
* **Scale:** Linear scale representing training steps, marked in thousands (k).
* **Major Tick Marks:** 0k, 20k, 40k, 60k, 80k.
* **Legend:**
* **Position:** Top-right quadrant of the chart area.
* **Series 1:** Blue line, labeled "Train (Teacher Forced)".
* **Series 2:** Orange line, labeled "Val (Teacher Forced)".
* **Series 3:** Green line, labeled "Val (Autoregressive)".
### Detailed Analysis
**1. Train (Teacher Forced) - Blue Line:**
* **Trend:** Starts at a very high loss (off the chart, >8 at step 0), experiences a steep initial descent, then transitions to a noisy, gradually decreasing trend with significant variance.
* **Approximate Data Points:**
* Step ~0k: Loss > 8 (initial point not fully visible).
* Step ~5k: Loss ≈ 6.0.
* Step ~20k: Loss ≈ 5.3 (with fluctuations between ~5.1 and 5.5).
* Step ~40k: Loss ≈ 5.1 (fluctuating between ~4.9 and 5.3).
* Step ~60k: Loss ≈ 5.0 (fluctuating between ~4.8 and 5.2).
* Step ~80k: Loss ≈ 4.8 (fluctuating between ~4.7 and 5.0).
**2. Val (Teacher Forced) - Orange Line:**
* **Trend:** Starts high, descends very steeply and smoothly in the initial phase, then continues a steady, smooth decline with minimal noise, consistently maintaining the lowest loss of the three series.
* **Approximate Data Points:**
* Step ~0k: Loss ≈ 7.5.
* Step ~5k: Loss ≈ 5.5.
* Step ~20k: Loss ≈ 4.7.
* Step ~40k: Loss ≈ 4.5.
* Step ~60k: Loss ≈ 4.4.
* Step ~80k: Loss ≈ 4.2.
**3. Val (Autoregressive) - Green Line:**
* **Trend:** Starts at the highest visible point, descends steeply but less sharply than the orange line, then follows a smooth, gradual decline. It remains consistently above the "Val (Teacher Forced)" line throughout training.
* **Approximate Data Points:**
* Step ~0k: Loss ≈ 8.0.
* Step ~5k: Loss ≈ 6.2.
* Step ~20k: Loss ≈ 5.5.
* Step ~40k: Loss ≈ 5.2.
* Step ~60k: Loss ≈ 5.1.
* Step ~80k: Loss ≈ 5.0.
### Key Observations
1. **Performance Hierarchy:** The validation loss under teacher forcing (orange) is consistently the lowest, followed by the training loss (blue), with the validation loss under autoregressive generation (green) being the highest.
2. **Convergence:** All three loss curves show a clear downward trend, indicating the model is learning. The rate of improvement slows significantly after approximately 20,000 to 40,000 steps.
3. **Noise vs. Smoothness:** The training loss (blue) exhibits considerable high-frequency noise or variance, which is typical as it's calculated on mini-batches. Both validation curves (orange and green) are much smoother, as they are likely computed over the entire validation set.
4. **Generalization Gap:** There is a persistent gap between the two validation curves. The "Val (Autoregressive)" loss is approximately 0.8 to 1.0 points higher than the "Val (Teacher Forced)" loss at the end of training (80k steps). This quantifies the performance degradation when the model generates sequences autoregressively (using its own predictions) versus when it is guided by ground-truth tokens (teacher forcing) during validation.
### Interpretation
This chart is a diagnostic tool for sequence model training (e.g., a language model or time-series forecaster). The data suggests:
* **Successful Learning:** The model is effectively minimizing cross entropy loss on both training and validation data, indicating it is learning the underlying patterns in the dataset.
* **Teacher Forcing vs. Autoregressive Inference:** The significant and persistent gap between the orange and green validation curves highlights a core challenge in sequence modeling: **exposure bias**. The model performs better when its predictions are conditioned on perfect ground-truth data (teacher forcing) than when it must rely on its own, potentially erroneous, previous predictions during autoregressive generation. This gap represents the real-world performance penalty the model will incur during deployment.
* **Training Dynamics:** The noisy blue training curve suggests the use of stochastic gradient descent with mini-batches. The smooth validation curves indicate stable evaluation. The plateauing of all curves after ~40k steps suggests diminishing returns from further training under the current hyperparameters (learning rate, etc.), and that the model may be approaching its capacity for this specific task and dataset.
* **Potential for Overfitting:** While both validation losses are decreasing, the fact that the training loss (blue) remains higher than the "Val (Teacher Forced)" loss (orange) is unusual. Typically, training loss is lower than validation loss. This could indicate a specific characteristic of the loss calculation, regularization techniques (like dropout) active only during training, or that the training set is more challenging than the validation set. It does not show classic overfitting (where training loss continues to drop while validation loss rises).
In summary, the chart demonstrates a model that learns effectively but suffers from a measurable exposure bias, and its training process is stable but may benefit from hyperparameter tuning to close the generalization gap and reduce the noise in the training loss.
</details>
(b) Training curves for compression challenge At train time, we use teacher forcing (blue). We then evaluate on the validation set using unrealistic teacher forcing (orange), as well as with the greedy autoregressive generation that will be used at inference time (green).
Figure 3: Overall figure showing (a) the ST-Transformer world model architecture and (b) its training curves in the compression challenge.
Unlike the Sampling Challenge, which measures prediction directly in pixel space, the Compression Challenge evaluates models in a discrete latent space. Each video sequence is first compressed into a grid of discrete tokens using the Cosmos $8\times 8\times 8$ tokeniser [28], producing a compact sequence that can be modelled with sequence architectures.
#### Problem Statement
Given a context of $H=3$ grids of $32\times 32$ tokens and robot states for both past and future timesteps, the task is to predict the next $M=3$ grids of $32\times 32$ tokens:
$$
\displaystyle\hat{\mathbf{z}}_{H:H+M-1} \displaystyle\sim f_{\theta}({\bm{z}}_{0:H-1},{\bm{s}}_{0:63}) \tag{1}
$$
where $\hat{{\bm{z}}}_{H:H+M-1}$ are the predicted token grids for the future frames. The tokenized training dataset $\mathcal{D}$ contains approximately $306{,}000$ samples. Each sample consists of:
- Tokenised video: $6$ consecutive token grids (3 past, 3 future), each of size $32\times 32$ , giving $6144$ tokens per sample and $\sim 1.88$ B tokens overall.
- Robot state: a sequence ${\bm{s}}\in\mathbb{R}^{64\times 25}$ aligned with the corresponding raw video frames.
A block of three $32\times 32$ token grids corresponds to 17 RGB frames at $256\times 256$ resolution, so predictions in token space remain aligned with the original video. Performance is evaluated using top-500 cross-entropy, which considers only the top-500 logits per token.
### 3.1 Model
#### Spatio-temporal Transformer
Following Genie [6], our world model builds on the Vision Transformer (ViT) [9, 35]. An overview is shown in Fig. 3. To reduce the quadratic memory cost of standard Transformers, we use a spatio-temporal (ST) Transformer [38], which alternates spatial and temporal attention blocks followed by feed-forward layers. Spatial attention attends over $1\times 32\times 32$ tokens per frame, while temporal attention (with a causal mask) attends across $T\times 1\times 1$ tokens over time. This design makes spatial attention, the main bottleneck, scale linearly with the number of frames, improving efficiency for video generation. We apply pre-LayerNorm [2] and QKNorm [18] for stability. Positional information is added via learnable absolute embeddings for both spatial and temporal tokens. Our transformer used 24 layers, 8 heads, an embedding dimension of $512$ , a sequence length of $T=5$ , and dropout of $0.1$ on all attention, MLPs, and residual connections.
#### State Conditioning
Robot states are encoded as additive embeddings following Bruce et al. [6]. The state vector is projected with an MLP, processed by a 1D convolution (kernel size 3, padding 1), and enriched with absolute position embeddings before being combined with video tokens.
### 3.2 Training
We implemented our model in PyTorch [29] and trained it using the fused AdamW optimiser [25] with $\beta_{1}=0.9$ and $\beta_{2}=0.95$ for $80$ epochs. Weight decay of $0.05$ was applied only to parameter matrices, while biases, normalisation parameters, gains, and positional embeddings were excluded. Following GPT-2 [32] and Press and Wolf [31], Bertolotti and Cazzola [5], we tied the input and output embeddings. This reduces the memory footprint by removing one of the two largest weight matrices and typically improves both training speed and final performance.
#### Training Objective
The model was trained to minimise the cross-entropy loss between predicted and ground-truth tokens at future time steps:
| | $\displaystyle\min_{\theta}\,\mathbb{E}_{({\bm{z}}_{t},{\bm{s}}_{t})_{t=0:K+M-1}\sim\mathcal{D},\hat{{\bm{z}}}_{t}\sim f_{\theta}(\cdot)}\left[\sum_{t=K}^{K+M-1}\text{CE}\left(\hat{{\bm{z}}}_{t},{\bm{z}}_{t}\right)\right],$ | |
| --- | --- | --- |
where $\hat{{\bm{z}}}_{t}$ is the model output at time $t$ , CE denotes the cross-entropy loss over all tokens in the grid, and $\mathcal{D}$ is the dataset of tokenised video and state sequences. Training used teacher forcing to allow parallel computation across timesteps, with a linear learning rate schedule from peak $8\times 10^{-4}$ to $0$ after a warmup of $2000$ steps.
#### Implementation
Training used automatic mixed precision (AMP) with bfloat16, but inference used float32 due to degraded performance in bfloat16. Linear layer biases were zero-initialised, and weights (including embeddings) were drawn from $\mathcal{N}(0,0.02)$ . We trained with an effective batch size of 160 on the same B200 DataCrunch instant cluster as in the sampling challenge.
### 3.3 Inference
Our autoregressive model generates sequences via
| | $\displaystyle p({\bm{z}}_{H:H+M-1}\mid{\bm{z}}_{0:H-1},{\bm{s}}_{0:63})=\prod_{t=H}^{H+M-1}f_{\theta}({\bm{z}}_{t}\mid{\bm{z}}_{<t},{\bm{s}}_{0:63}),$ | |
| --- | --- | --- |
where each step outputs a categorical distribution over each spatial token. Sampling draws ${\bm{z}}_{t}\sim f_{\theta}(\cdot)$ , introducing diversity but typically yields lower-probability trajectories and higher loss. Greedy decoding instead selects
$$
{\bm{z}}_{t}=\arg\max_{{\bm{z}}}f_{\theta}({\bm{z}}\mid{\bm{z}}_{<t},{\bm{s}}_{0:63}),
$$
producing deterministic, high-probability sequences that we found both effective and efficient.
### 3.4 Results
Fig. 3(b) shows the training curves for our ST-Transformer. The blue curve corresponds to the training loss under teacher-forced training. While the teacher-forced validation loss is optimistic – since it conditions on ground-truth inputs – it can be interpreted as a lower bound on the achievable loss, representing the performance of an idealised autoregressive model with perfect inference. To reduce the gap between teacher-forced and autoregressive performance, we experimented with scheduled sampling [4, 27]. However, this did not lead to meaningful improvements.
## 4 Conclusion
In this report, we presented two complementary approaches that achieved strong performance across both 1X World Model Challenges. First, we showed how internet-scale data can be leveraged by fine-tuning a pre-trained image–text-to-video foundation model. Using multi-node training on the DataCrunch instant cluster, we reached first place on the leaderboard in only 36 hours—an order of magnitude faster than the runner-up, who required about a month. To further improve inference, we averaged over samples to selectively blur regions of high predictive uncertainty. While this proved effective for optimising PSNR, the most suitable inference strategy for downstream decision-making remains an open question. Second, we demonstrated how a spatio-temporal transformer world model can be trained on the tokenised dataset in under 17 hours. We found that greedy autoregressive inference offered a practical balance of speed and accuracy. Despite its simplicity, the model achieved substantially lower loss values than other leaderboard entries.
## References
- Alonso et al. [2024] Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for World Modeling: Visual Details Matter in Atari. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. Curran Associates, Inc., 2024.
- Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization. arXiv preprint arXiv:1607.06450, 2016.
- Bar et al. [2024] Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation World Models. arXiv preprint arXiv:2412.03572, 2024.
- Bengio et al. [2015] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2015.
- Bertolotti and Cazzola [2024] Francesco Bertolotti and Walter Cazzola. By Tying Embeddings You Are Assuming the Distributional Hypothesis. In Proceedings of the 41st International Conference on Machine Learning, pages 3584–3610, 2024.
- Bruce et al. [2024] Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, de Nando Freitas, Satinder Singh, and Tim Rocktäschel. Genie: Generative Interactive Environments. arXiv preprint arXiv:2402.15391, 2024.
- Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. In Advances in Neural Information Processing Systems, 2018.
- Decart et al. [2024] Etched Decart, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A Universe in a Transformer. 2024.
- Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2020.
- Guo et al. [2025] Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. MineWorld: A Real-Time and Open-Source Interactive World Model on Minecraft. arXiv preprint arXiv:2504.08388, 2025.
- Ha and Schmidhuber [2018] David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018.
- HaCohen et al. [2024] Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime Video Latent Diffusion. arXiv preprint arXiv:2501.00103, 2024.
- Hafner et al. [2019] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. In International Conference on Machine Learning, 2019.
- Hafner et al. [2022] Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with Discrete World Models. In International Conference on Learning Representations, 2022.
- Hafner et al. [2025] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering Diverse Control Tasks through World Models. Nature, 640(8059):647–653, 2025.
- Hansen et al. [2023] Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, Robust World Models for Continuous Control. In The Twelfth International Conference on Learning Representations, 2023.
- Hansen et al. [2022] Nicklas A. Hansen, Hao Su, and Xiaolong Wang. Temporal Difference Learning for Model Predictive Control. In Proceedings of the 39th International Conference on Machine Learning, 2022.
- [18] Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-Key Normalization for Transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253. Association for Computational Linguistics.
- Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
- Ho and Salimans [2021] Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Hong et al. [2022] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. arXiv preprint arXiv:2205.15868, 2022.
- Hu et al. [2022] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations, 2022.
- Kong et al. [2024] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A Systematic Framework For Large Video Generative Models. arXiv preprint arXiv:2412.03603, 2024.
- Liu et al. [2025] Peter Liu, Annabelle Chu, and Yiran Chen. Effective World Modeling for Humanoid Robots: Long-Horizon Prediction and Efficient State Compression. Technical Report Team Duke, Duke University, 2025. 1X World Model Challenge, CVPR 2025.
- Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2019.
- Micheli et al. [2022] Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are Sample-Efficient World Models. In The Eleventh International Conference on Learning Representations, 2022.
- Mihaylova and Martins [2019] Tsvetomila Mihaylova and André F. T. Martins. Scheduled Sampling for Transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 351–356. Association for Computational Linguistics, 2019.
- NVIDIA et al. [2025] NVIDIA, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, and Artur Zolkowski. Cosmos World Foundation Model Platform for Physical AI. arxiv preprint arXiv:2501.03575, 2025.
- [29] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
- Peebles and Xie [2023] William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.
- [31] Ofir Press and Lior Wolf. Using the Output Embedding to Improve Language Models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163. Association for Computational Linguistics.
- Radford et al. [2019] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. 2019.
- Robine et al. [2022] Jan Robine, Marc Höftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based World Models Are Happy With 100k Interactions. In The Eleventh International Conference on Learning Representations, 2022.
- Scannell et al. [2025] Aidan Scannell, Mohammadreza Nakhaei, Kalle Kujanpää, Yi Zhao, Kevin Luck, Arno Solin, and Joni Pajarinen. Discrete Codebook World Models for Continuous Control. In The Thirteenth International Conference on Learning Representations, 2025.
- Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
- Wang et al. [2025] Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Xiaofeng Meng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, and Ziyu Liu. Wan: Open and Advanced Large-Scale Video Generative Models. CoRR, abs/2503.20314, 2025.
- Wang et al. [2003] Z Wang, EP Simoncelli, and AC Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, pages 1398–1402. IEEE, 2003.
- Xu et al. [2021] Mingxing Xu, Wenrui Dai, Chunmiao Liu, Xing Gao, Weiyao Lin, Guo-Jun Qi, and Hongkai Xiong. Spatial-Temporal Transformer Networks for Traffic Flow Forecasting. arXiv preprint arXiv:2001.02908, 2021.
- Yang et al. [2024] Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. arXiv preprint arXiv:2408.06072, 2024.
- Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595. IEEE Computer Society, 2018.
- Zhang et al. [2023] Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning. In Advances in Neural Information Processing Systems, pages 27147–27166. Curran Associates, Inc., 2023.