2510.07092v1

Model: nemotron-free

# Generative World Modelling for Humanoids 1X World Model Challenge Technical Report - Team Revontuli > equal contribution. ## Abstract World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges. ## 1 Introduction World models [11] equip agents (e.g. humanoid robots) with internal simulators of their environments. By “imagining” the consequences of their actions, agents can plan, anticipate outcomes, and improve decision-making without direct real-world interaction. A central challenge in world modelling is the design of architectures that are both sufficiently expressive and computationally tractable. Early approaches have largely relied on recurrent networks [13, 14, 15] or multilayer perceptrons [16, 17, 34, 7]. More recently, advances in generative modelling have driven a new wave of architectural choices. A prominent line of work leverages autoregressive transformers over discrete latent spaces [6, 33, 41, 26, 3, 10], while others explore diffusion- and flow-based approaches [1, 8]. At scale, these methods underpin powerful foundation models [21, 39, 36, 28, 12, 23] capable of producing realistic and accurate video predictions. <details> <summary>figs/image_1.png Details</summary> ![223bda9d](/v1/image/223bda9dc2521b0861fc8f8176601414e2757529859489e5440008d5e6387fd4) ### Visual Description ## Flowchart: Process of Tokenization, Generation, and Evaluation ### Overview The image depicts a sequential workflow involving three stages: **Context + States**, **Generation**, and **Evaluation**. Each stage is represented by a 3D cube structure, with directional arrows indicating the flow of data or processes. The diagram uses color-coded arrows (teal, green, blue) to differentiate stages and includes a "Tokenizer" component as a bridge between the first and second stages. ### Components/Axes - **Sections**: 1. **Context + States**: A teal arrow points to a 3D cube labeled "Context + States". 2. **Tokenizer**: A green arrow labeled "Tokenizer" connects the first cube to the second. 3. **Generation**: A green cube labeled "Generation" follows the Tokenizer. 4. **Evaluation**: A blue arrow connects "Generation" to a third cube labeled "Evaluation". - **Arrows**: - Teal (Context + States → Tokenizer) - Green (Tokenizer → Generation) - Blue (Generation → Evaluation) - **Cubes**: - 3D structures with grid-like patterns, representing data or state complexity. - The first and third cubes have more visible layers than the second. ### Detailed Analysis - **Context + States**: - A teal arrow originates from the left, pointing to a 3D cube. - The cube is labeled "Context + States" in black text. - **Tokenizer**: - A green arrow labeled "Tokenizer" connects the first cube to the second. - The second cube is labeled "Generation" in black text. - **Generation**: - A green cube with a grid pattern, labeled "Generation". - A blue arrow extends from this cube to the third section. - **Evaluation**: - A blue arrow points to a third 3D cube labeled "Evaluation". - This cube has a similar grid pattern but appears slightly larger than the first. ### Key Observations 1. **Sequential Flow**: The process is linear, with each stage directly feeding into the next. 2. **Data Complexity**: The first and third cubes (Context + States and Evaluation) have more visible layers, suggesting higher complexity or data volume. 3. **Color Coding**: Arrows and cubes use distinct colors (teal, green, blue) to differentiate stages. 4. **No Numerical Data**: The diagram lacks explicit numerical values, focusing on structural relationships. ### Interpretation The flowchart illustrates a workflow where **Context + States** are processed through a **Tokenizer** to generate intermediate data ("Generation"), which is then evaluated. The increasing complexity of the cubes in the first and third stages implies that both the initial context and the final evaluation involve more detailed or layered data. The "Tokenizer" acts as a critical intermediary, transforming raw context into a format suitable for generation. The absence of numerical data suggests the diagram emphasizes process flow over quantitative metrics. </details> Figure 1: Overview of the 1X World Model Challenges Left depicts the context (inputs), middle the model generations, and right the evaluations. Sampling challenge (top): The model observes 17 past frames along with past and future robot states, then generates future frames in pixel space. Performance is measured by PSNR between the predicted and ground-truth 77th frame. Compression challenge (bottom): The Cosmos $8\times 8\times 8$ tokeniser encodes the history of 17 RGB frames into three latent token grids of shape $3\times 32\times 32$ . Models must predict the next three latent token grids corresponding to the next 17 frames. Evaluation is based on Top-500 cross-entropy between predicted and ground-truth tokens. Table 1: Performance on Public 1X World Model Leaderboard | Sampling | Revontuli Duke | Test 23.00 21.56 | Val 25.53 25.30 | Test (Top-500) – – | Val – – | 1st 2nd | | --- | --- | --- | --- | --- | --- | --- | | Michael | 18.51 | – | – | – | 3rd | | | Revontuli | – | – | 6.64 | 4.92 | 1st | | | Compression | Duke | – | – | 7.50 | 5.60 | 2nd | | a27sridh | – | – | 7.99 | – | 3rd | | The 1X World Model Challenge evaluates predictive performance on two tracks: Sampling and Compression. Fig. 1 outlines the tasks, and Tab. 1 reports our results. These challenges capture core problems when using world models in robotics. Our methods show strong performance that we hope will shape future efforts. ## 2 Sampling Challenge #### Problem Statement In the sampling task, the model must predict the $512\times 512$ frame observed by the robot $2$ s into the future. Conditioning is provided by the first 17 frames $\mathbf{x}_{0:16}$ and the complete sequence of robot states $\mathbf{s}_{0:76}\in\mathbb{R}^{77\times 25}$ . Performance is evaluated using PSNR between the predicted and ground-truth last frames. #### Data Pre-processing We downsample the original 77 frames clips by a factor of four, yielding shorter 21 sample clips. As a result, this gives us five conditioning frames, ( $\mathbf{x}_{0},\mathbf{x}_{4},\dots,\mathbf{x}_{16}$ ), and the remaining 16 serve as prediction targets. Wan2.2-VAE applies spatial compression to the first frame and temporal compression of 4 to the remaining frames, producing a latent sequence of length $(1+(L-1)/4)$ for a clip of length $L=21$ . ### 2.1 Model #### Base Model For our solution, we adapt Wan 2.2 TI2V-5B [36], a flow-matching generative video model with a 30-layer DiT backbone [30]. The base model is designed as a text-image-to-video (TI2V), but we modified the architecture to condition the predictions on videos and robot states. The model operates on latent video representations from Wan2.2-VAE, which compresses clips to a size $(1+(L-1)/4)\times 16\times 16$ . #### Video-State Conditioning To incorporate video conditioning, we modified the masking of the input latents. In a standard image-to-video model, the first latent in the time dimension is masked, treating the input image as fixed during generation, thereby establishing a conditional mapping. We extend this idea by fixing multiple frames during generation, effectively transforming the model from image-to-video to video-to-video. The original Wan 2.2 also conditions textual prompts to generate videos. Since our dataset does not include textual descriptions, we use empty strings as text prompts while retaining the original cross-attention layer, enabling future work to leverage text conditioning. As shown in Fig. 2, we incorporate state conditioning into the model’s predictions using adaLN-Zero [30] within Wan’s DiT blocks. We first downsample the states to match those of the downsampled video. The continuous angle and velocity states are augmented with sinusoidal features, and all states are projected through an MLP to a hidden dimension of $r_{\text{dim}}=256$ . Then, we compress the projected features along the temporal dimension with a 2-layer 1d convolutional network to match the compression of Wan-VAE for the video frames, mapping the state features to shape $((1+L//4),r_{\text{dim}})$ . Finally, we fed the compressed feature into an MLP layer to get the modulation used by adaLN-Zero layers. The obtained robot modulation is added to the modulation of the flow matching timestep. The robot modulation acts differently on latent since the timestep embedding is the same for the whole latent, while for the states, they will modulate the latent slice associated with the corresponding frames. <details> <summary>x1.png Details</summary> ![1cdfbced](/v1/image/1cdfbced7c080b2ecafadb9a05d2fb69396929ffe4d9d8f77b5f0e4cc81896e8) ### Visual Description ## Diagram: States-Conditioned W2.2 DiT Block Architecture ### Overview The diagram illustrates a neural network architecture labeled as a "States-conditioned W2.2 DiT Block," repeated 30 times (x30). It integrates latent inputs, time embeddings, and robot state embeddings through a series of attention mechanisms, normalization layers, and multi-layer perceptrons (MLPs). The flow is structured with explicit parameterization (γ, β, α) and conditional processing. --- ### Components/Axes 1. **Inputs**: - **Latent**: Gray rectangle on the left, serving as the primary input. - **Time Embedding**: Orange rectangle labeled "Flow-Matching Timestep t." - **States Embedding**: Blue rectangle labeled "Robot States s." 2. **Processing Layers**: - **Layer Norm**: Green rectangles (two instances). - **Scale/Shift**: Light blue rectangles (four instances, labeled with α₁, α₂, γ₁, β₁, γ₂, β₂). - **Self Attention**: Pink rectangle. - **Cross Attention**: Pink rectangle. - **MLP**: Yellow rectangle. 3. **Output**: - Final output arrow labeled "States-conditioned W2.2 DiT Block x30." 4. **Legend**: - Colors map components to their types (e.g., pink = attention, yellow = MLP). --- ### Detailed Analysis 1. **Flow Path**: - **Latent** → **Layer Norm** (γ₁, β₁) → **Scale** (α₁) → **Self Attention** → **Scale** (α₁) → **Layer Norm** (γ₂, β₂) → **Cross Attention** → **Layer Norm** (γ₂, β₂) → **Scale** (α₂) → **MLP** → **Scale** (α₂) → Output. - **Time Embedding** and **States Embedding** feed into the block, likely conditioning the attention and MLP layers. 2. **Parameterization**: - Scaling factors (α₁, α₂) and normalization parameters (γ₁, β₁, γ₂, β₂) are explicitly defined, suggesting trainable or fixed hyperparameters. 3. **Repetition**: - The block is repeated 30 times (x30), indicating a transformer-like architecture with stacked layers. --- ### Key Observations 1. **Modular Design**: - The block combines attention mechanisms (self and cross) with MLP layers, typical of diffusion models or transformers. - Normalization (Layer Norm) and scaling (Scale/Shift) are interspersed to stabilize gradients. 2. **Conditioning**: - **Time Embedding** and **States Embedding** are integrated to condition the model on temporal and robotic state information, critical for tasks like motion planning or control. 3. **Repetition**: - The x30 repetition suggests depth, enabling hierarchical feature extraction. --- ### Interpretation This architecture is likely part of a diffusion model or transformer-based system for robotics or time-series prediction. The **self-attention** and **cross-attention** layers enable the model to capture temporal dependencies and integrate external state information. The **MLP** processes high-level features, while **Layer Norm** and **Scale/Shift** ensure stable training. The explicit parameterization (γ, β, α) allows fine-grained control over layer behavior. The repetition (x30) implies a deep network capable of modeling complex dynamics, with conditioning on time and robot states enabling adaptability to specific tasks. </details> Figure 2: State conditioning of DiT-Block. Wan2.2 TI2V-5B DiT architecture was updated to enable state conditioning using adaLN-Zero [30] and combining it with the timestep of the Flow Matching scheduler [36]. ### 2.2 Training Models were trained for 23k steps with AdamW [25] with a constant learning rate of $4\cdot 10^{-4}$ . We applied LoRA [22] fine-tuning with rank 32 on the Wan 2.2 DiT backbone. We experimented with and without classifier-free guidance (CFG) [20] during training but observed little improvement in PSNR performance (see Sec. 2.4). Training was conducted on a DataCrunch instant cluster equipped with 4 nodes, each with $8\times$ NVIDIA B200 GPUs. We used a total effective batch size of 1024. The B200 VRAM capacity of 184GB allows for more efficient training of memory-hungry video generation models. ### 2.3 Inference Since the challenge does not restrict inference compute time, we experimented with different approaches for our submissions. In our initial attempts, we followed [24] post-processing pipeline, applying Gaussian blur and performing histogram matching on the predicted frames. This post-processing improved the PSNR score by $1.2$ dB, as reported in [24]. Because PSNR heavily penalizes outlier deviations from the target image, sharper images with slight errors are typically scored worse than blurrier images with comparable errors. We found that exploiting predictive uncertainty with an ensemble of predictions outperformed Gaussian blurring. This produces blurring mainly in regions of high motion, such as the humanoid’s arms (see Tab. 2). Increasing the number of ensemble samples improved PSNR on both the validation set and the public leaderboard, with different performance found from tuning the number of inference steps and the classifier-free guidance weight, as shown in Tab. 2. Table 2: Sampling results on validation and test sets. † The results on test set were obtained after the deadline. ∗ This model has been trained on the whole train + validation raw dataset. | Num. Inf. Samples 20 | Num. Samples 1 | CFG Scale – | Val. PSNR $[\uparrow]$ 22.63 | Test PSNR $[\uparrow]$ 21.05 | Val. SSIM $[\uparrow]$ 0.707 | Val. LPIPS $[\downarrow]$ 0.137 | Val. FID $[\downarrow]$ 40.23 | | --- | --- | --- | --- | --- | --- | --- | --- | | 5 | – | 24.52 | 22.11 | 0.750 | 0.165 | 71.46 | | | 20 | – | 24.88 | 22.42 | 0.762 | 0.201 | 90.71 | | | 1st sub. ∗ | 20 | – | 26.62 | 23.00 | 0.836 | 0.082 | 31.70 | | 20 | 20 | 2.0 | 24.20 | 22.26 | 0.734 | 0.164 | 71.83 | | 2nd sub. | 20 | 1.5 | 24.59 | 22.53 | 0.746 | 0.169 | 74.10 | | 100 | 20 | 1.5 | 25.07 | 22.55 † | 0.762 | 0.148 | 65.76 | | 20 | 1.0 | 25.53 | 23.04 † | 0.773 | 0.158 | 69.25 | | ### 2.4 Results Tab. 2 reports the quantitative results of our model on the validation set using the PSNR metric. We further extend the evaluation by reporting Structural Similarity Index Measure (SSIM) [37], Learned Perceptual Image Patch Similarity (LPIPS) [40], and Fréchet Inception Distance (FID) [19], all computed on our model’s predictions over the validation set. The table is divided into three blocks. The first block contains models trained without classifier-free guidance (CFG) [20]. We ablate over the number of averaged samples used for final predictions, ranging from 1 to 20. Increasing the number of samples has a smoothing effect that improves Val. PSNR scores but degrades visual quality, as reflected in the other metrics. The bottom row of this block contains a model that is additionally trained on the validation dataset. This makes the values reported on the validation dataset not comparable with the rest of the entries in the table. However, the result on the public leaderboard showed a $+0.58$ dB increase on PSNR. The second and third blocks present models trained with CFG applied during training. Earlier experiments on the validation data showed that raising the cfg_scale beyond a certain point did not improve PSNR scores. Nevertheless, we retained the run with cfg_scale as our second-best competition submission. For completeness, we also report results obtained by increasing the number of sampling steps using the same checkpoint. These results show consistent improvements over the previous CFG-based predictions. ## 3 Compression Challenge <details> <summary>x2.png Details</summary> ![888a3a2d](/v1/image/888a3a2d9ea8ab5b9e57dfc4d9f083bffe7d79ce480f3ad861642d8bcbcbf0c1) ### Visual Description ## Neural Network Architecture Diagram: Robot State Processing System ### Overview The diagram illustrates a multi-layered neural network architecture designed for processing robot state data. It features sequential processing of input tokens through embedding, spatial and temporal attention mechanisms, an ST block with an MLP, and linear output layers. The system emphasizes spatial-temporal feature integration for robotics applications. ### Components/Axes 1. **Input Section**: - **Input Tokens**: 3D tensor structure (W x H x T) representing spatial-temporal data - **Embed Layer**: Converts input tokens into dense vector representations 2. **Processing Layers**: - **Spatial Attention**: Processes W x H dimensions with attention weights - **Temporal Attention**: Processes T dimension with attention weights - **Layer Norm**: Normalization applied after each attention mechanism - **ST Block**: Contains MLP for spatial-temporal feature integration - **Linear Layer**: Final transformation to output tokens 3. **Output Section**: - **Output Tokens**: Processed data after L layers of transformation ### Detailed Analysis - **Input Dimensions**: - Width (W) and Height (H) represent spatial dimensions - Time steps (T) represent temporal dimension - Input tokens structured as W x H x T 3D tensor - **Attention Mechanisms**: - Spatial Attention: 3x3 grid visualization with attention weights - Temporal Attention: 3x3 grid visualization with attention weights - Both attention mechanisms use color-coded weights (blue, purple, pink for spatial; red, orange, green, yellow for temporal) - **Layer Normalization**: - Applied after each attention mechanism - Green rectangles indicate normalization operations - **ST Block**: - Contains MLP (orange rectangle) for feature integration - Followed by layer normalization - **Output Transformation**: - Linear layer (white rectangle) maps processed features to output tokens ### Key Observations 1. **Spatial-Temporal Integration**: - Separate attention mechanisms for spatial (W x H) and temporal (T) dimensions - Combined processing in ST block suggests hierarchical feature extraction 2. **Normalization Strategy**: - Layer normalization after each attention mechanism indicates focus on stable training dynamics 3. **Architecture Depth**: - "L Layers" notation suggests configurable depth for the network 4. **Output Structure**: - Final linear layer implies direct mapping to desired output space ### Interpretation This architecture demonstrates a sophisticated approach to robot state processing by: 1. **Multi-modal Attention**: Separating spatial and temporal attention allows specialized processing of different data dimensions 2. **Feature Integration**: The ST block's MLP combines attended features for higher-level representation 3. **Normalization**: Layer normalization after each attention mechanism helps manage gradient flow in deep networks 4. **Scalability**: The "L Layers" notation suggests the architecture can be deepened for complex tasks The design appears optimized for robotics applications requiring understanding of both spatial relationships (e.g., object positions) and temporal dynamics (e.g., movement sequences). The attention mechanisms enable the model to focus on relevant spatial regions and time steps, while the ST block facilitates cross-modal feature integration crucial for tasks like navigation or action prediction. </details> (a) Illustration of our ST-Transformer architecture for the compression challenge Given three grids of past video tokens of shape $3\times 32\times 32$ , as well as the robot state of shape $64\times 25$ as context, the transformer predicts the future three grids of shape $3\times 32\times 32$ . The ST-Transformer consists of $L$ layers of spatio-temporal blocks, each containing per time step spatial attention over the $H\times W$ tokens at time step $t$ , followed by causal temporal attention of the same spatial coordinate across time, and then a feed-forward network. Each colour in the spatial and temporal attention represents a single self-attention map. <details> <summary>x3.png Details</summary> ![2b4d560c](/v1/image/2b4d560caec3cba73c8999d1d1e8390764d77fee77dec6bb8047443dbdef6db7) ### Visual Description ## Line Graph: Cross Entropy Loss vs Training Steps ### Overview The image depicts a line graph tracking cross entropy loss across three data series during model training. The x-axis represents training steps (0k to 80k), while the y-axis shows cross entropy loss values (4 to 8). Three distinct lines represent different training/validation scenarios, with notable divergence in their trajectories. ### Components/Axes - **X-axis (Step)**: Labeled "Step" with increments of 20k (0k, 20k, 40k, 60k, 80k) - **Y-axis (Cross Entropy Loss)**: Labeled "Cross Entropy Loss" with increments of 1 (4 to 8) - **Legend**: Positioned on the right side, containing: - Blue line: "Train (Teacher Forced)" - Orange line: "Val (Teacher Forced)" - Green line: "Val (Autoregressive)" ### Detailed Analysis 1. **Train (Teacher Forced) [Blue Line]** - Starts at ~8.0 at 0k steps - Sharp decline to ~5.5 by 20k steps - Gradual stabilization between ~5.2-5.4 from 40k-80k steps - Final value ~4.9 at 80k steps 2. **Val (Teacher Forced) [Orange Line]** - Begins at ~8.0 at 0k steps - Steady decline to ~4.5 by 80k steps - Minimal fluctuations after 40k steps - Final value ~4.3 at 80k steps 3. **Val (Autoregressive) [Green Line]** - Initial value ~7.5 at 0k steps - Gradual decline to ~5.0 by 80k steps - Consistent ~0.2-0.3 point gap above orange line throughout - Final value ~4.8 at 80k steps ### Key Observations - All three lines show decreasing trends, indicating improving model performance over time - Training loss (blue) decreases faster initially than validation losses - Validation losses (orange/green) maintain higher values than training loss, suggesting potential overfitting - Autoregressive validation (green) consistently shows higher loss than teacher-forced validation (orange) - All lines plateau between 4.3-5.5 loss by 80k steps ### Interpretation The graph demonstrates typical training dynamics where the model rapidly reduces training loss (blue line) while validation losses (orange/green) decrease more gradually. The persistent gap between training and validation losses suggests possible overfitting to the training data. Notably, the autoregressive validation (green) maintains higher loss than teacher-forced validation (orange), indicating that the autoregressive approach may be less effective or that the model struggles more with autoregressive sequences. The convergence of all lines toward lower loss values after 60k steps suggests that extended training improves generalization across both training paradigms. </details> (b) Training curves for compression challenge At train time, we use teacher forcing (blue). We then evaluate on the validation set using unrealistic teacher forcing (orange), as well as with the greedy autoregressive generation that will be used at inference time (green). Figure 3: Overall figure showing (a) the ST-Transformer world model architecture and (b) its training curves in the compression challenge. Unlike the Sampling Challenge, which measures prediction directly in pixel space, the Compression Challenge evaluates models in a discrete latent space. Each video sequence is first compressed into a grid of discrete tokens using the Cosmos $8\times 8\times 8$ tokeniser [28], producing a compact sequence that can be modelled with sequence architectures. #### Problem Statement Given a context of $H=3$ grids of $32\times 32$ tokens and robot states for both past and future timesteps, the task is to predict the next $M=3$ grids of $32\times 32$ tokens: $$ \displaystyle\hat{\mathbf{z}}_{H:H+M-1} \displaystyle\sim f_{\theta}({\bm{z}}_{0:H-1},{\bm{s}}_{0:63}) \tag{1} $$ where $\hat{{\bm{z}}}_{H:H+M-1}$ are the predicted token grids for the future frames. The tokenized training dataset $\mathcal{D}$ contains approximately $306{,}000$ samples. Each sample consists of: - Tokenised video: $6$ consecutive token grids (3 past, 3 future), each of size $32\times 32$ , giving $6144$ tokens per sample and $\sim 1.88$ B tokens overall. - Robot state: a sequence ${\bm{s}}\in\mathbb{R}^{64\times 25}$ aligned with the corresponding raw video frames. A block of three $32\times 32$ token grids corresponds to 17 RGB frames at $256\times 256$ resolution, so predictions in token space remain aligned with the original video. Performance is evaluated using top-500 cross-entropy, which considers only the top-500 logits per token. ### 3.1 Model #### Spatio-temporal Transformer Following Genie [6], our world model builds on the Vision Transformer (ViT) [9, 35]. An overview is shown in Fig. 3. To reduce the quadratic memory cost of standard Transformers, we use a spatio-temporal (ST) Transformer [38], which alternates spatial and temporal attention blocks followed by feed-forward layers. Spatial attention attends over $1\times 32\times 32$ tokens per frame, while temporal attention (with a causal mask) attends across $T\times 1\times 1$ tokens over time. This design makes spatial attention, the main bottleneck, scale linearly with the number of frames, improving efficiency for video generation. We apply pre-LayerNorm [2] and QKNorm [18] for stability. Positional information is added via learnable absolute embeddings for both spatial and temporal tokens. Our transformer used 24 layers, 8 heads, an embedding dimension of $512$ , a sequence length of $T=5$ , and dropout of $0.1$ on all attention, MLPs, and residual connections. #### State Conditioning Robot states are encoded as additive embeddings following Bruce et al. [6]. The state vector is projected with an MLP, processed by a 1D convolution (kernel size 3, padding 1), and enriched with absolute position embeddings before being combined with video tokens. ### 3.2 Training We implemented our model in PyTorch [29] and trained it using the fused AdamW optimiser [25] with $\beta_{1}=0.9$ and $\beta_{2}=0.95$ for $80$ epochs. Weight decay of $0.05$ was applied only to parameter matrices, while biases, normalisation parameters, gains, and positional embeddings were excluded. Following GPT-2 [32] and Press and Wolf [31], Bertolotti and Cazzola [5], we tied the input and output embeddings. This reduces the memory footprint by removing one of the two largest weight matrices and typically improves both training speed and final performance. #### Training Objective The model was trained to minimise the cross-entropy loss between predicted and ground-truth tokens at future time steps: | | $\displaystyle\min_{\theta}\,\mathbb{E}_{({\bm{z}}_{t},{\bm{s}}_{t})_{t=0:K+M-1}\sim\mathcal{D},\hat{{\bm{z}}}_{t}\sim f_{\theta}(\cdot)}\left[\sum_{t=K}^{K+M-1}\text{CE}\left(\hat{{\bm{z}}}_{t},{\bm{z}}_{t}\right)\right],$ | | | --- | --- | --- | where $\hat{{\bm{z}}}_{t}$ is the model output at time $t$ , CE denotes the cross-entropy loss over all tokens in the grid, and $\mathcal{D}$ is the dataset of tokenised video and state sequences. Training used teacher forcing to allow parallel computation across timesteps, with a linear learning rate schedule from peak $8\times 10^{-4}$ to $0$ after a warmup of $2000$ steps. #### Implementation Training used automatic mixed precision (AMP) with bfloat16, but inference used float32 due to degraded performance in bfloat16. Linear layer biases were zero-initialised, and weights (including embeddings) were drawn from $\mathcal{N}(0,0.02)$ . We trained with an effective batch size of 160 on the same B200 DataCrunch instant cluster as in the sampling challenge. ### 3.3 Inference Our autoregressive model generates sequences via | | $\displaystyle p({\bm{z}}_{H:H+M-1}\mid{\bm{z}}_{0:H-1},{\bm{s}}_{0:63})=\prod_{t=H}^{H+M-1}f_{\theta}({\bm{z}}_{t}\mid{\bm{z}}_{<t},{\bm{s}}_{0:63}),$ | | | --- | --- | --- | where each step outputs a categorical distribution over each spatial token. Sampling draws ${\bm{z}}_{t}\sim f_{\theta}(\cdot)$ , introducing diversity but typically yields lower-probability trajectories and higher loss. Greedy decoding instead selects $$ {\bm{z}}_{t}=\arg\max_{{\bm{z}}}f_{\theta}({\bm{z}}\mid{\bm{z}}_{<t},{\bm{s}}_{0:63}), $$ producing deterministic, high-probability sequences that we found both effective and efficient. ### 3.4 Results Fig. 3(b) shows the training curves for our ST-Transformer. The blue curve corresponds to the training loss under teacher-forced training. While the teacher-forced validation loss is optimistic – since it conditions on ground-truth inputs – it can be interpreted as a lower bound on the achievable loss, representing the performance of an idealised autoregressive model with perfect inference. To reduce the gap between teacher-forced and autoregressive performance, we experimented with scheduled sampling [4, 27]. However, this did not lead to meaningful improvements. ## 4 Conclusion In this report, we presented two complementary approaches that achieved strong performance across both 1X World Model Challenges. First, we showed how internet-scale data can be leveraged by fine-tuning a pre-trained image–text-to-video foundation model. Using multi-node training on the DataCrunch instant cluster, we reached first place on the leaderboard in only 36 hours—an order of magnitude faster than the runner-up, who required about a month. To further improve inference, we averaged over samples to selectively blur regions of high predictive uncertainty. While this proved effective for optimising PSNR, the most suitable inference strategy for downstream decision-making remains an open question. Second, we demonstrated how a spatio-temporal transformer world model can be trained on the tokenised dataset in under 17 hours. We found that greedy autoregressive inference offered a practical balance of speed and accuracy. Despite its simplicity, the model achieved substantially lower loss values than other leaderboard entries. ## References - Alonso et al. [2024] Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for World Modeling: Visual Details Matter in Atari. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. Curran Associates, Inc., 2024. - Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization. arXiv preprint arXiv:1607.06450, 2016. - Bar et al. [2024] Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation World Models. arXiv preprint arXiv:2412.03572, 2024. - Bengio et al. [2015] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2015. - Bertolotti and Cazzola [2024] Francesco Bertolotti and Walter Cazzola. By Tying Embeddings You Are Assuming the Distributional Hypothesis. In Proceedings of the 41st International Conference on Machine Learning, pages 3584–3610, 2024. - Bruce et al. [2024] Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, de Nando Freitas, Satinder Singh, and Tim Rocktäschel. Genie: Generative Interactive Environments. arXiv preprint arXiv:2402.15391, 2024. - Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. In Advances in Neural Information Processing Systems, 2018. - Decart et al. [2024] Etched Decart, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A Universe in a Transformer. 2024. - Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2020. - Guo et al. [2025] Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. MineWorld: A Real-Time and Open-Source Interactive World Model on Minecraft. arXiv preprint arXiv:2504.08388, 2025. - Ha and Schmidhuber [2018] David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018. - HaCohen et al. [2024] Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime Video Latent Diffusion. arXiv preprint arXiv:2501.00103, 2024. - Hafner et al. [2019] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. In International Conference on Machine Learning, 2019. - Hafner et al. [2022] Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with Discrete World Models. In International Conference on Learning Representations, 2022. - Hafner et al. [2025] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering Diverse Control Tasks through World Models. Nature, 640(8059):647–653, 2025. - Hansen et al. [2023] Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, Robust World Models for Continuous Control. In The Twelfth International Conference on Learning Representations, 2023. - Hansen et al. [2022] Nicklas A. Hansen, Hao Su, and Xiaolong Wang. Temporal Difference Learning for Model Predictive Control. In Proceedings of the 39th International Conference on Machine Learning, 2022. - [18] Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-Key Normalization for Transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253. Association for Computational Linguistics. - Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017. - Ho and Salimans [2021] Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. - Hong et al. [2022] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. arXiv preprint arXiv:2205.15868, 2022. - Hu et al. [2022] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations, 2022. - Kong et al. [2024] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A Systematic Framework For Large Video Generative Models. arXiv preprint arXiv:2412.03603, 2024. - Liu et al. [2025] Peter Liu, Annabelle Chu, and Yiran Chen. Effective World Modeling for Humanoid Robots: Long-Horizon Prediction and Efficient State Compression. Technical Report Team Duke, Duke University, 2025. 1X World Model Challenge, CVPR 2025. - Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2019. - Micheli et al. [2022] Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are Sample-Efficient World Models. In The Eleventh International Conference on Learning Representations, 2022. - Mihaylova and Martins [2019] Tsvetomila Mihaylova and André F. T. Martins. Scheduled Sampling for Transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 351–356. Association for Computational Linguistics, 2019. - NVIDIA et al. [2025] NVIDIA, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, and Artur Zolkowski. Cosmos World Foundation Model Platform for Physical AI. arxiv preprint arXiv:2501.03575, 2025. - [29] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems. Curran Associates, Inc. - Peebles and Xie [2023] William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. - [31] Ofir Press and Lior Wolf. Using the Output Embedding to Improve Language Models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163. Association for Computational Linguistics. - Radford et al. [2019] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. 2019. - Robine et al. [2022] Jan Robine, Marc Höftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based World Models Are Happy With 100k Interactions. In The Eleventh International Conference on Learning Representations, 2022. - Scannell et al. [2025] Aidan Scannell, Mohammadreza Nakhaei, Kalle Kujanpää, Yi Zhao, Kevin Luck, Arno Solin, and Joni Pajarinen. Discrete Codebook World Models for Continuous Control. In The Thirteenth International Conference on Learning Representations, 2025. - Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017. - Wang et al. [2025] Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Xiaofeng Meng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, and Ziyu Liu. Wan: Open and Advanced Large-Scale Video Generative Models. CoRR, abs/2503.20314, 2025. - Wang et al. [2003] Z Wang, EP Simoncelli, and AC Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, pages 1398–1402. IEEE, 2003. - Xu et al. [2021] Mingxing Xu, Wenrui Dai, Chunmiao Liu, Xing Gao, Weiyao Lin, Guo-Jun Qi, and Hongkai Xiong. Spatial-Temporal Transformer Networks for Traffic Flow Forecasting. arXiv preprint arXiv:2001.02908, 2021. - Yang et al. [2024] Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. arXiv preprint arXiv:2408.06072, 2024. - Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595. IEEE Computer Society, 2018. - Zhang et al. [2023] Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning. In Advances in Neural Information Processing Systems, pages 27147–27166. Curran Associates, Inc., 2023.

Rendering Paper...