# Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models
**Authors**: Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long
1]Tsinghua University 2]ByteDance Seed [*]Work done at ByteDance Seed [†]Corresponding authors
(January 27, 2026)
## Abstract
Humans construct internal models of the world and reason by manipulating the concepts within these models. Recent advances in artificial intelligence (AI), particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems, which rely predominantly on verbal reasoning as their primary information-processing pathway. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though a clear consensus on their benefits has not yet been reached. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks–particularly those grounded in the physical world–visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of deliberate CoT reasoning and analyze distinctions among different forms of world models from both informativeness and knowledge aspects. Empirically, we identify and design tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Through controlled experiments on a state-of-the-art UMM, we show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling. Conversely, it offers no clear advantage for tasks that do not require explicit visual modeling. Together, these insights and findings clarify the applicability and potential of multimodal world modeling and reasoning for more powerful, human-like multimodal AI. We publicly release our evaluation suite to facilitate further research.
[Project Lead]Jialong Wu at Mingsheng Long at , Xiaoying Zhang at [Project Page] https://thuml.github.io/Reasoning-Visual-World
## 1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: World Models in Human Minds & Multimodal AI
### Overview
This diagram illustrates the concept of world models, comparing how humans and AI reason about the world. It's divided into sections: World Models, Reasoning with Verbal World Modeling in Multimodal AI, and Reasoning with Visual World Modeling in Multimodal AI. The diagram uses a combination of illustrations, text blocks, and flowcharts to explain these concepts.
### Components/Axes
The diagram is structured into three main horizontal sections. Each section contains multiple sub-sections with descriptive titles. There are no explicit axes in the traditional sense, but the diagram flows from left to right, representing a progression of thought or action. Key components include:
* **World Model:** Mental Model of the World, Dual Representations of World Knowledge
* **Reasoning with Verbal World Modeling:** Mathematical Reasoning, Travel Planning, Everyday Activity Planning, Puzzle Solving
* **Reasoning with Visual World Modeling:** Real-World Spatial Reasoning
### Detailed Analysis or Content Details
**1. World Models: Mental Model of the World**
* Illustration: A human head with interconnected nodes representing thoughts and concepts.
* Label: "Approximate" connecting the head to an arrow pointing towards feedback.
* Label: "Feedback"
**2. World Models: Dual Representations of World Knowledge**
* Left side: "Verbal/Symbolic Knowledge" with the equation "y = ax² + bx + c" and "F = ma"
* Right side: "Visual/Imagery Knowledge" with a person disliking something on the left and preferring something on the right.
* Labels: "Dislike in Daily Life", "Prefer in Daily Life"
**3. Reasoning with Verbal World Modeling in Multimodal AI**
* **Mathematical Reasoning:**
* Question: "If x-1 then the sum of the real solutions of a = x is equal to"
* Response: "To solve the equation (a - a² + x) = x, let's start by squaring both..." (truncated equation)
* **Travel Planning:**
* Task: "Plan a trip with a budget of $1,700."
* `<think>`: "INITIAL Budget: $1,700. Spent: $0."
* Day 1: "transportation: PENDING... Available: Flight F3573659. $474"
* ACTION: "Plan day 1 transportation. Select Flight F3573659."
* NEXT STATE: "After this action, you've spent $474, leaving $1226 from your $1700 budget..."
* **Everyday Activity Planning:**
* Goal: "Cooking tomato and eggs"
* `<think>`: "ACTION: Cook the eggs in the pan. STATE: This action changes the state of the eggs from liquid to a partially cooked state. The eggs are now partially cooked and are being transformed into curds. The heat from the pan cooks the eggs, and the stirring action ensures even cooking..."
* **Puzzle Solving:**
* `<think>`: "From [452, 59], the only available move is West to [359, 59]..."
* Illustration: A grid-like puzzle.
* Labels: "State/Observation", "Action"
**4. Reasoning with Visual World Modeling in Multimodal AI**
* **Real-World Spatial Reasoning:**
* Question: "Where did I took the photo in Figure 1, relative to the iron refrigerator located where was it?"
* `<think>`: "First, let's analyze the images. [...] It's not visible in that initial view, so I need to change my perspective..."
* Illustration: A series of images showing a room and a refrigerator.
* Label: "The refrigerator is not visible in this 90-degree left turn view. [...] To be thorough, I will also check the view to the right."
* `<think>`: "My initial turn was 90 degrees left, but the refrigerator isn't at that exact angle. Let's try a smaller turn to the left."
### Key Observations
* The diagram emphasizes the interplay between verbal and visual reasoning.
* The "think" blocks represent the internal reasoning process of an AI agent.
* The travel planning example demonstrates a step-by-step action and budget tracking.
* The spatial reasoning example highlights the importance of perspective and visual analysis.
* The diagram uses a consistent visual style with rounded rectangles for thought bubbles and arrows to indicate flow.
### Interpretation
The diagram presents a framework for understanding how AI can model the world, drawing parallels to human cognitive processes. It suggests that effective AI requires both verbal (symbolic) and visual reasoning capabilities. The examples provided – mathematical reasoning, travel planning, cooking, puzzle solving, and spatial reasoning – illustrate the diversity of tasks that benefit from this multimodal approach. The "think" blocks are crucial, as they represent the AI's internal state and decision-making process. The diagram implies that AI agents need to be able to observe, plan, act, and reflect on their actions to effectively navigate and interact with the world. The spatial reasoning example is particularly insightful, demonstrating how AI can overcome limitations in initial perception by actively seeking different viewpoints. The diagram is a high-level conceptual overview rather than a detailed technical specification, aiming to convey the core principles of multimodal AI and world modeling.
</details>
Figure 1: Overview of a world-model perspective on multimodal reasoning. (a) Humans construct mental models of the world, representing information and knowledge through two complementary channels–verbal and visual–to support reasoning, planning, and decision-making. (b) Recent advances in large language models (LLMs) and vision language models (VLMs) largely rely on verbal chain-of-thought reasoning, leveraging primarily verbal and symbolic world knowledge. (c) Unified multimodal models (UMMs) open a new paradigm by using visual generation for visual world modeling, advancing more human-like reasoning on tasks grounded in the physical world. Examples of reasoning with verbal world modeling are adapted from Guo et al. [18], Du et al. [14], Chen et al. [9], Zhang et al. [72].
Humans construct internal mental models of the external world that represent objects and concepts, along with their relationships, structures, and operational mechanisms [11, 16]. These models support reasoning and decision-making by enabling mental simulation, allowing individuals to anticipate the outcome of actions without actually taking them [40]. For example, if a glass of water is spilled on the table, people can rapidly mentally simulate the ensuing events: the water falling downward, spreading across the surface, and potentially dripping onto the floor. Such predictions lead them to quickly move valuable items away or reach for a towel. Beyond physical systems, mental models also extend to any domain where relational structures can be simulated, such as mathematics and logic [31, 32], making them fundamental to how humans understand and interact with all aspects of the world.
Cross-disciplinary researchers in philosophy, psychology, cognitive science, and related fields have a long history of developing computational models of human mental models [44]. Among them, artificial intelligence (AI) shares a core ambition of building machines that reason like people. Although debates remain, recent breakthroughs, especially in large language models (LLMs) and chain-of-thought (CoT) reasoning, have made a substantial step towards approximating human reasoning grounded in mental models of the world, often referred to as world models [24, 34] in the AI literature. During chain-of-thought reasoning, LLMs explore, reflect, and backtrack within the structured solution space, guided by world knowledge acquired through large-scale pre-training. These capabilities have already driven progress in diverse domains, including programming [18], mathematics [57, 18], scientific discovery [53], clinical medicine [58], and robotics [42].
Such reasoning capabilities have also been extended to multimodal AI systems, particularly vision language models (VLMs) [28, 6, 19, 70]. These systems typically incorporate visual inputs by aligning visual representations with the embedding space of LLMs, resulting in reasoning that remains primarily constrained to a linguistic space. In contrast, human mental models operate over multiple forms of mental representations. Dual-coding theory [45] suggests that the mind processes information through two complementary codes: verbal and imagery (particularly visual) representations. These pathways can function independently but often collaborate to support reasoning. Indeed, visual imagery has been shown to have advantages over words in memory encoding and retrieval [33]; and individuals with aphantasia, who lack the ability to visualize mental imagery, exhibit worse performance on tasks such as visual search [43]. These evidence from psychology and cognitive science therefore suggest that the absence of a dedicated visual information pathway may explain why current multimodal AI systems excel in formal and abstract domains dominated by verbal world knowledge, yet continue to fall far short of human performance on tasks involving physical and spatial reasoning [49, 8], which fundamentally depend on visual world modeling.
Next-generation multimodal AI systems are evolving to be built upon unified multimodal models (UMMs) [54, 63, 62, 13], which seamlessly integrate both verbal and visual generation capabilities. The newly introduced visual generation component offers the potential to explicitly realize visual world modeling, a critical element of multimodal world models in human-like reasoning that current systems largely lack. This naturally makes us ponder: Can current UMMs truly leverage their visual generation capability to enhance reasoning and thereby narrow the performance gap between multimodal AI and humans? A growing body of preliminary research [36, 77, 38, 76, 17] has begun exploring this question from different perspectives. However, the findings so far remain inconclusive. Reported empirical results are mixed, showing no consistent trends that visual generation reliably improves reasoning performance. Moreover, the evaluation tasks used in current studies are designed heuristically, lacking a principled basis for understanding when and how visual generation can meaningfully contribute to multimodal reasoning.
In this paper, we present the first principled study of when and how visual generation benefits reasoning from a world-model perspective (see Figure 1), making both theoretical and empirical contributions.
Theoretically, we rigorously bridge the concepts of world models and reasoning. (1) World model formulations: We formulate multimodal world models to approximate the underlying multi-observable Markov decision processes (MOMDP) of tasks, and define two fundamental capabilities of world models, namely world reconstruction and world simulation. (2) World model-based reasoning: To realize world models for reasoning, we introduce three reasoning formulations. Two rely solely on verbal CoTs through implicit or verbal world modeling, while the third interleaves verbal and visual CoTs that explicitly incorporate visual generation as a form of visual world modeling. (3) The visual superiority hypothesis: Under this framework, we analyze the distinctions among different world models, highlighting the richer informativeness and complementary prior knowledge afforded by visual world modeling. These insights motivate our central hypothesis that visual world modeling is superior for certain tasks, particularly those grounded in the physical world.
Empirically, we validate these insights through a series of controlled experiments. (4) The VisWorld-Eval suite: We identify and design tasks that specifically isolate and demand each atomic world model capability, forming a new evaluation suite to facilitate future research. This suite, VisWorld-Eval, collects seven tasks spanning both synthetic and real-world domains. (5) Empirical evaluation: Experiments with a state-of-the-art UMM [13] on VisWorld-Eval reveal findings consistent with our insights and theoretical analysis. In tasks where verbal world modeling suffers from representational bottlenecks or insufficient prior knowledge, interleaved CoT delivers substantial performance improvements. By contrast, it offers no clear advantages in tasks such as mazes and Sokoban, whose simple states do not require explicit visual world modeling. We further conduct dedicated analyses, including evidence revealing emergent implicit world modeling in the maze task.
We hope this work provides early evidence for the central role of multimodal world models in general-purpose AI, in which complementary verbal and visual knowledge emerge from generative modeling across modalities, with the latter being especially valuable for bringing human-like intelligence into the physical world.
## 2 Related Work
World models. The field of world models is rapidly evolving, yet remains far from reaching consensus on definitions or methodologies. Although psychology and cognitive science suggest that human mental models rely on compact representations that discard irrelevant details, how to scale approaches capable of learning such abstract representations [48, 26, 34] to arbitrary domains and modalities is still unclear. Consequently, most current techniques preserve complete information of observations, either through reconstructable latent representations [24, 25] or directly at the level of raw data. Prominent examples include modern video generation world models [12, 1, 2, 64] which capture concrete pixel-level dynamics. In contrast, language inherently provides a higher level of abstraction, making it more similar to human mental representations [60, 65, 59, 71, 9]. This motivates the promise of unified multimodal models that generate both languages and visuals as a new direction for building more human-like world models.
Unified multimodal models. Multmodal understanding [28, 6, 19] and visual generation [47, 50] have long developed in isolation. Recently, there has been growing interest in integrating these two capabilities into a single unified model. This can be straightforwardly achieved by forwarding the representations of vision language models to an external visual generation module [56, 46]. A more unified approach is to model both language and visual modalities within a single backbone. While language is predominantly modeled through autoregressive next-token prediction, the design space of visual modalities spans a wide spectrum, from discrete tokenization with autoregressive [62, 54, 63] or masked modeling [66, 22], to continuous tokenization with diffusion or flow-based modeling [75, 41, 13]. Among these efforts, BAGEL [13] is one of the most widely adopted open-source models achieving state-of-the-art performance. Despite substantial progress in building unified multimodal models (UMMs), existing evaluations still primarily assess their understanding and generation capabilities separately. One widely recognized advantage of UMMs lies in leveraging reasoning abilities of handling complex instructions to enhance visual generation or editing [74, 21]. Yet when and how visual generation, in turn, enhances reasoning remains insufficiently explored, lacking solid empirical evidence and community consensus.
Benchmarking visual generation for reasoning. This paper contributes to a growing line of research on visual generation for reasoning. RealUnify [52] and Uni-MMMU [77] design tasks in which generation is expected to enhance reasoning, but report mixed results without revealing clear trends regarding the benefits of visual generation. ROVER [38] reveals fundamental limitations of current models in generating meaningful visual reasoning steps, often resulting in minimal or even negative gains in final accuracy. In contrast, MIRA [76] conducts a sanity test by providing manually annotated visual cues, thereby bypassing the evaluation of visual world modeling capability. While the aforementioned works evaluate zero-shot performance, ThinkMorph [17] fine-tunes UMMs to reveal emergent reasoning behaviors but restricts each CoT to a single intermediate image, thereby not fully exploiting the potential of interleaved CoT. Our work distinguishes itself through a world-model perspective that enables a principled investigation, allowing us to both demonstrate and systematically explain when visual generation yields positive gains and when it does not.
## 3 A World Model Perspective on Multimodal Reasoning
Inspired by the aforementioned connections between human cognition and artificial intelligence, we formalize our world-model perspective on multimodal reasoning (see Figure 2) in this section.
### 3.1 Formulation: Multiple Observations of the World
Without loss of generality, the world of a specific task can be formulated as a multi-observable Markov decision process (MOMDP) $\mathcal{M}=(\mathcal{S},\mathcal{A},p,\Phi,\mathcal{O}_{\phi},e_{\phi})$ , where $\mathcal{S}$ denotes the state space, $\mathcal{A}$ the action space, $p$ the transition function, $\Phi$ the parameter space of observation functions, $\mathcal{O}_{\phi}$ the observation space, and $e_{\phi}$ the observation function. Each $s\in\mathcal{S}$ represents the underlying state of the world, which is typically hidden and not directly observable. Instead, it can be perceived through different instantiations of observations (hereafter also referred to as views) [27], given by $o=e_{\phi}(s)\in\mathcal{O}_{\phi}$ , parameterized by $\phi\in\Phi$ . As illustrated in Figure 2 a, such views can span multiple modalities—for example, visual observations corresponding to different camera poses, or verbal descriptions expressed with different emphases or styles. When an action $a\in\mathcal{A}$ is applied to the current state, the world transits according to the dynamics $s^{\prime}\sim p(s^{\prime}|s,a)$ and yields new observations.
### 3.2 Atomic Capabilities of World Models
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Diagram: World Model Capabilities & Chain-of-Thought Formulations
### Overview
This diagram illustrates the capabilities of world models, specifically focusing on how they integrate verbal and visual observations, reconstruct and simulate worlds, and formulate chain-of-thought solutions to problems. It presents a framework for understanding how an agent can reason about and interact with its environment. The diagram is divided into three main sections: Multiple Observations of the World, Atomic Capabilities of World Models, and World Model-Based Chain-of-Thought Formulations.
### Components/Axes
The diagram consists of several interconnected components:
* **Multiple Observations of the World:** This section includes "Verbal Observations" (text descriptions) and "Visual Observations" (cube stack images).
* **Multi-Observable Markov Decision Process:** Depicts a state transition system with observations (σ), states (S), actions (α), and next states (S').
* **Atomic Capabilities of World Models:** Divided into "World Reconstruction" and "World Simulation" sections.
* **World Reconstruction:** Takes top, front, and right views of a cube stack as input and attempts to reconstruct the full structure.
* **World Simulation:** Uses a "World Model" to simulate the effects of actions on the reconstructed world.
* **World Model-Based Chain-of-Thought Formulations:** Demonstrates a step-by-step reasoning process to solve a cube stack problem.
* **Person Icon:** Represents the agent performing the task.
* **Cube Stack Images:** Used as visual inputs and outputs throughout the diagram.
* **Text Boxes:** Contain descriptions of the process and problem statements.
### Detailed Analysis or Content Details
**1. Multiple Observations of the World:**
* **Verbal Observations:** "A stack of cubes with an L-shaped front view and an inverted L-shaped right view." and "A stack of cubes positioned at (0,0,0), (1,0,0), (0,1,0), and (0,0,1)."
* **Visual Observations:** Two 3D cube stack arrangements are shown. The first is an L-shape, and the second is a set of four cubes at coordinates (0,0,0), (1,0,0), (0,1,0), and (0,0,1).
**2. Multi-Observable Markov Decision Process:**
* This section shows a sequence of observations (σ1, σ2, σ3…), a state (S), an action (α), and a resulting next state (S'). The ellipsis (…) indicates that this process continues iteratively.
**3. Atomic Capabilities of World Models:**
* **World Reconstruction:**
* Input Views: Top, Front, Right views of a cube stack.
* Intermediate Step: "World Model" is used to reconstruct the cube stack.
* Output: Reconstructed cube stack with coordinates (0,0,0), (1,0,0), (0,1,0), (0,0,1).
* **World Simulation:**
* Input: Reconstructed cube stack.
* Process: "World Model" simulates the effects of actions.
* Output: Simulated cube stack with coordinates (0,0,0), (1,0,0), (0,1,0), (2,0,0).
**4. World Model-Based Chain-of-Thought Formulations:**
* **Problem Statement:** "Given the three views of a cube stack… how can we modify the stack to match the desired back view?"
* **Chain-of-Thought Steps:**
* Step 1: Input Views (Top, Front, Right).
* Step 2: "Reconstruct the full structure" – resulting in a cube stack.
* Step 3: "Imagine the back view" – resulting in a cube stack.
* Step 4: "Try put a new cube" – resulting in a cube stack.
* Step 5: "Wait, retry another choice" – resulting in a cube stack.
* Step 6: "Imagine the back view" – resulting in a cube stack.
* Step 7: "Get the answer: Put at (2,0,0)" – resulting in a cube stack.
### Key Observations
* The diagram emphasizes the iterative nature of world modeling, with reconstruction and simulation steps being repeated.
* The chain-of-thought formulation demonstrates a problem-solving approach that involves hypothesis generation, testing, and refinement.
* The use of both verbal and visual observations highlights the importance of multi-modal input for world modeling.
* The Markov Decision Process section suggests that the agent's actions are based on observations and lead to state transitions.
### Interpretation
The diagram presents a conceptual framework for how an intelligent agent can build and utilize world models to understand and interact with its environment. The agent leverages both verbal and visual information to reconstruct a representation of the world, simulate potential actions, and formulate solutions to problems. The chain-of-thought approach demonstrates a deliberate reasoning process, where the agent explores different possibilities and refines its understanding based on feedback.
The Markov Decision Process component suggests that the agent operates within a probabilistic framework, where actions have uncertain outcomes. The world model serves as a crucial component in this process, allowing the agent to predict the consequences of its actions and make informed decisions.
The diagram highlights the importance of abstraction and representation in intelligent systems. The world model provides a simplified representation of the environment, enabling the agent to reason about complex scenarios without being overwhelmed by details. The use of coordinates (e.g., (0,0,0), (1,0,0)) suggests that the world model employs a symbolic representation of objects and their relationships.
The diagram's overall message is that effective world modeling requires the integration of multiple sources of information, a robust simulation capability, and a systematic reasoning process. This framework has implications for the development of artificial intelligence systems that can operate in complex and dynamic environments.
</details>
Figure 2: Theoretical formulation of the world model perspective on multimodal reasoning. (a) Observations of the same underlying world state can span multiple modalities, including verbal and visual observations, each reflecting different views or emphases. (b) Two atomic capabilities of world models are defined: world reconstruction, which infers complete structure from partial observations and enables novel view synthesis, and world simulation, which models dynamics to predict future observations. (c) Chain-of-thought reasoning includes internal world modeling, by explicitly maintaining an evolving sequence of observations, generated through either of the atomic world model capabilities.
A world model, analogous to human mental models, is then expected to support two fundamental capabilities [34], illustrated in Figure 2 b. The first is called world reconstruction. Humans are remarkably skilled at mentally reconstructing the structure of an environment from only a few partial observations [71], grounded in their prior knowledge of the world. Such mental reconstruction allows them to imagine novel views of the same underlying state, supporting skills such as mental rotation. Formally, the perception component of a world model encodes $n$ observations from limited views into an internal representation: $\hat{s}=\operatorname{enc}(o_{\phi_{1}},\dots,o_{\phi_{n}})\approx s$ . This representation approximates the true state We set aside the debate between compact and comprehensive representations. By treating abstract (e.g., sketches) and high-fidelity observations as different view specifications, this formulation allows the internal representation to flexibly adjust to the level of detail required by the desired views., and can then be decoded to generate an unseen observation: $\hat{o}_{\phi_{n+1}}=\mathrm{dec}(\hat{s},\phi_{n+1})\approx e_{\phi_{n+1}}(s)$ , providing an internal "experience" of navigating the world. In modern generative models, including UMMs, since their latent representations are not explicitly defined, the world reconstruction capability can be realized through end-to-end novel view generation:
$$
\displaystyle p_{\theta}(o_{\phi_{n+1}}\mid o_{\phi_{1}},\dots,o_{\phi_{n}}), \tag{1}
$$
which implicitly learns the internal representations required to synthesize the new view.
The second capability is world simulation. Humans can mentally simulate how the world evolves into the future, supporting reasoning and decision-making, either purely in their minds or with external aids such as a scratchpad. Formally, this corresponds to the prediction component of a world model, which predicts the transition of the current state and action: $\hat{s}^{\prime}\sim\operatorname{pred}(\hat{s},a)$ , providing an internal "experience" of interacting with the world. Similarly, for modern generative models, this capability is more typically realized through predictions of future observations:
$$
\displaystyle p_{\theta}(o_{t+1}\mid o_{\leq t},a_{\leq t}). \tag{2}
$$
In our new evaluation suite, we deliberately curate tasks that specifically demand each capability, allowing us to independently validate its contribution to multimodal reasoning (see Section 4.1).
### 3.3 Deliberate Reasoning with World Modeling Across Modalities
We then formalize how world-modeling capabilities within multimodal models contribute to reasoning. Given a question $Q$ and input images $I$ , the chain-of-thought reasoning process of a multimodal AI system can be expressed as a sequence of intermediate steps (or thoughts) $R=\tau_{1},\tau_{2},\dots,\tau_{H}$ , followed by the answer $A$ . Although this general formulation treats each reasoning step $\tau_{i}$ as an unconstrained, free-form operation, our world model perspective suggests that humans reason by prediction and planning, and each step inherently manipulates the underlying world observations of the problem [59, 10, 72]. We therefore refine the reasoning formulation as $\tau_{i}=(r_{i},o_{i})$ to explicitly incorporate an evolving sequence of observations:
$$
\displaystyle R=\left(r_{1},o_{1}\right),\left(r_{2},o_{2}\right),\dots,\left(r_{H},o_{H}\right), \tag{3}
$$
where $r_{i}$ We use $i$ to index reasoning steps in order to distinguish them from the true time step $t$ of the underlying MOMDP. The twos are not generally aligned, as we may include branching and backtracking in the reasoning. denotes a logical reasoning step based on the accumulated context, typically expressed in text, and $o_{i}$ denotes the observation generated at that step. Specifically, the input images serve as the initial observation $o_{0}=I$ , and subsequent observations are generated from previous reasoning and observations, by invoking atomic world modeling capabilities: world reconstruction (Eq. (1)) and world simulation (Eq. (2)), where reasoning steps imply actions $a$ and view transformations $\phi$ , as illustrated in Figure 2 c.
This formulation is modality-agnostic, allowing observations—and thus world modeling—to arise across various modalities. We focus specifically on verbal and visual observations, motivated by dual-coding theory in human cognition and by the fact that UMMs are equipped to generate both. This yields several concrete CoT instantiations. Specifically, verbal world modeling produces purely verbal CoTs, with $o_{i}$ as verbal descriptions, whereas visual world modeling produces verbal-visual interleaved CoTs, with $o_{i}$ as generated images. In addition, prior work has discovered that language models can implicitly learn world models with emergent internal representations of board-game states without explicit supervision [37]. Motivated by this, we also consider implicit world modeling, in which no explicit observation is generated ( $o_{i}=\emptyset$ ) In practice, strictly distinguishing implicit from verbal world modeling can be difficult, because there are often partial descriptions of the current state in the reasoning part $r_{i}$ . In this work, we treat verbal world modeling as explicitly expressing world states or observations in text, such as coordinates or symbolic matrices..
### 3.4 The Visual Superiority Hypothesis
Contemporary LLMs and VLMs have achieved impressive performance in structured and abstract domains, such as mathematics and programming, largely driven by large-scale language-centric pre-training and verbal chain-of-thought post-training. Although these models have accumulated extensive verbal and symbolic knowledge, their understanding of the visual world remains limited when trained under purely verbal supervision. As a result, they continue to struggle with tasks grounded in basic physical and spatial intuition that even young children naturally master [49, 8].
Visual world modeling is therefore essential for endowing multimodal AI with complementary forms of information and knowledge. (1) In terms of informativeness, while verbal and symbolic representations capture high-level semantic abstractions, they often suffer from ambiguity and representational bottlenecks. In contrast, visual observations are more concrete and information-rich, directly encoding physical properties such as motion and spatial relationships. This provides precise, fine-grained grounding for reasoning about the complex real world, particularly in spatial and physical tasks. (2) In terms of prior knowledge, visual world knowledge is inherently complementary to symbolic knowledge. Humans and animals acquire much of this knowledge (e.g., physical interactions and spatial transformations) through perception, largely independent of language. Consequently, humans naturally represent and communicate such knowledge visually—for example, by sketching an approximate parabolic trajectory without performing explicit calculations. This suggests that different aspects of world knowledge are concentrated in different data modalities, and learning from large-scale generative modeling of visual data can thereby expand the effective knowledge landscape available for multimodal reasoning.
We next formalize and justify these insights through theoretical analysis, with formal statements and proofs provided in Appendix 7.
Informativeness. For notational convenience, we denote the question $Q$ as $r_{0}$ , the input images as $o_{0}$ , and the final answer as $r_{H+1}$ . Prefixes of a CoT are defined as $R_{i}=(r_{0},o_{0},r_{1},o_{1},\dots,r_{i-1},o_{i-1}),\tilde{R}_{i}=(r_{0},o_{0},r_{1},o_{1},\dots,r_{i-1},o_{i-1},r_{i})$ . We use $\mathbb{H}(\cdot)$ and $\mathbb{I}(\cdot;\cdot)$ to denote Shannon entropy and mutual information, respectively. We first establish that the end-to-end answer error admits an upper bound that naturally decomposes into reasoning and world-modeling errors.
**Theorem 1**
*Let $p$ denote the distribution over optimal chain-of-thoughts and answers, and let $p_{\theta}$ be a learned reasoning model. Then the following inequality holds:
$$
\displaystyle\operatorname{KL}(p(A\mid Q,I)\mid\mid p_{\theta}(A\mid Q,I)) \displaystyle\leq\operatorname{KL}(p(R,A\mid Q,I)\mid\mid p_{\theta}(R,A\mid Q,I)) \displaystyle= \displaystyle\sum_{i=1}^{H+1}\underbrace{\mathbb{E}_{p}\left[\operatorname{KL}(p(r_{i}|R_{i})\mid\mid p_{\theta}(r_{i}|R_{i}))\right]}_{\textnormal{reasoning errors}}+\sum_{i=1}^{H}\underbrace{\mathbb{E}_{p}\left[\operatorname{KL}(p(o_{i}|\tilde{R}_{i})\mid\mid p_{\theta}(o_{i}|\tilde{R}_{i}))\right]}_{\textnormal{world-modeling errors}}. \tag{4}
$$*
This decomposition reveals a fundamental trade-off between the informativeness of world models for reasoning and the fidelity of the world model itself. In the case of implicit world modeling, where $o_{i}=\emptyset$ , we get rid of the world-modeling error. However, this typically comes at the cost of increased uncertainty and learning difficulty in reasoning, as all state transitions must be implicitly encoded. Empirically, world models that explicitly track the task states, serving as verbal or visual sketchpads, are generally beneficial for reasoning. We dive into the reasoning component of Eq. (4) to elucidate the factors underlying these benefits.
**Theorem 2**
*Let $s_{i}$ denote the latent states associated with the observations $o_{i}$ . Under appropriate assumptions, the reduction in reasoning uncertainty achieved by explicit world modeling satisfies the following properties:
1. Reasoning uncertainty does not increase: $\mathbb{H}(r_{i}|o_{0},r_{0:i-1})-\mathbb{H}(r_{i}|R_{i})=\mathbb{I}(o_{1:i-1};r_{i}|o_{0},r_{0:i-1})\geq 0.$
1. The reasoning uncertainty improvement is bounded by both (i) the information that observations provide about the underlying states and (ii) the information that the reasoning step requires about those states:
$$
\mathbb{I}(o_{1:i-1};r_{i}|o_{0},r_{0:i-1})\leq\min\left(\mathbb{I}(o_{1:i-1};s_{1:i-1}),\mathbb{I}(r_{i};s_{0:i-1},r_{0:i-1})\right). \tag{5}
$$*
The uncertainty of the target distribution is closely related to sample efficiency and learning difficulty. Consequently, the upper bound on the improvement of reasoning uncertainty (Eq. (5)) highlights another trade-off in the choice of observation modality for world modeling. The first term indicates that observations should be sufficiently informative about the underlying latent states. In contrast, the second suggests that they need only preserve the task-relevant aspects of the states required to select appropriate reasoning steps. Excessively detailed observations may be unnecessary and even detrimental, increasing world modeling errors.
Prior knowledge. Although visual world models are more informative, they are intrinsically more difficult to learn from scratch due to the high dimensionality and complexity of visual observations. Fortunately, modern AI systems are typically large-scale pre-trained, which endows them with strong prior knowledge and enables faster convergence and improved generalization during downstream post-training. As discussed earlier, humans tend to represent different aspects of world knowledge through different modalities. Consequently, for a given downstream task, the distribution shift between its transition distribution and that learned during large-scale Internet pre-training can vary substantially across modalities. The generalization bound in Theorem 6 of Appendix 7.2 suggests that this modality-dependent distribution shift is closely related to the post-training sample efficiency of the corresponding world model. This highlights the importance of acquiring broad prior knowledge across modalities during pre-training, and of leveraging the proper modality whose priors are best aligned with the downstream task.
Drawing on the above analysis, we formulate our central hypothesis regarding when and how visual generation benefits reasoning, thereby helping narrow the gap between multimodal AI and human capabilities.
The Visual Superiority Hypothesis: In multimodal reasoning tasks grounded in the physical world, visual generation as a world model yields representations that are more informative and knowledge-rich than those produced by verbal world models.
## 4 Experiment Settings
Finally, we empirically validate the insights and theoretical analyses presented above through a series of controlled experiments. In this section, we describe the evaluation tasks and model training procedures.
### 4.1 VisWorld-Eval: Task Suite for Reasoning with Visual World Modeling
While prior work has primarily designed evaluation tasks heuristically, we principledly evaluate multimodal reasoning across tasks designed to specific world model capabilities. Building on related benchmarks, we identify and curate a total of seven tasks, forming an evaluation suite tailored to assess reasoning with visual world modeling. All tasks are framed as question answering with concise, verifiable answers, and performance is measured by answer accuracy. We refer to this suite as VisWorld-Eval, and summarize it in Figure 3.
World simulation. We consider the following tasks that primarily require simulating world dynamics over time: (1) Paper folding: Adapted from SpatialViz-Bench [61], this task presents a sequence of paper folds followed by hole punching, and asks for the distribution of holes after the paper is unfolded. Successfully solving this task requires simulating the unfolding process, relying on prior knowledge of symmetry and spatial transformations that is commonly grounded in visual experience. (2) Multi-hop manipulation: Build upon CLEVR [30], this task features a scene containing objects with various shapes and colors that undergo a sequence of operations, such as addition, removal, or color changes. The final question queries properties of the resulting layouts. Since target objects of operations are often specified via relative spatial relationships, this task places strong demands on state tracking and spatial understanding. (3) Ball tracking: Adapted from RBench-V [20], this task evaluates physical dynamics simulation by requiring the model to infer the trajectory of a ball undergoing ideal specular reflections within a given scene and predicting which numbered hole it will ultimately enter. In addition, we include (4) Maze [29] and (5) Sokoban [55], as these two grid-world tasks are commonly used in prior work of studying visual generation for reasoning [67, 36].
World reconstruction. We also evaluate tasks that emphasize reconstructing underlying world structure from partial observations: (6) Cube 3-view projection: Adapted from SpatialViz-Bench [61], this task provides an isometric view and two orthographic views of a connected cube stack, and asks about an unseen viewpoint. Solving the task requires reconstructing the full 3D structure and mentally rotating or projecting it into the queried view, a process closely aligned with human visual mental representations. (7) Real-world spatial reasoning: We focus on the positional relationship subset of MMSI-Bench [69]. Given multiple views of a realistic scene, these tasks ask about positional relationships among the cameras, objects, and regions. Successfully answering these questions requires constructing a coherent spatial mental model of the scene from limited viewpoints to support accurate spatial reasoning.
For each task, we construct SFT data by designing different CoT patterns with implicit, verbal, or visual world modeling, enabling controlled comparative evaluations. Data construction pipeline and examples across tasks are presented in Appendix 8.1.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Diagram: VisWorld-Eval Task Suite
### Overview
The image presents a diagram showcasing the "VisWorld-Eval: Task Suite for Reasoning with Visual World Modeling". It displays nine different visual reasoning tasks, categorized under "World Simulation" and "World Reconstruction". Each task is represented by a visual example and a corresponding question with a multiple-choice answer.
### Components/Axes
The diagram is structured into two main sections: "World Simulation" (top row) and "World Reconstruction" (bottom row). Each section contains several individual task examples. Each task example includes:
* A visual representation of the task.
* A question related to the visual.
* Multiple-choice answers (A, B, C, D).
* The correct answer is indicated with "A: [answer]".
The tasks are:
1. Paper Folding
2. Multi-Hop Manipulation
3. Ball Tracking
4. Maze
5. Sokoban
6. Cube 3-View Projection
7. Real-World Spatial Reasoning
### Detailed Analysis or Content Details
**World Simulation:**
1. **Paper Folding:** Visual shows a partially unfolded paper with dotted lines indicating folds. Question: "How many cutouts are there in the unfolded paper?" Answer: A: 15
2. **Multi-Hop Manipulation:** Visual shows several colored cylinders. Question: "Starting with the initial arrangement, perform the following: 1. Place a red cylinder to the left of the black cylinder. 2. Swap the colors of the orange cylinder and the black cylinder. After these operations, what is to the left of the orange cylinder?" Answer: A. black sphere, B. white sphere, C. yellow cylinder, D. red cylinder. Correct Answer: A.
3. **Ball Tracking:** Visual shows a ball bouncing off walls. Question: "Given a red point-mass ball that moves at constant speed, reflects perfectly off solid walls, and follows the initial direction indicated by an green arrow, determine which numbered hole at the top it will enter first." Answer: A: 1
**World Reconstruction:**
4. **Maze:** Visual shows a simple maze. Question: "Navigate the maze from the red dot to the blue X." Answer: A: (4, 5), (5, 5), (5, 4)
5. **Sokoban:** Visual shows a Sokoban-style puzzle. Question: "Guide the player to push the box onto the goal position." Answer: A: Down, Right, Down, ...
6. **Cube 3-View Projection:** Visual shows three 2D projections of a cube (Front, Right, Top). Question: "How many cubes in dark violet can possibly be seen from the back view?" Answer: A: 0, B: 2, C: 3, D: 9.
7. **Real-World Spatial Reasoning:** Visual shows a room with a door. Question: "Which direction is the black door relative to me when I am taking image 2?" Answer: A. Behind, B. Left, C. Front, D. Right.
### Key Observations
The diagram presents a diverse set of visual reasoning tasks. The tasks range in complexity, from simple counting (Paper Folding) to more complex spatial reasoning (Maze, Sokoban, Real-World Spatial Reasoning). The tasks are designed to test different aspects of visual understanding and reasoning.
### Interpretation
The diagram illustrates a comprehensive task suite ("VisWorld-Eval") designed to evaluate the capabilities of AI models in visual world modeling. The tasks cover a spectrum of reasoning abilities, including physical simulation, manipulation, path planning, and spatial understanding. The inclusion of both "World Simulation" and "World Reconstruction" tasks suggests an emphasis on both predicting how the world will change and inferring the structure of the world from visual input. The multiple-choice format allows for quantifiable evaluation of model performance. The tasks are designed to be challenging, requiring models to go beyond simple pattern recognition and engage in more complex reasoning processes. The variety of tasks suggests a goal of creating a robust and generalizable benchmark for visual reasoning.
</details>
Figure 3: The VisWorld-Eval suite for assessing multimodal reasoning with visual world modeling. VisWorld-Eval comprises seven tasks spanning both synthetic and real-world domains, each designed to isolate and demand specific atomic world-model capabilities.
Table 1: Zero-shot evaluation of advanced VLMs on VisWorld-Eval. We report the average accuracy over five tasks (excluding Maze and Sokoban) and over all seven tasks.
| Models | Paper Folding | Multi-Hop Manip. | Ball Tracking | Cube 3-View | MMSI (Pos. Rel.) | Maze | Sokoban | Overall (5 tasks) | Overall (7 tasks) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Proprietary Models | | | | | | | | | |
| Gemini 3 Flash | 25.6 | 75.4 | 55.3 | 52.7 | 41.3 | 73.9 | 99.3 | 50.0 | 60.5 |
| Gemini 3 Pro | 27.0 | 74.5 | 44.7 | 53.3 | 49.6 | 33.5 | 90.2 | 49.8 | 53.2 |
| Seed 1.8 | 10.6 | 75.2 | 24.4 | 42.5 | 38.8 | 83.9 | 68.3 | 38.3 | 49.1 |
| GPT 5.1 | 6.4 | 73.9 | 34.8 | 44.5 | 44.8 | 0.6 | 62.8 | 40.8 | 38.2 |
| o3 | 13.5 | 68.1 | 24.7 | 37.7 | 44.4 | 0.0 | 36.0 | 37.6 | 32.0 |
| Open-Source Models | | | | | | | | | |
| Qwen3-VL-8B-Thinking [5] | 11.0 | 49.3 | 17.8 | 21.2 | 27.7 | 0.0 | 5.8 | 25.4 | 18.9 |
| BAGEL-7B-MoT [13] | 11.2 | 31.6 | 19.4 | 26.8 | 27.2 | 0.0 | 0.2 | 23.2 | 16.6 |
Evaluation of advanced VLMs. Table 1 reports the zero-shot performance of advanced VLMs on VisWorld-Eval. Overall, these models perform suboptimally, highlighting limitations of current multimodal AI systems. Among them, Gemini 3 Flash and Gemini 3 Pro remarkably outperform the other models; however, their performance remains far from satisfactory on challenging tasks like paper folding, ball tracking, cube 3-view projection, and real-world spatial reasoning.
### 4.2 Unified Multimodal Model Training and Evaluation
Evaluation protocol. To investigate the benefits of visual generation in multimodal reasoning, we evaluate post-trained UMMs, rather than the zero-shot performance of base models. To the best of our knowledge, no open-source model has been natively optimized for interleaved verbal-visual generation for reasoning. Even commercial closed-source models currently exhibit fundamental limitations in generating visual intermediate reasoning steps [38, 76]. Focusing on post-trained models, therefore, provides a more meaningful estimate of the upper bound for multimodal reasoning performance, while reducing confounding effects arising from insufficient pre-training due to limited interleaved data availability or quality.
Model training. We adopt BAGEL [13], a state-of-the-art open-source unified multimodal model, as our base model. Most experiments are conducted by supervised fine-tuning (SFT) on task-specific datasets, where verbal and visual generation in both chain-of-thought reasoning and final answers are optimized using cross-entropy and flow-matching loss. Specifically, the loss for reasoning with visual world modeling is as follows:
$$
\mathcal{L}_{\theta}(Q,I,R,A)=-\sum_{i=1}^{H+1}\sum_{j=1}^{|r_{i}|}\log p_{\theta}\left(r_{i,j}\mid r_{i,<j},R_{i}\right)+\sum_{i=1}^{H}\mathbb{E}_{t,\epsilon}\left\|v_{\theta}(o_{i}^{t},t\mid\tilde{R}_{i})-(\epsilon-o_{i})\right\|_{2}^{2}, \tag{6}
$$
where $o_{i}^{t}=to_{i}+(1-t)\epsilon$ are noisy observations. We emphasize that in our formulation, $r_{i}$ refers to a verbal reasoning step, instead of a reward. We also perform reinforcement learning from verifiable rewards (RLVR) following SFT. During RL, only the verbal generation component is optimized by GRPO [18], while visual generation is regularized via the KL-divergence with respect to the SFT-trained reference model:
$$
\displaystyle\mathcal{J}_{\theta}(Q,I)=\mathbb{E}_{o,r\sim p_{\theta_{\text{old}}}}\Bigg[ \displaystyle\sum_{i=1}^{H+1}\sum_{j=1}^{|r_{i}|}\Bigg(\min\Big(\frac{p_{\theta}\left(r_{i,j}\mid r_{i,<j},R_{i}\right)}{p_{\theta_{\text{old}}}\left(r_{i,j}\mid r_{i,<j},R_{i}\right)}{A},\ \text{clip}\Big(\frac{p_{\theta}\left(r_{i,j}\mid r_{i,<j},R_{i}\right)}{p_{\theta_{\text{old}}}\left(r_{i,j}\mid r_{i,<j},R_{i}\right)},1-\varepsilon,1+\varepsilon\Big){A}\Big)\Bigg) \displaystyle-\sum_{i=1}^{H}\mathbb{E}_{t,\epsilon}\left\|v_{\theta}(o_{i}^{t},t\mid\tilde{R}_{i})-v_{\theta_{\text{ref}}}(o_{i}^{t},t\mid\tilde{R}_{i})\right\|_{2}^{2}\Bigg]. \tag{7}
$$
Full implementation details and hyperparameters are provided in Appendix 8.2.
## 5 Experimental Results
In this section, we demonstrate that visual world modeling boosts multimodal reasoning through two atomic capabilities: world simulation (Section 5.1) and world reconstruction (Section 5.2). We also identify tasks in which it is unhelpful (Section 5.3), where implicit or verbal world modeling is sufficient. We conduct analysis in detail. Interestingly, we reveal emergent internal representations in UMMs that support implicit world modeling on simple maze tasks.
### 5.1 Visual World Simulation Boosts Multimodal Reasoning
Main results. Figure 4 summarizes the performance of SFT-trained UMMs under different chain-of-thought formulations across all tasks. We observe that interleaved CoT with visual world modeling significantly outperforms its purely verbal counterparts on three world simulation tasks: paper folding, multi-hop manipulation, and ball tracking. These gains are attributed to both the richer expressiveness and stronger prior knowledge afforded by the visual modality. In particular, it is difficult for models to precisely ground object coordinates and perform arithmetic operations without external tools in tasks such as multi-hop manipulation and ball tracking, with the latter being especially challenging. Thus, verbal world modeling is inappropriate and omitted in these tasks. This exacerbates ambiguity and hallucinations in purely verbal reasoning. Similarly, in paper folding, although models can track the states of holes, it remains difficult to completely depict the paper contour during unfolding. Moreover, as showcased in Figure 9 and 16, the spatial transformation involved in paper unfolding critically relies on an understanding of geometric symmetry, which can be more naturally learned from visual data like images and videos.
Sample efficiency. To further demonstrate the stronger prior knowledge embedded in the visual modality, we experiment comparing the sample efficiency of verbal and visual world modeling on the paper folding task. As shown in Figure 6, reasoning with visual world modeling exhibits substantially higher sample efficiency, achieving performance comparable to verbal world modeling while using more than $4\times$ less SFT data.
### 5.2 Visual World Reconstruction Boosts Multimodal Reasoning
Main results. As shown in Figure 4, multimodal reasoning tasks that rely on world reconstruction capabilities also benefit substantially from visual world modeling. In the cube 3-view task, predicting a novel view of stacked cubes, denoted symbolic character matrices, suffers from limited prior knowledge, whereas visually rotating objects has been a rich experience during pre-training with large-scale Internet videos. For MMSI tasks, fully describing a novel view of a realistic scene using text alone is similarly ill-suited as in the previous subsection, and we also discover hallucinations in pure verbal reasoning, which lacks grounding to visual generation. We do not observe consistent improvements on other positional-relationship subtasks in MMSI-Bench, except camera-object and camera-region, which we attribute to current UMM’s limitations in both spatial understanding during verbal reasoning and generation quality in visual world modeling. Full quantitative results and qualitative examples are provided in Appendix 9. We expect these limitations to be mitigated in future work with stronger base models.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Bar Chart: Accuracy Comparison of World Modeling Techniques
### Overview
This bar chart compares the accuracy of three world modeling techniques – Implicit, Verbal, and Visual – across seven different tasks. Accuracy is measured on the y-axis, ranging from 0 to 80, while the x-axis represents the different tasks: Paper Folding, Multi-Hop Manipulation, Ball Tracking, Cube 3-View Projection, MMSI (Cam.-Obj.), MMSI (Cam.-Reg.), Maze, and Sokoban.
### Components/Axes
* **X-axis:** Task Name (Categorical) - Paper Folding, Multi-Hop Manip., Ball Tracking, Cube 3-View Proj., MMSI (Cam.-Obj.), MMSI (Cam.-Reg.), Maze, Sokoban.
* **Y-axis:** Accuracy (Numerical) - Scale from 0 to 80.
* **Legend:**
* Implicit World Modeling (Pink/Reddish)
* Verbal World Modeling (Green)
* Visual World Modeling (Blue)
* **Chart Title:** Not explicitly present, but the chart represents an accuracy comparison.
### Detailed Analysis
The chart consists of grouped bar plots for each task, with three bars representing the accuracy of each world modeling technique.
* **Paper Folding:**
* Implicit: Approximately 21.1
* Verbal: Approximately 27.4
* Visual: Approximately 39.2
* **Multi-Hop Manip.:**
* Implicit: Approximately 40.0
* Verbal: Not present
* Visual: Approximately 66.6
* **Ball Tracking:**
* Implicit: Approximately 40.7
* Verbal: Not present
* Visual: Approximately 57.6
* **Cube 3-View Proj.:**
* Implicit: Approximately 60.2
* Verbal: Approximately 63.7
* Visual: Approximately 76.8
* **MMSI (Cam.-Obj.):**
* Implicit: Approximately 46.5
* Verbal: Approximately 37.3
* Visual: Approximately 60.9
* **MMSI (Cam.-Reg.):**
* Implicit: Approximately 54.4
* Verbal: Not present
* Visual: Not present
* **Maze:**
* Implicit: Approximately 70.6
* Verbal: Approximately 73.1
* Visual: Approximately 77.0
* **Sokoban:**
* Implicit: Approximately 29.6
* Verbal: Approximately 36.8
* Visual: Approximately 39.3
**Trends:**
* **Visual World Modeling** generally performs the best across most tasks, often achieving the highest accuracy. It shows a generally upward trend across the tasks.
* **Implicit World Modeling** shows variable performance, with some tasks showing relatively low accuracy and others showing competitive results.
* **Verbal World Modeling** is not present for all tasks, and its performance varies.
### Key Observations
* Visual World Modeling consistently outperforms the other two methods on most tasks.
* The largest performance gap between the methods is observed in the "Maze" task.
* The "MMSI (Cam.-Reg.)" task only shows data for Implicit World Modeling.
* The "Multi-Hop Manip." and "Ball Tracking" tasks do not have data for Verbal World Modeling.
### Interpretation
The data suggests that Visual World Modeling is the most effective technique for these tasks, consistently achieving higher accuracy than Implicit and Verbal World Modeling. This could be due to the ability of visual models to directly process and understand visual information, which is crucial for tasks involving spatial reasoning and object manipulation. The absence of data for Verbal World Modeling in some tasks suggests it may not be applicable or effective for those specific scenarios. The varying performance of Implicit World Modeling indicates its effectiveness is highly task-dependent. The consistent high performance of Visual World Modeling across a range of tasks suggests its robustness and generalizability. The lack of data for certain combinations (e.g., Verbal on Multi-Hop Manip.) could indicate limitations in applying those modeling techniques to those specific problems, or simply a lack of experimentation. The chart highlights the importance of selecting the appropriate world modeling technique based on the specific requirements of the task.
</details>
Figure 4: Performance of SFT-trained UMMs with different world model-based chain-of-thought formulations across seven tasks from VisWorld-Eval. Refer to Table 1 for zero-shot performance of advanced VLMs.
Effects of task difficulties. Figure 6 analyzes performance on the cube 3-view projection task across varying sizes of input cube stacks. We observe a consistent advantage of reasoning with visual world modeling over verbal world modeling across all difficulty levels. Notably, for cube stacks of size six—out of the training distribution—visual world modeling still yields approximately a $10\$ performance improvement.
World model fidelity. Modern AI models are known to exhibit hallucinations along their reasoning trajectories, even when producing correct final answers [38]. We therefore evaluate the fidelity of world modeling in the cube 3-view projection task by comparing ground-truth views with the intermediate views generated verbally or visually during reasoning. To focus on structural correctness, we compare only the shapes of the views and completely ignore color information. Even under this relaxed evaluation setting, Figure 6 shows that verbal world modeling exhibits dramatically low fidelity, with scores degrading to near zero. Notably, approximately half of the samples require predicting the opposite view of a given input view, a transformation that only involves horizontal mirroring. Visual world modeling, benefiting from stronger prior knowledge of such geometric transformations, captures these patterns effectively and achieves fidelity scores consistently exceeding $50\$ .
### 5.3 Visual World Modeling is Unhelpful for Certain Tasks
Main results. (Un)surprisingly, we do not observe notable improvements on grid-world tasks, including maze and Sokoban. In the maze tasks, reasoning with implicit world modeling—without explicitly tracking coordinates—achieves the best performance with a slight advantage. These results are consistent with recent empirical findings [14]. We argue that this is also well explained by our world model perspective. In these tasks, state tracking is relatively simple, typically requiring the maintenance of only one or two two-dimensional coordinates, which can be adequately handled through verbal reasoning alone. Furthermore, in the maze task, we hypothesize that such world modeling can be implicitly encoded in the model’s hidden representations [37], which helps explain the competitive performance of verbal reasoning without explicit coordinate tracking.
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Diagram: BAGEL Architecture Illustration
### Overview
The image depicts a simplified diagram of the BAGEL (presumably a model or system) architecture. It shows a layered structure labeled "Layer 1" through "Layer N" with the text "BAGEL" prominently displayed within the layers. An MLP (Multi-Layer Perceptron) is connected to the output of the layers, and a coordinate is indicated in the top-right corner. There is also a text snippet at the bottom.
### Components/Axes
* **Layers:** Layer 1, Layer 2, Layer N (representing the depth of the BAGEL model).
* **MLP:** Multi-Layer Perceptron, a component receiving input from the BAGEL layers.
* **Coordinate:** (1, 3) - likely representing a position or index within the system.
* **Text Snippet:** "... proceed until I hit a wall, at [masked] ..."
* **Visual Elements:** A bagel icon is present within the layers.
### Detailed Analysis or Content Details
The diagram shows a flow of information from the BAGEL layers to the MLP.
* **BAGEL Layers:** The layers are stacked vertically, with "Layer 1" at the bottom and "Layer N" at the top. An ellipsis ("...") indicates that there are layers between Layer 2 and Layer N.
* **MLP Connection:** An arrow originates from the top-most layer (Layer N) and connects to the MLP.
* **Coordinate:** The coordinate (1, 3) is positioned in the top-right corner.
* **Output Representation:** A red bar graph-like structure is shown above the MLP, potentially representing the output of the MLP. The graph has 5 bars of varying heights.
* **Text Snippet:** The text snippet at the bottom reads "... proceed until I hit a wall, at [masked] ...". The "[masked]" portion suggests that some information is intentionally hidden or omitted.
### Key Observations
* The diagram is highly conceptual and doesn't provide specific numerical data.
* The "BAGEL" name is visually emphasized.
* The coordinate (1, 3) might relate to the location of a specific element or feature within the BAGEL model.
* The masked text suggests a limitation or stopping point in a process.
### Interpretation
The diagram illustrates a high-level architecture where the BAGEL model processes information through multiple layers, and the output of these layers is then fed into an MLP. The MLP likely performs further processing or classification. The coordinate (1, 3) could indicate a specific location or feature within the BAGEL model that is being highlighted. The text snippet suggests a process that continues until a certain condition is met ("hitting a wall"), but the specific condition is obscured by the "[masked]" token.
The presence of the bagel icon within the layers is likely a branding element or a playful reference to the model's name. The diagram is intended to provide a conceptual understanding of the BAGEL architecture rather than a detailed technical specification. The red bar graph above the MLP could represent the output distribution or activation values of the MLP.
</details>
Figure 5: Probing implicit world models, by training a set of probes, i.e., MLPs which infer the masked point coordinates during reasoning from internal representations.
Demystifying implicit world modeling. To validate this hypothesis, we probe the internal representations of models, as illustrated in Figure 5. We consider the same architecture, BAGEL, with three different sets of weights: a randomly initialized model, the pre-trained model, and the model supervised fine-tuned on CoT data in the implicit world modeling format, in which special tokens mask all explicit point coordinates during the reasoning process. For each model, we extract the hidden representations of these special tokens at each layer. We then train multilayer perceptrons (MLPs) on these representations to predict the underlying true point coordinates.
Figure 6 reports the prediction accuracy on a validation set. As expected, the randomly initialized model completely fails to internally track point states, achieving only random-guess accuracy on $5\times 5$ mazes. In contrast, the pre-trained model [13] already exhibits emergent representations that are predictive of maze states. Notably, we observe a non-monotonic trend across layers: prediction accuracy increases from lower layers (which capture low-level features) to middle layers, and then decreases toward the final layers, which are likely specialized for next-token prediction. Finally, supervised fine-tuning on domain-specific data, despite providing no explicit coordinate supervision, substantially enhances this internal predictability, achieving near-perfect accuracy. These in-depth results help explain our main experimental findings: as the model already possesses the capability for implicit world modeling, it does not necessarily benefit from explicit verbal world modeling, let alone more complex forms of visual world modeling.
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Line Chart: Paper Folding
### Overview
This line chart depicts the relationship between the number of training samples and accuracy for different modeling approaches and difficulty levels in a paper folding task. The chart displays three lines representing "Visual WM", "Verbal WM", and "Modeling", along with data points for "Normal" and "Hard" difficulty levels.
### Components/Axes
* **Title:** "Paper Folding" (centered at the top)
* **X-axis:** "Num. of Training Samples" (ranging from 0 to 2500, with markers at 0, 500, 1000, 1500, 2000, and 2500)
* **Y-axis:** "Accuracy" (ranging from 20 to 90, with markers at 20, 30, 40, 50, 60, 70, 80, and 90)
* **Legend:** Located in the top-right corner.
* "Modeling" - Blue line
* "Visual WM" - Light Blue line
* "Verbal WM" - Red line
* "Difficulty" - Grey circle ("Normal"), Red triangle ("Hard")
### Detailed Analysis
**Modeling (Blue Line):**
The blue line representing "Modeling" shows an upward trend.
* At 500 training samples, accuracy is approximately 28%.
* At 1000 training samples, accuracy is approximately 35%.
* At 2000 training samples, accuracy is approximately 40%.
* At 2500 training samples, accuracy is approximately 40%.
**Visual WM (Light Blue Line):**
The light blue line representing "Visual WM" also shows an upward trend, with a steeper initial slope.
* At 500 training samples, accuracy is approximately 52%.
* At 1000 training samples, accuracy is approximately 65%.
* At 2000 training samples, accuracy is approximately 72%.
* At 2500 training samples, accuracy is approximately 72%.
**Verbal WM (Red Line):**
The red line representing "Verbal WM" shows a relatively flat trend.
* At 500 training samples, accuracy is approximately 24%.
* At 2500 training samples, accuracy is approximately 28%.
**Difficulty - Normal (Grey Circle):**
* At 1000 training samples, accuracy is approximately 35%.
* At 2000 training samples, accuracy is approximately 72%.
**Difficulty - Hard (Red Triangle):**
* At 2500 training samples, accuracy is approximately 28%.
### Key Observations
* "Visual WM" consistently outperforms "Verbal WM" across all training sample sizes.
* "Modeling" shows a moderate improvement in accuracy with increasing training samples, but plateaus after 2000 samples.
* "Visual WM" shows the most significant improvement in accuracy with increasing training samples.
* The "Hard" difficulty level results in significantly lower accuracy compared to the "Normal" difficulty level, especially at 2500 training samples.
* The "Verbal WM" accuracy remains relatively low and stable regardless of the number of training samples.
### Interpretation
The data suggests that "Visual WM" is the most effective modeling approach for the paper folding task, particularly as the number of training samples increases. The significant difference in accuracy between "Normal" and "Hard" difficulty levels indicates that the task complexity has a substantial impact on performance. The plateauing of the "Modeling" line suggests that there may be a limit to the benefits of increasing training samples for this approach. The consistently low accuracy of "Verbal WM" suggests that this approach is not well-suited for this task. The chart demonstrates the importance of both model selection and task difficulty in achieving high accuracy in the paper folding task. The data could be used to inform the selection of appropriate modeling techniques and to adjust task complexity to optimize performance.
</details>
(a) Sample efficiency.
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Line Chart: Cube 3-View Projection
### Overview
This image presents a line chart illustrating the relationship between "Cube Stack Size" and various "Metric" values, specifically related to "Modeling" types: "Visual WM" and "Verbal WM". The chart displays trends for "Accuracy" and "WM Fidelity" as well. The x-axis represents "Cube Stack Size" ranging from 3 to 6, while the y-axis represents "Metric" ranging from 0 to 100.
### Components/Axes
* **Title:** Cube 3-View Projection
* **X-axis:** Cube Stack Size (3, 4, 5, 6)
* **Y-axis:** Metric (0 to 100)
* **Legend:** Located in the top-right corner.
* **Modeling:**
* Visual WM (Solid Blue Line)
* Verbal WM (Solid Red Line)
* **Metric:**
* Accuracy (Solid Black Line)
* WM Fidelity (Dashed Black Line)
### Detailed Analysis
The chart contains four distinct lines, each representing a different metric.
* **Visual WM (Blue):** This line starts at approximately 92 at a Cube Stack Size of 3 and slopes downward, reaching approximately 75 at a Cube Stack Size of 6. The trend is consistently downward, but the slope appears to decrease as the Cube Stack Size increases.
* **Verbal WM (Red):** This line begins at approximately 45 at a Cube Stack Size of 3 and exhibits a steep downward slope, reaching approximately 5 at a Cube Stack Size of 6. The decline is more pronounced than that of Visual WM.
* **Accuracy (Black):** This line starts at approximately 85 at a Cube Stack Size of 3 and decreases to approximately 60 at a Cube Stack Size of 6. The slope is moderate and relatively consistent.
* **WM Fidelity (Dashed Black):** This line begins at approximately 65 at a Cube Stack Size of 3 and decreases to approximately 40 at a Cube Stack Size of 6. The slope is moderate and relatively consistent.
### Key Observations
* All metrics demonstrate a negative correlation with Cube Stack Size – as the Cube Stack Size increases, the metric values generally decrease.
* Verbal WM exhibits the most significant decline with increasing Cube Stack Size.
* Visual WM shows the least decline, maintaining a relatively higher value across the range of Cube Stack Sizes.
* Accuracy and WM Fidelity show similar trends, with moderate declines.
### Interpretation
The data suggests that increasing the complexity of the cube stack (as measured by Cube Stack Size) negatively impacts performance on tasks related to both visual and verbal working memory, as well as accuracy and WM fidelity. The steeper decline in Verbal WM indicates that verbal working memory is more sensitive to increased complexity than visual working memory. This could imply that the task relies more heavily on verbal processing, or that the verbal component is more easily disrupted by increased cognitive load. The consistent downward trends across all metrics suggest a general effect of complexity on cognitive performance. The differences in the slopes of the lines suggest varying degrees of sensitivity to complexity among the different metrics. This data could be used to inform the design of tasks or interfaces to minimize cognitive load and optimize performance. The chart provides quantitative evidence of the trade-off between complexity and performance in a 3-view cube projection task.
</details>
(b) World model fidelity.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Line Chart: Maze - State Prediction Accuracy vs. Layer Index
### Overview
This line chart depicts the state prediction accuracy of three different initialization methods ("Random Init.", "BAGEL PT", and "BAGEL SFT") as a function of the layer index in a "Maze" environment. The chart visualizes how the accuracy changes as the network depth (layer index) increases.
### Components/Axes
* **Title:** Maze
* **X-axis:** Layer Index (ranging from approximately 0 to 28)
* **Y-axis:** State Prediction Accuracy (ranging from approximately 0.2 to 1.0)
* **Legend:** Located in the top-left corner.
* Random Init. (represented by a light red color)
* BAGEL PT (represented by a green color)
* BAGEL SFT (represented by a blue color)
### Detailed Analysis
The chart displays three distinct lines representing the accuracy of each initialization method.
* **Random Init. (Light Red):** The line is relatively flat and hovers around a state prediction accuracy of approximately 0.23 throughout the entire layer index range (0-28). There is minimal variation.
* **BAGEL PT (Green):** This line starts at approximately 0.28 at Layer Index 0. It exhibits an upward trend, increasing to a peak accuracy of approximately 0.55 around Layer Index 18. After this peak, the accuracy declines slightly, settling around 0.45 at Layer Index 28.
* **BAGEL SFT (Blue):** This line begins at approximately 0.27 at Layer Index 0. It shows a rapid and significant increase in accuracy, reaching approximately 0.95 around Layer Index 16. The accuracy remains high, fluctuating slightly between 0.92 and 0.96 for the remaining layer indices (16-28).
Specific data points (approximate):
| Layer Index | Random Init. | BAGEL PT | BAGEL SFT |
|---|---|---|---|
| 0 | 0.23 | 0.28 | 0.27 |
| 5 | 0.23 | 0.32 | 0.32 |
| 10 | 0.23 | 0.45 | 0.65 |
| 15 | 0.23 | 0.52 | 0.92 |
| 20 | 0.23 | 0.50 | 0.95 |
| 25 | 0.23 | 0.45 | 0.93 |
| 28 | 0.23 | 0.43 | 0.92 |
### Key Observations
* BAGEL SFT consistently outperforms both Random Init. and BAGEL PT in terms of state prediction accuracy, especially as the layer index increases.
* Random Init. demonstrates very low and stable accuracy, indicating it is not effective for this task.
* BAGEL PT shows improvement over Random Init., but its accuracy plateaus at a significantly lower level than BAGEL SFT.
* The rapid increase in accuracy for BAGEL SFT around Layer Index 10-16 suggests a critical point where the model begins to effectively learn the state representation.
### Interpretation
The data suggests that the BAGEL SFT initialization method is significantly more effective at training a neural network to predict states in the "Maze" environment compared to both Random Init. and BAGEL PT. The consistently high accuracy of BAGEL SFT indicates that it provides a better starting point for learning, allowing the network to quickly and effectively capture the underlying state representation. The flat line for Random Init. suggests that random initialization alone is insufficient for learning in this environment. The BAGEL PT method shows some improvement, but it does not reach the same level of performance as BAGEL SFT, indicating that the specific initialization strategy employed by BAGEL SFT is crucial for achieving high accuracy. The rapid increase in accuracy for BAGEL SFT around Layer Index 10-16 could be indicative of the network reaching a critical depth where it can effectively model the complexity of the maze environment.
</details>
(c) Implicit world modeling.
Figure 6: Model analysis: (a) Performance of UMMs on the paper-folding task with varying numbers of SFT samples. Reasoning with visual world modeling achieves a $4\times$ improvement in sample efficiency. WM = world modeling. (b) Performance of UMMs on the cube 3-view projection task with increasing sizes of input cube stacks, evaluated using both answer accuracy and world-model fidelity. Visual world modeling demonstrates dramatically better fidelity of view synthesis. (c) Prediction accuracy of masked point coordinates in CoTs using representations extracted from different layers of different UMMs, revealing emergent internal world representations. PT = Pre-trained.
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Bar Chart: Accuracy Comparison of World Modeling Techniques
### Overview
This bar chart compares the accuracy of different world modeling techniques – Implicit, Verbal, and Visual – using two different models, BAGEL and Qwen-VL, across three tasks: Paper Folding, Multi-Hop Manipulation, and Cube 3-View Projection. Accuracy is measured on the y-axis, while the x-axis represents the different tasks. Each task has six bars representing the accuracy of each modeling technique and model combination.
### Components/Axes
* **X-axis:** Tasks - "Paper Folding", "Multi-Hop Manip.", "Cube 3-View Proj."
* **Y-axis:** Accuracy (Scale from 0 to 80, increments of 10)
* **Legend:**
* Red: Implicit World Modeling (BAGEL)
* Light Red (Hatched): Implicit World Modeling (Qwen-VL)
* Green: Verbal World Modeling (BAGEL)
* Light Green (Hatched): Verbal World Modeling (Qwen-VL)
* Blue: Visual World Modeling (BAGEL)
* Light Blue (Hatched): Visual World Modeling (Qwen-VL)
### Detailed Analysis
The chart consists of three groups of bars, one for each task. Within each group, there are six bars representing the accuracy of each model/technique combination.
**Paper Folding:**
* Implicit World Modeling (BAGEL): Approximately 21.5% accuracy.
* Implicit World Modeling (Qwen-VL): Approximately 21.1% accuracy.
* Verbal World Modeling (BAGEL): Approximately 28.8% accuracy.
* Verbal World Modeling (Qwen-VL): Approximately 27.4% accuracy.
* Visual World Modeling (BAGEL): Approximately 39.2% accuracy.
* Visual World Modeling (Qwen-VL): No bar is present.
**Multi-Hop Manip.:**
* Implicit World Modeling (BAGEL): Approximately 37.5% accuracy.
* Implicit World Modeling (Qwen-VL): Approximately 40.0% accuracy.
* Verbal World Modeling (BAGEL): Approximately 66.6% accuracy.
* Verbal World Modeling (Qwen-VL): No bar is present.
* Visual World Modeling (BAGEL): No bar is present.
* Visual World Modeling (Qwen-VL): No bar is present.
**Cube 3-View Proj.:**
* Implicit World Modeling (BAGEL): Approximately 60.0% accuracy.
* Implicit World Modeling (Qwen-VL): Approximately 63.7% accuracy.
* Verbal World Modeling (BAGEL): Approximately 58.8% accuracy.
* Verbal World Modeling (Qwen-VL): Approximately 60.2% accuracy.
* Visual World Modeling (BAGEL): Approximately 76.8% accuracy.
* Visual World Modeling (Qwen-VL): No bar is present.
### Key Observations
* Visual World Modeling (BAGEL) consistently achieves the highest accuracy across the Paper Folding and Cube 3-View Projection tasks.
* Verbal World Modeling (BAGEL) achieves the highest accuracy on the Multi-Hop Manipulation task.
* Qwen-VL generally performs similarly to or slightly worse than BAGEL for Implicit World Modeling.
* There are missing bars for Qwen-VL in the Verbal and Visual World Modeling categories for the Multi-Hop Manipulation and Cube 3-View Projection tasks. This suggests that these combinations were not evaluated or the results were not significant enough to display.
* The accuracy values for Implicit World Modeling are relatively low compared to Verbal and Visual World Modeling.
### Interpretation
The data suggests that the choice of world modeling technique significantly impacts accuracy, with Visual World Modeling (BAGEL) and Verbal World Modeling (BAGEL) generally outperforming Implicit World Modeling. The best technique depends on the task; Verbal World Modeling excels at Multi-Hop Manipulation, while Visual World Modeling is superior for Paper Folding and Cube 3-View Projection. The consistent performance of BAGEL compared to Qwen-VL for Implicit World Modeling indicates that BAGEL may be a more effective model for this technique. The missing data points for Qwen-VL in certain categories warrant further investigation to understand why these combinations were not evaluated or yielded insignificant results. The overall trend shows that more complex tasks (like Cube 3-View Projection) benefit from more sophisticated modeling techniques (like Visual World Modeling), as evidenced by the higher accuracy scores.
</details>
Figure 7: Performance of SFT-trained VLMs compared with UMMs across three tasks.
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Line Charts: Performance Comparison of World Modeling Techniques
### Overview
The image presents three separate line charts comparing the performance of different world modeling techniques – Implicit World Modeling (BAGEL & Qwen-VL) and Verbal/Visual World Modeling (BAGEL & Qwen-VL) – across three distinct tasks: Paper Folding, Multi-Hop Manipulation, and Cube 3-View Projection. The performance metric is Accuracy, plotted against the number of Steps.
### Components/Axes
Each chart shares the following components:
* **X-axis:** "Steps" ranging from 0 to 200, with tick marks at intervals of 50.
* **Y-axis:** "Accuracy" ranging from approximately 20 to 85, with tick marks at intervals of 5 or 10.
* **Legend:** Located at the top of the image, spanning all three charts. The legend identifies the data series using both color and label:
* Red (solid line): Implicit World Modeling (BAGEL)
* Orange (dashed line): Implicit World Modeling (Qwen-VL)
* Green (solid line): Verbal World Modeling (BAGEL)
* Teal (dashed line): Verbal World Modeling (Qwen-VL)
* Blue (solid line): Visual World Modeling (BAGEL)
Each chart also has a title indicating the specific task being evaluated:
* "Paper Folding"
* "Multi-Hop Manip." (Multi-Hop Manipulation)
* "Cube 3-View Proj." (Cube 3-View Projection)
### Detailed Analysis or Content Details
**1. Paper Folding Chart:**
* **Blue Line (Visual World Modeling - BAGEL):** Starts at approximately 42 accuracy at 0 steps, increases to around 47 accuracy at 50 steps, plateaus around 47-48 accuracy for the remainder of the steps.
* **Green Line (Verbal World Modeling - BAGEL):** Starts at approximately 32 accuracy at 0 steps, fluctuates between 32 and 36 accuracy for the first 100 steps, then increases to around 38 accuracy at 200 steps.
* **Red Line (Implicit World Modeling - BAGEL):** Starts at approximately 26 accuracy at 0 steps, decreases to around 23 accuracy at 50 steps, then increases to around 28 accuracy at 200 steps.
* **Orange Line (Implicit World Modeling - Qwen-VL):** Starts at approximately 24 accuracy at 0 steps, fluctuates between 24 and 28 accuracy throughout the steps.
**2. Multi-Hop Manipulation Chart:**
* **Blue Line (Visual World Modeling - BAGEL):** Starts at approximately 72 accuracy at 0 steps, increases to around 74 accuracy at 50 steps, and remains relatively stable around 73-74 accuracy for the rest of the steps.
* **Green Line (Verbal World Modeling - BAGEL):** Starts at approximately 42 accuracy at 0 steps, increases to around 46 accuracy at 50 steps, and remains relatively stable around 44-46 accuracy for the rest of the steps.
* **Red Line (Implicit World Modeling - BAGEL):** Starts at approximately 26 accuracy at 0 steps, increases to around 28 accuracy at 50 steps, and remains relatively stable around 26-28 accuracy for the rest of the steps.
* **Orange Line (Implicit World Modeling - Qwen-VL):** Starts at approximately 41 accuracy at 0 steps, decreases to around 39 accuracy at 50 steps, and remains relatively stable around 40-42 accuracy for the rest of the steps.
**3. Cube 3-View Projection Chart:**
* **Blue Line (Visual World Modeling - BAGEL):** Starts at approximately 83 accuracy at 0 steps, increases to around 85 accuracy at 50 steps, and remains relatively stable around 84-85 accuracy for the rest of the steps.
* **Green Line (Verbal World Modeling - BAGEL):** Starts at approximately 72 accuracy at 0 steps, increases to around 75 accuracy at 50 steps, and remains relatively stable around 72-75 accuracy for the rest of the steps.
* **Red Line (Implicit World Modeling - BAGEL):** Starts at approximately 64 accuracy at 0 steps, increases to around 68 accuracy at 50 steps, and remains relatively stable around 68-70 accuracy for the rest of the steps.
* **Orange Line (Implicit World Modeling - Qwen-VL):** Starts at approximately 62 accuracy at 0 steps, increases to around 66 accuracy at 50 steps, and remains relatively stable around 66-68 accuracy for the rest of the steps.
### Key Observations
* Visual World Modeling (BAGEL) consistently outperforms other methods across all three tasks, achieving the highest accuracy and demonstrating the most stable performance.
* Implicit World Modeling (both BAGEL and Qwen-VL) generally exhibits the lowest accuracy across all tasks.
* The performance of Verbal World Modeling (BAGEL and Qwen-VL) falls between Visual and Implicit World Modeling.
* In all three charts, the accuracy tends to plateau after approximately 50-100 steps, indicating diminishing returns with increased steps.
### Interpretation
The data suggests that Visual World Modeling (using BAGEL) is the most effective technique for these three tasks – Paper Folding, Multi-Hop Manipulation, and Cube 3-View Projection. This could be because these tasks heavily rely on spatial reasoning and visual understanding, which Visual World Modeling is specifically designed to handle.
The lower performance of Implicit World Modeling suggests that relying solely on implicit representations of the world may be insufficient for these tasks. The intermediate performance of Verbal World Modeling indicates that incorporating verbal information can improve performance compared to implicit modeling, but it still falls short of the capabilities of visual modeling.
The plateauing of accuracy with increasing steps suggests that the models reach a point of diminishing returns, where further training does not significantly improve performance. This could be due to limitations in the model architecture, the training data, or the inherent complexity of the tasks. The differences in performance between BAGEL and Qwen-VL within each modeling type are relatively small, suggesting that the core modeling approach is more important than the specific implementation details in this case.
</details>
Figure 8: Performance of RLVR-trained VLMs and UMMs with different world-model-based CoT formulations across three tasks.
### 5.4 Comparison with VLMs: Do UMMs Compromise Verbal Reasoning Capabilities?
One may argue that UMMs are typically trained with a stronger emphasis on visual generation [13], which could compromise verbal reasoning capabilities, and bias comparisons in favor of visual world modeling. To address this concern, we compare with a pure VLM baseline, Qwen2.5-VL-7B-Instruct [6], which shares the same Qwen 2.5 LLM base model, with BAGEL. We fine-tune Qwen2.5-VL on the same verbal CoT datasets used in the previous subsections and evaluate it on three representative tasks: paper folding, cube 3-view projection, and multi-hop manipulation.
Results. As shown in Figure 7, the SFT performance of Qwen2.5-VL with implicit and verbal world modeling is comparable to that of BAGEL, without exhibiting significant advantages. It still lags behind BAGEL in settings that leverage visual world modeling. These results indicate that our findings arise from the inherent advantages of visual world modeling rather than from compromised verbal reasoning capabilities in UMMs.
### 5.5 RL Enhances Various CoTs, Yet Does Not Close the Gap
Reinforcement learning from verifiable rewards (RLVR) has been a major driver of recent progress in reasoning models equipped with verbal chain-of-thoughts, achieving strong performance across domains such as mathematics [18]. While Figure 4 shows a clear advantage of reasoning with visual world modeling after SFT, RLVR may further incentivize emergent reasoning behaviors that improve verbal CoTs. We thus conduct comparative RLVR experiments across different world model–based CoT formulations on three representative tasks.
Results. Figure 8 presents the learning curves under RLVR for different models. We observe consistent improvements during RLVR for different CoT formulations. However, the performance gap persists. We also find that VLMs and UMMs generally perform similarly with verbal CoTs. These results suggest that the superiority arises from inherent advantages of the world modeling approach, rather than insufficient post-training. Notably, RL enhances reasoning with visual world modeling, even though only the verbal generation components of interleaved CoTs are directly optimized. We envision that the full potential of interleaved CoTs will be further released with the development of RL algorithms tailored for verbal-visual interleaved generation.
## 6 Discussions
By bridging concepts from human cognition and artificial intelligence, we revisit the mechanisms underlying human reasoning and the central role that world models play. This provides a new perspective on the use of visual generation for reasoning in multimodal AI, highlighting its potential to serve as visual world models that complements the verbal world models embedded in LLMs, thereby enabling more human-like reasoning on scenarios grounded in the physical world. For the first time, this perspective is studied in a principled manner, through theoretical formulations that bridge world models and reasoning, as well as through empirical evaluations whose results are well explained by and strongly support the proposed insights. We hope this work helps address longstanding questions about the synergistic effects between generation and reasoning, and more broadly contributes to the development of more human-like AI that thinks and acts with multimodal world models.
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Document Analysis: Paper Folding & Spatial Reasoning
### Overview
The image presents a two-part document. The left side contains a question related to real-world spatial reasoning, accompanied by a model output explaining the reasoning process. The right side presents a paper folding problem, also with a model output detailing the solution. Both sections include images to illustrate the problems and solutions.
### Components/Axes
**Left Side (Spatial Reasoning):**
* **Title:** "Real-World Spatial Reasoning"
* **Question:** "The pictures are taken continuously from a first-person perspective, with two pictures overlapping at the frame. At the moment of the last picture, in which direction is the potted plant relative to you? Options: A: Front right, B: Back left, C: Back right, D: Front left"
* **Model Output Label:** "Model Output: <think>"
* **Images:** Two images depicting a room from a first-person perspective, showing a potted plant in different positions.
* **Image Captions:** "<image>"
**Right Side (Paper Folding):**
* **Title:** "Paper Folding"
* **Question:** "Analyze the image showing a folded paper with cutouts. Mentally reverse the folding process to reconstruct the final unfolded design, then provide your answer to: how many cutouts are there in the unfolded paper?"
* **Model Output Label:** "Model Output: <think>"
* **Images:** An image of a folded paper with cutouts, and another image showing the unfolded paper with marked cutouts.
* **Image Captions:** "<image>"
* **Text Blocks:** Several paragraphs explaining the step-by-step unfolding process.
### Detailed Analysis or Content Details
**Left Side (Spatial Reasoning):**
The text describes a scenario where two images are taken from a first-person perspective, with overlap between frames. The task is to determine the relative direction of a potted plant in the final image. The model output explains the reasoning: determining the camera's angle based on the change in the potted plant's position. The final answer is not explicitly stated, but the reasoning suggests the plant is to the "front right".
**Right Side (Paper Folding):**
The problem involves mentally unfolding a folded paper with cutouts to determine the total number of cutouts in the unfolded state. The model output details the unfolding process:
1. The initial fold is a diagonal fold where the top-left corner was folded down over the main body of the paper.
2. The key principle is that unfolding creates a mirror image of any holes located on the moving flap, using the fold line as the axis of symmetry.
3. The square hole is located in the bottom-left section of the paper.
4. Since the hole is not on the moving flap, it does not create a mirror image when unfolded.
5. The circle hole is on the moving flap, creating a mirror image.
6. The triangle hole is also on the moving flap, creating a mirror image.
7. The final unfolded paper has one square hole, two circle holes, and two triangle holes, totaling five holes.
The answer is explicitly stated as "5".
### Key Observations
* Both sections demonstrate a problem-solving approach using a "think" label to denote the model's reasoning process.
* The spatial reasoning problem relies on visual interpretation and understanding of perspective.
* The paper folding problem requires logical deduction and visualization of the unfolding process.
* The paper folding solution is presented with a clear step-by-step explanation.
### Interpretation
The document showcases the application of AI models to solve spatial reasoning and geometric problems. The "think" outputs provide insight into the model's thought process, demonstrating an ability to break down complex problems into smaller, manageable steps. The spatial reasoning problem highlights the model's capacity to interpret visual information and infer relationships between objects. The paper folding problem demonstrates the model's ability to apply geometric principles and visualize transformations. The inclusion of images is crucial for both problems, as they provide the necessary visual context for the model to operate effectively. The document suggests a trend towards using AI for tasks that traditionally require human spatial intelligence and problem-solving skills. The clarity of the paper folding explanation suggests a strong ability to articulate logical reasoning.
</details>
Figure 9: Showcases of interleaved verbal-visual chain-of-thought reasoning, generated by post-trained UMMs, where visual generation serves as world models. <image> denotes a placeholder indicating the position of a generated image.
Limitations and future work. This work primarily focuses on spatial and physical reasoning tasks, where multimodal AI systems exhibit a pronounced performance gap relative to humans. Many other tasks proposed in the related literature can also be interpreted through our world model perspective. For example, a prominent class of benchmarks involves visual jigsaw tasks [52, 17, 38, 77], in which input image patches are cropped, masked, or shuffled. Such tasks essentially probe a form of world reconstruction capability, as corrupted images and videos are commonly treated as specific views within the world model literature [3, 7, 4]. Another active area of interest lies in STEM reasoning. Recent work [51] leverages visual generation for mathematical diagram editing, such as constructing auxiliary geometric lines. This closely resembles how humans use visual sketchpads to support math understanding and reasoning, constructing visual world models of a symbolic system. However, as symbolic representations in mathematics are largely complete, and mathematical reasoning has been extensively optimized in modern LLMs, it remains unclear whether multimodal interleaved CoT can fundamentally break through the performance limit, warranting further investigation.
We do not apply reinforcement learning to the visual generation components of verbal–visual interleaved CoTs [39]. Prior work has shown that world models themselves can be improved through RLVR [65]. As discussed in Section 5.5, developing RL algorithms specifically tailored to interleaved verbal–visual generation may further improve world-model fidelity during reasoning and incentivize the emergence of stronger and intriguing world-modeling capabilities.
The analysis of emergent representations for implicit world modeling in Figure 6 is intriguing but preliminary. We hope this result will rekindle interest in probing approaches [37] for interpreting the latent representations learned by different models. In particular, we are interested in comparing the internal representations of VLMs and UMMs, as the latter may capture complementary aspects of world knowledge through training for multimodal generation.
Artificial intelligence is increasingly being embodied in the physical world [23]. Our work, particularly the visual superiority hypothesis, suggests that learning visual world models is therefore essential for embodied intelligence. Visual world modeling enables embodied agents to better understand their environments, from imagining occluded regions to interpreting user intentions from an egocentric perspective, thereby supporting more reliable and natural everyday services. It also facilitates planning and decision-making by allowing agents to mentally simulate the precise outcomes of potential actions, leading to more effective interaction with the world. Rather than relying on loosely coupled modules [15] or performing only single-step reasoning [73], we envision a future direction in which flexible multimodal world modeling and reasoning, empowered by interleave verbal-visual generation within a unified model, form core components of physical and embodied AI.
## Acknowledgements
We would like to thank Yanwei Li, Rui Yang, Ziyu Zhu, and Feng Cheng for their assistance in constructing some preliminary training and test data. We also appreciate Xinchen Zhang, Jianhua Zhu, Yifan Du, Yuezhou Ma, Xingzhuo Guo, Ningya Feng, Shangchen Miao, and many colleagues for their valuable discussion.
## References
- Agarwal et al. [2025] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025.
- Alonso et al. [2024] Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. In NeurIPS, 2024.
- Assran et al. [2023] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023.
- Assran et al. [2025] Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025.
- Bai et al. [2025a] Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025a.
- Bai et al. [2025b] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025b.
- Bardes et al. [2024] Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024.
- Cai et al. [2025] Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, et al. Has gpt-5 achieved spatial intelligence? an empirical study. arXiv preprint arXiv:2508.13142, 2025.
- Chen et al. [2025] Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model. arXiv preprint arXiv:2509.02722, 2025.
- Copet et al. [2025] Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025.
- Craik [1967] Kenneth James Williams Craik. The nature of explanation, volume 445. CUP Archive, 1967.
- DeepMind [2025] Google DeepMind. Genie 3: A new frontier for world models. 2025.
- Deng et al. [2025] Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025.
- Du et al. [2025] Yifan Du, Kun Zhou, Yingqian Min, Yue Ling, Wayne Xin Zhao, and Youbin Wu. Revisiting the necessity of lengthy chain-of-thought in vision-centric reasoning generalization. arXiv preprint arXiv:2511.22586, 2025.
- Feng et al. [2025] Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine, and Jianlan Luo. Reflective planning: Vision-language models for multi-stage long-horizon robotic manipulation. arXiv preprint arXiv:2502.16707, 2025.
- Forrester [1971] Jay W Forrester. Counterintuitive behavior of social systems. Theory and decision, 2(2):109–140, 1971.
- Gu et al. [2025] Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492, 2025.
- Guo et al. [2025a] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 2025a.
- Guo et al. [2025b] Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report. arXiv preprint arXiv:2505.07062, 2025b.
- Guo et al. [2025c] Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin-Sheng Chen, Yi Zhang, et al. Rbench-v: A primary assessment for visual reasoning models with multi-modal outputs. arXiv preprint arXiv:2505.16770, 2025c.
- Guo et al. [2025d] Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, and Pheng-Ann Heng. Thinking-while-generating: Interleaving textual reasoning throughout visual generation. arXiv preprint arXiv:2511.16671, 2025d.
- Guo et al. [2025e] Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Rui Huang, Haoquan Zhang, Manyuan Zhang, Jiaming Liu, Shanghang Zhang, Peng Gao, et al. Can we generate images with cot? let’s verify and reinforce image generation step by step. In CVPR, 2025e.
- Gupta et al. [2021] Agrim Gupta, Silvio Savarese, Surya Ganguli, and Li Fei-Fei. Embodied intelligence via learning and evolution. Nature Communications, 2021.
- Ha and Schmidhuber [2018] David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2(3), 2018.
- Hafner et al. [2025] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models. Nature, 2025.
- Hansen et al. [2022] Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. In ICML, 2022.
- Huh et al. [2024] Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. In ICML, 2024.
- Hurst et al. [2024] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
- Ivanitskiy et al. [2023] Michael Igorevich Ivanitskiy, Rusheb Shah, Alex F. Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, and Samy Wu Fung. A configurable library for generating and manipulating maze datasets. arXiv preprint arXiv:2309.10498, 2023.
- Johnson et al. [2017] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
- Johnson-Laird [1983] PN Johnson-Laird. Mental models: Towards a cognitive science of language, inference, and consciousness. Harvard University Press, 1983.
- Lakoff and Núñez [2000] George Lakoff and Rafael Núñez. Where mathematics comes from, volume 6. New York: Basic Books, 2000.
- Landy and Goldstone [2007] David Landy and Robert L Goldstone. How abstract is symbolic thought? Journal of Experimental Psychology: Learning, Memory, and Cognition, 33(4):720, 2007.
- LeCun [2022] Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022.
- Li et al. [2025a] Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. Zebra-cot: A dataset for interleaved vision language reasoning. arXiv preprint arXiv:2507.16746, 2025a.
- Li et al. [2025b] Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought. In ICML, 2025b.
- Li et al. [2023] Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. In ICLR, 2023.
- Liang et al. [2025] Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, and Furong Huang. Rover: Benchmarking reciprocal cross-modal reasoning for omnimodal generation. arXiv preprint arXiv:2511.01163, 2025.
- Liu et al. [2025] Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. In NeurIPS, 2025.
- Loomis et al. [1991] JM Loomis, RL Klatzky, RG Golledge, and JG CicineIli. Mental models, psychology of. Psychology, 14:56–89, 1991.
- Ma et al. [2025] Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. In CVPR, 2025.
- Mon-Williams et al. [2025] Ruaridh Mon-Williams, Gen Li, Ran Long, Wenqian Du, and Christopher G Lucas. Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence, 2025.
- Monzel and Reuter [2024] Merlin Monzel and Martin Reuter. Where’s wanda? the influence of visual imagery vividness on visual search speed measured by means of hidden object pictures. Attention, Perception, & Psychophysics, 86(1):22–27, 2024.
- Norman [2014] Donald A Norman. Some observations on mental models. In Mental models, pages 7–14. Psychology Press, 2014.
- Paivio [1990] Allan Paivio. Mental representations: A dual coding approach. Oxford university press, 1990.
- Pan et al. [2025] Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Schrittwieser et al. [2020] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 2020.
- Schulze Buschoff et al. [2025] Luca M Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. Visual cognition in multimodal large language models. Nature Machine Intelligence, 2025.
- Seedream et al. [2025] Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427, 2025.
- Shi et al. [2025a] Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, et al. Mathcanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning. arXiv preprint arXiv:2510.14958, 2025a.
- Shi et al. [2025b] Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, et al. Realunify: Do unified models truly benefit from unification? a comprehensive benchmark. arXiv preprint arXiv:2509.24897, 2025b.
- Swanson et al. [2025] Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of ai agents designs new sars-cov-2 nanobodies. Nature, 2025.
- Team [2024] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
- Tong et al. [2025a] Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, et al. Game-rl: Synthesizing multimodal verifiable game data to boost vlms’ general reasoning. arXiv preprint arXiv:2505.13886, 2025a.
- Tong et al. [2025b] Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. In ICCV, 2025b.
- Trinh et al. [2024] Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 2024.
- Tu et al. [2025] Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, et al. Towards conversational diagnostic artificial intelligence. Nature, 2025.
- Wang et al. [2025a] Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, et al. Vagen: Reinforcing world model reasoning for multi-turn vlm agents. In NeurIPS, 2025a.
- Wang et al. [2024a] Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, and Peter Jansen. Can language models serve as text-based world simulators? In ACL, 2024a.
- Wang et al. [2025b] Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, and Jun Wang. Spatialviz-bench: Automatically generated spatial visualization reasoning tasks for mllms. arXiv preprint arXiv:2507.07610, 2025b.
- Wang et al. [2024b] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024b.
- Wu et al. [2025a] Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In CVPR, 2025a.
- Wu et al. [2024] Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Interactive videogpts are scalable world models. In NeurIPS, 2024.
- Wu et al. [2025b] Jialong Wu, Shaofeng Yin, Ningya Feng, and Mingsheng Long. Rlvr-world: Training world models with reinforcement learning. In NeurIPS, 2025b.
- Xie et al. [2025] Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. In ICLR, 2025.
- Xu et al. [2025] Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vulić. Visual planning: Let’s think only with images. arXiv preprint arXiv:2505.11409, 2025.
- Yang et al. [2025a] Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, et al. Visual spatial tuning. arXiv preprint arXiv:2511.05491, 2025a.
- Yang et al. [2025b] Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764, 2025b.
- Yao et al. [2025] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Chi Chen, Haoyu Li, Weilin Zhao, et al. Efficient gpt-4v level multimodal large language model for deployment on edge devices. Nature Communications, 2025.
- Yin et al. [2025] Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. In Structural Priors for Vision Workshop at ICCV’25, 2025.
- Zhang et al. [2025] Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, et al. Agent learning via early experience. arXiv preprint arXiv:2510.08558, 2025.
- Zhao et al. [2025a] Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In CVPR, 2025a.
- Zhao et al. [2025b] Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing. In NeurIPS, 2025b.
- Zhou et al. [2025a] Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. In ICLR, 2025a.
- Zhou et al. [2025b] Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, et al. When visualizing is the first step to reasoning: Mira, a benchmark for visual chain-of-thought. arXiv preprint arXiv:2511.02779, 2025b.
- Zou et al. [2025] Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark. arXiv preprint arXiv:2510.13759, 2025.
## 7 Theorectical Analysis
### 7.1 Informativeness
In this section, we provide the rigorous version of our world model-based chain-of-thought formulations, and proofs for Theorem 1 and Theorem 2.
Formal problem setup and assumptions. Given a question $Q$ and input images $I$ , multimodal reasoning generates a chain-of-thought process $R$ , followed by a final answer $A$ . We explicitly formulate the reasoning process as an interleaving of logic reasoning steps and observations of the underlying MOMDP defined in Section 3.1: $R=(r_{1},o_{1}),(r_{2},o_{2}),\dots,(r_{H},o_{H})$ where $H$ denotes the (fixed) CoT length. For notation convenience, we denote the input image(s) as the initial observation $o_{0}$ .
We assume that each MOMDP observation function admits a two-stage decomposition: $e_{\phi}(s)=g_{\phi_{m}}\left(f_{\phi_{s}}(s)\right),\Phi=\Phi_{s}\times\Phi_{m}$ , where the inner modality-agnostic mapping $f_{\phi_{s}}$ (parameterized by $\phi_{s}\in\Phi_{s}$ ) extracts a slice of the underlying state $s$ , retaining only partial state information, and the outer modality-specific mapping $g_{\phi_{m}}$ (parameterized by $\phi_{m}\in\Phi_{m}$ ) renders the extracted slice into a particular observation modality.
Under this decomposition, we assume that reasoning across different modalities of observations shares a common underlying oracle reasoning process:
$$
p(Q,\bar{s}_{0},r_{1},\bar{s}_{1},\dots,r_{H},\bar{s}_{H},A)=p(Q)\left[\prod_{i=1}^{H}p(r_{i}|\bar{s}_{0:{i-1}},r_{1:i-1},Q)p(\bar{s}_{i}|\bar{s}_{0:i-1},r_{1:i},Q)\right]p(A|r_{1:H},\bar{s}_{0:H},Q),
$$
where $\bar{s}_{i}=(s_{i},{\phi_{s}}_{i})\in\mathcal{S}\times\Phi_{s}$ denotes a modality-agnostic sliced state. Each logic step $r_{i}$ is assumed to reason on sufficient sliced state information: $p(r_{i}\mid\bar{s}_{0:i-1},r_{1:i-1},Q)=p\!\left(r_{i}\mid f_{{\phi_{s}}_{0}}(s_{0}),\dots,f_{{\phi_{s}}_{i-1}}(s_{i-1}),r_{1:i-1},Q\right),$ and produces actionable outcomes that either (i) transit a previous world state $s_{j<i}$ via an implicit action $a_{i}$ : $\bar{s}_{i}=(s_{i},{\phi_{s}}_{j}),s_{i}\sim p(s_{j},a_{i})$ or (ii) query the same underlying world state with a new slice ${\phi_{s}}_{i}$ , yielding $\bar{s}_{i}=(s_{j},{\phi_{s}}_{i})$ The oracle reasoning process is then rendered into a specific modality via $o_{i}=g_{\phi_{m}}\!\left(f_{{\phi_{s}}_{i}}(s_{i})\right).$ Unless otherwise specified, we abuse notation and use $s_{i}$ to denote $\bar{s}_{i}=(s_{i},{\phi_{s}}_{i})$ in the remainder of our analysis.
Given the above oracle CoT generation process, we learn a model $p_{\theta}$ whose joint distribution over CoTs and answers factorizes into a reasoning component and a world-modeling component:
$$
\displaystyle p_{\theta}(R,A|Q,I)=p_{\theta}(r_{1},o_{1},r_{2},o_{2},\dots,r_{H},o_{H},r_{H+1}|r_{0},o_{0})=\prod_{i=1}^{H+1}p_{\theta}(r_{i}|R_{i})\prod_{i=1}^{H}p_{\theta}(o_{i}|\tilde{R}_{i}), \tag{8}
$$
where we denote the question as $r_{0}$ , the initial observation (input image(s)) as $o_{0}$ , and the final answer as $r_{H+1}$ . The CoT prefixes are defined as $R_{i}=(r_{0},o_{0},r_{1},o_{1},\dots,r_{i-1},o_{i-1}),\tilde{R}_{i}=(r_{0},o_{0},r_{1},o_{1},\dots,r_{i-1},o_{i-1},r_{i}).$
Proofs. We provide proofs of Theorem 1 and Theorem 2 below.
**Theorem 3 (Restatement of Theorem1)**
*For any observation modality $m$ , the following inequality holds:
$$
\displaystyle\operatorname{KL}(p(A\mid Q,I)\mid\mid p_{\theta}(A\mid Q,I)) \displaystyle\leq\operatorname{KL}(p(R,A\mid Q,I)\mid\mid p_{\theta}(R,A\mid Q,I)) \displaystyle= \displaystyle\sum_{i=1}^{H+1}\underbrace{\mathbb{E}_{p}\left[\operatorname{KL}(p(r_{i}|R_{i})\mid\mid p_{\theta}(r_{i}|R_{i}))\right]}_{\textnormal{reasoning errors}}+\sum_{i=1}^{H}\underbrace{\mathbb{E}_{p}\left[\operatorname{KL}(p(o_{i}|\tilde{R}_{i})\mid\mid p_{\theta}(o_{i}|\tilde{R}_{i}))\right]}_{\textnormal{world modeling errors}}. \tag{9}
$$*
* Proof*
The first inequality follows from the data processing inequality: marginalizing out $R$ cannot increase the KL divergence. For the equality, we apply the chain rule for KL divergence together with the CoT factorization in Eq. (8). In particular, substituting the factorizations of $p(R,A\mid Q,I)$ and $p_{\theta}(R,A\mid Q,I)$ into $\operatorname{KL}(p(R,A\mid Q,I)\,\|\,p_{\theta}(R,A\mid Q,I))$ leads to the stated decomposition. ∎
**Theorem 4 (Restatement of Theorem2)**
*For any observation modality $m$ , the reduction in reasoning uncertainty achieved by explicit world modeling satisfies:
1. Reasoning uncertainty does not increase: $\mathbb{H}(r_{i}\mid o_{0},r_{0:i-1})-\mathbb{H}(r_{i}|R_{i})=\mathbb{I}(o_{1:i-1};r_{i}\mid o_{0},r_{0:i-1})\geq 0.$
1. Uncertainty reduction is upper-bounded by both (i) the information that observations provide about the underlying states and (ii) the information that the reasoning step requires about those states:
$$
\mathbb{I}(o_{1:i-1};r_{i}\mid o_{0},r_{0:i-1})\leq\min\left(\mathbb{I}(o_{1:i-1};s_{1:i-1}),\mathbb{I}(r_{i};s_{0:i-1},r_{0:i-1})\right). \tag{10}
$$*
* Proof*
The first property follows the definition and the non-negativity of mutual information. For the second property, denote the conditioning context as $C=(o_{0},r_{0:i-1})$ . Using the properties of ternary mutual information: $\mathbb{I}(X;Y;Z)=\mathbb{I}(X;Y)-\mathbb{I}(X;Y\mid Z).$ we obtain
$$
\displaystyle\mathbb{I}(o_{1:i-1};r_{i}\mid C) \displaystyle=\mathbb{I}(o_{1:i-1};r_{i}\mid C)-\mathbb{I}(o_{1:i-1};r_{i}\mid s_{1:i-1},C)=\mathbb{I}(s_{1:i-1};o_{1:i-1};r_{i}\mid C) \displaystyle=\mathbb{I}(o_{1:i-1};s_{1:i-1}\mid C)-\mathbb{I}(o_{1:i-1};s_{1:i-1}\mid r_{i},C)\leq\mathbb{I}(o_{1:i-1};s_{1:i-1}\mid C), \tag{11}
$$
where $\mathbb{I}(o_{1:i-1};r_{i}|s_{1:i-1},C)=0$ follows from the conditional independence $r_{i}\perp o_{1:i-1}\mid s_{1:i-1}$ . Further, due to $o$ as the deterministic function of $s$ , we have:
| | $\displaystyle\mathbb{I}(o_{1:i-1};s_{1:i-1}\mid C)$ | $\displaystyle=\mathbb{H}(o_{1:i-1}\mid C)-\mathbb{H}(o_{1:i-1}\mid s_{1:i-1},C)$ | |
| --- | --- | --- | --- |
where $\mathbb{H}(o_{1:i-1}\mid s_{1:i-1})=\mathbb{H}(o_{1:i-1}\mid s_{1:i-1},C)=0.$ Symmetrically, we have:
| | $\displaystyle\mathbb{I}(o_{1:i-1};r_{i}|C)$ | $\displaystyle=\mathbb{I}(s_{1:i-1};o_{1:i-1};r_{i}\mid C)\leq\mathbb{I}(s_{1:i-1};r_{i}|C)=\mathbb{H}(r_{i}|C)-\mathbb{H}(r_{i}|s_{1:i-1},C)$ | |
| --- | --- | --- | --- |
where $\mathbb{H}(r_{i}|s_{0:i-1},r_{0:i-1})\leq\mathbb{H}(r_{i}|s_{1:i-1},o_{0},r_{0:i-1})$ due to data processing inequality. Combining the two upper bounds proves Eq. (10). ∎
**Corollary 1**
*If observations are fully informative about the underlying states, i.e., $\mathbb{H}(s_{i}\mid o_{i})=0$ for all $i$ , and the state transition dynamics are deterministic, then explicit world modeling provides no reduction in reasoning uncertainty: $\mathbb{I}(o_{1:i-1};r_{i}\mid o_{0},r_{0:i-1})=0.$*
* Proof*
By Eq. (11), we have
| | $\displaystyle\mathbb{I}(o_{1:i-1};r_{i}\mid o_{0},r_{0:i-1})\leq\mathbb{I}(o_{1:i-1};s_{1:i-1}\mid o_{0},r_{0:i-1})\leq\mathbb{H}(s_{1:i-1}\mid o_{0},r_{0:i-1}).$ | |
| --- | --- | --- |
Under the assumption $\mathbb{H}(s_{0}\mid o_{0})=0$ , the initial observation $o_{0}$ uniquely determines $s_{0}$ . Moreover, under deterministic state transitions, the trajectory $s_{1:i-1}$ is uniquely determined by $(s_{0},r_{1:i-1})$ . Hence,
$$
\mathbb{H}(s_{1:i-1}\mid o_{0},r_{1:i-1})=\mathbb{H}(s_{1:i-1}\mid s_{0},r_{1:i-1})=0.
$$
Therefore, $\mathbb{I}(o_{1:i-1};r_{i}\mid o_{0},r_{1:i-1})=0,$ which proves the corollary. ∎
Remarks. Corollary 1 shows that in deterministic and fully observable environments, given sufficient data and model capacity, explicit world modeling provides no additional benefit. This theoretical result is consistent with our empirical findings on the simple maze task.
### 7.2 Prior Knowledge
In this section, we first derive a generalization bound for transfer learning under distribution shift, and relate it to our perspective on prior knowledge in multimodal reasoning.
#### 7.2.1 General Transfer Learning Analysis
Problem setup. A standard transfer learning setup involves a pre-training data distribution $P$ and the fine-tuning data distribution $Q$ over samples $(x,y)\in\mathcal{X}\times\mathcal{Y}$ , and a loss function $\ell_{\theta}(x,y)\in[0,1].$ Define the population risks $\mathcal{L}_{D}(\theta):=\mathbb{E}_{(x,y)\sim D}[\ell_{\theta}(x,y)],D\in\{P,Q\}$ , and the population minimizers $\theta_{D}^{\star}\in\arg\min_{\theta\in\Theta}\mathcal{L}_{D}(\theta),D\in\{P,Q\}$ We assume we can obtain $\theta_{P}^{*}$ as the pre-trained model given sufficient data and optimization. For a radius $r>0$ , we then define the fine-tuning constraint set (local neighborhood around the pre-trained model)
$$
\Theta_{r}:=\{\theta\in\Theta:\ \|\theta-\theta_{P}^{*}\|\leq r\}.
$$
Given $n$ i.i.d. samples from $Q$ : $S=\left\{(x_{i},y_{i})_{i=1}^{n}\right\},(x_{i},y_{i})\sim Q$ , the fine-tuned model $\theta_{Q}$ minimize empirical risk over $\Theta_{r}$ , $\widehat{\mathcal{L}}_{Q}(\theta):=\frac{1}{n}\sum_{i=1}^{n}[\ell_{\theta}(x_{i},y_{i})].$ Our analysis focus on the excess risk on $Q$ : $\mathcal{E}_{Q}(\theta_{Q}):=\mathcal{L}_{Q}(\theta_{Q})-\mathcal{L}_{Q}(\theta_{Q}^{\star}).$
From distribution shift to parameter drift. We first derive how the distribution shift relates to the shift of the population minimizer.
**Lemma 1 (Uniform Loss Shift under Total Variation)**
*For any subset $\mathcal{S}\subseteq\Theta$ ,
$$
\sup_{\theta\in\mathcal{S}}\big|\mathcal{L}_{Q}(\theta)-\mathcal{L}_{P}(\theta)\big|\leq\mathrm{TV}(P,Q).
$$*
* Proof*
Fix any $\theta\in\mathcal{S}$ and define $f_{\theta}(h,a,o^{\prime}):=\ell_{\theta}(h,a,o^{\prime})\in[0,1]$ . By the definition of total variation and the standard inequality for bounded functions, $\big|\mathbb{E}_{Q}[f_{\theta}]-\mathbb{E}_{P}[f_{\theta}]\big|\leq\mathrm{TV}(P,Q).$ Taking the supremum over $\theta\in\mathcal{S}$ yields the claim. ∎
**Lemma 2 (Risk Proximity ofθQ⋆\theta_{Q}^{\star}underPP)**
*$$
\mathcal{L}_{P}(\theta_{Q}^{\star})\leq\mathcal{L}_{P}(\theta_{P}^{\star})+2\mathrm{TV}(P,Q). \tag{12}
$$*
* Proof*
By Lemma 1, $\mathcal{L}_{P}(\theta_{Q}^{\star})\leq\mathcal{L}_{Q}(\theta_{Q}^{\star})+\mathrm{TV}(P,Q).$ By optimality of $\theta_{Q}^{\star}$ on $Q$ , $\mathcal{L}_{Q}(\theta_{Q}^{\star})\leq\mathcal{L}_{Q}(\theta_{P}^{\star}).$ Applying Lemma 1 again, $\mathcal{L}_{Q}(\theta_{P}^{\star})\leq\mathcal{L}_{P}(\theta_{P}^{\star})+\mathrm{TV}(P,Q).$ Chaining the three inequalities proves (12). ∎
**Assumption 1 (Local Quadratic Growth / Sharpness ofâ„’P\mathcal{L}_{P})**
*There exists $\mu>0$ such that for all $\theta$ in a neighborhood containing $\theta_{Q}^{\star}$ ,
$$
\mathcal{L}_{P}(\theta)\geq\mathcal{L}_{P}(\theta_{P}^{\star})+\frac{\mu}{2}\|\theta-\theta_{P}^{\star}\|^{2}.
$$*
**Lemma 3 (Parameter Drift Controlled byTV​(P,Q)\mathrm{TV}(P,Q))**
*Under Assumption 1,
$$
\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|\leq\sqrt{\frac{4}{\mu}\,\mathrm{TV}(P,Q)}. \tag{13}
$$*
* Proof*
By Assumption 1 with $\theta=\theta_{Q}^{\star}$ , $\mathcal{L}_{P}(\theta_{Q}^{\star})\geq\mathcal{L}_{P}(\theta_{P}^{\star})+\frac{\mu}{2}\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|^{2}.$ Rearranging, $\frac{\mu}{2}\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|^{2}\leq\mathcal{L}_{P}(\theta_{Q}^{\star})-\mathcal{L}_{P}(\theta_{P}^{\star}).$ Applying Lemma 2 yields $\frac{\mu}{2}\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|^{2}\leq 2\mathrm{TV}(P,Q),$ and hence (13). ∎
Control of the bias term. Recall the fine-tuning bias induced by restricting to $\Theta_{r}$ : $\varepsilon_{\mathrm{bias}}(r):=\inf_{\theta\in\Theta_{r}}\mathcal{L}_{Q}(\theta)-\mathcal{L}_{Q}(\theta_{Q}^{\star}).$
**Assumption 2 (â„’Q\mathcal{L}_{Q}is Locally Lipschitz)**
*There exists $L_{Q}>0$ such that for all $\theta,\theta^{\prime}\in\Theta_{r}$ ,
$$
|\mathcal{L}_{Q}(\theta)-\mathcal{L}_{Q}(\theta^{\prime})|\leq L_{Q}\|\theta-\theta^{\prime}\|.
$$*
**Theorem 5 (Bias Bound via Distribution Shift)**
*Under Assumption 1 and Assumption 2,
$$
\varepsilon_{\mathrm{bias}}(r)\leq L_{Q}\left(\sqrt{\frac{4}{\mu}\,\mathrm{TV}(P,Q)}-r\right)_{+}, \tag{14}
$$
where $(x)_{+}:=\max\{x,0\}$ . In particular, if $r\geq\sqrt{\frac{4}{\mu}\,\mathrm{TV}(P,Q)},$ then $\varepsilon_{\mathrm{bias}}(r)=0$ .*
* Proof*
If $r\geq\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|$ , then $\theta_{Q}^{\star}\in\Theta_{r}$ and thus $\inf_{\theta\in\Theta_{r}}\mathcal{L}_{Q}(\theta)\leq\mathcal{L}_{Q}(\theta_{Q}^{\star})$ , implying $\varepsilon_{\mathrm{bias}}(r)=0$ . Now consider $r<\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|$ . Let $\theta_{r}$ be the projection of $\theta_{Q}^{\star}$ onto the closed ball $\Theta_{r}$ , i.e., $\theta_{r}:=\theta_{P}+r\cdot\frac{\theta_{Q}^{\star}-\theta_{P}^{\star}}{\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|}$ . Then $\theta_{r}\in\Theta_{r}$ and $\|\theta_{r}-\theta_{Q}^{\star}\|=\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|-r.$ Therefore,
$$
\varepsilon_{\mathrm{bias}}(r)=\inf_{\theta\in\Theta_{r}}\mathcal{L}_{Q}(\theta)-\mathcal{L}_{Q}(\theta_{Q}^{\star})\leq\mathcal{L}_{Q}(\theta_{r})-\mathcal{L}_{Q}(\theta_{Q}^{\star})\leq L_{Q}\|\theta_{r}-\theta_{Q}^{\star}\|=L_{Q}(\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|-r).
$$
Using Lemma 3 to bound $\|\theta_{Q}^{\star}-\theta_{P}\|$ completes the proof of (14). ∎
Fine-tuning excess risk bound. We then arrive at the final result:
**Theorem 6 (Fine-tuning Excess Risk Bound with Shift-Controlled Bias)**
*Assume Assumptions 1 and 2 and uniform convergence over $\Theta_{r}$ holds: with probability at least $1-\delta$ over samples $S$ ,
$$
\sup_{\theta\in\Theta_{r}}\big|\mathcal{L}_{Q}(\theta)-\widehat{\mathcal{L}}_{Q}(\theta)\big|\leq\varepsilon_{\mathrm{gen}}=O\!\left(\sqrt{\frac{\operatorname{Rad}_{Q,n}(\Theta_{r})+\log(1/\delta)}{n}}\right),
$$
where $\operatorname{Rad}_{Q,n}(\Theta_{r})$ is the Rademacher complexity of of the function class $\{\ell_{\theta},\theta\in\Theta_{r}\}$ with respect to $Q$ for sample size $n$ . Then with probability at least $1-\delta$ ,
$$
\mathcal{E}_{Q}(\theta_{Q})\leq 2\varepsilon_{\mathrm{gen}}+L_{Q}\left(\sqrt{\frac{4}{\mu}\,\mathrm{TV}(P,Q)}-r\right)_{+}. \tag{15}
$$*
* Proof*
Decompose the excess risk as $\mathcal{E}_{Q}(\theta_{Q})=\Big(\mathcal{L}_{Q}(\theta_{Q})-\inf_{\theta\in\Theta_{r}}\mathcal{L}_{Q}(\theta)\Big)+\varepsilon_{\mathrm{bias}}(r).$ The first term is bounded by a standard ERM argument using uniform convergence: $\mathcal{L}_{Q}(\theta_{Q})-\inf_{\theta\in\Theta_{r}}\mathcal{L}_{Q}(\theta)\leq 2\varepsilon_{\mathrm{gen}}.$ The second term is bounded by Lemma 5. Combining the two bounds yields (15). ∎
#### 7.2.2 Remarks on Multimodal Reasoning
Theorem 6 reveals a trade-off between modality complexity and distribution shift. This general transfer learning analysis can be instantiated in our setting of learning world models and reasoning policies. Specifically, training pairs $(x,y)$ can be instantiated as $((o_{0:i},r_{0:{i+1}}),o_{i+1})$ for world modeling and $((o_{0:i},r_{0:{i}}),r_{i})$ for reasoning, respectively. Crucially, the distribution shift between large-scale pre-training data and downstream tasks may differ substantially across modalities. For example, there are abundant visual demonstrations of paper folding on the Internet, whereas detailed verbal descriptions of folding dynamics are comparatively scarce. This suggests that downstream tasks should be formulated under the most appropriate observation modality for world modeling and reasoning—i.e., the modality that best aligns with pre-training data—in order to achieve stronger generalization at inference time and higher sample efficiency during post-training.
## 8 Experiment Details
### 8.1 VisWorld-Eval and Training Data
In this section, we elaborate on the construction of training and test data for each task in VisWorld-Eval.
Paper folding. This task involves folding a paper grid with varying grid sizes (3–8) and folding steps (1–4). After folding, holes of different shapes—circles, triangles, stars, diamonds, and squares—are punched into the paper. The model is then asked to predict the distribution of holes after the paper is completely unfolded, including queries such as the total number of holes, the number of holes of a specific shape, or the difference in counts between shapes. All test prompts are constructed at the highest difficulty level (grid size 8 with 4 folding steps). For SFT, we generate chain-of-thoughts using rule-based templates that follow a fixed procedure: unfold the paper step-by-step and then count the resulting holes by shape. These CoTs are then rewritten with Gemini 2.5 Pro to improve clarity and logical coherence. Under visual world modeling, we interleave reasoning steps with images of partially unfolded paper states. Under verbal world modeling, we represent intermediate states using two matrices encoding grid coverage status and hole shape at each position. Under implicit world modeling, we directly skip the explicit tracking of states from original CoTs.
Multi-hop manipulation. This task begins with an initial arrangement of several geometric objects (cubes, spheres, and cylinders) in various colors, rendered by Blender https://www.blender.org/. A sequence of text-based instructions is then provided, describing operations such as changing or swapping objects’ color or shape, adding new objects, or removing existing ones. To ensure these commands can be interpreted unambiguously in a 3D space, the instructions consistently use relative spatial references, with each object uniquely identified by its combined color and shape attributes—for example: "Place a purple cylinder between the black sphere and the yellow cube." The model is asked to infer the resulting spatial layout. Queries may include the total number of objects of a specific shape, the directional relationship between two objects, or which object lies in a given direction relative to a reference object. Test prompts are constructed by varying both the number of initial objects (between 3 and 6) and the frequency (between 1 and 5) of different operation types. For SFT, chain-of-thought reasoning is generated using rule-based templates that simulate the stepwise execution of instructions before answering the final query, and these CoTs are subsequently refined with Gemini 2.5 Pro.
Ball tracking. This task features a red point-mass ball that moves at constant speed, reflects elastically off solid walls, and travels in the initial direction indicated by a green arrow. The model is asked to predict which numbered hole at the top of the image the ball will enter first. We generate input images with randomized resolution, initial ball position and direction, and a random number of holes (4–8). For test prompts, we select cases in which the ball trajectory reflects off at least one wall before entering a hole. For SFT, CoTs are generated by Seed 1.6, which is asked to explain the ball dynamics between adjacent frames.
Sokoban. Sokoban is a classic grid-based puzzle game. We generate instances with grid sizes ranging from 6 to 10, containing a single box and a target position. Test prompts are sampled from the same distribution as the training data. To construct CoTs, we use a search algorithm to compute an optimal solution path. To avoid excessively long trajectories, we render only key intermediate steps, including: (i) the player moving toward the box, (ii) pushing the box in a direction, and (iii) changing the pushing direction. To encourage reflective behavior, we additionally augment trajectories with randomized detours that involve walking into walls, reflecting, and backtracking to rejoin the optimal path. CoTs are generated by Seed 1.6, which explains the dynamics between adjacent frames. For visual world modeling, the rendered intermediate steps are interleaved with verbal CoTs. For pure verbal world modeling, these intermediate renderings are removed. For implicit world modeling, we additionally mask all explicit coordinates during CoTs with special tokens [masked].
Maze. Maze is a classic grid-based puzzle task. We generate both training and test samples with a fixed grid size of $5\times 5$ . To construct CoTs, we use rule-based templates followed by rewriting for improved naturalness. Under visual world modeling, rendered intermediate steps through points and lines are interleaved with verbal CoTs. The settings for verbal and implicit world modeling follow the same protocol as in Sokoban, with masking special tokens as <point>[masked]</point>.
Cube 3-view projection. This task considers stacks of colored cubes arranged on grids of varying sizes (3–5), with two cube colors. The input consists of one isometric view (either front-left or front-right) and two orthographic views of the stack. The question asks for the number of cubes of a specified color visible from another orthogonal view. Both the questions and answer choices account for ambiguity caused by occlusions, leading to uncertainty in the cube count. All test prompts are constructed using uniformly random grid sizes between 3 and 5. We generate CoTs using rule-based templates: the model first constructs the queried view, marks potentially occluded cubes using a third (auxiliary) color, and then counts cubes by color. These CoTs are subsequently rewritten by Gemini 2.5 Pro for improved naturalness. Under visual world modeling, we interleave reasoning steps with an image of the queried view. Under verbal world modeling, we represent intermediate views using character matrices, where different colors are encoded by different symbols.
Real-world spatial reasoning. For this real-world task, we directly adopt test samples from MMSI-Bench, focusing on camera–object and camera–region positional relationship questions. We construct training prompts following a pipeline similar to Yang et al. [68]. To obtain training CoTs, we run a visual-CoT model, which uses an SFT-trained BAGEL model for novel view synthesis as a tool. The resulting visual CoTs are subsequently filtered and rewritten by Gemini 2.5 Pro.
We summarize the training and test sample counts for each task in VisWorld-Eval, along with the corresponding original or referenced benchmarks, in Table 2.
Table 2: Overview of VisWorld-Eval and corresponding training data: features, statistics, and references.
| Task | Capability | Domain | Training Samples | Test Samples | Source/Reference |
| --- | --- | --- | --- | --- | --- |
| Paper folding | Simulation | Synthetic | 2,357 | 480 | SpatialViz [61] |
| Multi-hop manipulation | Simulation | Synthetic | 2,000 | 480 | ZebraCoT [35], CLEVR [30] |
| Ball tracking | Simulation | Synthetic | 2,254 | 1,024 | RBench-V [20] |
| Maze | Simulation | Synthetic | 8,448 | 480 | maze-dataset [29] |
| Sokoban | Simulation | Synthetic | 7,715 | 480 | GameRL [55] |
| Cube 3-view projection | Reconstruction | Synthetic | 2,500 | 480 | SpatialViz [61] |
| Real-world spatial reasoning | Reconstruction | Real-world | 10,661 | 522 | MMSI-Bench [69] |
Examples of training CoTs are presented in Figure 10, 11, 12, 13, and 14.
### 8.2 Model Training
We perform supervised fine-tuning (SFT) of BAGEL based on its official repository https://github.com/ByteDance-Seed/Bagel, using 8 GPUs, and conduct reinforcement learning from verifiable rewards (RLVR) using verl https://github.com/volcengine/verl on 64 GPUs. Hyperparameters for SFT and RLVR are reported in Table 4 and Table 4, respectively.
Table 3: Hyperparameters for supervised fine-tuning UMMs.
| Hyperparameter | Value |
| --- | --- |
| Learning rate | $3\times 10^{-5}$ |
| LR Schedule | Constant |
| Optimizer | AdamW |
| Loss weight (CE:MSE) | 1:10 |
| Warm-up steps | 200 |
| Training steps | 4000 |
| Gen. resolution | (256, 1024) for paper folding, cube 3-view |
| (240, 1024) for multi-hop manipulation | |
| (256, 512) otherwise | |
| Und. resolution | (224, 980) |
| Sequence length per rank | 32K |
| Num. ranks | 8 |
Table 4: Hyperparameters for reinforcement learning UMMs.
| Hyperparameter | Value |
| --- | --- |
| Learning rate | $1\times 10^{-5}$ |
| Batch size | 128 |
| GRPO mini batch size | 32 |
| Group size | 16 |
| KL loss coefficient for visual gen. | 0.1 |
| KL loss coefficient for verbal gen. | 0.0 |
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Diagram: Paper Folding Unfolding Process
### Overview
The image presents a step-by-step visual and textual explanation of unfolding a folded paper with cut-out shapes. The goal is to determine the final pattern of cutouts when the paper is fully unfolded and to count the total number of cutouts. The diagram uses a grid-based representation of the folded paper and arrows to illustrate the unfolding process. It also includes mathematical notation to represent the positions of the cutouts.
### Components/Axes
The diagram consists of the following components:
* **Grid:** A 3x3 grid representing the folded paper.
* **Cutout Shapes:** Represented as triangles, labeled as 'triangle_right' and 'triangle_left'.
* **Arrows:** Indicate the direction of unfolding.
* **Mathematical Notation:** Lists of coordinates representing the positions of the cutouts (e.g., [0, 0, -1], [1, 1, -1]).
* **Textual Explanation:** Step-by-step instructions describing the unfolding process.
### Detailed Analysis or Content Details
**Step 1: Initial Folded State**
The initial folded state shows the following cutout positions:
* `triangle_right` at (0, 0)
* `triangle_left` at (0, 1)
* `triangle_right` at (0, 2)
Represented mathematically as: `[[0, 0, -1], [1, 1, -1], [1, 2, -1]]`
**Step 2: Unfolding the Second Fold**
This step involves unfolding the fold where the right side of the paper was folded over to the left. The reflection occurs across the vertical line between column 0 and column 1.
* `triangle_right` at (0, 0) is reflected to (1, 1).
* `triangle_left` at (0, 1) is reflected to (2, 1).
* The original shapes remain in column 0.
The resulting cutout positions are: `[[0, 0, -1], [1, 1, -1], [1, 1, -1], [1, 2, -1], [triangle_right, triangle_left, ''], [triangle_right, triangle_left, '']]`
**Step 3: Unfolding the First Fold**
This step involves unfolding the fold made by placing the right third of the paper over the middle third.
* Shapes in the leftmost column remain unchanged.
* Shapes in the middle column are reflected across a double layer.
* `triangle_right` at (1, 1) is reflected to the new middle column, resulting in `triangle_right` at the stationary layer.
* `triangle_left` at (1, 1) is reflected to `triangle_left` in the middle column.
The resulting cutout positions are: `[[0, 0, -1], [1, 1, -1], [1, 1, -1], [1, 2, -1], [triangle_right, triangle_left, ''], [triangle_right, triangle_left, '']]`
**Final Unfolded State**
The final unfolded state shows the following cutout positions:
* `triangle_right` at (0, 0)
* `triangle_right` at (0, 1)
* `triangle_right` at (0, 2)
* `triangle_right` at (1, 0)
* `triangle_left` at (1, 1)
* `triangle_left` at (1, 2)
* `triangle_right` at (2, 0)
* `triangle_left` at (2, 1)
* `triangle_right` at (2, 2)
The total number of cutouts is 9.
### Key Observations
* Each unfolding step involves reflecting the cutout shapes across a specific axis.
* The mathematical notation provides a precise representation of the cutout positions.
* The diagram clearly illustrates the unfolding process, making it easy to follow.
* The final unfolded state reveals a symmetrical pattern of cutouts.
### Interpretation
The diagram demonstrates the concept of reflection and symmetry in the context of paper folding. The unfolding process can be seen as a series of reflections, where each fold creates a mirror image of the cutouts. The final unfolded state shows that the total number of cutouts is determined by the initial number of cutouts and the number of folds. The diagram effectively visualizes this process, making it easier to understand the underlying mathematical principles. The use of both visual and textual explanations enhances the clarity and accessibility of the information. The diagram is a good example of how geometric transformations can be used to create complex patterns from simple shapes.
</details>
Figure 10: Examples of chain-of-thought SFT data for the paper folding task, under visual world modeling (left) and verbal world modeling (right).
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Textual Document: Ball Tracking & Multi-Hop Manipulation - Problem & Solution
### Overview
The image presents two distinct textual blocks: one detailing a "Ball Tracking" problem with a provided solution, and another describing a "Multi-Hop Manipulation" problem with its solution. Both sections include a question, a response, and supporting reasoning. The layout is a two-column format.
### Components/Axes
The image does not contain charts, diagrams, or axes. It consists entirely of text.
### Detailed Analysis or Content Details
**Left Column: Ball Tracking**
* **Question:**
1. There is a red ball and a green arrow; the arrow indicates the ball’s initial direction of motion.
2. At the image numbered from left to right starting at 1;
3. The black boundaries of the image represent solid walls. When the ball hits a wall, it undergoes ideal reflection;
4. The red ball is considered a point mass—its size is negligible;
5. The ball reflects according to the following rules:
* The component of velocity perpendicular to the wall is reversed in direction;
* The component of velocity parallel to the wall remains unchanged;
* The ball moves at a constant speed in an ideal environment until it enters one of the holes.
6. Your task is: Estimate which hole the red ball will enter during its motion. Please reason step by step, and put your final answer within box([ ]).
* **Response:**
1. Firstly, analyze the original image: The image shows a rectangular area with three horizontal holes at the top, numbered 1, 2, 3 from left to right. A red dot with a downward green arrow indicates the ball’s initial position and direction of motion. The black lines are solid walls, and the ball undergoes ideal reflection (perpendicular velocity component reverses, parallel component stays the same).
2. Track the ball’s motion: The ball moves downward in the direction of the green arrow. To determine where it will exit, consider the following:
* The ball will first hit the bottom wall.
* Upon hitting the bottom wall, the vertical component of its velocity will reverse, sending it upwards.
* It will then hit the top wall.
* Upon hitting the top wall, the vertical component of its velocity will reverse again, sending it downwards.
* This process repeats.
3. Estimate the exit hole: Due to the repeated reflections, the ball will follow a zigzag path. The initial downward direction and the reflections suggest the ball will likely enter hole 2.
4. Final Answer: box([2])
* **Additional Text:** "FM-AI" is present in the bottom-left corner.
**Right Column: Multi-Hop Manipulation**
* **Question:** You are given an initial arrangement of objects viewed from an oblique front perspective. In this view, objects closer to the camera are considered 'front' and the 'left' and 'right' sides of the scene correspond to the left and right of the frame.
Perform the following operations in sequence:
1. Change the object directly in front of the yellow cuboid into a rose cylinder.
2. Place a gray cylinder behind and to the left of the object that is directly behind the rose cylinder.
3. Place a gray sphere to the left of the rose cylinder.
After completing all operations, what is the object to the right of the blue cylinder?
A. gray cylinder, B. gray cuboid, C. red cylinder, D. blue cylinder.
* **Response:** `<think> First, I will execute the operations step-by-step, starting from the initial scene which contains five objects: a blue cylinder, a gray sphere, a gray cuboid, a red cylinder, and a yellow cuboid. The first step is to change the object closest to the front of the yellow cuboid into a rose cylinder. In the initial image, this object is the gray sphere. Accordingly, I will change the gray sphere to a rose cylinder.</image>`
* **Additional Text:** "FM-AI" is present in the bottom-right corner.
### Key Observations
Both sections demonstrate a problem-solving approach. The "Ball Tracking" section involves spatial reasoning and physics principles (reflection). The "Multi-Hop Manipulation" section requires following a sequence of instructions to modify a scene and identify an object based on its relative position. The presence of "FM-AI" in both corners suggests a common origin or context, potentially related to an AI model or framework.
### Interpretation
The image showcases a testing or training scenario for an AI model capable of reasoning and problem-solving. The "Ball Tracking" problem assesses the model's understanding of physics and spatial dynamics. The "Multi-Hop Manipulation" problem tests the model's ability to follow instructions, maintain state, and perform relational reasoning. The inclusion of a "think" step in the "Multi-Hop Manipulation" response suggests the model is designed to articulate its reasoning process. The consistent use of "FM-AI" indicates this is likely a component of a larger system developed by that entity. The problems are designed to be solvable through logical deduction rather than requiring external knowledge. The format (question, response, reasoning) is typical of educational or assessment materials.
</details>
Figure 11: Examples of chain-of-thought SFT data for the ball tracking and multi-hop manipulation task.
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Text Extraction & Analysis: Maze & Sokoban Problem Sets with Responses
### Overview
The image presents two distinct problem sets with corresponding responses: a maze navigation task and a Sokoban puzzle. Both problems are presented with a question and a detailed, step-by-step response. The maze problem involves finding a path from a start to an end point, represented by coordinates. The Sokoban problem involves moving boxes in a grid to designated goal locations. The responses detail the reasoning and steps taken to solve each problem.
### Components/Axes
The image is divided into two main sections, each containing a question and a response. Each section is further structured with text blocks. The Sokoban section includes a grid representation of the puzzle. The maze section includes a visual representation of the maze.
### Detailed Analysis or Content Details
**Maze Problem:**
* **Question:** "You are provided with an image of a maze. A red dot marks the starting position, and a blue X marks the goal. Your objective is to navigate the maze step-by-step from the start to the goal. Return your final path as a list of waypoints in the following format: `<point1> <point2> ...`"
* **Response:**
* "I will now solve the maze step by step. I can see the starting point, the red dot, is at `<point=165 83>` located in the upper-left area of the maze. My destination, the blue X, is at `<point=83 410>` located in the lower-left corner of the maze."
* **Step 1:** "I begin at the starting point, `<point=165 83>`. Looking around, the path upwards is blocked by the top boundary of the maze, and the path to the left is blocked by a wall. This leaves me with two possible directions: right or down. After observing the maze layout, I've decided to move right. The path downwards leads into a small, U-shaped enclosure which is clearly a dead end. The path to the right, however, opens into a long corridor that seems to wind through a large portion of the maze, making it the only viable option for making significant progress. I will proceed right until I hit the wall at `<point=246 83>`."
* **Step 2:** "Since there are no new choices, I will continue along the only available path. From my new position at `<point=246 83>`, the path to the left is blocked by the right wall. Blocked by the path to the left I can continue to the right. The option is to go down, which leads downhill direction I just came from. Only one answer is available, follow the path to the right until I reach `<point=283 83>`."
* **Step 3:** "From `<point=283 83>`, I can go down. I will go down to `<point=283 165>`."
* **Step 4:** "From `<point=283 165>`, I can go left. I will go left to `<point=246 165>`."
* **Step 5:** "From `<point=246 165>`, I can go down. I will go down to `<point=246 248>`."
* **Step 6:** "From `<point=246 248>`, I can go left. I will go left to `<point=165 248>`."
* **Step 7:** "From `<point=165 248>`, I can go down. I will go down to `<point=165 331>`."
* **Step 8:** "From `<point=165 331>`, I can go left. I will go left to `<point=83 331>`."
* **Step 9:** "From `<point=83 331>`, I can go down. I will go down to `<point=83 410>`."
* **Final Path:** `<point=165 83> <point=246 83> <point=283 83> <point=283 165> <point=246 165> <point=246 248> <point=165 248> <point=165 331> <point=83 331> <point=83 410>`
**Sokoban Problem:**
* **Question:** "This is a Sokoban task. A player is in a small room with walls, an empty floor, one box, and one goal with green marks. Your objective is to guide the player to move the box onto the goal position. The player can move 'up', 'down', 'left', or 'right' and can only push the box, not pull it. The player cannot walk through walls or the box."
* **Response:**
* "Let's begin by analyzing the initial setup. In the starting scene, the player character is positioned at (3,1) on the grid. The box is located at (1,2), and the goal (marked with green X) is at position (1,4). The room contains walls (the orange brick tiles) and floor tiles (the lighter, dotted ones). My objective is to push the box from (1,2) onto the goal at (1,4)."
* "To accomplish this, I first need to position myself so I can push the box towards the goal. The box is currently in row 1, column 2, and the goal is located two columns to the right at (1,4). Since the goal is to the right of the box, I need to approach the box from the left side in order to push it right. Currently, I am at position (3,1). To get to its left side (push position (1,1)), I need to move twice left from (3,1), then twice up. Therefore I am adjacent to the box on its left side."
* **Final Sequence:** "bblluu" (where 'b' = back, 'l' = left, 'u' = up)
* "Now, I will push the box to the right by moving right. This will move the box from (1,2) to (1,3). Then, I will move right again to push the box onto the goal at (1,4). Therefore, the final sequence of moves is 'rr'."
* **Complete Sequence:** "bblluurr"
### Key Observations
* The maze solution provides a precise sequence of coordinates, demonstrating a methodical approach to pathfinding.
* The Sokoban solution breaks down the problem into smaller steps: positioning the player, then pushing the box.
* Both responses emphasize a step-by-step reasoning process.
* The Sokoban grid is a 5x5 grid, with coordinates ranging from 1 to 5.
### Interpretation
The image showcases the ability of a language model to solve spatial reasoning problems and articulate the solution process in a clear, step-by-step manner. The maze solution demonstrates the model's capacity to interpret a visual representation and translate it into a sequence of coordinates. The Sokoban solution highlights the model's understanding of game mechanics and its ability to plan a series of actions to achieve a specific goal. The detailed explanations provided in the responses suggest a level of reasoning beyond simple pattern matching. The use of coordinate systems and directional commands indicates a structured approach to problem-solving. The responses are not just providing answers, but also explaining *how* those answers were derived. This is a key indicator of a more sophisticated understanding of the problems presented.
</details>
Figure 12: Examples of chain-of-thought SFT data for the maze and sokoban task.
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Diagram: Cube 3-View Projection & Red Cube Count Determination
### Overview
The image presents a problem involving a cube stack composed of red and light cyan cubes. The task is to determine the number of red cubes visible from the left view, given isometric views from the front, top, and right. The image details the reasoning process to solve this problem, starting with an initial attempt at representing the cube stack and refining it to account for hidden cubes and uncertainty. The solution is presented with a matrix representation of the cube stack.
### Components/Axes
The image consists of:
* **Cube Views:** Isometric projections of the cube stack from the front, top, and right.
* **Question Text:** A textual problem statement asking for the number of red cubes visible from the left.
* **Response Text:** A step-by-step explanation of the solution process.
* **Matrix Representations:** Several 4x4 matrices representing the cube stack, with 'X' denoting a red cube, 'O' denoting a light cyan cube, '*' denoting an uncertain cube, and '|' representing the cube boundaries.
* **Answer Options:** A list of possible answers (A. 4, B. 6, C. 5, D. 6, E. 5, F. 4).
### Detailed Analysis or Content Details
**Initial Attempt:**
The initial representation of the cube stack is a 4x4 matrix:
```
| X | X | | X |
|---|---|---|---|
| X | O | | X |
|---|---|---|---|
| X | O | | X |
|---|---|---|---|
| X | X | X | X |
```
**Refined Representation:**
The refined representation introduces '*' to denote uncertain cubes:
```
| X | X | | X |
|---|---|---|---|
| X | O | | X |
|---|---|---|---|
| X | O | | X |
|---|---|---|---|
| * | * | X | X |
```
**Final Corrected Representation:**
The final corrected 4x4 matrix is:
```
| X | X | | X |
|---|---|---|---|
| X | O | | X |
|---|---|---|---|
| X | O | | X |
|---|---|---|---|
| * | * | X | X |
```
**Solution:**
Based on the final matrix, the number of definitively red cubes is 2 (marked with 'X'). There are 2 cubes whose color is uncertain (marked with '*'). Since each of these uncertain cubes could be red, the total number of red cubes visible from the left could range from a minimum of 2 (if both uncertain cubes are light cyan) to a maximum of 4 (if both uncertain cubes are red).
The answer is therefore 4.
### Key Observations
* The problem requires spatial reasoning and the ability to visualize a 3D structure from 2D projections.
* The solution process involves iterative refinement of the cube stack representation to account for hidden cubes and uncertainty.
* The use of a matrix representation simplifies the visualization and analysis of the cube stack.
* The final answer is not a single value but a range, reflecting the uncertainty in the color of some cubes.
### Interpretation
The image demonstrates a problem-solving approach to a spatial reasoning task. The initial attempt to represent the cube stack is incomplete, as it doesn't account for hidden cubes. The refinement process, using the '*' symbol to denote uncertainty, is crucial for arriving at a correct solution. The matrix representation is a powerful tool for visualizing and analyzing the cube stack, allowing for a systematic approach to counting the red cubes. The final answer of 4 acknowledges the uncertainty in the color of some cubes, highlighting the importance of considering all possibilities. The problem showcases the application of logical deduction and spatial visualization skills. The use of the "think" notation suggests a thought process being documented, similar to a "rubber duck debugging" approach. The problem is designed to test the ability to infer information from limited data and to handle ambiguity. The solution is presented in a clear and concise manner, making it easy to follow the reasoning process. The problem is a good example of a visual-spatial reasoning question that could be used in an aptitude test or interview.
</details>
Figure 13: Examples of chain-of-thought SFT data for the cube 3-view projection task, under visual world modeling (left) and verbal world modeling (right).
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Textual Document: Spatial Reasoning Problem Set
### Overview
The image presents two spatial reasoning problems with associated "Response" sections demonstrating a thought process for solving them. The problems are presented in a question-and-answer format, with accompanying images (labeled "Image 1" and "Image 2") used as visual references. The "Response" sections include a "think" section outlining the reasoning steps.
### Components/Axes
The image is structured as follows:
* **Header:** "Real-World Spatial Reasoning"
* **Problem 1:** Question, "Response" section with "think" sub-section, and "Image 1"
* **Problem 2:** Question, "Response" section with "think" sub-section, and "Image 2"
* The image contains text in English.
### Detailed Analysis or Content Details
**Problem 1:**
* **Question:** "If the round clock mounted on the dark wall is north of the heater below the windows, and you stand at the heater, in which direction will you find the white door in the room's background?"
* **Options:** A. north B. southwest C. northwest D. east.
* **Response:**
* **think:** "My goal is to determine the direction of the white door from the heater."
* "First, let's establish the reference frame. The problem states that the round clock on the dark wall is north of the heater. This means if I stand at the heater and look towards the clock, I am facing North."
* "Let's examine the provided images."
* "Image 1 shows the heater, a bookshelf, and a TV on a media console."
* "Image 2 shows the white door (radar under the window on the right), the clock on a dark gray wall section, a dining table, and the same TV and media console."
* "The heater, my starting point, is visible in Image 2. The clock is also visible in Image 2, allowing me to establish the North direction from my position at the heater. The white door, however, is not in this view. It is visible in Image 1."
* "The TV and media console are common elements in both images, indicating that the scenes are in the same room. From Image 2, the TV area is to the left of the heater. In Image 1, the door is to the left of the TV area. This suggests the door is further to the left from the perspective of Image 2."
* "To find the door from my position at the heater (viewpoint of Image 2), I need to turn my head to the left. I'll estimate a 45-degree turn to the left ("front left" action) from the viewpoint of the second image to see if the door comes into view."
* **Image 1:** Shows a room with a heater, bookshelf, TV, and a white door visible in the background.
* **Image 2:** Shows a room with a heater, clock, dining table, TV, and media console.
**Problem 2:**
* **Question:** "When table lamp (in frame 2) is south of pillow (in frame 2), what is the position of the TV (in frame 1) in relation to a combined (in frame 2)?"
* **Options:** A. northeast, B. southwest, C. northwest, D. west
* **Response:**
* **think:** "To solve this spatial reasoning problem, I first need to establish a clear view of all the relevant objects: the table lamp and pillow from the second image, and the TV from the first image. The initial frames show these objects from different perspectives. Frame 1 shows the TV, while Frame 2 shows the pillow and the table lamp. To get a comprehensive view, a good strategy would be to move backward from the position of the second image. This should widen the field of view and hopefully bring the TV into sight. Let's move backward from the second image."
* "With this new perspective providing a combined view, we can now proceed with the reasoning."
* "1. Identify the pillow (in frame 2): It is a fluffy object on the bed, near the dark headboard."
* "2. Identify the table lamp (in frame 2): White, with a translucent lampshade, positioned to the right of the pillow."
* "3. Establish the south direction: The table lamp is south of the pillow, meaning the pillow is north of the table lamp."
* "4. From the question statement, we know the TV is in frame 1. The TV is on a media console. Based on the combined view, the TV is to the left and slightly behind the combined pillow and table lamp."
* "5. Determine the relative position: From the combined perspective, the TV is located to the northeast of the combined pillow and table lamp."
* **Image 1:** Shows a room with a TV and media console.
* **Image 2:** Shows a bed with a pillow and a table lamp.
### Key Observations
* Both problems rely on visual reasoning and the ability to integrate information from multiple images.
* The "think" sections demonstrate a step-by-step approach to problem-solving, emphasizing establishing a reference frame and identifying key relationships between objects.
* The problems involve determining relative positions (direction, northeast, southwest, etc.).
### Interpretation
The document showcases a method for teaching and assessing spatial reasoning skills. The problems are designed to test the ability to mentally manipulate objects in space and understand their relationships from different viewpoints. The "think" sections are crucial, as they reveal the cognitive processes involved in solving these types of problems. The use of images is essential for providing the necessary visual information. The problems are not about factual recall but about applying logical reasoning to visual data. The consistent structure of question, response, and image provides a clear framework for learning and evaluation. The problems are designed to be solved by carefully analyzing the images and applying spatial concepts.
</details>
Figure 14: Examples of chain-of-thought SFT data for the real-world spatial reasoning task.
The Qwen-VL baselines are trained using LLaMA-Factory https://github.com/hiyouga/LLaMA-Factory for supervised fine-tuning (SFT) and verl for reinforcement learning from verifiable rewards (RLVR).
### 8.3 Analytic Experiments
Sample efficiency. For Figure 6, we randomly subsample either 500 or 1000 training examples. The resulting models are evaluated under two settings: (i) a hard setting with the maximum difficulty (grid size 8 and 4 folding steps, default in VisWorld-Eval), and (ii) an in-distribution setting (denoted as Normal in the figure) with randomly sampled grid sizes (3–8) and folding steps (1–4).
Task difficulties and world model fidelity. For Figure 6, we generate test samples with varying cube-stack sizes (3–6), where size 6 is out-of-distribution relative to the training data. To assess world-model fidelity, we compare the generated views with the ground-truth views: for verbal world modeling, we use string pattern matching; for visual world modeling, we use Gemini 3 Pro to compare images. Since accurately inferring colors becomes particularly challenging at larger stack sizes, we evaluate only the shapes of the views and ignore color information. We also find that overall accuracy can be bottlenecked by verbal subskills (e.g., counting holes) after SFT, thus we report the accuracy of RL-trained models in Figure 6. In contrast, RL can distract verbal world modeling capabilities, leading to invalid formats of generated symbolic matrices, thus we report world-model fidelity of SFT-trained models.
Implicit world modeling. For Figure 6, we supervised fine-tune (SFT) BAGEL on CoTs with implicit world modeling, in which all explicit point coordinates are replaced by the placeholder token sequence <point>masked<point>. After training, we extract the hidden representations at the position of the token masked from each transformer layer. We then split the extracted representations from different CoTs into training and validation sets with an 8:2 ratio and train a two-layer MLP (hidden size 4096) to predict the ground-truth point coordinates. Since all samples are $5\times 5$ mazes, we formulate coordinate prediction as two 5-way classification tasks (for $x$ and $y$ , respectively). We compute classification accuracy for each coordinate and report the average of the two.
## 9 Extended Experimental Results
### 9.1 Full Results on MMSI-Bench
We report all scores on positional relationship tasks of MMSI-Bench in Table 5.
Table 5: Full results of SFT-trained UMMs on MMSI-Bench positional relationship tasks.
| | MMSI-Bench (Positional Relationship) | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Models | Cam.-Cam. | Obj.–Obj. | Reg.–Reg. | Cam.–Obj. | Obj.–Reg. | Cam.–Reg. | Overall |
| Implicit WM | 33.1 | 31.2 | 31.8 | 46.5 | 29.1 | 37.3 | 34.8 |
| Visual WM | 29.6 | 29.5 | 31.6 | 60.9 | 25.8 | 54.4 | 38.4 |
### 9.2 Additional Qualitative Evaluation
We provide additional qualitative evaluation of trained UMMs’ reasoning, particularly failure cases.
Real-world spatial reasoning. As shown in Figure 15 a, reasoning with implicit world modeling is prone to hallucinations. In contrast, visual generation (Figure 15 b) yields more faithful world models, but still suffers from insufficient quality, including blurring and corrupted details. Moreover, we find that current VLMs and UMMs continue to exhibit limited understanding of positions and directions across different viewpoints. We expect that stronger base models and better-curated post-training data will enable more effective use of visual world models for spatial reasoning in future work.
Paper folding. As illustrated in Figure 16, verbal reasoning about geometric symmetry is prone to hallucinations, leading to inaccurate verbal world modeling. In contrast, visual world models, benefiting from stronger prior knowledge, generate correct intermediate unfolding steps even in the presence of erroneous verbal reasoning.
Cube 3-view projection. As shown in Figure 17, visual world models are able to approximately generate novel views of cube stacks even in the challenging out-of-distribution setting with an unseen stack size of 6, indicating strong prior knowledge of spatial transformations. Nevertheless, overall task performance remains limited by subtle shape-generation errors (Figure 17 b,d) and inaccurate color inference (Figure 17 c). We expect these issues to be alleviated through improved post-training and stronger base models.
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Real-World Spatial Reasoning: Question Set with Visuals
### Overview
The image presents a spatial reasoning problem with two parts (a) and (b). Each part includes a question, a set of multiple-choice answers, and a series of images illustrating a walkthrough of the reasoning process. The problem involves determining relative positions and directions within a room based on visual cues. The text is primarily in English.
### Components/Axes
The image is structured into two main sections, (a) and (b), each with the following components:
* **Question:** A textual question related to spatial reasoning.
* **Options:** A list of multiple-choice answers (A, B, C, D, E, F, G).
* **Model Output (Implicit World Modeling/Visual World Modeling):** Text describing the thought process used to solve the problem.
* **Image Sequence:** A series of four images showing different viewpoints within a room, used to illustrate the reasoning steps.
### Detailed Analysis or Content Details
**Part (a):**
* **Question:** "The images are taken continuously from a first-person perspective. At the moment of the last image, in which direction is the black chair relative to you?"
* **Options:** A: Front right, B: Back left, C: Front left, D: Back right.
* **Model Output (Implicit World Modeling):** Describes the process of analyzing initial images to build a mental map of the room and determine the chair's direction.
* **Image 1:** Shows a black chair in front of a white wall with a large painting.
* **Image 2:** Shows a bed with a black headboard against a white wall.
* **Image 3:** Shows a view turning slightly to the left.
* **Image 4:** Shows a desk and a window. The black chair is not visible.
* **Final Answer:** D: Back right.
**Part (b):**
* **Question:** "When you enter the room through the white door, which way should you go to reach the TV area?"
* **Options:** A: Go straight, B: Go left, C: Cannot be determined, D: Go right.
* **Model Output (Visual World Modeling):** Describes the process of understanding the room's layout and determining the direction to the TV area.
* **Image 1:** Shows a view through a white door.
* **Image 2:** Shows a wider view, with the white door still visible and the TV area to the left.
* **Image 3:** Shows a view turning slightly to the left.
* **Image 4:** Shows a comprehensive view after turning both the door and the TV area are visible.
* **Final Answer:** B: Go left.
**Text Transcription (Model Output - Part a):**
"Model Output (Implicit World Modeling) <think>: To solve this, I need to determine the direction of the black chair relative to the camera’s position in the last image.
First, I’ll analyze the two initial images to build a mental map of the room.
Image 1 shows a black chair in front of a white wall with a large painting. To the right of the painting, there’s a dark hallway.
Image 2 shows a bed with a black headboard against a white wall. To the left of the bed, there’s a dark door.
The hallway seen on the right in image 1 appears to be the same hallway that starts on the right side of the bed in image 2. This suggests the two images are taken from different positions in the same room, looking in roughly opposite directions.
The question asks for the chair’s direction relative to the camera at the moment of the last image. The last image is image 2, which shows the bed. This means the camera’s starting position is near the bed, looking towards it.
To find the chair, I need to explore the surroundings from this starting position. I’ll simulate turning the camera to see what’s around the bed. I’ll start by turning left from the perspective of image 2.
This new view shows a desk and a window. The black chair is not here. This means the chair is not to the left of the bed.
Now, I’ll turn to the right. This view shows the chair (image 1) behind the bed and a dark hallway of opposite direction.
Final Answer: (D) Back right"
**Text Transcription (Model Output - Part b):**
"Model Output (Visual World Modeling) <think>: My goal is to determine the direction from the white door to the TV area. The first image shows a white door in the background, and the second image shows a TV area. To understand their relative positions, I need to explore the space.
Starting from the first image, I will move forward to get a better view of the room.
After moving forward, the view is wider. The white door is still visible, and the TV area seems to be to the left. To confirm this, I will turn slightly to the left from this new position.
After turning left, the TV area is now more clearly visible. To get a final, comprehensive view that includes both the door and the TV area, I will move backward from this current position.
Now, I can confidently determine that the TV area is to the left of the door.
Final Answer: (B) Go left"
### Key Observations
* The problem relies on understanding spatial relationships and performing mental rotations.
* The images are presented as a sequence to simulate a person moving through the room.
* The "Model Output" provides a step-by-step explanation of the reasoning process.
* The answers are based on interpreting visual cues and building a mental model of the environment.
### Interpretation
The image demonstrates a method for solving spatial reasoning problems by combining visual information with a step-by-step thought process. The "Model Output" acts as a cognitive trace, showing how a person might analyze the images and arrive at the correct answer. The use of multiple viewpoints and a sequential presentation of images is crucial for building a comprehensive understanding of the room's layout. The problem highlights the importance of mental rotation and the ability to infer spatial relationships from visual cues. The inclusion of both questions and answers, along with the reasoning process, makes this a valuable example for studying human spatial cognition and problem-solving strategies. The problem is designed to test the ability to perform "common sense" reasoning about the physical world.
</details>
Figure 15: Showcases of reasoning generateed by post-trained UMMs in the real-world spatial reasoning task. We highlight hallucinations or incorrect reasoning steps in red.
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Diagram: Paper Folding Analysis
### Overview
The image presents a visual and textual analysis of a paper folding problem. The task is to mentally reverse the folding process of a paper with cutouts to reconstruct the final unfolded design and calculate the number of triangles. The image contains diagrams illustrating the folding steps, along with accompanying text explaining the reasoning and solution. It also includes model output in a list format.
### Components/Axes
The image is structured into several sections:
1. **Problem Statement:** A textual question defining the task.
2. **Visual Steps:** A series of diagrams showing the paper folding process, with each step numbered from 1 to 6.
3. **Textual Explanation:** Detailed descriptions accompanying each visual step, explaining the folding and unfolding logic.
4. **Model Output (Visual World Modeling):** A list representing the cutting pattern on the folded paper.
5. **Model Output (Verbal World Modeling):** A textual explanation of the cutting pattern.
6. **Final Answer:** The calculated number of triangles.
There are no explicit axes or scales in this diagram. The "axes" are the sequential steps of the folding process, visually represented and explained.
### Detailed Analysis or Content Details
**1. Problem Statement:**
"Question: Analyze the image showing a folded paper with cutouts. Mentally reverse the folding process to reconstruct the final unfolded design, then provide your answer to: calculate the number of triangle, let minus the number of triangle: right."
**2. Visual Steps & Textual Explanation:**
* **Step 1:** The paper is folded in half vertically. The text states: "First, we will reverse the fourth and final fold. This was a vertical fold where the right half of the paper was folded over to the left half. To undo this, we unfold it, we open the paper from left to right. The vertical crease acts as a line of symmetry."
* **Step 2:** The text describes the existing cuts and the stationary portion of the paper. "In this case, the existing cutouts – the square and the triangle – are located on the right-hand portion of the paper, which remains stationary. The flap being unfolded is blank. Therefore, as we unfold the first fold, the existing cutouts are mirrored. The square and triangle simply remain in place, leaving the newly revealed left portion of the paper as a solid, uncut surface."
* **Step 3:** The paper is folded in half again, this time horizontally. "Next, we will reverse the third fold. This was a horizontal fold where the top half of the paper was folded down to the bottom half. To undo this, we unfold it, we open the paper from top to bottom. The horizontal crease acts as a line of symmetry."
* **Step 4:** The text explains the mirroring effect of the unfolding. "As we unfold the second fold, the existing cutouts are mirrored again. The square and triangle remain in their original positions, while the newly revealed top portion of the paper is a solid, uncut surface."
* **Step 5:** The paper is folded diagonally. "Then, we will reverse the second fold. This was a diagonal fold where the top-right corner of the paper was folded down to the bottom-left corner. To undo this, we unfold it, we open the paper along the diagonal. The diagonal crease acts as a line of symmetry."
* **Step 6:** The final unfolding step. "Finally, we will reverse the first fold. This was a diagonal fold where the top-left corner of the paper was folded down to the bottom-right corner. To undo this, we unfold it, we open the paper along the diagonal. The diagonal crease acts as a line of symmetry."
**3. Model Output (Visual World Modeling):**
```
[[-1, -1, -1],
[-1, 1, -1],
[-1, -1, 1]]
[['square', ''],
['triangle', 'left']]
```
**4. Model Output (Verbal World Modeling):**
"Let's analyze the shapes on the folded paper:
- The square cutout is on the stationary left side, so it remains in its original position.
- The left-pointing triangle is on the folded-over section. As we unfold, this cutout is mirrored across the vertical crease. A reflection across a vertical line flips a shape horizontally, so the left-pointing triangle becomes a right-pointing triangle on the right side.
- The right-pointing triangle is on the stationary right side, and left-pointing triangle is on the folded section."
**5. Final Answer:**
"Therefore, the final unfolded design will consist of a square and two triangles.
The number of triangle is 2.
Let minus the number of triangle: right.
So the answer is 2."
### Key Observations
* The solution relies heavily on understanding the symmetry created by each fold and how unfolding mirrors the cutouts.
* The model output provides a symbolic representation of the cutting pattern.
* The textual explanations are crucial for following the logic of the unfolding process.
* The final answer is explicitly stated as 2.
### Interpretation
The diagram demonstrates a problem-solving approach to spatial reasoning. By systematically reversing the folding steps, one can reconstruct the original unfolded design. The use of symmetry is a key concept in this process. The model output attempts to formalize the cutting pattern, but its format is somewhat abstract. The diagram effectively illustrates how a complex shape can be created through a series of simple folding and cutting operations. The final answer of 2 triangles is derived from careful consideration of the mirrored cutouts after each unfolding step. The phrase "Let minus the number of triangle: right" is somewhat ambiguous but seems to confirm the answer is a positive value. The entire process is a demonstration of visual-spatial intelligence and logical deduction.
</details>
Figure 16: Showcases of reasoning generated by post-trained UMMs in the paper folding task. We highlight hallucinations or incorrect reasoning steps in red, but also mark correctly generated visual unfolding intermediate steps with green borders.
<details>
<summary>x19.png Details</summary>

### Visual Description
## Cube 2-View Projection: Problem Set
### Overview
The image presents a set of four cube projection problems (labeled a, b, c, and d). Each problem displays three 2D views of a cube (front, top, and left) and asks a question about the number of visible cubes of a specific color in a fourth view (either front or top). Each problem also includes a "Model Output" section that describes the reasoning process and provides the answer. A "Ground-truth" indicator is present in problem (b).
### Components/Axes
Each sub-image (a, b, c, d) is structured as follows:
* **Title:** "Cube 2-View Projection" followed by a letter (a, b, c, or d).
* **Cube Views:** Three 2D views of a cube are displayed horizontally: "front", "top", and "left".
* **Question:** A text question asking for the number of visible cubes of a specific color in a specified view.
* **Model Output:** A text block describing the reasoning and the answer.
* **Answer Options:** A list of possible answers (A, B, C, D, etc.).
* **Ground Truth:** (Only in b) A small green checkmark with the label "Ground-truth".
### Detailed Analysis or Content Details
**Problem (a):**
* **Cube Colors:** Yellow and other colors (not specified).
* **Question:** "A cube stack is made of equal-sized small cubes, mostly in seashell with a few yellow ones. The image shows its isometric view from front-left, top view, and left view from left to right. How many cubes in yellow can possibly be seen from the front view?"
* **Answer Options:** A. 2, B. All three, C. 4, D. 1.
* **Model Output:** "To determine the number of yellow cubes visible from the front, we first need to construct the front view of the stack. By synthesizing the provided isometric, top, and left views, we can generate the required front view."
* **Model Output (cont.):** "Examining the generated front view, we can directly count the number of visible yellow cubes. There are 2 yellow cubes visible. Therefore, the correct answer is 2, which corresponds to option A."
* **Answer:** A
**Problem (b):**
* **Cube Colors:** Blue and palegreen.
* **Question:** "A cube stack is made of equal-sized small cubes, mostly in palegreen with a few blue ones. The image shows its isometric view from front-right, front view, and right view from left to right. How many cubes in blue can possibly be seen from the top view?"
* **Answer Options:** A. 4, B. 5, C. 3, D. 0.
* **Model Output:** "To determine the number of blue cubes visible from the top, we first need to construct the top view of the stack. By synthesizing the information from the given front and right views, we can generate the top view as shown below."
* **Model Output (cont.):** "Examining the generated top view, we can count the number of visible blue cubes. There is exactly 1 blue cube visible. However, we must also consider the possibility of occluded cubes whose colors might be unknown from the provided front and right views."
* **Answer:** C
**Problem (c):**
* **Cube Colors:** Darkviolet and palegreen.
* **Question:** "A cube stack is made of equal-sized small cubes, mostly in palegreen with a few darkviolet ones. The image shows its isometric view from front-left, top view, and left view from left to right. How many cubes in darkviolet can possibly be seen from the right view?"
* **Answer Options:** A. 3, B. All three, C. 1, D. 2.
* **Model Output:** "To determine the number of darkviolet cubes visible from the right, we first need to construct the right view of the stack. The colors of some front, top, and left view might be unknown from the provided isometric view."
* **Model Output (cont.):** "Examining the generated right view, we can directly count the number of visible darkviolet cubes. There are 3 darkviolet cubes visible. Therefore, the correct answer is 3, which corresponds to option A."
* **Answer:** A
**Problem (d):**
* **Cube Colors:** Green and palegreen.
* **Question:** "A cube stack is made of equal-sized small cubes, mostly in seashell with a few green ones. The image shows its isometric view from front-left, top view, and left view from left to right. How many cubes in green can possibly be seen from the front view?"
* **Answer Options:** A. 2, B. All three, C. 1, D. 0.
* **Model Output:** "To determine the number of green cubes visible from the front, we first need to construct the front view of the stack. By synthesizing the provided isometric, top, and left views, we can generate the required front view."
* **Model Output (cont.):** "Examining the generated front view, we can directly count the number of visible green cubes. There is exactly 1 green cube visible. Therefore, the correct answer is 1, which corresponds to option C."
* **Answer:** C
### Key Observations
* Each problem follows the same structure: 3 views, a question, a model output explaining the reasoning, and answer options.
* The "Model Output" consistently describes a process of constructing the missing view based on the provided views.
* The answers provided (A, C, A, C) suggest a pattern, but with only four data points, it's difficult to establish a statistically significant trend.
* Problem (b) includes a "Ground-truth" indicator, suggesting a validation process.
### Interpretation
The image demonstrates a problem-solving approach to spatial reasoning and 3D visualization. The task requires the ability to mentally rotate and synthesize 2D projections of a 3D object to determine the number of visible elements of a specific color. The "Model Output" sections highlight a logical, step-by-step reasoning process. The inclusion of a "Ground-truth" indicator in problem (b) suggests that the solutions are being evaluated against a known correct answer, potentially for training or evaluating an AI model. The problems are designed to test the ability to infer hidden information and account for occluded elements. The consistent structure of the problems suggests a standardized testing or training methodology. The problems are testing the ability to perform mental rotation and spatial reasoning.
</details>
Figure 17: Showcases of reasoning generated by post-trained UMMs in the paper folding task. We mark correct and incorrect generated cube views with green and red borders, respectively. For incorrect generations, the corresponding ground-truth views are provided for reference (note that these are shown only for readers and are never provided to the models during reasoning).