# Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models
**Authors**: Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long
> wujialong0229@gmail.commingsheng@tsinghua.edu.cnzhangxiaoying.xy@bytedance.com
1]Tsinghua University 2]ByteDance Seed \contribution [*]Work done at ByteDance Seed \contribution [†]Corresponding authors
(January 27, 2026)
Abstract
Humans construct internal models of the world and reason by manipulating the concepts within these models. Recent advances in artificial intelligence (AI), particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems, which rely predominantly on verbal reasoning as their primary information-processing pathway. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though a clear consensus on their benefits has not yet been reached. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks–particularly those grounded in the physical world–visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of deliberate CoT reasoning and analyze distinctions among different forms of world models from both informativeness and knowledge aspects. Empirically, we identify and design tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Through controlled experiments on a state-of-the-art UMM, we show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling. Conversely, it offers no clear advantage for tasks that do not require explicit visual modeling. Together, these insights and findings clarify the applicability and potential of multimodal world modeling and reasoning for more powerful, human-like multimodal AI. We publicly release our evaluation suite to facilitate further research.
\checkdata
[Project Lead]Jialong Wu at \correspondence Mingsheng Long at , Xiaoying Zhang at \checkdata [Project Page] https://thuml.github.io/Reasoning-Visual-World
1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Composite Image: World Models in Human Minds
### Overview
The image presents a conceptual overview of how world models are represented in human minds, contrasting verbal/symbolic and visual/imagery knowledge. It further illustrates reasoning processes using both verbal and visual world modeling in multimodal AI, providing examples of mathematical reasoning, travel planning, everyday activity planning, and real-world spatial reasoning.
### Components/Axes
**Top Section: World Models in Human Minds**
* **Title:** World Models in Human Minds
* **Left Sub-section:** World Model: Mental Model of the World
* Features a person thinking, a cloud containing a hierarchical diagram connected to a globe, and an arrow labeled "Feedback" pointing from the globe back to the person. The word "Approximate" is also present.
* **Right Sub-section:** Dual Representations of World Knowledge
* **Sub-titles:** Verbal/Symbolic Knowledge, Visual/Imagery Knowledge
* **Verbal/Symbolic Knowledge:** A graph with axes labeled 'y' and 'x'. A parabolic curve is plotted, represented by the equation y = ax² + bx + c. The equation F = ma is also present. Text "Dislike in Daily Life" is below the graph.
* **Visual/Imagery Knowledge:** A basketball court with a basketball moving through the air. A play button is superimposed on the image. Text "Prefer in Daily Life" is below the image.
**Middle Section: Reasoning with Verbal World Modeling in Multimodal AI**
* **Title:** Reasoning with Verbal World Modeling in Multimodal AI
* **Left Sub-section:** Mathematical Reasoning
* Presents a mathematical question and a step-by-step solution.
* **Question:** If a>1, then the sum of the real solutions of √(a - √(a+x)) = x is equal to
* **Response:** <think>
* To solve the equation √(a - √(a+x)) = x, let's start by squaring both...
* (√(a - √(a+x)))² = x² => a - √(a+x) = x²
* Rearrange to isolate the inner square root term:
* (a-x²)² = a+x => a² - 2ax² + (x²)² = a + x => x⁴ - 2ax² - x + (a² - a) = 0
* **Puzzle Solving:**
* <think>From S at [452, 59], the only available move is West to [359, 59] ......
* **Middle Sub-section:** Travel Planning
* Presents a travel planning task.
* **Task:** Plan a trip with a budget of $1,700.
* <think>STATE: Initial Budget: $1700, Spent: $0. Day 1: transportation: PENDING...Available: Flight F3573659: $474
* **ACTION:** Plan day 1 transportation. Select Flight F3573659.
* **NEXT STATE:** After this action, you've spent $474, leaving $1226 from your $1700 budget...
* Buttons labeled "State/Observation" and "Action" are present.
* **Right Sub-section:** Everyday Activity Planning
* **Goal:** Cooking tomato and eggs
* Presents a sequence of images showing eggs being cooked in a pan.
* <think>... ACTION: Cook the eggs in the pan STATE: This action changes the state of the eggs from liquid to a partially cooked state. The eggs are now partially cooked and are being transformed into curds. The heat from the pan cooks the eggs, and the stirring action ensures even cooking...
**Bottom Section: Reasoning with Visual World Modeling in Multimodal AI**
* **Title:** Reasoning with Visual World Modeling in Multimodal AI
* **Left Sub-section:** Real-World Spatial Reasoning
* **Question:** When you took the photo in Figure 1, where was the iron refrigerator located relative to you?
* Presents two images of a room.
* **Middle Sub-section:**
* <think>First, let's analyze the images. [...] It's not visible in that initial view, so I need to change my perspective.
* Presents an image of a room.
* **Right Sub-section:**
* The refrigerator is not visible in this 90-degree left turn view, [...] To be thorough, I will also check the view to the right.
* Presents an image of a room.
* **Rightmost Sub-section:**
* [...] My initial turn was 90 degrees left, but the refrigerator isn't at that exact angle. Let's try a smaller turn to the left.
* Presents an image of a room.
### Detailed Analysis or Content Details
* **Mathematical Reasoning:** The mathematical problem involves solving an equation with nested square roots. The solution attempts to isolate the square root terms and simplify the equation.
* **Travel Planning:** The travel planning task involves budgeting and selecting transportation options. The initial budget is $1700, and the first action involves planning day 1 transportation, costing $474.
* **Everyday Activity Planning:** The everyday activity planning task involves cooking tomato and eggs. The description focuses on the state changes of the eggs during the cooking process.
* **Real-World Spatial Reasoning:** The real-world spatial reasoning task involves determining the location of a refrigerator based on a series of images. The AI agent analyzes the images and adjusts its perspective to locate the refrigerator.
### Key Observations
* The image contrasts verbal/symbolic and visual/imagery knowledge representations.
* The examples demonstrate how AI can reason using both verbal and visual information.
* The real-world spatial reasoning example highlights the importance of perspective and visual analysis in problem-solving.
### Interpretation
The image illustrates the concept of world models in human minds and how AI can leverage both verbal and visual information to reason and solve problems. The examples demonstrate the potential of multimodal AI in various domains, including mathematical reasoning, travel planning, everyday activity planning, and real-world spatial reasoning. The contrast between verbal/symbolic and visual/imagery knowledge highlights the importance of integrating different modalities for effective AI systems. The real-world spatial reasoning example showcases the ability of AI to analyze visual information and adjust its perspective to solve spatial problems, mimicking human cognitive processes.
</details>
Figure 1: Overview of a world-model perspective on multimodal reasoning. (a) Humans construct mental models of the world, representing information and knowledge through two complementary channels–verbal and visual–to support reasoning, planning, and decision-making. (b) Recent advances in large language models (LLMs) and vision language models (VLMs) largely rely on verbal chain-of-thought reasoning, leveraging primarily verbal and symbolic world knowledge. (c) Unified multimodal models (UMMs) open a new paradigm by using visual generation for visual world modeling, advancing more human-like reasoning on tasks grounded in the physical world. Examples of reasoning with verbal world modeling are adapted from Guo et al. [18], Du et al. [14], Chen et al. [9], Zhang et al. [72].
Humans construct internal mental models of the external world that represent objects and concepts, along with their relationships, structures, and operational mechanisms [11, 16]. These models support reasoning and decision-making by enabling mental simulation, allowing individuals to anticipate the outcome of actions without actually taking them [40]. For example, if a glass of water is spilled on the table, people can rapidly mentally simulate the ensuing events: the water falling downward, spreading across the surface, and potentially dripping onto the floor. Such predictions lead them to quickly move valuable items away or reach for a towel. Beyond physical systems, mental models also extend to any domain where relational structures can be simulated, such as mathematics and logic [31, 32], making them fundamental to how humans understand and interact with all aspects of the world.
Cross-disciplinary researchers in philosophy, psychology, cognitive science, and related fields have a long history of developing computational models of human mental models [44]. Among them, artificial intelligence (AI) shares a core ambition of building machines that reason like people. Although debates remain, recent breakthroughs, especially in large language models (LLMs) and chain-of-thought (CoT) reasoning, have made a substantial step towards approximating human reasoning grounded in mental models of the world, often referred to as world models [24, 34] in the AI literature. During chain-of-thought reasoning, LLMs explore, reflect, and backtrack within the structured solution space, guided by world knowledge acquired through large-scale pre-training. These capabilities have already driven progress in diverse domains, including programming [18], mathematics [57, 18], scientific discovery [53], clinical medicine [58], and robotics [42].
Such reasoning capabilities have also been extended to multimodal AI systems, particularly vision language models (VLMs) [28, 6, 19, 70]. These systems typically incorporate visual inputs by aligning visual representations with the embedding space of LLMs, resulting in reasoning that remains primarily constrained to a linguistic space. In contrast, human mental models operate over multiple forms of mental representations. Dual-coding theory [45] suggests that the mind processes information through two complementary codes: verbal and imagery (particularly visual) representations. These pathways can function independently but often collaborate to support reasoning. Indeed, visual imagery has been shown to have advantages over words in memory encoding and retrieval [33]; and individuals with aphantasia, who lack the ability to visualize mental imagery, exhibit worse performance on tasks such as visual search [43]. These evidence from psychology and cognitive science therefore suggest that the absence of a dedicated visual information pathway may explain why current multimodal AI systems excel in formal and abstract domains dominated by verbal world knowledge, yet continue to fall far short of human performance on tasks involving physical and spatial reasoning [49, 8], which fundamentally depend on visual world modeling.
Next-generation multimodal AI systems are evolving to be built upon unified multimodal models (UMMs) [54, 63, 62, 13], which seamlessly integrate both verbal and visual generation capabilities. The newly introduced visual generation component offers the potential to explicitly realize visual world modeling, a critical element of multimodal world models in human-like reasoning that current systems largely lack. This naturally makes us ponder: Can current UMMs truly leverage their visual generation capability to enhance reasoning and thereby narrow the performance gap between multimodal AI and humans? A growing body of preliminary research [36, 77, 38, 76, 17] has begun exploring this question from different perspectives. However, the findings so far remain inconclusive. Reported empirical results are mixed, showing no consistent trends that visual generation reliably improves reasoning performance. Moreover, the evaluation tasks used in current studies are designed heuristically, lacking a principled basis for understanding when and how visual generation can meaningfully contribute to multimodal reasoning.
In this paper, we present the first principled study of when and how visual generation benefits reasoning from a world-model perspective (see Figure 1), making both theoretical and empirical contributions.
Theoretically, we rigorously bridge the concepts of world models and reasoning. (1) World model formulations: We formulate multimodal world models to approximate the underlying multi-observable Markov decision processes (MOMDP) of tasks, and define two fundamental capabilities of world models, namely world reconstruction and world simulation. (2) World model-based reasoning: To realize world models for reasoning, we introduce three reasoning formulations. Two rely solely on verbal CoTs through implicit or verbal world modeling, while the third interleaves verbal and visual CoTs that explicitly incorporate visual generation as a form of visual world modeling. (3) The visual superiority hypothesis: Under this framework, we analyze the distinctions among different world models, highlighting the richer informativeness and complementary prior knowledge afforded by visual world modeling. These insights motivate our central hypothesis that visual world modeling is superior for certain tasks, particularly those grounded in the physical world.
Empirically, we validate these insights through a series of controlled experiments. (4) The VisWorld-Eval suite: We identify and design tasks that specifically isolate and demand each atomic world model capability, forming a new evaluation suite to facilitate future research. This suite, VisWorld-Eval, collects seven tasks spanning both synthetic and real-world domains. (5) Empirical evaluation: Experiments with a state-of-the-art UMM [13] on VisWorld-Eval reveal findings consistent with our insights and theoretical analysis. In tasks where verbal world modeling suffers from representational bottlenecks or insufficient prior knowledge, interleaved CoT delivers substantial performance improvements. By contrast, it offers no clear advantages in tasks such as mazes and Sokoban, whose simple states do not require explicit visual world modeling. We further conduct dedicated analyses, including evidence revealing emergent implicit world modeling in the maze task.
We hope this work provides early evidence for the central role of multimodal world models in general-purpose AI, in which complementary verbal and visual knowledge emerge from generative modeling across modalities, with the latter being especially valuable for bringing human-like intelligence into the physical world.
2 Related Work
World models. The field of world models is rapidly evolving, yet remains far from reaching consensus on definitions or methodologies. Although psychology and cognitive science suggest that human mental models rely on compact representations that discard irrelevant details, how to scale approaches capable of learning such abstract representations [48, 26, 34] to arbitrary domains and modalities is still unclear. Consequently, most current techniques preserve complete information of observations, either through reconstructable latent representations [24, 25] or directly at the level of raw data. Prominent examples include modern video generation world models [12, 1, 2, 64] which capture concrete pixel-level dynamics. In contrast, language inherently provides a higher level of abstraction, making it more similar to human mental representations [60, 65, 59, 71, 9]. This motivates the promise of unified multimodal models that generate both languages and visuals as a new direction for building more human-like world models.
Unified multimodal models. Multmodal understanding [28, 6, 19] and visual generation [47, 50] have long developed in isolation. Recently, there has been growing interest in integrating these two capabilities into a single unified model. This can be straightforwardly achieved by forwarding the representations of vision language models to an external visual generation module [56, 46]. A more unified approach is to model both language and visual modalities within a single backbone. While language is predominantly modeled through autoregressive next-token prediction, the design space of visual modalities spans a wide spectrum, from discrete tokenization with autoregressive [62, 54, 63] or masked modeling [66, 22], to continuous tokenization with diffusion or flow-based modeling [75, 41, 13]. Among these efforts, BAGEL [13] is one of the most widely adopted open-source models achieving state-of-the-art performance. Despite substantial progress in building unified multimodal models (UMMs), existing evaluations still primarily assess their understanding and generation capabilities separately. One widely recognized advantage of UMMs lies in leveraging reasoning abilities of handling complex instructions to enhance visual generation or editing [74, 21]. Yet when and how visual generation, in turn, enhances reasoning remains insufficiently explored, lacking solid empirical evidence and community consensus.
Benchmarking visual generation for reasoning. This paper contributes to a growing line of research on visual generation for reasoning. RealUnify [52] and Uni-MMMU [77] design tasks in which generation is expected to enhance reasoning, but report mixed results without revealing clear trends regarding the benefits of visual generation. ROVER [38] reveals fundamental limitations of current models in generating meaningful visual reasoning steps, often resulting in minimal or even negative gains in final accuracy. In contrast, MIRA [76] conducts a sanity test by providing manually annotated visual cues, thereby bypassing the evaluation of visual world modeling capability. While the aforementioned works evaluate zero-shot performance, ThinkMorph [17] fine-tunes UMMs to reveal emergent reasoning behaviors but restricts each CoT to a single intermediate image, thereby not fully exploiting the potential of interleaved CoT. Our work distinguishes itself through a world-model perspective that enables a principled investigation, allowing us to both demonstrate and systematically explain when visual generation yields positive gains and when it does not.
3 A World Model Perspective on Multimodal Reasoning
Inspired by the aforementioned connections between human cognition and artificial intelligence, we formalize our world-model perspective on multimodal reasoning (see Figure 2) in this section.
3.1 Formulation: Multiple Observations of the World
Without loss of generality, the world of a specific task can be formulated as a multi-observable Markov decision process (MOMDP) $\mathcal{M}=(\mathcal{S},\mathcal{A},p,\Phi,\mathcal{O}_{\phi},e_{\phi})$ , where $\mathcal{S}$ denotes the state space, $\mathcal{A}$ the action space, $p$ the transition function, $\Phi$ the parameter space of observation functions, $\mathcal{O}_{\phi}$ the observation space, and $e_{\phi}$ the observation function. Each $s∈\mathcal{S}$ represents the underlying state of the world, which is typically hidden and not directly observable. Instead, it can be perceived through different instantiations of observations (hereafter also referred to as views) [27], given by $o=e_{\phi}(s)∈\mathcal{O}_{\phi}$ , parameterized by $\phi∈\Phi$ . As illustrated in Figure 2 a, such views can span multiple modalities—for example, visual observations corresponding to different camera poses, or verbal descriptions expressed with different emphases or styles. When an action $a∈\mathcal{A}$ is applied to the current state, the world transits according to the dynamics $s^{\prime}\sim p(s^{\prime}|s,a)$ and yields new observations.
3.2 Atomic Capabilities of World Models
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: Multiple Observations and World Models
### Overview
The image presents a series of diagrams illustrating different aspects of world observation, reconstruction, and simulation using cube stacks. It covers multiple observation types, Markov decision processes, atomic capabilities of world models, and chain-of-thought formulations.
### Components/Axes
**Section 1: Multiple Observations of the World**
* **Verbal Observations:** An arrow points from the text "Verbal Observations" to a 3D cube stack.
* **Visual Observations:** An arrow points from the text "Visual Observations" to a 2D representation of the same cube stack.
* **Cube Stack Description 1:** "A stack of cubes with an L-shaped front view and an inverted L-shaped right view." This is accompanied by 2D projections of the cube stack.
* **Cube Stack Description 2:** "A stack of unit cubes positioned at (0,0,0), (1,0,0), (0,1,0), and (0,0,1)." This is also accompanied by 2D projections of the cube stack.
* **Multi-Observable Markov Decision Process:**
* Observations: Oφ1, Oφ2, Oφ3, O'φ1, O'φ2, O'φ3
* States: S, S'
* Action: a
**Section 2: Atomic Capabilities of World Models**
* **World Reconstruction:**
* Top view: 2D projection of a cube stack.
* Front view: 2D projection of a cube stack.
* Right view: 2D projection of a cube stack.
* World Model: A pink box labeled "World Model".
* Front-right view: 3D projection of a cube stack.
* Back view: 2D projection of a cube stack.
* Coordinates: (0,0,0), (1,0,0), (0,1,0), (0,0,1).
* **World Simulation:**
* 3D projection of a cube stack.
* World Model: A pink box labeled "World Model".
* 3D projection of a cube stack.
* Coordinates: (0,0,0), (1,0,0), (0,1,0), (0,0,1), (2,0,0).
**Section 3: World Model-Based Chain-of-Thought Formulations**
* **Question:** "Given the three views of a cube stack [Top, Front, Right], how can we modify the stack to match the desired back view? [Back view]"
* **World Reconstruction:**
* Top view: 2D projection of a cube stack.
* Front view: 2D projection of a cube stack.
* Right view: 2D projection of a cube stack.
* Reconstruct the full structure: 3D projection of a cube stack.
* Imagine the back view: 2D projection of a cube stack.
* Get the answer: Put at (2,0,0): 2D projection of a cube stack.
* **World Simulation:**
* Try put a new cube: 3D projection of a cube stack.
* Wait, retry another choice: 2D projection of a cube stack.
* Imagine the back view: 2D projection of a cube stack.
### Detailed Analysis or ### Content Details
**Section 1:**
* The "Verbal Observations" and "Visual Observations" both refer to the same cube stack, suggesting two different ways of perceiving the same object.
* The "Multi-Observable Markov Decision Process" illustrates a state transition model with observations, states, and actions.
**Section 2:**
* "World Reconstruction" shows how different views of an object can be used to create a world model and then infer the back view.
* "World Simulation" shows how a world model can be used to simulate different configurations of the object.
**Section 3:**
* The "World Model-Based Chain-of-Thought Formulations" section presents a problem-solving approach using world models. It involves reconstructing the full structure from three views, imagining the back view, and then either putting a new cube or retrying another choice.
### Key Observations
* The diagrams use 2D and 3D projections to represent cube stacks.
* The "World Model" is a central component in both reconstruction and simulation.
* The chain-of-thought formulation involves iterative steps of reconstruction, simulation, and decision-making.
### Interpretation
The image illustrates the concept of building and using world models for object understanding and manipulation. It demonstrates how different observations can be integrated into a coherent model, and how this model can be used for tasks such as reconstructing hidden views or simulating the effects of actions. The chain-of-thought formulation highlights the iterative and reasoning-based nature of problem-solving using world models. The diagrams suggest a system that can perceive an object from multiple viewpoints, create an internal representation of it, and then use that representation to reason about its properties and how it can be modified.
</details>
Figure 2: Theoretical formulation of the world model perspective on multimodal reasoning. (a) Observations of the same underlying world state can span multiple modalities, including verbal and visual observations, each reflecting different views or emphases. (b) Two atomic capabilities of world models are defined: world reconstruction, which infers complete structure from partial observations and enables novel view synthesis, and world simulation, which models dynamics to predict future observations. (c) Chain-of-thought reasoning includes internal world modeling, by explicitly maintaining an evolving sequence of observations, generated through either of the atomic world model capabilities.
A world model, analogous to human mental models, is then expected to support two fundamental capabilities [34], illustrated in Figure 2 b. The first is called world reconstruction. Humans are remarkably skilled at mentally reconstructing the structure of an environment from only a few partial observations [71], grounded in their prior knowledge of the world. Such mental reconstruction allows them to imagine novel views of the same underlying state, supporting skills such as mental rotation. Formally, the perception component of a world model encodes $n$ observations from limited views into an internal representation: $\hat{s}=\operatorname{enc}(o_{\phi_{1}},...,o_{\phi_{n}})≈ s$ . This representation approximates the true state We set aside the debate between compact and comprehensive representations. By treating abstract (e.g., sketches) and high-fidelity observations as different view specifications, this formulation allows the internal representation to flexibly adjust to the level of detail required by the desired views., and can then be decoded to generate an unseen observation: $\hat{o}_{\phi_{n+1}}=\mathrm{dec}(\hat{s},\phi_{n+1})≈ e_{\phi_{n+1}}(s)$ , providing an internal "experience" of navigating the world. In modern generative models, including UMMs, since their latent representations are not explicitly defined, the world reconstruction capability can be realized through end-to-end novel view generation:
$$
\displaystyle p_{\theta}(o_{\phi_{n+1}}\mid o_{\phi_{1}},\dots,o_{\phi_{n}}), \tag{1}
$$
which implicitly learns the internal representations required to synthesize the new view.
The second capability is world simulation. Humans can mentally simulate how the world evolves into the future, supporting reasoning and decision-making, either purely in their minds or with external aids such as a scratchpad. Formally, this corresponds to the prediction component of a world model, which predicts the transition of the current state and action: $\hat{s}^{\prime}\sim\operatorname{pred}(\hat{s},a)$ , providing an internal "experience" of interacting with the world. Similarly, for modern generative models, this capability is more typically realized through predictions of future observations:
$$
\displaystyle p_{\theta}(o_{t+1}\mid o_{\leq t},a_{\leq t}). \tag{2}
$$
In our new evaluation suite, we deliberately curate tasks that specifically demand each capability, allowing us to independently validate its contribution to multimodal reasoning (see Section 4.1).
3.3 Deliberate Reasoning with World Modeling Across Modalities
We then formalize how world-modeling capabilities within multimodal models contribute to reasoning. Given a question $Q$ and input images $I$ , the chain-of-thought reasoning process of a multimodal AI system can be expressed as a sequence of intermediate steps (or thoughts) $R=\tau_{1},\tau_{2},...,\tau_{H}$ , followed by the answer $A$ . Although this general formulation treats each reasoning step $\tau_{i}$ as an unconstrained, free-form operation, our world model perspective suggests that humans reason by prediction and planning, and each step inherently manipulates the underlying world observations of the problem [59, 10, 72]. We therefore refine the reasoning formulation as $\tau_{i}=(r_{i},o_{i})$ to explicitly incorporate an evolving sequence of observations:
$$
\displaystyle R=\left(r_{1},o_{1}\right),\left(r_{2},o_{2}\right),\dots,\left(r_{H},o_{H}\right), \tag{3}
$$
where $r_{i}$ We use $i$ to index reasoning steps in order to distinguish them from the true time step $t$ of the underlying MOMDP. The twos are not generally aligned, as we may include branching and backtracking in the reasoning. denotes a logical reasoning step based on the accumulated context, typically expressed in text, and $o_{i}$ denotes the observation generated at that step. Specifically, the input images serve as the initial observation $o_{0}=I$ , and subsequent observations are generated from previous reasoning and observations, by invoking atomic world modeling capabilities: world reconstruction (Eq. (1)) and world simulation (Eq. (2)), where reasoning steps imply actions $a$ and view transformations $\phi$ , as illustrated in Figure 2 c.
This formulation is modality-agnostic, allowing observations—and thus world modeling—to arise across various modalities. We focus specifically on verbal and visual observations, motivated by dual-coding theory in human cognition and by the fact that UMMs are equipped to generate both. This yields several concrete CoT instantiations. Specifically, verbal world modeling produces purely verbal CoTs, with $o_{i}$ as verbal descriptions, whereas visual world modeling produces verbal-visual interleaved CoTs, with $o_{i}$ as generated images. In addition, prior work has discovered that language models can implicitly learn world models with emergent internal representations of board-game states without explicit supervision [37]. Motivated by this, we also consider implicit world modeling, in which no explicit observation is generated ( $o_{i}=\emptyset$ ) In practice, strictly distinguishing implicit from verbal world modeling can be difficult, because there are often partial descriptions of the current state in the reasoning part $r_{i}$ . In this work, we treat verbal world modeling as explicitly expressing world states or observations in text, such as coordinates or symbolic matrices..
3.4 The Visual Superiority Hypothesis
Contemporary LLMs and VLMs have achieved impressive performance in structured and abstract domains, such as mathematics and programming, largely driven by large-scale language-centric pre-training and verbal chain-of-thought post-training. Although these models have accumulated extensive verbal and symbolic knowledge, their understanding of the visual world remains limited when trained under purely verbal supervision. As a result, they continue to struggle with tasks grounded in basic physical and spatial intuition that even young children naturally master [49, 8].
Visual world modeling is therefore essential for endowing multimodal AI with complementary forms of information and knowledge. (1) In terms of informativeness, while verbal and symbolic representations capture high-level semantic abstractions, they often suffer from ambiguity and representational bottlenecks. In contrast, visual observations are more concrete and information-rich, directly encoding physical properties such as motion and spatial relationships. This provides precise, fine-grained grounding for reasoning about the complex real world, particularly in spatial and physical tasks. (2) In terms of prior knowledge, visual world knowledge is inherently complementary to symbolic knowledge. Humans and animals acquire much of this knowledge (e.g., physical interactions and spatial transformations) through perception, largely independent of language. Consequently, humans naturally represent and communicate such knowledge visually—for example, by sketching an approximate parabolic trajectory without performing explicit calculations. This suggests that different aspects of world knowledge are concentrated in different data modalities, and learning from large-scale generative modeling of visual data can thereby expand the effective knowledge landscape available for multimodal reasoning.
We next formalize and justify these insights through theoretical analysis, with formal statements and proofs provided in Appendix 7.
Informativeness. For notational convenience, we denote the question $Q$ as $r_{0}$ , the input images as $o_{0}$ , and the final answer as $r_{H+1}$ . Prefixes of a CoT are defined as $R_{i}=(r_{0},o_{0},r_{1},o_{1},...,r_{i-1},o_{i-1}),\tilde{R}_{i}=(r_{0},o_{0},r_{1},o_{1},...,r_{i-1},o_{i-1},r_{i})$ . We use $\mathbb{H}(·)$ and $\mathbb{I}(·;·)$ to denote Shannon entropy and mutual information, respectively. We first establish that the end-to-end answer error admits an upper bound that naturally decomposes into reasoning and world-modeling errors.
**Theorem 1**
*Let $p$ denote the distribution over optimal chain-of-thoughts and answers, and let $p_{\theta}$ be a learned reasoning model. Then the following inequality holds:
$$
\displaystyle\operatorname{KL}(p(A\mid Q,I)\mid\mid p_{\theta}(A\mid Q,I)) \displaystyle\leq\operatorname{KL}(p(R,A\mid Q,I)\mid\mid p_{\theta}(R,A\mid Q,I)) \displaystyle= \displaystyle\sum_{i=1}^{H+1}\underbrace{\mathbb{E}_{p}\left[\operatorname{KL}(p(r_{i}|R_{i})\mid\mid p_{\theta}(r_{i}|R_{i}))\right]}_{\textnormal{reasoning errors}}+\sum_{i=1}^{H}\underbrace{\mathbb{E}_{p}\left[\operatorname{KL}(p(o_{i}|\tilde{R}_{i})\mid\mid p_{\theta}(o_{i}|\tilde{R}_{i}))\right]}_{\textnormal{world-modeling errors}}. \tag{4}
$$*
This decomposition reveals a fundamental trade-off between the informativeness of world models for reasoning and the fidelity of the world model itself. In the case of implicit world modeling, where $o_{i}=\emptyset$ , we get rid of the world-modeling error. However, this typically comes at the cost of increased uncertainty and learning difficulty in reasoning, as all state transitions must be implicitly encoded. Empirically, world models that explicitly track the task states, serving as verbal or visual sketchpads, are generally beneficial for reasoning. We dive into the reasoning component of Eq. (4) to elucidate the factors underlying these benefits.
**Theorem 2**
*Let $s_{i}$ denote the latent states associated with the observations $o_{i}$ . Under appropriate assumptions, the reduction in reasoning uncertainty achieved by explicit world modeling satisfies the following properties:
1. Reasoning uncertainty does not increase: $\mathbb{H}(r_{i}|o_{0},r_{0:i-1})-\mathbb{H}(r_{i}|R_{i})=\mathbb{I}(o_{1:i-1};r_{i}|o_{0},r_{0:i-1})≥ 0.$
1. The reasoning uncertainty improvement is bounded by both (i) the information that observations provide about the underlying states and (ii) the information that the reasoning step requires about those states:
$$
\mathbb{I}(o_{1:i-1};r_{i}|o_{0},r_{0:i-1})\leq\min\left(\mathbb{I}(o_{1:i-1};s_{1:i-1}),\mathbb{I}(r_{i};s_{0:i-1},r_{0:i-1})\right). \tag{5}
$$*
The uncertainty of the target distribution is closely related to sample efficiency and learning difficulty. Consequently, the upper bound on the improvement of reasoning uncertainty (Eq. (5)) highlights another trade-off in the choice of observation modality for world modeling. The first term indicates that observations should be sufficiently informative about the underlying latent states. In contrast, the second suggests that they need only preserve the task-relevant aspects of the states required to select appropriate reasoning steps. Excessively detailed observations may be unnecessary and even detrimental, increasing world modeling errors.
Prior knowledge. Although visual world models are more informative, they are intrinsically more difficult to learn from scratch due to the high dimensionality and complexity of visual observations. Fortunately, modern AI systems are typically large-scale pre-trained, which endows them with strong prior knowledge and enables faster convergence and improved generalization during downstream post-training. As discussed earlier, humans tend to represent different aspects of world knowledge through different modalities. Consequently, for a given downstream task, the distribution shift between its transition distribution and that learned during large-scale Internet pre-training can vary substantially across modalities. The generalization bound in Theorem 6 of Appendix 7.2 suggests that this modality-dependent distribution shift is closely related to the post-training sample efficiency of the corresponding world model. This highlights the importance of acquiring broad prior knowledge across modalities during pre-training, and of leveraging the proper modality whose priors are best aligned with the downstream task.
Drawing on the above analysis, we formulate our central hypothesis regarding when and how visual generation benefits reasoning, thereby helping narrow the gap between multimodal AI and human capabilities.
The Visual Superiority Hypothesis: In multimodal reasoning tasks grounded in the physical world, visual generation as a world model yields representations that are more informative and knowledge-rich than those produced by verbal world models.
4 Experiment Settings
Finally, we empirically validate the insights and theoretical analyses presented above through a series of controlled experiments. In this section, we describe the evaluation tasks and model training procedures.
4.1 VisWorld-Eval: Task Suite for Reasoning with Visual World Modeling
While prior work has primarily designed evaluation tasks heuristically, we principledly evaluate multimodal reasoning across tasks designed to specific world model capabilities. Building on related benchmarks, we identify and curate a total of seven tasks, forming an evaluation suite tailored to assess reasoning with visual world modeling. All tasks are framed as question answering with concise, verifiable answers, and performance is measured by answer accuracy. We refer to this suite as VisWorld-Eval, and summarize it in Figure 3.
World simulation. We consider the following tasks that primarily require simulating world dynamics over time: (1) Paper folding: Adapted from SpatialViz-Bench [61], this task presents a sequence of paper folds followed by hole punching, and asks for the distribution of holes after the paper is unfolded. Successfully solving this task requires simulating the unfolding process, relying on prior knowledge of symmetry and spatial transformations that is commonly grounded in visual experience. (2) Multi-hop manipulation: Build upon CLEVR [30], this task features a scene containing objects with various shapes and colors that undergo a sequence of operations, such as addition, removal, or color changes. The final question queries properties of the resulting layouts. Since target objects of operations are often specified via relative spatial relationships, this task places strong demands on state tracking and spatial understanding. (3) Ball tracking: Adapted from RBench-V [20], this task evaluates physical dynamics simulation by requiring the model to infer the trajectory of a ball undergoing ideal specular reflections within a given scene and predicting which numbered hole it will ultimately enter. In addition, we include (4) Maze [29] and (5) Sokoban [55], as these two grid-world tasks are commonly used in prior work of studying visual generation for reasoning [67, 36].
World reconstruction. We also evaluate tasks that emphasize reconstructing underlying world structure from partial observations: (6) Cube 3-view projection: Adapted from SpatialViz-Bench [61], this task provides an isometric view and two orthographic views of a connected cube stack, and asks about an unseen viewpoint. Solving the task requires reconstructing the full 3D structure and mentally rotating or projecting it into the queried view, a process closely aligned with human visual mental representations. (7) Real-world spatial reasoning: We focus on the positional relationship subset of MMSI-Bench [69]. Given multiple views of a realistic scene, these tasks ask about positional relationships among the cameras, objects, and regions. Successfully answering these questions requires constructing a coherent spatial mental model of the scene from limited viewpoints to support accurate spatial reasoning.
For each task, we construct SFT data by designing different CoT patterns with implicit, verbal, or visual world modeling, enabling controlled comparative evaluations. Data construction pipeline and examples across tasks are presented in Appendix 8.1.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Task Suite: Reasoning with Visual World Modeling
### Overview
The image presents a suite of visual reasoning tasks, categorized into "World Simulation" and "World Reconstruction." Each task involves a visual scenario, a question, and a set of possible answers. The tasks range from paper folding and ball tracking to maze navigation, cube projection, and real-world spatial reasoning.
### Components/Axes
**Header:**
* Title: "VisWorld-Eval: Task Suite for Reasoning with Visual World Modeling"
**Categories:**
* World Simulation: Contains Paper Folding, Multi-Hop Manipulation, Ball Tracking, Maze, and Sokoban tasks.
* World Reconstruction: Contains Cube 3-View Projection and Real-World Spatial Reasoning tasks.
**Task Components (General Structure):**
* Visual Scenario: An image or diagram depicting the task.
* Question (Q): A textual question related to the visual scenario.
* Answer Options (A): A set of possible answers, labeled A, B, C, D.
### Detailed Analysis or ### Content Details
**World Simulation Tasks:**
* **Paper Folding:**
* Visual: A sequence of images showing a paper being folded and cut.
* Question: "How many cutouts are there in the unfolded paper?"
* Answer: "A: 15"
* **Multi-Hop Manipulation:**
* Visual: An image showing colored cylinders and spheres.
* Question: "Starting with the initial arrangement, perform the following: 1. Place a red cylinder to the left of the black cylinder. 2. Swap the colors of the orange cylinder and the black cylinder. After these operations, what is to the left of the orange cylinder?"
* Answer: "A. black sphere, B. white sphere, C. yellow cylinder, D. red cylinder."
* **Ball Tracking:**
* Visual: A top-down view of a rectangular area with numbered holes along the top edge and a red ball inside. A green arrow indicates the initial direction.
* Question: "Given a red point-mass ball that moves at constant speed, reflects perfectly off solid walls, and follows the initial direction indicated by an green arrow, determine which numbered hole at the top it will enter first."
* Answer: "A: 1"
* **Maze:**
* Visual: A simple maze with a red dot at the start and a blue X at the end.
* Question: "Navigate the maze from the red dot to the blue X."
* Answer: "A: (4, 5), (5, 5), (5, 4) ..."
* **Sokoban:**
* Visual: A Sokoban puzzle with a grid, a box, and a goal position marked with an "X".
* Question: "Guide the player to push the box onto the goal position."
* Answer: "A: Down, Right, Down, ..."
**World Reconstruction Tasks:**
* **Cube 3-View Projection:**
* Visual: Three views (Front-right, Right, Top) of a cube structure, with some cubes colored dark violet.
* Question: "How many cubes in dark violet can possibly be seen from the back view?"
* Answer: "A. 0, B. 2, C. 3, D. 9."
* **Real-World Spatial Reasoning:**
* Visual: Two images of an interior space, including a black door.
* Question: "Which direction is the black door relative to me when I am taking Image 2?"
* Answer: "A. Behind, B. Left, C. Front, D. Right"
### Key Observations
* The tasks cover a range of visual reasoning skills, including spatial reasoning, object manipulation, and path planning.
* Each task presents a clear question and a set of possible answers.
* The visual scenarios vary in complexity, from simple diagrams to real-world images.
### Interpretation
The "VisWorld-Eval" task suite is designed to assess a system's ability to reason about visual information and solve problems in simulated and real-world environments. The tasks require a combination of visual perception, spatial reasoning, and logical inference. The suite could be used to evaluate the performance of AI models on tasks that require understanding and interacting with the visual world. The variety of tasks ensures a comprehensive evaluation of visual reasoning capabilities.
</details>
Figure 3: The VisWorld-Eval suite for assessing multimodal reasoning with visual world modeling. VisWorld-Eval comprises seven tasks spanning both synthetic and real-world domains, each designed to isolate and demand specific atomic world-model capabilities.
Table 1: Zero-shot evaluation of advanced VLMs on VisWorld-Eval. We report the average accuracy over five tasks (excluding Maze and Sokoban) and over all seven tasks.
| Models | Paper Folding | Multi-Hop Manip. | Ball Tracking | Cube 3-View | MMSI (Pos. Rel.) | Maze | Sokoban | Overall (5 tasks) | Overall (7 tasks) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Proprietary Models | | | | | | | | | |
| Gemini 3 Flash | 25.6 | 75.4 | 55.3 | 52.7 | 41.3 | 73.9 | 99.3 | 50.0 | 60.5 |
| Gemini 3 Pro | 27.0 | 74.5 | 44.7 | 53.3 | 49.6 | 33.5 | 90.2 | 49.8 | 53.2 |
| Seed 1.8 | 10.6 | 75.2 | 24.4 | 42.5 | 38.8 | 83.9 | 68.3 | 38.3 | 49.1 |
| GPT 5.1 | 6.4 | 73.9 | 34.8 | 44.5 | 44.8 | 0.6 | 62.8 | 40.8 | 38.2 |
| o3 | 13.5 | 68.1 | 24.7 | 37.7 | 44.4 | 0.0 | 36.0 | 37.6 | 32.0 |
| Open-Source Models | | | | | | | | | |
| Qwen3-VL-8B-Thinking [5] | 11.0 | 49.3 | 17.8 | 21.2 | 27.7 | 0.0 | 5.8 | 25.4 | 18.9 |
| BAGEL-7B-MoT [13] | 11.2 | 31.6 | 19.4 | 26.8 | 27.2 | 0.0 | 0.2 | 23.2 | 16.6 |
Evaluation of advanced VLMs. Table 1 reports the zero-shot performance of advanced VLMs on VisWorld-Eval. Overall, these models perform suboptimally, highlighting limitations of current multimodal AI systems. Among them, Gemini 3 Flash and Gemini 3 Pro remarkably outperform the other models; however, their performance remains far from satisfactory on challenging tasks like paper folding, ball tracking, cube 3-view projection, and real-world spatial reasoning.
4.2 Unified Multimodal Model Training and Evaluation
Evaluation protocol. To investigate the benefits of visual generation in multimodal reasoning, we evaluate post-trained UMMs, rather than the zero-shot performance of base models. To the best of our knowledge, no open-source model has been natively optimized for interleaved verbal-visual generation for reasoning. Even commercial closed-source models currently exhibit fundamental limitations in generating visual intermediate reasoning steps [38, 76]. Focusing on post-trained models, therefore, provides a more meaningful estimate of the upper bound for multimodal reasoning performance, while reducing confounding effects arising from insufficient pre-training due to limited interleaved data availability or quality.
Model training. We adopt BAGEL [13], a state-of-the-art open-source unified multimodal model, as our base model. Most experiments are conducted by supervised fine-tuning (SFT) on task-specific datasets, where verbal and visual generation in both chain-of-thought reasoning and final answers are optimized using cross-entropy and flow-matching loss. Specifically, the loss for reasoning with visual world modeling is as follows:
$$
\mathcal{L}_{\theta}(Q,I,R,A)=-\sum_{i=1}^{H+1}\sum_{j=1}^{|r_{i}|}\log p_{\theta}\left(r_{i,j}\mid r_{i,<j},R_{i}\right)+\sum_{i=1}^{H}\mathbb{E}_{t,\epsilon}\left\|v_{\theta}(o_{i}^{t},t\mid\tilde{R}_{i})-(\epsilon-o_{i})\right\|_{2}^{2}, \tag{6}
$$
where $o_{i}^{t}=to_{i}+(1-t)\epsilon$ are noisy observations. We emphasize that in our formulation, $r_{i}$ refers to a verbal reasoning step, instead of a reward. We also perform reinforcement learning from verifiable rewards (RLVR) following SFT. During RL, only the verbal generation component is optimized by GRPO [18], while visual generation is regularized via the KL-divergence with respect to the SFT-trained reference model:
$$
\displaystyle\mathcal{J}_{\theta}(Q,I)=\mathbb{E}_{o,r\sim p_{\theta_{\text{old}}}}\Bigg[ \displaystyle\sum_{i=1}^{H+1}\sum_{j=1}^{|r_{i}|}\Bigg(\min\Big(\frac{p_{\theta}\left(r_{i,j}\mid r_{i,<j},R_{i}\right)}{p_{\theta_{\text{old}}}\left(r_{i,j}\mid r_{i,<j},R_{i}\right)}{A},\ \text{clip}\Big(\frac{p_{\theta}\left(r_{i,j}\mid r_{i,<j},R_{i}\right)}{p_{\theta_{\text{old}}}\left(r_{i,j}\mid r_{i,<j},R_{i}\right)},1-\varepsilon,1+\varepsilon\Big){A}\Big)\Bigg) \displaystyle-\sum_{i=1}^{H}\mathbb{E}_{t,\epsilon}\left\|v_{\theta}(o_{i}^{t},t\mid\tilde{R}_{i})-v_{\theta_{\text{ref}}}(o_{i}^{t},t\mid\tilde{R}_{i})\right\|_{2}^{2}\Bigg]. \tag{7}
$$
Full implementation details and hyperparameters are provided in Appendix 8.2.
5 Experimental Results
In this section, we demonstrate that visual world modeling boosts multimodal reasoning through two atomic capabilities: world simulation (Section 5.1) and world reconstruction (Section 5.2). We also identify tasks in which it is unhelpful (Section 5.3), where implicit or verbal world modeling is sufficient. We conduct analysis in detail. Interestingly, we reveal emergent internal representations in UMMs that support implicit world modeling on simple maze tasks.
5.1 Visual World Simulation Boosts Multimodal Reasoning
Main results. Figure 4 summarizes the performance of SFT-trained UMMs under different chain-of-thought formulations across all tasks. We observe that interleaved CoT with visual world modeling significantly outperforms its purely verbal counterparts on three world simulation tasks: paper folding, multi-hop manipulation, and ball tracking. These gains are attributed to both the richer expressiveness and stronger prior knowledge afforded by the visual modality. In particular, it is difficult for models to precisely ground object coordinates and perform arithmetic operations without external tools in tasks such as multi-hop manipulation and ball tracking, with the latter being especially challenging. Thus, verbal world modeling is inappropriate and omitted in these tasks. This exacerbates ambiguity and hallucinations in purely verbal reasoning. Similarly, in paper folding, although models can track the states of holes, it remains difficult to completely depict the paper contour during unfolding. Moreover, as showcased in Figure 9 and 16, the spatial transformation involved in paper unfolding critically relies on an understanding of geometric symmetry, which can be more naturally learned from visual data like images and videos.
Sample efficiency. To further demonstrate the stronger prior knowledge embedded in the visual modality, we experiment comparing the sample efficiency of verbal and visual world modeling on the paper folding task. As shown in Figure 6, reasoning with visual world modeling exhibits substantially higher sample efficiency, achieving performance comparable to verbal world modeling while using more than $4×$ less SFT data.
5.2 Visual World Reconstruction Boosts Multimodal Reasoning
Main results. As shown in Figure 4, multimodal reasoning tasks that rely on world reconstruction capabilities also benefit substantially from visual world modeling. In the cube 3-view task, predicting a novel view of stacked cubes, denoted symbolic character matrices, suffers from limited prior knowledge, whereas visually rotating objects has been a rich experience during pre-training with large-scale Internet videos. For MMSI tasks, fully describing a novel view of a realistic scene using text alone is similarly ill-suited as in the previous subsection, and we also discover hallucinations in pure verbal reasoning, which lacks grounding to visual generation. We do not observe consistent improvements on other positional-relationship subtasks in MMSI-Bench, except camera-object and camera-region, which we attribute to current UMM’s limitations in both spatial understanding during verbal reasoning and generation quality in visual world modeling. Full quantitative results and qualitative examples are provided in Appendix 9. We expect these limitations to be mitigated in future work with stronger base models.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Bar Chart: Accuracy of World Modeling Techniques
### Overview
The image is a bar chart comparing the accuracy of three different world modeling techniques (Implicit, Verbal, and Visual) across various tasks. The chart displays accuracy percentages on the y-axis and task names on the x-axis.
### Components/Axes
* **Y-axis:** "Accuracy", with a numerical scale from 0 to 80 in increments of 10.
* **X-axis:** Categorical axis representing different tasks: Paper Folding, Multi-Hop Manip., Ball Tracking, Cube 3-View Proj., MMSI (Cam.-Obj.), MMSI (Cam.-Reg.), Maze, Sokoban.
* **Legend:** Located at the top of the chart, it identifies the color-coding for each world modeling technique:
* Pink: Implicit World Modeling
* Green: Verbal World Modeling
* Blue: Visual World Modeling
### Detailed Analysis
Here's a breakdown of the accuracy for each task and modeling technique:
* **Paper Folding:**
* Implicit World Modeling (Pink): 21.1
* Verbal World Modeling (Green): 27.4
* Visual World Modeling (Blue): 39.2
* **Multi-Hop Manip.:**
* Implicit World Modeling (Pink): 40.0
* Verbal World Modeling (Green): Not present
* Visual World Modeling (Blue): 66.6
* **Ball Tracking:**
* Implicit World Modeling (Pink): 40.7
* Verbal World Modeling (Green): Not present
* Visual World Modeling (Blue): 57.6
* **Cube 3-View Proj.:**
* Implicit World Modeling (Pink): 63.7
* Verbal World Modeling (Green): 60.2
* Visual World Modeling (Blue): 76.8
* **MMSI (Cam.-Obj.):**
* Implicit World Modeling (Pink): 46.5
* Verbal World Modeling (Green): Not present
* Visual World Modeling (Blue): 60.9
* **MMSI (Cam.-Reg.):**
* Implicit World Modeling (Pink): 37.3
* Verbal World Modeling (Green): Not present
* Visual World Modeling (Blue): 54.4
* **Maze:**
* Implicit World Modeling (Pink): 77.0
* Verbal World Modeling (Green): 73.1
* Visual World Modeling (Blue): 70.6
* **Sokoban:**
* Implicit World Modeling (Pink): 29.6
* Verbal World Modeling (Green): 36.8
* Visual World Modeling (Blue): 39.3
### Key Observations
* Visual World Modeling generally shows higher accuracy compared to Implicit and Verbal World Modeling across most tasks.
* Verbal World Modeling data is missing for several tasks (Multi-Hop Manip., Ball Tracking, MMSI (Cam.-Obj.), MMSI (Cam.-Reg.)).
* The Maze task shows relatively high accuracy for all three modeling techniques.
* The Sokoban task shows the lowest accuracy across all techniques.
### Interpretation
The bar chart provides a comparative analysis of the accuracy of different world modeling techniques in various tasks. The data suggests that Visual World Modeling is often more effective, possibly due to its ability to directly process visual information relevant to the tasks. The absence of Verbal World Modeling data for some tasks could indicate limitations or inapplicability of this technique in those specific scenarios. The Maze task's high accuracy across all techniques might suggest it's a relatively easier task, while the low accuracy in Sokoban could indicate its complexity or the need for more sophisticated modeling approaches.
</details>
Figure 4: Performance of SFT-trained UMMs with different world model-based chain-of-thought formulations across seven tasks from VisWorld-Eval. Refer to Table 1 for zero-shot performance of advanced VLMs.
Effects of task difficulties. Figure 6 analyzes performance on the cube 3-view projection task across varying sizes of input cube stacks. We observe a consistent advantage of reasoning with visual world modeling over verbal world modeling across all difficulty levels. Notably, for cube stacks of size six—out of the training distribution—visual world modeling still yields approximately a $10\%$ performance improvement.
World model fidelity. Modern AI models are known to exhibit hallucinations along their reasoning trajectories, even when producing correct final answers [38]. We therefore evaluate the fidelity of world modeling in the cube 3-view projection task by comparing ground-truth views with the intermediate views generated verbally or visually during reasoning. To focus on structural correctness, we compare only the shapes of the views and completely ignore color information. Even under this relaxed evaluation setting, Figure 6 shows that verbal world modeling exhibits dramatically low fidelity, with scores degrading to near zero. Notably, approximately half of the samples require predicting the opposite view of a given input view, a transformation that only involves horizontal mirroring. Visual world modeling, benefiting from stronger prior knowledge of such geometric transformations, captures these patterns effectively and achieves fidelity scores consistently exceeding $50\%$ .
5.3 Visual World Modeling is Unhelpful for Certain Tasks
Main results. (Un)surprisingly, we do not observe notable improvements on grid-world tasks, including maze and Sokoban. In the maze tasks, reasoning with implicit world modeling—without explicitly tracking coordinates—achieves the best performance with a slight advantage. These results are consistent with recent empirical findings [14]. We argue that this is also well explained by our world model perspective. In these tasks, state tracking is relatively simple, typically requiring the maintenance of only one or two two-dimensional coordinates, which can be adequately handled through verbal reasoning alone. Furthermore, in the maze task, we hypothesize that such world modeling can be implicitly encoded in the model’s hidden representations [37], which helps explain the competitive performance of verbal reasoning without explicit coordinate tracking.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Diagram: BAGEL Model Architecture
### Overview
The image presents a diagram illustrating the architecture of a model named "BAGEL." It shows the flow of information through different layers of the model, culminating in an output processed by an MLP (Multi-Layer Perceptron). The diagram also includes a text snippet indicating the context of the model's operation.
### Components/Axes
* **BAGEL:** The name of the model, prominently displayed. A logo of a donut is to the left of the name.
* **Layers:** The model consists of multiple layers, labeled as "Layer 1," "Layer 2," and "Layer N."
* **MLP:** A green rounded rectangle labeled "MLP" (Multi-Layer Perceptron), representing a processing unit.
* **Coordinate:** Text at the top-right corner, "Coordinate: (1,3)" in red.
* **Bar Graph:** A small bar graph above the MLP, with several bars of varying heights, enclosed in square brackets. The bars are pink.
* **Text Snippet:** A text snippet at the bottom: "... proceed until I hit a wall, at [masked] ...". The word "[masked]" is in red.
* **Arrows:** Arrows indicate the flow of information between layers and to the MLP.
### Detailed Analysis
* **Layer Structure:** The layers are arranged vertically, with "Layer 1" at the bottom and "Layer N" at the top. An arrow points upwards from Layer 1. An arrow points upwards from Layer 2 to Layer N, with three dots in the middle.
* **MLP Connection:** A blue line connects "Layer 1" to the "MLP." Another blue line connects the "MLP" to "Layer 2."
* **Output:** An arrow points upwards from the "MLP" to the bar graph.
* **Bar Graph Details:** The bar graph consists of approximately 7 bars. The heights of the bars decrease from left to right.
* **Coordinate:** The coordinate (1,3) is written in red.
### Key Observations
* The diagram highlights a multi-layered architecture with connections between layers and an MLP.
* The bar graph likely represents the output distribution or some other relevant metric.
* The text snippet provides context, suggesting a sequence processing task where the model operates until a specific condition is met.
### Interpretation
The diagram illustrates the architecture and flow of information within the BAGEL model. The model processes input through multiple layers, with the output of Layer 1 and Layer 2 being fed into an MLP. The MLP then generates an output, represented by the bar graph, which could be a probability distribution or some other relevant metric. The text snippet suggests that the model is used in a sequence processing task, where it operates until a specific condition (hitting a wall) is met, and a certain element is masked. The coordinate (1,3) likely refers to a specific location or index within the model's processing.
</details>
Figure 5: Probing implicit world models, by training a set of probes, i.e., MLPs which infer the masked point coordinates during reasoning from internal representations.
Demystifying implicit world modeling. To validate this hypothesis, we probe the internal representations of models, as illustrated in Figure 5. We consider the same architecture, BAGEL, with three different sets of weights: a randomly initialized model, the pre-trained model, and the model supervised fine-tuned on CoT data in the implicit world modeling format, in which special tokens mask all explicit point coordinates during the reasoning process. For each model, we extract the hidden representations of these special tokens at each layer. We then train multilayer perceptrons (MLPs) on these representations to predict the underlying true point coordinates.
Figure 6 reports the prediction accuracy on a validation set. As expected, the randomly initialized model completely fails to internally track point states, achieving only random-guess accuracy on $5× 5$ mazes. In contrast, the pre-trained model [13] already exhibits emergent representations that are predictive of maze states. Notably, we observe a non-monotonic trend across layers: prediction accuracy increases from lower layers (which capture low-level features) to middle layers, and then decreases toward the final layers, which are likely specialized for next-token prediction. Finally, supervised fine-tuning on domain-specific data, despite providing no explicit coordinate supervision, substantially enhances this internal predictability, achieving near-perfect accuracy. These in-depth results help explain our main experimental findings: as the model already possesses the capability for implicit world modeling, it does not necessarily benefit from explicit verbal world modeling, let alone more complex forms of visual world modeling.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Chart: Paper Folding Accuracy vs. Number of Training Samples
### Overview
The image is a scatter plot showing the accuracy of two modeling types (Visual WM and Verbal WM) on a paper folding task, plotted against the number of training samples. The difficulty of the task (Normal and Hard) is also indicated.
### Components/Axes
* **Title:** Paper Folding
* **X-axis:** Num. of Training Samples, ranging from 0 to 2500. Axis markers are present at 0, 500, 1000, 1500, 2000, and 2500.
* **Y-axis:** Accuracy, ranging from 20 to 90. Axis markers are present at 20, 30, 40, 50, 60, 70, 80, and 90.
* **Legend (top-left):**
* **Modeling:**
* Visual WM (blue line)
* Verbal WM (pink line)
* **Difficulty (top-right):**
* Normal (gray circle)
* Hard (gray triangle)
### Detailed Analysis
* **Visual WM (Normal Difficulty):** The blue line with circle markers represents the accuracy of the Visual WM model on the normal difficulty task. The line slopes upward.
* At 500 training samples, the accuracy is approximately 52%.
* At 1000 training samples, the accuracy is approximately 65%.
* At 2300 training samples, the accuracy is approximately 72%.
* **Visual WM (Hard Difficulty):** The blue line with triangle markers represents the accuracy of the Visual WM model on the hard difficulty task. The line slopes upward.
* At 500 training samples, the accuracy is approximately 28%.
* At 1000 training samples, the accuracy is approximately 35%.
* At 2300 training samples, the accuracy is approximately 39%.
* **Verbal WM (Normal Difficulty):** The pink circle represents the accuracy of the Verbal WM model on the normal difficulty task.
* At 2300 training samples, the accuracy is approximately 54%.
* **Verbal WM (Hard Difficulty):** The pink triangle represents the accuracy of the Verbal WM model on the hard difficulty task.
* At 2300 training samples, the accuracy is approximately 27%.
### Key Observations
* The accuracy of the Visual WM model increases with the number of training samples for both normal and hard difficulty tasks.
* The Visual WM model performs better on the normal difficulty task compared to the hard difficulty task.
* The Verbal WM model has a single data point at 2300 training samples for both normal and hard difficulty tasks.
* The Visual WM model generally outperforms the Verbal WM model, especially at lower training sample sizes.
### Interpretation
The data suggests that the Visual WM model is more effective than the Verbal WM model for the paper folding task, particularly when the number of training samples is limited. Increasing the number of training samples improves the accuracy of the Visual WM model. The difficulty of the task significantly impacts the accuracy of both models, with the normal difficulty task resulting in higher accuracy. The single data point for the Verbal WM model limits the ability to analyze its performance trend with varying training samples. The Visual WM model's performance increase with more training data indicates that it benefits from increased exposure to the task.
</details>
(a) Sample efficiency.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Chart: Cube 3-View Projection
### Overview
The image is a line chart comparing the performance of Visual and Verbal Working Memory (WM) against Accuracy and WM Fidelity as the Cube Stack Size increases from 3 to 6. The y-axis represents a "Metric" with values ranging from 0 to 100.
### Components/Axes
* **Title:** Cube 3-View Projection
* **X-axis:** Cube Stack Size (values: 3, 4, 5, 6)
* **Y-axis:** Metric (values: 0, 20, 40, 60, 80, 100)
* **Legend:** Located in the center of the chart.
* **Modeling:**
* Visual WM (solid blue line)
* Verbal WM (solid pink line)
* **Metric:**
* Accuracy (solid black line) - Note: This line is not visible in the image.
* WM Fidelity (dashed black line) - Note: This line is not visible in the image.
### Detailed Analysis
* **Visual WM (Solid Blue Line):**
* Trend: Decreasing as Cube Stack Size increases.
* Data Points:
* Cube Stack Size 3: Approximately 90
* Cube Stack Size 4: Approximately 85
* Cube Stack Size 5: Approximately 78
* Cube Stack Size 6: Approximately 72
* **Verbal WM (Solid Pink Line):**
* Trend: Decreasing as Cube Stack Size increases.
* Data Points:
* Cube Stack Size 3: Approximately 74
* Cube Stack Size 4: Approximately 70
* Cube Stack Size 5: Approximately 65
* Cube Stack Size 6: Approximately 60
* **Accuracy (Dashed Blue Line):**
* Trend: Decreasing as Cube Stack Size increases.
* Data Points:
* Cube Stack Size 3: Approximately 94
* Cube Stack Size 4: Approximately 85
* Cube Stack Size 5: Approximately 68
* Cube Stack Size 6: Approximately 55
* **WM Fidelity (Dashed Pink Line):**
* Trend: Decreasing as Cube Stack Size increases.
* Data Points:
* Cube Stack Size 3: Approximately 52
* Cube Stack Size 4: Approximately 30
* Cube Stack Size 5: Approximately 10
* Cube Stack Size 6: Approximately 0
### Key Observations
* Both Visual WM and Verbal WM performance decrease as the Cube Stack Size increases.
* Accuracy and WM Fidelity also decrease as the Cube Stack Size increases.
* Visual WM consistently outperforms Verbal WM across all Cube Stack Sizes.
* Accuracy consistently outperforms WM Fidelity across all Cube Stack Sizes.
### Interpretation
The chart suggests that as the complexity of the cube stack increases (larger Cube Stack Size), both visual and verbal working memory performance declines. This indicates that the task becomes more challenging, potentially due to increased cognitive load. The fact that Visual WM consistently outperforms Verbal WM might suggest that visual processing is more efficient for this specific task. Similarly, Accuracy consistently outperforms WM Fidelity, suggesting that the model is more accurate than it is faithful to the original data. The steep decline in WM Fidelity as Cube Stack Size increases is a notable trend, indicating a significant drop in the model's ability to maintain fidelity as the task becomes more complex.
</details>
(b) World model fidelity.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Chart: State Prediction Accuracy in Maze
### Overview
The image is a line chart comparing the state prediction accuracy of three different methods (Random Initialization, BAGEL PT, and BAGEL SFT) across different layers in a maze. The x-axis represents the layer index, and the y-axis represents the state prediction accuracy.
### Components/Axes
* **Title:** Maze
* **X-axis:** Layer Index (values from 0 to 25, incrementing by 5)
* **Y-axis:** State Prediction Accuracy (values from 0.2 to 1.0, incrementing by 0.1)
* **Legend:** Located in the top-left corner.
* Random Init. (Red)
* BAGEL PT (Green)
* BAGEL SFT (Blue)
### Detailed Analysis
* **Random Init. (Red):** The accuracy remains relatively constant at approximately 0.2 for all layer indices.
* Layer 0: ~0.22
* Layer 25: ~0.22
* **BAGEL PT (Green):** The accuracy increases from approximately 0.25 to a peak around 0.55, then decreases slightly.
* Layer 0: ~0.25
* Layer 15: ~0.55
* Layer 25: ~0.45
* **BAGEL SFT (Blue):** The accuracy increases sharply from approximately 0.2 to a peak around 0.98, then decreases slightly.
* Layer 0: ~0.22
* Layer 15: ~0.75
* Layer 20: ~0.98
* Layer 25: ~0.93
### Key Observations
* BAGEL SFT significantly outperforms the other two methods in terms of state prediction accuracy.
* BAGEL PT shows some improvement over Random Init., but not as significant as BAGEL SFT.
* Random Init. has the lowest and most consistent accuracy across all layers.
* Both BAGEL PT and BAGEL SFT show a peak in accuracy before decreasing slightly in later layers.
### Interpretation
The chart demonstrates that BAGEL SFT is the most effective method for state prediction in the maze environment, achieving significantly higher accuracy compared to BAGEL PT and Random Initialization. The performance of BAGEL PT suggests some level of learning, but it is not as effective as BAGEL SFT. The consistent low accuracy of Random Initialization indicates that it does not learn or adapt to the maze environment. The peak in accuracy for BAGEL PT and BAGEL SFT, followed by a slight decrease, could indicate overfitting or diminishing returns in later layers.
</details>
(c) Implicit world modeling.
Figure 6: Model analysis: (a) Performance of UMMs on the paper-folding task with varying numbers of SFT samples. Reasoning with visual world modeling achieves a $4×$ improvement in sample efficiency. WM = world modeling. (b) Performance of UMMs on the cube 3-view projection task with increasing sizes of input cube stacks, evaluated using both answer accuracy and world-model fidelity. Visual world modeling demonstrates dramatically better fidelity of view synthesis. (c) Prediction accuracy of masked point coordinates in CoTs using representations extracted from different layers of different UMMs, revealing emergent internal world representations. PT = Pre-trained.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Bar Chart: Accuracy Comparison of World Modeling Techniques
### Overview
The image is a bar chart comparing the accuracy of different world modeling techniques (Implicit, Verbal, and Visual) using two different models (BAGEL and Qwen-VL) across three tasks: Paper Folding, Multi-Hop Manipulation, and Cube 3-View Projection. The chart displays accuracy percentages on the y-axis and the tasks on the x-axis.
### Components/Axes
* **Y-axis:** "Accuracy" ranging from 0 to 80, with gridlines at intervals of 20.
* **X-axis:** Categorical axis representing the tasks: "Paper Folding", "Multi-Hop Manip.", and "Cube 3-View Proj."
* **Legend:** Located at the top of the chart.
* Pink: "Implicit World Modeling (BAGEL)"
* Red-Striped: "Implicit World Modeling (Qwen-VL)"
* Green: "Verbal World Modeling (BAGEL)"
* Green-Striped: "Verbal World Modeling (Qwen-VL)"
* Blue: "Visual World Modeling (BAGEL)"
### Detailed Analysis
**Paper Folding:**
* Implicit World Modeling (BAGEL): 21.1
* Implicit World Modeling (Qwen-VL): 21.5
* Verbal World Modeling (BAGEL): 27.4
* Verbal World Modeling (Qwen-VL): 28.8
* Visual World Modeling (BAGEL): 39.2
**Multi-Hop Manip.:**
* Implicit World Modeling (BAGEL): 40.0
* Implicit World Modeling (Qwen-VL): 37.5
* Verbal World Modeling (BAGEL): N/A (no bar present)
* Verbal World Modeling (Qwen-VL): N/A (no bar present)
* Visual World Modeling (BAGEL): 66.6
**Cube 3-View Proj.:**
* Implicit World Modeling (BAGEL): 63.7
* Implicit World Modeling (Qwen-VL): 60.0
* Verbal World Modeling (BAGEL): 60.2
* Verbal World Modeling (Qwen-VL): 58.8
* Visual World Modeling (BAGEL): 76.8
### Key Observations
* Visual World Modeling (BAGEL) consistently achieves the highest accuracy across all three tasks.
* Implicit World Modeling (Qwen-VL) generally performs slightly lower than Implicit World Modeling (BAGEL).
* Verbal World Modeling (Qwen-VL) performs slightly better than Verbal World Modeling (BAGEL) in Paper Folding, but the difference is small.
* Verbal World Modeling is not present in Multi-Hop Manip.
* The accuracy varies significantly depending on the task, with Cube 3-View Projection generally showing higher accuracy scores.
### Interpretation
The bar chart provides a comparative analysis of different world modeling techniques and their performance on specific tasks. The results suggest that Visual World Modeling (BAGEL) is the most effective approach among those tested. The choice of model (BAGEL vs. Qwen-VL) also impacts performance, with BAGEL generally showing higher accuracy for Implicit World Modeling. The varying accuracy across tasks indicates that the effectiveness of each modeling technique is task-dependent. The absence of Verbal World Modeling in Multi-Hop Manipulation suggests that this approach may not be applicable or effective for that particular task.
</details>
Figure 7: Performance of SFT-trained VLMs compared with UMMs across three tasks.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Charts: World Modeling Accuracy vs. Steps for Different Tasks
### Overview
The image presents three line charts comparing the accuracy of different world modeling approaches (Implicit, Verbal, and Visual) using two different systems (BAGEL and Qwen-VL) across three tasks: Paper Folding, Multi-Hop Manipulation, and Cube 3-View Projection. The x-axis represents the number of steps, and the y-axis represents the accuracy.
### Components/Axes
* **Title:** The image contains three separate charts, titled "Paper Folding", "Multi-Hop Manip.", and "Cube 3-View Proj."
* **X-axis:** Labeled "Steps", with a range from 0 to 200, incrementing by 50.
* **Y-axis:** Labeled "Accuracy", with a range from 20 to 45 for "Paper Folding", 40 to 75 for "Multi-Hop Manip.", and 60 to 85 for "Cube 3-View Proj.". The y-axis increments by 5.
* **Legend:** Located at the top of the image.
* Pink Line: Implicit World Modeling (BAGEL)
* Red Dashed Line: Implicit World Modeling (Qwen-VL)
* Green Line: Verbal World Modeling (BAGEL)
* Green Dashed Line: Verbal World Modeling (Qwen-VL)
* Blue Line: Visual World Modeling (BAGEL)
### Detailed Analysis
**1. Paper Folding**
* **Visual World Modeling (BAGEL) - Blue Line:** Starts at approximately 39% accuracy and rapidly increases to approximately 44% by step 50, then plateaus around 46% for the remaining steps.
* **Verbal World Modeling (BAGEL) - Green Line:** Starts at approximately 28% accuracy, increases to approximately 33% by step 50, fluctuates between 30% and 35% until step 200.
* **Verbal World Modeling (Qwen-VL) - Green Dashed Line:** Starts at approximately 28% accuracy, increases to approximately 31% by step 50, then decreases to approximately 26% by step 100, and remains relatively stable around 26% for the remaining steps.
* **Implicit World Modeling (BAGEL) - Pink Line:** Starts at approximately 23% accuracy and gradually increases to approximately 26% by step 200.
* **Implicit World Modeling (Qwen-VL) - Red Dashed Line:** Starts at approximately 22% accuracy and gradually increases to approximately 27% by step 200.
**2. Multi-Hop Manip.**
* **Visual World Modeling (BAGEL) - Blue Line:** Starts at approximately 67% accuracy and increases to approximately 72% by step 25, then fluctuates between 71% and 74% for the remaining steps.
* **Verbal World Modeling (BAGEL) - Green Line:** Not present in this chart.
* **Verbal World Modeling (Qwen-VL) - Green Dashed Line:** Not present in this chart.
* **Implicit World Modeling (BAGEL) - Pink Line:** Starts at approximately 40% accuracy and gradually increases to approximately 45% by step 100, then plateaus around 43% for the remaining steps.
* **Implicit World Modeling (Qwen-VL) - Red Dashed Line:** Starts at approximately 38% accuracy and fluctuates between 38% and 42% for the remaining steps.
**3. Cube 3-View Proj.**
* **Visual World Modeling (BAGEL) - Blue Line:** Starts at approximately 77% accuracy and increases to approximately 84% by step 50, then plateaus around 84% for the remaining steps.
* **Verbal World Modeling (BAGEL) - Green Line:** Starts at approximately 60% accuracy and increases to approximately 72% by step 100, then decreases to approximately 70% by step 200.
* **Verbal World Modeling (Qwen-VL) - Green Dashed Line:** Starts at approximately 59% accuracy and increases to approximately 67% by step 50, then fluctuates between 65% and 70% for the remaining steps.
* **Implicit World Modeling (BAGEL) - Pink Line:** Starts at approximately 62% accuracy and increases to approximately 69% by step 50, then fluctuates between 66% and 69% for the remaining steps.
* **Implicit World Modeling (Qwen-VL) - Red Dashed Line:** Starts at approximately 61% accuracy and increases to approximately 68% by step 50, then fluctuates between 66% and 69% for the remaining steps.
### Key Observations
* Visual World Modeling (BAGEL) consistently achieves the highest accuracy across all three tasks.
* Implicit World Modeling (Qwen-VL) generally performs the worst.
* The performance difference between BAGEL and Qwen-VL is more pronounced in Implicit World Modeling.
* The accuracy tends to plateau after a certain number of steps, suggesting diminishing returns.
### Interpretation
The data suggests that Visual World Modeling (BAGEL) is the most effective approach for these tasks, indicating that visual information is crucial for accurate world modeling. The performance differences between BAGEL and Qwen-VL highlight the impact of the underlying system on the accuracy of world modeling. The plateauing of accuracy after a certain number of steps suggests that there may be a limit to how much the models can learn from additional steps, or that the models have converged to a local optimum. The relative performance of the different modeling approaches varies depending on the task, suggesting that the optimal approach may depend on the specific requirements of the task.
</details>
Figure 8: Performance of RLVR-trained VLMs and UMMs with different world-model-based CoT formulations across three tasks.
5.4 Comparison with VLMs: Do UMMs Compromise Verbal Reasoning Capabilities?
One may argue that UMMs are typically trained with a stronger emphasis on visual generation [13], which could compromise verbal reasoning capabilities, and bias comparisons in favor of visual world modeling. To address this concern, we compare with a pure VLM baseline, Qwen2.5-VL-7B-Instruct [6], which shares the same Qwen 2.5 LLM base model, with BAGEL. We fine-tune Qwen2.5-VL on the same verbal CoT datasets used in the previous subsections and evaluate it on three representative tasks: paper folding, cube 3-view projection, and multi-hop manipulation.
Results. As shown in Figure 7, the SFT performance of Qwen2.5-VL with implicit and verbal world modeling is comparable to that of BAGEL, without exhibiting significant advantages. It still lags behind BAGEL in settings that leverage visual world modeling. These results indicate that our findings arise from the inherent advantages of visual world modeling rather than from compromised verbal reasoning capabilities in UMMs.
5.5 RL Enhances Various CoTs, Yet Does Not Close the Gap
Reinforcement learning from verifiable rewards (RLVR) has been a major driver of recent progress in reasoning models equipped with verbal chain-of-thoughts, achieving strong performance across domains such as mathematics [18]. While Figure 4 shows a clear advantage of reasoning with visual world modeling after SFT, RLVR may further incentivize emergent reasoning behaviors that improve verbal CoTs. We thus conduct comparative RLVR experiments across different world model–based CoT formulations on three representative tasks.
Results. Figure 8 presents the learning curves under RLVR for different models. We observe consistent improvements during RLVR for different CoT formulations. However, the performance gap persists. We also find that VLMs and UMMs generally perform similarly with verbal CoTs. These results suggest that the superiority arises from inherent advantages of the world modeling approach, rather than insufficient post-training. Notably, RL enhances reasoning with visual world modeling, even though only the verbal generation components of interleaved CoTs are directly optimized. We envision that the full potential of interleaved CoTs will be further released with the development of RL algorithms tailored for verbal-visual interleaved generation.
6 Discussions
By bridging concepts from human cognition and artificial intelligence, we revisit the mechanisms underlying human reasoning and the central role that world models play. This provides a new perspective on the use of visual generation for reasoning in multimodal AI, highlighting its potential to serve as visual world models that complements the verbal world models embedded in LLMs, thereby enabling more human-like reasoning on scenarios grounded in the physical world. For the first time, this perspective is studied in a principled manner, through theoretical formulations that bridge world models and reasoning, as well as through empirical evaluations whose results are well explained by and strongly support the proposed insights. We hope this work helps address longstanding questions about the synergistic effects between generation and reasoning, and more broadly contributes to the development of more human-like AI that thinks and acts with multimodal world models.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Real-World Spatial Reasoning and Paper Folding
### Overview
The image presents two distinct problem-solving scenarios: one involving spatial reasoning with overlapping images, and the other involving mentally unfolding a paper with cutouts. The left side focuses on determining the relative position of an object (potted plant) based on two images taken from a first-person perspective. The right side focuses on visualizing the unfolded state of a paper after a series of folds and cutouts.
### Components/Axes
**Real-World Spatial Reasoning (Left Side)**
* **Question:** Poses the spatial reasoning problem.
* **Options:** Multiple-choice answers (A, B, C, D) indicating possible directions.
* **Model Output:** A step-by-step explanation of the reasoning process.
* **Images:** Two photographs showing different views of a room.
**Paper Folding (Right Side)**
* **Question:** Poses the paper folding problem.
* **Model Output:** A step-by-step explanation of the unfolding process.
* **Diagrams:** A series of diagrams illustrating the folding and unfolding steps, including the cutouts.
### Detailed Analysis
**Real-World Spatial Reasoning (Left Side)**
* **Question:** The pictures are taken continuously from a first-person perspective, with two pictures overlapping at the frame. At the moment of the last picture, in which direction is the potted plant relative to you?
* Options: A: Front right, B: Back left, C: Back right, D: Front left
* **Model Output:**
* Goal: Determine the direction of the potted plant relative to the camera at the end of the second image.
* The first image shows a potted plant on a small table, next to a window with curtains and a chair. The second image shows a wall with framed pictures, a teapot cabinet, and a lamp.
* The model output describes a process of mentally rotating the camera's view to match the different images.
* Conclusion: The correct option is A: Front right.
**Paper Folding (Right Side)**
* **Question:** Analyze the image showing a folded paper with cutouts. Mentally reverse the folding process to reconstruct the final unfolded design, then provide your answer to: how many cutouts are there in the unfolded paper?
* **Model Output:**
* The model output describes a step-by-step process of mentally unfolding the paper.
* The process involves reversing diagonal and horizontal folds.
* The key principle is that unfolding creates a mirror image of any holes located on the moving flap, using the fold line as the axis of symmetry.
* Conclusion: Finally we can calculate that the number of cutoffs is 2.
### Key Observations
* The spatial reasoning problem relies on understanding perspective and relative positions.
* The paper folding problem relies on visualizing transformations and symmetry.
* Both problems are presented with a step-by-step explanation of the solution.
### Interpretation
The image presents two distinct cognitive tasks that require different types of reasoning. The spatial reasoning task involves understanding how changes in perspective affect the relative positions of objects. The paper folding task involves mentally manipulating a 2D object through a series of transformations. Both tasks highlight the importance of visualization and step-by-step problem-solving. The model outputs provide a clear and logical explanation of the reasoning process, making the solutions accessible.
</details>
Figure 9: Showcases of interleaved verbal-visual chain-of-thought reasoning, generated by post-trained UMMs, where visual generation serves as world models. <image> denotes a placeholder indicating the position of a generated image.
Limitations and future work. This work primarily focuses on spatial and physical reasoning tasks, where multimodal AI systems exhibit a pronounced performance gap relative to humans. Many other tasks proposed in the related literature can also be interpreted through our world model perspective. For example, a prominent class of benchmarks involves visual jigsaw tasks [52, 17, 38, 77], in which input image patches are cropped, masked, or shuffled. Such tasks essentially probe a form of world reconstruction capability, as corrupted images and videos are commonly treated as specific views within the world model literature [3, 7, 4]. Another active area of interest lies in STEM reasoning. Recent work [51] leverages visual generation for mathematical diagram editing, such as constructing auxiliary geometric lines. This closely resembles how humans use visual sketchpads to support math understanding and reasoning, constructing visual world models of a symbolic system. However, as symbolic representations in mathematics are largely complete, and mathematical reasoning has been extensively optimized in modern LLMs, it remains unclear whether multimodal interleaved CoT can fundamentally break through the performance limit, warranting further investigation.
We do not apply reinforcement learning to the visual generation components of verbal–visual interleaved CoTs [39]. Prior work has shown that world models themselves can be improved through RLVR [65]. As discussed in Section 5.5, developing RL algorithms specifically tailored to interleaved verbal–visual generation may further improve world-model fidelity during reasoning and incentivize the emergence of stronger and intriguing world-modeling capabilities.
The analysis of emergent representations for implicit world modeling in Figure 6 is intriguing but preliminary. We hope this result will rekindle interest in probing approaches [37] for interpreting the latent representations learned by different models. In particular, we are interested in comparing the internal representations of VLMs and UMMs, as the latter may capture complementary aspects of world knowledge through training for multimodal generation.
Artificial intelligence is increasingly being embodied in the physical world [23]. Our work, particularly the visual superiority hypothesis, suggests that learning visual world models is therefore essential for embodied intelligence. Visual world modeling enables embodied agents to better understand their environments, from imagining occluded regions to interpreting user intentions from an egocentric perspective, thereby supporting more reliable and natural everyday services. It also facilitates planning and decision-making by allowing agents to mentally simulate the precise outcomes of potential actions, leading to more effective interaction with the world. Rather than relying on loosely coupled modules [15] or performing only single-step reasoning [73], we envision a future direction in which flexible multimodal world modeling and reasoning, empowered by interleave verbal-visual generation within a unified model, form core components of physical and embodied AI.
Acknowledgements
We would like to thank Yanwei Li, Rui Yang, Ziyu Zhu, and Feng Cheng for their assistance in constructing some preliminary training and test data. We also appreciate Xinchen Zhang, Jianhua Zhu, Yifan Du, Yuezhou Ma, Xingzhuo Guo, Ningya Feng, Shangchen Miao, and many colleagues for their valuable discussion.
References
- Agarwal et al. [2025] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025.
- Alonso et al. [2024] Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. In NeurIPS, 2024.
- Assran et al. [2023] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023.
- Assran et al. [2025] Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025.
- Bai et al. [2025a] Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025a.
- Bai et al. [2025b] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025b.
- Bardes et al. [2024] Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024.
- Cai et al. [2025] Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, et al. Has gpt-5 achieved spatial intelligence? an empirical study. arXiv preprint arXiv:2508.13142, 2025.
- Chen et al. [2025] Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model. arXiv preprint arXiv:2509.02722, 2025.
- Copet et al. [2025] Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025.
- Craik [1967] Kenneth James Williams Craik. The nature of explanation, volume 445. CUP Archive, 1967.
- DeepMind [2025] Google DeepMind. Genie 3: A new frontier for world models. 2025.
- Deng et al. [2025] Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025.
- Du et al. [2025] Yifan Du, Kun Zhou, Yingqian Min, Yue Ling, Wayne Xin Zhao, and Youbin Wu. Revisiting the necessity of lengthy chain-of-thought in vision-centric reasoning generalization. arXiv preprint arXiv:2511.22586, 2025.
- Feng et al. [2025] Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine, and Jianlan Luo. Reflective planning: Vision-language models for multi-stage long-horizon robotic manipulation. arXiv preprint arXiv:2502.16707, 2025.
- Forrester [1971] Jay W Forrester. Counterintuitive behavior of social systems. Theory and decision, 2(2):109–140, 1971.
- Gu et al. [2025] Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492, 2025.
- Guo et al. [2025a] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 2025a.
- Guo et al. [2025b] Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report. arXiv preprint arXiv:2505.07062, 2025b.
- Guo et al. [2025c] Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin-Sheng Chen, Yi Zhang, et al. Rbench-v: A primary assessment for visual reasoning models with multi-modal outputs. arXiv preprint arXiv:2505.16770, 2025c.
- Guo et al. [2025d] Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, and Pheng-Ann Heng. Thinking-while-generating: Interleaving textual reasoning throughout visual generation. arXiv preprint arXiv:2511.16671, 2025d.
- Guo et al. [2025e] Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Rui Huang, Haoquan Zhang, Manyuan Zhang, Jiaming Liu, Shanghang Zhang, Peng Gao, et al. Can we generate images with cot? let’s verify and reinforce image generation step by step. In CVPR, 2025e.
- Gupta et al. [2021] Agrim Gupta, Silvio Savarese, Surya Ganguli, and Li Fei-Fei. Embodied intelligence via learning and evolution. Nature Communications, 2021.
- Ha and Schmidhuber [2018] David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2(3), 2018.
- Hafner et al. [2025] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models. Nature, 2025.
- Hansen et al. [2022] Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. In ICML, 2022.
- Huh et al. [2024] Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. In ICML, 2024.
- Hurst et al. [2024] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
- Ivanitskiy et al. [2023] Michael Igorevich Ivanitskiy, Rusheb Shah, Alex F. Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, and Samy Wu Fung. A configurable library for generating and manipulating maze datasets. arXiv preprint arXiv:2309.10498, 2023.
- Johnson et al. [2017] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
- Johnson-Laird [1983] PN Johnson-Laird. Mental models: Towards a cognitive science of language, inference, and consciousness. Harvard University Press, 1983.
- Lakoff and Núñez [2000] George Lakoff and Rafael Núñez. Where mathematics comes from, volume 6. New York: Basic Books, 2000.
- Landy and Goldstone [2007] David Landy and Robert L Goldstone. How abstract is symbolic thought? Journal of Experimental Psychology: Learning, Memory, and Cognition, 33(4):720, 2007.
- LeCun [2022] Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022.
- Li et al. [2025a] Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. Zebra-cot: A dataset for interleaved vision language reasoning. arXiv preprint arXiv:2507.16746, 2025a.
- Li et al. [2025b] Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought. In ICML, 2025b.
- Li et al. [2023] Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. In ICLR, 2023.
- Liang et al. [2025] Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, and Furong Huang. Rover: Benchmarking reciprocal cross-modal reasoning for omnimodal generation. arXiv preprint arXiv:2511.01163, 2025.
- Liu et al. [2025] Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. In NeurIPS, 2025.
- Loomis et al. [1991] JM Loomis, RL Klatzky, RG Golledge, and JG CicineIli. Mental models, psychology of. Psychology, 14:56–89, 1991.
- Ma et al. [2025] Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. In CVPR, 2025.
- Mon-Williams et al. [2025] Ruaridh Mon-Williams, Gen Li, Ran Long, Wenqian Du, and Christopher G Lucas. Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence, 2025.
- Monzel and Reuter [2024] Merlin Monzel and Martin Reuter. Where’s wanda? the influence of visual imagery vividness on visual search speed measured by means of hidden object pictures. Attention, Perception, & Psychophysics, 86(1):22–27, 2024.
- Norman [2014] Donald A Norman. Some observations on mental models. In Mental models, pages 7–14. Psychology Press, 2014.
- Paivio [1990] Allan Paivio. Mental representations: A dual coding approach. Oxford university press, 1990.
- Pan et al. [2025] Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Schrittwieser et al. [2020] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 2020.
- Schulze Buschoff et al. [2025] Luca M Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. Visual cognition in multimodal large language models. Nature Machine Intelligence, 2025.
- Seedream et al. [2025] Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427, 2025.
- Shi et al. [2025a] Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, et al. Mathcanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning. arXiv preprint arXiv:2510.14958, 2025a.
- Shi et al. [2025b] Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, et al. Realunify: Do unified models truly benefit from unification? a comprehensive benchmark. arXiv preprint arXiv:2509.24897, 2025b.
- Swanson et al. [2025] Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of ai agents designs new sars-cov-2 nanobodies. Nature, 2025.
- Team [2024] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
- Tong et al. [2025a] Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, et al. Game-rl: Synthesizing multimodal verifiable game data to boost vlms’ general reasoning. arXiv preprint arXiv:2505.13886, 2025a.
- Tong et al. [2025b] Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. In ICCV, 2025b.
- Trinh et al. [2024] Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 2024.
- Tu et al. [2025] Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, et al. Towards conversational diagnostic artificial intelligence. Nature, 2025.
- Wang et al. [2025a] Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, et al. Vagen: Reinforcing world model reasoning for multi-turn vlm agents. In NeurIPS, 2025a.
- Wang et al. [2024a] Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, and Peter Jansen. Can language models serve as text-based world simulators? In ACL, 2024a.
- Wang et al. [2025b] Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, and Jun Wang. Spatialviz-bench: Automatically generated spatial visualization reasoning tasks for mllms. arXiv preprint arXiv:2507.07610, 2025b.
- Wang et al. [2024b] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024b.
- Wu et al. [2025a] Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In CVPR, 2025a.
- Wu et al. [2024] Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Interactive videogpts are scalable world models. In NeurIPS, 2024.
- Wu et al. [2025b] Jialong Wu, Shaofeng Yin, Ningya Feng, and Mingsheng Long. Rlvr-world: Training world models with reinforcement learning. In NeurIPS, 2025b.
- Xie et al. [2025] Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. In ICLR, 2025.
- Xu et al. [2025] Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vulić. Visual planning: Let’s think only with images. arXiv preprint arXiv:2505.11409, 2025.
- Yang et al. [2025a] Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, et al. Visual spatial tuning. arXiv preprint arXiv:2511.05491, 2025a.
- Yang et al. [2025b] Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764, 2025b.
- Yao et al. [2025] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Chi Chen, Haoyu Li, Weilin Zhao, et al. Efficient gpt-4v level multimodal large language model for deployment on edge devices. Nature Communications, 2025.
- Yin et al. [2025] Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. In Structural Priors for Vision Workshop at ICCV’25, 2025.
- Zhang et al. [2025] Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, et al. Agent learning via early experience. arXiv preprint arXiv:2510.08558, 2025.
- Zhao et al. [2025a] Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In CVPR, 2025a.
- Zhao et al. [2025b] Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing. In NeurIPS, 2025b.
- Zhou et al. [2025a] Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. In ICLR, 2025a.
- Zhou et al. [2025b] Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, et al. When visualizing is the first step to reasoning: Mira, a benchmark for visual chain-of-thought. arXiv preprint arXiv:2511.02779, 2025b.
- Zou et al. [2025] Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark. arXiv preprint arXiv:2510.13759, 2025.
\beginappendix
7 Theorectical Analysis
7.1 Informativeness
In this section, we provide the rigorous version of our world model-based chain-of-thought formulations, and proofs for Theorem 1 and Theorem 2.
Formal problem setup and assumptions. Given a question $Q$ and input images $I$ , multimodal reasoning generates a chain-of-thought process $R$ , followed by a final answer $A$ . We explicitly formulate the reasoning process as an interleaving of logic reasoning steps and observations of the underlying MOMDP defined in Section 3.1: $R=(r_{1},o_{1}),(r_{2},o_{2}),...,(r_{H},o_{H})$ where $H$ denotes the (fixed) CoT length. For notation convenience, we denote the input image(s) as the initial observation $o_{0}$ .
We assume that each MOMDP observation function admits a two-stage decomposition: $e_{\phi}(s)=g_{\phi_{m}}\left(f_{\phi_{s}}(s)\right),\Phi=\Phi_{s}×\Phi_{m}$ , where the inner modality-agnostic mapping $f_{\phi_{s}}$ (parameterized by $\phi_{s}∈\Phi_{s}$ ) extracts a slice of the underlying state $s$ , retaining only partial state information, and the outer modality-specific mapping $g_{\phi_{m}}$ (parameterized by $\phi_{m}∈\Phi_{m}$ ) renders the extracted slice into a particular observation modality.
Under this decomposition, we assume that reasoning across different modalities of observations shares a common underlying oracle reasoning process:
$$
p(Q,\bar{s}_{0},r_{1},\bar{s}_{1},\dots,r_{H},\bar{s}_{H},A)=p(Q)\left[\prod_{i=1}^{H}p(r_{i}|\bar{s}_{0:{i-1}},r_{1:i-1},Q)p(\bar{s}_{i}|\bar{s}_{0:i-1},r_{1:i},Q)\right]p(A|r_{1:H},\bar{s}_{0:H},Q),
$$
where $\bar{s}_{i}=(s_{i},{\phi_{s}}_{i})∈\mathcal{S}×\Phi_{s}$ denotes a modality-agnostic sliced state. Each logic step $r_{i}$ is assumed to reason on sufficient sliced state information: $p(r_{i}\mid\bar{s}_{0:i-1},r_{1:i-1},Q)=p\!\left(r_{i}\mid f_{{\phi_{s}}_{0}}(s_{0}),...,f_{{\phi_{s}}_{i-1}}(s_{i-1}),r_{1:i-1},Q\right),$ and produces actionable outcomes that either (i) transit a previous world state $s_{j<i}$ via an implicit action $a_{i}$ : $\bar{s}_{i}=(s_{i},{\phi_{s}}_{j}),s_{i}\sim p(s_{j},a_{i})$ or (ii) query the same underlying world state with a new slice ${\phi_{s}}_{i}$ , yielding $\bar{s}_{i}=(s_{j},{\phi_{s}}_{i})$ The oracle reasoning process is then rendered into a specific modality via $o_{i}=g_{\phi_{m}}\!\left(f_{{\phi_{s}}_{i}}(s_{i})\right).$ Unless otherwise specified, we abuse notation and use $s_{i}$ to denote $\bar{s}_{i}=(s_{i},{\phi_{s}}_{i})$ in the remainder of our analysis.
Given the above oracle CoT generation process, we learn a model $p_{\theta}$ whose joint distribution over CoTs and answers factorizes into a reasoning component and a world-modeling component:
$$
\displaystyle p_{\theta}(R,A|Q,I)=p_{\theta}(r_{1},o_{1},r_{2},o_{2},\dots,r_{H},o_{H},r_{H+1}|r_{0},o_{0})=\prod_{i=1}^{H+1}p_{\theta}(r_{i}|R_{i})\prod_{i=1}^{H}p_{\theta}(o_{i}|\tilde{R}_{i}), \tag{8}
$$
where we denote the question as $r_{0}$ , the initial observation (input image(s)) as $o_{0}$ , and the final answer as $r_{H+1}$ . The CoT prefixes are defined as $R_{i}=(r_{0},o_{0},r_{1},o_{1},...,r_{i-1},o_{i-1}),\tilde{R}_{i}=(r_{0},o_{0},r_{1},o_{1},...,r_{i-1},o_{i-1},r_{i}).$
Proofs. We provide proofs of Theorem 1 and Theorem 2 below.
**Theorem 3 (Restatement of Theorem1)**
*For any observation modality $m$ , the following inequality holds:
$$
\displaystyle\operatorname{KL}(p(A\mid Q,I)\mid\mid p_{\theta}(A\mid Q,I)) \displaystyle\leq\operatorname{KL}(p(R,A\mid Q,I)\mid\mid p_{\theta}(R,A\mid Q,I)) \displaystyle= \displaystyle\sum_{i=1}^{H+1}\underbrace{\mathbb{E}_{p}\left[\operatorname{KL}(p(r_{i}|R_{i})\mid\mid p_{\theta}(r_{i}|R_{i}))\right]}_{\textnormal{reasoning errors}}+\sum_{i=1}^{H}\underbrace{\mathbb{E}_{p}\left[\operatorname{KL}(p(o_{i}|\tilde{R}_{i})\mid\mid p_{\theta}(o_{i}|\tilde{R}_{i}))\right]}_{\textnormal{world modeling errors}}. \tag{9}
$$*
* Proof*
The first inequality follows from the data processing inequality: marginalizing out $R$ cannot increase the KL divergence. For the equality, we apply the chain rule for KL divergence together with the CoT factorization in Eq. (8). In particular, substituting the factorizations of $p(R,A\mid Q,I)$ and $p_{\theta}(R,A\mid Q,I)$ into $\operatorname{KL}(p(R,A\mid Q,I)\,\|\,p_{\theta}(R,A\mid Q,I))$ leads to the stated decomposition. ∎
**Theorem 4 (Restatement of Theorem2)**
*For any observation modality $m$ , the reduction in reasoning uncertainty achieved by explicit world modeling satisfies:
1. Reasoning uncertainty does not increase: $\mathbb{H}(r_{i}\mid o_{0},r_{0:i-1})-\mathbb{H}(r_{i}|R_{i})=\mathbb{I}(o_{1:i-1};r_{i}\mid o_{0},r_{0:i-1})≥ 0.$
1. Uncertainty reduction is upper-bounded by both (i) the information that observations provide about the underlying states and (ii) the information that the reasoning step requires about those states:
$$
\mathbb{I}(o_{1:i-1};r_{i}\mid o_{0},r_{0:i-1})\leq\min\left(\mathbb{I}(o_{1:i-1};s_{1:i-1}),\mathbb{I}(r_{i};s_{0:i-1},r_{0:i-1})\right). \tag{10}
$$*
* Proof*
The first property follows the definition and the non-negativity of mutual information. For the second property, denote the conditioning context as $C=(o_{0},r_{0:i-1})$ . Using the properties of ternary mutual information: $\mathbb{I}(X;Y;Z)=\mathbb{I}(X;Y)-\mathbb{I}(X;Y\mid Z).$ we obtain
$$
\displaystyle\mathbb{I}(o_{1:i-1};r_{i}\mid C) \displaystyle=\mathbb{I}(o_{1:i-1};r_{i}\mid C)-\mathbb{I}(o_{1:i-1};r_{i}\mid s_{1:i-1},C)=\mathbb{I}(s_{1:i-1};o_{1:i-1};r_{i}\mid C) \displaystyle=\mathbb{I}(o_{1:i-1};s_{1:i-1}\mid C)-\mathbb{I}(o_{1:i-1};s_{1:i-1}\mid r_{i},C)\leq\mathbb{I}(o_{1:i-1};s_{1:i-1}\mid C), \tag{11}
$$
where $\mathbb{I}(o_{1:i-1};r_{i}|s_{1:i-1},C)=0$ follows from the conditional independence $r_{i}\perp o_{1:i-1}\mid s_{1:i-1}$ . Further, due to $o$ as the deterministic function of $s$ , we have:
| | $\displaystyle\mathbb{I}(o_{1:i-1};s_{1:i-1}\mid C)$ | $\displaystyle=\mathbb{H}(o_{1:i-1}\mid C)-\mathbb{H}(o_{1:i-1}\mid s_{1:i-1},C)$ | |
| --- | --- | --- | --- |
where $\mathbb{H}(o_{1:i-1}\mid s_{1:i-1})=\mathbb{H}(o_{1:i-1}\mid s_{1:i-1},C)=0.$ Symmetrically, we have:
| | $\displaystyle\mathbb{I}(o_{1:i-1};r_{i}|C)$ | $\displaystyle=\mathbb{I}(s_{1:i-1};o_{1:i-1};r_{i}\mid C)≤\mathbb{I}(s_{1:i-1};r_{i}|C)=\mathbb{H}(r_{i}|C)-\mathbb{H}(r_{i}|s_{1:i-1},C)$ | |
| --- | --- | --- | --- |
where $\mathbb{H}(r_{i}|s_{0:i-1},r_{0:i-1})≤\mathbb{H}(r_{i}|s_{1:i-1},o_{0},r_{0:i-1})$ due to data processing inequality. Combining the two upper bounds proves Eq. (10). ∎
**Corollary 1**
*If observations are fully informative about the underlying states, i.e., $\mathbb{H}(s_{i}\mid o_{i})=0$ for all $i$ , and the state transition dynamics are deterministic, then explicit world modeling provides no reduction in reasoning uncertainty: $\mathbb{I}(o_{1:i-1};r_{i}\mid o_{0},r_{0:i-1})=0.$*
* Proof*
By Eq. (11), we have
| | $\displaystyle\mathbb{I}(o_{1:i-1};r_{i}\mid o_{0},r_{0:i-1})≤\mathbb{I}(o_{1:i-1};s_{1:i-1}\mid o_{0},r_{0:i-1})≤\mathbb{H}(s_{1:i-1}\mid o_{0},r_{0:i-1}).$ | |
| --- | --- | --- |
Under the assumption $\mathbb{H}(s_{0}\mid o_{0})=0$ , the initial observation $o_{0}$ uniquely determines $s_{0}$ . Moreover, under deterministic state transitions, the trajectory $s_{1:i-1}$ is uniquely determined by $(s_{0},r_{1:i-1})$ . Hence,
$$
\mathbb{H}(s_{1:i-1}\mid o_{0},r_{1:i-1})=\mathbb{H}(s_{1:i-1}\mid s_{0},r_{1:i-1})=0.
$$
Therefore, $\mathbb{I}(o_{1:i-1};r_{i}\mid o_{0},r_{1:i-1})=0,$ which proves the corollary. ∎
Remarks. Corollary 1 shows that in deterministic and fully observable environments, given sufficient data and model capacity, explicit world modeling provides no additional benefit. This theoretical result is consistent with our empirical findings on the simple maze task.
7.2 Prior Knowledge
In this section, we first derive a generalization bound for transfer learning under distribution shift, and relate it to our perspective on prior knowledge in multimodal reasoning.
7.2.1 General Transfer Learning Analysis
Problem setup. A standard transfer learning setup involves a pre-training data distribution $P$ and the fine-tuning data distribution $Q$ over samples $(x,y)∈\mathcal{X}×\mathcal{Y}$ , and a loss function $\ell_{\theta}(x,y)∈[0,1].$ Define the population risks $\mathcal{L}_{D}(\theta):=\mathbb{E}_{(x,y)\sim D}[\ell_{\theta}(x,y)],D∈\{P,Q\}$ , and the population minimizers $\theta_{D}^{\star}∈\arg\min_{\theta∈\Theta}\mathcal{L}_{D}(\theta),D∈\{P,Q\}$ We assume we can obtain $\theta_{P}^{*}$ as the pre-trained model given sufficient data and optimization. For a radius $r>0$ , we then define the fine-tuning constraint set (local neighborhood around the pre-trained model)
$$
\Theta_{r}:=\{\theta\in\Theta:\ \|\theta-\theta_{P}^{*}\|\leq r\}.
$$
Given $n$ i.i.d. samples from $Q$ : $S=\left\{(x_{i},y_{i})_{i=1}^{n}\right\},(x_{i},y_{i})\sim Q$ , the fine-tuned model $\theta_{Q}$ minimize empirical risk over $\Theta_{r}$ , $\widehat{\mathcal{L}}_{Q}(\theta):=\frac{1}{n}\sum_{i=1}^{n}[\ell_{\theta}(x_{i},y_{i})].$ Our analysis focus on the excess risk on $Q$ : $\mathcal{E}_{Q}(\theta_{Q}):=\mathcal{L}_{Q}(\theta_{Q})-\mathcal{L}_{Q}(\theta_{Q}^{\star}).$
From distribution shift to parameter drift. We first derive how the distribution shift relates to the shift of the population minimizer.
**Lemma 1 (Uniform Loss Shift under Total Variation)**
*For any subset $\mathcal{S}⊂eq\Theta$ ,
$$
\sup_{\theta\in\mathcal{S}}\big|\mathcal{L}_{Q}(\theta)-\mathcal{L}_{P}(\theta)\big|\leq\mathrm{TV}(P,Q).
$$*
* Proof*
Fix any $\theta∈\mathcal{S}$ and define $f_{\theta}(h,a,o^{\prime}):=\ell_{\theta}(h,a,o^{\prime})∈[0,1]$ . By the definition of total variation and the standard inequality for bounded functions, $\big|\mathbb{E}_{Q}[f_{\theta}]-\mathbb{E}_{P}[f_{\theta}]\big|≤\mathrm{TV}(P,Q).$ Taking the supremum over $\theta∈\mathcal{S}$ yields the claim. ∎
**Lemma 2 (Risk Proximity ofθQ⋆\theta_{Q}^{\star}underPP)**
*$$
\mathcal{L}_{P}(\theta_{Q}^{\star})\leq\mathcal{L}_{P}(\theta_{P}^{\star})+2\mathrm{TV}(P,Q). \tag{12}
$$*
* Proof*
By Lemma 1, $\mathcal{L}_{P}(\theta_{Q}^{\star})≤\mathcal{L}_{Q}(\theta_{Q}^{\star})+\mathrm{TV}(P,Q).$ By optimality of $\theta_{Q}^{\star}$ on $Q$ , $\mathcal{L}_{Q}(\theta_{Q}^{\star})≤\mathcal{L}_{Q}(\theta_{P}^{\star}).$ Applying Lemma 1 again, $\mathcal{L}_{Q}(\theta_{P}^{\star})≤\mathcal{L}_{P}(\theta_{P}^{\star})+\mathrm{TV}(P,Q).$ Chaining the three inequalities proves (12). ∎
**Assumption 1 (Local Quadratic Growth / Sharpness ofℒP\mathcal{L}_{P})**
*There exists $\mu>0$ such that for all $\theta$ in a neighborhood containing $\theta_{Q}^{\star}$ ,
$$
\mathcal{L}_{P}(\theta)\geq\mathcal{L}_{P}(\theta_{P}^{\star})+\frac{\mu}{2}\|\theta-\theta_{P}^{\star}\|^{2}.
$$*
**Lemma 3 (Parameter Drift Controlled byTV(P,Q)\mathrm{TV}(P,Q))**
*Under Assumption 1,
$$
\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|\leq\sqrt{\frac{4}{\mu}\,\mathrm{TV}(P,Q)}. \tag{13}
$$*
* Proof*
By Assumption 1 with $\theta=\theta_{Q}^{\star}$ , $\mathcal{L}_{P}(\theta_{Q}^{\star})≥\mathcal{L}_{P}(\theta_{P}^{\star})+\frac{\mu}{2}\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|^{2}.$ Rearranging, $\frac{\mu}{2}\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|^{2}≤\mathcal{L}_{P}(\theta_{Q}^{\star})-\mathcal{L}_{P}(\theta_{P}^{\star}).$ Applying Lemma 2 yields $\frac{\mu}{2}\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|^{2}≤ 2\mathrm{TV}(P,Q),$ and hence (13). ∎
Control of the bias term. Recall the fine-tuning bias induced by restricting to $\Theta_{r}$ : $\varepsilon_{\mathrm{bias}}(r):=∈f_{\theta∈\Theta_{r}}\mathcal{L}_{Q}(\theta)-\mathcal{L}_{Q}(\theta_{Q}^{\star}).$
**Assumption 2 (ℒQ\mathcal{L}_{Q}is Locally Lipschitz)**
*There exists $L_{Q}>0$ such that for all $\theta,\theta^{\prime}∈\Theta_{r}$ ,
$$
|\mathcal{L}_{Q}(\theta)-\mathcal{L}_{Q}(\theta^{\prime})|\leq L_{Q}\|\theta-\theta^{\prime}\|.
$$*
**Theorem 5 (Bias Bound via Distribution Shift)**
*Under Assumption 1 and Assumption 2,
$$
\varepsilon_{\mathrm{bias}}(r)\leq L_{Q}\left(\sqrt{\frac{4}{\mu}\,\mathrm{TV}(P,Q)}-r\right)_{+}, \tag{14}
$$
where $(x)_{+}:=\max\{x,0\}$ . In particular, if $r≥\sqrt{\frac{4}{\mu}\,\mathrm{TV}(P,Q)},$ then $\varepsilon_{\mathrm{bias}}(r)=0$ .*
* Proof*
If $r≥\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|$ , then $\theta_{Q}^{\star}∈\Theta_{r}$ and thus $∈f_{\theta∈\Theta_{r}}\mathcal{L}_{Q}(\theta)≤\mathcal{L}_{Q}(\theta_{Q}^{\star})$ , implying $\varepsilon_{\mathrm{bias}}(r)=0$ . Now consider $r<\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|$ . Let $\theta_{r}$ be the projection of $\theta_{Q}^{\star}$ onto the closed ball $\Theta_{r}$ , i.e., $\theta_{r}:=\theta_{P}+r·\frac{\theta_{Q}^{\star}-\theta_{P}^{\star}}{\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|}$ . Then $\theta_{r}∈\Theta_{r}$ and $\|\theta_{r}-\theta_{Q}^{\star}\|=\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|-r.$ Therefore,
$$
\varepsilon_{\mathrm{bias}}(r)=\inf_{\theta\in\Theta_{r}}\mathcal{L}_{Q}(\theta)-\mathcal{L}_{Q}(\theta_{Q}^{\star})\leq\mathcal{L}_{Q}(\theta_{r})-\mathcal{L}_{Q}(\theta_{Q}^{\star})\leq L_{Q}\|\theta_{r}-\theta_{Q}^{\star}\|=L_{Q}(\|\theta_{Q}^{\star}-\theta_{P}^{\star}\|-r).
$$
Using Lemma 3 to bound $\|\theta_{Q}^{\star}-\theta_{P}\|$ completes the proof of (14). ∎
Fine-tuning excess risk bound. We then arrive at the final result:
**Theorem 6 (Fine-tuning Excess Risk Bound with Shift-Controlled Bias)**
*Assume Assumptions 1 and 2 and uniform convergence over $\Theta_{r}$ holds: with probability at least $1-\delta$ over samples $S$ ,
$$
\sup_{\theta\in\Theta_{r}}\big|\mathcal{L}_{Q}(\theta)-\widehat{\mathcal{L}}_{Q}(\theta)\big|\leq\varepsilon_{\mathrm{gen}}=O\!\left(\sqrt{\frac{\operatorname{Rad}_{Q,n}(\Theta_{r})+\log(1/\delta)}{n}}\right),
$$
where $\operatorname{Rad}_{Q,n}(\Theta_{r})$ is the Rademacher complexity of of the function class $\{\ell_{\theta},\theta∈\Theta_{r}\}$ with respect to $Q$ for sample size $n$ . Then with probability at least $1-\delta$ ,
$$
\mathcal{E}_{Q}(\theta_{Q})\leq 2\varepsilon_{\mathrm{gen}}+L_{Q}\left(\sqrt{\frac{4}{\mu}\,\mathrm{TV}(P,Q)}-r\right)_{+}. \tag{15}
$$*
* Proof*
Decompose the excess risk as $\mathcal{E}_{Q}(\theta_{Q})=\Big(\mathcal{L}_{Q}(\theta_{Q})-∈f_{\theta∈\Theta_{r}}\mathcal{L}_{Q}(\theta)\Big)+\varepsilon_{\mathrm{bias}}(r).$ The first term is bounded by a standard ERM argument using uniform convergence: $\mathcal{L}_{Q}(\theta_{Q})-∈f_{\theta∈\Theta_{r}}\mathcal{L}_{Q}(\theta)≤ 2\varepsilon_{\mathrm{gen}}.$ The second term is bounded by Lemma 5. Combining the two bounds yields (15). ∎
7.2.2 Remarks on Multimodal Reasoning
Theorem 6 reveals a trade-off between modality complexity and distribution shift. This general transfer learning analysis can be instantiated in our setting of learning world models and reasoning policies. Specifically, training pairs $(x,y)$ can be instantiated as $((o_{0:i},r_{0:{i+1}}),o_{i+1})$ for world modeling and $((o_{0:i},r_{0:{i}}),r_{i})$ for reasoning, respectively. Crucially, the distribution shift between large-scale pre-training data and downstream tasks may differ substantially across modalities. For example, there are abundant visual demonstrations of paper folding on the Internet, whereas detailed verbal descriptions of folding dynamics are comparatively scarce. This suggests that downstream tasks should be formulated under the most appropriate observation modality for world modeling and reasoning—i.e., the modality that best aligns with pre-training data—in order to achieve stronger generalization at inference time and higher sample efficiency during post-training.
8 Experiment Details
8.1 VisWorld-Eval and Training Data
In this section, we elaborate on the construction of training and test data for each task in VisWorld-Eval.
Paper folding. This task involves folding a paper grid with varying grid sizes (3–8) and folding steps (1–4). After folding, holes of different shapes—circles, triangles, stars, diamonds, and squares—are punched into the paper. The model is then asked to predict the distribution of holes after the paper is completely unfolded, including queries such as the total number of holes, the number of holes of a specific shape, or the difference in counts between shapes. All test prompts are constructed at the highest difficulty level (grid size 8 with 4 folding steps). For SFT, we generate chain-of-thoughts using rule-based templates that follow a fixed procedure: unfold the paper step-by-step and then count the resulting holes by shape. These CoTs are then rewritten with Gemini 2.5 Pro to improve clarity and logical coherence. Under visual world modeling, we interleave reasoning steps with images of partially unfolded paper states. Under verbal world modeling, we represent intermediate states using two matrices encoding grid coverage status and hole shape at each position. Under implicit world modeling, we directly skip the explicit tracking of states from original CoTs.
Multi-hop manipulation. This task begins with an initial arrangement of several geometric objects (cubes, spheres, and cylinders) in various colors, rendered by Blender https://www.blender.org/. A sequence of text-based instructions is then provided, describing operations such as changing or swapping objects’ color or shape, adding new objects, or removing existing ones. To ensure these commands can be interpreted unambiguously in a 3D space, the instructions consistently use relative spatial references, with each object uniquely identified by its combined color and shape attributes—for example: "Place a purple cylinder between the black sphere and the yellow cube." The model is asked to infer the resulting spatial layout. Queries may include the total number of objects of a specific shape, the directional relationship between two objects, or which object lies in a given direction relative to a reference object. Test prompts are constructed by varying both the number of initial objects (between 3 and 6) and the frequency (between 1 and 5) of different operation types. For SFT, chain-of-thought reasoning is generated using rule-based templates that simulate the stepwise execution of instructions before answering the final query, and these CoTs are subsequently refined with Gemini 2.5 Pro.
Ball tracking. This task features a red point-mass ball that moves at constant speed, reflects elastically off solid walls, and travels in the initial direction indicated by a green arrow. The model is asked to predict which numbered hole at the top of the image the ball will enter first. We generate input images with randomized resolution, initial ball position and direction, and a random number of holes (4–8). For test prompts, we select cases in which the ball trajectory reflects off at least one wall before entering a hole. For SFT, CoTs are generated by Seed 1.6, which is asked to explain the ball dynamics between adjacent frames.
Sokoban. Sokoban is a classic grid-based puzzle game. We generate instances with grid sizes ranging from 6 to 10, containing a single box and a target position. Test prompts are sampled from the same distribution as the training data. To construct CoTs, we use a search algorithm to compute an optimal solution path. To avoid excessively long trajectories, we render only key intermediate steps, including: (i) the player moving toward the box, (ii) pushing the box in a direction, and (iii) changing the pushing direction. To encourage reflective behavior, we additionally augment trajectories with randomized detours that involve walking into walls, reflecting, and backtracking to rejoin the optimal path. CoTs are generated by Seed 1.6, which explains the dynamics between adjacent frames. For visual world modeling, the rendered intermediate steps are interleaved with verbal CoTs. For pure verbal world modeling, these intermediate renderings are removed. For implicit world modeling, we additionally mask all explicit coordinates during CoTs with special tokens [masked].
Maze. Maze is a classic grid-based puzzle task. We generate both training and test samples with a fixed grid size of $5× 5$ . To construct CoTs, we use rule-based templates followed by rewriting for improved naturalness. Under visual world modeling, rendered intermediate steps through points and lines are interleaved with verbal CoTs. The settings for verbal and implicit world modeling follow the same protocol as in Sokoban, with masking special tokens as <point>[masked]</point>.
Cube 3-view projection. This task considers stacks of colored cubes arranged on grids of varying sizes (3–5), with two cube colors. The input consists of one isometric view (either front-left or front-right) and two orthographic views of the stack. The question asks for the number of cubes of a specified color visible from another orthogonal view. Both the questions and answer choices account for ambiguity caused by occlusions, leading to uncertainty in the cube count. All test prompts are constructed using uniformly random grid sizes between 3 and 5. We generate CoTs using rule-based templates: the model first constructs the queried view, marks potentially occluded cubes using a third (auxiliary) color, and then counts cubes by color. These CoTs are subsequently rewritten by Gemini 2.5 Pro for improved naturalness. Under visual world modeling, we interleave reasoning steps with an image of the queried view. Under verbal world modeling, we represent intermediate views using character matrices, where different colors are encoded by different symbols.
Real-world spatial reasoning. For this real-world task, we directly adopt test samples from MMSI-Bench, focusing on camera–object and camera–region positional relationship questions. We construct training prompts following a pipeline similar to Yang et al. [68]. To obtain training CoTs, we run a visual-CoT model, which uses an SFT-trained BAGEL model for novel view synthesis as a tool. The resulting visual CoTs are subsequently filtered and rewritten by Gemini 2.5 Pro.
We summarize the training and test sample counts for each task in VisWorld-Eval, along with the corresponding original or referenced benchmarks, in Table 2.
Table 2: Overview of VisWorld-Eval and corresponding training data: features, statistics, and references.
| Task | Capability | Domain | Training Samples | Test Samples | Source/Reference |
| --- | --- | --- | --- | --- | --- |
| Paper folding | Simulation | Synthetic | 2,357 | 480 | SpatialViz [61] |
| Multi-hop manipulation | Simulation | Synthetic | 2,000 | 480 | ZebraCoT [35], CLEVR [30] |
| Ball tracking | Simulation | Synthetic | 2,254 | 1,024 | RBench-V [20] |
| Maze | Simulation | Synthetic | 8,448 | 480 | maze-dataset [29] |
| Sokoban | Simulation | Synthetic | 7,715 | 480 | GameRL [55] |
| Cube 3-view projection | Reconstruction | Synthetic | 2,500 | 480 | SpatialViz [61] |
| Real-world spatial reasoning | Reconstruction | Real-world | 10,661 | 522 | MMSI-Bench [69] |
Examples of training CoTs are presented in Figure 10, 11, 12, 13, and 14.
8.2 Model Training
We perform supervised fine-tuning (SFT) of BAGEL based on its official repository https://github.com/ByteDance-Seed/Bagel, using 8 GPUs, and conduct reinforcement learning from verifiable rewards (RLVR) using verl https://github.com/volcengine/verl on 64 GPUs. Hyperparameters for SFT and RLVR are reported in Table 4 and Table 4, respectively.
Table 3: Hyperparameters for supervised fine-tuning UMMs.
| Hyperparameter | Value |
| --- | --- |
| Learning rate | $3× 10^{-5}$ |
| LR Schedule | Constant |
| Optimizer | AdamW |
| Loss weight (CE:MSE) | 1:10 |
| Warm-up steps | 200 |
| Training steps | 4000 |
| Gen. resolution | (256, 1024) for paper folding, cube 3-view |
| (240, 1024) for multi-hop manipulation | |
| (256, 512) otherwise | |
| Und. resolution | (224, 980) |
| Sequence length per rank | 32K |
| Num. ranks | 8 |
Table 4: Hyperparameters for reinforcement learning UMMs.
| Hyperparameter | Value |
| --- | --- |
| Learning rate | $1× 10^{-5}$ |
| Batch size | 128 |
| GRPO mini batch size | 32 |
| Group size | 16 |
| KL loss coefficient for visual gen. | 0.1 |
| KL loss coefficient for verbal gen. | 0.0 |
<details>
<summary>x12.png Details</summary>

### Visual Description
## Paper Folding: Unfolding Process and Cutout Analysis
### Overview
The image presents a step-by-step explanation of unfolding a folded paper with cutouts. It aims to determine the final pattern and count the number of 'triangle_left' cutouts. The explanation includes visual representations of the folding and unfolding process, along with code-like representations of the cutout patterns at each stage.
### Components/Axes
* **Title:** Paper Folding
* **Question:** Asks to unfold the paper step-by-step and determine the number of 'triangle_left' cutouts.
* **Response:** Provides a step-by-step explanation with images and code-like representations.
* **Images:** Show the paper folding and unfolding process.
* **Code-like Representations:** Arrays representing the cutout patterns at each stage. The arrays contain numbers (0, 1, -1) and strings ('triangle_right', 'triangle_left', '').
### Detailed Analysis or ### Content Details
**Initial State (Top-Rightmost Image):**
* **Image:** Shows a folded paper with cutouts.
* **Array Representation:**
```
[[0, -1, -1],
[1, -1, -1],
[1, -1, -1]]
```
* **String Representation:**
```
[['', '', ''],
['triangle_right', '', ''],
['triangle_left', '', '']]
```
**First Unfolding Step:**
* **Description:** Reverses the second fold (vertical fold).
* **Reflection Axis:** Vertical line between column 0 and column 1.
* **Array Representation:**
```
[[0, 0, -1],
[1, 1, -1],
[1, 1, -1]]
```
* **String Representation:**
```
[['', '', ''],
['triangle_right', 'triangle_left', ''],
['triangle_left', 'triangle_right', '']]
```
**Second Unfolding Step:**
* **Description:** Reverses the first fold.
* **Reflection Axis:** Fold line between the middle and right columns.
* **Array Representation:**
```
[[0, 0, 0],
[1, 1, 1],
[1, 1, 1]]
```
* **String Representation:**
```
[['', '', ''],
['triangle_right', 'triangle_left', 'triangle_right'],
['triangle_left', 'triangle_right', 'triangle_left']]
```
**Final Answer:**
* The number of 'triangle_left' cutouts is 3.
### Key Observations
* The code-like representations use numbers and strings to describe the cutout patterns.
* The unfolding process involves mirroring shapes across vertical axes.
* The 'triangle_right' and 'triangle_left' shapes are flipped during the unfolding process.
### Interpretation
The image provides a clear and systematic explanation of how to unfold a folded paper with cutouts. The combination of visual representations and code-like representations makes it easy to follow the unfolding process and understand how the cutout patterns change at each step. The final answer of 3 'triangle_left' cutouts is derived logically from the unfolding process. The use of arrays and strings to represent the cutout patterns allows for a more precise and detailed analysis of the unfolding process.
</details>
Figure 10: Examples of chain-of-thought SFT data for the paper folding task, under visual world modeling (left) and verbal world modeling (right).
<details>
<summary>x13.png Details</summary>

### Visual Description
## Ball Tracking and Multi-Hop Manipulation
### Overview
The image presents two distinct problem-solving scenarios: "Ball Tracking" and "Multi-Hop Manipulation." The "Ball Tracking" section involves predicting the trajectory of a ball reflecting off walls to enter a numbered hole. The "Multi-Hop Manipulation" section describes a sequence of object manipulations in a 3D scene and asks for the final configuration.
### Components/Axes
**Ball Tracking:**
* **Question:** Poses the problem of predicting which numbered hole the ball will enter.
* **Given Information:**
* A red ball and a green arrow indicating the ball's initial direction.
* Three rectangular holes numbered 1, 2, and 3 from left to right.
* Black boundaries representing solid walls with ideal reflection.
* The ball is considered a point mass.
* Reflection rules: perpendicular velocity component reverses, parallel component remains unchanged, constant speed.
* **Response:** Provides a step-by-step analysis of the ball's trajectory, including:
* Initial analysis of the image.
* Tracking the ball's initial motion.
* Description of the ball's movement after hitting the bottom wall.
* Tracking the upward motion after reflection.
* Determining the target hole.
* **Diagrams:** Two diagrams showing the ball's trajectory:
* First diagram: Ball moving downward and hitting the bottom wall.
* Second diagram: Ball moving upward towards hole 2 after reflection.
**Multi-Hop Manipulation:**
* **Question:** Describes an initial arrangement of objects in a 3D scene and asks for the object to the right of the blue cylinder after a series of manipulations.
* **Given Information:**
* Objects are viewed from an oblique front perspective.
* Objects closer to the camera are considered 'front'.
* 'Left' and 'right' sides correspond to the frame's left and right.
* **Operations:**
1. Change the object directly in front of the yellow cuboid into a rose cylinder.
2. Place a gray cylinder behind and to the left of the object directly behind the rose cylinder.
3. Place a gray sphere to the left of the rose cylinder.
* **Response:** Provides a step-by-step execution of the operations, including:
* Initial scene description with five objects: blue cylinder, gray sphere, gray cuboid, red cylinder, and yellow cuboid.
* Changing the gray sphere to a rose cylinder.
* Placing a gray cylinder to the left-back of the yellow cuboid.
* Adding a gray sphere to the left of the rose cylinder.
* **Diagrams:** Four diagrams showing the scene after each manipulation step.
### Detailed Analysis or ### Content Details
**Ball Tracking:**
* The initial position of the red ball is in the top-left quadrant of the rectangular area.
* The green arrow indicates a downward and slightly rightward direction.
* The ball's trajectory involves a single reflection off the bottom wall.
* The final answer, \boxed{2}, indicates that the ball will enter hole number 2.
**Multi-Hop Manipulation:**
* The initial scene contains five objects with distinct colors and shapes.
* The first operation transforms the gray sphere into a rose cylinder.
* The second operation places a gray cylinder behind and to the left of the yellow cuboid.
* The third operation adds a gray sphere to the left of the rose cylinder.
* The final answer, A, indicates that the object to the right of the blue cylinder is a gray cylinder.
### Key Observations
**Ball Tracking:**
* The problem relies on understanding basic physics principles of reflection.
* The diagrams visually represent the ball's trajectory.
* The step-by-step reasoning leads to the correct answer.
**Multi-Hop Manipulation:**
* The problem requires spatial reasoning and following instructions carefully.
* The diagrams illustrate the changes in the scene after each operation.
* The step-by-step execution ensures the correct final configuration.
### Interpretation
**Ball Tracking:**
The "Ball Tracking" problem demonstrates the application of physics principles to predict the motion of an object. It highlights the importance of understanding reflection rules and visualizing trajectories. The problem is designed to test logical reasoning and problem-solving skills.
**Multi-Hop Manipulation:**
The "Multi-Hop Manipulation" problem assesses spatial reasoning and the ability to follow a sequence of instructions. It emphasizes the importance of visualizing 3D scenes and understanding relative positions of objects. The problem is designed to test cognitive skills related to spatial awareness and sequential processing.
</details>
Figure 11: Examples of chain-of-thought SFT data for the ball tracking and multi-hop manipulation task.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Maze and Sokoban Puzzle Solutions
### Overview
The image presents solutions to two different puzzles: a maze and a Sokoban puzzle. The maze solution involves finding a path from a red dot to a blue 'X', while the Sokoban solution involves pushing a box onto a goal position in a grid-based environment.
### Components/Axes
**Maze:**
* **Starting Point:** Marked by a red dot. Located in the upper-left area of the maze.
* **Goal:** Marked by a blue 'X'. Located in the lower-left corner of the maze.
* **Maze Structure:** Consists of paths and walls.
**Sokoban:**
* **Player:** Represented by a character. Starts at position (3,1).
* **Box:** Starts at position (1,2).
* **Goal:** Marked with green 'X's. Located at position (1,4).
* **Grid:** A 5x6 grid (0-4 on x-axis, 0-5 on y-axis) representing the room.
* **Walls:** Represented by orange brick tiles.
* **Floor:** Represented by lighter, dotted tiles.
### Detailed Analysis or ### Content Details
**Maze Solution:**
The solution is presented as a series of steps, with the path described textually and visually. The path is also given as a list of waypoints:
* <point>165 83</point>
* <point>246 83</point>
* <point>246 246</point>
* <point>165 246</point>
* <point>165 328</point>
* <point>83 328</point>
* <point>83 410</point>
**Sokoban Solution:**
The solution is presented in two steps:
* **Step 1:** The player moves up twice from (3,1) to (1,1) to get to the left side of the box.
* **Step 2:** The player pushes the box right twice from (1,2) to (1,4), landing on the goal position.
The final solution is given as a sequence of moves: \bbox{up, up, right, right}.
### Key Observations
* The maze solution is described in detail, with each step explained.
* The Sokoban solution is concise, focusing on the key moves required to solve the puzzle.
* The images accompanying the solutions provide visual confirmation of the steps taken.
### Interpretation
The image demonstrates the solutions to two different types of puzzles. The maze solution highlights the process of navigating a complex path, while the Sokoban solution demonstrates strategic planning to manipulate objects in a constrained environment. The combination of textual descriptions and visual aids makes the solutions easy to understand.
</details>
Figure 12: Examples of chain-of-thought SFT data for the maze and sokoban task.
<details>
<summary>x15.png Details</summary>

### Visual Description
## Diagram: Cube 3-View Projection and Left View Construction
### Overview
The image presents a problem involving a cube stack made of lightcyan and red cubes. It shows the isometric view of the cube stack from front-right, top, and right perspectives. The task is to determine how many red cubes can be seen from the left view. The image then provides a step-by-step construction of the left view, accounting for hidden cubes and color uncertainties, ultimately arriving at the answer.
### Components/Axes
* **Title:** Cube 3-View Projection
* **Views:**
* front_right (isometric view of the cube stack)
* top (top view of the cube stack)
* right (right view of the cube stack)
* **Question:** A text block posing the problem and providing multiple-choice answers.
* **Response:** A step-by-step explanation and construction of the left view.
* **Cube Colors:** lightcyan and red.
* **Character Matrix Legend:** 'O' represents a red cube, 'X' represents a lightcyan cube, and '*' represents a cube of uncertain color.
### Detailed Analysis or ### Content Details
1. **Problem Statement:**
* A cube stack is made of equal-sized small cubes, mostly in lightcyan with a few red ones.
* The image shows its isometric view from front-right, top view, and right view from left to right.
* Question: How many cubes in red can possibly be seen from the left view?
* Options: A. 4. B. 6. C. 5. D. 0.
2. **Initial Response:**
* Construct the left view based on the provided front-right, top, and right views.
* The initial construction shows the cubes whose colors are known for certain.
* A 4x4 grid shows the initial left view with one red cube ('O') and the rest lightcyan ('X').
3. **Accounting for Hidden Cubes:**
* The initial view is incomplete as it doesn't account for cubes that might be hidden from all given perspectives.
* Regenerate the left view, marking the positions of these potentially hidden cubes in white.
* A 3x3 grid shows the left view with one red cube and two white cubes.
4. **Complete Left View:**
* The complete left view shows 2 cubes that are confirmed to be red.
* It also identifies 2 positions (marked in white) where cubes could exist but whose colors are unknown.
* Since these hidden cubes could also be red, the total number of red cubes visible from the left could range from a minimum of 2 to a maximum of 4.
* Therefore, the highest possible number of red cubes is 4.
5. **Character Matrix Refinement:**
* The initial representation is incomplete because some cubes that are visible from the left are occluded in the given views, making their color uncertain.
* Refine the matrix, using 'O' for red cubes, 'X' for lightcyan cubes, and '*' for cubes of an uncertain color.
* A 4x4 character matrix is presented:
</details>
Figure 13: Examples of chain-of-thought SFT data for the cube 3-view projection task, under visual world modeling (left) and verbal world modeling (right).
<details>
<summary>x16.png Details</summary>

### Visual Description
## Chart/Diagram Type: Spatial Reasoning Problems
### Overview
The image presents two spatial reasoning problems, each with a question, multiple-choice options, and a detailed response explaining the solution. Each problem involves visualizing objects in a room from different perspectives and determining their relative positions. The image includes textual descriptions and reasoning, along with references to "Image 1" and "Image 2" which are photographs depicting the room from different angles.
### Components/Axes
**Problem 1 (Left Side):**
* **Question:** If the round clock mounted on the dark wall is to the north of the heater below the windows, and you stand at the heater, in which direction will you find the white door in the room's background?
* **Options:** A. north B. southwest C. northwest D. east.
* **Response:** A detailed explanation of the reasoning process, including establishing a reference frame, examining provided images, simulating a 45-degree turn, and concluding that the white door is in the northwest direction.
* **Answer:** C (northwest)
**Problem 2 (Right Side):**
* **Question:** When table lamp (in frame 2) is south of pillow (in frame 2), what is the position of tv (in frame 1) in relation to pillow (in frame 2)?
* **Options:** A. northeast, B. southwest, C. northwest, D. west
* **Response:** A detailed explanation of the reasoning process, including establishing a clear view of relevant objects, relating the TV's position to the pillow's position, and confirming the relative position based on north-south and east-west axes.
* **Answer:** B (southwest)
### Detailed Analysis or Content Details
**Problem 1 Analysis:**
1. **Reference Frame:** The clock is north of the heater.
2. **Image 1:** Shows the white door, a bookshelf, and a TV on a media console.
3. **Image 2:** Shows the heater, the clock, and the same TV and media console.
4. **Spatial Reasoning:** The solver simulates a 45-degree turn to the left from the heater's position to locate the white door.
5. **Conclusion:** The white door is in the northwest direction.
**Problem 2 Analysis:**
1. **Object Identification:** Pillow and table lamp in Frame 2, TV in Frame 1.
2. **Spatial Relationship:** The table lamp is south of the pillow.
3. **TV Position:** The TV is on a wall that is south and west of the pillow's location.
4. **Conclusion:** The TV is southwest of the pillow.
### Key Observations
* Both problems involve spatial reasoning and require the solver to visualize objects from different perspectives.
* The responses provide step-by-step explanations of the reasoning process.
* The problems rely on the relative positions of objects within a room.
### Interpretation
The image demonstrates how spatial reasoning can be used to solve problems involving the relative positions of objects. The problems highlight the importance of establishing a reference frame, visualizing objects from different perspectives, and using logical deduction to arrive at a solution. The detailed explanations provided in the responses offer insights into the cognitive processes involved in spatial reasoning. The use of images (though not directly present in the provided text) is crucial for visualizing the scene and understanding the spatial relationships between the objects.
</details>
Figure 14: Examples of chain-of-thought SFT data for the real-world spatial reasoning task.
The Qwen-VL baselines are trained using LLaMA-Factory https://github.com/hiyouga/LLaMA-Factory for supervised fine-tuning (SFT) and verl for reinforcement learning from verifiable rewards (RLVR).
8.3 Analytic Experiments
Sample efficiency. For Figure 6, we randomly subsample either 500 or 1000 training examples. The resulting models are evaluated under two settings: (i) a hard setting with the maximum difficulty (grid size 8 and 4 folding steps, default in VisWorld-Eval), and (ii) an in-distribution setting (denoted as Normal in the figure) with randomly sampled grid sizes (3–8) and folding steps (1–4).
Task difficulties and world model fidelity. For Figure 6, we generate test samples with varying cube-stack sizes (3–6), where size 6 is out-of-distribution relative to the training data. To assess world-model fidelity, we compare the generated views with the ground-truth views: for verbal world modeling, we use string pattern matching; for visual world modeling, we use Gemini 3 Pro to compare images. Since accurately inferring colors becomes particularly challenging at larger stack sizes, we evaluate only the shapes of the views and ignore color information. We also find that overall accuracy can be bottlenecked by verbal subskills (e.g., counting holes) after SFT, thus we report the accuracy of RL-trained models in Figure 6. In contrast, RL can distract verbal world modeling capabilities, leading to invalid formats of generated symbolic matrices, thus we report world-model fidelity of SFT-trained models.
Implicit world modeling. For Figure 6, we supervised fine-tune (SFT) BAGEL on CoTs with implicit world modeling, in which all explicit point coordinates are replaced by the placeholder token sequence <point>masked<point>. After training, we extract the hidden representations at the position of the token masked from each transformer layer. We then split the extracted representations from different CoTs into training and validation sets with an 8:2 ratio and train a two-layer MLP (hidden size 4096) to predict the ground-truth point coordinates. Since all samples are $5× 5$ mazes, we formulate coordinate prediction as two 5-way classification tasks (for $x$ and $y$ , respectively). We compute classification accuracy for each coordinate and report the average of the two.
9 Extended Experimental Results
9.1 Full Results on MMSI-Bench
We report all scores on positional relationship tasks of MMSI-Bench in Table 5.
Table 5: Full results of SFT-trained UMMs on MMSI-Bench positional relationship tasks.
| | MMSI-Bench (Positional Relationship) | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Models | Cam.-Cam. | Obj.–Obj. | Reg.–Reg. | Cam.–Obj. | Obj.–Reg. | Cam.–Reg. | Overall |
| Implicit WM | 33.1 | 31.2 | 31.8 | 46.5 | 29.1 | 37.3 | 34.8 |
| Visual WM | 29.6 | 29.5 | 31.6 | 60.9 | 25.8 | 54.4 | 38.4 |
9.2 Additional Qualitative Evaluation
We provide additional qualitative evaluation of trained UMMs’ reasoning, particularly failure cases.
Real-world spatial reasoning. As shown in Figure 15 a, reasoning with implicit world modeling is prone to hallucinations. In contrast, visual generation (Figure 15 b) yields more faithful world models, but still suffers from insufficient quality, including blurring and corrupted details. Moreover, we find that current VLMs and UMMs continue to exhibit limited understanding of positions and directions across different viewpoints. We expect that stronger base models and better-curated post-training data will enable more effective use of visual world models for spatial reasoning in future work.
Paper folding. As illustrated in Figure 16, verbal reasoning about geometric symmetry is prone to hallucinations, leading to inaccurate verbal world modeling. In contrast, visual world models, benefiting from stronger prior knowledge, generate correct intermediate unfolding steps even in the presence of erroneous verbal reasoning.
Cube 3-view projection. As shown in Figure 17, visual world models are able to approximately generate novel views of cube stacks even in the challenging out-of-distribution setting with an unseen stack size of 6, indicating strong prior knowledge of spatial transformations. Nevertheless, overall task performance remains limited by subtle shape-generation errors (Figure 17 b,d) and inaccurate color inference (Figure 17 c). We expect these issues to be alleviated through improved post-training and stronger base models.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Chart/Diagram Type: Real-World Spatial Reasoning
### Overview
The image presents two spatial reasoning problems, each with a question, a series of images, and a model's reasoning process to arrive at an answer. Part (a) involves determining the location of a black chair relative to the viewer, while part (b) involves determining the direction to the TV area from the entrance.
### Components/Axes
**Part (a):**
* **Title:** Real-World Spatial Reasoning (a)
* **Question:** The images are taken continuously from a first-person perspective. At the moment of the last image, in which direction is the black chair relative to you?
* **Options:** A: Front right, B: Back left, C: Front left, D: Back right
* **Images:** Two images showing different views of a room. Image 1 shows a black chair in front of a white wall with a large painting and a dark hallway to the right of the painting. Image 2 shows a bed with a black headboard against a white wall and a dark door to the left of the bed.
* **Model Output (Implicit World Modeling):** A textual explanation of the reasoning process.
**Part (b):**
* **Title:** Real-World Spatial Reasoning (b)
* **Question:** When you enter the room through the white door, which way should you go to reach the TV area?
* **Options:** A: Go straight, B: Go left, C: Cannot be determined, D: Go right
* **Images:** Three images showing different views of a room. The first image shows a white door in the background. The second image shows a TV area. The third image shows both the white door and the TV area.
* **Model Output (Visual World Modeling):** A textual explanation of the reasoning process.
### Detailed Analysis or ### Content Details
**Part (a):**
* The model analyzes the two images to build a mental map of the room.
* It identifies the hallway in Image 1 as the same hallway on the right side of the bed in Image 2.
* The model determines that the camera's starting position is near the bed, looking towards it.
* It simulates turning the camera to see what's around the bed, starting by turning left from the perspective of Image 2.
* The model concludes that the black chair is in the "back-right" direction relative to the camera's position.
* The correct answer is D: Back right.
**Part (b):**
* The model identifies the white door in the first image and the TV area in the second image.
* It determines that the TV area is positioned to the left of the white door.
* The model confirms this by noting that moving left from the door's position aligns with the TV area's location.
* The correct answer is B: Go left.
### Key Observations
* Both parts involve spatial reasoning from a first-person perspective.
* The model uses visual information from the images to build a mental map of the room.
* The model uses logical reasoning to determine the correct answer.
### Interpretation
The image demonstrates how a model can use visual information and logical reasoning to solve spatial reasoning problems. The model's ability to build a mental map of the room and simulate movement within the environment is crucial to its success. The problems highlight the importance of understanding spatial relationships and perspective in real-world scenarios.
</details>
Figure 15: Showcases of reasoning generateed by post-trained UMMs in the real-world spatial reasoning task. We highlight hallucinations or incorrect reasoning steps in red.
<details>
<summary>x18.png Details</summary>

### Visual Description
## Diagram: Paper Folding
### Overview
The image presents a step-by-step guide to mentally reverse a paper folding process with cutouts to determine the final unfolded design. It includes visual and verbal modeling outputs, illustrating the unfolding process and the resulting shapes.
### Components/Axes
* **Title:** Paper Folding
* **Question:** Analyze the image showing a folded paper with cutouts. Mentally reverse the folding process to reconstruct the final unfolded design, then provide your answer to: calculate the number of triangle_left minus the number of triangle_right.
* **Model Output (Visual World Modeling):** Shows the unfolding process visually.
* **Model Output (Verbal World Modeling):** Describes the unfolding process verbally with code-like representations.
* **Shapes:** Square, left-pointing triangle, right-pointing triangle, diamond, downward-pointing triangle.
* **Folding Steps:** Four folds are reversed and described.
### Detailed Analysis or ### Content Details
**Visual World Modeling:**
* **Initial Folded State:** A square with a small diagonal cutout in the top-right corner.
* **First Unfolding:** The left side is unfolded, revealing no new shapes.
* **Second Unfolding:** The top half is unfolded, mirroring the shapes on the bottom half to the top half.
* **Third Unfolding:** The left section is unfolded, creating a complete square from the diagonal cutout.
* **Final Configuration:** Shows the final arrangement of squares and triangles.
**Verbal World Modeling:**
* **Initial State:** `[[-1, -1, -1], [-1, 1, -1], [-1, 1, -1]]` represents the initial folded state.
* `[['', '', ''], ['', 'diamond', ''], ['', 'triangle_left', '']]` describes the cutouts.
* **First Unfolding:** The square and left-pointing triangle remain, and a new right-pointing triangle is added.
* `[[-1, -1, -1], [-1, 1, 1], [-1, 1, 1]]`
* `[['', '', ''], ['', 'diamond', 'diamond'], ['', 'triangle_left', 'triangle_right']]`
* **Second Unfolding:** A downward-pointing triangle appears on the top half.
* `[[-1, 1, 1], [-1, 1, 1], [-1, 1, 1]]`
* `[['', 'triangle_down', 'triangle_right'], ['', 'diamond', 'diamond'], ['', 'triangle_left', 'triangle_right']]`
### Key Observations
* The visual and verbal models complement each other in explaining the unfolding process.
* The verbal model uses a code-like representation to describe the shapes and their positions.
* The unfolding process involves mirroring shapes across fold lines.
* The diagonal cutout transforms into a complete square after unfolding.
### Interpretation
The image provides a detailed explanation of how to mentally reverse a paper folding process with cutouts. The visual model offers a clear step-by-step illustration, while the verbal model provides a more abstract representation using code-like structures. The combination of these two models allows for a comprehensive understanding of the unfolding process and the resulting shapes. The question posed at the beginning encourages the viewer to actively engage with the process and calculate the difference between the number of left-pointing and right-pointing triangles in the final unfolded design.
</details>
Figure 16: Showcases of reasoning generated by post-trained UMMs in the paper folding task. We highlight hallucinations or incorrect reasoning steps in red, but also mark correctly generated visual unfolding intermediate steps with green borders.
<details>
<summary>x19.png Details</summary>

### Visual Description
## Cube 3-View Projection
### Overview
The image presents four separate cube projection problems (a, b, c, d), each involving a stack of equal-sized cubes with a primary color and a few cubes of a secondary color. For each problem, an isometric view and two orthographic views are provided. The task is to determine the maximum number of cubes of the secondary color visible from a specified view. The image includes the problem statement, possible answers, a model's reasoning process, and a "ground-truth" view for comparison.
### Components/Axes
Each sub-problem (a, b, c, d) contains the following elements:
* **Isometric View:** A 3D representation of the cube stack.
* **Orthographic Views:** Two 2D projections of the cube stack from different viewpoints (e.g., front, top, left, right).
* **Question:** A textual description of the problem, specifying the primary and secondary colors, the given views, and the target view.
* **Answer Options:** Multiple-choice options for the number of visible secondary-colored cubes.
* **Model Output:** A textual explanation of the model's reasoning process, including the generation of a specific view.
* **Generated View:** A 2D projection of the cube stack generated by the model, highlighted with a red border.
* **Ground-truth View:** A 2D projection of the cube stack representing the correct view, enclosed in a dashed border.
### Detailed Analysis
**Sub-problem (a):**
* **Colors:** Primary color is seashell, secondary color is yellow.
* **Views:** Isometric (front-left), top, and left views are given.
* **Question:** How many yellow cubes can possibly be seen from the front view?
* **Answer Options:** A. 2, B. All three other options are possible, C. 4, D. 1.
* **Model Output:** The model generates the front view.
* **Generated View:** A 2D projection showing the front view with yellow cubes.
* **Ground-truth View:** Not explicitly labeled, but implied to be the same as the generated view.
* **Model's Answer:** 2, corresponding to option A.
**Sub-problem (b):**
* **Colors:** Primary color is palegreen, secondary color is blue.
* **Views:** Isometric (front-right), front, and right views are given.
* **Question:** How many blue cubes can possibly be seen from the top view?
* **Answer Options:** A. 4, B. 5, C. 3, D. 0.
* **Model Output:** The model generates the top view.
* **Generated View:** A 2D projection showing the top view with blue cubes.
* **Ground-truth View:** A 2D projection of the top view.
* **Model's Answer:** 1, corresponding to option C.
**Sub-problem (c):**
* **Colors:** Primary color is palegreen, secondary color is darkviolet.
* **Views:** Isometric (front-left), top, and left views are given.
* **Question:** How many darkviolet cubes can possibly be seen from the right view?
* **Answer Options:** A. 3, B. All three other options are possible, C. 1, D. 2.
* **Model Output:** The model generates the right view, accounting for uncertainty in occluded cubes.
* **Generated View:** A 2D projection showing the right view with darkviolet and gray (uncertain) cubes.
* **Ground-truth View:** A 2D projection of the right view.
* **Model's Answer:** All three options (1, 2, and 3) are possible, corresponding to option B.
**Sub-problem (d):**
* **Colors:** Primary color is seashell, secondary color is green.
* **Views:** Isometric (front-left), left, and top views are given.
* **Question:** How many green cubes can possibly be seen from the front view?
* **Answer Options:** A. All three other options are possible, B. 0, C. 4, D. 2.
* **Model Output:** The model generates the front view.
* **Generated View:** A 2D projection showing the front view with green cubes.
* **Ground-truth View:** A 2D projection of the front view.
* **Model's Answer:** 2, corresponding to option D.
### Key Observations
* Each sub-problem presents a different configuration of cubes and viewpoints.
* The model attempts to reconstruct a specific view based on the given views.
* In sub-problem (c), the model acknowledges uncertainty due to occlusion.
* The "ground-truth" view serves as a reference for evaluating the model's generated view.
### Interpretation
The image demonstrates a model's ability to reason about 3D cube arrangements and project them into 2D views. The model uses the given views to infer the arrangement of cubes and generate the requested view. The inclusion of "ground-truth" views allows for a comparison of the model's performance against the correct solution. The uncertainty handling in sub-problem (c) highlights the model's awareness of potential ambiguities due to occlusion. The problems vary in difficulty, requiring the model to synthesize information from different viewpoints and handle uncertainty.
</details>
Figure 17: Showcases of reasoning generated by post-trained UMMs in the paper folding task. We mark correct and incorrect generated cube views with green and red borders, respectively. For incorrect generations, the corresponding ground-truth views are provided for reference (note that these are shown only for readers and are never provided to the models during reasoning).