# From Building Blocks to Planning: Multi-Step Spatial Reasoning in LLMs with Reinforcement Learning
## Abstract
Spatial reasoning in large language models (LLMs) has gained increasing attention due to applications in navigation and planning. Despite strong general language capabilities, LLMs still struggle with spatial transformations and multi-step planning in structured environments. We propose a two-stage approach that decomposes spatial reasoning into atomic building blocks and their composition. First, we apply supervised fine-tuning on elementary spatial transformations, such as rotation, translation, and scaling, to equip the model with basic spatial physics. We then freeze this physics-aware model and train lightweight LoRA adapters within the GRPO framework to learn policies that compose these building blocks for multi-step planning in puzzle-based environments, in a closed-loop manner. To support this pipeline, we synthesize an ASCII-art dataset and construct a corresponding ASCII-based reinforcement learning environment. Our method consistently outperforms baselines, including the generic backbone, physics-aware model, and end-to-end RL models, under both Dynamic environments with explicit state updates and Static environments where the model must rely on its internal state across steps. In addition, the proposed approach converges faster and exhibits more stable training compared to end-to-end reinforcement learning from scratch. Finally, we analyze attention patterns to assess whether fine-tuning induces meaningful improvements in spatial understanding.
## 1 Introduction
Spatial reasoning and understanding represent the capability to reason about spatial relationships of objects and transformations in an environment. This includes understanding relative positions, orientations, distances, and the effects of actions that modify spatial configurations. With the recent advancements of LLMs (Minaee et al., 2025) and VLMs (Li et al., 2025) in a wide variety of tasks, such as math reasoning (Ahn et al., 2024) and vision–language understanding (Wu et al., 2024a), their spatial reasoning capabilities have gained more attention, with applications such as robotics (Kong et al., 2025) and language navigation tasks (Zhang et al., 2024). However, their abilities in spatial reasoning have not yet been widely explored (Wu et al., 2024b). Spatial reasoning tasks can often be mapped into several domains, some of which focus on linguistic and natural language scenarios (Li et al., 2024), as well as puzzle-based settings (Noever & Burdick, 2021) such as mazes (Einarsson, 2025), Rubik’s Cube (Ding et al., 2023), and Sokoban (Todd et al., 2023), where the model needs to understand the spatial relationships among objects to solve the task. In puzzle-based settings, language models are empowered by several techniques. One category involves using an external module as a solver, with the LLM acting as an action recommender or reasoning candidate generator. The external module can be heuristic-based, such as BFS or DFS in Tree of Thoughts (ToT) (Yao et al., 2023), or learning-based approaches, such as XoT, which leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) (Ding et al., 2023), or Q-learning (Deng et al., 2025). Another category of approaches focuses on the model itself. One direction aims to change the model’s behavior through prompting, such as Visualization-of-Thought (VoT) (Wu et al., 2024b), which elicits spatial reasoning in LLMs by visualizing their reasoning traces and guiding subsequent reasoning steps. Another approach, presented in Dao & Vu (2025), is inspired by DeepSeek-R1 (DeepSeek-AI et al., 2025) and is based on supervised fine-tuning the model on solution traces, and then applying GRPO to further refine reasoning steps on the same task and structure.
In this work, we focus on spatial reasoning in a puzzle-based setting, where an agent must transform an initial spatial configuration into a target configuration through a sequence of discrete actions. We propose a novel approach that decomposes spatial understanding into a set of building blocks, consisting of atomic transformations such as rotating a shape by $90^{\circ}$ or translating it one grid cell upward. We first apply supervised fine-tuning to enable the model to learn these basic physical transformations. After this stage, the physics-aware model is kept frozen, and reinforcement learning is applied on top of it by introducing lightweight adapter layers that learn a policy for composing these building blocks as primitives to reach a target spatial configuration from a given starting point. An overview of our approach is shown in Figure 1.
For the supervised fine-tuning stage, we synthesize a dataset of 12k tasks spanning three transformation categories, translation, rotation, and scaling, which is used to fine-tune the Qwen2.5-1.5B-Instruct model (Qwen Team, 2024). In the subsequent reinforcement learning stage, the physics-aware model is embedded directly in the reinforcement learning loop within a multi-step, compositional environment. Reinforcement learning is applied via GRPO to optimize lightweight LoRA adapter layers (Hu et al., 2021) on top of the frozen backbone, enabling the model to learn policies over sequences of atomic spatial operations through repeated interaction with the environment. We evaluate our approach against several baselines, including the generic Qwen2.5-1.5B-Instruct model, the physics-aware model trained only with supervised fine-tuning, and a Qwen model trained directly with GRPO reinforcement learning. All models are tested on unseen spatial reasoning tasks under two settings: one where the environment map is updated after each action, and one where the map remains fixed. Our results show that the proposed method achieves higher rewards than all baselines across both settings and converges faster during reinforcement learning. Although the physics-aware model trained only with supervised fine-tuning underperforms the generic backbone on the final task, it provides a stronger prior for reinforcement learning, leading to substantially better performance after GRPO optimization. Finally, we conduct ablation studies on attention layers to analyze whether the learned improvements reflect genuine spatial understanding.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Qwen-Physics Training and Evaluation Pipeline
### Overview
This diagram illustrates the training and evaluation pipeline for the Qwen-Physics model. It depicts the flow of data from a "Building Block Dataset" through a "Training Environment" to an "Evaluation Environment". The pipeline involves stages like Translation, Scaling, Rotation, Supervised Fine-Tuning (SFT), and Closed Loop GRPO, culminating in the Qwen-PhysRL model.
### Components/Axes
The diagram is structured into three main sections: "Building Block Dataset" (left), "Training Environment" (center), and "Evaluation Environment" (right). Within these sections, key components are:
* **Building Block Dataset:** Contains images of block arrangements labeled 'A' and 'B', undergoing transformations like Translation, Scaling, and Rotation.
* **Qwen-Physics:** A graph-based representation of the physics engine.
* **Qwen-Instruct:** Another graph-based representation, likely used for instruction following.
* **SFT (Supervised Fine-Tuning):** Represented by a gear icon within a blue box.
* **LoRA Adapters:** A component within the Closed Loop GRPO stage, represented by a blue box with 'H' and 'W' inside, and 'A' and 'B' connected by a plus sign.
* **Closed Loop GRPO:** Represented by a locked padlock icon.
* **Qwen-PhysRL:** The final model, also a graph-based representation.
* **Evaluation Environment:** Contains images of block arrangements labeled 'A' and 'B'.
* **Action:** "Up" with a circular arrow indicating an update.
* **Update Shape B:** Text indicating the action affects shape B.
* **Training Environment Labels:** "(Right, Up, Quart Rot, Double)" and "(Down, Down, Slight Rot)".
* **Evaluation Environment Labels:** "Half", "Up", "Slight Rot", "Left".
### Detailed Analysis / Content Details
The diagram shows a sequential flow of information.
1. **Building Block Dataset:** The initial stage presents images of block arrangements. The blocks are labeled 'A' and 'B'. These images are subjected to transformations: Translation, Scaling, and Rotation. Below these transformations is a graph representation of Qwen-Instruct.
2. **Training Environment:** The transformed data then flows into the Training Environment. Here, the data is processed through SFT, which then feeds into a component involving LoRA Adapters and Closed Loop GRPO. The training environment also shows two sets of block arrangements with labels indicating the training conditions: "(Right, Up, Quart Rot, Double)" and "(Down, Down, Slight Rot)".
3. **Evaluation Environment:** The output of the training process (Qwen-PhysRL) is then used in the Evaluation Environment. The evaluation environment shows block arrangements with labels indicating the evaluation conditions: "Half", "Up", "Slight Rot", "Left". An arrow points from Qwen-PhysRL to the evaluation environment.
4. **Qwen-Physics:** A graph representation is shown between the SFT and the LoRA Adapters/Closed Loop GRPO.
5. **Action/Update:** An "Action: Up" with a circular arrow indicates an iterative update process, specifically affecting "Update Shape B".
### Key Observations
* The pipeline is designed for iterative refinement, as indicated by the "Closed Loop GRPO" and the "Action/Update" components.
* The use of LoRA Adapters suggests a parameter-efficient fine-tuning approach.
* The diagram highlights the importance of both supervised learning (SFT) and reinforcement learning (RL) in the training process.
* The training and evaluation environments have specific conditions defined by the labels.
* The graph representations of Qwen-Instruct, Qwen-Physics, and Qwen-PhysRL suggest a focus on relational reasoning and physics understanding.
### Interpretation
The diagram illustrates a sophisticated training pipeline for a physics-based AI model (Qwen-Physics). The pipeline begins with a dataset of building block arrangements and uses transformations to create a diverse training set. Supervised fine-tuning (SFT) is used to initialize the model, followed by a closed-loop reinforcement learning process (GRPO) with LoRA adapters for efficient adaptation. The evaluation environment tests the model's ability to generalize to new scenarios. The labels associated with the training and evaluation environments suggest that the model is being trained and tested on a variety of conditions, including position, rotation, and scale. The use of graph representations suggests that the model is designed to reason about the relationships between objects in the environment. The iterative nature of the closed-loop GRPO indicates a focus on continuous improvement and refinement of the model's performance. The diagram suggests a system designed to learn and apply physics principles to solve problems involving block arrangements.
</details>
Figure 1: Overview of the proposed training framework. The pipeline follows a two-stage learning approach. In the SFT Phase, the base model (Qwen-Instruct) is fine-tuned on the Building Block Dataset to acquire atomic spatial priors (translation, scaling, rotation), resulting in the intermediate Qwen-Physics model. In the RL Phase, we employ GRPO in a closed-loop setting with LoRA adapters. The model is trained to master multi-step spatial reasoning and planning, yielding the final Qwen-PhysRL model.
## 2 Related Work and Background
### 2.1 Spatial Reasoning in LLMs
Spatial reasoning in large language models has been studied across a range of domains. One line of research focuses on textual inputs, where spatial semantics are embedded in text and models are required to reason about relative spatial relations, such as left of or right of, expressed through language (Mirzaee & Kordjamshidi, 2022). A further step is explored in natural language navigation tasks, where, given a sequence of textual instructions, the model must maintain an implicit spatial state and track its position over multiple steps. In these settings, spatial understanding is inferred from sequential language instructions rather than from explicit spatial representations such as symbolic layouts (Wu et al., 2024b). Recent efforts explore symbolic environments, particularly ASCII-art representations, which bridge the gap between text and image modalities in applications such as level generation (Todd et al., 2023). Results show that large language models are capable of recognizing concepts depicted in ASCII art given textual inputs. However, they still exhibit notable limitations and remain far behind human performance in transformation tasks such as translation, rotation, and robustness to noise (Bayani, 2024), as well as in shape recognition (Wang et al., 2024; Jia et al., 2025). Prior work also indicates that supervised fine-tuning can improve model accuracy in these settings (Jia et al., 2025).
### 2.2 Attention Mechanisms in Transformer-Based LLMs
Since the introduction of the Transformer architecture (Vaswani et al., 2017), it has been widely adopted across a broad range of applications. Most notably, it serves as the dominant foundation for large language models (LLMs), enabling efficient natural language processing (NLP) tasks such as understanding and generating long sequences of text (Devlin et al., 2019; Brown et al., 2020). The key innovation underlying these advances is the self-attention mechanism (Akbari et al., 2021), which models long-range dependencies among elements in sequential data and enables effective extraction, processing, and generation of structured outputs. Most contemporary LLMs adopt a decoder-only Transformer architecture, which is more suitable for autoregressive text generation. These models process a sequence of tokens through multiple architecturally similar decoder layers, followed by a final linear projection that maps hidden states to the vocabulary space. Each decoder layer consists of several core components, including self-attention, a feed-forward multilayer perceptron (MLP), and residual connections with normalization. With the exception of a small number of element-wise operations (e.g., softmax and dropout), matrix multiplication dominates the computational workload of these components.
During inference, the token sequence is represented as a matrix of hidden state vectors with dimensionality $H$ , corresponding to the model’s hidden size. These hidden states are multiplied by a set of learned linear projections in the self-attention module, namely $W_{Q}$ (query), $W_{K}$ (key), $W_{V}$ (value), and $W_{O}$ (output), as well as by projection matrices in the MLP component, such as $W_{\text{up}}$ , $W_{\text{down}}$ , and $W_{\text{gate}}$ . Among these operations, the multiplication $QK^{\mathsf{T}}$ , which produces the attention score matrix, plays a key role in our analysis. Here, $Q$ and $K$ are obtained by projecting the input hidden states using $W_{Q}$ and $W_{K}$ , respectively. The resulting attention score matrix captures semantic and syntactic dependencies between tokens in the sequence by quantifying the relative influence of each token on every other token. In later sections, we examine how this attention distribution changes under our proposed method and analyze its effect on how tokens are weighted during spatial reasoning tasks.
## 3 Methodology
### 3.1 Problem Formulation
We frame the spatial configuration space using three properties: the rotational state of the shape, its translational position on a discrete grid, and its scale. Based on this formulation, a shape state is represented as a tuple $S=(r,p,s)$ , where $r$ denotes the orientation, $p$ the spatial position, and $s$ the size of the shape. Correspondingly, the action space is defined as a finite symbolic set $\mathcal{A}=\mathcal{A}_{\text{rot}}\cup\mathcal{A}_{\text{trans}}\cup\mathcal{A}_{\text{scale}}$ , aligned with these three properties. The rotation action set is given by $\mathcal{A}_{\text{rot}}=\{90^{\circ}\ \text{CCW},45^{\circ}\ \text{CCW},180^{\circ}\ \text{CCW},0^{\circ}\}$ , the scaling set by $\mathcal{A}_{\text{scale}}=\{2\times,\tfrac{1}{2}\times,1\}$ , and the translation set by $\mathcal{A}_{\text{trans}}=\{\text{right},\text{left},\text{up},\text{down}\}$ . This discrete action formulation mitigates known limitations of language models when operating in continuous spaces (Szot et al., 2024), while still providing sufficient flexibility to modify each component of the state. The task begins from an initial state $S_{0}=(r_{0},p_{0},s_{0})$ , and the agent applies a sequence of actions $a_{1:T}=\{a_{1},\ldots,a_{T}\}$ with the objective of reaching a target configuration $S_{\text{target}}=(r_{g},p_{g},s_{g})$ . The environment dynamics are deterministic, with state transitions defined as $S_{B}^{(t+1)}=\mathcal{T}(S_{B}^{(t)},a_{t})$ , resulting in a closed-loop rollout where the observation at step $t$ is given by $O_{t}=(S_{\text{target}},S_{t},H_{t})$ , with the action history $H_{t}=(a_{1},\ldots,a_{t-1})$ . The objective is to reach the target spatial configuration in the fewest possible steps. An episode is considered successful once the intersection-over-union (IoU) between the current shape and the target shape exceeds a predefined threshold $\tau$ . Accordingly, the task objective can be expressed in terms of the minimal timestep $t^{\star}$ at which the following condition is satisfied:
$$
t^{\star}=\min\left\{\,t\geq 0\;\middle|\;\operatorname{IoU}\big(S_{B}^{(t)},S_{A}\big)\geq\tau\,\right\}. \tag{1}
$$
All actions are admissible at each timestep under a bounded map domain $\mathcal{P}\subset\mathbb{Z}^{2}$ . For any action $a\in\mathcal{A}$ with induced displacement $\Delta(a)$ , the next position is given by $\tilde{p}=p+\Delta(a)$ when $\tilde{p}\in\mathcal{P}$ , and equals $p$ otherwise.
### 3.2 Proposed Method
In the first stage, we perform supervised fine-tuning on atomic building-block transformations. We define a building block as a single-step transformation for which the distance between the start configuration $S_{\text{start}}$ and the target configuration $S_{\text{target}}$ satisfies $\operatorname{dist}(S_{\text{start}},S_{\text{target}})=1$ , where the distance is defined as the sum of differences in position, scale, and orientation, i.e., $\Delta p+\Delta s+\Delta r$ . Under this formulation, each training example corresponds to exactly one atomic action from the predefined action set $\mathcal{A}$ . This process yields a physics-aware policy $\pi_{\text{phys}}$ , defined as
$$
\pi_{\text{phys}}:=\pi_{\theta^{*}}(a\mid S_{\text{start}},S_{\text{target}},k), \tag{2}
$$
where $k\in\{\text{rot},\text{trans},\text{scale}\}$ denotes the transformation type label. This policy captures the local physics of the environment by learning to correctly execute the corresponding atomic transformation. In the second stage, we learn a compositional policy on top of the frozen physics-aware model. The policy $\pi_{\phi}$ is parameterized by the frozen base weights $\theta^{*}$ obtained from SFT and a set of learnable Low-Rank Adaptation (LoRA) parameters $\phi$ . We apply LoRA to a predefined set of transformer modules $\mathcal{M}=\{W_{Q},W_{K},W_{V},W_{O},W_{\text{gate}},W_{\text{up}},W_{\text{down}}\}$ . For each layer $l\in\{1,\ldots,L\}$ and each module $W\in\mathcal{M}$ , the forward pass is modified as
$$
h^{(l)}=(W_{\text{phys}}+\Delta W_{l})\,h^{(l-1)}=(W_{\text{phys}}+B_{l}A_{l})\,h^{(l-1)}, \tag{3}
$$
where $W_{\text{phys}}$ denotes the frozen base weights and $A_{l},B_{l}$ are the learnable low-rank matrices comprising $\phi$ . The adapter parameters $\phi$ are optimized using GRPO in a closed-loop reinforcement learning setting, while the base parameters $\theta^{*}$ remain fixed.
As a result, the learned policy $\pi_{\phi}$ operates in a closed-loop manner, where at each timestep $t$ the model generates a textual output $y_{t}$ conditioned on the current observation $o_{t}$ .
$$
y_{t}\sim\pi_{\phi}(\cdot\mid o_{t}),\qquad a_{t}=g(y_{t})\in\mathcal{A}, \tag{4}
$$
where $g(\cdot)$ is a deterministic parser that maps the generated text to a discrete atomic action label.
## 4 Experiments
### 4.1 Experimental Setup
For the spatial reasoning task, all models are built on top of the Qwen2.5-1.5B-Instruct backbone. In our experiments, we evaluate the performance of several variants to isolate the contribution of each training stage. These include: (i) Qwen-Instruct, the generic pretrained model without task-specific training; (ii) Qwen-Physics, a supervised fine-tuned model trained only on atomic building-block transformations; (iii) Qwen-DirectRL, a model trained end-to-end using GRPO directly on the base model; (iv) Qwen-PhysRL, our proposed two-stage method combining frozen atomic execution with reinforcement learning–based composition; and (v) Random Policy ( $\pi_{rnd}$ ), a random action policy serving as an unbiased lower-bound baseline. All model variants share the same tokenizer, namely the standard tokenizer provided with the Qwen2.5-1.5B-Instruct model.
### 4.2 Dataset Construction
For supervised fine-tuning, we construct a synthetic dataset consisting of 12k unique samples, approximately uniformly distributed across the action set $\mathcal{A}$ . Each sample is generated by randomly initializing two of the three shape properties and modifying the remaining property according to a single atomic action. All samples are generated programmatically by applying deterministic atomic transformations to randomly sampled initial configurations, ensuring full reproducibility and eliminating the need for manual annotation. All spatial configurations are represented using an ASCII-art domain, where both the current shape and the target shape are encoded as text-based grids. In this representation, the relative spatial relationship between the current and target shapes is conveyed implicitly through their layouts (Appendix § A). Shape boundaries are denoted using the character #, while rows are separated using * and newline (\n) characters to preserve the two-dimensional structure. This textual spatial encoding follows conventions inspired by prior work on symbolic spatial reasoning with language models (Todd et al., 2023).
### 4.3 Evaluation Metrics
A deterministic parser $g(\cdot)$ is used to identify the <answer> tags in the model output and extract the corresponding discrete action. Our primary evaluation metric is the cumulative episode reward. For an episode of length $T$ , we define
$$
R_{\text{total}}=R_{\text{correctness}}-0.1\,T-\sum_{t=1}^{T}R_{\text{rep}}^{(t)}+R_{\text{success}}, \tag{5}
$$
where $R_{\text{correctness}}$ reflects agreement between the agent’s predicted action and a predefined ground-truth sequence $\mathrm{GT}=\{a_{1},\ldots,a_{T}\}$ of heuristic greedy atomic actions that locally reduce the distance to the target configuration. A per-step penalty of $0.1$ encourages shorter solutions, $R_{\text{rep}}^{t}$ penalizes repetitive behaviors, and $R_{\text{success}}$ is granted upon task completion. We additionally log step-wise rewards $\{r_{t}\}_{t=1}^{T}$ to support fine-grained analysis of policy failures and divergence points.
## 5 Results and Analysis
For GRPO training of the physics-aware fine-tuned model, all experiments were conducted on a single NVIDIA A100 GPU with 80 GB of memory. We use the Adam optimizer with a learning rate of $1\times 10^{-5}$ and apply LoRA adaptation with rank $r=64$ to the specified transformer modules, while keeping the base model frozen. Training is performed with a batch size of 64 trajectories, with puzzles randomly generated at each environment reset. Action sampling during training follows a temperature scheduling strategy to stabilize learning, where the temperature is linearly annealed from an initial value of 1.4 to a lower bound of 0.7 over training iterations. During evaluation, sampling is disabled, and actions are selected greedily to obtain the best deterministic solution from the model. In both training and evaluation, the maximum distance between the start and target configurations is limited to 5, and the episode horizon is set to $T_{\max}=5$ . A success bonus of $+2$ is assigned upon task completion, while a per-step penalty of $-0.1$ encourages shorter solutions. To discourage repetitive behavior, an additional repetition penalty of $-0.2$ is applied when the same atomic action is selected more than twice within an episode. Finally, the correctness reward is normalized to 1 across all operation types, regardless of the number of ground-truth atomic actions available for a given operation, in order to prevent bias toward specific transformation categories. For instance, in a scenario involving three transformations, one rotation and one scaling, the maximum episode reward is computed as $3\times\tfrac{1}{3}+1+1-3\times 0.1=2.7$ . Adding the success bonus $R_{\text{success}}=2$ yields a total reward of 4.7.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Line Chart: Reward vs. Step Number
### Overview
This image presents a line chart comparing the reward obtained by two models, "Qwen-PhysRL" and "Qwen-directRL", over a series of steps. The chart displays the mean reward for each model, along with a shaded region representing the standard deviation or confidence interval around the mean.
### Components/Axes
* **X-axis:** "Step Number", ranging from 0 to 180.
* **Y-axis:** "Reward", ranging from -1 to 5.
* **Legend:** Located in the bottom-right corner, identifying the two models:
* "Qwen-PhysRL" (represented by a blue line)
* "Qwen-directRL" (represented by a cyan line)
* **Data Series:** Two lines representing the reward for each model over the step number. Shaded areas indicate the variance around each line.
### Detailed Analysis
**Qwen-PhysRL (Blue Line):**
The blue line representing Qwen-PhysRL shows an upward trend, starting at approximately -0.2 at Step 0. The line increases relatively quickly to around 1.5 at Step 20, then continues to rise more gradually, reaching approximately 4.3 by Step 60. It fluctuates between approximately 4.0 and 4.7 for the remainder of the steps, ending at approximately 4.5 at Step 180.
**Qwen-directRL (Cyan Line):**
The cyan line representing Qwen-directRL also shows an upward trend, but it is less pronounced than that of Qwen-PhysRL. It starts at approximately -0.4 at Step 0 and increases to around 0.8 at Step 20. The line continues to rise, reaching approximately 3.2 by Step 60. It fluctuates between approximately 2.8 and 3.8 for the remainder of the steps, ending at approximately 3.4 at Step 180. The shaded area around this line is significantly larger than that of Qwen-PhysRL, indicating greater variance.
**Approximate Data Points (extracted from visual inspection):**
| Step Number | Qwen-PhysRL (Reward) | Qwen-directRL (Reward) |
|-------------|-----------------------|------------------------|
| 0 | -0.2 | -0.4 |
| 20 | 1.5 | 0.8 |
| 40 | 3.0 | 2.0 |
| 60 | 4.3 | 3.2 |
| 80 | 4.5 | 3.0 |
| 100 | 4.2 | 3.3 |
| 120 | 4.6 | 3.6 |
| 140 | 4.4 | 3.5 |
| 160 | 4.5 | 3.4 |
| 180 | 4.5 | 3.4 |
### Key Observations
* Qwen-PhysRL consistently achieves a higher reward than Qwen-directRL throughout the entire range of steps.
* The variance in reward for Qwen-directRL is significantly higher than that for Qwen-PhysRL, as indicated by the larger shaded area.
* Both models exhibit a diminishing rate of reward increase as the step number increases, suggesting a convergence towards a stable reward level.
* The initial learning phase (Steps 0-20) shows the most significant reward gains for both models.
### Interpretation
The data suggests that the Qwen-PhysRL model is more effective at learning the task than the Qwen-directRL model, as evidenced by its consistently higher reward. The lower variance in Qwen-PhysRL's reward indicates that its performance is more stable and predictable. The diminishing rate of reward increase suggests that both models are approaching a point of optimal performance. The larger variance in Qwen-directRL could indicate that it is more sensitive to initial conditions or requires more training data to achieve stable performance. The initial rapid increase in reward for both models likely represents the initial learning phase where the models are quickly acquiring basic skills. The difference in performance between the two models could be attributed to the specific training methodologies or architectures employed.
</details>
Figure 2: GRPO training reward trajectories for Qwen-PhysRL and Qwen-DirectRL, illustrating improved stability and faster convergence when using a frozen physics-aware prior.
| Qwen-Instruct Qwen-Physics Qwen-DirectRL ( $r=64$ ) | $0.070$ $-0.068$ $1.626$ | $0.004$ $-0.120$ $-0.216$ |
| --- | --- | --- |
| Qwen-PhysRL ( $r=64$ , Ours) | $\mathbf{2.457}$ | $\mathbf{1.717}$ |
Table 1: Performance comparison across models on 100 unseen random scenarios under Dynamic state updates and Static settings.
#### Performance Analysis.
The average total reward achieved by different model variants is reported in Table 1. We evaluate all models under two settings. In the Dynamic setting, after each action, the environment updates the shape’s spatial configuration map $S_{B}$ , mirroring the training procedure and generating the next prompt accordingly. This setup allows the model to focus on predicting the correct next action without requiring explicit memorization of prior state changes. In contrast, in the Static setting, the initial maps remain fixed, and the prompt includes only the sequence of previously selected actions. Successful performance in this scenario, therefore, requires the model to internally track and reason about the cumulative effects of past actions. The results demonstrate that our proposed Qwen-PhysRL achieves near-maximum performance in the Dynamic setting, with an average reward of $2.457$ , indicating that the learned RL policy reliably identifies the correct sequence of transformations when external state updates are provided. Importantly, Qwen-PhysRL also achieves an average reward of $1.717$ in the Static setting, substantially outperforming all baseline models. In contrast, Qwen-DirectRL exhibits moderate performance in the Dynamic setting but fails in the Static setting, suggesting that reinforcement learning alone is insufficient to equip the model with robust internal spatial understanding and reasoning. Finally, non-RL models perform poorly in both settings, reflecting limited planning capability. Reinforcement learning bridges this gap by teaching the model how to utilize its building-block knowledge, enabling the composition of atomic operations into coherent multi-step policies.
| Step | Cumulative Reward until Step $i$ | | |
| --- | --- | --- | --- |
| Qwen-PhysRL | $\pi_{rnd}$ | | |
| Dynamic | Static | | |
| 1 | 0.900 | 0.759 | 0.250 |
| 2 | 1.374 | 1.143 | 0.463 |
| 3 | 1.914 | 1.469 | 0.641 |
| 4 | 2.340 | 1.689 | 0.790 |
| 5 | 2.457 | 1.717 | 0.912 |
0 0.9 1.2 1.8 1.5 2.1 2.1 1.8 2.4 2.4 2.4 2.7 2.7 2.7 2.7 scale tr rot tr tr rot tr tr tr tr tr rot rot tr
Figure 3: Left: Step-by-step cumulative reward comparison between Qwen-PhysRL (Dynamic and Static settings) and random policy ( $\pi_{rnd}$ ) on tasks requiring 3 translations, 1 rotation, and 1 scaling. Right: Action-reward trajectory illustrating the optimal path (blue: scale $\rightarrow$ tr $\rightarrow$ rot $\rightarrow$ tr $\rightarrow$ tr), which achieves the maximum reward of 2.7, contrasted with alternative action sequences (gray). The trajectory obtained under the Static setting is shown in red, where the model initially follows the optimal prefix (scale $\rightarrow$ tr $\rightarrow$ tr $\rightarrow$ tr) but then diverges and loses track of the remaining steps, failing to complete the full plan.
#### Per-Step Reward Analysis.
To analyze step-by-step reward accumulation during evaluation, we focus on a smaller subset of test samples in which the distance between the start and target configurations is exactly five. Specifically, we select instances that require three translations, one rotation, and one scaling operation to reach the goal. This restriction allows us to study reward trajectories under a fixed action budget and comparable difficulty. Figure 3 reports the average cumulative reward per step for three cases: Qwen-PhysRL evaluated in the Dynamic environment, Qwen-PhysRL evaluated in the Static setting, and a random policy $\pi_{\text{rnd}}$ . Figure 3 also includes a tree representation of the maximum achievable cumulative reward at each step for all valid sequences of ground-truth actions ( $GT$ ). To keep the tree visualization compact and interpretable, we fix the first action in the tree to scale, which is the most frequently selected initial operation by the learned policy. The optimal action sequence identified by the Qwen-PhysRL policy in the Dynamic setting (scale $\rightarrow$ translation $\rightarrow$ rotation $\rightarrow$ translation $\rightarrow$ translation) is highlighted as the blue path in the tree and is taken in approximately 33 % of the evaluated trajectories. The corresponding average cumulative reward closely tracks the maximum achievable reward along this path, indicating that the model consistently follows near-optimal action sequences when the environment explicitly updates the state after each step. Two additional high-probability paths, (scale $\rightarrow$ rotation $\rightarrow$ translation $\rightarrow$ translation $\rightarrow$ translation) and (scale $\rightarrow$ translation $\rightarrow$ translation $\rightarrow$ rotation $\rightarrow$ translation), are taken in 27 % and 19 % of cases, respectively. In contrast, under the Static setting, the average cumulative reward initially follows the red path in the tree, corresponding to the sequence (scale $\rightarrow$ translation $\rightarrow$ translation $\rightarrow$ translation). This behavior aligns with our empirical observations that, without explicit state updates, the model gradually loses track of earlier actions, diverges from the optimal plan, and ultimately fails to produce a decisive fifth action. As a result, reward accumulation stagnates at later steps. Compared to the random policy $\pi_{\text{rnd}}$ (see Appendix § B for its expected reward analysis), Qwen-PhysRL selects substantially more effective actions at each step in both the Dynamic and Static settings. The only exception occurs at the final step in the Static case, where the learned policy yields a smaller average reward increase than the random policy, as it has already diverged from the optimal trajectory, whereas the random policy occasionally converges to a higher-reward branch by chance. We note that, because the action space is not relatively large and the reward distribution is partly biased toward positive values, the random policy attains relatively high rewards in this setting. Even under these favorable conditions, Qwen-PhysRL consistently outperforms the random baseline by a wide margin.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Stacked Bar Chart: Attention Distribution Across Layers
### Overview
The image presents a stacked bar chart comparing the total attention (sum) allocated to different token types – '# (Hash)', '* (Star)', 'Whitespace' – across four layers (0, 16, 20, and 27) for two conditions: 'Qwen-Phys' and 'Qwen-Instruct'. The chart visualizes how attention is distributed among these token types within each layer and condition.
### Components/Axes
* **X-axis:** Represents the combination of layer and condition. The categories are 'Qwen-Phys' and 'Qwen-Instruct' repeated for each layer (0, 16, 20, 27).
* **Y-axis:** Labeled "Total Attention (Sum)", with a scale ranging from 0.00 to 0.10.
* **Legend:** Located at the top-left corner, identifies the token types with corresponding colors:
* Red: '# (Hash)'
* Blue: '* (Star)'
* Gray: 'Whitespace'
* **Layers:** The chart is divided into four sections, each representing a layer: Layer 0, Layer 16, Layer 20, and Layer 27. Each layer contains two stacked bar groups, one for 'Qwen-Phys' and one for 'Qwen-Instruct'.
### Detailed Analysis
Let's analyze each layer and condition:
**Layer 0:**
* **Qwen-Phys:** Total attention is approximately 0.095. The stack consists of:
* # (Hash): ~0.055
* * (Star): ~0.035
* Whitespace: ~0.005
* **Qwen-Instruct:** Total attention is approximately 0.105. The stack consists of:
* # (Hash): ~0.075
* * (Star): ~0.025
* Whitespace: ~0.005
**Layer 16:**
* **Qwen-Phys:** Total attention is approximately 0.07. The stack consists of:
* # (Hash): ~0.04
* * (Star): ~0.025
* Whitespace: ~0.005
* **Qwen-Instruct:** Total attention is approximately 0.03. The stack consists of:
* # (Hash): ~0.015
* * (Star): ~0.01
* Whitespace: ~0.005
**Layer 20:**
* **Qwen-Phys:** Total attention is approximately 0.04. The stack consists of:
* # (Hash): ~0.02
* * (Star): ~0.015
* Whitespace: ~0.005
* **Qwen-Instruct:** Total attention is approximately 0.02. The stack consists of:
* # (Hash): ~0.01
* * (Star): ~0.005
* Whitespace: ~0.005
**Layer 27:**
* **Qwen-Phys:** Total attention is approximately 0.06. The stack consists of:
* # (Hash): ~0.03
* * (Star): ~0.025
* Whitespace: ~0.005
* **Qwen-Instruct:** Total attention is approximately 0.045. The stack consists of:
* # (Hash): ~0.025
* * (Star): ~0.015
* Whitespace: ~0.005
### Key Observations
* The '# (Hash)' token consistently receives the largest portion of attention across all layers and conditions.
* 'Qwen-Instruct' generally exhibits higher total attention in Layer 0 compared to 'Qwen-Phys'.
* Attention decreases across layers for both conditions, particularly for the '* (Star)' token.
* Whitespace consistently receives the lowest attention across all layers and conditions.
* The difference in attention between 'Qwen-Phys' and 'Qwen-Instruct' diminishes in later layers (20 and 27).
### Interpretation
The chart suggests that the model (likely a large language model) prioritizes the '# (Hash)' token type across different layers. The varying attention levels across layers and conditions indicate how the model processes information differently depending on the input type ('Qwen-Phys' vs. 'Qwen-Instruct'). The decrease in attention across layers could signify that the model refines its focus as it processes information through deeper layers. The higher attention in Layer 0 for 'Qwen-Instruct' might indicate that the initial processing stages are more sensitive to instructional prompts. The diminishing difference in attention between the two conditions in later layers suggests that the model converges towards a similar representation regardless of the initial input type. The low attention to whitespace is expected, as it typically doesn't carry significant semantic meaning. This data could be used to understand the model's internal mechanisms and potentially optimize its performance.
</details>
Figure 4: Token-level attention distribution across layers for Qwen-Physics and Qwen-Instruct.
#### GRPO Convergence and Stability.
As shown in Figure 2, Qwen-PhysRL exhibits substantially faster convergence and higher final rewards than Qwen-DirectRL under identical GRPO training configurations, highlighting the importance of a physics-aware initialization for RL. Table 1 provides additional insight: the generic Qwen-Instruct backbone used in Qwen-DirectRL achieves higher rewards than the Qwen-Physics backbone used in Qwen-PhysRL during evaluation, indicating that general pretrained knowledge can partially compensate. However, despite this advantage, Qwen-Instruct fails to scale effectively under GRPO, exhibiting slower adaptation and early convergence to suboptimal policies. In contrast, the proposed two-stage approach consistently achieves superior performance across both Dynamic and Static settings.
### 5.1 Ablation Study
<details>
<summary>x4.png Details</summary>

### Visual Description
## Heatmap: Kullback-Leibler (KL) Divergence
### Overview
This image presents a heatmap visualizing the Kullback-Leibler (KL) divergence between generated tokens across different layers. The heatmap displays KL divergence values as color intensities, with darker reds indicating higher divergence and lighter yellows indicating lower divergence. The x-axis represents generated tokens (Gen0 to Gen4), and the y-axis represents layers (0, 15, 16, 17, 18, 19, 20, 21, 27).
### Components/Axes
* **X-axis:** "Generated Token" with categories: Gen0, Gen1, Gen2, Gen3, Gen4.
* **Y-axis:** "Layer" with categories: 0, 15, 16, 17, 18, 19, 20, 21, 27.
* **Color Scale/Legend:** Located on the right side of the heatmap, representing KL divergence values ranging from 0.0 (light yellow) to 1.4 (dark red). The scale is marked with values: 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4.
### Detailed Analysis
The heatmap shows varying KL divergence values across different layers and generated tokens.
* **Gen0:** KL divergence is relatively low across all layers, ranging from approximately 0.2 to 0.6. The values appear fairly consistent.
* **Gen1:** KL divergence is generally low, similar to Gen0, ranging from approximately 0.2 to 0.6.
* **Gen2:** KL divergence remains low, ranging from approximately 0.2 to 0.6.
* **Gen3:** Exhibits significantly higher KL divergence values, particularly for layers 16, 17, and 18. The values range from approximately 0.6 to 1.4.
* **Gen4:** KL divergence is generally low, similar to Gen0, Gen1, and Gen2, ranging from approximately 0.2 to 0.6.
Specifically:
* Layer 0: KL divergence is approximately 0.2 for all generated tokens.
* Layer 15: KL divergence is approximately 0.4 for Gen0, Gen1, Gen2, and Gen4, and approximately 0.8 for Gen3.
* Layer 16: KL divergence is approximately 0.6 for Gen0, Gen1, Gen2, and Gen4, and approximately 1.2 for Gen3.
* Layer 17: KL divergence is approximately 0.6 for Gen0, Gen1, Gen2, and Gen4, and approximately 1.3 for Gen3.
* Layer 18: KL divergence is approximately 0.6 for Gen0, Gen1, Gen2, and Gen4, and approximately 1.1 for Gen3.
* Layer 19: KL divergence is approximately 0.4 for all generated tokens.
* Layer 20: KL divergence is approximately 0.4 for all generated tokens.
* Layer 21: KL divergence is approximately 0.4 for all generated tokens.
* Layer 27: KL divergence is approximately 0.2 for all generated tokens.
### Key Observations
* Gen3 consistently exhibits the highest KL divergence values across layers 15-18, indicating a significant difference between the generated token distribution and the expected distribution at those layers.
* Gen0, Gen1, Gen2, and Gen4 show relatively consistent and low KL divergence values across all layers.
* Layers 16, 17, and 18 appear to be particularly sensitive to changes in the generated token, as evidenced by the higher divergence values for Gen3.
### Interpretation
The heatmap suggests that the generated token Gen3 introduces a substantial shift in the distribution of activations within layers 16-18 of the model. This could indicate that Gen3 represents a significantly different or more complex pattern than the other generated tokens. The low KL divergence for Gen0, Gen1, Gen2, and Gen4 suggests that these tokens are more consistent with the model's learned representations.
The higher KL divergence in layers 16-18 for Gen3 might signify that these layers are crucial for processing the specific features or patterns present in Gen3, and the model struggles to represent Gen3 effectively within these layers. This could be due to Gen3 being an outlier, a novel input, or a result of a specific generation process. Further investigation would be needed to determine the underlying cause of this divergence and its implications for the model's performance. The consistent low divergence for the other tokens suggests they are well-represented by the model's existing knowledge.
</details>
(a) Attention divergence heatmap
<details>
<summary>x5.png Details</summary>

### Visual Description
## Stacked Bar Chart: Relative Attention Share by Model Layer
### Overview
This is a stacked bar chart visualizing the relative attention share between "System Prompt" and "Maps" across different model layers (L0 to L27). The y-axis represents the "Relative Attention Share", ranging from 0.0 to 1.0, while the x-axis represents the "Model Layer". Each bar is divided into two sections, representing the attention share of each category.
### Components/Axes
* **X-axis:** "Model Layer" with markers L0, L1, L2, L3, L14, L15, L16, L18, L20, L21, L24, L25, L26, L27.
* **Y-axis:** "Relative Attention Share" ranging from 0.0 to 1.0.
* **Legend:** Located in the top-right corner.
* "System Prompt" - Light Gray
* "Maps" - Blue
### Detailed Analysis
The chart consists of 14 bars, each representing a specific model layer. The attention share is represented as a stacked bar, with the "System Prompt" at the bottom (light gray) and "Maps" on top (blue).
Here's a breakdown of the approximate attention shares for each layer:
* **L0:** System Prompt ~ 0.25, Maps ~ 0.75
* **L1:** System Prompt ~ 0.20, Maps ~ 0.80
* **L2:** System Prompt ~ 0.25, Maps ~ 0.75
* **L3:** System Prompt ~ 0.30, Maps ~ 0.70
* **L14:** System Prompt ~ 0.40, Maps ~ 0.60
* **L15:** System Prompt ~ 0.45, Maps ~ 0.55
* **L16:** System Prompt ~ 0.40, Maps ~ 0.60
* **L18:** System Prompt ~ 0.35, Maps ~ 0.65
* **L20:** System Prompt ~ 0.30, Maps ~ 0.70
* **L21:** System Prompt ~ 0.20, Maps ~ 0.80
* **L24:** System Prompt ~ 0.25, Maps ~ 0.75
* **L25:** System Prompt ~ 0.25, Maps ~ 0.75
* **L26:** System Prompt ~ 0.30, Maps ~ 0.70
* **L27:** System Prompt ~ 0.40, Maps ~ 0.60
**Trends:**
* Initially, from L0 to L1, the attention share of "Maps" increases while that of "System Prompt" decreases.
* From L1 to L3, the attention share of "Maps" decreases slightly while that of "System Prompt" increases.
* From L3 to L14, the attention share of "System Prompt" increases significantly, while that of "Maps" decreases.
* From L14 to L16, the attention share of "System Prompt" decreases slightly, while that of "Maps" increases.
* From L16 to L27, the attention share of "System Prompt" fluctuates, while that of "Maps" remains relatively stable.
### Key Observations
* The "Maps" category consistently receives a higher attention share than the "System Prompt" category across all layers.
* The most significant shift in attention share occurs between layers L3 and L14, where the "System Prompt" attention share increases substantially.
* The attention share of "Maps" is highest at L1 and L21, reaching approximately 0.80.
* The attention share of "System Prompt" is highest at L14, reaching approximately 0.40.
### Interpretation
The chart demonstrates how the model's attention is distributed between the "System Prompt" and "Maps" data sources across different layers. The initial high attention to "Maps" (layers L0-L3) suggests that the model initially relies heavily on the map data. The subsequent increase in attention to the "System Prompt" (around L14) indicates that the model begins to integrate the system instructions more effectively as it processes information through deeper layers. The relatively stable attention shares in the later layers (L16-L27) suggest a balance between utilizing map data and following system instructions. The fluctuations in attention share could be due to the complexity of the task or the specific information being processed at each layer. The chart provides insights into the model's internal reasoning process and how it leverages different data sources to perform its task.
</details>
(b) Relative attention share across layers
Figure 5: Layer-wise attention score analysis showing (a) attention divergence of Qwen-Physics from the backbone Qwen-Instruct model across layers, and (b) relative attention allocation between system prompt and map tokens.
To examine the effect of our initial building-block fine-tuning on the spatial understanding of the target LLM $\mathcal{F}$ , we analyze how attention is distributed across prompt tokens at different layers of the model by studying the attention score matrices produced during evaluation. We prompt the model with a fixed system prompt $P$ , concatenated with a start–target map sequence $M$ , to generate an answer sequence $A$ , i.e., $\mathcal{F}(P\!\cdot\!M)=A$ , where $\cdot$ denotes string concatenation. To obtain attention scores between prompt tokens and the newly generated output tokens, we concatenate the generated answer to the original prompt and perform a single forward pass over the full sequence, $\mathcal{F}(P\!\cdot\!M\!\cdot\!A)$ . This pass yields the complete set of attention score matrices, including those corresponding to output tokens. For each decoder layer $l\in\{1,\dots,L\}$ and attention head $h\in\{1,\dots,H\}$ , the model computes a raw attention score matrix as $Q_{h,l}K_{h,l}^{\mathsf{T}}$ . We aggregate these matrices across heads by averaging to obtain a per-layer attention score matrix,
$$
\text{AttScore}_{l}=\frac{1}{H}\sum_{h=1}^{H}Q_{h,l}K_{h,l}^{\mathsf{T}}.
$$
We then focus on the rows corresponding to the answer tokens, $\text{AttScore}_{l}[|P\!\cdot\!M|:,:]$ , where each row represents the attention assigned to all prompt tokens during the generation of a specific output token. In the following, we summarize key observations derived from analyzing these attention patterns across layers.
#### Observation 1.
Fine-tuning induces the most pronounced changes in attention distributions within the middle layers of the model, as confirmed by Figure 5(a). Among the 28 layers, layers 16 and 20 exhibit the highest KL divergence between Qwen-Physics and the generic Qwen-Instruct backbone, while the early (layer 0) and final (layer 27) layers show substantially smaller deviations. This indicates that the adaptation introduced by fine-tuning is concentrated in intermediate layers rather than being uniformly distributed across the network.
#### Observation 2.
Qwen-Physics assigns closer attention to spatially informative ASCII tokens, such as # and *, compared to Qwen-Instruct. As shown in Figure 4, the increased attention to these tokens is most prominent in the same middle layers identified in Observation 1. This suggests that the layers undergoing the largest adaptation are also those most responsible for encoding spatial structure, enabling the model to better focus on salient regions of the map $M$ during spatial reasoning.
#### Observation 3.
Across most layers, Qwen-Physics consistently allocates more attention to the system prompt $P$ than to the map region, except in the earliest layers. Notably, the system prompt contains approximately $1.8\times$ more tokens than the map region $M$ , yet Figure 5(b) shows an attention gap that is substantially larger than what token count alone would suggest. This gap narrows in the middle layers, which, as shown in Observations 1 and 2, are more actively involved in spatial reasoning. This behavior is consistent with prior findings that textual context can dominate attention allocation in spatial reasoning and vision–language models (Chen et al., 2025; Wang et al., 2025). The earliest layers, such as layer 0, allocate attention more evenly across different parts of the input, reflecting a coarse, global skimming behavior, as illustrated in both Figure 5(b) and Figure 4.
## 6 Conclusion
This paper presents a two-stage training pipeline that enables spatial understanding, reasoning, and planning capabilities in large language models. Our approach first builds a foundation of atomic spatial relations through supervised fine-tuning and then augments it with multi-step planning ability using a closed-loop GRPO-based reinforcement learning stage. This combination equips the model with both task-relevant physical knowledge and a learned policy that effectively composes these primitives to solve more complex spatial planning problems, without relying solely on reinforcement learning to discover representations and planning strategies simultaneously. Experimental results demonstrate that models trained under this framework exhibit a newfound understanding of spatial properties and planning, substantially outperforming generic LLM baselines on challenging spatial planning tasks. Moreover, compared to end-to-end reinforcement learning applied to an unadapted backbone model, our pipeline converges faster and exhibits more stable training under identical settings. Through ablation studies and attention analysis, we further identify the decoder layers most affected by training and show that our approach systematically shifts attention toward task-critical tokens, providing evidence of more structured internal reasoning. While our evaluation focuses on a specific class of spatial planning problems, we believe that the underlying principle of learning reusable building blocks and composing them through reinforcement learning is broadly applicable. We view this work as a step toward more modular and interpretable approaches for teaching complex reasoning skills to language models, and we leave the exploration of additional tasks and modalities to future work.
## References
- Ahn et al. (2024) Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges, 2024. URL https://arxiv.org/abs/2402.00157.
- Akbari et al. (2021) Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY, USA, 2021. Curran Associates Inc. ISBN 9781713845393.
- Bayani (2024) David Bayani. Testing the depth of chatgpt’s comprehension via cross-modal tasks based on ascii-art: Gpt3.5’s abilities in regard to recognizing and generating ascii-art are not totally lacking, 2024. URL https://arxiv.org/abs/2307.16806.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Chen et al. (2025) Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas, 2025. URL https://arxiv.org/abs/2503.01773.
- Dao & Vu (2025) Alan Dao and Dinh Bach Vu. Alphamaze: Enhancing large language models’ spatial intelligence via grpo, 2025. URL https://arxiv.org/abs/2502.14669.
- DeepSeek-AI et al. (2025) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948.
- Deng et al. (2025) Hourui Deng, Hongjie Zhang, Jie Ou, and Chaosheng Feng. Can llm be a good path planner based on prompt engineering? mitigating the hallucination for path planning. In Proceedings of the International Conference on Intelligent Computing (ICIC), 2025.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
- Ding et al. (2023) Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Ming-Jie Ma, Wei Zhang, Si Qin, S. Rajmohan, Qingwei Lin, and Dongmei Zhang. Everything of thoughts: Defying the law of penrose triangle for thought generation. ArXiv, abs/2311.04254, 2023.
- Einarsson (2025) Hafsteinn Einarsson. Mazeeval: A benchmark for testing sequential decision-making in language models, 2025. URL https://arxiv.org/abs/2507.20395.
- Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685.
- Jia et al. (2025) Qi Jia, Xiang Yue, Shanshan Huang, Ziheng Qin, Yizhu Liu, Bill Yuchen Lin, and Yang You. Visual perception in text strings, 2025. URL https://openreview.net/forum?id=etToTig9Fp.
- Kong et al. (2025) Yangzhe Kong, Daeun Song, Jing Liang, Dinesh Manocha, Ziyu Yao, and Xuesu Xiao. Autospatial: Visual-language reasoning for social robot navigation through efficient spatial reasoning learning, 2025. URL https://arxiv.org/abs/2503.07557.
- Li et al. (2024) Fangjun Li, David C. Hogg, and Anthony G. Cohn. Advancing spatial reasoning in large language models: an in-depth evaluation and enhancement using the stepgame benchmark. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’24/IAAI’24/EAAI’24. AAAI Press, 2024. ISBN 978-1-57735-887-9. doi: 10.1609/aaai.v38i17.29811. URL https://doi.org/10.1609/aaai.v38i17.29811.
- Li et al. (2025) Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges, 2025. URL https://arxiv.org/abs/2501.02189.
- Minaee et al. (2025) Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2025. URL https://arxiv.org/abs/2402.06196.
- Mirzaee & Kordjamshidi (2022) Roshanak Mirzaee and Parisa Kordjamshidi. Transfer learning with synthetic corpora for spatial role labeling and reasoning. 10 2022. doi: 10.48550/arXiv.2210.16952.
- Noever & Burdick (2021) David Noever and Ryerson Burdick. Puzzle solving without search or human knowledge: An unnatural language approach. CoRR, abs/2109.02797, 2021. URL https://arxiv.org/abs/2109.02797.
- Qwen Team (2024) Qwen Team. Qwen2.5-1.5b-instruct. https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct, 2024. Hugging Face model card, accessed 2025-08-10.
- Szot et al. (2024) Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, and Alexander Toshev. Grounding multimodal large language models in actions, 2024. URL https://arxiv.org/abs/2406.07904.
- Todd et al. (2023) Graham Todd, Sam Earle, Muhammad Umair Nasir, Michael Cerny Green, and Julian Togelius. Level generation through large language models. In Proceedings of the 18th International Conference on the Foundations of Digital Games, FDG 2023, pp. 1–8. ACM, April 2023. doi: 10.1145/3582437.3587211. URL http://dx.doi.org/10.1145/3582437.3587211.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
- Wang et al. (2024) Hong Wang, Xuan Luo, Weizhi Wang, and Xifeng Yan. Bot or human? detecting chatgpt imposters with a single question, 2024. URL https://arxiv.org/abs/2305.06424.
- Wang et al. (2025) Zhaochen Wang, Bryan Hooi, Yiwei Wang, Ming-Hsuan Yang, Zi Huang, and Yujun Cai. Text speaks louder than vision: Ascii art reveals textual biases in vision-language models, 2025. URL https://arxiv.org/abs/2504.01589.
- Wu et al. (2024a) Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, and Jifeng Dai. Visionllm v2: an end-to-end generalist multimodal large language model for hundreds of vision-language tasks. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA, 2024a. Curran Associates Inc. ISBN 9798331314385.
- Wu et al. (2024b) Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models, 2024b. URL https://arxiv.org/abs/2404.03622.
- Yao et al. (2023) Shunyu Yao et al. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Zhang et al. (2024) Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, and Parisa Kordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models, 2024. URL https://arxiv.org/abs/2407.07035.
## Appendix A Environment Setup Example
System prompt for Dynamic Setting
⬇ You are an expert at analyzing ASCII art shapes. Two shapes are provided: Shape A (the target) and Shape B (the current state), separated by ’% $ % $ % $ %’. Each line of the shapes starts and ends with a star (*) character. Your goal is to transform Shape B into Shape A by analyzing one type of transformation at a time. You may analyze rotation, translation, or scaling --- but try to analyze DIFFERENT types rather than repeating. You have already analyzed: {analyzed} IMPORTANT: Choose a transformation type you haven ’ t analyzed yet if possible. TASK 1 - ROTATION: If analyzing rotation: Determine what rotation is needed to transform Shape B into Shape A. Classify using exactly one of these labels: - ’ no_rotation ’: The shapes have the same orientation (0 rotation needed) - ’ quarter_rotation ’: Shape B needs a 90 degrees rotation to match Shape A - ’ slight_rotation ’: Shape B needs a small rotation (<90 degrees) to match Shape A TASK 2 - TRANSLATION: If analyzing translation: Determine how Shape B must be moved to match Shape A ’ s position. Classify using exactly one of these labels: - ’ no_translation ’: Shapes are already in the same position - ’ up ’: Shape B must be moved up to match Shape A - ’ down ’: Shape B must be moved down to match Shape A - ’ left ’: Shape B must be moved left to match Shape A - ’ right ’: Shape B must be moved right to match Shape A TASK 3 - SCALING: If analyzing scaling: Determine the size adjustment needed to transform Shape B to match Shape A. Classify using exactly one of these labels: - ’ no_scaling ’: Both shapes have the same size - ’ double_size ’: Shape B is half size and must be enlarged to match Shape A - ’ half_size ’: Shape B is twice as large and must be shrunk to match Shape A INSTRUCTIONS: 1. Identify which transformation type would be most useful to analyze now 2. Carefully compare both ASCII art shapes 3. Determine the transformation needed to convert Shape B into Shape A 4. Respond with the appropriate label inside < answer ></ answer > tags
System prompt for Static Setting
⬇ You are an expert at analyzing ASCII art shapes. Two shapes are provided: Shape A (the target) and Shape B (the current state), separated by ’% $ % $ % $ %’. Each line of the shapes starts and ends with a star (*) character. Your goal is to transform Shape B into Shape A by analyzing one type of transformation at a time. You may analyze rotation, translation, or scaling --- but try to analyze DIFFERENT types rather than repeating. CRITICAL INSTRUCTION: THE MAP IS NOT UPDATING The Shape B shown below is the ** STATIC INITIAL STATE ** It does ** NOT ** reflect the moves you have already made. You must RELY ON YOUR MEMORY of the following actions you have already performed: HISTORY OF ACTIONS: [{analyzed}] To solve this: 1. Look at the Initial Shape B. 2. Mentally apply the ’ HISTORY OF ACTIONS ’ to it to imagine the * current * state. 3. Determine the NEXT step needed from that imagined state. 4. Choose a transformation type you haven ’ t analyzed yet if possible. TASK 1 - ROTATION: If analyzing rotation: Determine what rotation is needed to transform Shape B into Shape A. Classify using exactly one of these labels: - ’ no_rotation ’: The shapes have the same orientation (0 rotation needed) - ’ quarter_rotation ’: Shape B needs a 90 degrees rotation to match Shape A - ’ slight_rotation ’: Shape B needs a small rotation (<90 degrees) to match Shape A TASK 2 - TRANSLATION: If analyzing translation: Determine how Shape B must be moved to match Shape A ’ s position. Classify using exactly one of these labels: - ’ no_translation ’: Shapes are already in the same position - ’ up ’: Shape B must be moved up to match Shape A - ’ down ’: Shape B must be moved down to match Shape A - ’ left ’: Shape B must be moved left to match Shape A - ’ right ’: Shape B must be moved right to match Shape A TASK 3 - SCALING: If analyzing scaling: Determine the size adjustment needed to transform Shape B to match Shape A. Classify using exactly one of these labels: - ’ no_scaling ’: Both shapes have the same size - ’ double_size ’: Shape B is half size and must be enlarged to match Shape A - ’ half_size ’: Shape B is twice as large and must be shrunk to match Shape A INSTRUCTIONS: 1. Identify which transformation type would be most useful to analyze now 2. Carefully compare both ASCII art shapes 3. Determine the transformation needed to convert Shape B into Shape A 4. Respond with the appropriate label inside < answer ></ answer > tags
Sample Environment
⬇ TARGET (Shape A): * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ## * * ## #* * ## * * ## ##* * ## ## * * ## ## * * ## * * * % $ % $ % $ % CURRENT (Shape B): * * * * * # * * ## ## * * # # * * ## ## * * # * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ------------------------------
Model Responses (Dynamic setting with State Updates)
$\star$ Ground-Truth: double size, quarter rotation, right, down, down.
$\times$ Qwen-Instruct: slight rotation, slight rotation, slight rotation, slight rotation, slight rotation.
$\times$ Qwen-Physics: slight rotation, quarter rotation, slight rotation, slight rotation, slight rotation.
$\times$ Qwen-DirectRL: double size, quarter rotation, down, down, up.
$\checkmark$ Qwen-PhysRL (ours): double size, quarter rotation, right, down, down.
## Appendix B Random Policy Reward Calculation
To establish a baseline for the random policy $\pi_{\text{rnd}}$ , we analytically derive the expected reward $E[R_{k}]$ at step $k$ . An action $a$ yields a positive reward only if its usage count $N_{a}$ over the preceding $k\!-\!1$ steps does not exceed its ground-truth quota $C_{a}$ . We model $N_{a}$ as a binomial random variable, $N_{a}\sim\mathrm{Binomial}(k-1,1/|\mathcal{A}|)$ , reflecting uniform random action selection. The expected reward at step $k$ is therefore given by
$$
E[R_{k}]=\sum_{a\in GT}r(a)\,\mathbb{P}(N_{a}<C_{a})\;-\;\lambda\!\left(1-\sum_{a\in GT}\mathbb{P}(N_{a}<C_{a})\right), \tag{6}
$$
where $GT$ denotes the set of ground-truth required actions, $r(a)$ is the reward associated with a valid action, and $\lambda$ is the penalty incurred for an invalid action. The probability term expands as the cumulative binomial sum
$$
\mathbb{P}(N_{a}<C_{a})=\sum_{i=0}^{C_{a}-1}\binom{k-1}{i}\left(\frac{1}{|\mathcal{A}|}\right)^{i}\left(1-\frac{1}{|\mathcal{A}|}\right)^{k-1-i}. \tag{7}
$$
For the subset of samples considered in our evaluation, the ground-truth action quotas consist of two occurrences of one translation direction ( $C=2$ ), one occurrence of another translation direction ( $C=1$ ), and a single occurrence each for rotation and scaling ( $C=1$ ). The expected cumulative reward up to step $K$ then follows directly as $\sum_{j=1}^{K}E[R_{j}]$ .