2602.00276

Model: gemma-3-27b-it-free

# Localizing and Correcting Errors for LLM-based Planners **Authors**: Aditya Kumar, William Cohen ## Abstract Large language models (LLMs) have demonstrated strong reasoning capabilities on math and coding, but frequently fail on symbolic classical planning tasks. Our studies, as well as prior work, show that LLM-generated plans routinely violate domain constraints given in their instructions (e.g., walking through walls). To address this failure, we propose iteratively augmenting instructions with Localized In-Context Learning (L-ICL) demonstrations: targeted corrections for specific failing steps. Specifically, L-ICL identifies the first constraint violation in a trace and injects a minimal input-output example giving the correct behavior for the failing step. Our proposed technique of L-ICL is much effective than explicit instructions or traditional ICL, which adds complete problem-solving trajectories, and many other baselines. For example, on an 8×8 gridworld, L-ICL produces valid plans 89% of the time with only 60 training examples, compared to 59% for the best baseline, an increase of 30%. L-ICL also shows dramatic improvements in other domains (gridworld navigation, mazes, Sokoban, and BlocksWorld), and on several LLM architectures. Machine Learning, ICML ## 1 Introduction Large language models (LLMs) and agentic systems reason and plan effectively in domains such as mathematics, coding, and question answering (Khattab et al., 2023; Yao et al., 2023a), suggesting that modern LLMs possess strong general planning capabilities. However, studies on classical planning benchmarks reveals a more nuanced picture: LLMs frequently fail, even on simple planning tasks that symbolic planners solve easily (Valmeekam et al., 2023; Stechly et al., 2024). Past researchers have analyzed plans produced by LLMs such as SearchFormer (Lehnert et al., 2024), which are fine-tuned to generate structured reasoning chains that can be parsed, and shown that LLMs frequently violate domain constraints given in their instructions (Stechly et al., 2024). For example, LLMs might propose plans that walk through a wall in a maze, or pick up a block when the robot’s gripper is already occupied. Table 1: Performance on an 8 $\times$ 8 two-room gridworld using DeepSeek V3. Paths start in one room and end in the other. Valid plans never leave the grid or cross walls; Successful plans reach their goals; and Optimal plans are successful and use the minimum number of steps. L-ICL[ $m$ ] denotes our method trained on $m$ examples, with the corresponding character count of L-ICL examples provided. All experiments are provided with an ASCII representation of the grid. | Zero-Shot | 16 | 0 | 0 | | --- | --- | --- | --- | | RAG-ICL [10k chars] | 20 | 6 | 6 | | RAG-ICL [20k chars] | 21 | 9 | 9 | | ReAct | 48 | 41 | 37 | | Self-Consistency ( $k{=}5$ ) | 59 | 45 | 43 | | Self-Refine ( $k{=}5$ ) | 51 | 44 | 38 | | PTP PTP, introduced in (Cohen and Cohen, 2024) describes a method to prompt LLMs with partially specified programs / L-ICL [ $m=0$ ] | 40 | 33 | 28 | | L-ICL [ours, $m=60$ , 2k chars] | 89 | 89 | 77 | Table 1 demonstrates this on a very simple 8 $\times$ 8 two-room gridworld navigation task. Despite receiving complete information about the domain (grid layout and obstacles) no baseline method produces valid plans even 60% of the time. Agentic and test-time-scaling approaches perform better, but still produce many invalid plans. We conjecture that LLMs cannot build valid plans for this task because they fail to access the necessary domain-specific knowledge in the prompt consistently. This hypothesis is consistent with the failure of LLMs in these domains, and with their success in math and coding, where the necessary knowledge is general, and hence learnable in pre-training or fine-tuning. In-context learning (ICL) is a natural remedy. However, complete solution trajectories demonstrate that plans work, not why individual steps are valid—leaving constraints implicit. As Table 1 shows, even 20,000 characters of retrieved trajectories RAG-ICL, retrieving demonstrations for tasks with similar start and end goals. yield only 9% success. The rules must still be inferred, and inference fails. L-ICL escapes this trap by letting failures reveal which constraints need explicit specification. Rather than full trajectories, we augment prompts with localized examples that demonstrate correct behavior on individual steps where models err. We call this approach Localized In-Context Learning (L-ICL). This approach achieves higher performance with much less context: 2,000 characters of targeted corrections outperforms 20,000 characters of trajectories. Generating L-ICL examples requires analyzing and correcting reasoning traces at training time, which we enable by prompting models to produce structured reasoning traces, and then correcting the traces with a symbolic planner. Thus, L-ICL might be viewed as distilling domain knowledge from a symbolic system into an LLM. Figure 1 summarizes our approach, which builds on Program Trace Prompting (PTP) (Cohen and Cohen, 2024). PTP recasts reasoning as producing a “program trace” for a partially specified program. A PTP prompt includes, for each type of reasoning “step”, documentation (but not code) for a corresponding subroutine, along with (optional) example inputs and outputs. For instance, a gridworld navigation task might include a subroutine get_applicable_actions(cell) that returns the set of obstacle-free cells adjacent to the input cell. Because no executable code is provided in PTP, just documentation, the LLM must infer how to perform the reasoning step: e.g., in gridworld navigation, the LLM must infer which moves are valid for a task. PTP’s prompting scheme provides a natural insertion point for localized corrections: when a subroutine call fails, we locally augment that subroutine’s documentation by adding a new input/output example. The input/output examples use Python’s doctest syntax, a format well-represented in LLM training data, so readily understandable by code-trained LLMs. <details> <summary>graphs/corrected/Research_Presentation.png Details</summary> ![7dfb11b8](/v1/image/7dfb11b8eb1654304c8ee3188c9c1f8b2ab2af46d010cf51ae32f592099c3508) ### Visual Description \n ## Diagram: Program Fragment and Traces with Symbolic Planner ### Overview The image presents a diagram illustrating a process involving a program fragment, traces, a symbolic planner, and error analysis. It depicts a flow of information from a prompt template through a program fragment and traces to a symbolic planner, ultimately leading to error localization. The diagram uses arrows to indicate the direction of information flow and boxes to represent different stages or components. ### Components/Axes The diagram consists of the following components: * **Prompt Template:** A rectangular box on the left side labeled "Prompt Template". It contains text describing the context of the program fragment and traces. * **Program Fragment + Traces:** A larger rectangular box connected to the "Prompt Template" via blue arrows. It is labeled "[MAZE] PROGRAM: [PROGRAM FRAGMENT + TRACES]". * **Question:** A text block within the "Program Fragment + Traces" box, asking "What is the output of the program for the input(1,2,(6,5))?". * **Structured Trace:** A rectangular box connected to the "Program Fragment + Traces" box via a gray arrow. It is labeled "Structured trace". * **Symbolic Planner:** A rectangular box connected to the "Structured trace" box via a gray arrow. It is labeled "Symbolic Planner". * **Analyze and localize errors:** A rectangular box connected to the "Symbolic Planner" box via a gray arrow. It is labeled "Analyze and localize errors". * **Code Snippet (Python):** A rectangular box on the right side containing Python code. It defines a function `get_applicable_actions`. * **Trace Output 1:** A text block below the code snippet showing the output of calling `get_applicable_actions(1,2)` which returns `['move north', 'move east']`. * **Trace Output 2:** A text block below the first trace output showing the output of calling `get_optimal_actions` and `at_goal`. * **Trace Output 3:** A green rectangular box containing the output of `get_applicable_actions(5,4)` which returns `["move_east", "move_west"]`. It is labeled "In-context example for one step (a “doctest”/ L-ICL)". ### Detailed Analysis or Content Details The Python code snippet defines a function `get_applicable_actions` that takes a `state` (of type `PlanningState`) as input. The function is documented with a docstring: `"""Get all applicable actions in the current state."""`. The code is incomplete, indicated by "...". The first trace output shows that calling `get_applicable_actions(1,2)` returns a list of two strings: `'move north'` and `'move east'`. The second trace output shows a series of calls to `get_optimal_actions` and `at_goal` with the inputs `(5,4)` and `(6,5)`. The output of `at_goal` is `False`. The third trace output shows that calling `get_applicable_actions(5,4)` returns a list of two strings: `"move_east"` and `"move_west"`. The arrows indicate the following flow: 1. The "Prompt Template" feeds into the "Program Fragment + Traces". 2. The "Program Fragment + Traces" generates a "Structured trace". 3. The "Structured trace" is processed by the "Symbolic Planner". 4. The "Symbolic Planner" is used to "Analyze and localize errors". 5. The code snippet and trace outputs provide examples of the program's behavior. ### Key Observations The diagram illustrates a debugging or verification process for a planning algorithm. The "Symbolic Planner" appears to be used to analyze the program's behavior based on the traces. The "In-context example" suggests a form of learning or demonstration. The incomplete code snippet indicates that the program is not fully defined. ### Interpretation The diagram demonstrates a workflow for debugging or verifying a planning algorithm. The process starts with a prompt template that defines the problem. The program fragment and traces provide concrete examples of the algorithm's behavior. The symbolic planner analyzes these traces to identify errors or inconsistencies. The "In-context example" suggests that the symbolic planner may be using a form of learning or demonstration to improve its performance. The diagram highlights the importance of traces in debugging and verifying planning algorithms. The incomplete code snippet suggests that the algorithm is still under development. The diagram is a high-level overview of the process and does not provide specific details about the implementation of the symbolic planner or the error analysis techniques used. The diagram is a conceptual illustration rather than a precise technical specification. </details> Figure 1: Overview of L-ICL. The prompt template follows PTP: it includes documentation for each subroutine but no executable code. Prompting an LLM produces a trace that follows the format of the $k$ provided example traces. The trace is parsed to find the first failing step, and the failing input is passed to an oracle that returns the correct output. This yields a localized example (e.g., $x{=}\texttt{(5,4)}$ , $y{=}\texttt{['move\_east','move\_west']}$ ) that is inserted into the subroutine’s documentation. This process iterates over training instances to accumulate examples in a failure-driven manner. Given a planning task, we first prompt the LLM to generate a trace using the PTP format. We then analyze this trace programmatically to identify the first failing step, i.e., the first subroutine call whose output violates domain constraints. An oracle (a symbolic simulator or verifier) provides the correct output for that input, yielding a localized correction. This correction is then inserted into the prompt. For instance, if the LLM’s first invalid move is from cell $(3,4)$ , we L-ICL will add to the prompt an example showing get_applicable_actions((3,4)) should return [’move_north’, ’move_south’]. This localized correction directly addresses the failure, and of course can also be generalized by the LLM to other similar cases. This process iterates over multiple training instances, accumulating a bank of targeted examples that progressively refine the model’s understanding of domain constraints. Crucially, the oracle is required only during training. Experimentally, prompt augmentation with L-ICL dramatically reduces domain violations, and thus improves LLM planning performance across multiple domains. Beyond the results of Table 1 and other gridworld tasks, we evaluate on classical planning benchmarks like BlocksWorld and Sokoban, seeing similar gains. L-ICL is also remarkably sample-efficient: peak performance is typically achieved with only 30–60 training examples. L-ICL works on multiple LLM architectures (DeepSeek V3, DeepSeek V3.1, Claude Haiku 4.5, Claude Sonnet 4.5), and learned constraints can transfer across problem sizes (see Appendix B). To summarize our contributions: (1) Using the PTP variant of semi-structured reasoning, we precisely measure constraint violation rates in LLM-generated plans across multiple planning domains, revealing that such violations are the dominant failure mode. (2) We introduce L-ICL, a method that improves planning validity through localized, failure-driven corrections, and show that targeted examples outperform retrieval of complete trajectories even when the latter uses 10 $\times$ more context. (3) We demonstrate consistent improvements across multiple planning domains and four LLM architectures. (4) We release our benchmark suite and code to facilitate future research on LLM planning. ## 2 Related Work ### 2.1 LLM Planning: Capabilities and Limitations The planning capabilities of LLMs remain contested. One line of work reports strong performance on some planning tasks when LLMs are augmented with appropriate scaffolding: e.g., Tree of Thoughts achieves 74% on Game of 24 versus 4% for chain-of-thought (Yao et al., 2023a), RAP-MCTS reaches 100% on Blocksworld instances requiring 6 or fewer steps (Hao et al., 2023), and ReAct improves interactive decision-making by 34% over baselines (Yao et al., 2023b). However, systematic evaluation on classical planning benchmarks reveals persistent failures. Valmeekam et al. (2023) show GPT-4 achieves only 12% success on International Planning Competition (IPC) domains; and Stechly et al. (2024) demonstrate that chain-of-thought improvements are brittle and fail to generalize beyond surface patterns. The LLM-Modulo framework (Kambhampati et al., 2024) argues that LLMs function as approximate knowledge sources rather than autonomous planners, achieving strong results only when paired with external verifiers. Kaesberg et al. (2025) also documented that LLMs are challenged by 2D navigation tasks, similar to ones we study here. Most recently, Shojaee et al. (2025) identify a “complexity collapse” phenomenon: reasoning models’ performance degrades sharply beyond certain problem complexities, with accuracy dropping to zero on harder instances even when token budgets remain available. We follow Stechly et al. (2024) in working to diagnose why LLMs violate constraints using structured reasoning chains; however, we work with PTP as a prompting scheme, rather than models fine-tuned to produce structured reasoning chains, allowing us to consider more kinds of models, and more powerful ones. With L-ICL, we also propose a practical method to reduce these violations. Our work confirms that constraint violations are a common failure mode, and shows that targeted corrections outperform both agentic scaffolding and retrieval-based ICL approaches. ### 2.2 Approaches to Improve LLM Reasoning Prior work addresses LLM reasoning limitations through three main strategies: structured output formats, test-time compute scaling, and in-context learning. Structured Reasoning. Chain-of-thought prompting (Wei et al., 2022) improves performance by eliciting intermediate steps, though explanations may be unfaithful to actual computation (Turpin et al., 2023). PTP (Cohen and Cohen, 2024) offers interpretable traces: prompts specify subroutine signatures without implementations, and the LLM produces structured outputs that can be parsed and verified (Leng et al., 2025). We build on PTP because its explicit subroutine structure provides natural insertion points for localized corrections. Test-Time Compute. Several methods improve reasoning by expending more computation at inference. Self-Consistency (Wang et al., 2023) aggregates multiple sampled paths via majority voting; Tree of Thoughts (Yao et al., 2023a) explores branching reasoning trajectories; and Self-Refine (Madaan et al., 2023) iteratively improves outputs through self-critique. Tool-augmented approaches interleave reasoning with execution: Program of Thoughts (Chen et al., 2022), PAL (Gao et al., 2023), and Chain of Code (Li et al., 2023) generate executable code, while ReAct (Yao et al., 2023b) interleaves reasoning with tool calls. These methods require multiple LLM calls or external tools at inference. Critically, Stechly et al. (2025) show that LLM self-verification is unreliable, making self-critique ineffective for planning. In-Context Learning. ICL enables task adaptation through examples (Brown et al., 2020), with effectiveness depending on example selection (Liu et al., 2022) and format (Min et al., 2022). For planning, a natural approach is retrieving complete solution trajectories (RAG-ICL). However, we find this ineffective: 20,000 characters of retrieved trajectories yield only 9% success on our gridworld benchmark. Complete trajectories demonstrate that solutions work but leave implicit why individual steps are valid. L-ICL addresses this by providing localized input-output pairs that directly encode constraints. Table 2 summarizes how L-ICL relates to prior approaches. Table 2: Comparison of L-ICL with related approaches. L-ICL uniquely combines example-based training with localized feedback while requiring only single-pass inference. | Self-Refine Tree of Thoughts Self-Consistency | none none none | many many many | none none none | | --- | --- | --- | --- | | ReAct | none | many | none | | ReAct + oracle f/b | none | many | yes | | Fine-tuning | trajectory | one | none | | RAG-ICL | trajectory | one | none | | L-ICL (ours) | one step | one | train only | ## 3 Method We first describe Program Trace Prompting (PTP), the structured reasoning framework underlying our approach. We then introduce Localized In-Context Learning (L-ICL), our method for iteratively injecting domain constraints into the prompt. Finally, we describe our experimental domains and evaluation setup. ### 3.1 Background: Program Trace Prompting Program Trace Prompting (PTP) (Cohen and Cohen, 2024) recasts reasoning as producing an execution trace for a partially specified program. A PTP prompt contains documentation for each subroutine (function name, typed arguments, return type, and a natural language description of its purpose), a small number of example traces showing how subroutines are called, and the query problem to solve. Crucially, subroutine implementations are withheld; the LLM must infer correct behavior from context. For planning tasks, we define subroutines corresponding to planning primitives. For instance, a gridworld navigation task includes a subroutine that returns applicable actions from a given state (those that stay in bounds and avoid walls), a subroutine that returns the resulting state after executing an action, and a subroutine that checks whether the current state satisfies the goal. The LLM generates a trace by repeatedly invoking these subroutines, producing outputs consistent with the documentation and examples. Because the trace follows a predictable structure, we can parse it programmatically and verify each step against a ground-truth oracle. This explicit subroutine structure provides natural insertion points for corrections: when a specific subroutine call fails, we can augment that subroutine’s documentation without modifying the rest of the prompt. Full subroutine specifications for each domain appear in Appendix E. ### 3.2 Localized In-Context Learning (L-ICL) The key insight behind L-ICL is that domain constraints can be taught more effectively through targeted examples than through complete solution trajectories. When an LLM violates a constraint (e.g., proposing to move through a wall), traditional approaches either reject the entire plan or provide feedback on the final outcome. L-ICL instead identifies the precise point of failure and injects a minimal correction for that specific subroutine call. First Failure Identification. Given an LLM-generated trace, we parse each subroutine call and verify its output against an oracle. Let $c_{1},c_{2},\ldots,c_{n}$ denote the sequence of subroutine calls in the trace. We identify the first failing call $c_{i^{*}}$ such that the LLM’s output differs from the oracle’s: $$ i^{*}=\min\{i:\text{LLM}(c_{i})\neq\text{Oracle}(c_{i})\} $$ Focusing on the first failure is deliberate. Planning errors cascade: an invalid move at step $k$ renders all subsequent state representations incorrect, making later “errors” artifacts of the initial mistake rather than independent failures. Correcting the root cause addresses multiple downstream errors simultaneously. Localized Correction. For the failing call $c_{i^{*}}$ with input $x$ and incorrect output $\hat{y}$ , we query the oracle to obtain the correct output $y^{*}=\text{Oracle}(x)$ . This yields a correction tuple $(f,x,y^{*})$ where $f$ is the subroutine name. We format this correction as a doctest-style example and insert it into the documentation for subroutine $f$ , augmenting the original description with an additional input-output pair. This format, drawn from Python’s widely used doctest convention, is well-represented in LLM training data. Appendix E.3 provides concrete examples of the correction format. Iterative Accumulation. L-ICL iterates over a set of training problems $\{P_{1},P_{2},\ldots,P_{m}\}$ . For each problem, we generate a trace using the current prompt, identify the first failing subroutine call (if any), and add the corresponding correction to the prompt. Corrections accumulate across training problems, progressively “hardening” the prompt to avoid constraint violations. Algorithm 1 provides pseudocode. L-ICL converges quickly: we see diminishing returns after only 30–60 training examples on our benchmark tasks (see Section 4). Algorithm 1 Localized In-Context Learning (L-ICL) 0: Base prompt $\mathcal{P}_{0}$ with PTP structure, training problems $\{P_{1},\ldots,P_{m}\}$ , oracle $\mathcal{O}$ 0: Augmented prompt $\mathcal{P}$ $\mathcal{P}\leftarrow\mathcal{P}_{0}$ $\mathcal{C}\leftarrow\emptyset$ $\triangleright$ Correction set for $j=1$ to $m$ do $\tau\leftarrow\textsc{GenerateTrace}(\mathcal{P}_{0},P_{j})$ $\{c_{1},\ldots,c_{n}\}\leftarrow\textsc{ParseCalls}(\tau)$ for $i=1$ to $n$ do $(f,x,\hat{y})\leftarrow c_{i}$ $y^{*}\leftarrow\mathcal{O}(f,x)$ if $\hat{y}\neq y^{*}$ then $\mathcal{C}\leftarrow\mathcal{C}\cup\{(f,x,y^{*})\}$ $\triangleright$ Record first failure break end if end for end for $\mathcal{P}\leftarrow\textsc{InsertCorrections}(\mathcal{P}_{0},\mathcal{C})$ $\triangleright$ Batch update return $\mathcal{P}$ ### 3.3 Experimental Domains We design our experimental domains as a progressive ablation study that isolates different facets of planning difficulty. Starting from simple navigation, we incrementally add complexity along several axes: spatial structure, action diversity, state tracking requirements, and strategic reasoning. Table 3 summarizes how each domain isolates specific challenges. Table 3: Progressive ablation across experimental domains. Each domain adds complexity along one or more axes while controlling others. | 8 $\times$ 8 Grid | Simple | 4 | 1 | No | | --- | --- | --- | --- | --- | | 10 $\times$ 10 Maze | Complex | 4 | 1 | No | | Sokoban Grid | Complex | 4 | 1 | No | | Full Sokoban | Complex | 8 | 3 | Yes | | BlocksWorld | None | 2 | 5 | No | The 8 $\times$ 8 Two-Room Gridworld is our simplest setting, testing basic spatial reasoning: an agent must navigate between two rooms connected by a single doorway. The 10 $\times$ 10 Maze increases spatial complexity with narrow corridors and dead ends, requiring longer plans (typically 15–25 steps versus 8–12 for the gridworld). Full Sokoban introduces the critical challenge of multi-object state tracking (an agent and a box), where the agent must coordinate its position with multiple box positions, and where certain pushes lead to irreversible trap states. Sokoban-Style Gridworld ablates Sokoban by removing pushable boxes, but keeping the spatial layout and action semantics, isolating the effect of richer environment structure. Finally, BlocksWorld differs qualitatively from navigation: every object (block) is dynamic, constraints depend on relational configurations rather than spatial positions, and we provide an algorithmic sketch to test whether L-ICL can improve adherence to prescribed planning strategies. Full domain specifications appear in Appendix C. ### 3.4 Baselines and Metrics We compare L-ICL against several approaches spanning prompting strategies, agentic methods, and retrieval. Zero-Shot. The LLM receives the problem description and instructions with no in-context examples, measuring baseline capability without demonstration. RAG-ICL. We retrieve complete CoT-formatter solution trajectories for similar problems based on start/goal similarity, and evaluate at 10k and 20k character budgets. ReAct. The LLM is instructed to interleave reasoning and action selection in its output, following the prompt format specified in Appendix F.2. We evaluate a prompt-only version and an oracle-augmented version that queries a verifier during planning. Self-Consistency. Majority voting with $k{=}5$ reasoning paths sampled at temperature 0.7. Self-Refine. The LLM generates a solution, then critiques and refines it, based on its own feedback, for $k{=}5$ iterations. Tree-of-Thoughts. The LLM explores a tree of intermediate steps, evaluating and pruning branches (prompt-only, no external search). Crucially, ReAct (Oracle) queries the verifier at test time for each proposed action, while L-ICL uses the oracle only during training. At inference, L-ICL requires a single forward pass with no external dependencies. For L-ICL, we report results with different numbers of training examples $m$ (denoted L-ICL[ $m$ ]) to assess sample efficiency. We evaluate plans along three axes that form a natural hierarchy. A plan is valid if it violates no domain constraints (e.g., no wall collisions). A plan is successful if it is valid and reaches the goal state. A plan is optimal if it is successful and uses the minimum number of steps. Hence, a large valid-to-success gap indicates the model follows rules but fails to reach goals, and a large success-to-optimal gap indicates inefficient but functional plans. ### 3.5 Experimental Setup Our primary experiments use DeepSeek V3 (DeepSeek-AI, 2024), with additional evaluation on DeepSeek V3.1, Claude 4.5 Haiku, and Claude Sonnet 4.5 (Anthropic, 2025) to assess cross-architecture generalization. For each domain, we generate 100 test problems with random start and goal configurations. Training problems for L-ICL are drawn from a disjoint pool of 250 instances. For domains other than blocks world, prompts use a textual state representation, as suggested in Figure 1, and unless stated otherwise, use an ASCII representation of the grid. Oracles are domain-specific: simple simulators for gridworlds and mazes, and the Fast Downward planner (Helmert, 2006) and tools like the K-Star Planner (Katz and Lee, 2023; Lee et al., 2023) for Sokoban and BlocksWorld. We use temperature 1 for optimal model performance (DeepSeek-AI, 2024) unless stated. L-ICL is trained on up to 240 examples. ## 4 Results We evaluate L-ICL across our domain suite, demonstrating that localized corrections dramatically improve constraint adherence while remaining sample-efficient. We ask four key questions about L-ICL: (1) Does it learn domain constraints? (2) Is it more efficient than retrieval-based ICL? (3) Does it require explicit spatial representations? (4) Does it generalize across LLM architectures? ### 4.1 L-ICL Learns Domain Constraints Table 4 presents our main results across all domains. L-ICL consistently outperforms all baselines, often by substantial margins. Beyond raw performance gains, the pattern of results across our progressive domain suite reveals which aspects of planning L-ICL addresses effectively. Table 4: Main results across all domains. We report %(V)alid and %(S)uccessful. All baselines receive ASCII grid representations. L-ICL[ $m$ ] denotes training on $m$ examples. Best results in bold, second-best underlined. $\dagger$ ReAct (Oracle f/b) receives oracle feedback at inference time. ∗ L-ICL (no grid) methods are handicapped: they receive no ASCII grid, and rely purely on L-ICL to infer structure. | | 8 $\times$ 8 Grid | 10 $\times$ 10 Maze | Sokoban Grid | Full Sokoban | BlocksWorld | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Method | V | S | V | S | V | S | V | S | V | S | | Zero-Shot | 16 | 0 | 3 | 0 | 15 | 0 | 1 | 0 | 10 | 10 | | RAG-ICL (10k chars) | 20 | 6 | 7 | 1 | 17 | 4 | 31 | 11 | 25 | 25 | | RAG-ICL (20k chars) | 21 | 9 | 7 | 4 | 25 | 10 | 36 | 15 | 32 | 32 | | ReAct (Prompt-Only) | 48 | 41 | 6 | 5 | 19 | 12 | 1 | 0 | 46 | 45 | | Self-Consistency ( $k{=}5$ ) | 59 | 45 | 3 | 3 | 10 | 5 | 2 | 1 | 31 | 31 | | Self-Refine ( $k{=}5$ ) | 51 | 44 | 3 | 1 | 13 | 8 | 0 | 0 | 49 | 49 | | ToT (Prompt-Only) | 33 | 12 | 1 | 0 | 3 | 2 | 0 | 0 | 50 | 40 | | ReAct (Oracle f/b) † | 55 | 45 | 6 | 5 | 21 | 13 | 3 | 0 | 51 | 51 | | L-ICL[ $m{=}0$ ] (ours) | 40 | 33 | 20 | 16 | 21 | 17 | 19 | 13 | 50 | 48 | | L-ICL[ $m{=}60$ ] (ours) | 89 | 89 | 40 | 21 | 63 | 49 | 46 | 20 | 68 | 66 | | L-ICL[ $m{=}0$ ] ∗ (ours) | 19 | 12 | 7 | 6 | 10 | 8 | 12 | 9 | 50 | 48 | | L-ICL[ $m{=}60$ ] ∗ (ours) | 73 | 63 | 57 | 27 | 62 | 44 | 42 | 14 | 68 | 66 | 8 $\times$ 8 Gridworld. The complete failure of zero-shot prompting (0%) on this simple two-room task is striking: the model receives full information about walls, start, and goal, yet fails completely. This reveals that the bottleneck is not knowledge but application. L-ICL achieves 63% success, demonstrating that localized corrections bridge this gap. Figure 2 shows rapid improvement in the first 30 examples, with continued gains for $\approx$ 160 examples before plateauing. 10 $\times$ 10 Maze. The maze’s narrow corridors and longer optimal paths (15–25 steps) challenge all methods. L-ICL reaches 27% success where baselines achieve at most 5%. Notably, valid rates reach 57%, indicating that most L-ICL plans respect maze constraints even when they fail to reach the goal. This valid-to-success gap suggests that constraint satisfaction and goal-directed search are separable challenges; L-ICL addresses the former effectively. Sokoban Grid. Despite adopting Sokoban’s richer spatial structure, this domain (without pushable boxes) yields results intermediate between the prior domains: L-ICL achieves 49% success versus 13% for the best baseline. The similarity suggests that spatial complexity, not action vocabulary, dominates difficulty in navigation tasks. Full Sokoban. Introducing pushable boxes causes the sharpest performance degradation across all methods. L-ICL improves success from 13% to only 20%, yet increases valid action rates from 19% to 46%. This dissociation isolates multi-object state tracking as a distinct challenge: L-ICL teaches which pushes are legal, but coordinating agent and box positions toward the goal requires capabilities beyond constraint satisfaction, furhter analyzed in Appendix A. BlocksWorld. This domain differs qualitatively: constraints are relational (“block A is on block B”) rather than spatial, and every object is dynamic. L-ICL still improves success from 48% to 66%, demonstrating that localized corrections generalize beyond navigation. <details> <summary>graphs/misc/8x8_nogrid_success_optimal_combined.png Details</summary> ![0adeb2b4](/v1/image/0adeb2b48fde013f9621dcc51ff6e3be3f77acf695e84b13039130ea91c6a837) ### Visual Description ## Line Chart: 8x8 Gridworld: Success vs Optimal Rate ### Overview This line chart compares the success rate and optimal rate in an 8x8 Gridworld environment as a function of the number of training examples. The chart displays two main lines representing the success and optimal rates, along with shaded areas indicating the lower and upper confidence intervals (L-ICL). Two dashed lines represent the baseline success and optimal rates. ### Components/Axes * **Title:** 8x8 Gridworld: Success vs Optimal Rate * **X-axis:** Training Examples (Scale: 0 to 240, increments of 30) * **Y-axis:** Rate (%) (Scale: 0 to 90, increments of 10) * **Legend:** Located at the bottom-center of the chart. * Best Baseline Success (Self-Consistency) - Dashed Orange Line * Best Baseline Optimal (Self-Consistency) - Dashed Blue Line * L-ICL Success - Blue Line with Shaded Area * L-ICL Optimal - Orange Line with Shaded Area ### Detailed Analysis The chart shows the following trends and data points: * **Best Baseline Success (Self-Consistency):** This is a horizontal dashed orange line. It remains relatively constant at approximately 44% throughout the range of training examples. * **Best Baseline Optimal (Self-Consistency):** This is a horizontal dashed blue line. It remains relatively constant at approximately 42% throughout the range of training examples. * **L-ICL Success (Blue Line):** This line starts at approximately 44% at 0 training examples. It decreases to around 30% at 30 training examples, then increases, reaching a peak of approximately 75% at 150 training examples. It then fluctuates, ending at approximately 72% at 240 training examples. * **L-ICL Optimal (Orange Line):** This line starts at approximately 42% at 0 training examples. It decreases sharply to around 25% at 30 training examples, then increases, reaching a peak of approximately 72% at 150 training examples. It then fluctuates, ending at approximately 70% at 240 training examples. * **L-ICL Success Shaded Area:** The shaded area around the blue line represents the lower and upper confidence intervals. The width of the shaded area varies, indicating the uncertainty in the success rate. * **L-ICL Optimal Shaded Area:** The shaded area around the orange line represents the lower and upper confidence intervals. The width of the shaded area varies, indicating the uncertainty in the optimal rate. Here's a more detailed breakdown of approximate values at specific training example points: | Training Examples | L-ICL Success (%) | L-ICL Optimal (%) | |---|---|---| | 0 | 44 | 42 | | 30 | 30 | 25 | | 60 | 58 | 52 | | 90 | 62 | 58 | | 120 | 68 | 64 | | 150 | 75 | 72 | | 180 | 70 | 66 | | 210 | 73 | 68 | | 240 | 72 | 70 | ### Key Observations * Both the success and optimal rates initially decrease with a small number of training examples (0-30). * Both rates increase significantly between 30 and 150 training examples, suggesting a learning phase. * After 150 training examples, the rates fluctuate but generally remain high. * The success rate (blue line) is consistently slightly higher than the optimal rate (orange line) after approximately 60 training examples. * The confidence intervals (shaded areas) are wider at the beginning and end of the training period, indicating greater uncertainty. ### Interpretation The data suggests that the agent's performance (both success and optimal rates) in the 8x8 Gridworld environment improves with more training examples. The initial decrease in performance may be due to the agent exploring the environment and learning the basic dynamics. The subsequent increase indicates that the agent is learning to navigate and achieve its goals more effectively. The fact that the success rate is consistently higher than the optimal rate after a certain point suggests that the agent is not only finding optimal solutions but also succeeding in other, potentially suboptimal, ways. The confidence intervals provide a measure of the reliability of the results, indicating that the performance is more consistent with a larger number of training examples. The baseline rates are relatively low, indicating that the self-consistency method provides a significant improvement in performance. The fluctuations in the rates after 150 training examples could be due to the complexity of the environment or the stochastic nature of the learning process. </details> Figure 2: 8 $\times$ 8 Gridworld learning curves. Success and Optimal rates vs. training examples. L-ICL (without being given the ASCII grid) improves rapidly in the first 30–60 examples, substantially outperforming all baselines, which are given access to the ASCII grid (horizontal line shows best baseline). ### 4.2 L-ICL Is More Efficient Than Retrieval-Based ICL A key advantage of L-ICL is sample efficiency: localized corrections convey more information per token than complete solution trajectories. Figure 3 compares L-ICL and RAG-ICL as a function of context size. RAG-ICL with 20,000 characters of retrieved trajectories achieves 16% success. L-ICL matches this performance with approximately 5,000 characters and reaches 63% success with 7,000 characters. At matched context size, L-ICL outperforms RAG-ICL by 40+ percentage points. This efficiency stems from the compression achieved by localized examples. A complete trajectory demonstrates that a solution works but leaves implicit why individual steps are valid. A local example like get_applicable_actions((3,4)) -> [’move_north’,’move_south’] directly encodes that eastward movement from (3,4) is blocked. <details> <summary>graphs/efficiency/8x8_grid_nogrid_efficiency.png Details</summary> ![a7aeb134](/v1/image/a7aeb1343a9ff60c1b1fed1855c030fc997e009db88427ef1e7b812011f843dd) ### Visual Description \n ## Line Chart: 8x8 Gridworld: Sample Efficiency ### Overview This line chart depicts the relationship between Context Size (chars) and Success Rate (%) for two different models: RAG-CoT and L-ICL, in an 8x8 Gridworld environment. The chart aims to demonstrate the sample efficiency of each model as context size increases. ### Components/Axes * **Title:** 8x8 Gridworld: Sample Efficiency (top-center) * **X-axis:** Context Size (chars). Scale ranges from 0 to 20k, with markers at 0, 5k, 10k, 15k, and 20k. (bottom-center) * **Y-axis:** Success Rate (%). Scale ranges from 0 to 90, with markers at 10, 20, 30, 40, 50, 60, 70, 80, and 90. (left-center) * **Legend:** Located at the top-right corner. * RAG-CoT (orange line) * L-ICL (blue line) ### Detailed Analysis **RAG-CoT (Orange Line):** The orange line representing RAG-CoT shows a generally increasing trend, but with significant fluctuations. * At 0 chars context size, the success rate is approximately 11%. * At 5k chars, the success rate increases to approximately 26%. * At 10k chars, the success rate is around 28%. * At 15k chars, the success rate is approximately 24%. * At 20k chars, the success rate is approximately 32%. **L-ICL (Blue Line):** The blue line representing L-ICL demonstrates a more pronounced initial increase, followed by oscillations and a slight decline. * At 0 chars context size, the success rate is approximately 12%. * At 5k chars, the success rate jumps to approximately 61%. * At 10k chars, the success rate decreases to approximately 58%. * At 15k chars, the success rate peaks at approximately 79%. * At 20k chars, the success rate decreases to approximately 72%. ### Key Observations * L-ICL consistently outperforms RAG-CoT across all context sizes. * L-ICL exhibits a rapid improvement in success rate between 0 and 5k chars. * Both models show fluctuations in success rate as context size increases, suggesting that simply increasing context size does not guarantee improved performance. * RAG-CoT's performance is relatively stable, but lower than L-ICL. * L-ICL reaches its peak performance at 15k chars, then slightly declines. ### Interpretation The data suggests that L-ICL is more sample efficient than RAG-CoT in the 8x8 Gridworld environment. L-ICL benefits significantly from even a small increase in context size (0 to 5k chars), while RAG-CoT shows a more gradual and less dramatic improvement. The oscillations in both lines indicate that the relationship between context size and success rate is not linear and may be influenced by other factors. The decline in L-ICL's performance at 20k chars could indicate a point of diminishing returns or potential overfitting to the training data. The difference in performance between the two models could be attributed to differences in their architectures or training methodologies. The chart highlights the importance of context size in language model performance, but also emphasizes that context size is not the sole determinant of success. Further investigation is needed to understand the underlying reasons for the observed fluctuations and the decline in L-ICL's performance at higher context sizes. </details> Figure 3: Sample efficiency: L-ICL vs. RAG-ICL. Success rate vs. context size (characters) on 8 $\times$ 8 Gridworld. L-ICL achieves higher performance with substantially less context. ### 4.3 L-ICL Does Not Need Full Domain Knowledge In Table 4, in the tasks aside from BlocksWorld, all prompting schemes use an ASCII grid visualization of the gridworld to be explored (preliminary experiments suggested this approach was most effective for these tasks.) Since L-ICL learns to correct domain violations, a natural question is whether the ASCII grid is actually necessary for it: can it learn the domain from examples alone? Figure 4 shows the learning curve for L-ICL on the 10x10 grid task with and without the ASCII visualization of the grid. The visualization accelerates performance early on (21% at $m{=}30$ with grid vs. 15% without), but peak performance is comparable (39% vs. 37%). Thus, L-ICL does not require visual scaffolding, although the grid provides useful inductive bias during early training. However, to obtain the full benefit of such scaffolding, the LLM requires some L-ICL training; with more examples being needed for more complex domains. Thus, the 8 $\times$ 8 grid almost immediately benefits, whereas all harder domains only display the benefit of the scaffolded version over the non-scaffolded version later on in their training, as seen in the figure. <details> <summary>graphs/misc/10x10_maze_grid_ablation.png Details</summary> ![d3ae3114](/v1/image/d3ae3114c0e6b28bda73b4ed1945600279565d4155f7b648a7687fd29b34f924) ### Visual Description ## Line Chart: 10x10 Maze: Grid Ablation ### Overview This line chart visualizes the performance of different models (Best Baseline, L-ICL) with and without a grid, as measured by Success Rate (%) against the number of Training Examples. The chart appears to be evaluating the impact of using a grid representation in a 10x10 maze solving task. ### Components/Axes * **Title:** 10x10 Maze: Grid Ablation * **X-axis:** Training Examples (ranging from 0 to 240, with markers at 0, 30, 60, 90, 120, 150, 180, 210, and 240) * **Y-axis:** Success Rate (%) (ranging from 0 to 50, with markers at 0, 10, 20, 30, 40, and 50) * **Legend:** Located at the bottom-right of the chart. * Best Baseline (With Grid) - Dashed Blue Line * Best Baseline (No Grid) - Dashed Orange Line * L-ICL (With Grid) - Solid Blue Line * L-ICL (No Grid) - Solid Orange Line ### Detailed Analysis Let's analyze each line individually, noting trends and approximate data points. * **Best Baseline (With Grid) - Dashed Blue Line:** This line starts at approximately 5% at 0 Training Examples and increases steadily to around 18% at 30 Training Examples. It continues to rise, reaching approximately 25% at 60 Training Examples, then plateaus around 25-30% between 90 and 180 Training Examples. Finally, it increases to approximately 35% at 210 Training Examples and remains around 35% at 240 Training Examples. * **Best Baseline (No Grid) - Dashed Orange Line:** This line begins at approximately 5% at 0 Training Examples and decreases to around 3% at 30 Training Examples. It then rises to approximately 15% at 60 Training Examples, before decreasing to around 10% at 90 Training Examples. It remains relatively flat around 10-12% between 120 and 240 Training Examples. * **L-ICL (With Grid) - Solid Blue Line:** This line starts at approximately 15% at 0 Training Examples and increases rapidly to around 35% at 30 Training Examples. It peaks at approximately 42% at 150 Training Examples, then decreases slightly to around 38% at 180 Training Examples, and remains around 38% at 210 and 240 Training Examples. * **L-ICL (No Grid) - Solid Orange Line:** This line begins at approximately 8% at 0 Training Examples and increases to around 20% at 30 Training Examples. It reaches a peak of approximately 32% at 150 Training Examples, then decreases to around 30% at 180 Training Examples, and remains around 30-34% at 210 and 240 Training Examples. ### Key Observations * The L-ICL model consistently outperforms the Best Baseline model, both with and without the grid. * The grid appears to significantly improve the performance of the L-ICL model, especially at lower training example counts. The gap between L-ICL (With Grid) and L-ICL (No Grid) is substantial initially. * The Best Baseline model shows minimal improvement from using the grid. * All models exhibit diminishing returns in performance as the number of training examples increases beyond 150. * The Best Baseline (No Grid) performs poorly, remaining below 15% success rate throughout the experiment. ### Interpretation The data suggests that the L-ICL model is more effective at learning to solve the 10x10 maze task, and that the use of a grid representation significantly enhances its performance. The grid likely provides a more structured input format that the L-ICL model can leverage. The Best Baseline model, however, does not benefit as much from the grid, indicating that its learning process is less sensitive to the input representation. The plateauing of performance at higher training example counts suggests that the models are reaching their capacity to learn from the given data or that the task itself has a limited complexity. The consistently low performance of the Best Baseline (No Grid) indicates that it struggles to generalize from the training data without a structured input. This experiment demonstrates the importance of both model architecture (L-ICL vs. Best Baseline) and input representation (With Grid vs. No Grid) in achieving high performance on maze-solving tasks. </details> Figure 4: Grid representation ablation on 10 $\times$ 10 Maze. The ASCII grid accelerates early learning but does not change peak performance. Without L-ICL, the grid provides little benefit. ### 4.4 L-ICL Works On Many LLM Architectures To assess whether L-ICL’s benefits are architecture-specific, we evaluate on three additional models: DeepSeek V3.1, Claude 4.5 Haiku, and Claude Sonnet 4.5. Figure 5 shows results on the 10 $\times$ 10 Maze. All models improve substantially with L-ICL. Claude Sonnet 4.5 shows the strongest gains (10% to 74%), followed by DeepSeek V3.1 (2% to 47%) and Claude 4.5 Haiku (1% to 39%). The relative ordering changes with training: at $m{=}0$ models are comparable, but by $m{=}120$ Claude Sonnet 4.5 leads substantially. This suggests stronger models leverage accumulated corrections more effectively, though all models benefit. <details> <summary>graphs/misc/llm_ablation_success.png Details</summary> ![18b7f4c7](/v1/image/18b7f4c7c7b1fad49240b00ea2c93bcea4e567ba5c31eddcf7c36ee1dbdcbb23) ### Visual Description ## Line Chart: 10x10 Maze: L-ICL Performance Across LLMs ### Overview This line chart displays the success rate (%) of several Large Language Models (LLMs) on a 10x10 maze task, as a function of the number of training examples provided. The chart compares the performance of DeepSeek V3, DeepSeek V3.1, Claude Haiku 4.5, and Claude Sonnet 4.5. The x-axis represents the number of training examples, and the y-axis represents the success rate. Shaded areas around each line indicate confidence intervals or standard deviations. ### Components/Axes * **Title:** 10x10 Maze: L-ICL Performance Across LLMs * **X-axis Label:** Training Examples (ranging from 0 to 240, with increments of 30) * **Y-axis Label:** Success Rate (%) (ranging from 0 to 90, with increments of 10) * **Legend:** Located at the bottom of the chart, identifying each line by LLM name and color. * DeepSeek V3 (Blue) * DeepSeek V3.1 (Orange) * Claude Haiku 4.5 (Green) * Claude Sonnet 4.5 (Red) ### Detailed Analysis The chart shows the success rate of each LLM as a function of training examples. * **DeepSeek V3 (Blue):** The line starts at approximately 10% at 0 training examples, rises sharply to around 30% at 60 training examples, plateaus around 30-35% between 60 and 180 training examples, and then declines slightly to approximately 30% at 240 training examples. * **DeepSeek V3.1 (Orange):** The line begins at approximately 10% at 0 training examples, increases steadily to around 40% at 90 training examples, fluctuates between 35% and 45% from 90 to 210 training examples, and then decreases to approximately 35% at 240 training examples. The shaded area around this line is quite large, indicating high variability. * **Claude Haiku 4.5 (Green):** The line starts at approximately 10% at 0 training examples, rises to around 25% at 60 training examples, plateaus around 30-35% between 60 and 180 training examples, and then increases to approximately 40% at 240 training examples. * **Claude Sonnet 4.5 (Red):** The line begins at approximately 10% at 0 training examples, increases rapidly to around 70% at 90 training examples, reaches a peak of approximately 75% at 120 training examples, and then declines gradually to approximately 65% at 240 training examples. ### Key Observations * Claude Sonnet 4.5 consistently outperforms the other LLMs, especially in the range of 0-120 training examples. * DeepSeek V3 and Claude Haiku 4.5 show similar performance, with relatively stable success rates after an initial increase. * DeepSeek V3.1 exhibits the highest variability in performance, as indicated by the large shaded area around its line. * All LLMs show an initial increase in success rate with more training examples, but the rate of improvement diminishes as the number of examples increases. * After a certain point, increasing the number of training examples does not necessarily lead to a higher success rate, and in some cases, can even lead to a decrease. ### Interpretation The data suggests that the choice of LLM significantly impacts performance on the 10x10 maze task. Claude Sonnet 4.5 demonstrates superior learning capabilities, achieving a high success rate with a relatively small number of training examples. The diminishing returns observed with increasing training examples indicate that the LLMs may reach a point of saturation, where additional examples do not provide significant improvements. The variability in DeepSeek V3.1's performance suggests that its learning process may be more sensitive to the specific training data or initialization conditions. The chart highlights the importance of selecting an appropriate LLM and optimizing the training data to maximize performance on a given task. The initial rapid increase in success rate for all models suggests that even a small amount of task-specific training can significantly improve performance. The subsequent plateau and decline in some cases suggest that overfitting or the limitations of the model architecture may be factors. </details> Figure 5: L-ICL across LLM architectures. Success rate on 10 $\times$ 10 Maze for four models. All improve substantially; Claude Sonnet 4.5 shows the largest gains (10% $\to$ 74%). ### 4.5 Summary of Findings (1) L-ICL dramatically improves constraint adherence, achieving consistently higher success rates than baselines across all domains. (2) L-ICL is sample-efficient: 30–90 training examples typically suffice, and L-ICL outperforms RAG-ICL while using 4 $\times$ less context. (3) Explicit spatial representations are not required: ASCII grids accelerate early learning but do not change peak performance. (4) L-ICL generalizes across architectures: four LLMs from different families all benefit substantially. (5) Multi-object tracking and strategic planning remain challenging: the valid-to-success gap in Sokoban and BlocksWorld indicates that localized corrections address constraint violations but do not fully solve long-horizon coordination (see Appendix A). ## 5 Discussion Our experiments demonstrate that L-ICL consistently improves LLM planning performance, often by substantial margins. Beyond raw performance gains, these results support a specific conceptual interpretation that clarifies both what L-ICL achieves and where challenges remain. ### 5.1 L-ICL as In-Context Unit Testing In software engineering, unit testing is a means of “hardening” code subroutines (i.e., making them more reliable and predictable), and it is considered good practice to use unit tests even when end-to-end tests exist. ICL demonstrations instruct a model as to desired behavior, rather than confirming that it has a desired behavior; modulo this important difference, however, L-ICL demonstrations are analogous to unit tests, and traditional ICL demonstrations are analogous to end-to-end tests. L-ICL demonstrations can be viewed as a technique for “hardening” individual reasoning steps, in that they makes an LLM’s instruction-following behavior more reliable and consistent. Full-trajectory demonstrations are more like end-to-end tests; in software engineering, these tests have a different role than unit tests, confirming that individual modules interact correctly: in LLM terms, they encourage process correctness, and only incidentally encourage step correctness. In planning tasks, an invalid plan may have many correctly perform steps and only a single invalidly performed step, so adding a full-trajectory demonstration is at best an inefficient way to improve performance, in terms of the useful information per prompt token, relative to accumulating local demonstrations in a failure-driven way. ### 5.2 Qualitative Evidence: From Guessing to Navigation Figure 6 provides visual evidence of L-ICL’s effect. At $m{=}0$ , the model proposes moves without regard for walls, quickly entering invalid states. By $m{=}60$ , it produces a coherent start-to-goal path respecting all walls. Crucially, this improvement occurs without the model ever seeing the ASCII grid. The doctests encode constraints implicitly through input-output pairs, and the model learns to satisfy them. This demonstrates that L-ICL induces a transferable constraint prior rather than memorizing specific layouts. <details> <summary>graphs/misc/maze_pictures/0_final.png Details</summary> ![8eff2af9](/v1/image/8eff2af9b5d1dc25848a06ef338f3b1be2f98b1994b1da3a16be1bcc3f1787fa) ### Visual Description \n ## Diagram: Maze/Pathfinding Problem ### Overview The image depicts a grid-based maze or pathfinding problem. The maze consists of black blocks obstructing movement on a white grid. A starting point 'S' and a goal point 'G' are marked. A dashed red line indicates a possible path from 'S' to 'G'. ### Components/Axes - **Grid:** A square grid forms the background. The grid lines are faint and appear to be 10x10 or similar. - **Black Blocks:** Dark gray/black rectangular blocks represent obstacles within the grid. - **Start Point (S):** A green circle labeled 'S' marks the beginning of the path. Located in the bottom-left quadrant. - **Goal Point (G):** A red circle labeled 'G' marks the destination. Located in the top-right quadrant. - **Path:** A dashed red line connects the 'S' and 'G' points, illustrating a potential solution path. ### Detailed Analysis or Content Details The maze is approximately 10 units wide and 10 units high. The black blocks are arranged in a complex pattern, creating a challenging path. The path indicated by the dashed red line appears to follow these approximate coordinates (assuming the bottom-left corner is (0,0)): - Start (S): Approximately (1,1) - Intermediate points: (2,1), (2,2), (3,2), (4,2), (4,3), (5,3), (6,3), (6,4), (7,4), (8,4), (8,5), (7,5), (7,6), (8,6), (9,6) - Goal (G): Approximately (9,8) The path is not necessarily the shortest path, but a feasible route through the maze. ### Key Observations - The maze is relatively dense with obstacles, requiring careful navigation. - The path shown is not a straight line, indicating the need to maneuver around the blocks. - The 'S' and 'G' points are positioned in opposite corners of the grid, increasing the path length. - The dashed line path is not perfectly aligned with the grid lines, suggesting it's a visual aid rather than a precise coordinate-based path. ### Interpretation This diagram represents a classic pathfinding problem, commonly encountered in computer science and robotics. The goal is to find a sequence of moves from the start point to the goal point, avoiding the obstacles. The dashed red line suggests a solution has been found, or is being proposed. The complexity of the maze indicates a non-trivial problem, potentially requiring algorithms like A* search, Dijkstra's algorithm, or similar pathfinding techniques to solve efficiently. The diagram could be used to illustrate the concept of pathfinding, demonstrate the effectiveness of a particular algorithm, or serve as a test case for pathfinding implementations. The absence of numerical data beyond the grid structure suggests the focus is on the visual representation of the problem and a potential solution, rather than quantitative analysis of path length or efficiency. </details> <details> <summary>graphs/misc/maze_pictures/60_final.png Details</summary> ![895d733b](/v1/image/895d733be8373fcdde872bfbee5ed8259a7ce1b8a9c18c911feedc999820a4c4) ### Visual Description \n ## Diagram: Pathfinding Grid ### Overview The image depicts a grid-based diagram representing a pathfinding problem. The grid is composed of black and white squares, with a path highlighted in blue connecting a start point 'S' to a goal point 'G'. The grid appears to represent an obstacle course, where black squares are obstacles and white squares are traversable spaces. ### Components/Axes The diagram consists of: * **Grid:** A square grid of approximately 10x10 cells. * **Obstacles:** Black squares representing impassable areas. * **Path:** A blue line indicating a possible route from 'S' to 'G'. * **Start Point (S):** A green circle marking the beginning of the path. Located in the bottom-left quadrant of the grid. * **Goal Point (G):** A red circle marking the end of the path. Located in the top-right quadrant of the grid. * **Grid Lines:** Faint gray lines defining the grid cells. ### Detailed Analysis or Content Details The path starts at the green circle 'S' and proceeds as follows (approximating grid coordinates, assuming the bottom-left cell is (0,0)): 1. (0,1) to (1,1) 2. (1,1) to (1,2) 3. (1,2) to (2,2) 4. (2,2) to (2,3) 5. (2,3) to (3,3) 6. (3,3) to (4,3) 7. (4,3) to (4,4) 8. (4,4) to (5,4) 9. (5,4) to (6,4) 10. (6,4) to (7,4) 11. (7,4) to (7,5) 12. (7,5) to (8,5) 13. (8,5) to (9,5) 14. (9,5) to (9,6) 15. (9,6) to (9,7) 16. (9,7) to (8,7) 17. (8,7) to (7,7) 18. (7,7) to (6,7) 19. (6,7) to (5,7) 20. (5,7) to (4,7) 21. (4,7) to (3,7) 22. (3,7) to (2,7) 23. (2,7) to (1,7) 24. (1,7) to (0,7) 25. (0,7) to (0,6) 26. (0,6) to (1,6) 27. (1,6) to (2,6) 28. (2,6) to (3,6) 29. (3,6) to (4,6) 30. (4,6) to (5,6) 31. (5,6) to (6,6) 32. (6,6) to (7,6) 33. (7,6) to (8,6) 34. (8,6) to (9,6) 35. (9,6) to (9,5) 36. (9,5) to (9,4) 37. (9,4) to (9,3) 38. (9,3) to (9,2) 39. (9,2) to (9,1) 40. (9,1) to (9,0) 41. (9,0) to (8,0) 42. (8,0) to (7,0) 43. (7,0) to (6,0) 44. (6,0) to (5,0) 45. (5,0) to (4,0) 46. (4,0) to (3,0) 47. (3,0) to (2,0) 48. (2,0) to (1,0) 49. (1,0) to (0,0) 50. (0,0) to (0,1) 51. (0,1) to (1,1) 52. (1,1) to (2,1) 53. (2,1) to (3,1) 54. (3,1) to (4,1) 55. (4,1) to (5,1) 56. (5,1) to (6,1) 57. (6,1) to (7,1) 58. (7,1) to (8,1) 59. (8,1) to (9,1) The path terminates at the red circle 'G'. ### Key Observations * The path is relatively long and winding, suggesting a complex obstacle arrangement. * The path avoids all black squares, demonstrating a successful navigation of the obstacles. * The path does not appear to be the shortest possible path, indicating that the pathfinding algorithm may not be optimal. * The grid is symmetrical in some areas, but the obstacle placement introduces asymmetry. ### Interpretation This diagram illustrates a classic pathfinding problem, likely used to demonstrate or test algorithms like A*, Dijkstra's algorithm, or Breadth-First Search. The 'S' and 'G' points represent the starting and ending locations, respectively, and the black squares represent obstacles that must be avoided. The blue line represents a solution path found by a pathfinding algorithm. The length and complexity of the path suggest that the environment is challenging, and the algorithm may not have found the most efficient route. The diagram could be used for educational purposes, algorithm testing, or as a visual representation of a robotic navigation scenario. The placement of the start and end points, and the arrangement of obstacles, likely influence the performance and characteristics of the pathfinding algorithm. </details> Figure 6: From blind guessing to structured navigation. Two rollouts on the same held-out maze as training examples $m$ increase. At $m{=}0$ (left), the model ignores walls entirely. By $m{=}60$ (right), the model produces a valid trajectory without ever seeing the grid representation, demonstrating that L-ICL induces transferable constraint knowledge. ### 5.3 Limitations and Scope One limitation is that L-ICL requires an oracle that can verify constraint satisfaction and provide correct outputs during training; however, this oracle is needed only during training —at test time, L-ICL requires a single forward pass with no external dependencies, distinguishing it from methods like ReAct with oracle feedback that require verification at inference. Extending to domains without formal specifications may require weaker supervision (learned verifiers, stronger models) that could introduce noise. A second limitation of this work is that we have only addressed one problem for LLM planners: their difficulty in correctly applying domain knowledge. LLM planners also struggle with strategic reasoning, i.e., performing valid actions in a way that quickly reaches the goal. While L-ICL excels improving validity, this does not always lead to good strategic reasoning, as shown by the valid-to-success gap in Sokoban (46% valid, 20% success). We leave to future work the question of whether localized corrections, or some extension of them, can also correct strategic failures, which seem to require multi-step lookahead, or whether L-ICL must be combined with complementary approaches such as search or value functions. A third limitation of this paper is that we consider only formally-describable planning benchmarks from the LLM planning literature. Transfer to open-ended natural-language tasks is not studied. ## 6 Conclusion We began with a puzzle: LLMs receive complete specifications of domain constraints yet routinely violate them. For example, stating that an agent cannot walk through walls is insufficient, because models do not consistently apply that information at test time. L-ICL addresses this issue in a simple way: when a constraint is violated, we add a minimal input-output example correcting that error, hence putting additional emphasis on the precise knowledge that was not applied. These minimal corrections are accumulating during training, progressively distilling behavioral knowledge from an oracle symbolic system into the prompt. The improvement is remarkable: on an 8 $\times$ 8 gridworld where zero-shot prompting achieves 0% success, L-ICL reaches 89% with only 60 training examples, and L-ICL consistently outperforms other baselines across domains. One key finding is that demonstration structure matters more than quantity. L-ICL achieves higher performance with 2,000 characters of targeted corrections than RAG-ICL achieves with 20,000 characters of complete trajectories. Complete solutions demonstrate that a plan works; localized examples demonstrate why individual steps are valid. This compression explains L-ICL’s sample efficiency and suggests a broader principle: LLM reliability can be improved by making implicit knowledge explicit at the point of application. This also reduces prompt engineering burden: rather than exhaustively specifying every constraint upfront, practitioners can let L-ICL discover them through failure-driven corrections. L-ICL does not solve planning. The valid-to-success gap in Sokoban shows that respecting domain constraints is necessary but not sufficient; strategic reasoning remains challenging in this domain. We view this not as a limitation but as a clarification of scope. L-ICL provides a procedural hardening layer: a reliable foundation of constraint-satisfying primitives on which higher-level reasoning can build. Just as unit tests do not write the program but ensure its components behave correctly, L-ICL does not plan but ensures that proposed actions respect domain physics. We hope this decomposition proves useful for future work on LLM reasoning systems. ## Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. ## References - Anthropic (2025) Claude 4.5 model family. Note: https://www.anthropic.com/claude Sonnet 4.5 released September 2025; Haiku 4.5 released October 2025 Cited by: §3.5. - T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901. Cited by: §2.2. - W. Chen, X. Ma, X. Wang, and W. W. Cohen (2022) Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588. Cited by: §2.2. - C. A. Cohen and W. W. Cohen (2024) Watch your steps: observable and modular chains of thought. arXiv preprint arXiv:2409.15359. Cited by: §1, §2.2, §3.1, footnote 1. - DeepSeek-AI (2024) DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437. Cited by: §3.5. - G. Francés, M. Ramirez, and Collaborators (2018) Tarski: an AI planning modeling framework. GitHub. Note: https://github.com/aig-upf/tarski Cited by: §E.5. - L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig (2023) PAL: program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. Cited by: §2.2. - S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu (2023) Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 8154–8173. Cited by: §2.1. - M. Helmert (2006) The fast downward planning system. In Journal of Artificial Intelligence Research, Vol. 26, pp. 191–246. Cited by: §E.4.2, §E.5, §3.5. - R. Howey, D. Long, and M. Fox (2004) VAL: automatic plan validation, continuous effects and mixed initiative planning using pddl. In 16th IEEE International Conference on Tools with Artificial Intelligence, Vol. , pp. 294–301. External Links: Document Cited by: §E.4.1. - L. B. Kaesberg, J. P. Wahle, T. Ruas, and B. Gipp (2025) SPaRC: a spatial pathfinding reasoning challenge. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Eds.), Suzhou, China, pp. 10359–10390. External Links: Link, Document, ISBN 979-8-89176-332-6 Cited by: §2.1. - S. Kambhampati, K. Valmeekam, L. Guan, M. Verma, K. Stechly, S. Bhambri, L. Saldyt, and A. Murthy (2024) Position: llms can’t plan, but can help planning in llm-modulo frameworks. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. Cited by: §2.1. - M. Katz and J. Lee (2023) K* search over orbit space for top-k planning. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023), Cited by: §E.5, §3.5. - O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts (2023) DSPy: compiling declarative language model calls into self-improving pipelines. External Links: 2310.03714, Link Cited by: §1. - J. Lee, M. Katz, and S. Sohrabi (2023) On k* search for top-k planning. In Symposium on Combinatorial Search, External Links: Link Cited by: §3.5. - L. Lehnert, S. Sukhbaatar, D. Su, Q. Zheng, P. McVay, M. Rabbat, and Y. Tian (2024) Beyond A*: better planning with transformers via search dynamics bootstrapping. arXiv preprint arXiv:2402.14083. Cited by: §1. - J. Leng, C. A. Cohen, Z. Zhang, C. Xiong, and W. W. Cohen (2025) Semi-structured llm reasoners can be rigorously audited. External Links: 2505.24217, Link Cited by: §2.2. - C. Li, J. Liang, A. Zeng, X. Chen, K. Hausman, D. Sadigh, S. Levine, L. Fei-Fei, F. Xia, and B. Ichter (2023) Chain of code: reasoning with a language model-augmented code emulator. arXiv preprint arXiv:2312.04474. Cited by: §2.2. - J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen (2022) What makes good in-context examples for GPT-3?. In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, E. Agirre, M. Apidianaki, and I. Vulić (Eds.), Dublin, Ireland and Online, pp. 100–114. External Links: Link, Document Cited by: §2.2. - A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark (2023) Self-refine: iterative refinement with self-feedback. In Advances in Neural Information Processing Systems, Vol. 36. Cited by: §D.1.4, §2.2. - S. Min, X. Lyu, A. Holtzman, M. Arber, M. Lewis, H. Hajishirzi, and L. Zettlemoyer (2022) Rethinking the role of demonstrations: what makes in-context learning work?. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048–11064. Cited by: §2.2. - P. Shojaee, I. Mirzadeh, K. Alizadeh, M. Horton, S. Bengio, and M. Farajtabar (2025) The illusion of thinking: understanding the strengths and limitations of reasoning models via the lens of problem complexity. In Advances in Neural Information Processing Systems, Vol. 38. Cited by: §2.1. - K. Stechly, K. Valmeekam, and S. Kambhampati (2024) Chain of thoughtlessness? an analysis of cot in planning. In Advances in Neural Information Processing Systems, Vol. 37. Cited by: 3rd item, §F.1.3, §1, §2.1, §2.1. - K. Stechly, K. Valmeekam, and S. Kambhampati (2025) On the self-verification limitations of large language models on reasoning and planning tasks. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §D.1.4, §2.2. - M. Turpin, J. Michael, E. Perez, and S. R. Bowman (2023) Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. In Thirty-seventh Conference on Neural Information Processing Systems, External Links: Link Cited by: §2.2. - K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati (2023) On the planning abilities of large language models–a critical investigation. In Advances in Neural Information Processing Systems, Vol. 36. Cited by: §1, §2.1. - X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou (2023) Self-consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations, Cited by: §D.1.3, §2.2. - J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou (2022) Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, Vol. 35, pp. 24824–24837. Cited by: §2.2. - S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan (2023a) Tree of thoughts: deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, Vol. 36. Cited by: §D.1.6, §1, §2.1, §2.2. - S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao (2023b) ReAct: synergizing reasoning and acting in language models. In International Conference on Learning Representations, Cited by: §D.1.5, §2.1, §2.2. ## Appendix A Analysis: The Valid-to-Success Gap While L-ICL dramatically improves constraint adherence, a gap often remains between validity and success. This gap is most pronounced in Full Sokoban, where L-ICL achieves 46% valid plans but only 20% success (Table 4). Understanding this gap illuminates both L-ICL’s strengths and its limitations. ### A.1 Trap Rate Analysis In Sokoban, certain states are traps: configurations from which the goal is unreachable regardless of future actions (e.g., a box pushed into a corner). We measure the adjusted trap rate: among valid plans, what fraction enters a trap state? Figure 7 shows that L-ICL reduces trap rates. On Sokoban Grid, the adjusted trap rate drops from 50% at $m{=}0$ to 10% at $m{=}210$ . This indicates that L-ICL teaches not only immediate constraint satisfaction but also some degree of trap avoidance. However, the absolute trap rate remains non-negligible, and the valid-to-success gap persists. We hypothesize that trap avoidance requires multi-step lookahead that localized corrections cannot fully provide. A correction like “pushing box B east from (3,4) is valid” does not encode that this push leads to an unsolvable configuration three moves later. Addressing this limitation may require complementary approaches such as search or learned value functions. <details> <summary>graphs/misc/sokoban_trap_rate_adjusted.png Details</summary> ![eb573a05](/v1/image/eb573a05f7920ac3aa5c8eb274db8e532d34c971fadbad33a7e24b283c585632) ### Visual Description \n ## Line Chart: Sokoban Gridworld Adjusted Trap Rate ### Overview This line chart displays the adjusted trap rate (%) for two conditions – "With Grid" and "Without Grid" – as a function of the number of training examples. The chart shows the performance of a Sokoban Gridworld system as it is trained with increasing amounts of data. Shaded areas around each line represent uncertainty or variance in the data. ### Components/Axes * **Title:** Sokoban Gridworld: Adjusted Trap Rate * **X-axis:** Training Examples (ranging from 0 to 240, with markers at 30, 60, 90, 120, 150, 180, 210, and 240) * **Y-axis:** Adjusted Trap Rate (%) (ranging from 0 to 80, with markers at 0, 10, 20, 30, 40, 50, 60, 70, and 80) * **Legend:** * "Without Grid" – Orange line * "With Grid" – Blue line * **Shaded Areas:** Light orange and light blue areas surrounding the respective lines, indicating variance. ### Detailed Analysis The chart presents two lines representing the adjusted trap rate for "With Grid" and "Without Grid" conditions across varying training examples. **"Without Grid" (Orange Line):** The line starts at approximately 48% at 0 training examples. It then sharply declines to around 10% at 30 training examples. It fluctuates between approximately 10% and 30% for the remainder of the training examples, with a peak around 32% at 90 training examples, a dip to approximately 11% at 150 training examples, and ending at approximately 17% at 240 training examples. **"With Grid" (Blue Line):** The line begins at approximately 50% at 0 training examples. It rapidly decreases to around 5% at 30 training examples. It then rises to approximately 33% at 90 training examples, falls to around 10% at 150 training examples, rises again to approximately 25% at 180 training examples, and finally declines to approximately 10% at 240 training examples. The shaded areas around each line indicate the variance in the data. The "Without Grid" shaded area is generally wider than the "With Grid" shaded area, suggesting greater variability in the "Without Grid" condition. ### Key Observations * Both conditions exhibit a significant decrease in adjusted trap rate with increasing training examples. * The "With Grid" condition initially shows a lower trap rate than the "Without Grid" condition, but the lines cross around 60 training examples. * The "With Grid" condition demonstrates more fluctuation in trap rate as training progresses, with a notable peak around 90 training examples. * The "Without Grid" condition shows a more stable, though still fluctuating, trap rate after the initial decline. * The shaded areas suggest that the "Without Grid" condition has more variance in its trap rate than the "With Grid" condition. ### Interpretation The data suggests that both the presence and absence of a grid in the Sokoban Gridworld environment lead to improved performance (lower trap rate) with increased training. Initially, the grid seems to provide a more significant advantage, but as training progresses, the "Without Grid" condition catches up. The fluctuations in the "With Grid" condition might indicate that the grid introduces complexities that require more training to overcome, or that the grid's benefits are more sensitive to the specific training examples. The wider variance in the "Without Grid" condition suggests that the system's performance is more unpredictable in that environment. The initial high trap rates for both conditions indicate that the system starts with a poor understanding of the environment, and learning is crucial for improving performance. The convergence of the lines towards the end of the training period suggests that both conditions are approaching a similar level of performance, although the "Without Grid" condition still exhibits more variability. </details> Figure 7: Trap rate decreases with L-ICL. Adjusted trap rate (fraction of valid plans entering unsolvable states) on Sokoban Grid. L-ICL reduces trap rates from 50% to 10%, indicating partial learning of strategic constraints. ### A.2 Multi-Object State Tracking Comparing Sokoban Grid (no boxes) to Full Sokoban reveals the cost of multi-object tracking. With identical spatial layouts, Sokoban Grid achieves 49% success while Full Sokoban reaches only 20%. The difference lies in state complexity: Full Sokoban requires tracking the agent position and all box positions, with constraints that depend on their joint configuration. This difficulty is also evident in BlocksWorld, where every object is dynamic. L-ICL improves BlocksWorld success from 48% to 66%, but a gap remains between validity (68%) and success. The pattern suggests that relational constraint learning, while improved by L-ICL, remains more challenging than spatial constraint learning. ### A.3 Decomposing Planning Difficulty The valid-to-success gap reveals a clean decomposition of planning difficulty: 1. Constraint satisfaction: Generating actions that respect domain physics. L-ICL addresses this effectively across all domains. 1. Strategic selection: Among valid actions, choosing those that lead toward the goal without entering traps. This requires multi-step reasoning that localized corrections do not directly provide. This decomposition suggests a practical architecture: use L-ICL to harden constraint satisfaction, then layer strategic reasoning (search, learned policies, or hierarchical planning) on top. The hardened base ensures that any action proposed by the strategic layer is physically valid, separating concerns and simplifying both components. Table 5 summarizes the valid-to-success gaps across domains, highlighting where strategic failures dominate. Table 5: Valid-to-success gap analysis across domains with L-ICL[ $m{=}60$ ]. Larger gaps indicate that constraint satisfaction alone is insufficient—strategic reasoning is the bottleneck. | 8 $\times$ 8 Grid 10 $\times$ 10 Maze Sokoban Grid | 89 57 63 | 89 27 49 | 0 30 14 | | --- | --- | --- | --- | | Full Sokoban | 46 | 20 | 26 | | BlocksWorld | 68 | 66 | 2 | The 8 $\times$ 8 gridworld shows no gap: once constraints are satisfied, the simple structure makes goal-reaching straightforward. The 10 $\times$ 10 maze and Full Sokoban show the largest gaps, reflecting the strategic complexity of navigating dead ends and avoiding irreversible trap states, respectively. BlocksWorld shows a small gap, suggesting that while relational constraints are harder to learn, once learned they suffice for task completion in our 5-block instances. ## Appendix B Out-of-Distribution Generalization <details> <summary>graphs/domains/gridworld-10x10-sokoban_grid.png Details</summary> ![4c329224](/v1/image/4c329224e7367752be7432fd21907eae46b30965316a86025b927113ea8cb7a3) ### Visual Description \n ## Grid: Binary Pattern ### Overview The image displays a 10x10 grid with cells colored either dark blue or white, forming a binary pattern. The grid is oriented with the origin (1,1) in the bottom-left corner and extends to (10,10) in the top-right corner. The axes are labeled with numerical values from 1 to 10. There is no legend. ### Components/Axes * **X-axis:** Labeled with integers from 1 to 10, increasing from left to right. * **Y-axis:** Labeled with integers from 1 to 10, increasing from bottom to top. * **Cells:** Each cell represents a point in the grid, colored either dark blue or white. ### Detailed Analysis or Content Details The pattern consists of dark blue cells arranged in a specific configuration. We can describe the location of the dark blue cells as follows: * (1,1) to (1,10): Continuous dark blue cells along the bottom row. * (2,1) to (2,3): Dark blue cells. * (3,1) to (3,3): Dark blue cells. * (4,1): Dark blue cell. * (4,5): Dark blue cell. * (5,4) to (5,6): Continuous dark blue cells. * (6,1): Dark blue cell. * (6,7) to (6,10): Continuous dark blue cells along the top row. * (7,7) to (7,10): Continuous dark blue cells. * (8,7) to (8,10): Continuous dark blue cells. * (9,7) to (9,10): Continuous dark blue cells. * (10,7) to (10,10): Continuous dark blue cells. The white cells fill the remaining spaces in the grid. ### Key Observations The pattern appears to be a stylized representation of the letter "E" or a similar shape, constructed from the dark blue cells. The pattern is symmetrical around a vertical axis. The pattern is not random; it follows a defined structure. ### Interpretation The image likely represents a visual code or a simple graphic. The binary nature of the grid (dark blue/white) suggests a digital or computational context. The shape formed by the dark blue cells could be a symbol, a character in a custom font, or a visual element in a larger design. Without additional context, the precise meaning of the pattern remains ambiguous. The pattern could be a simplified representation of a more complex data set, where dark blue indicates a specific state or value and white indicates another. The grid structure suggests a discrete space, potentially representing a matrix or a set of coordinates. </details> (a) 10 $\times$ 10 maze (training distribution) <details> <summary>graphs/domains/gridworld-15x15-sokoban_grid.png Details</summary> ![4450c7b6](/v1/image/4450c7b6983be499000d275490f179566177eeec2423ac9c58f2aaeeb0c645b6) ### Visual Description \n ## Pixelated Image: Letter "E" Representation ### Overview The image is a 15x15 grid of pixels, visually representing the uppercase letter "E". The pixels are either filled with a dark blue color or remain white, creating the shape of the letter. There are no explicit labels, axes, or legends. ### Components/Axes The image can be considered to have implicit axes: * **X-axis:** Ranges from 1 to 15. * **Y-axis:** Ranges from 1 to 15. The grid itself forms the coordinate system. ### Detailed Analysis or Content Details The dark blue pixels define the letter "E". Here's a breakdown of the pixel coordinates that are filled (dark blue): * **(1, 1)** to **(1, 6)**: Vertical line forming the left side of the "E". * **(2, 1)** to **(2, 6)**: Vertical line forming the left side of the "E". * **(3, 1)** to **(3, 6)**: Vertical line forming the left side of the "E". * **(4, 1)** to **(4, 6)**: Vertical line forming the left side of the "E". * **(1, 7)** to **(1, 8)**: Horizontal line forming the top of the middle bar. * **(2, 7)** to **(2, 8)**: Horizontal line forming the top of the middle bar. * **(3, 7)** to **(3, 8)**: Horizontal line forming the top of the middle bar. * **(4, 7)** to **(4, 8)**: Horizontal line forming the top of the middle bar. * **(5, 8)** to **(5, 8)**: Single pixel forming the middle bar. * **(6, 8)** to **(6, 8)**: Single pixel forming the middle bar. * **(1, 9)** to **(1, 15)**: Vertical line forming the right side of the "E". * **(2, 9)** to **(2, 15)**: Vertical line forming the right side of the "E". * **(3, 9)** to **(3, 15)**: Vertical line forming the right side of the "E". * **(4, 9)** to **(4, 15)**: Vertical line forming the right side of the "E". * **(5, 4)** to **(5, 5)**: Single pixel forming the bottom of the "E". * **(6, 4)** to **(6, 5)**: Single pixel forming the bottom of the "E". All other pixels in the 15x15 grid are white. ### Key Observations The image is a simple pixelated representation of a letter. The resolution is low, resulting in a blocky appearance. The letter "E" is positioned towards the top-left corner of the grid. ### Interpretation The image demonstrates a basic form of digital representation, where a letter is constructed using a grid of pixels. This is a fundamental concept in computer graphics and image processing. The image itself doesn't convey any complex data or trends; it's a visual representation of a single character. The choice of a dark blue color for the letter against a white background provides good contrast and readability. The image could be used as a simple example in a tutorial on pixel art or digital typography. It is a static image and does not suggest any dynamic process or change over time. </details> (b) 15 $\times$ 15 maze (OOD evaluation) Figure 8: Out-of-distribution generalization setup. L-ICL corrections are accumulated on 10 $\times$ 10 mazes (left) and evaluated on 15 $\times$ 15 mazes (right). The larger mazes contain positions not seen during training, yet corrections transfer substantially, despite the penalty for boundary violations differing. A key question for any learning-based approach is whether acquired knowledge transfers beyond the training distribution. For L-ICL, this translates to: do corrections learned on smaller problem instances improve performance on larger, unseen instances? We investigate this by training L-ICL on 10 $\times$ 10 mazes and evaluating on 15 $\times$ 15 mazes, showin in Figure 8. ### B.1 Experimental Setup We accumulate L-ICL corrections using the standard training procedure on 10 $\times$ 10 maze instances (Section 3.5). We then evaluate the resulting prompts on a held-out test set of 100 15 $\times$ 15 mazes. The larger mazes are generated using the same procedural algorithm (randomized depth-first search) with proportionally scaled wall density, but contain positions and path structures never seen during training. ### B.2 Results Figure 9 shows that L-ICL corrections provide substantial transfer to larger instances. At $m{=}0$ (no corrections), the 15 $\times$ 15 maze achieves only 9% success—comparable to the 10 $\times$ 10 baseline without corrections. With corrections accumulated from 10 $\times$ 10 training instances, 15 $\times$ 15 success improves to 49% at $m{=}120$ , representing a 5 $\times$ improvement over the no-correction baseline. <details> <summary>graphs/misc/ood_10x10_to_15x15.png Details</summary> ![1c59f7f3](/v1/image/1c59f7f35d493be959d7e51329a25bbd31cd2d3e2e60fa0cc40c4858ed7390b0) ### Visual Description ## Line Chart: OOD Generalization: 10x10 -> 15x15 Transfer ### Overview This line chart illustrates the success rate (%) of a model trained with varying numbers of training examples, comparing performance on in-distribution data (10x10) versus out-of-distribution (OOD) transfer data (15x15). The chart shows how the success rate changes as the number of training examples increases, with shaded areas representing confidence intervals. ### Components/Axes * **Title:** OOD Generalization: 10x10 -> 15x15 Transfer * **X-axis:** Training Examples (ranging from 0 to 240, with markers at 0, 30, 60, 90, 120, 150, 180, 210, and 240) * **Y-axis:** Success Rate (%) (ranging from 0 to 70, with markers at 0, 10, 20, 30, 40, 50, 60, and 70) * **Legend:** * Blue Line: 10x10 (In-Distribution) * Orange Line: 15x15 (OOD Transfer) * **Shaded Areas:** Represent confidence intervals around each line. The blue shaded area corresponds to the 10x10 data, and the orange shaded area corresponds to the 15x15 data. ### Detailed Analysis **10x10 (In-Distribution) - Blue Line:** The blue line representing the in-distribution data starts at approximately 45% success rate at 0 training examples. It rises sharply to approximately 58% at 30 training examples, then continues to increase at a decreasing rate, reaching a peak of approximately 64% at 150 training examples. After 150 training examples, the success rate plateaus and fluctuates between approximately 58% and 62% until 240 training examples. **15x15 (OOD Transfer) - Orange Line:** The orange line representing the OOD transfer data begins at approximately 10% success rate at 0 training examples. It increases to approximately 25% at 30 training examples, then rises more steeply to approximately 48% at 90 training examples. The line then decreases to approximately 35% at 120 training examples, increases to approximately 40% at 150 training examples, decreases to approximately 30% at 180 training examples, and finally rises to approximately 35% at 240 training examples. **Confidence Intervals:** The shaded areas around each line indicate the confidence intervals. The blue shaded area is relatively narrow, suggesting more consistent performance for the in-distribution data. The orange shaded area is wider, indicating greater variability in the OOD transfer performance. ### Key Observations * The in-distribution data consistently outperforms the OOD transfer data across all training example counts. * The OOD transfer data exhibits more significant fluctuations in success rate as the number of training examples increases. * The in-distribution data reaches a plateau in performance after approximately 150 training examples, while the OOD transfer data continues to fluctuate. * The initial performance gap between the two datasets is substantial, but it narrows somewhat as the number of training examples increases. ### Interpretation The chart demonstrates the impact of domain generalization on model performance. The in-distribution data (10x10) benefits from training within the same distribution as the test data, resulting in higher and more stable success rates. The OOD transfer data (15x15), however, faces a distribution shift, leading to lower and more variable performance. The initial low success rate for the OOD transfer data suggests that the model struggles to generalize to the new domain with limited training examples. The fluctuations in the OOD transfer line indicate that the model's performance is sensitive to the specific training examples it receives. The narrowing gap between the two datasets as the number of training examples increases suggests that the model can learn to adapt to the new domain, but it requires a substantial amount of data to achieve comparable performance to the in-distribution scenario. The wider confidence intervals for the OOD transfer data highlight the challenges of domain generalization and the need for robust techniques to mitigate the effects of distribution shift. The plateau in the in-distribution data suggests diminishing returns from adding more training examples once a certain level of performance is reached. This could inform decisions about data collection and model training strategies. </details> Figure 9: Out-of-distribution transfer: 10 $\times$ 10 $\to$ 15 $\times$ 15. Corrections learned on 10 $\times$ 10 mazes transfer to larger instances, improving success from 9% to 49%. A gap remains compared to in-distribution performance (57% on 10 $\times$ 10), but transfer is substantial. Table 6 summarizes the transfer results at key checkpoints. Table 6: Out-of-distribution generalization: corrections trained on 10 $\times$ 10 mazes evaluated on 15 $\times$ 15 mazes. We report success rate (%) and compare to in-distribution 10 $\times$ 10 performance. | Training Examples $m=0$ $m=30$ | 10 $\times$ 10 (in-dist.) 16 21 | 15 $\times$ 15 (OOD) 9 18 | | --- | --- | --- | | $m=60$ | 27 | 31 | | $m=120$ | 57 | 49 | ### B.3 Why Does Transfer Work? The transfer is notable because 15 $\times$ 15 mazes contain positions (e.g., $(12,14)$ ) and wall configurations that never appear in 10 $\times$ 10 training instances. We hypothesize that corrections transfer because they encode constraint types rather than specific positions. Consider a correction like: >>> get_applicable_actions((3, 4)) [’move_north’, ’move_south’] While this example specifies position $(3,4)$ , it implicitly teaches a general principle: when east and west are blocked (by walls or boundaries), only north and south are valid. The LLM can generalize this pattern to novel positions in larger grids. This interpretation is supported by the observation that transfer improves with more corrections ( $m$ ). Early corrections address common constraint patterns (boundary violations, simple wall configurations); as $m$ increases, rarer patterns are covered, and the accumulated examples provide a richer specification that generalizes more robustly. ### B.4 Transfer Gap Analysis While transfer is substantial, a gap remains between in-distribution and OOD performance (57% vs. 49% at $m{=}120$ ). We identify two contributing factors: 1. Unseen spatial configurations: Larger mazes contain junction types and corridor patterns that may not appear in smaller instances. Some constraint violations specific to these configurations are not addressed by 10 $\times$ 10 training. 1. Longer planning horizons: 15 $\times$ 15 mazes require longer plans, providing more opportunities for errors to accumulate. Even with improved per-step validity, the probability of completing an error-free trajectory decreases with plan length. These findings suggest that for maximum OOD performance, practitioners should either (a) train on a mixture of problem sizes, or (b) accept a modest performance gap when deploying to larger instances than those seen during training. ### B.5 Cross-Domain Transfer We also conducted preliminary experiments on cross-domain transfer: using corrections from one domain (e.g., 8 $\times$ 8 gridworld) to improve another (e.g., 10 $\times$ 10 maze). Results were mixed—corrections for basic movement constraints (boundary checking) transferred, but domain-specific spatial structures (two-room vs. maze corridors) did not. This suggests that L-ICL learns a combination of general procedural knowledge and domain-specific constraint instantiations, with only the former transferring across domains. ## Appendix C Domain Specifications This appendix provides detailed specifications of the experimental domains used in our evaluation. For each domain, we describe the state representation, action space, constraints and goal conditions. ### C.1 8 $\times$ 8 Two-Room Gridworld State Space. The state consists of the agent’s $(x,y)$ position on an 8 $\times$ 8 grid. Coordinates range from $(1,1)$ at the bottom-left to $(8,8)$ at the top-right. Environment Structure. The grid is divided into two rooms by a vertical wall running through column 5, with a single doorway allowing passage between rooms (doorway position varies by instance). Start positions are randomly sampled from one room, and goal positions from the other, ensuring all paths must traverse the doorway. Action Space. Four actions: move_north $(+y)$ , move_south $(-y)$ , move_east $(+x)$ , and move_west $(-x)$ . Constraints. An action is valid if and only if: 1. The resulting position remains within grid bounds. 1. The movement does not cross a wall segment. Goal Condition. The agent’s position equals the goal position. Optimal Solution. The shortest path between start and goal, computed via breadth-first search. Optimal paths typically require 8–12 steps. <details> <summary>graphs/domains/gridworld-8x8_grid.png Details</summary> ![80a9daf2](/v1/image/80a9daf2b9b6f4f8b2f84e207075ad28d2169adeaba993bb7579aa5af4e5ae70) ### Visual Description \n ## Chart: Constant Value Plot ### Overview The image displays a simple chart with a constant value across a range of x-axis values. It appears to be a bar-like representation, where the height of the bars is consistently at the value of 3. ### Components/Axes * **X-axis:** Ranges from 1 to 8, with integer increments. * **Y-axis:** Ranges from 1 to 8, with integer increments. * **Data Series:** A single series represented by dark blue blocks. * **Grid:** A grid is present, with both horizontal and vertical lines at integer intervals. ### Detailed Analysis The chart shows a constant value of approximately 3 for all x-values from 1 to 6. There are no data points displayed for x-values 7 and 8. Specifically: * At x = 1, the y-value is 3. * At x = 2, the y-value is 3. * At x = 3, the y-value is 3. * At x = 4, the y-value is 3. * At x = 5, the y-value is 3. * At x = 6, the y-value is 3. ### Key Observations The most notable observation is the constant y-value of 3 across the displayed x-axis range. There is no variation in the data. The chart abruptly ends at x=6, with no data points for x=7 and x=8. ### Interpretation The data suggests a scenario where a particular variable remains constant at a value of 3 over the observed range (x-values 1 to 6). This could represent a stable process, a fixed parameter, or a controlled experiment where the variable is intentionally maintained at a specific level. The absence of data for x=7 and x=8 could indicate the end of the observation period, a limitation of the experiment, or a lack of data collection beyond x=6. The simplicity of the chart suggests a focus on demonstrating this constant value rather than complex relationships or trends. </details> Figure 10: Example 8 $\times$ 8 two-room gridworld instance. Walls are shown as filled cells. ### C.2 10 $\times$ 10 Maze State Space. The state consists of the agent’s $(x,y)$ position on a 10 $\times$ 10 grid. Coordinates range from $(1,1)$ to $(10,10)$ . Environment Structure. Mazes are procedurally generated using a randomized depth-first search algorithm, producing a spanning tree of corridors with exactly one path between any two open cells. This ensures unique shortest paths and creates narrow corridors with dead ends that require backtracking if the agent makes suboptimal choices. Action Space. Four actions: move_north, move_south, move_east, move_west. Constraints. Identical to the 8 $\times$ 8 gridworld: actions must keep the agent in bounds and cannot cross walls. Goal Condition. The agent’s position equals the goal position. Optimal Solution. The unique shortest path through the maze. <details> <summary>graphs/domains/maze_grid.png Details</summary> ![795d08ec](/v1/image/795d08ec6078f5ead9c82058956ac4b2b430d4bab4618943fb2994f97eff3151) ### Visual Description \n ## Chart: Grid-Based Binary Representation ### Overview The image presents a grid-based chart with a 10x10 arrangement of cells. Each cell is either filled with a dark blue color or remains white. The chart lacks explicit labels, legends, or axis titles, making it a binary representation of some underlying data or pattern. The x-axis ranges from 1 to 10, and the y-axis ranges from 1 to 10. ### Components/Axes * **X-axis:** Ranges from 1 to 10, with tick marks at each integer value. * **Y-axis:** Ranges from 1 to 10, with tick marks at each integer value. * **Cells:** The grid is composed of 100 cells, each representing a coordinate (x, y) where x and y are integers from 1 to 10. * **Color:** Two colors are used: dark blue (filled cells) and white (empty cells). There is no legend to define what these colors represent. ### Detailed Analysis / Content Details The following lists the coordinates of the filled (dark blue) cells: * (1, 7) * (1, 8) * (2, 2) * (2, 8) * (2, 9) * (3, 3) * (3, 8) * (3, 9) * (4, 2) * (4, 5) * (4, 6) * (5, 2) * (5, 5) * (5, 6) * (6, 4) * (6, 5) * (7, 5) * (7, 6) * (7, 7) * (8, 2) * (8, 5) * (8, 7) * (9, 7) * (9, 8) * (10, 3) * (10, 4) * (10, 5) * (10, 6) * (10, 7) * (10, 8) * (10, 9) * (10, 10) There are 32 filled cells and 68 empty cells. ### Key Observations * The filled cells are not randomly distributed. There appears to be a pattern, though it is not immediately obvious without further context. * The filled cells are concentrated towards the top-right corner of the grid. * There are no filled cells in the first row (y=1). * The last column (x=10) has a high density of filled cells. ### Interpretation Without additional information, it is difficult to determine the meaning of this chart. It could represent: * **A binary matrix:** Where 1 represents a filled cell and 0 represents an empty cell. This matrix could be used to encode data, represent relationships, or model a system. * **A visual pattern:** The arrangement of filled cells might represent a specific shape, code, or signal. * **A simplified map:** The grid could represent a geographical area, and the filled cells could indicate certain features or locations. * **A game state:** The grid could represent the board of a game, and the filled cells could represent pieces or occupied spaces. The lack of labels and context makes it impossible to definitively interpret the data. Further investigation is needed to understand the underlying meaning of this chart. The concentration of filled cells in the top-right corner and the last column suggests a possible trend or bias in the data. </details> Figure 11: Example 10 $\times$ 10 maze instance. The maze structure creates narrow corridors and dead ends, requiring longer plans than the two-room gridworld. ### C.3 Sokoban-Style Gridworld State Space. The state consists of the agent’s $(x,y)$ position on a grid that uses Sokoban-style layouts. Coordinates are 1-indexed. Environment Structure. We use grid layouts from standard Sokoban benchmarks but remove all pushable boxes. The layouts retain walls, open floor cells, and the spatial structure of Sokoban puzzles, including irregular room shapes and narrow passages. This domain serves as an ablation to isolate the effect of Sokoban’s spatial complexity from the challenge of multi-object state tracking. Action Space. Four actions: move_north, move_south, move_east, move_west. Constraints. Actions must keep the agent within the walkable floor area and cannot cross walls. Goal Condition. The agent reaches a designated goal cell. <details> <summary>graphs/domains/gridworld_sokoban_noDZ_final_grid.png Details</summary> ![6dad43c5](/v1/image/6dad43c57331e2790e5527f6e5cf06ea7a909819aa7877f1c9efe561596fbeaa) ### Visual Description \n ## Grid: Binary Representation ### Overview The image depicts a 10x10 grid filled with colored squares. The grid appears to represent a binary pattern, with some squares colored dark blue and others left white. There are no explicit labels or legends. The grid forms a shape resembling the letter "E". ### Components/Axes * **X-axis:** Ranges from 1 to 10, with integer values. * **Y-axis:** Ranges from 1 to 10, with integer values. * **Color Scheme:** Two colors are used: dark blue (representing a value) and white (representing the absence of a value). ### Detailed Analysis or Content Details The dark blue squares are located at the following coordinates: * (1, 1) to (1, 10) - Horizontal line along the bottom. * (1, 1) to (10, 1) - Vertical line along the left. * (1, 10) to (10, 10) - Horizontal line along the top. * (10, 1) to (10, 10) - Vertical line along the right. * (6, 6) to (6, 8) - Vertical line. * (5, 6) to (7, 6) - Horizontal line. * (5, 3) to (5, 5) - Vertical line. * (8, 8) to (10, 8) - Horizontal line. The white squares represent the background and the absence of the value. ### Key Observations The arrangement of the dark blue squares forms a clear "E" shape within the grid. The grid is symmetrical along the horizontal axis. The pattern is discrete, with sharp transitions between the dark blue and white squares. ### Interpretation The image likely represents a visual encoding of the letter "E" using a binary grid. The dark blue squares could represent "on" or "1" states, while the white squares represent "off" or "0" states. This could be a simplified representation of a pixelated image or a basic form of data visualization. The grid structure suggests a digital or computational context. The "E" shape could be a symbolic representation or a part of a larger code or message. The image is a simple, yet effective, way to convey information using a limited set of visual elements. It is a visual representation of a binary pattern, and the shape formed is the letter "E". </details> Figure 12: Example Sokoban-style gridworld instance. The layout is derived from a Sokoban puzzle but contains no pushable boxes, isolating spatial navigation from object manipulation. ### C.4 Full Sokoban State Space. The state consists of: - The agent’s $(x,y)$ position (1-indexed). - The box position $(x,y)$ . Our instances use 1 box. Environment Structure. Standard Sokoban puzzle layouts from established benchmarks, including walls, floor cells, and designated target locations where boxes must be placed. Action Space. Eight actions: - Movement: move_north, move_south, move_east, move_west —move the agent one cell in the specified direction if the destination is empty floor. - Pushing: push_north, push_south, push_east, push_west —move the agent into a cell containing a box, pushing the box one cell further in the same direction. Constraints. An action is valid if and only if: 1. Movement: The destination cell is within bounds, is not a wall, and does not contain a box. 1. Pushing: The cell adjacent to the agent contains a box, and the cell beyond the box (in the push direction) is within bounds, is not a wall, and does not contain another box. Irreversibility. Unlike navigation domains, Sokoban contains trap states —configurations from which the goal is unreachable. Common traps include: - Pushing a box into a corner (cannot be retrieved). - Pushing a box against a wall such that it cannot reach any target. Goal Condition. The box occupies the shown target position. <details> <summary>graphs/domains/sokoban_final_grid.png Details</summary> ![e9e0395a](/v1/image/e9e0395a6ace73b3386fa2047760283114bca0e9b3d1ae2dcbf387ac43836ab3) ### Visual Description \n ## Diagram: Grid-Based Pattern with Highlighted Point ### Overview The image depicts a grid-based pattern composed of dark blue and white squares, resembling a stylized letter or symbol. A single, small, green diamond is positioned within the grid, highlighting a specific point. The grid spans from 1 to 10 on both the x and y axes. ### Components/Axes * **X-axis:** Labeled with numbers 1 through 10, incrementing by 1. * **Y-axis:** Labeled with numbers 1 through 10, incrementing by 1. * **Grid:** Composed of alternating dark blue and white squares. * **Highlighted Point:** A small, green diamond. ### Detailed Analysis or Content Details The grid pattern can be described as follows: * **Column 1:** Dark blue squares from y=1 to y=10. * **Column 2:** Dark blue squares from y=1 to y=8, white squares from y=9 to y=10. * **Column 3:** White squares from y=1 to y=3, dark blue squares from y=4 to y=10. * **Column 4:** White squares from y=1 to y=4, dark blue squares from y=5 to y=10. * **Column 5:** Dark blue squares from y=1 to y=3, white squares from y=4 to y=6, dark blue squares from y=7 to y=10. * **Column 6:** White squares from y=1 to y=8, dark blue squares from y=9 to y=10. * **Column 7:** Dark blue squares from y=1 to y=10. * **Column 8:** Dark blue squares from y=1 to y=10. * **Column 9:** Dark blue squares from y=1 to y=10. * **Column 10:** Dark blue squares from y=1 to y=10. The green diamond is located at approximately x=4.2, y=5.2. ### Key Observations The pattern is not random; it appears to be a deliberate arrangement of squares. The highlighted point is positioned within a white space in the grid, making it visually distinct. The pattern resembles a stylized "E" or a similar character. ### Interpretation The image likely represents a visual code or a simplified map. The grid could represent a coordinate system, and the highlighted point could indicate a specific location or item of interest. The pattern itself might be a symbol with a specific meaning within a larger context. Without additional information, it's difficult to determine the exact purpose of the diagram. The deliberate arrangement of the squares suggests a structured system, potentially related to data representation or a game board. The choice of dark blue and white colors provides a clear visual contrast, enhancing the readability of the pattern. The single green diamond draws attention to a specific element within the grid, implying its importance. The pattern could also be a visual representation of a binary code or a simplified circuit diagram. </details> Figure 13: Example Sokoban instance. The agent must push the box onto the target location without creating deadlocks. ### C.5 BlocksWorld State Space. The state consists of a configuration of $n$ uniquely labeled blocks (we use $n=5$ in our experiments). Each block is either: - On the table, or - On top of exactly one other block. A block is clear if no other block is on top of it. The table has unlimited capacity. Action Space. Three actions, described in natural language: 1. Move block from block to block (move-b-to-b): Pick up a block that is currently sitting on top of another block and place it onto a third block. This requires that the block being moved has nothing on top of it (is clear) and that the destination block also has nothing on top of it (is clear). After the move, the block that was underneath the moved block becomes clear. 1. Move block from block to table (move-b-to-t): Pick up a block that is currently sitting on top of another block and place it on the table. This requires that the block being moved has nothing on top of it (is clear). After the move, the block that was underneath becomes clear, and the moved block is now on the table. 1. Move block from table to block (move-t-to-b): Pick up a block that is currently on the table and place it onto another block. This requires that both the block being moved and the destination block have nothing on top of them (are clear). After the move, the destination block is no longer clear. Constraints. The preconditions for each action are: - move-b-to-b( $b_{m}$ , $b_{f}$ , $b_{t}$ ): Block $b_{m}$ is clear, block $b_{t}$ is clear, $b_{m}$ is currently on $b_{f}$ , and $b_{m}\neq b_{t}$ . - move-b-to-t( $b_{m}$ , $b_{f}$ ): Block $b_{m}$ is clear and $b_{m}$ is currently on $b_{f}$ . - move-t-to-b( $b_{m}$ , $b_{t}$ ): Block $b_{m}$ is clear, block $b_{t}$ is clear, $b_{m}$ is currently on the table, and $b_{m}\neq b_{t}$ . Goal Condition. The block configuration matches a target specification, typically given as a set of on( $b_{1}$ , $b_{2}$ ) predicates describing which blocks must be stacked on which. Differences from Navigation Domains. BlocksWorld differs qualitatively from the grid-based domains: - No spatial structure: Constraints are purely relational (“block A is on block B”) rather than geometric. - All objects are dynamic: Every block can be moved, unlike navigation where only the agent moves. - Algorithmic solutions: We additionally provide an algorithmic sketch (the Universal Blockswordl Algorithm (Stechly et al., 2024)) to test whether L-ICL can improve adherence to prescribed planning strategies. ## Appendix D Baseline Method Implementations This appendix provides detailed specifications of all baseline methods evaluated in our experiments. All baselines operate on the same task: given a problem description with start position, goal position, walls, and (optionally) deadzones, produce an action sequence to navigate from start to goal. We organize baselines into two categories: prompt-only methods that rely solely on LLM reasoning, and oracle methods that receive feedback from a ground-truth simulator. ### D.1 Prompt-Only Baselines #### D.1.1 Zero-Shot Chain-of-Thought (Zero-Shot CoT) The simplest baseline provides the model with task instructions, and asks it to reason step-by-step to produce a navigation plan. Implementation. The prompt includes: (1) a task description explaining gridworld navigation, valid actions, and movement constraints; (2) an ASCII representation of the problem (if applicable); (3) the query problem with start/goal coordinates; and (4) output format instructions requiring **Final Action Sequence:** action1, action2, .... We use temperature 1.0 for all experiments unless otherwise noted. #### D.1.2 RAG-CoT (Retrieval-Augmented Chain-of-Thought) This baseline extends Zero-Shot CoT with dynamic example selection based on similarity to the query problem. We retrieve the most relevant training examples within a character budget (10,000 or 20,000 characters in our experiments). Similarity Metric. We compute similarity based on Manhattan distance between start-goal pairs: $$ \text{similarity}(q,c)=\frac{1}{1+|d_{q}-d_{c}|} \tag{1} $$ where $d=|g_{x}-s_{x}|+|g_{y}-s_{y}|$ is the Manhattan distance from start $s$ to goal $g$ . This metric prefers training examples with similar navigation distances, under the assumption that problems with similar start-to-goal distances share structural similarities. Retrieval Modes. We evaluate three retrieval strategies: - Strict: Add examples until the budget would be exceeded (conservative). - Generous: Add examples until the budget is just crossed (permissive). - Fixed: Include fixed examples plus retrieved examples up to the remaining budget. Our main experiments use the generous mode. #### D.1.3 Self-Consistency Self-Consistency (Wang et al., 2023) generates multiple independent reasoning trajectories and selects the final answer via majority voting. Implementation. We sample $k=5$ independent CoT traces using temperature 1.0 for diversity. Each sample uses the same prompt with an annotation indicating the sample number (e.g., “Sample 3/5”). We parse action sequences from each sample, count votes for each unique plan (exact sequence match), and select the plan with the highest vote count. For tie-breaking, we use an additional LLM call to evaluate candidates based on their self-critique annotations. Self-Critique. Each sample includes a self-critique section where the model evaluates its own reasoning, providing confidence estimates and noting potential issues. This information is used only for tie-breaking. #### D.1.4 Self-Refine Self-Refine (Madaan et al., 2023) allows the model to iteratively review and improve its own solutions without external feedback. Implementation. The model generates an initial attempt, then receives up to $N=5$ refinement opportunities. In each refinement round, the model sees its previous response and is instructed to check for potential mistakes: boundary violations, wall collisions, goal reachability, deadzone avoidance, and path optimality. The model may either provide a corrected plan or explicitly state “ **No further refinement needed.** ” Termination Conditions. Refinement stops when: (1) the model explicitly states satisfaction or (2) maximum refinement steps are reached Key Distinction. Unlike oracle baselines, Self-Refine receives no external feedback about plan validity. The model must introspect on its own reasoning, which prior work has shown to be unreliable for planning tasks (Stechly et al., 2025). #### D.1.5 ReAct (Prompt-Only) ReAct (Yao et al., 2023b) interleaves reasoning and action selection in a textual trace format. Implementation. The model alternates between Thought: steps (reasoning about current state and next move) and Action: steps (single movement action). All reasoning and actions are generated in a single LLM call—no external tool execution occurs. The prompt includes guidelines to keep moves consistent with the grid layout, avoid illegal steps, provide reasoning before each action, and end with an explicit final action sequence. Trace Format. ⬇ Thought: [analyze current state and next move] Action: move - direction Thought: [continue reasoning] Action: move - direction ... Final Thought: [summarize path to goal] ** Final Action Sequence:** action1, action2, ... #### D.1.6 Tree-of-Thoughts (Prompt-Only) Tree-of-Thoughts (ToT) (Yao et al., 2023a) explores multiple reasoning paths through iterative expansion and scoring. Implementation. We use a prompt-only variant with breadth-first tree search. Parameters: breadth $b=5$ (nodes per level), depth $d=3$ (expansion rounds), max step actions $m=8$ (actions per candidate). At each depth level, the model generates candidate continuations as JSON, including a thought description, proposed actions, confidence score (0–100), terminal flag, and optional final plan. We keep the top- $k$ nodes by confidence (beam search) and finalize by selecting the best terminal node or completing the top non-terminal node. Scoring. Without oracle access, scoring uses only LLM self-assessed confidence and plan length: $$ \text{score}=(\text{confidence},-\text{plan\_length}) \tag{2} $$ Higher confidence is preferred; shorter plans are preferred as a tiebreaker. ### D.2 Oracle Baselines Oracle baselines have access to a ground-truth environment simulator that provides feedback on plan validity. This represents an upper bound on what prompt-only methods could achieve with perfect self-verification. #### D.2.1 ReAct (+Oracle f/b) This two-step approach allows the model to receive targeted feedback about errors and produce a corrected plan. Implementation. Step 1: Generate an initial CoT plan. Step 2: If errors are detected by the oracle, provide specific feedback and request correction. Maximum 2 LLM calls per problem. We use temperature 0.3 for more deterministic outputs in the correction step. Feedback Types. The oracle provides two types of feedback: 1. Invalid Move: “Your plan has an ERROR at step $N$ . The action ‘move-X’ at position $(x,y)$ is INVALID because it would move into a wall or out of bounds.” 1. Incomplete Path: “Your plan is INCOMPLETE. After executing all $N$ actions, you ended at position $(x,y)$ but did not reach the goal.” Key Distinction. ReAct (+Oracle f/b) queries the verifier at test time for each proposed plan, while L-ICL uses the oracle only during training. At inference, L-ICL requires a single forward pass with no external dependencies. ### D.3 Evaluation Infrastructure Action Parsing. All baselines use the same action sequence parser that handles multiple output formats. We search for explicit patterns (e.g., **Final Action Sequence:**), fall back to lines with comma-separated actions, and as a last resort extract all move-* actions from the response. Actions are normalized to canonical form (e.g., “north” $\to$ “move-north”). Plan Validation. Plan validity is determined by simulating the action sequence: 1. Parse problem to extract start/goal positions and obstacles. 1. Execute actions sequentially, checking bounds and wall collisions. 1. Verify goal is reached. 1. Compare plan length to BFS optimal. ### D.4 Summary of Baseline Characteristics Table 7 summarizes the key characteristics distinguishing each baseline. Table 7: Summary of baseline characteristics. “Examples” indicates whether the method uses in-context examples; “LLM Calls” indicates calls per problem; “Tools” indicates whether external tools are used. $$ k 1 N O(b\cdot d) \tag{3} $$ ## Appendix E Implementation Details This appendix provides detailed implementation specifications for L-ICL, including the system architecture, correction generation process, and evaluation pipeline. We provide sufficient detail for reproducing our experiments. ### E.1 System Architecture Our implementation consists of four main components that work together to execute the L-ICL training loop: 1. Partial Program Generator: Constructs PTP-style prompts with subroutine documentation and accumulated corrections formatted as input-output examples. 1. LLM Interface: Sends prompts to language models and parses structured traces from responses. 1. Evaluation Engine: Validates generated plans using external tools and step-by-step simulation, identifying the first point of failure. 1. Correction Accumulator: Extracts corrections from evaluation mismatches and injects them into subsequent prompts. Figure 14 illustrates the data flow between these components during L-ICL training. Partial Program Generator LLM Interface Evaluation Engine Correction Accumulator prompt trace corrections update Figure 14: System architecture for L-ICL training. The loop iterates over training problems, accumulating corrections that progressively refine the prompt. ### E.2 Subroutine Specifications by Domain Each domain defines a set of planning primitives that the LLM must “implement” through trace generation. We describe the subroutines for each domain, including their signatures, semantics, and the constraints they encode. #### E.2.1 Subroutines) We use the following subroutines in our experiments: State Extraction. - extract_initial_state(problem) $\to$ State: Parses the problem description to extract the agent’s starting position and environment structure. - extract_goal(problem) $\to$ State: Parses the goal specification from the problem. Action Generation. - get_applicable_actions(state, goal) $\to$ Set[Action]: Returns the set of actions that can be legally executed from the current state. For navigation, this filters the four cardinal directions to exclude moves that would exit the grid or collide with walls. - get_optimal_actions(state, goal) $\to$ Set[Action]: Returns the subset of applicable actions that lie on an optimal path to the goal. This is computed using shortest-path algorithms/ a planner. For BlocksWorld, we replace this with get_recommended_actions(state, goal) $\to$ Set[Action], which returns the set of actions prescribed by the Universal Blocksworld Algorithm. State Transition and Goal Test. - apply_action(state, action) $\to$ State: Returns the state resulting from executing the action. For navigation, this updates the agent’s coordinates. - at_goal(state, goal) $\to$ bool: Returns whether the current state satisfies the goal condition. ### E.3 Correction Format and Integration L-ICL corrections are formatted as doctest-style input-output examples that are injected into subroutine documentation. This format is well-represented in LLM training data, facilitating generalization. Correction Structure. Each correction consists of three components: 1. Function identifier: Which subroutine the correction applies to. 1. Input: The arguments that triggered the mismatch. 1. Correct output: The oracle-provided ground truth. Example Correction. Consider an LLM that incorrectly proposes moving east from position $(3,4)$ when a wall blocks that direction. The evaluation detects that move_east is not in the set of applicable actions. L-ICL generates a correction: ⬇ >>> get_applicable_actions (state =(3,4), walls ={(3,5)}) {’ move_north ’, ’ move_south ’, ’ move_west ’} This correction is inserted into the documentation for get_applicable_actions, providing an explicit example that eastward movement from $(3,4)$ is invalid. Correction Accumulation. Corrections accumulate across training problems. When a new correction duplicates an existing one (same function and inputs), we retain only one copy. This prevents prompt bloat while ensuring coverage of diverse failure cases. The accumulated corrections are batch-inserted into the prompt template before each evaluation iteration. Additional details are in Section F.3. ### E.4 Evaluation Pipeline Plan evaluation proceeds through multiple validation stages, each providing increasingly detailed feedback. #### E.4.1 External Plan Validation We use the VAL validator (Howey et al., 2004), the standard tool for PDDL plan validation, to verify plan correctness. Given a domain specification, problem instance, and proposed action sequence, VAL checks: - Each action’s preconditions are satisfied when it is executed. - The final state after executing all actions satisfies the goal. VAL provides a binary validity judgment and, for invalid plans, identifies the first action whose preconditions fail. #### E.4.2 Optimality Verification To assess plan quality, we compute optimal solutions using the Fast Downward planning system (Helmert, 2006). Fast Downward is a state-of-the-art classical planner that guarantees optimal solutions when configured with admissible heuristics. We use the A* search algorithm with the LM-cut heuristic. For each problem, we: 1. Run Fast Downward to obtain the optimal plan length. 1. Compare the LLM’s plan length against this baseline. 1. Mark plans as optimal if lengths match and the plan is valid. #### E.4.3 Step-by-Step Simulation Beyond end-to-end validation, we simulate plan execution step-by-step using proxy implementations of each subroutine. This enables: 1. First-failure identification: We identify the exact step where the LLM’s trace first diverges from ground truth, enabling localized correction generation. 1. Fine-grained error categorization: We distinguish between: - Applicability errors: Proposing an action not in the applicable set. - Optimality errors: Proposing an applicable but suboptimal action. 1. Correction generation: For each error type, we generate the corresponding correction by querying the oracle for the correct output. Algorithm 2 provides pseudocode for the step-by-step evaluation procedure. Algorithm 2 Step-by-Step Plan Evaluation 0: Domain $\mathcal{D}$ , problem $P$ , predicted actions $[a_{1},\ldots,a_{n}]$ , oracle $\mathcal{O}$ 0: Evaluation result with corrections $s\leftarrow\mathcal{O}.\text{extract\_initial\_state}(P)$ $g\leftarrow\mathcal{O}.\text{extract\_goal}(P)$ corrections $\leftarrow[]$ first_invalid $\leftarrow$ null first_suboptimal $\leftarrow$ null for $i=1$ to $n$ do $A_{\text{applicable}}\leftarrow\mathcal{O}.\text{get\_applicable\_actions}(s,g)$ $A_{\text{optimal}}\leftarrow\mathcal{O}.\text{get\_optimal\_actions}(s,g)$ if $a_{i}\notin A_{\text{applicable}}$ and first_invalid is null then first_invalid $\leftarrow i$ corrections.append $(\text{``get\_applicable\_actions''},(s,g),A_{\text{applicable}})$ break $\triangleright$ Stop at first invalid action else if $a_{i}\notin A_{\text{optimal}}$ and first_suboptimal is null then first_suboptimal $\leftarrow i$ corrections.append $(\text{``get\_optimal\_actions''},(s,g),A_{\text{optimal}})$ end if $s\leftarrow\mathcal{O}.\text{apply\_action}(s,a_{i})$ end for goal_reached $\leftarrow\mathcal{O}.\text{at\_goal}(s,g)$ return {first_invalid, first_suboptimal, corrections, goal_reached} ### E.5 Oracle Implementation The oracle provides ground-truth outputs for each subroutine. We implement oracles using a combination of external planning tools and logic-based simulation. Planning Tools. We use Fast Downward (Helmert, 2006) for optimal plan computation and action applicability. For domains requiring multiple optimal plans (to compute optimal action sets), we additionally use the K* planner (Katz and Lee, 2023), which enumerates the top- $k$ shortest plans. State Simulation. Action effects are computed using the Tarski planning library (Francés et al., 2018), which provides PDDL parsing and grounded action simulation. Given a PDDL domain and problem, Tarski computes: - The set of ground actions applicable in any state. - The successor state resulting from applying an action. - Whether a state satisfies a goal formula. Optimality Computation. Computing optimal actions (those on some optimal path) requires enumerating multiple optimal plans. We use K* to generate all plans of optimal length, then take the union of first actions across these plans. For efficiency, we cache optimal action sets for frequently-queried states. ### E.6 Prompt Construction The final prompt sent to the LLM consists of four components assembled in sequence: 1. Task description: Natural language explanation of the planning domain, valid actions, and objective. 1. Subroutine documentation: For each subroutine, we include: - Function signature with typed arguments and return type. - Natural language description of the subroutine’s purpose. - Accumulated L-ICL corrections as doctest examples. 1. Example traces: A small number ( $k=2$ – $3$ ) of complete reasoning traces showing how subroutines are invoked to solve example problems. 1. Query problem: The problem instance to solve, formatted consistently with the examples, followed by instructions to produce a trace. State Representation. For grid-based domains, we evaluate two state representations: - Textual: Positions as coordinates (e.g., “agent at (3,4)”) with walls listed explicitly. - ASCII: Visual grid representation where walls are marked characters and open cells are spaces. Our ablation (Section 4.3) shows that L-ICL achieves comparable peak performance with either representation, though ASCII grids accelerate early learning. ### E.7 Experimental Infrastructure Hardware. Experiments were conducted on a Linux workstation with 32GB RAM. External planning tools (Fast Downward, VAL) were run locally. LLM inference was performed via API calls. LLM Services. We evaluated models through their respective APIs: - DeepSeek V3 and V3.1 via the DeepSeek API. - Claude Haiku 4.5 and Claude Sonnet 4.5 via the Anthropic API. Hyperparameters. Unless otherwise specified: - Temperature: 1.0 (following DeepSeek recommendations). - Maximum generation length: 32000 tokens. - Training examples per iteration: 1 (single problem per L-ICL update). - Total training problems: up to 240. - Thinking tokens for Sonnet 4.5: 10k - Thinking tokens for Haiku 4.5: 5k Timeout Handling. Fast Downward was given a 60-second timeout per problem. Problems exceeding this limit were marked as having unknown optimal cost and excluded from optimality statistics (but included in validity statistics if the validator succeeded). ## Appendix F Representative Prompts This appendix provides representative prompts used in all experiments. We organize prompts into two categories: (1) L-ICL prompts based on Program Trace Prompting (PTP), and (2) baseline method prompts used for comparison approaches. All prompts use template variables (denoted with curly braces, e.g., {partial_program}) that are replaced with problem-specific content at runtime. ### F.1 L-ICL Prompts (Program Trace Prompting) L-ICL prompts follow the Program Trace Prompting (PTP) format, where the LLM is asked to predict the output of a partially specified program. The key insight is that by withholding subroutine implementations (replacing them with “…” markers), the LLM must infer correct behavior from documentation and accumulated examples. #### F.1.1 Base L-ICL Prompt (No Domain Visualization) This is the minimal L-ICL prompt used for gridworld and Sokoban navigation tasks when no ASCII grid visualization is provided. The LLM must infer spatial constraints purely from accumulated L-ICL corrections. ⬇ Consider the program fragment below. This program fragment is incomplete, with key parts of the implementation hidden, by replacing them with "..." markers. PROGRAM: ‘‘‘ python {partial_program} ‘‘‘ QUESTION: Predict what the output of the program above will be, given the input shown below. Respond with the FULL program output, and ONLY the expected program output: you will be PENALIZED if you introduce any additional explanatory text. ‘‘‘ >>> {task_name}({input_str}) ‘‘‘ Template Variables. - {partial_program}: The PTP-style program with subroutine signatures, documentation, doctest examples (including L-ICL corrections), and “…” placeholders for implementations. - {task_name}: The function name to invoke (e.g., solve_gridworld). - {input_str}: The problem specification as a string (e.g., start position, goal position, wall locations). #### F.1.2 L-ICL Prompt with ASCII Grid Visualization When ASCII grid visualization is enabled, the prompt includes a visual representation of the environment. This provides spatial scaffolding that accelerates early learning, though L-ICL achieves comparable peak performance without it. ⬇ Consider the program fragment below. This program fragment is incomplete, with key parts of the implementation hidden, by replacing them with "..." markers. IMPORTANT: You are an agent navigating a {grid_size} gridworld. The grid has {num_walls} walls that block movement. ** Grid Layout:** ‘‘‘ 1 2 3 4 5 6 7 8 9 10 +---+---+---+---+---+---+---+---+---+---+ 10 | . | . | . | . | . | . | . | . | . | . | +---+---+---+---+---+---+---+---+---+---+ 9 | . | . | # | # | # | # | # | # | # | . | +---+---+---+---+---+---+---+---+---+---+ 8 | . | # | . | # | . | # | . | . | . | . | +---+---+---+---+---+---+---+---+---+---+ 7 | . | # | . | . | . | # | . | # | # | . | +---+---+---+---+---+---+---+---+---+---+ 6 | . | # | . | # | . | . | . | # | . | . | +---+---+---+---+---+---+---+---+---+---+ 5 | . | . | . | # | . | # | . | # | . | . | +---+---+---+---+---+---+---+---+---+---+ 4 | . | # | . | # | . | # | . | . | # | . | +---+---+---+---+---+---+---+---+---+---+ 3 | . | # | . | . | . | # | . | # | . | . | +---+---+---+---+---+---+---+---+---+---+ 2 | . | # | . | # | . | . | . | # | . | . | +---+---+---+---+---+---+---+---+---+---+ 1 | . | . | . | . | . | . | . | . | . | . | +---+---+---+---+---+---+---+---+---+---+ ‘‘‘ PROGRAM: ‘‘‘ python {partial_program} ‘‘‘ QUESTION: Predict what the output of the program above will be, given the input shown below. Respond with the FULL program output, and ONLY the expected program output: you will be PENALIZED if you introduce any additional explanatory text. ‘‘‘ >>> {task_name}({input_str}) ‘‘‘ Grid Symbols. - . – Open cell (traversable) - # – Wall (impassable) - $ – Box (in Sokoban) #### F.1.3 L-ICL BlocksWorld Prompt with UBW Algorithm For BlocksWorld, we additionally provide algorithmic guidance based on the Universal Blocks World (UBW) algorithm (Stechly et al., 2024). This tests whether L-ICL can improve adherence to prescribed planning strategies beyond simple constraint satisfaction. ⬇ Consider the program fragment below. This program fragment implements the Universal Blocks World (UBW) algorithm, which is a systematic two - phase approach for solving blocks world planning problems. The implementation is incomplete, with key parts replaced by "..." markers. UNIVERSAL BLOCKS WORLD ALGORITHM OVERVIEW: The UBW algorithm works in two distinct phases to efficiently solve any blocks world configuration: PHASE 1: STRATEGIC UNSTACKING - Unstack ALL blocks that are stacked on top of others - Work from top to bottom, unstacking clear blocks first - Move incorrectly positioned blocks to the table PHASE 2: SYSTEMATIC REASSEMBLY - Build goal configurations from bottom up - Process blocks in dependency order (place supporting blocks before supported blocks) - Only place a block when its target is ready (clear and in final position) - Ensure structural integrity throughout construction KEY HEURISTICS FOR IMPLEMENTATION: 1. STATE ANALYSIS: - Parse predicates into on (), on - table (), and clear () relationships - Build dependency graphs: what should be on what - Identify bottom blocks (blocks that should be on table in goal) 2. UNSTACKING STRATEGY: - Check each on (X, Y) relationship in current state - If (X, Y) is NOT in goal relationships, consider unstacking X - Only unstack if X is clear (no blocks on top) - Priority: unstack blocks that block other necessary moves 3. REASSEMBLY STRATEGY: - For each goal on (X, Y), check if X can be placed on Y - X must be: clear AND on - table - Y must be: clear AND in its final position - Y is in final position if: Y should be on table OR Y is already correctly placed on its target 4. ACTION SELECTION LOGIC: ‘‘‘ For unstacking: if on (X, Y) in current state AND clear (X): return move - b - to - t (X, Y) For assembly: if goal requires on (X, Y) AND can_place_block (X, Y): return move - t - to - b (X, Y) ‘‘‘ 5. CORRECTNESS VERIFICATION: - Always verify preconditions before suggesting actions - Check that actions don ’ t break existing correct configurations - Ensure goal - directed progress in every move during assembly phase DETAILED TRACE GUIDANCE: When implementing the UBW algorithm, provide step - by - step reasoning inside reasoning () calls if required, which is your scratchpad. 1. State the current configuration clearly 2. Identify which phase you ’ re in (unstacking vs assembly) 3. Explain WHY each action is chosen based on UBW principles 4. Show how the action advances toward the goal 5. Verify preconditions are satisfied 6. Update state representation after each action PROGRAM: ‘‘‘ python {partial_program} ‘‘‘ QUESTION: Predict what the output of the program above will be, given the input shown below. IMPLEMENTATION REQUIREMENTS: - Follow the UBW algorithm phases strictly - Provide detailed reasoning for each action selection - Show state analysis and dependency tracking - Explain how each move contributes to the overall strategy - Demonstrate understanding of when to unstack vs when to build - Verify that all actions follow UBW heuristics Respond with the FULL program output, including detailed algorithmic traces that demonstrate proper UBW implementation. Your trace should show: - Clear identification of current phase (unstacking / assembly) - Specific reasoning for each action choice - State updates and goal progress tracking - Verification that actions follow UBW principles Under no circumstance must you skip steps in the program output. You CAN decide to go back and choose different actions if you feel that you have made a mistake, but the FINAL PLAN must show the COMPLETE CORRECT PATH ONLY. ‘‘‘ >>> {task_name}({input_str}) ‘‘‘ #### F.1.4 Example Partial Program Structure The {partial_program} template variable is replaced with a PTP-style program containing subroutine documentation and accumulated L-ICL corrections. Below is a representative example for gridworld navigation: ⬇ import collections from typing import Dict, List, Set, Tuple, Union, Optional, Any, FrozenSet PlanningState = Any Action = Any @traced def extract_problem (input_str: str) -> str: """ Extract a standardized problem description from input. """ ... @traced def extract_initial_state (problem_str: str) -> PlanningState: """ Extract the initial state from a problem description. """ ... @traced def extract_goal (problem_str: str) -> PlanningState: """ Extract the goal from a problem description. """ ... @traced def at_goal (state: PlanningState, goal: PlanningState) -> bool: """ Check if current state satisfies goal conditions. """ ... @traced def get_applicable_actions (state: PlanningState, goal: PlanningState) -> Set [Action]: """ Get all applicable actions in the current state. """ ... @traced def get_optimal_actions (state: PlanningState, applicable_actions: List [Action], goal: PlanningState) -> Set [Action]: """ Get actions that are part of an optimal plan. """ ... @traced def apply_action (state: PlanningState, action: Action, goal: PlanningState) -> PlanningState: """ Apply an action to a state, returning the resulting state. """ ... def pddl_grid (input_str: str): """ Solve a planning problem described in input_str. This function processes a planning problem description by: 1. Extracting the initial state and goal 2. Iteratively applying actions until the goal is reached 3. Returning the sequence of actions as a plan >>> pddl_grid (’(define (problem gw - task -351)\ n (: domain gridworld -10 x10)\ n (: init (at c9 -5))\ n (: goal (at c5 -10))\ n)\ n ’) Calling extract_problem (’(define (problem gw - task -351)\ n (: domain gridworld -10 x10)\ n (: init (at c9 -5))\ n (: goal (at c5 -10))\ n)\ n ’)... ... extract_problem returned ’ gridworld -10 x10 ’ Calling extract_initial_state (’(define (problem gw - task -351)\ n (: domain gridworld -10 x10)\ n (: init (at c9 -5))\ n (: goal (at c5 -10))\ n)\ n ’)... ... extract_initial_state returned (9, 5) Calling extract_goal (’(define (problem gw - task -351)\ n (: domain gridworld -10 x10)\ n (: init (at c9 -5))\ n (: goal (at c5 -10))\ n)\ n ’)... ... extract_goal returned (5, 10) Calling at_goal ((9, 5), (5, 10))... ... at_goal returned False Calling get_applicable_actions ((9, 5), (5, 10))... ... get_applicable_actions returned [’ move - north ’, ’ move - east ’] Calling get_optimal_actions ((9, 5), [’ move - north ’, ’ move - east ’], (5, 10))... ... get_optimal_actions returned [’ move - north ’, ’ move - east ’] Calling apply_action ((9, 5), ’ move - north ’, (5, 10))... ... apply_action returned (9, 6) Calling at_goal ((9, 6), (5, 10))... ... at_goal returned False Calling get_applicable_actions ((9, 6), (5, 10))... ... get_applicable_actions returned [’ move - south ’, ’ move - east ’] Calling get_optimal_actions ((9, 6), [’ move - south ’, ’ move - east ’], (5, 10))... ... get_optimal_actions returned [’ move - east ’] Calling apply_action ((9, 6), ’ move - east ’, (5, 10))... ... apply_action returned (10, 6) Calling at_goal ((10, 6), (5, 10))... ... at_goal returned False Calling get_applicable_actions ((10, 6), (5, 10))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’, ’ move - west ’] Calling get_optimal_actions ((10, 6), [’ move - north ’, ’ move - south ’, ’ move - west ’], (5, 10))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((10, 6), ’ move - north ’, (5, 10))... ... apply_action returned (10, 7) Calling at_goal ((10, 7), (5, 10))... ... at_goal returned False Calling get_applicable_actions ((10, 7), (5, 10))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’] Calling get_optimal_actions ((10, 7), [’ move - north ’, ’ move - south ’], (5, 10))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((10, 7), ’ move - north ’, (5, 10))... ... apply_action returned (10, 8) Calling at_goal ((10, 8), (5, 10))... ... at_goal returned False Calling get_applicable_actions ((10, 8), (5, 10))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’, ’ move - west ’] Calling get_optimal_actions ((10, 8), [’ move - north ’, ’ move - south ’, ’ move - west ’], (5, 10))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((10, 8), ’ move - north ’, (5, 10))... ... apply_action returned (10, 9) Calling at_goal ((10, 9), (5, 10))... ... at_goal returned False Calling get_applicable_actions ((10, 9), (5, 10))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’] Calling get_optimal_actions ((10, 9), [’ move - north ’, ’ move - south ’], (5, 10))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((10, 9), ’ move - north ’, (5, 10))... ... apply_action returned (10, 10) Calling at_goal ((10, 10), (5, 10))... ... at_goal returned False Calling get_applicable_actions ((10, 10), (5, 10))... ... get_applicable_actions returned [’ move - south ’, ’ move - west ’] Calling get_optimal_actions ((10, 10), [’ move - south ’, ’ move - west ’], (5, 10))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((10, 10), ’ move - west ’, (5, 10))... ... apply_action returned (9, 10) Calling at_goal ((9, 10), (5, 10))... ... at_goal returned False Calling get_applicable_actions ((9, 10), (5, 10))... ... get_applicable_actions returned [’ move - east ’, ’ move - west ’] Calling get_optimal_actions ((9, 10), [’ move - east ’, ’ move - west ’], (5, 10))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((9, 10), ’ move - west ’, (5, 10))... ... apply_action returned (8, 10) Calling at_goal ((8, 10), (5, 10))... ... at_goal returned False Calling get_applicable_actions ((8, 10), (5, 10))... ... get_applicable_actions returned [’ move - east ’, ’ move - west ’] Calling get_optimal_actions ((8, 10), [’ move - east ’, ’ move - west ’], (5, 10))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((8, 10), ’ move - west ’, (5, 10))... ... apply_action returned (7, 10) Calling at_goal ((7, 10), (5, 10))... ... at_goal returned False Calling get_applicable_actions ((7, 10), (5, 10))... ... get_applicable_actions returned [’ move - east ’, ’ move - west ’] Calling get_optimal_actions ((7, 10), [’ move - east ’, ’ move - west ’], (5, 10))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((7, 10), ’ move - west ’, (5, 10))... ... apply_action returned (6, 10) Calling at_goal ((6, 10), (5, 10))... ... at_goal returned False Calling get_applicable_actions ((6, 10), (5, 10))... ... get_applicable_actions returned [’ move - east ’, ’ move - west ’] Calling get_optimal_actions ((6, 10), [’ move - east ’, ’ move - west ’], (5, 10))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((6, 10), ’ move - west ’, (5, 10))... ... apply_action returned (5, 10) Calling at_goal ((5, 10), (5, 10))... ... at_goal returned True Final answer: move - north move - east move - north move - north move - north move - north move - west move - west move - west move - west move - west [’ move - north ’, ’ move - east ’, ’ move - north ’, ’ move - north ’, ’ move - north ’, ’ move - north ’, ’ move - west ’, ’ move - west ’, ’ move - west ’, ’ move - west ’, ’ move - west ’] >>> pddl_grid (’(define (problem gw - task -352)\ n (: domain gridworld -10 x10)\ n (: init (at c9 -3))\ n (: goal (at c7 -7))\ n)\ n ’) Calling extract_problem (’(define (problem gw - task -352)\ n (: domain gridworld -10 x10)\ n (: init (at c9 -3))\ n (: goal (at c7 -7))\ n)\ n ’)... ... extract_problem returned ’ gridworld -10 x10 ’ Calling extract_initial_state (’(define (problem gw - task -352)\ n (: domain gridworld -10 x10)\ n (: init (at c9 -3))\ n (: goal (at c7 -7))\ n)\ n ’)... ... extract_initial_state returned (9, 3) Calling extract_goal (’(define (problem gw - task -352)\ n (: domain gridworld -10 x10)\ n (: init (at c9 -3))\ n (: goal (at c7 -7))\ n)\ n ’)... ... extract_goal returned (7, 7) Calling at_goal ((9, 3), (7, 7))... ... at_goal returned False Calling get_applicable_actions ((9, 3), (7, 7))... ... get_applicable_actions returned [’ move - south ’, ’ move - east ’] Calling get_optimal_actions ((9, 3), [’ move - south ’, ’ move - east ’], (7, 7))... ... get_optimal_actions returned [’ move - south ’, ’ move - east ’] Calling apply_action ((9, 3), ’ move - south ’, (7, 7))... ... apply_action returned (9, 2) Calling at_goal ((9, 2), (7, 7))... ... at_goal returned False Calling get_applicable_actions ((9, 2), (7, 7))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’, ’ move - east ’] Calling get_optimal_actions ((9, 2), [’ move - north ’, ’ move - south ’, ’ move - east ’], (7, 7))... ... get_optimal_actions returned [’ move - south ’] Calling apply_action ((9, 2), ’ move - south ’, (7, 7))... ... apply_action returned (9, 1) Calling at_goal ((9, 1), (7, 7))... ... at_goal returned False Calling get_applicable_actions ((9, 1), (7, 7))... ... get_applicable_actions returned [’ move - north ’, ’ move - east ’, ’ move - west ’] Calling get_optimal_actions ((9, 1), [’ move - north ’, ’ move - east ’, ’ move - west ’], (7, 7))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((9, 1), ’ move - west ’, (7, 7))... ... apply_action returned (8, 1) Calling at_goal ((8, 1), (7, 7))... ... at_goal returned False Calling get_applicable_actions ((8, 1), (7, 7))... ... get_applicable_actions returned [’ move - east ’, ’ move - west ’] Calling get_optimal_actions ((8, 1), [’ move - east ’, ’ move - west ’], (7, 7))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((8, 1), ’ move - west ’, (7, 7))... ... apply_action returned (7, 1) Calling at_goal ((7, 1), (7, 7))... ... at_goal returned False Calling get_applicable_actions ((7, 1), (7, 7))... ... get_applicable_actions returned [’ move - north ’, ’ move - east ’, ’ move - west ’] Calling get_optimal_actions ((7, 1), [’ move - north ’, ’ move - east ’, ’ move - west ’], (7, 7))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((7, 1), ’ move - north ’, (7, 7))... ... apply_action returned (7, 2) Calling at_goal ((7, 2), (7, 7))... ... at_goal returned False Calling get_applicable_actions ((7, 2), (7, 7))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’, ’ move - west ’] Calling get_optimal_actions ((7, 2), [’ move - north ’, ’ move - south ’, ’ move - west ’], (7, 7))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((7, 2), ’ move - north ’, (7, 7))... ... apply_action returned (7, 3) Calling at_goal ((7, 3), (7, 7))... ... at_goal returned False Calling get_applicable_actions ((7, 3), (7, 7))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’] Calling get_optimal_actions ((7, 3), [’ move - north ’, ’ move - south ’], (7, 7))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((7, 3), ’ move - north ’, (7, 7))... ... apply_action returned (7, 4) Calling at_goal ((7, 4), (7, 7))... ... at_goal returned False Calling get_applicable_actions ((7, 4), (7, 7))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’, ’ move - east ’] Calling get_optimal_actions ((7, 4), [’ move - north ’, ’ move - south ’, ’ move - east ’], (7, 7))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((7, 4), ’ move - north ’, (7, 7))... ... apply_action returned (7, 5) Calling at_goal ((7, 5), (7, 7))... ... at_goal returned False Calling get_applicable_actions ((7, 5), (7, 7))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’] Calling get_optimal_actions ((7, 5), [’ move - north ’, ’ move - south ’], (7, 7))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((7, 5), ’ move - north ’, (7, 7))... ... apply_action returned (7, 6) Calling at_goal ((7, 6), (7, 7))... ... at_goal returned False Calling get_applicable_actions ((7, 6), (7, 7))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’, ’ move - west ’] Calling get_optimal_actions ((7, 6), [’ move - north ’, ’ move - south ’, ’ move - west ’], (7, 7))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((7, 6), ’ move - north ’, (7, 7))... ... apply_action returned (7, 7) Calling at_goal ((7, 7), (7, 7))... ... at_goal returned True Final answer: move - south move - south move - west move - west move - north move - north move - north move - north move - north move - north [’ move - south ’, ’ move - south ’, ’ move - west ’, ’ move - west ’, ’ move - north ’, ’ move - north ’, ’ move - north ’, ’ move - north ’, ’ move - north ’, ’ move - north ’] >>> pddl_grid (’(define (problem gw - task -353)\ n (: domain gridworld -10 x10)\ n (: init (at c7 -2))\ n (: goal (at c2 -5))\ n)\ n ’) Calling extract_problem (’(define (problem gw - task -353)\ n (: domain gridworld -10 x10)\ n (: init (at c7 -2))\ n (: goal (at c2 -5))\ n)\ n ’)... ... extract_problem returned ’ gridworld -10 x10 ’ Calling extract_initial_state (’(define (problem gw - task -353)\ n (: domain gridworld -10 x10)\ n (: init (at c7 -2))\ n (: goal (at c2 -5))\ n)\ n ’)... ... extract_initial_state returned (7, 2) Calling extract_goal (’(define (problem gw - task -353)\ n (: domain gridworld -10 x10)\ n (: init (at c7 -2))\ n (: goal (at c2 -5))\ n)\ n ’)... ... extract_goal returned (2, 5) Calling at_goal ((7, 2), (2, 5))... ... at_goal returned False Calling get_applicable_actions ((7, 2), (2, 5))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’, ’ move - west ’] Calling get_optimal_actions ((7, 2), [’ move - north ’, ’ move - south ’, ’ move - west ’], (2, 5))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((7, 2), ’ move - west ’, (2, 5))... ... apply_action returned (6, 2) Calling at_goal ((6, 2), (2, 5))... ... at_goal returned False Calling get_applicable_actions ((6, 2), (2, 5))... ... get_applicable_actions returned [’ move - south ’, ’ move - east ’, ’ move - west ’] Calling get_optimal_actions ((6, 2), [’ move - south ’, ’ move - east ’, ’ move - west ’], (2, 5))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((6, 2), ’ move - west ’, (2, 5))... ... apply_action returned (5, 2) Calling at_goal ((5, 2), (2, 5))... ... at_goal returned False Calling get_applicable_actions ((5, 2), (2, 5))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’, ’ move - east ’] Calling get_optimal_actions ((5, 2), [’ move - north ’, ’ move - south ’, ’ move - east ’], (2, 5))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((5, 2), ’ move - north ’, (2, 5))... ... apply_action returned (5, 3) Calling at_goal ((5, 3), (2, 5))... ... at_goal returned False Calling get_applicable_actions ((5, 3), (2, 5))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’, ’ move - west ’] Calling get_optimal_actions ((5, 3), [’ move - north ’, ’ move - south ’, ’ move - west ’], (2, 5))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((5, 3), ’ move - west ’, (2, 5))... ... apply_action returned (4, 3) Calling at_goal ((4, 3), (2, 5))... ... at_goal returned False Calling get_applicable_actions ((4, 3), (2, 5))... ... get_applicable_actions returned [’ move - east ’, ’ move - west ’] Calling get_optimal_actions ((4, 3), [’ move - east ’, ’ move - west ’], (2, 5))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((4, 3), ’ move - west ’, (2, 5))... ... apply_action returned (3, 3) Calling at_goal ((3, 3), (2, 5))... ... at_goal returned False Calling get_applicable_actions ((3, 3), (2, 5))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’, ’ move - east ’] Calling get_optimal_actions ((3, 3), [’ move - north ’, ’ move - south ’, ’ move - east ’], (2, 5))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((3, 3), ’ move - north ’, (2, 5))... ... apply_action returned (3, 4) Calling at_goal ((3, 4), (2, 5))... ... at_goal returned False Calling get_applicable_actions ((3, 4), (2, 5))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’] Calling get_optimal_actions ((3, 4), [’ move - north ’, ’ move - south ’], (2, 5))... ... get_optimal_actions returned [’ move - north ’] Calling apply_action ((3, 4), ’ move - north ’, (2, 5))... ... apply_action returned (3, 5) Calling at_goal ((3, 5), (2, 5))... ... at_goal returned False Calling get_applicable_actions ((3, 5), (2, 5))... ... get_applicable_actions returned [’ move - north ’, ’ move - south ’, ’ move - west ’] Calling get_optimal_actions ((3, 5), [’ move - north ’, ’ move - south ’, ’ move - west ’], (2, 5))... ... get_optimal_actions returned [’ move - west ’] Calling apply_action ((3, 5), ’ move - west ’, (2, 5))... ... apply_action returned (2, 5) Calling at_goal ((2, 5), (2, 5))... ... at_goal returned True Final answer: move - west move - west move - north move - west move - west move - north move - north move - west [’ move - west ’, ’ move - west ’, ’ move - north ’, ’ move - west ’, ’ move - west ’, ’ move - north ’, ’ move - north ’, ’ move - west ’] """ ... ### F.2 Baseline Method Prompts The following prompts are used for baseline comparison methods. All baselines receive the same problem information but use different reasoning frameworks. #### F.2.1 Zero-Shot CoT / RAG-CoT Prompt The base prompt used for Zero-Shot Chain-of-Thought and RAG-CoT baselines. For RAG-CoT, dynamically retrieved examples are inserted in the {examples} section. ⬇ You are an expert at navigating gridworld environments. You will solve navigation problems where an agent must find the optimal path from a start position to a goal position while avoiding walls and obstacles. IMPORTANT: You are an agent navigating a {grid_size} gridworld. The grid has {num_walls} walls that block movement. ** Grid Layout:** {ascii_grid} # Task Description In each problem, you are given: - A gridworld of specific dimensions - A start position (row, column) - A goal position (row, column) - Wall locations that block movement Your task is to find the shortest path from start to goal using these actions: - ** move - north **: Move one cell north (increase row by 1) - ** move - south **: Move one cell south (decrease row by 1) - ** move - east **: Move one cell east (increase column by 1) - ** move - west **: Move one cell west (decrease column by 1) # Solution Strategy For each problem, follow this reasoning process: 1. ** Analyze the Grid **: Identify the start position, goal position, and obstacles 2. ** Plan the Route **: Determine if a direct path exists or if you need to navigate around obstacles 3. ** Step - by - Step Reasoning **: For each move, explain why it brings you closer to the goal 4. ** Verify the Path **: Ensure the path is valid and optimal # Example Problems {examples} # Problem to Solve Start: {start_position} Goal: {goal_position} {deadzone_warning} Please solve this problem step - by - step and provide your answer. ** Your Solution:** First, provide your step - by - step reasoning: 1. Identify the start position, goal position, and any obstacles 2. Reason through each step of your path 3. Verify your path is valid and optimal Then, provide your final answer EXACTLY in this format: ** Final Action Sequence:** move - direction1, move - direction2, ... IMPORTANT: You MUST include the line starting with " Final Action Sequence:" followed by your comma - separated list of actions. #### F.2.2 Self-Consistency Prompt Self-Consistency uses the same base prompt as CoT, with a sample annotation appended to each independent call: ⬇ {base_cot_prompt}  Each sample is generated with temperature $>0$ for diversity. The final answer is selected via majority voting over the k samples. #### F.2.3 Self-Refine Refinement Prompt After the initial attempt, if refinement rounds remain, the model receives its previous response with reflection instructions: ⬇ {base_cot_prompt} ### Self - Refinement Attempt {attempt_number} You previously produced the following reasoning and plan: {previous_response} Proposed action sequence: {previous_actions} Carefully re - read the task description and your earlier steps. Without running code or simulations, check for potential mistakes: - Did any move leave the grid or pass through a wall? - Does the sequence actually reach the goal cell? - Is there a shorter valid route? If issues are found, explain them briefly and provide a corrected plan. If you believe the plan is correct and needs no further refinement, explicitly state: ’** No further refinement needed.**’ and then restate the action sequence. Always finish with a line of the form: ** Final Action Sequence:** move -*, move -*, ... Refined solution: #### F.2.4 ReAct (Prompt-Only) Prompt ReAct uses an alternating Thought/Action trace format: ⬇ You are an expert gridworld planner. Solve using ReAct style trace. IMPORTANT: You are an agent navigating a {grid_size} gridworld. The grid has {num_walls} walls that block movement. ** Grid Layout:** {ascii_grid} ## Valid Actions - ** move - north **: Move one cell up (increase y by 1) - ** move - south **: Move one cell down (decrease y by 1) - ** move - east **: Move one cell right (increase x by 1) - ** move - west **: Move one cell left (decrease x by 1) ## Movement Constraints - You cannot move through walls - You cannot move outside the grid boundaries - Each action moves exactly one cell # Example Start: (2,1), Goal: (5,4) Thought: I am at (2,1) and need to reach (5,4). I should move north and east while checking for obstacles. Action: move - north Thought: Now at (2,2). Continue moving toward the goal. Action: move - north ... Final Thought: Reached the goal at (5,4). ** Final Action Sequence:** move - north, move - north, move - east, ... # Problem to Solve Start: {start_position} Goal: {goal_position} Guidelines: - Alternate between ‘ Thought:‘ and ‘ Action:‘ - Keep moves consistent with grid layout - Avoid illegal steps (walls, boundaries) - End with ‘ Final Thought:‘ and ‘** Final Action Sequence:**‘ #### F.2.5 Tree-of-Thoughts Expansion Prompt ToT uses a structured expansion prompt requesting JSON-formatted candidates: ⬇ {reference_examples} Gridworld planning problem: Start: {start_position} Goal: {goal_position} Current depth: {depth}/{max_depth} Actions chosen so far: {action_prefix} Thoughts considered so far: {thought_history} Generate up to 5 candidate expansions as JSON. Each must include: - " thought ": a short description of the idea - " proposed_actions ": list of up to 8 moves continuing the plan - " confidence ": integer 0-100 for promise of success - " is_terminal ": true if plan should stop after these actions - " final_plan ": optional full action list if terminal Moves must stay within bounds and avoid walls. Return ONLY the JSON array; no commentary. #### F.2.6 ReAct (+Oracle) Feedback Prompt When the oracle detects errors, it provides specific feedback: ⬇ {original_prompt} --- Your previous response: {previous_response} --- ORACLE FEEDBACK: Your plan has a PROBLEM at step {step_number}. The action ’{failed_action}’ at position {position} is INVALID because {reason}. Please find an alternative path that avoids this issue. ** Corrected Final Action Sequence:** Feedback Types. - Invalid Move: “The action ‘move-X’ at position $(x,y)$ is INVALID because it would move into a wall or out of bounds.” - Deadzone Entry: “The action ‘move-X’ leads to position $(x,y)$ which is a DEADZONE. You should avoid deadzones.” - Incomplete Path: “Your plan is INCOMPLETE. After executing all actions, you ended at $(x,y)$ but did not reach the goal.” ### F.3 L-ICL Correction Format L-ICL corrections are formatted as doctest-style input-output examples inserted into subroutine documentation. This format leverages Python’s doctest convention, which is well-represented in LLM training data. Correction Structure. ⬇ >>> {function_name}({input_args}) {correct_output} Example Corrections by Subroutine. Applicability Correction (when LLM proposes invalid action): ⬇ >>> get_applicable_actions (state =(3, 4), goal =(7, 8)) {’ move_north ’, ’ move_south ’, ’ move_west ’} Optimality Correction (when LLM proposes suboptimal action): ⬇ >>> get_optimal_actions (state =(5, 2), goal =(8, 7)) {’ move_north ’, ’ move_east ’} BlocksWorld Action Correction: ⬇ >>> get_recommended_actions ( ... state ={’ on ’: [(’ A ’,’ B ’)], ’ on - table ’: [’ B ’,’ C ’], ... ’ clear ’: [’ A ’,’ C ’]}, ... goal ={’ on ’: [(’ B ’,’ C ’)], ’ on - table ’: [’ A ’,’ C ’], ... ’ clear ’: [’ A ’,’ B ’]} ... ) {’ move - b - to - t (A, B)’} ### F.4 Action Parsing Patterns All methods use the same action sequence parser that handles multiple output formats: ⬇ PATTERNS = [ r’\*\*Final ␣ Action ␣ Sequence:\*\*\s*(.+?)(?:\n|$)’, r’Final ␣ Action ␣ Sequence:\s*(.+?)(?:\n|$)’, r’\*\*Action ␣ Sequence:\*\*\s*(.+?)(?:\n|$)’, r’Action ␣ Sequence:\s*(.+?)(?:\n|$)’, r’Optimal ␣ path:\s*(.+?)(?:\n|$)’, r’Plan:\s*(.+?)(?:\n|$)’, ] VALID_ACTIONS = { ’move-north’, ’move-south’, ’move-east’, ’move-west’, ’move_north’, ’move_south’, ’move_east’, ’move_west’, ’push-north’, ’push-south’, ’push-east’, ’push-west’, } # Normalize action format ACTION_ALIASES = { ’north’: ’move-north’, ’south’: ’move-south’, ’east’: ’move-east’, ’west’: ’move-west’, ’up’: ’move-north’, ’down’: ’move-south’, ’right’: ’move-east’, ’left’: ’move-west’, } ### F.5 Prompt Variations by Domain 8 $\times$ 8 Two-Room Gridworld. Uses the standard L-ICL prompt with ASCII grid showing two rooms separated by a wall with a doorway. 10 $\times$ 10 Maze. Uses the L-ICL prompt with ASCII grid showing procedurally generated maze corridors. Wall density is higher, creating narrow passages. Sokoban-Style Gridworld. Uses the standard L-ICL prompt with Sokoban-style ASCII layouts but no pushable boxes. Full Sokoban. Extends the action space to include push actions: ⬇ Valid actions: - Movement: move_north, move_south, move_east, move_west - Pushing: push_north, push_south, push_east, push_west BlocksWorld. Uses the UBW algorithm prompt (Section F.1.3) with relational state representation instead of spatial coordinates. ### F.6 Hyperparameter Settings by Method Table 8 summarizes the key hyperparameters used for each prompting method. Table 8: Hyperparameter settings for each prompting method. | Zero-Shot CoT | 1.0 | 32,000 | 1 | | --- | --- | --- | --- | | RAG-CoT | 1.0 | 32,000 | 1 | | Self-Consistency | 1.0 | 32,000 | $k=5$ | | Self-Refine | 1.0 | 32,000 | $N=5$ | | ReAct (Prompt) | 1.0 | 32,000 | 1 | | ToT (Prompt) | 1.0 | 32,000 | $b=5,d=3$ | | ReAct (+Oracle) | 0.3 | 32,000 | 1–2 | | L-ICL | 1.0 | 32,000 | 1 | L-ICL Training Configuration. - Training examples: up to 240 problems - Corrections per problem: 1 (first failure only) in Sokoban and BlocksWorld or up to 2 (first optimality correction and first validity correction, or just first validity correction) in gridworld problems - Correction accumulation: batch update after 10 training examples

Rendering Paper...