# seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
**Authors**:
- M.R. Ramezanali
- Salesforce AI
- Palo Alto, CA
- &M. Vazifeh (Capital One, MIT)
- Cambridge, MA
- &P. Santi (MIT)
- Cambridge, MA
> â \star denotes equal contribution.
## Abstract
We introduce seqBench, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. seqBench allows systematic variation of (1) the logical depth, defined as the number of sequential actions required to solve the task; (2) the number of backtracking steps along the optimal path, quantifying how often the agent must revisit prior states to satisfy deferred preconditions (e.g., retrieving a key after encountering a locked door); and (3) the noise ratio, defined as the ratio between supporting and distracting facts about the environment. Our evaluations on state-of-the-art LLMs reveal a universal failure pattern: accuracy collapses exponentially beyond a model-specific logical depth. Unlike existing benchmarks, seqBench âs fine-grained control facilitates targeted analyses of these reasoning failures, illuminating universal scaling laws and statistical limits, as detailed in this paper alongside its generation methodology and evaluation metrics. We find that even top-performing models systematically fail on seqBench âs structured reasoning tasks despite minimal search complexity, underscoring key limitations in their commonsense reasoning capabilities. Designed for future evolution to keep pace with advancing models, the seqBench datasets are publicly released to spur deeper scientific inquiry into LLM reasoning, aiming to establish a clearer understanding of their true potential and current boundaries for robust real-world application.
seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
M.R. Ramezanali thanks: $\star$ denotes equal contribution. Salesforce AI Palo Alto, CA 94301 mramezanali@salesforce.com M. Vazifeh footnotemark: Capital One, MIT Cambridge, MA 02143 mvazifeh@mit.edu P. Santi MIT Cambridge, MA 02143 psanti@mit.edu
Large Language Models (LLMs) have shown remarkable performance (Vaswani et al., 2017; Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Smith et al., 2022; Thoppilan et al., 2022; Hoffmann et al., 2022; Du et al., 2021; Fedus et al., 2022; Zoph et al., 2022) on a wide range of tasks and benchmarks spanning diverse human-like capabilities; however, these successes can obscure fundamental limitations in sequential reasoning that still persist. Arguably, reasoning captures a more pure form of intelligence, going beyond mere pattern matching or fact memorization, and is thus a critical capability to understand and enhance in AI systems. Recent studies show that state-of-the-art LLMs (OpenAI, 2025; Google DeepMind, 2025; Meta AI, 2025; Mistral AI, 2024; Anthropic, 2025) excel at complex benchmarks, yet stumble upon simple common-sense inferences trivial for an adult human (Nezhurina et al., 2025; Han et al., 2024; Sharma, 2024; Berglund et al., 2024; Yang et al., 2019). Most existing benchmarks saturate quickly, leaving little room for fine-grained attribution studies to perform systemic probes of LLM failure modes. Consequently, a robust understanding of why and under what circumstances these models fail, especially on problems requiring sequential reasoning, remains elusive.
This gap, we argue, stems from the lack of evaluation benchmarks allowing systematic, multi-dimensional control over key independent factors that influence a taskâs overall reasoning difficulty. Most benchmarks (Cobbe et al., 2021; Hendrycks et al., 2021; Srivastava et al., 2023; Weston et al., 2015; Clark et al., 2018; Dua et al., 2019; Rein et al., 2023), despite their evaluation merits, often do not support a systematic variation of crucial complexity dimensions. This makes it difficult to isolate the specific conditions under which reasoning in LLMs falter. For instance, discerning whether a failure is due to the length of the required reasoning chain, the necessity to revise intermediate conclusions, or the density of distracting information is often not quantitatively possible. While prompting strategies like chain-of-thought (CoT) and model scaling have boosted aggregate performance, they often obscure sharp performance cliffs that can emerge when these underlying complexity dimensions are varied independently (Wei et al., 2023; Kojima et al., 2022). Without such systematic control, disentangling inherent architectural limitations from those addressable via scaling (model size, data, or compute), fine-tuning, or prompting techniques is challenging. A fine-grained understanding of these performance boundaries is crucial for developing more robust and reliable reasoning systems.
To complement recent efforts (Sprague et al., 2024; Tyagi et al., 2024; Kuratov et al., 2024; Tang and Kejriwal, 2025; Mirzaee et al., 2021; Tikhonov, 2024; Mirzaee and Kordjamshidi, 2022; Shi et al., 2022) in evaluating reasoning, and to address the need for more controlled analysis, we introduce seqBench, a tunable benchmark designed explicitly to probe and analyze sequential reasoning capabilities in language models. The dataset comprises synthetic yet linguistically grounded pathfinding task configurations on two-dimensional grids. Solving each problem requires sequential inference over relevant and distracting structured facts. Each instance is automatically verifiable and parameterized by controllable factors that directly address the previously identified gaps: (1) logical depth (total number of actions in the ground-truth solution, reflecting the length of the reasoning chain); (2) backtracking count (number of locked-door detours on the optimal path, requiring revision of tentative solution paths); and (3) noise ratio (proportion of distracting vs. supporting facts, testing robustness to irrelevant information). Performance against these dimensions can be quantified with fine-grained metrics (e.g., via progress ratio as we define here). We observe that beyond a certain logical depth, Pass@1 success collapses to near zero for all models (see Figure 1). These features enable precise attribution studies of model failure modes, offering insights into the brittle boundaries of current LLM generalization.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Performance Comparison Chart: AI Model Success Rate vs. Number of Actions
### Overview
The image contains two line charts comparing the performance of nine different AI models. The charts plot the "Success Rate" of each model against the "Number of Actions (L)" it is required to perform. The top chart uses a linear scale for the y-axis, while the bottom chart uses a logarithmic scale for the y-axis to better visualize performance differences at lower success rates. The data suggests an exponential decay in success rate as the number of actions increases for all models.
### Components/Axes
* **X-Axis (Both Plots):** Labeled "Number of Actions (L)". The scale runs from 0 to 300, with major tick marks at 0, 50, 100, 150, 200, 250, and 300.
* **Y-Axis (Top Plot):** Labeled "Success Rate". The scale is linear, running from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Y-Axis (Bottom Plot):** Labeled "Success Rate (Log Scale)". The scale is logarithmic (base 10), running from 10â»Âł (0.001) to 10â° (1.0), with major tick marks at 10â»Âł, 10â»ÂČ, 10â»Âč, and 10â°.
* **Legend (Top-Right of Top Plot):** A box containing the model names, their corresponding line colors and markers, and the parameters for their fitted exponential decay curves. The fit formula is given at the top of the legend box as: `Fit: ~ exp(-L / Lâ)`.
* **gemini-2.5-flash-preview-04-17:** Red line with circle markers. (Fit: Lâ = 85.7)
* **gemini-2.0-flash:** Green line with circle markers. (Fit: Lâ = 40.2)
* **Llama-4-Maverick-17B-128E-Instruct-FP8:** Gray line with circle markers. (Fit: Lâ = 16.7)
* **Llama-3.3-70B-Instruct-Turbo:** Pink line with circle markers. (Fit: Lâ = 10.2)
* **gemma-2-27b-it:** Purple line with circle markers. (Fit: Lâ = 8.1)
* **Qwen2.5-Coder-32B-Instruct:** Orange line with circle markers. (Fit: Lâ = 4.8)
* **Qwen2.5-7B-Instruct-Turbo:** Blue line with circle markers. (Fit: Lâ = 4.0)
* **Llama-3.2-3B-Instruct-Turbo:** Brown line with circle markers. (Fit: Lâ = 1.6)
### Detailed Analysis
**Trend Verification:** All data series show a downward trend, with success rate decreasing as the number of actions (L) increases. The decay appears exponential, as indicated by the fitted dashed lines.
**Data Points & Model Performance (Approximate values from visual inspection):**
1. **gemini-2.5-flash-preview-04-17 (Red):**
* **Trend:** Slowest decay, highest overall performance.
* **Points (Top Plot):** L=0, Successâ1.0; L=50, Successâ0.5; L=100, Successâ0.25; L=200, Successâ0.1; L=300, Successâ0.05.
* **Points (Bottom Plot):** Confirms the trend, showing a near-linear decline on the log scale.
2. **gemini-2.0-flash (Green):**
* **Trend:** Second-best performance, decay faster than gemini-2.5-flash.
* **Points (Top Plot):** L=0, Successâ0.9; L=50, Successâ0.15; L=100, Successâ0.05; L=200, Successâ0.01.
* **Points (Bottom Plot):** Shows a steeper slope than the red line on the log scale.
3. **Llama-4-Maverick-17B-128E-Instruct-FP8 (Gray):**
* **Trend:** Moderate decay, performance falls between gemini-2.0-flash and Llama-3.3-70B.
* **Points (Top Plot):** L=0, Successâ0.85; L=50, Successâ0.1; L=100, Successâ0.02.
* **Points (Bottom Plot):** Line ends near L=100, Successâ10â»ÂČ.
4. **Llama-3.3-70B-Instruct-Turbo (Pink):**
* **Trend:** Faster decay than the gray line.
* **Points (Top Plot):** L=0, Successâ0.8; L=50, Successâ0.05; L=100, Successâ0.01.
* **Points (Bottom Plot):** Line ends near L=75, Successâ10â»ÂČ.
5. **gemma-2-27b-it (Purple):**
* **Trend:** Rapid decay.
* **Points (Top Plot):** L=0, Successâ0.7; L=25, Successâ0.1; L=50, Successâ0.02.
* **Points (Bottom Plot):** Line ends near L=50, Successâ10â»ÂČ.
6. **Qwen2.5-Coder-32B-Instruct (Orange):**
* **Trend:** Very rapid decay.
* **Points (Top Plot):** L=0, Successâ0.65; L=25, Successâ0.05.
* **Points (Bottom Plot):** Line ends near L=25, Successâ10â»ÂČ.
7. **Qwen2.5-7B-Instruct-Turbo (Blue):**
* **Trend:** Extremely rapid decay.
* **Points (Top Plot):** L=0, Successâ0.5; L=10, Successâ0.05.
* **Points (Bottom Plot):** Line ends near L=15, Successâ10â»ÂČ.
8. **Llama-3.2-3B-Instruct-Turbo (Brown):**
* **Trend:** Fastest decay, lowest overall performance.
* **Points (Top Plot):** L=0, Successâ0.4; L=10, Successâ0.02.
* **Points (Bottom Plot):** Line ends near L=10, Successâ10â»ÂČ.
### Key Observations
1. **Clear Performance Hierarchy:** There is a distinct and consistent ordering of model performance across the entire range of actions. `gemini-2.5-flash-preview-04-17` is the top performer, followed by `gemini-2.0-flash`, then the Llama and other models in descending order.
2. **Exponential Decay Fit:** The dashed lines representing the exponential fit `~ exp(-L / Lâ)` align closely with the data points for each model, confirming this as a good model for the performance drop-off. The `Lâ` parameter (decay constant) quantifies this: a higher `Lâ` indicates slower decay (e.g., 85.7 for the top model vs. 1.6 for the bottom model).
3. **Log Scale Revelation:** The bottom logarithmic plot clearly shows that while all models start at different success rates, their decay *rates* (the slopes of the lines on the log plot) also differ significantly. The top models have shallower slopes.
4. **Performance Convergence at High L:** On the linear plot, all models' success rates converge towards zero as the number of actions approaches 300, though the top model maintains a small but non-zero rate.
### Interpretation
This data demonstrates a strong inverse relationship between task length (number of sequential actions) and the reliability of AI models in completing them successfully. The exponential decay suggests that each additional action introduces a roughly constant probability of failure, compounding over the sequence.
The significant differences in the `Lâ` values (ranging from 85.7 to 1.6) highlight vast disparities in model capability for multi-step reasoning or planning. The `gemini-2.5-flash-preview-04-17` model is approximately 50 times more resilient to increasing task length than the `Llama-3.2-3B-Instruct-Turbo` model, as indicated by their `Lâ` ratio.
**Practical Implication:** For applications requiring long sequences of actions (e.g., complex problem-solving, multi-turn dialogue, autonomous agent tasks), model selection is critical. Using a model with a low `Lâ` value would lead to a very high likelihood of failure for all but the shortest tasks. The chart provides a quantitative basis for choosing a model based on the expected task length and required reliability. The near-perfect fit to an exponential model also allows for predicting success rates for action counts not explicitly tested.
</details>
Figure 1: Performance collapse of various models with increasing logical depth $L$ for a pathfinding task ( $N,M=40,\mathcal{B}=2$ keys, Noise Ratio $\mathcal{N}=0.0$ ). Success rates (Pass@1) are shown on linear (top panel) and logarithmic (bottom panel) y-axes, averaged from 5 runs/problem across 40 problems per unit $L$ -bin. All evaluations used Temperature=1.0 and top-p=0.95 (Gemini-2.5-flash: âautoâ thinking). The displayed fits employ a Weighted Least Squares (WLS) Carroll and Ruppert (2017) method on log-success rates. Weights are derived from inverse squared residuals of a preliminary Ordinary Least Squares (OLS) fit. (In the supplementary section, we have added Figure 16 to show a similar pattern is observed in recently released OpenAI models.)
Furthermore, the seqBench benchmark is built upon a scalable data generation framework, allowing it to evolve alongside increasingly capable models to help with both model training and evaluation. Through evaluations on popular LLMs, we reveal that top-performing LLMs exhibit steep universal declines as either of the three complexity dimensions increases, while remaining comparatively robust to fact shuffle, despite the underlying logical structure being unchanged.
#### Contributions.
Our main contributions are:
1. seqBench: A Tunable Benchmark for Sequential Reasoning. We introduce an open-source framework for generating pathfinding tasks with fine-grained, orthogonal control over logical depth, backtracking steps, and noise ratio. We also evaluate secondary factors like fact ordering (shuffle ratio; See supplementary material for details).
1. Comprehensive LLM Attribution Study. Using seqBench, we demonstrate the significant impact of these controlled complexities on LLM performance, revealing sharp performance cliffs in state-of-the-art models even when search complexity is minimal.
The seqBench dataset is publicly available https://huggingface.co/datasets/emnlp-submission/seqBench under the CC BY 4.0 license to facilitate benchmarking.
<details>
<summary>figs/llama4_deepdive.png Details</summary>

### Visual Description
## Line Graphs: Model Performance Metrics vs. Number of Actions
### Overview
The image contains two vertically stacked line charts sharing the same x-axis ("Number of actions"). The top chart plots the "Success rate" of a specific AI model against the number of actions, accompanied by an exponential decay fit. The bottom chart plots three related performance metrics ("Precision", "Recall", "Progress ratio") against the number of actions, with error bars indicating variability.
### Components/Axes
**Top Chart:**
* **X-axis:** Label: "Number of actions". Scale: Linear, from 0 to 300, with major ticks at 0, 50, 100, 150, 200, 250, 300.
* **Y-axis:** Label: "Success rate". Scale: Linear, from 0.0 to 0.6 (approx. 0.7 at top), with major ticks at 0.0, 0.2, 0.4, 0.6.
* **Legend (Top-right corner):**
* Blue line with circle markers: "Llama-4-Maverick-17B-128E-Instruct-FP8"
* Orange dashed line: "â exp(âL/Lâ), Lâ = 16.7"
**Bottom Chart:**
* **X-axis:** Label: "Number of actions". Scale: Linear, from 0 to 400, with major ticks at 0, 100, 200, 300, 400.
* **Y-axis:** No explicit label, but values range from 0.0 to 1.0, with major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Legend (Top-right corner):**
* Blue line with circle markers and vertical error bars: "Precision"
* Orange line with circle markers and vertical error bars: "Recall"
* Green line with circle markers and vertical error bars: "Progress ratio"
### Detailed Analysis
**Top Chart - Success Rate:**
* **Trend Verification:** The blue data series ("Llama-4-Maverick...") shows a steep, concave-upward decay. It starts high and decreases rapidly, then asymptotically approaches zero. The orange dashed line (exponential fit) follows this trend very closely.
* **Data Points (Approximate):**
* At ~10 actions: Success rate â 0.63
* At ~20 actions: Success rate â 0.26
* At ~30 actions: Success rate â 0.14
* At ~40 actions: Success rate â 0.09
* At ~50 actions: Success rate â 0.06
* At ~60 actions: Success rate â 0.04
* At ~100 actions: Success rate â 0.02
* From ~150 to 300 actions: Success rate is very close to 0.0, with data points hovering just above the axis.
**Bottom Chart - Precision, Recall, Progress Ratio:**
* **Trend Verification:**
* **Precision (Blue):** Starts high (~0.9) and remains relatively stable, showing a very slight downward trend with large error bars.
* **Recall (Orange):** Starts moderately high (~0.8) and shows a clear, steady downward trend.
* **Progress Ratio (Green):** Starts moderately high (~0.75) and shows the steepest decline of the three metrics.
* **Data Points & Error Bars (Approximate):**
* **Precision (Blue):**
* ~10 actions: Mean â 0.90, Error bar range â 0.85 to 0.95
* ~50 actions: Mean â 0.92, Error bar range â 0.88 to 0.96
* ~100 actions: Mean â 0.91, Error bar range â 0.84 to 0.98
* ~200 actions: Mean â 0.87, Error bar range â 0.76 to 0.98
* ~300 actions: Mean â 0.87, Error bar range â 0.79 to 0.95
* **Recall (Orange):**
* ~10 actions: Mean â 0.79, Error bar range â 0.68 to 0.90
* ~50 actions: Mean â 0.62, Error bar range â 0.40 to 0.84
* ~100 actions: Mean â 0.54, Error bar range â 0.18 to 0.90
* ~200 actions: Mean â 0.38, Error bar range â 0.16 to 0.60
* ~300 actions: Mean â 0.28, Error bar range â 0.10 to 0.46
* **Progress Ratio (Green):**
* ~10 actions: Mean â 0.74, Error bar range â 0.22 to 1.00 (very large)
* ~50 actions: Mean â 0.26, Error bar range â 0.02 to 0.50
* ~100 actions: Mean â 0.11, Error bar range â 0.02 to 0.20
* ~200 actions: Mean â 0.09, Error bar range â 0.02 to 0.16
* ~300 actions: Mean â 0.04, Error bar range â 0.01 to 0.08
### Key Observations
1. **Strong Exponential Decay:** The success rate of the "Llama-4-Maverick" model decays exponentially with the number of actions, with a characteristic length scale (Lâ) of 16.7 actions. The fit is excellent.
2. **Divergent Metric Trends:** While the model's **Precision** remains high and stable (though with high variance) as actions increase, its **Recall** and **Progress Ratio** degrade significantly. The Progress Ratio degrades the fastest.
3. **Increasing Variability:** The error bars for all three metrics in the bottom chart are substantial, particularly for Recall and Progress Ratio at lower action counts, indicating high variance in model performance across different trials or tasks.
4. **Performance Plateau:** All metrics, especially Success Rate and Progress Ratio, appear to plateau near zero after approximately 150-200 actions, suggesting a functional limit to the model's effective operational range in this context.
### Interpretation
This data demonstrates a critical limitation in the evaluated AI model's performance on sequential or multi-step tasks. The exponential decay in success rate indicates that the probability of completing a task successfully diminishes rapidly with each additional action required.
The divergence between Precision and Recall is particularly insightful. The model maintains high **Precision** (when it claims to have completed a step or identified something, it is often correct), but its **Recall** plummets (it misses an increasing number of required steps or relevant items as the task length grows). This suggests the model becomes increasingly "conservative" or "forgetful" in longer action sequencesâit may avoid making incorrect predictions but at the cost of failing to complete necessary actions.
The **Progress Ratio**, which likely measures the proportion of actions that meaningfully advance the task goal, decays fastest. This implies that in longer sequences, a growing fraction of the model's actions are either redundant, corrective, or non-productive.
**In summary:** The model is reliable for short action sequences but suffers from a severe "horizon problem." Its ability to maintain goal-directed behavior and recall necessary information degrades exponentially with task length, even while the correctness of its individual, isolated predictions remains relatively stable. This highlights a fundamental challenge in scaling such models to complex, long-horizon problems. The provided exponential fit (Lâ=16.7) offers a quantitative benchmark for this limitation.
</details>
Figure 2: On the left: Llama-4 Maverick-17B-128E-Instruct Modelâs performance (pass@1 success rate) versus number of actions in the ground truth path of the pathfinding problems ( $N,M=40,\mathcal{B}=2$ keys, Noise Ratio $\mathcal{N}=0.0$ ) is shown. This Pass@1 success rate across 5 runs per problem is averaged over the problem instances sampled from different actions count bins of width equal to 1. On the right: The mean of progress ratio across all problems as well as mean of precision and recall is shown to highlight models gradually increasing struggle in completing the path. The Temperature is set to 1.0 and the top-p is set to 0.95 in all runs.
## 1 Methods
### 1.1 Dataset Generation
The seqBench dataset consists of spatial pathfinding tasks. Task instance generation, detailed below (Algorithm 1; See Appendix A for details), is predicated on the precise independent control of the three key complexity dimensions introduced earlier: Logical Depth ( $L$ ), Backtracking Count ( $\mathcal{B}$ ), and Noise Ratio ( $\mathcal{N}$ ). This allows the creation of instances with specific values for these parameters, enabling targeted studies of their impact on LLM reasoning.
Task instances are produced in a multi-stage process. Initially, primary generation parametersâmaze dimensions ( $N,M$ ), target backtracks ( $\mathcal{B}_{\text{target}}$ ), and target noise ratio ( $\mathcal{N}_{\text{target}}$ )âare specified. An acyclic maze graph ( $M_{g}$ ) is formed on an $N\times M$ grid using Kruskalâs algorithm (Kleinberg and Tardos, 2006). Our "Rewind Construction" method (Algorithm 1) then embeds $\mathcal{B}_{\text{target}}$ backtracking maneuvers by working backward from a goal to strategically place keys and locked doors, yielding the instanceâs actual backtracking count $\mathcal{B}$ . Finally, a natural language fact list ( $\mathcal{F}$ ) is derived from the maze, and distracting facts are added according to $\mathcal{N}_{\text{target}}$ to achieve the final noise ratio $\mathcal{N}$ . The logical depth $L$ (optimal path length) emerges from these generative steps, influenced by $N,M,\mathcal{B}_{\text{target}}$ , and construction stochasticity. While $L$ is not a direct input to the generation algorithm, the process is designed to yield a wide spectrum of logical depths. Each generated instance is then precisely annotated with its emergent $L$ value, alongside its effective $\mathcal{B}$ and $\mathcal{N}$ values. This annotation effectively makes $L$ a key, selectable parameter for users of the seqBench dataset, enabling them to choose or filter tasks by their desired logical depth. Our rewind construction method guarantees task solvability. The full seqBench benchmark is constructed by systematically applying this instance generation process (detailed in Algorithm 1) across a wide range of initial parameters. This includes varied grid sizes (e.g., $N\in\{5..50\},M\approx N$ ) and target backtracks ( $\mathcal{B}_{\text{target}}\in\{0..7\}$ ), yielding a large and diverse data pool. For each $(N,M,\mathcal{B}_{\text{target}})$ configuration, multiple unique base mazes are generated, to which different noise ratios (e.g., $\mathcal{N}_{\text{target}}\in\{0..1\}$ ) are subsequently applied. It is important to note that the algorithm constrains backtracking complexity to a simple dependency chain. In this setting, retrieving the key for each locked door involves at most one backtracking step to pick up its corresponding key, without requiring the unlocking of additional doors along the optimal path. Combined with the uniform random placement of keys, this design ensures a well-balanced distribution of backtracking difficulty across the generated instances for each logical depth $L$ . Nevertheless, the same backward-in-time construction can be extended to generate tasks with higher backtracking complexityâfor example, doors that require multiple keys, or intermediate doors that must be unlocked en route to other keys. Such extensions would introduce richer tree-structured dependency graphs and allow seqBench to probe model performance under more complex long-horizon reasoning regimes. The creation of this comprehensive data pool was computationally efficient, requiring approximately an hour of computation on a standard laptop while using minimal memory. The publicly released benchmark comprises a substantial collection of these generated instances, each annotated with its specific emergent logical depth $L$ , effective backtracking count $\mathcal{B}$ , and noise ratio $\mathcal{N}$ . This rich annotation is key, enabling researchers to readily select or filter task subsets by these dimensions for targeted studies (e.g., as done for Figure 1, where instances were sampled into $L$ -bins with other parameters fixed). For the experiments presented in this paper, specific subsets were drawn from this benchmark pool, often involving further filtering or parameter adjustments tailored to the objectives of each study; precise details for each experiment are provided in the relevant sections and figure captions. Full details on path derivation, fact compilation, and overall dataset generation parameters are provided in the Appendix A.
Input : Grid $N\times M$ , Target backtracks $\mathcal{B}$
Output : Maze graph $M_{g}$ , Locked doors $\mathcal{D}_{L}$ , Key info $\mathcal{K}_{I}$ , Path skeleton $\Pi_{S}$
1
2 $M_{g}\leftarrow$ Acyclic graph on grid (Kruskalâs);
3 $x\leftarrow C_{goal}\leftarrow$ Random goal cell in $M_{g}$ ;
4 $\mathcal{D}_{L},\mathcal{K}_{I}\leftarrow\emptyset,\emptyset$ ; $b\leftarrow 0$ ;
5 $\Pi_{S}\leftarrow[(C_{goal},\text{GOAL})]$ ;
6
7 while $b<\mathcal{B}$ do
8 $c_{key}\leftarrow$ Random cell in $M_{g}$ accessible from $x$ (path avoids $\mathcal{D}_{L}$ for this step);
9 $\pi_{seg}\leftarrow$ Unique path in $M_{g}$ from $x$ to $c_{key}$ ;
10 if $\exists e\in\pi_{seg}$ such that $e\notin\mathcal{D}_{L}$ then
11 $d\leftarrow$ Randomly select such an edge $e$ ;
12 $\mathcal{D}_{L}\leftarrow\mathcal{D}_{L}\cup\{d\}$ ;
13 $K_{id}\leftarrow$ New unique key ID;
14 $\mathcal{K}_{I}[K_{id}]\leftarrow\{\text{opens}:d,\text{loc}:c_{key}\}$ ;
15 $\Pi_{S}$ .prepend( $(c_{key},\text{PICKUP }K_{id})$ , $(d,\text{UNLOCK }K_{id})$ , $(\pi_{seg},\text{MOVE})$ );
16 $x\leftarrow c_{key}$ ; $b\leftarrow b+1$ ;
17
18 end if
19 else
20 Break
21 end if
22
23 end while
24 $\Pi_{S}$ .prepend( $(x,\text{START}))$ ;
25 return $M_{g},\mathcal{D}_{L},\mathcal{K}_{I},\Pi_{S}$ ;
Algorithm 1 Rewind Construction of Path Skeleton
### 1.2 Prompt Construction and Model Configuration
Our evaluation uses a standardized prompt template with four components: (i) task instructions and action schema, (ii) three few-shot examples of increasing complexity (simple navigation, single-key, and multi-key backtracking), (iii) optional reasoning guidance, and (iv) the problemâs natural-language facts. All models are queried using temperature $T{=}1.0$ , nucleus sampling $p{=}0.95$ , and maximum allowed setting in terms of output token limits on a per model basis. For each instance, we compute 5 independent runs to establish robust performance statistics. The complete prompt structure, shown in Figure 6, is provided in the Appendix B.
### 1.3 Evaluation Metrics
To analyze not just success but also how models fail, we employ several complementary metrics. Success Rate (Pass@1) measures the proportion of runs where the predicted action sequence exactly matches the ground truth. The Progress Ratio (Tyagi et al., 2024), calculated as $k/n$ (where $n$ is the total ground-truth actions and $k$ is the number correctly executed before the first error), pinpoints the breakdown position in reasoning. We also use Precision and Recall. Precision is the proportion of predicted actions that are correct, while Recall is the proportion of ground-truth actions that were correctly predicted. Low precision indicates hallucinated actions, while low recall signifies missed necessary actions. Additionally, we visualize error locations via a Violation Map. This multi-faceted approach reveals each modelâs effective "reasoning horizon"âthe maximum sequence length it can reliably traverse. Further details on all metrics and visualizations are provided in the supplementary material.
## 2 Benchmarking Results
<details>
<summary>figs/fig_vs_backtracking_fixed_L_shuffle1.0_noise0.0.png Details</summary>

### Visual Description
## Line Charts: Performance Metrics vs. Backtracking Steps
### Overview
The image displays three horizontally aligned line charts comparing the performance of five different Large Language Models (LLMs) across varying numbers of backtracking steps (0 to 5). The charts measure three distinct metrics: Progress Ratio Mean, Success Rate, and Number of Tokens. A shared legend is present in the first chart.
### Components/Axes
* **Common X-Axis (All Charts):** "Number of backtracking steps" with integer markers from 0 to 5.
* **Chart 1 (Left):**
* **Y-Axis:** "Progress ratio mean" with a scale from 0.0 to 1.0.
* **Legend (Top-Left):** Contains five entries, each with a colored line and marker:
* Blue circle: `(Llama-4-maverick-17b-128e-instruct-fp8)`
* Orange circle: `(Qwen2.5-coder-32b-instruct)`
* Green circle: `(Llama-3.1-nemotron-70b-instruct-hf)`
* Red circle: `(Gemini-2.0-flash)`
* Purple circle: `(Gemini-2.5-flash-preview-04-17)`
* **Chart 2 (Middle):**
* **Y-Axis:** "Success rate" with a scale from 0.0 to 1.0.
* **Chart 3 (Right):**
* **Y-Axis:** "Number of tokens" with a scale from 250 to 1750.
### Detailed Analysis
**Chart 1: Progress Ratio Mean**
* **Trend Verification:** All models show a general downward trend in progress ratio as backtracking steps increase.
* **Data Points (Approximate):**
* **Purple (Gemini-2.5-flash-preview-04-17):** Starts highest at ~0.95 (step 0), declines to ~0.72 (step 5). It remains the top performer throughout.
* **Red (Gemini-2.0-flash):** Starts at ~0.75 (step 0), declines steadily to ~0.12 (step 5).
* **Blue (Llama-4-maverick):** Starts at ~0.50 (step 0), declines to ~0.18 (step 5).
* **Green (Llama-3.1-nemotron):** Starts at ~0.38 (step 0), declines to ~0.14 (step 5).
* **Orange (Qwen2.5-coder):** Starts lowest at ~0.29 (step 0), declines to ~0.04 (step 5).
**Chart 2: Success Rate**
* **Trend Verification:** Most models show a sharp decline in success rate with increased backtracking, except for the purple line which maintains a relatively high rate.
* **Data Points (Approximate):**
* **Purple (Gemini-2.5-flash-preview-04-17):** Starts highest at ~0.90 (step 0), dips to ~0.63 (step 2), and stabilizes around ~0.65-0.63 (steps 3-5).
* **Red (Gemini-2.0-flash):** Starts at ~0.54 (step 0), drops sharply to ~0.02 (step 5).
* **Blue (Llama-4-maverick):** Starts at ~0.26 (step 0), drops to near zero by step 2 and remains there.
* **Green (Llama-3.1-nemotron):** Starts very low at ~0.03 (step 0), remains near zero throughout.
* **Orange (Qwen2.5-coder):** Starts at 0.0 (step 0) and remains at 0.0 for all steps.
**Chart 3: Number of Tokens**
* **Trend Verification:** Token usage shows varied trends. The blue line increases, the orange and green lines show moderate increases, while the red and purple lines remain relatively low and stable.
* **Data Points (Approximate):**
* **Blue (Llama-4-maverick):** Shows a clear upward trend, starting at ~1580 (step 0) and rising to ~1820 (step 5). It uses the most tokens.
* **Orange (Qwen2.5-coder):** Starts at ~900 (step 0), peaks at ~1240 (step 4), and ends at ~1100 (step 5).
* **Green (Llama-3.1-nemotron):** Starts at ~640 (step 0), rises to ~880 (step 3), and stabilizes around ~870 (steps 4-5).
* **Red (Gemini-2.0-flash):** Starts at ~340 (step 0), fluctuates slightly, and ends at ~410 (step 5).
* **Purple (Gemini-2.5-flash-preview-04-17):** Starts lowest at ~280 (step 0), rises slowly to ~410 (step 5). It uses the fewest tokens overall.
### Key Observations
1. **Performance Hierarchy:** The `Gemini-2.5-flash-preview-04-17` (purple) model consistently outperforms the others in both progress ratio and success rate across all backtracking steps, while also using the fewest tokens.
2. **Backtracking Impact:** Increasing backtracking steps generally degrades performance (progress and success) for all models, but the magnitude of degradation varies significantly.
3. **Token Efficiency:** There is a clear inverse relationship between performance and token usage for the top-performing model (purple). The model with the highest token usage (blue, Llama-4-maverick) shows middling progress and poor success rates.
4. **Model Grouping:** The two Gemini models (red and purple) start with the highest progress and success rates. The Llama and Qwen models start lower and decline. The `Qwen2.5-coder` (orange) has a 0% success rate regardless of backtracking steps.
### Interpretation
The data suggests a significant performance advantage for the `Gemini-2.5-flash-preview-04-17` model in this specific evaluation context. It demonstrates superior robustness, maintaining high success rates even as backtracking steps increase, and does so with remarkable token efficiency.
The stark contrast between the purple and red lines (both Gemini models) indicates that the "preview-04-17" version likely incorporates substantial architectural or training improvements over the "2.0-flash" version, particularly in handling backtracking or complex reasoning tasks.
The general decline in success rate with more backtracking steps for most models is counter-intuitive, as backtracking is typically a strategy to improve correctness. This could imply that the backtracking mechanism itself is poorly implemented or that the models struggle to effectively utilize the additional steps, potentially getting "stuck" in unproductive loops. The anomaly is the purple line, which resists this trend, suggesting it has a more effective backtracking or recovery strategy.
The token usage chart reveals different operational strategies. The high token consumption of the Llama-4-maverick model (blue) without commensurate performance gains suggests inefficiency. Conversely, the Gemini-2.5-flash model's low token count combined with high performance points to a highly optimized and effective inference process for this task.
</details>
Figure 3: Performance as a function of the number of required backtracking steps, operationalized via the number of locked doors with distributed keys along the optimal path. Holding all other complexity factors constant, all models exhibit a clear decline in both progress ratio and success rate as backtracking demands increase. Additionally, we report the corresponding rise in output token counts per model, highlighting the increased reasoning burden associated with longer dependency chains. Fixed experimental parameters in this figure are the same as those in Figure 1. (for each point 100 problems sampled from $L=[40,60]$ )
### 2.1 Evaluated Models
We evaluate a diverse set of transformer-based LLMs across different model families and parameter scales. Our analysis includes Gemini models (2.5-flash-preview, 2.0-flash), Metaâs Llama family (4-Maverick-17B, 3.3-70B, 3.2-3B), Googleâs Gemma-2-27b, and Alibabaâs Qwen models (2.5-Coder-32B, 2.5-7B). [Note: GPT-5 was released during the preparation of this paperâs final version. Our analysis shows that this model exhibits the same performance degradation, as shown in Figure 16]. Access to some open-weight models and benchmarking infrastructure was facilitated by platforms such as Together AI https://www.together.ai/ and Google AI Studio https://aistudio.google.com/. Problem instances for varying logical depths ( $L$ ) were generated by sampling 40 problems for each $L$ , using a fixed maze size of $40\times 40$ and 2 keys, unless otherwise specified for specific experiments (e.g., when varying the number of keys for backtracking analysis). All models were evaluated using the standardized prompt template (see Figure 6), the inference settings detailed in Section 1.2, and a common response parsing methodology. For each task instance, we perform 5 independent runs to establish robust performance statistics, primarily analyzing Pass@1 success rates.
### 2.2 Universal Performance Collapse with Increasing Logical Depth
A central finding of our study is the universal collapse in reasoning performance observed across all evaluated LLMs when confronted with tasks requiring increasing sequential inference steps. As illustrated in Figure 1, Pass@1 success rates exhibit a consistent and sharp exponential decay as the ground-truth path length ( $L$ ) increases. Performance rapidly approaches near-zero past a model-specific point in this decay. To quantify and compare this exponential decay, we fit an exponential decay curve $P(L)=\exp(-L/L_{0})$ to the success rates, deriving a characteristic path length $L_{0}$ . This $L_{0}$ value, representing the path length at which performance drops by a factor of $e^{-1}$ , serves as a robust metric for each modelâs sequential reasoning horizon. Plotting success rates on a semi-logarithmic (log-y) scale against $L$ reveals an approximately linear decay trend across the evaluated regime. This log-linear relationship suggests that errors may accumulate with a degree of independence at each reasoning step, eventually overwhelming the modelâs capacity for coherent inference. The observed $L_{0}$ values vary significantly, from 85.7 for Gemini-2.5-Flash down to 1.6 for Llama-3.2-3B (Figure 1), underscoring a fundamental bottleneck in current transformer architectures for extended multi-step reasoning.
### 2.3 Impact of Independently Controlled Complexity Dimensions
Beyond the universal impact of logical depth ( $L$ ) discussed in Section 2.2, our benchmarkâs ability to independently vary key complexity dimensions allows for targeted analysis of their distinct impacts on LLM reasoning performance. We highlight the effects of noise, backtracking, and fact ordering, primarily focusing on Pass@1 success rates, mean progress ratios, and response token counts.
<details>
<summary>figs/fig_vary_noise_fixed_L_keys2_shuffle1.0.png Details</summary>

### Visual Description
## [Line Charts]: Performance of Two AI Models Under Increasing Noise
### Overview
The image displays three horizontally arranged line charts comparing the performance of two AI modelsâ**Llama-4-maverick-17b-128e-instruct-fp8** (blue line) and **Gemini-2.5-flash-preview-04-17** (orange line)âacross three different metrics as a function of "Noise ratio." The charts share a common x-axis but have different y-axes, measuring "Mean progress ratio," "Mean success rate (pass@1)," and "Cot tokens," respectively.
### Components/Axes
* **Legend:** Located in the top-left corner of the first (leftmost) chart. It identifies the two data series:
* Blue line with circle markers: `(Llama-4-maverick-17b-128e-instruct-fp8)`
* Orange line with circle markers: `(Gemini-2.5-flash-preview-04-17)`
* **Common X-Axis (All Charts):**
* **Label:** `Noise ratio`
* **Scale:** Linear, from `0.00` to `1.00`.
* **Tick Marks:** `0.00`, `0.25`, `0.50`, `0.75`, `1.00`.
* **Left Chart Y-Axis:**
* **Label:** `Mean progress ratio`
* **Scale:** Linear, from `0.0` to `1.0`.
* **Tick Marks:** `0.0`, `0.2`, `0.4`, `0.6`, `0.8`, `1.0`.
* **Middle Chart Y-Axis:**
* **Label:** `Mean success rate (pass@1)`
* **Scale:** Linear, from `0.0` to `1.0`.
* **Tick Marks:** `0.0`, `0.2`, `0.4`, `0.6`, `0.8`, `1.0`.
* **Right Chart Y-Axis:**
* **Label:** `Cot tokens`
* **Scale:** Linear, from `500` to `1750`.
* **Tick Marks:** `500`, `750`, `1000`, `1250`, `1500`, `1750`.
### Detailed Analysis
#### Chart 1: Mean Progress Ratio vs. Noise Ratio
* **Trend Verification:** Both lines show a clear downward trend as noise increases. The orange line (Gemini) has a steeper negative slope than the blue line (Llama).
* **Data Points (Approximate):**
* **Llama (Blue):** Starts at ~0.24 (Noise=0.00), decreases gradually to ~0.18 (0.25), ~0.16 (0.50), ~0.12 (0.75), and ends at ~0.11 (1.00).
* **Gemini (Orange):** Starts significantly higher at ~0.72 (0.00), decreases to ~0.65 (0.25), ~0.54 (0.50), ~0.38 (0.75), and ends at ~0.24 (1.00).
* **Key Observation:** Gemini consistently achieves a higher mean progress ratio than Llama across all noise levels, though the gap narrows as noise increases.
#### Chart 2: Mean Success Rate (pass@1) vs. Noise Ratio
* **Trend Verification:** Both lines show a strong downward trend. The orange line (Gemini) starts high and declines sharply. The blue line (Llama) starts very low and declines to near zero.
* **Data Points (Approximate):**
* **Llama (Blue):** Starts very low at ~0.03 (0.00), drops to ~0.01 (0.25), and is at or near `0.0` for noise ratios of 0.50, 0.75, and 1.00.
* **Gemini (Orange):** Starts at ~0.62 (0.00), decreases to ~0.50 (0.25), ~0.34 (0.50), ~0.19 (0.75), and ends at ~0.04 (1.00).
* **Key Observation:** Gemini has a substantially higher success rate than Llama at all noise levels. Llama's success rate is negligible, especially beyond a noise ratio of 0.25.
#### Chart 3: Cot Tokens vs. Noise Ratio
* **Trend Verification:** The blue line (Llama) shows a clear downward trend. The orange line (Gemini) is relatively flat with a very slight upward trend.
* **Data Points (Approximate):**
* **Llama (Blue):** Starts at ~1720 tokens (0.00), decreases to ~1640 (0.25), ~1620 (0.50), ~1510 (0.75), and ends at ~1460 (1.00).
* **Gemini (Orange):** Remains consistently low, starting at ~350 tokens (0.00), with values around ~350 (0.25), ~360 (0.50), ~355 (0.75), and ~370 (1.00).
* **Key Observation:** There is a massive disparity in token usage. Llama uses approximately 4-5 times more Chain-of-Thought (Cot) tokens than Gemini across all noise levels. Llama's token count decreases with noise, while Gemini's remains stable.
### Key Observations
1. **Performance Hierarchy:** Gemini-2.5-flash-preview-04-17 outperforms Llama-4-maverick-17b-128e-instruct-fp8 on both "Mean progress ratio" and "Mean success rate" metrics at every tested noise level.
2. **Noise Sensitivity:** Both models' performance metrics degrade as the noise ratio increases. The degradation in success rate is particularly severe for both.
3. **Efficiency Disparity:** The models exhibit opposite behaviors in resource usage. The higher-performing Gemini model uses a consistently low and stable number of Cot tokens. The lower-performing Llama model uses a very high number of Cot tokens, which decreases as noise increases (possibly indicating a failure to generate lengthy reasoning chains in noisy conditions).
4. **Llama's Near-Zero Success:** Llama's mean success rate (pass@1) is effectively zero for noise ratios of 0.50 and above, indicating a complete failure to produce correct final answers under moderate to high noise.
### Interpretation
The data suggests a fundamental trade-off or difference in model architecture/behavior between the two tested models. **Gemini-2.5-flash-preview-04-17** appears to be both more robust (higher performance) and more efficient (lower token usage) on this specific task under noisy conditions. Its performance degrades gracefully with noise.
In contrast, **Llama-4-maverick-17b-128e-instruct-fp8** struggles significantly. Its low progress and near-zero success rates, coupled with very high token consumption, indicate it may be generating lengthy but ineffective reasoning chains ("Cot tokens") that fail to lead to correct solutions, especially when the input is corrupted by noise. The decrease in its token count with higher noise might reflect an inability to sustain coherent reasoning rather than increased efficiency.
The "Noise ratio" likely represents the proportion of corrupted or irrelevant information in the input prompt. The charts demonstrate that maintaining performance in the presence of such noise is a critical challenge, and the two models handle it with vastly different levels of efficacy and efficiency. This analysis would be crucial for selecting a model for real-world applications where input data may be imperfect.
</details>
Figure 4: Performance as a function of contextual noise for Gemini 2.5 flash and Llama-4 Maverick-17B-128E-Instruct models. As noise increases through the inclusion of distracting or irrelevant facts, both models exhibit a clear and consistent decline in performance. Fixed experimental parameters in this figure are the same as those in Figure 1 (for each point 100 problems sampled from $L=[40,60]$ and number of keys is equal to 2).
#### Impact of Backtracking Requirements.
Increasing the number of required backtracking stepsâoperationalized via key-door mechanismsâalso leads to a clear and significant decline in Pass@1 success rates and mean progress ratios across all evaluated models as shown in Figure 3. Gemini 2.5 Flash-preview maintains the highest performance but still exhibits a notable drop as backtracking count increases from 0 to 5. This decline in reasoning accuracy is generally accompanied by an increase or sustained high level in the mean number of response tokens (Figure 3, right panel). For example, models like Llama-4 Maverick and Gemini 2.5 Flash-preview show a clear upward trend or maintain high token counts as backtracking complexity rises, reflecting the increased reasoning effort or path length articulated by the models when managing more complex sequential dependencies.
#### Sensitivity to Noise Ratio.
Model performance is highly sensitive to the noise ratioâthe proportion of distracting versus supporting facts. As demonstrated in Figure 4 for Gemini 2.5 Flash and Llama-4 Maverick, increasing the proportion of irrelevant facts consistently and significantly degrades both Pass@1 success rates and mean progress ratios. For instance, Gemini 2.5 Flashâs Pass@1 success rate drops from over 0.7 at zero noise to approximately 0.2 at a noise ratio of 1.0. Llama-4 Maverick, starting with lower performance, also shows a consistent decline. Interestingly, for these two models, the number of CoT (output) tokens remains relatively stable despite the increasing noise and degrading performance (Figure 4, right panel), suggesting that models do not necessarily "work harder" (in terms of output length) when faced with more distractors, but their accuracy suffers.
#### Fact Ordering (Shuffle Ratio).
In contrast to the strong effects of noise and backtracking, shuffle ratio (entropy of fact presentation order) within the prompt appears to play a secondary role when varied in isolation. Our experiments, exemplified by the performance of Gemini 2.5 Flash and Llama-4 Maverick (see Appendix C Figure 14 for details), show that complete shuffling of facts (randomizing their presentation order without adding or removing any information) has a minimal impact on Pass@1 success rates and mean progress ratios. Output token counts also remain stable. This suggests a relative robustness to presentation order as long as all necessary information is present and distinguishable. However, as details provided in supplementary material, when high noise and high shuffle co-occur, the combined effect can be more detrimental than either factor alone, though noise remains the dominant degrading factor.
### 2.4 Characterizing Key Failure Modes and Error Patterns
#### A Key Failure Mode: Omission of Critical Steps.
Beyond simply taking illegal shortcuts, detailed analysis reveals that LLMs often fail by omitting critical sub-goals necessary for task completion. Figure 2 (bottom panel) provides a quantitative view for Llama-4 Maverick (Meta AI, 2025), showing that while precision generally remains high (models infrequently hallucinate non-existent rooms or facts), recall and progress ratio plummet with increasing path length ( $L$ ). This indicates that models predominantly fail by missing necessary actions or entire crucial sub-sequences. For a qualitative example, even capable models like Gemini-2.5-Flash can neglect essential detours, such as collecting a required key, thereby violating sequential dependencies and rendering the task unsolvable (illustrative examples are provided in the Appendix B.4; see Figures 8 and 9). This pattern highlights a fundamental breakdown in robust multi-step planning and execution.
#### Path-Length Dependent First Errors: The Burden of Anticipated Complexity.
The propensity for models to make critical errors is not uniformly distributed across the reasoning process, nor is it solely a feature of late-stage reasoning fatigue. Examining the distribution of steps at which the first constraint violations occur reveals a counterintuitive pattern: as the total required path length ( $L$ ) of a problem increases, models tend to fail more frequently even at the earliest steps of the reasoning chain. This leftward shift in the first-error distribution also observed under increasing noise, (Appendix B.4; Figures 10 and 11) contradicts a simple cumulative error model where each step carries a fixed, independent failure probability. Instead, an error at an early step (e.g., step 5) becomes substantially more likely when the model is attempting to solve an 80-step problem versus a 20-step problem. This suggests that the overall anticipated complexity of the full problem influences reasoning quality from the very outset, indicating a struggle with global planning or maintaining coherence over longer horizons, rather than just an accumulation of local errors. This phenomenon may help explain why prompting techniques that decompose long problems into smaller, manageable sub-problems often succeed.
### 2.5 Disparity: Information Retention vs. Reasoning Capacity
On seqBench tasks, this disparity is quantitatively striking. While modern LLMs boast million-token contexts, their effective sequential reasoning depth typically remains on the order of hundreds of actions (Figure 1). This functional limit, even at several hundred actions (e.g., 300 actions, with each like (âmove_toâ, âA12â) being 5-7 tokens, totaling 1.5k-2.1k tokens), still consumes a minute fraction of their nominal context. Consequently, the ratio of context capacity to reasoning tokens often spans from several hundred-fold (e.g., 500:1 for 300 actions consuming 2k tokens within a 1M context) to potentially higher values given fewer limiting actions or larger model contexts. This striking gap suggests that while transformers can store and retrieve vast information, their ability to reliably chain it for coherent, multi-step inference appears surprisingly constrained.
### 2.6 Challenging the Conventional Performance Hierarchy
While metrics like average $L_{0}$ provide a general ranking of model capabilities, our fine-grained analysis reveals instances that challenge a simple linear performance hierarchy. Scatter plots of progress ratios across different models on identical tasks (see Appendix C Figure 13) show intriguing cases where models with lower overall $L_{0}$ values (i.e., typically weaker models) occasionally solve specific complex problems perfectly, while models with higher average $L_{0}$ values fail on those same instances. These performance inversions suggest that sequential reasoning failures may not solely stem from insufficient scale (parameters or general training) but could also arise from more nuanced reasoning limitations.
## 3 Related Work
Recent advancements in benchmarks evaluating sequential reasoning capabilities of LLMs have illuminated various strengths and limitations across different dimensions of complexity. These benchmarks typically differ in how they isolate and quantify reasoning challenges, such as logical deduction, retrieval difficulty, combinatorial complexity, and sensitivity to irrelevant information. ZebraLogic (Lin et al., 2025), for instance, targets formal deductive inference through logic-grid puzzles framed as constraint-satisfaction problems (csp, 2008). While valuable for probing deduction, its core methodology leads to a search space that grows factorially with puzzle size (Sempolinski, 2009). This makes it challenging to disentangle intrinsic reasoning failures from the sheer combinatorial complexity of the search. As the ZebraLogic authors themselves acknowledge: â solving ZebraLogic puzzles for large instances may become intractable⊠the required number of reasoning tokens may increase exponentially with the size of the puzzle. â This inherent characteristic means that for larger puzzles, performance is primarily dictated by the manageability of the search space rather than the limits of sequential reasoning depth. GridPuzzle (Tyagi et al., 2024) complements this by providing a detailed error taxonomy for grid puzzles, focusing on what kinds of reasoning mistakes LLMs make. However, like ZebraLogic, it doesnât offer independent control over key complexity dimensions such as logical depth, backtracking needs, or noise, separate from the puzzleâs inherent search complexity.
Other benchmarks conflate reasoning with different cognitive demands. BABILong (Kuratov et al., 2024) tests models on extremely long contexts (up to 50M tokens), primarily assessing the ability to retrieve "needles" (facts) from a "haystack" (distracting text that does not contribute to solving the task). While valuable for evaluating long-context processing, this design makes it hard to disentangle retrieval failures from reasoning breakdowns, as performance is often dictated by finding the relevant information rather than reasoning over it. MuSR (Sprague et al., 2024) embeds reasoning tasks within lengthy narratives (e.g., murder mysteries), mixing information extraction challenges with complex, domain-specific reasoning structures. This realism obscures which specific aspectâextraction or reasoning depthâcauses model failures. Dyna-bAbI (Tamari et al., 2021) offers a dynamic framework for compositional generalization but focuses on qualitative combinations rather than systematically varying quantitative complexity metrics needed to find precise failure points.
Spatial reasoning benchmarks, while relevant, also target different aspects. GRASP (Tang and Kejriwal, 2025) assesses practical spatial planning efficiency (like obstacle avoidance) in 2D grids, a different skill than the abstract sequential reasoning seqBench isolates. SPARTQA (Mirzaee et al., 2021) focuses on specialized spatial relational complexity (transitivity, symmetry) using coupled dimensions, preventing independent analysis of factors like path length. SpaRTUN (Mirzaee and Kordjamshidi, 2022) uses synthetic data primarily for transfer learning in Spatial Question Answering (SQA), aiming to improve model performance rather than serve as a diagnostic tool with controllable complexity. Similarly, StepGame (Shi et al., 2022) demonstrates performance decay with more reasoning steps in SQA but lacks the fine-grained, orthogonal controls over distinct complexity factors provided by seqBench.
In contrast, seqBench takes a targeted diagnostic approach. By deliberately simplifying the spatial environment to minimize search complexity, it isolates sequential reasoning. Its core contribution lies in the independent, fine-grained control over (1) logical depth (the number of sequential actions required to solve the task), (2) backtracking count (the number of backtracking steps along the optimal path), and (3) noise ratio (the ratio of supporting to distracting facts). This orthogonal parameterization allows us to precisely pinpoint when and why sequential reasoning capabilities degrade, revealing fundamental performance cliffs even when search and retrieval demands are trivial. seqBench thus offers a complementary tool for understanding the specific limitations of sequential inference in LLMs.
## 4 Limitations
While seqBench offers precise control over key reasoning complexities, our study has limitations that open avenues for future research:
1. Generalizability and Task Design Fidelity: Our current findings are rooted in synthetic spatial pathfinding tasks. While this allows for controlled experimentation, future work must extend seqBench âs methodology to more diverse reasoning domains (e.g., mathematical proofs) and incorporate greater linguistic diversity (e.g., ambiguity) to assess the broader applicability of the observed phenomena of performance collapse (quantified by $L_{0}$ ) and failure patterns. Moreover, this work did not investigate whether similar failure modes arise when the problem is also presented visually (e.g., as maze images). Multimodal capabilities could influence spatial reasoning outcomes, and we have already extended the benchmark by releasing maze image generation code alongside the HuggingFace dataset. This dataset can also be used to help train multimodal reasoning models.
1. Model Scope and Understanding Deeper Failure Dynamics: Our current evaluation, while covering diverse public models, should be expanded to a wider array of LLMsâincluding recent proprietary and newer open-source variants (e.g., GPT, Claude, DeepSeek series)âto rigorously assess the universality of our findings on the characteristic length $L_{0}$ and failure patterns. Furthermore, while seqBench effectively characterizes how reasoning performance degrades with logical depth (i.e., by determining $L_{0}$ ), two complementary research thrusts are crucial for understanding why. First, systematic investigation is needed to disentangle how $L_{0}$ is influenced by factors such as model architecture, scale (parameters, training data, compute), fine-tuning strategies, and inference-time computation (e.g., chain-of-thought depth). Second, deeper analysis is required to explain the precise mechanisms underlying the observed exponential performance collapse characterized by $L_{0}$ and to account for other non-trivial error patterns, such as path-length dependent first errors. Additionally, the evaluation presented here does not consider how agentic systems capable of tool use perform as the reasoning complexity is tuned across various dimensions. Exploring such setups, where the LLM can externalize sub-problems, invoke tools, or backtrack programmatically, could provide valuable insights into whether the same exponential failure modes persist. In particular, one can define sequential problems where the degree of backtracking or sequential tool use can be systematically varied, and to test whether similar performance drop emerge as the dependency chain grows. We highlight this as a promising direction for future research.
1. Impact of Prompting: Our current study employed standardized prompts and inference settings. A crucial next step is a robust sensitivity analysis to determine overall decay behavior are influenced by different prompting strategies (e.g., zero-shot vs. few-shot, decomposition techniques), varied decoding parameters (temperature, top-p), and interactive mechanisms such as self-verification or self-correction. Investigating the potential of these techniques to mitigate the observed sequential inference failures, particularly given seqBench âs minimal search complexity, remains a key avenue for future research.
Addressing these points by leveraging frameworks like seqBench will be vital for developing LLMs with more robust and generalizable sequential reasoning capabilities, and for understanding their fundamental performance limits.
## 5 Conclusion
We introduced seqBench, a novel benchmark framework designed for the precise attribution of sequential reasoning failures in Large Language Models. seqBench âs core strength lies in its unique capability for fine-grained, independent control over fundamental complexity dimensions; most notably, logical depth ( $L$ ), backtracking requirements, and noise ratio, its provision of automatically verifiable solutions, and critically minimizing confounding factors like search complexity. This design allows seqBench to isolate and rigorously evaluate the sequential inference capabilities of LLMs, enabling the automatic quantification of fine-grained performance metrics (such as progress ratio) and providing a clear lens into mechanisms often obscured in most other benchmarks. The frameworkâs inherent scalability and open-source nature position it as a durable tool for assessing and driving progress in current and future generations of models, ultimately aiming to enhance their utility for complex, real-world problems that often span multiple domains. Our comprehensive evaluations using seqBench reveal that reasoning accuracy consistently collapses exponentially with increasing logical depth across a diverse range of state-of-the-art LLMs. This collapse is characterized by a model-specific parameter $L_{0}$ (Section 2.2), indicating an inherent architectural bottleneck in maintaining coherent multi-step inference. In alignment with the goal of advancing NLPâs reach and fostering its responsible application in other fields by offering this precise analysis, seqBench provides a valuable resource. It encourages a shift beyond aggregate benchmark scores towards a more nuanced understanding of model capabilities, an essential step for rigorously assessing the true impact and potential risks of applying LLMs in new domains. The insights gleaned from seqBench can inform both NLP developers in building more robust models, and experts in other disciplines in setting realistic expectations and co-designing NLP solutions that are genuinely fit for purpose. Targeted improvements, guided by such fundamental understanding, are key to enhancing the robustness of sequential reasoning, making LLMs more reliable partners in interdisciplinary endeavors. Future work should leverage these insights to develop models that can overcome the observed performance cliffs and extend their effective reasoning horizons, thereby unlocking their transformative potential in diverse interdisciplinary applicationsâsuch as navigating complex scientific literature, supporting intricate legal analysis, or enabling robust multi-step planning in critical autonomous systems. Focusing on commonsense reasoning is paramount for NLP to achieve transformative societal impact, moving beyond incremental improvements to genuine breakthroughs.
## References
- csp (2008) 2008. Rina dechter , constraint processing, morgan kaufmann publisher (2003) isbn 1-55860-890-7, francesca rossi, peter van beek and toby walsh, editors, handbook of constraint programming, elsevier (2006) isbn 978-0-444-52726-4. Computer Science Review, 2:123â130.
- Anthropic (2025) Anthropic. 2025. Claude 3.7 sonnet. https://www.anthropic.com/news/claude-3-7-sonnet.
- Berglund et al. (2024) Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2024. The reversal curse: Llms trained on "a is b" fail to learn "b is a". Preprint, arXiv:2309.12288.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901.
- Carroll and Ruppert (2017) Raymond J Carroll and David Ruppert. 2017. Transformation and weighting in regression. Chapman and Hall/CRC.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. Preprint, arXiv:1803.05457.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
- Du et al. (2021) Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, and 8 others. 2021. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning.
- Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. Preprint, arXiv:1903.00161.
- Fedus et al. (2022) William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1â39.
- Google DeepMind (2025) Google DeepMind. 2025. Gemini 2.5 pro experimental. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/.
- Han et al. (2024) Pengrui Han, Peiyang Song, Haofei Yu, and Jiaxuan You. 2024. In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models. Preprint, arXiv:2409.15454.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Preprint, arXiv:2009.03300.
- Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others. 2022. Training compute-optimal large language models. Preprint, arXiv:2203.15556.
- Kleinberg and Tardos (2006) Jon Kleinberg and Eva Tardos. 2006. Algorithm Design. Pearson/Addison-Wesley, Boston.
- Kojima et al. (2022) Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199â22213. Curran Associates, Inc.
- Kuratov et al. (2024) Yury Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. 2024. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack. Advances in Neural Information Processing Systems, 37:106519â106554.
- Lieber et al. (2021) Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. Jurassic-1: Technical details and evaluation. https://www.ai21.com/blog/jurassic-1-technical-details-and-evaluation. White Paper.
- Lin et al. (2025) Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. 2025. Zebralogic: On the scaling limits of llms for logical reasoning. Preprint, arXiv:2502.01100.
- Meta AI (2025) Meta AI. 2025. Llama 4: Open and efficient multimodal language models. https://github.com/meta-llama/llama-models.
- Mirzaee et al. (2021) Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjmashidi. 2021. Spartqa: : A textual question answering benchmark for spatial reasoning. Preprint, arXiv:2104.05832.
- Mirzaee and Kordjamshidi (2022) Roshanak Mirzaee and Parisa Kordjamshidi. 2022. Transfer learning with synthetic corpora for spatial role labeling and reasoning. Preprint, arXiv:2210.16952.
- Mistral AI (2024) Mistral AI. 2024. Mistral large 2. https://mistral.ai/news/mistral-large-2407.
- Nezhurina et al. (2025) Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. 2025. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. Preprint, arXiv:2406.02061.
- OpenAI (2025) OpenAI. 2025. Openai gpt-5, o3 and o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/, https://openai.com/index/introducing-gpt-5/. Paperâs supplementary material (appendix) was revised, after GPT-5 release, with a new figure, to reflect that GPT-5 also suffers from the same failure pattern we have observed in this paper.
- Rae et al. (2021) Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Matthias Rauh, Po-Sen Huang, and 58 others. 2021. Scaling language models: Methods, analysis & insights from training Gopher. Preprint, arXiv:2112.11446.
- Rein et al. (2023) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2023. Gpqa: A graduate-level google-proof q&a benchmark. Preprint, arXiv:2311.12022.
- Sempolinski (2009) Peter Sempolinski. 2009. Automatic solutions of logic puzzles.
- Sharma (2024) Manasi Sharma. 2024. Exploring and improving the spatial reasoning abilities of large language models. In I Canât Believe Itâs Not Better Workshop: Failure Modes in the Age of Foundation Models.
- Shi et al. (2022) Zhengxiang Shi, Qiang Zhang, and Aldo Lipani. 2022. Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 11321â11329.
- Smith et al. (2022) Samuel Smith, Mostofa Patwary, Brian Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhenhao Liu, Shrimai Prabhumoye, Georgios Zerveas, Vikas Korthikanti, Eric Zhang, Rewon Child, Reza Yazdani Aminabadi, Jared Bernauer, Xia Song Song, Mohammad Shoeybi, Yuxin He, Michael Houston, Shishir Tiwary, and Bryan Catanzaro. 2022. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. Preprint, arXiv:2201.11990.
- Sprague et al. (2024) Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. 2024. Musr: Testing the limits of chain-of-thought with multistep soft reasoning. Preprint, arXiv:2310.16049.
- Srivastava et al. (2023) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, AdriĂ Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, and 432 others. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Preprint, arXiv:2206.04615.
- Tamari et al. (2021) Ronen Tamari, Kyle Richardson, Aviad Sar-Shalom, Noam Kahlon, Nelson Liu, Reut Tsarfaty, and Dafna Shahaf. 2021. Dyna-babi: unlocking babiâs potential with dynamic synthetic benchmarking. Preprint, arXiv:2112.00086.
- Tang and Kejriwal (2025) Zhisheng Tang and Mayank Kejriwal. 2025. Grasp: A grid-based benchmark for evaluating commonsense spatial reasoning. Preprint, arXiv:2407.01892.
- Thoppilan et al. (2022) Rami Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yi Du, Yanping Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Max Krikun, Dmitry Lepikhin, James Qin, and 38 others. 2022. Lamda: Language models for dialog applications. arXiv preprint. Technical report, Google Research.
- Tikhonov (2024) Alexey Tikhonov. 2024. Plugh: A benchmark for spatial understanding and reasoning in large language models. Preprint, arXiv:2408.04648.
- Tyagi et al. (2024) Nemika Tyagi, Mihir Parmar, Mohith Kulkarni, Aswin RRV, Nisarg Patel, Mutsumi Nakamura, Arindam Mitra, and Chitta Baral. 2024. Step-by-step reasoning to solve grid puzzles: Where do llms falter? Preprint, arXiv:2407.14790.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ć ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-thought prompting elicits reasoning in large language models. Preprint, arXiv:2201.11903.
- Weston et al. (2015) Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. 2015. Towards ai-complete question answering: A set of prerequisite toy tasks. Preprint, arXiv:1502.05698.
- Yang et al. (2019) Kaiyu Yang, Olga Russakovsky, and Jia Deng. 2019. SpatialSense: An adversarially crowdsourced benchmark for spatial relation recognition. In International Conference on Computer Vision (ICCV).
- Zoph et al. (2022) Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022. St-moe: Designing stable and transferable sparse expert models. Preprint, arXiv:2202.08906.
## Appendices
## Appendix A Dataset Generation Details
The seqBench benchmark generates pathfinding tasks by systematically controlling several complexity dimensions. As described in Section 1 (main paper), Algorithm 1 is central to this process. This appendix provides further details on the generation phases, natural language encoding of tasks, and specific dataset parameters.
### A.1 Generation Phases
The generation process, guided by Algorithm 1, involves three main phases:
1. Base Maze Construction: An initial $N\times M$ grid is populated, and an acyclic maze graph ( $M_{g}$ ) is formed using Kruskalâs algorithm (Kleinberg and Tardos, 2006). This ensures a simply connected environment where a unique path exists between any two cells if all internal "walls" (potential door locations) were open. The overall process results in maze instances like the one visualized in Figure 5.
1. Rewind Construction for Path Skeleton and Key/Door Placement: This phase implements the "Rewind Construction" (Algorithm 1 in the main paper). Starting from a randomly selected goal cell ( $C_{goal}$ ), the algorithm works backward to define a solvable path skeleton ( $\Pi_{S}$ ). It iteratively:
1. Selects a cell $c_{key}$ that would be a preceding point on a path towards the current cell $x$ (initially $C_{goal}$ ).
1. Identifies the unique path segment $\pi_{seg}$ in $M_{g}$ from $x$ to $c_{key}$ .
1. Randomly selects an edge $d$ on this segment $\pi_{seg}$ to become a locked door. This edge $d$ is added to the set of locked doors $\mathcal{D}_{L}$ .
1. A new unique key $K_{id}$ is conceptually placed at $c_{key}$ , and its information (which door it opens, its location) is stored in $\mathcal{K}_{I}$ .
1. The conceptual steps (moving along $\pi_{seg}$ , unlocking door $d$ with $K_{id}$ , picking up $K_{id}$ at $c_{key}$ ) are prepended (in reverse logical order) to the path skeleton $\Pi_{S}$ .
1. The current cell $x$ is updated to $c_{key}$ , and the process repeats until the target number of backtracks ( $\mathcal{B}$ ) is achieved or no valid placements remain.
This backward construction ensures solvability and controlled backtracking complexity. The final agent starting position is the cell $x$ at the end of this phase.
1. Fact Compilation and Noise Injection: Based on the final maze structure ( $M_{g},\mathcal{D}_{L},\mathcal{K}_{I}$ ), a set of natural language facts $\mathcal{F}$ is compiled. This includes facts describing room connections, key locations, and door states. Distracting facts are then introduced based on the target noise ratio $\mathcal{N}$ . These distractors might describe non-existent connections, spurious keys, or misleading adjacencies, chosen to be plausible yet incorrect.
<details>
<summary>figs/compath_viz.png Details</summary>

### Visual Description
## Diagram: Network Topology with Annotated Connections
### Overview
The image displays a schematic diagram of a network or grid-based system. It consists of nodes (represented by circles and triangles) connected by lines (edges) in a structured, grid-like layout. The diagram includes annotations in the form of colored dashed lines and highlighted components (red rectangles), suggesting specific relationships, paths, or states within the network. There is no textual information, labels, axes, or legends present in the image.
### Components/Layout
The diagram is organized on an implicit 5x5 grid, though not all grid intersections are occupied. The primary components are:
#### 1. Nodes
| Node Type | Description | Instances & Locations |
| :--- | :--- | :--- |
| **Solid Black Circles** | The most common node type. | Numerous grid intersections. |
| **Circles with a Black Ring (Hollow Center)** | A distinct node type. | Four instances: <br>âą Top-left quadrant (row 2, column 2). <br>âą Middle-left area (row 4, column 1). <br>âą Center-left area (row 4, column 2). <br>âą Middle-right area (row 3, column 4). |
| **Solid Black Triangles** | Pointing downwards. | Two instances: <br>âą Upper-right quadrant (row 2, column 3). <br>âą Lower-left quadrant (row 4, column 1), overlapping with a ringed circle. |
#### 2. Connections (Edges)
| Connection Type | Description | Details |
| :--- | :--- | :--- |
| **Solid Blue Lines** | Form the primary grid structure. | Connect adjacent nodes horizontally and vertically. |
| **Rectangular Elements on Blue Lines** | Small, empty white rectangles centered on most blue lines. | Likely represent standard components, links, or relays. |
| **Red Rectangles** | Some rectangles on the blue lines are filled with solid red. | Appear at specific locations, potentially indicating a fault, a special component, or an active state. **Total: 5 instances.** |
#### 3. Annotated Paths (Dashed Lines)
| Path Color | Connection | Description |
| :--- | :--- | :--- |
| **Blue Dashed Line** | Connects a ringed circle (top-left) to a red rectangle (upper-middle). | |
| **Orange Dashed Line** | Connects a red rectangle (middle-left) to a ringed circle (middle-right). | |
| **Green Dashed Line** | Connects a ringed circle (center-left) to a red rectangle (center). | |
| **Purple/Red Dashed Line** | Connects a ringed circle (middle-left) to a red rectangle (lower-left). | Passes through a triangle node. |
### Detailed Analysis
* **Spatial Distribution:** The network is denser in the central and right portions. The left column has a distinct vertical chain of nodes (including two ringed circles and a triangle) connected by a blue line with three red rectangles.
* **Node Relationships:** The ringed circles and triangles appear to be key junction points or special nodes. All annotated dashed lines originate from or connect to these special nodes.
* **Pattern of Red Rectangles:** The red rectangles are not randomly placed. They are all located on vertical blue line segments. Three are in the leftmost column, one is in the upper-middle, and one is in the center.
### Key Observations
1. **Special Node Clustering:** The ringed circles and triangles are not isolated; they are connected via the annotated dashed paths, suggesting a secondary layer of relationship or communication over the primary grid.
2. **Fault/State Indication:** The consistent use of red rectangles on specific vertical links, especially the concentration in the left column, strongly implies these represent a common conditionâpossibly faults, congested links, or components under specific test.
3. **Path Redundancy:** The dashed lines create alternative paths between special nodes that do not follow the direct grid connections. For example, the orange dashed line provides a direct link between the middle-left and middle-right areas, bypassing several grid hops.
### Interpretation
This diagram is a technical schematic, likely representing:
* A **power distribution grid** or **communication network topology**.
* A **state diagram** for a system, where nodes are states and lines are transitions.
* A **fault analysis map**, where red rectangles indicate failed components and dashed lines show rerouted paths or diagnostic connections.
The visual evidence suggests a system with a regular physical or logical structure (the blue grid) overlaid with a layer of monitoring, control, or fault management (the special nodes and dashed annotations). The left column's vertical chain with multiple red rectangles could indicate a critical path or a subsystem experiencing multiple issues. The dashed lines demonstrate how the system might be reconfigured or how information flows between key control points (ringed circles/triangles) to manage the state of the network. The absence of text implies this is a conceptual or internal diagram where the symbology is standardized for its intended audience.
</details>
Figure 5: Example visualization of a $6\times 6$ seqBench maze instance. Red rectangles denote locked doors, dashed lines indicate the locations of keys corresponding to those doors, and triangles mark the start (upward-pointing) and goal (downward-pointing) positions. This illustrates the spatial nature of the tasks.
### A.2 Natural Language Encoding
Each task instance is translated into a set of atomic natural language facts. We use a consistent templating approach:
- Room Connections: "Room A1 and B1 are connected by an open door."
- Locked Connections: "Room C3 and D3 are connected by a closed and locked door."
- Key Requirements: "The locked door between C3 and D3 requires key 5." (Key IDs are simple integers).
- Key Placements: "Key 5 is in room E4." (Room IDs use spreadsheet-like notation, e.g., A1, B2).
- Starting Position: "Bob is in room A2."
- Goal Position: "Alice is in room D5."
The full set of facts for a given problem constitutes its description.
### A.3 Dataset Parameters and Scope
The seqBench dataset was generated using the following parameter ranges based on the generation configuration:
- Grid Sizes ( $N\times M$ ): $N\times M$ where $N$ and $M$ range from 5 to 50 (e.g., [5,5], [6,6], âŠ, [50,50]), with $M=N$ for all configurations.
- Target Backtracking Steps ( $\mathcal{B}$ ): Values from 0 to 7. This controls the number of key-door mechanisms deliberately placed on the optimal path.
- Noise Ratio ( $\mathcal{N}$ ): Values from $0.0$ (no distracting facts) to $1.0$ (equal number of supporting and distracting facts), typically in increments of $0.2$ .
- Instances per Configuration: For each primary configuration, defined by a specific grid size ( $N,M$ ) and a specific target backtracking step count ( $\mathcal{B}\in\{0..7\}$ ), 400 unique base maze instances were generated.
- Logical Depth ( $L$ ): As an emergent property, $L$ varies. Experiments typically select problems from these generated instances that fall into specific $L$ bins (e.g., $L\in[10,11),[11,12),\ldots$ ).
This generation pipeline, leveraging the described parameter ranges and variations, can produce a vast and diverse set of problem instances. The publicly released seqBench dataset, used for the analyses in this paper (see main paper for access link), comprises 7,079 such curated instances. This collection offers a rich resource for studying the combined effects of the controlled complexity dimensions.
## Appendix B Prompt Design and Model Configuration Details
This appendix provides the complete details of the prompt structure and model configurations used for evaluating LLMs on the seqBench benchmark. The overall prompt, illustrated in Figure 6, concatenates four main components which are detailed below.
<details>
<summary>figs/prompt_template.png Details</summary>

### Visual Description
## Diagram: Prompt Template for Maze Navigation Problem
### Overview
The image displays a structured "Prompt Template" designed to instruct an AI agent on solving a maze navigation problem. The template is organized into four distinct, color-coded sections, each with a specific role in defining the task, providing examples, offering reasoning guidance, and presenting the problem facts. The overall purpose is to guide an agent to find the shortest path for "Bob" to rescue "Alice" in a maze of connected rooms with locked doors and keys.
### Components/Axes
The diagram is segmented into four primary rectangular boxes, each with a vertical label on its left or right side.
1. **Task Description (Left, Pink Box):**
* **Label:** "Task Description" (vertical text on the left).
* **Content:** Defines the agent's role, the core task, the structure of the maze description, valid actions, key constraints, and the required output format.
2. **Examples (Bottom-Left, Green Box):**
* **Label:** "Examples" (vertical text on the left).
* **Content:** Provides a complete example input (FACTS) and the corresponding correct output (a Python list of action tuples).
3. **Reasoning Guidance (Top-Right, Light Blue Box):**
* **Label:** "Reasoning Guidance" (vertical text on the right).
* **Content:** Lists a step-by-step procedure for the agent to follow to complete the task, including strategies for handling complexity.
4. **Problem Facts (Bottom-Right, Yellow Box):**
* **Label:** "Problem Facts" (vertical text on the right).
* **Content:** Presents the specific maze configuration for the current problem, including room connections, door states, key locations, and the start/end points.
### Detailed Analysis / Content Details
**1. Task Description Box Content:**
* **Agent Role:** "You are a problem solving agent that thinks carefully step by step based on provided facts and follows instructions closely."
* **TASK:** "Help Bob navigate through a maze of connected rooms to rescue Alice. Bob starts in a specified room and needs to find the optimal path to reach Alice's location, following the maze's rules about room connections and door locks."
* **MAZE DESCRIPTION CONTAINS:**
1. Room connections (open or locked/closed doors).
2. Door information (open or locked).
3. Key information (location and which doors they unlock).
4. Starting location (Bob's start).
5. Target location (Alice's location).
* **Valid actions:** `start`, `move_to`, `pick_up_key`, `use_key`, `unlock_and_open_door_to`, `rescue`.
* **Action & parameter syntax:**
* Room IDs: Column-Row (e.g., 'A1').
* Key IDs: positive integers (e.g., '1').
* `start`/`move_to`: room ID.
* `pick_up_key`/`use_key`: key ID.
* `unlock_and_open_door_to`: room ID.
* `rescue`: 'Alice'.
* **KEY CONSTRAINTS:**
1. Each move must be between adjacent and connected rooms.
2. Keys must be picked up before use.
3. Locked doors require use of their specific key to unlock.
4. Optimal path minimizes actions/distance.
5. `use_key` action always comes right before `unlock_and_open_door_to`.
6. If the response is missing any intermediate action it is invalid.
* **IMPORTANT:** Use only provided IDs.
* **OUTPUT FORMAT REQUIREMENT:** "Your solution must be formatted as a Python list of tuples representing each action in chronological order."
* Example format: `[('start', 'A1'), ('move_to', 'B1'), ('pick_up_key', '3'), ('use_key', '3'), ('unlock_and_open_door_to', 'C1'), ('rescue', 'Alice')]`
**2. Examples Box Content:**
* **INPUT (FACTS):**
* Room C4 and C3 are connected by an open door.
* Room C3 and D3 are connected by an open door.
* Room D5 and E5 are connected by an open door.
* Room A2 and A1 are connected by an open door.
* Room A3 and B3 are connected by an open door.
* Room A1 and B1 are connected by an open door.
* Room A4 and A3 are connected by an open door.
* Room E5 and E4 are connected by an open door.
* Room D4 and D3 are connected by an open door.
* Room A5 and B5 are connected by an open door.
* Room D4 and E4 are connected by an open door.
* Bob is in room D5.
* Alice is in room C4.
* **OUTPUT:**
`[('start', 'D5'), ('move_to', 'E5'), ('move_to', 'E4'), ('move_to', 'D4'), ('move_to', 'D3'), ('move_to', 'C3'), ('move_to', 'C4'), ('rescue', 'Alice')]`
**3. Reasoning Guidance Box Content:**
* **TO COMPLETE THIS TASK FOLLOW THESE STEPS:**
1. Find the shortest path from Bob to Alice.
2. Identify any locked doors on this path.
3. For each locked door, find its required key.
4. Plan key collection order to ensure you have each key before reaching its door.
5. Track all actions while following the rules.
6. Avoid unnecessary steps that increase the total path length.
* **IF THE PATH SEEMS COMPLEX:**
* Break it into smaller segments.
* Solve each segment separately.
* Combine the solutions while maintaining optimality.
* **Remember** to think step by step and verify each move.
* **Proceed** to provide your solution as a list of tuples in chronological order.
**4. Problem Facts Box Content (The Current Problem):**
* **FACTS:**
* Room A6 and A5 are connected by an open door.
* Room A6 and B6 are connected by an open door.
* Room B6 and C6 are connected by an open door.
* Room C6 and D6 are connected by an open door.
* Room C5 and C4 are connected by an open door.
* Room C4 and D4 are connected by an open door.
* Room D6 and D5 are connected by a closed and locked door.
* The locked door between D6 and D5 requires key 10.
* Key 10 is in room A5.
* Room D6 and E6 are connected by an open door.
* Room D5 and D4 are connected by an open door.
* Room E6 and F6 are connected by an open door.
* Room A4 and A3 are connected by an open door.
* Bob is in room F6.
* Alice is in room C5.
* **YOUR SOLUTION:** [This section is blank, intended for the agent's output.]
### Key Observations
* **Structured Input:** The template enforces a strict separation between the general task rules (Task Description), illustrative examples, problem-solving heuristics (Reasoning Guidance), and the specific problem instance (Problem Facts).
* **Critical Constraint:** The `use_key` action must immediately precede the `unlock_and_open_door_to` action for the same key/door pair.
* **Optimality Requirement:** The solution must be the shortest path, minimizing both actions and distance.
* **Output Rigor:** The solution must be a complete, chronological list of action tuples with no missing intermediate steps.
* **Problem Specifics:** In the given problem, Bob starts at F6 and Alice is at C5. The primary obstacle is a locked door between D6 and D5, which requires Key 10 located in room A5. The agent must plan a route from F6 to A5 to collect the key, then to the D6-D5 door to unlock it, and finally to C5.
### Interpretation
This prompt template is a sophisticated framework for testing or deploying an AI agent's planning and reasoning capabilities in a constrained, symbolic environment. It moves beyond a simple problem statement by embedding:
1. **Meta-Cognition:** The "Reasoning Guidance" section explicitly instructs the agent on *how* to think (break down problems, verify steps), promoting robust chain-of-thought reasoning.
2. **Few-Shot Learning:** The "Examples" section provides a clear demonstration of the expected input-output mapping, reducing ambiguity in the output format.
3. **Formalized Action Space:** By defining precise action syntax and constraints, it creates a verifiable and deterministic environment for the agent's operations.
4. **Hierarchical Problem Decomposition:** The structure encourages the agent to first understand the general rules, then see an example, then adopt a solving strategy, and finally apply all of this to the specific facts. This mirrors effective human problem-solving pedagogy.
The template's design suggests it is intended for evaluating or training AI systems on tasks requiring spatial reasoning, sequential planning, and strict adherence to procedural rules. The inclusion of locked doors and keys adds a layer of dependency management, making the problem non-trivial and requiring look-ahead planning.
</details>
Figure 6: The complete prompt structure passed to the LLMs. This includes: Component 1 (System Instructions and Task Definition), one of the three Few-Shot Examples (Component 2, specifically a simple navigation task), Component 3 (Reasoning Guidance), and an illustration of where the Problem Instance Facts (Component 4) are inserted. For clarity and completeness, the full verbatim text for all three few-shot examples (Component 2) is provided in 7.
### B.1 Overall Prompt Components
The prompt presented to the LLMs consists of the following components:
1. System Instructions and Task Definition (Component 1): Outlines the agentâs task, the structure of the maze description, valid actions and their syntax, key operational constraints, and the required output format.
1. Few-Shot Examples (Component 2): Three examples are provided to illustrate the task, ranging in complexity. One of these examples (a simple navigation task) is detailed in Figure 6. The verbatim text for all three examples is provided in Figure 7 for completeness.
1. Reasoning Guidance and Self-Assessment (Component 3): Offers step-by-step algorithmic tips for solving the task and requests the model to provide a self-assessment of its confidence and the perceived difficulty of the instance.
1. Problem Instance Facts (Component 4): The specific natural language facts describing the current maze configuration for the task instance. As illustrated in Figure 6, these facts are appended after the preceding components and are followed by the line "YOUR SOLUTION:" to prompt the model. These facts are generated using the templates described in Appendix A.
1. Example 1 (Simple Navigation): This example, as shown in Figure 6, involves navigating a maze with only open doors.
âŹ
EXAMPLE:
INPUT:
Maze Structure: Room C4 and C3 are connected by an open door. Room C3 and D3 are connected by an open door. Room D5 and E5 are connected by an open door. Room A2 and A1 are connected by an open door. Room A3 and B3 are connected by an open door. Room A1 and B1 are connected by an open door. Room A4 and A3 are connected by an open door. Room E5 and E4 are connected by an open door. Room D4 and D3 are connected by an open door. Room A5 and B5 are connected by an open door. Room D4 and E4 are connected by an open door. Bob is in room D5. Alice is in room C4.
OUTPUT:
Solution: [(â start â, â D5 â), (â move_to â, â E5 â), (â move_to â, â E4 â), (â move_to â, â D4 â), (â move_to â, â D3 â), (â move_to â, â C3 â), (â move_to â, â C4 â), (â rescue â, â Alice â)]
1. Example 2 (Single-Key Backtracking): This example introduces a single locked door and a corresponding key.
âŹ
EXAMPLE:
INPUT:
Maze Structure: Room A1 and A2 are connected by an open door. Room A2 and B2 are connected by an open door. Room B1 and B2 are connected by an open door. Room B1 and C1 are connected by an open door. Room C1 and C2 are connected by a closed and locked door. Door between C1 and C2 requires key 1. Key 1 is in room A2. Bob is in room A1. Alice is in room C2.
OUTPUT:
Solution: [(â start â, â A1 â), (â move_to â, â A2 â), (â pick_up_key â, â1â), (â move_to â, â B2 â), (â move_to â, â B1 â), (â move_to â, â C1 â), (â use_key â, â1â), (â unlock_and_open_door_to â, â C2 â), (â move_to â, â C2 â), (â rescue â, â Alice â)]
1. Example 3 (Multi-Key Backtracking): This example presents a more complex scenario with multiple locked doors and keys, requiring more extensive backtracking.
âŹ
EXAMPLE:
INPUT:
Maze Structure: Room B5 and B4 are connected by a closed and locked door. The locked door between B5 and B4 requires key 3. Key 3 is in room B5. Room B5 and C5 are connected by a closed and locked door. The locked door between B5 and C5 requires key 16. Key 16 is in room C5. Room B4 and C4 are connected by an open door. Room C4 and C3 are connected by an open door. Room C3 and D3 are connected by a closed and locked door. The locked door between C3 and D3 requires key 10. Key 10 is in room C4. Room D5 and D4 are connected by an open door. Room D4 and D3 are connected by an open door. Room A5 and B5 are connected by an open door. Bob is in room C5. Alice is in room D5.
OUTPUT:
Solution: [(â start â, â C5 â), (â pick_up_key â, â16â), (â use_key â, â16â), (â unlock_and_open_door_to â, â B5 â), (â move_to â, â B5 â), (â pick_up_key â, â3â), (â use_key â, â3â), (â unlock_and_open_door_to â, â B4 â), (â move_to â, â B4 â), (â move_to â, â C4 â), (â pick_up_key â, â10â), (â move_to â, â C3 â), (â use_key â, â10â), (â unlock_and_open_door_to â, â D3 â), (â move_to â, â D3 â), (â move_to â, â D4 â), (â move_to â, â D5 â), (â rescue â, â Alice â)]
Figure 7: Few-shot examples provided to guide the LLMs in the maze-solving task. These examples demonstrate simple navigation, single-key backtracking, and multi-key backtracking scenarios. The three examples illustrate increasing levels of complexity.
### B.2 Evaluation Metrics and Error Analysis Details
This section provides further details on specific aspects of our evaluation metrics and observed error categories, complementing the overview of metrics in Section 1 of the main paper and the discussion of failure modes in Section 2 of the main paper.
#### Observed Violation Categories.
Failures in model solutions on seqBench tasks can be categorized into several types. Understanding these categories is crucial for interpreting model performance and failure modes. Key types of violations observed include:
- Adjacency errors (e.g., attempting to move between unconnected rooms).
- Locked door errors (e.g., navigating through locked doors without the correct key or without unlocking them).
- Key usage errors (e.g., attempting to use keys not yet collected, or using the wrong key for a door).
- Path inefficiency (e.g., taking unnecessary detours or redundant actions; while not always a hard violation that stops progress, this contributes to solutions not matching the optimal path and thus failing Pass@1).
- Missed critical actions (e.g., failing to pick up a necessary key or unlock a required door). This is a key failure mode discussed in the main paper (Section 2.4) and is often reflected in metrics like low recall or a low progress ratio if the omission occurs early and prevents further correct steps.
Identifying these distinct categories of errors provides a more granular understanding of why models fail on sequential reasoning tasks and helps in the interpretation of aggregate performance metrics reported in the main paper.
### B.3 Violation Map: Qualitative Examples of Model Failures
This section provides qualitative examples of characteristic model failures to illustrate common error types. These examples visually support the discussion of failure modes in the main paper (Section 2.4, "A Key Failure Mode: Omission of Critical Steps"). Figure 8 illustrates a significant error by Gemini-2.5-Flash on a complex task, where the model generates an illegal path, bypassing necessary steps and locked doors. This exemplifies a breakdown in multi-step planning. Additionally, Figure 9 shows another common âadjacency error,â where a model attempts to jump between unconnected rooms. This type of error reveals a critical lapse in grounding its generated actions within the spatial adjacencies explicitly stated by the taskâs input facts.
<details>
<summary>figs/goodexample4040.png Details</summary>

### Visual Description
## Diagram Comparison: Optimal Path vs. Model Path
### Overview
The image displays two side-by-side diagrams on a white background, each titled and depicting a pathfinding or navigation scenario on an identical underlying grid. The left diagram is titled "Optimal Path," and the right is titled "Model Path." Both diagrams show a complex lattice of black dots connected by short black lines, forming a grid or network. Overlaid on this grid are colored paths and dashed reference lines. The primary purpose is to visually compare a theoretically optimal route with a route generated by a model.
### Components/Axes
* **Titles:** Centered at the top of each respective diagram.
* Left: "Optimal Path"
* Right: "Model Path"
* **Base Grid:** A uniform lattice of small black dots (nodes) connected by short, thin black lines (edges), forming a dense, maze-like network across the entire area of each diagram.
* **Paths (Primary Data Series):**
* **Optimal Path (Left Diagram):** A thick, continuous **yellow** line tracing a complex, winding route through the grid.
* **Model Path (Right Diagram):** a thick, continuous **purple** line tracing a different, somewhat more direct but still irregular route through the identical grid.
* **Reference Lines (Secondary Elements):**
* An **orange dashed line** runs diagonally from the upper-left quadrant towards the center in both diagrams. Its position and angle appear identical in both.
* A **blue dashed line** runs diagonally from the lower-left quadrant towards the center-right in both diagrams. Its position and angle also appear identical in both.
* **Markers:**
* A small, solid **black triangle** is present in both diagrams. In the "Optimal Path" diagram, it is located near the center-left. In the "Model Path" diagram, it is located near the center-right. This likely marks a start point, end point, or key waypoint.
* A small, solid **black circle** is present at the terminus of the blue dashed line in both diagrams.
### Detailed Analysis
**Spatial Grounding & Path Description:**
1. **Optimal Path (Yellow):**
* **Trend:** The path is highly non-linear and exploratory. It begins (or passes through) the area near the black triangle in the center-left. It initially moves upward and right, then executes a large, looping detour to the far left side of the grid. It then winds back towards the center, makes another significant excursion to the lower-right quadrant, and finally terminates near the black circle at the end of the blue dashed line. The path frequently doubles back and makes sharp turns, suggesting it is navigating around obstacles or following a cost-minimizing algorithm that values path quality over directness.
* **Key Segments:** Notable features include a long vertical segment on the far left, a dense cluster of turns in the upper-middle area, and a final approach to the endpoint from the south.
2. **Model Path (Purple):**
* **Trend:** The path is more direct than the yellow path but still contains significant deviations. It appears to start (or pass through) the black triangle, now located in the center-right. It moves generally leftward and downward, then curves upward to approach the same black circle endpoint. While it avoids the extreme detours of the optimal path, it is not a straight line; it has several bends and a notable "S" curve in its middle section.
* **Key Segments:** The path has a smoother overall trajectory but clearly does not follow the shortest geometric line between its apparent start and end points.
**Cross-Reference of Reference Lines:**
The orange and blue dashed lines are static reference elements, identical in placement across both diagrams. They do not interact with the paths directly but provide fixed spatial anchors for comparison. The yellow optimal path crosses the orange dashed line twice and the blue dashed line once. The purple model path crosses the orange dashed line once and terminates at the blue dashed line's endpoint.
### Key Observations
1. **Path Divergence:** The most striking observation is the significant difference in route choice between the "Optimal" and "Model" paths, despite operating in the same environment (grid). The optimal path is far more circuitous.
2. **Grid Complexity:** The underlying black dot-and-line grid is highly complex and irregular, resembling a road network or a state-space graph with many possible connections. This complexity explains why even the "optimal" path is not simple.
3. **Shared Landmarks:** The black triangle and circle serve as common reference points, but their relative positions to the paths differ. The triangle is near the start of the yellow path but near the middle of the purple path's trajectory. Both paths share the same final endpoint (black circle).
4. **Visual Efficiency:** The model path (purple) appears to cover less total distance and has fewer sharp turns than the optimal path (yellow), suggesting the model may be optimizing for a different metric (e.g., smoothness, predictability) or has learned an approximate policy.
### Interpretation
This image is a classic visualization used in fields like robotics, reinforcement learning, or algorithmic path planning. It demonstrates the difference between a ground-truth, computationally derived optimal solution and the solution produced by a trained model (e.g., a neural network).
* **What the Data Suggests:** The "Optimal Path" likely represents the result of an exhaustive search algorithm (like A* or Dijkstra's) that guarantees the shortest or least-cost path through the complex grid, albeit at high computational cost. Its winding nature indicates the grid contains many high-cost areas or obstacles that must be circumvented. The "Model Path" represents a learned policy's attempt to navigate the same space. Its more direct but imperfect route suggests the model has generalized a "good enough" strategy that balances efficiency with computational speed, but it has not fully replicated the optimal solution's nuanced navigation of the cost landscape.
* **Relationship Between Elements:** The static dashed lines and markers provide a fixed frame of reference, allowing the viewer to easily see how each path relates to the same spatial landmarks. The identical grid confirms the comparison is fair.
* **Notable Anomalies/Insights:** The fact that the model's path is *less* tortuous than the optimal path is counterintuitive and highly significant. It implies that the model's objective function or training data may not perfectly align with the true optimality criteria used to generate the yellow path. Alternatively, the model might be avoiding areas that are technically optimal but risky or unstable in a way not captured by the grid's cost structure. This visualization effectively highlights the "sim-to-real" or "theory-to-practice" gap in learned navigation systems.
</details>
Figure 8: Illustrative failure case for Gemini-2.5-Flash on a 40x40 task with 2 locked doors on the optimal path. Left: Optimal path (yellow). Right: Modelâs generated path showing an illegal adjacency jump (red arrow), bypassing multiple rooms and a locked door, despite only supporting facts being provided. This highlights a breakdown in multi-step planning.
<details>
<summary>figs/mistakev2.png Details</summary>

### Visual Description
## Diagram Comparison: Optimal Path vs. Model Path
### Overview
The image displays a side-by-side comparison of two pathfinding results on an identical grid-based environment. The left panel is titled "Optimal Path," and the right panel is titled "Model Path." Both diagrams visualize a path from a starting point to an endpoint, navigating through a network of nodes and edges. The comparison highlights the differences between a theoretically optimal route and the route generated by a computational model.
### Components/Axes
* **Grid Structure:** Both diagrams are built on a uniform grid of black circular nodes connected by thin, light-blue lines representing possible edges or pathways.
* **Start Point:** A black, downward-pointing triangle (âŒ) located in the lower-left quadrant of the grid.
* **End Point:** A black, upward-pointing triangle (âČ) located in the lower-right quadrant of the grid.
* **Waypoints/Targets:** Two black circular nodes are highlighted with red rectangular outlines in both diagrams. One is in the upper-left quadrant, and the other is in the lower-left quadrant, near the start point.
* **Path Lines:**
* **Optimal Path (Left Panel):** A thick, solid yellow line traces the optimal route.
* **Model Path (Right Panel):** A thick, solid purple line traces the model-generated route.
* **Direct Line (Both Panels):** A dashed orange line connects the two red-boxed waypoints directly, serving as a reference for the shortest possible distance between them.
* **Legend/Labels:** The titles "Optimal Path" and "Model Path" are the primary labels. No explicit color legend is present, but the path colors (yellow vs. purple) are consistent with the panel titles.
### Detailed Analysis
**Optimal Path (Left Panel):**
* **Trajectory:** The yellow path starts at the lower waypoint (âŒ), moves directly upward to the upper waypoint (red box), then proceeds in a generally southeast direction with efficient, right-angle turns to reach the endpoint (âČ).
* **Key Segment:** The path from the lower waypoint to the upper waypoint is a straight vertical line, matching the most direct route.
* **Efficiency:** The path appears to minimize distance, taking direct routes between nodes with minimal backtracking or unnecessary turns.
**Model Path (Right Panel):**
* **Trajectory:** The purple path starts at the lower waypoint (âŒ). Instead of going directly up, it first moves right, then up, then left, creating a detour before reaching the upper waypoint (red box). From there, it proceeds to the endpoint (âČ) with a route that is similar but not identical to the optimal path.
* **Key Deviation:** The most significant difference is in the segment between the two red-boxed waypoints. The model's path (purple) takes a circuitous, three-segment route (right, up, left) to cover the vertical distance that the optimal path (yellow) covers in one straight segment. This is highlighted by the red rectangle.
* **Inefficiency:** The model's path contains clear inefficiencies, particularly the initial detour and some less direct routing in the latter half compared to the optimal path.
### Key Observations
1. **Path Divergence:** The primary divergence occurs at the very beginning of the route between the two highlighted waypoints. The model fails to identify the direct vertical connection.
2. **Structural Similarity:** After the initial deviation, the model's path follows a broadly similar topological structure to the optimal path, suggesting it understands the general goal but not the most efficient sequence of moves.
3. **Grid Utilization:** Both paths exclusively use the grid's nodes and edges, moving only horizontally or vertically between adjacent nodes.
4. **Visual Emphasis:** The red rectangles and the orange dashed "direct line" are deliberate visual cues drawing the viewer's attention to the specific segment where the model's performance is most suboptimal.
### Interpretation
This diagram is a diagnostic visualization, likely from a machine learning or robotics context, evaluating a pathfinding model's performance against a ground-truth optimal solution.
* **What it Demonstrates:** The model has learned the general task of navigating from a start to an end point via specified waypoints but has not learned the optimal policy. Its failure to take the direct vertical path suggests a potential flaw in its training, reward function, or state evaluationâit may be overvaluing certain movements or lack a precise understanding of the grid's connectivity.
* **Relationship Between Elements:** The side-by-side comparison is the core analytical tool. The identical grid and waypoints create a controlled experiment. The orange dashed line acts as a visual benchmark for the ideal connection between key points, making the model's deviation immediately apparent.
* **Anomalies and Implications:** The model's detour is a clear anomaly. In a real-world application (e.g., robot navigation, game AI, logistics), this inefficiency would translate to wasted time, energy, or resources. The diagram argues that while the model is functional, it is not yet optimal and requires refinement to match the efficiency of the known best solution. The investigation would focus on why the model "chose" the longer pathâwas it due to a local minimum in its planning algorithm, a bias in its training data, or an incomplete map of available edges?
</details>
Figure 9: Illustrative failure case of an âadjacency errorâ in model-generated pathfinding on a 20x20 task with 2 locked doors on the optimal path. The left panel displays the optimal path (yellow) to the target (triangle). The right panel shows a suboptimal path (purple) generated by the model. This example highlights a common error where, after a sequence of actions (in this scenario, following a key acquisition), the model fails to navigate through valid connections. Instead, it attempts to âjumpâ directly between two unconnected rooms. This violation of room adjacency constraints is a key challenge in model performance.
### B.4 Quantitative Analysis of Error Patterns
To understand how and when models begin to fail within a reasoning sequence, we analyze the distribution of the first violation step. We record the time step at which the initial violation occurs in a modelâs generated path. Aggregating this step-indexed data across multiple instances allows us to create temporal distributions of errors. These distributions help determine whether errors tend to cluster early in the reasoning process (potentially indicating issues with initial planning or understanding of the overall problem complexity) or accumulate later (suggesting difficulties in maintaining long chains of inference or context). This analysis complements the discussion in the main paper (Section 2.4, "Path-Length Dependent First Errors: The Burden of Anticipated Complexity").
Figure 10 shows how the distribution of these first-error positions shifts with the overall problem complexity, represented by logical depth ( $L$ ). As detailed in the main paper, an increase in $L$ tends to cause errors to occur earlier in the reasoning chain.
<details>
<summary>figs/failure_step_dist_vs_L.png Details</summary>

### Visual Description
\n
## [Series of Step Plots]: Solution Steps vs. Max Progress Step
### Overview
The image displays a vertical series of eight individual horizontal line plots. Each plot is labeled with a specific "Solution steps" count and contains a single vertical line marker. The plots share a common x-axis at the bottom, labeled "max progress step," which ranges from 0 to 300. The visualization demonstrates a direct, one-to-one correspondence between the labeled solution step count and the position of the vertical marker on the shared progress axis.
### Components/Axes
* **Plot Labels (Top-Left of each subplot):** Each of the eight subplots has a text label in the format "Solution steps: [Number]". The numbers are, from top to bottom: 20, 60, 100, 140, 180, 220, 260, 300.
* **X-Axis (Bottom of the entire figure):**
* **Label:** "max progress step"
* **Scale:** Linear, from 0 to 300.
* **Major Tick Marks & Numerical Labels:** 0, 50, 100, 150, 200, 250, 300.
* **Plot Content:** Each subplot consists of a horizontal baseline and a single, thin vertical line extending upward from it. There are no other data series, legends, or color codes.
### Detailed Analysis
Each subplot contains one data point, represented by a vertical line. The x-position of this line corresponds precisely to the number given in the subplot's label.
1. **Solution steps: 20:** Vertical line is positioned at approximately x = 20 on the shared axis.
2. **Solution steps: 60:** Vertical line is positioned at approximately x = 60.
3. **Solution steps: 100:** Vertical line is positioned at approximately x = 100.
4. **Solution steps: 140:** Vertical line is positioned at approximately x = 140.
5. **Solution steps: 180:** Vertical line is positioned at approximately x = 180.
6. **Solution steps: 220:** Vertical line is positioned at approximately x = 220.
7. **Solution steps: 260:** Vertical line is positioned at approximately x = 260.
8. **Solution steps: 300:** Vertical line is positioned at approximately x = 300.
**Trend Verification:** For every data series (each subplot), the visual trend is a single, static vertical line. The position of this line shifts rightward along the x-axis in direct proportion to the increasing "Solution steps" value in the label.
### Key Observations
* **Perfect Correlation:** The "max progress step" value is identical to the "Solution steps" count for every plotted instance.
* **Visual Simplicity:** The chart uses a minimal design with no extraneous elements, focusing solely on the relationship between the two named variables.
* **Consistent Scaling:** All subplots are aligned to the same x-axis scale, allowing for direct visual comparison of the marker positions.
* **No Variance:** Each plot shows only a single outcome; there is no distribution, error, or multiple trials depicted.
### Interpretation
This visualization is a direct graphical proof or demonstration of a linear, identity relationship. It suggests that in the context of the system or process being measured, the "max progress step" achieved is exactly equal to the number of "Solution steps" allocated or executed.
The chart likely serves to validate a model, algorithm, or process where progress is measured in discrete steps. The perfect alignment indicates no overhead, delay, or regressionâthe system utilizes every allocated step to make exactly one unit of progress. The absence of any deviation from this pattern across a wide range of step counts (from 20 to 300) strongly reinforces the reliability and predictability of this relationship. It answers the question: "If I run for N solution steps, what is the maximum progress I can expect?" The answer, as shown, is precisely N.
</details>
Figure 10: Distribution of first-violation steps for Gemini-2.5-Flash across varying logical depths ( $L$ ). As $L$ (total required path length) increases, the distribution of first errors tends to shift leftward, indicating that models are more likely to fail at earlier steps in longer problems. This suggests that anticipated global complexity impacts reasoning from the outset. Experimental parameters in this figure are the same as those in Figure 1.
Similarly, Figure 11 illustrates how the introduction of contextual noise (distracting facts) affects the point of failure. Increased noise also tends to precipitate earlier errors in the reasoning sequence, as discussed in the main paper in relation to sensitivity to noise (Section 2.3) and its impact on error patterns (Section 2.4).
<details>
<summary>figs/gemini-progress-ratio-vs-noise.png Details</summary>

### Visual Description
## Histogram Series: Progress Ratio Distribution by Noise Ratio
### Overview
The image displays a series of six vertically stacked histograms, each representing the distribution of a "progress ratio" for a different "Noise ratio" value. The visualization demonstrates how increasing noise affects the distribution of progress outcomes.
### Components/Axes
* **X-Axis (Common to all plots):** Labeled **"progress ratio"**. It is a continuous scale from **0.0 to 1.0**, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Y-Axis (Implied):** Each histogram has an implicit vertical axis representing frequency or count. The height of the bars indicates the relative number of observations at each progress ratio value. There are no numerical labels on the y-axes.
* **Plot Labels:** Each of the six histograms is labeled in its top-left corner with its corresponding **"Noise ratio"**:
* Noise ratio: 0.0 (top plot, dark grey bars)
* Noise ratio: 0.2 (second plot, dark grey bars)
* Noise ratio: 0.4 (third plot, brownish-grey bars)
* Noise ratio: 0.6 (fourth plot, light brown bars)
* Noise ratio: 0.8 (fifth plot, light salmon bars)
* Noise ratio: 1.0 (bottom plot, light pink bars)
* **Color Coding:** The bar color shifts from a dark grey for low noise ratios to a light pink for the highest noise ratio (1.0), providing a visual cue for the increasing noise level.
### Detailed Analysis
The histograms show a clear and systematic change in the distribution of the progress ratio as the noise ratio increases.
1. **Noise ratio: 0.0 & 0.2:**
* **Trend:** The distribution is heavily right-skewed, with a single, very tall, narrow peak located at or extremely close to a progress ratio of **1.0**.
* **Data Points:** The vast majority of observations are concentrated at 1.0. There are a few, very short bars scattered at lower progress ratios (approximately between 0.0 and 0.3), but their frequency is negligible compared to the peak at 1.0.
2. **Noise ratio: 0.4:**
* **Trend:** The dominant peak at **1.0** remains but is slightly shorter than in the previous plots. The frequency of observations at lower progress ratios (0.0 to 0.4) has visibly increased.
* **Data Points:** The distribution is still strongly right-skewed, but the "tail" of lower progress values is becoming more populated.
3. **Noise ratio: 0.6:**
* **Trend:** A significant shift occurs. The peak at **1.0** is now much shorter. The distribution has become more spread out, with a notable increase in the frequency of progress ratios across the entire lower range (0.0 to 0.6).
* **Data Points:** While a mode still exists near 1.0, the data is now distributed across a wide spectrum of lower values, indicating much higher variability in outcomes.
4. **Noise ratio: 0.8:**
* **Trend:** The peak at **1.0** is further diminished. The distribution appears more uniform or multi-modal across the lower half of the scale (0.0 to 0.5).
* **Data Points:** Observations are scattered across many progress ratio values with relatively similar, low frequencies. The concentration at high progress is largely gone.
5. **Noise ratio: 1.0:**
* **Trend:** The peak at **1.0** is the smallest of all plots. The distribution is the most dispersed, with bars of low but relatively consistent height spread from 0.0 to about 0.6.
* **Data Points:** There is no strong concentration at any single value. The data suggests that with maximum noise, achieving a high progress ratio becomes rare, and outcomes are highly unpredictable and generally low.
### Key Observations
* **Inverse Relationship:** There is a clear inverse relationship between the noise ratio and the concentration of data at a high progress ratio (1.0).
* **Peak Attenuation:** The height of the primary peak at progress ratio = 1.0 decreases monotonically as the noise ratio increases from 0.0 to 1.0.
* **Distribution Spread:** The variance (spread) of the progress ratio distribution increases dramatically with higher noise. Low noise yields precise, high outcomes; high noise yields scattered, generally lower outcomes.
* **Color Consistency:** The color of the bars in each subplot consistently matches the color of its corresponding "Noise ratio" label, confirming the grouping.
### Interpretation
This visualization powerfully demonstrates the detrimental effect of noise on a process's ability to achieve a target outcome (progress ratio of 1.0).
* **What the data suggests:** In a noise-free environment (ratio 0.0), the process is highly reliable, consistently achieving near-perfect progress. As noise is introduced, the process becomes less reliable. First, it occasionally fails to reach full progress (noise 0.2-0.4). Then, with moderate to high noise (0.6-0.8), successful outcomes become the exception rather than the rule. At maximum noise (1.0), the process is essentially randomized, with progress ratios scattered across the lower range and almost never reaching the target.
* **How elements relate:** The "Noise ratio" is the independent variable being manipulated. The "progress ratio" is the dependent variable being measured. The histograms show the causal relationship: increasing the former degrades the distribution of the latter.
* **Notable anomalies/trends:** The most striking trend is the **phase shift** between noise ratios 0.4 and 0.6, where the system transitions from being "mostly successful" to "mostly unsuccessful." This could indicate a critical threshold of noise beyond which system performance collapses. The absence of any data points above 1.0 or below 0.0 defines the bounded nature of the progress metric.
</details>
Figure 11: Impact of increasing noise ratio on the distribution of failure steps for Gemini 2.5 Flash. As noise (proportion of distracting facts) increases, failures tend to occur earlier in the reasoning chain. This reflects increased difficulty in isolating relevant information and maintaining focus. Fixed experimental parameters in this figure are the same as those in Figure 1.
## Appendix C Supplementary Figures
This appendix provides supplementary figures that offer further visual support for analyses presented in the main paper. These figures illustrate the impact of various complexity dimensions and provide comparative views of model performance, elaborating on points made throughout Section 2 (Benchmarking Results) of the main paper.
Figure 12 details the performance of Llama-4 Maverick-17B-128E-Instruct under varying levels of noise and fact shuffling. This supports the discussion in the main paper (Section 2.3, on how these factors, especially in combination, affect success rates, with noise being a dominant factor.
<details>
<summary>figs/single_model_vs_steps_count_varied_noise_shuffle_Llama-4-Maverick-17B-128E-Instruct-FP8.png Details</summary>

### Visual Description
## Line Chart: Success Rate vs. Number of Actions Under Varying Noise and Shuffle Conditions
### Overview
The image displays two line charts side-by-side, presenting the same dataset with different y-axis scales. Both charts plot "success rate" against "number of actions" for four experimental conditions and two theoretical exponential decay models. The left chart uses a linear y-axis, while the right chart uses a logarithmic y-axis (base 10), which linearizes exponential decay trends.
### Components/Axes
* **X-Axis (Both Plots):** Label: "number of actions". Scale: Linear, ranging from approximately 5 to 70. Major tick marks are at 10, 20, 30, 40, 50, 60, 70.
* **Y-Axis (Left Plot):** Label: "success rate". Scale: Linear, ranging from 0.0 to 1.0. Major tick marks are at 0.2, 0.4, 0.6, 0.8, 1.0.
* **Y-Axis (Right Plot):** Label: "success rate". Scale: Logarithmic (base 10), ranging from 10â»ÂČ (0.01) to 10â° (1.0). Major tick marks are at 10â»ÂČ, 10â»Âč, 10â°.
* **Legend (Identical for both plots, positioned in the top-right quadrant):**
* **Blue line with circle markers:** `noise = 0, shuffle = 0`
* **Orange line with circle markers:** `noise = 0, shuffle = 0.5`
* **Green line with circle markers:** `noise = 0.2, shuffle = 0`
* **Red line with circle markers:** `noise = 0.2, shuffle = 0.5`
* **Purple dashed line:** `â exp(âx/L), L = 24`
* **Brown dashed line:** `â exp(âx/L), L = 14`
### Detailed Analysis
**Data Series Trends & Approximate Points:**
All series show a decaying trend: success rate decreases as the number of actions increases.
1. **Blue Line (noise=0, shuffle=0):**
* **Trend:** Highest success rate across all action counts. Decays the slowest.
* **Approximate Points (Left Plot):** (5, ~0.98), (15, ~0.68), (25, ~0.42), (35, ~0.30), (45, ~0.19), (55, ~0.15), (65, ~0.10).
2. **Orange Line (noise=0, shuffle=0.5):**
* **Trend:** Very closely follows the blue line, indicating shuffle=0.5 has minimal effect when noise=0.
* **Approximate Points (Left Plot):** (5, ~0.97), (15, ~0.67), (25, ~0.42), (35, ~0.28), (45, ~0.18), (55, ~0.14), (65, ~0.09).
3. **Green Line (noise=0.2, shuffle=0):**
* **Trend:** Significantly lower success rate than the no-noise conditions (blue/orange). Decays faster.
* **Approximate Points (Left Plot):** (5, ~0.90), (15, ~0.49), (25, ~0.32), (35, ~0.17), (45, ~0.10), (55, ~0.08), (65, ~0.06).
4. **Red Line (noise=0.2, shuffle=0.5):**
* **Trend:** Lowest success rate of all conditions. The combination of noise and shuffle degrades performance the most.
* **Approximate Points (Left Plot):** (5, ~0.88), (15, ~0.42), (25, ~0.24), (35, ~0.12), (45, ~0.07), (55, ~0.05), (65, ~0.04).
5. **Purple Dashed Line (Model: L=24):**
* **Trend:** Represents an exponential decay with characteristic length L=24. It closely fits the blue and orange (no noise) data series.
* **Approximate Points (Right Plot, Log Scale):** Starts at 1.0 for x=0. At x=24, value is ~0.37 (1/e). At x=48, value is ~0.14.
6. **Brown Dashed Line (Model: L=14):**
* **Trend:** Represents a faster exponential decay with L=14. It closely fits the red (noise=0.2, shuffle=0.5) data series.
* **Approximate Points (Right Plot, Log Scale):** Starts at 1.0 for x=0. At x=14, value is ~0.37. At x=28, value is ~0.14.
### Key Observations
1. **Dominant Effect of Noise:** The primary factor reducing success rate is the introduction of noise (noise=0.2). The green and red lines are substantially below the blue and orange lines.
2. **Minor Effect of Shuffle:** When noise is zero, shuffle=0.5 (orange) has negligible impact compared to shuffle=0 (blue). When noise is present, shuffle=0.5 (red) causes a further, noticeable degradation compared to shuffle=0 (green).
3. **Exponential Decay Fit:** The success rate decays exponentially with the number of actions. The no-noise conditions are well-modeled by a decay constant Lâ24, while the worst-case condition (noise+shuffle) is better modeled by Lâ14.
4. **Logarithmic Scale Clarity:** The right plot (log scale) makes the exponential nature of the decay visually apparent as straight lines and allows for clearer differentiation of the data points at higher action counts where values are small.
### Interpretation
This data demonstrates the robustness of a process (likely a sequential decision-making or planning algorithm) to perturbations. The "success rate" measures the probability of completing a task correctly within a given number of actions.
* **Under ideal conditions (no noise, no shuffle),** the process is highly reliable for short sequences but degrades predictably (exponentially) as the sequence length (number of actions) increases, with a characteristic decay length of about 24 actions.
* **Introducing noise (0.2) severely impacts performance,** cutting the effective reliable sequence length roughly in half (Lâ14 for the worst case). This suggests the process is sensitive to inaccurate information or stochasticity in its environment or execution.
* **Shuffling (likely reordering of actions or sub-tasks) has a minimal standalone effect** but exacerbates the negative impact of noise. This implies that the process's structure or ordering is important for maintaining robustness when operating in noisy conditions.
* The exponential decay model provides a simple, quantitative way to compare the robustness of the system under different conditions via the parameter `L`. A higher `L` indicates greater robustness to increasing task length.
**In summary, the charts quantify how noise dramatically reduces the effective operational length of a sequential process, while shuffling acts as a secondary stressor that compounds noise's effect.**
</details>
Figure 12: Pass@1 success rate for Llama-4 Maverick-17B-128E-Instruct versus solution length ( $L$ ) under different noise and shuffle ratios. Left: Linear scale. Right: Log-linear scale. Performance degrades with increased noise but is less affected by shuffle ratios. Fixed experimental parameters in this figure are the same as those in Figure 1.
To illustrate the performance consistency and disparities across different models, as detailed in Section 2.6, Figure 13 presents scatter and density plots of mean progress ratios. These plots clearly demonstrate that model performance hierarchies are not strictly linear. They reveal âperformance inversionsââinstances, also noted in Section 2.6, where models with typically lower overall performance (e.g., lower average $L_{0}$ ) occasionally solve specific complex problems that models with higher average $L_{0}$ values fail on.
<details>
<summary>figs/progress_vs_progress.png Details</summary>

### Visual Description
## [Contour Plot Grid]: AI Model Performance Comparison via Progress Ratio
### Overview
The image displays a 2x3 grid of contour plots (density heatmaps) comparing the performance of four different AI models. Each plot is a pairwise comparison, with the x-axis and y-axis both representing a "progress ratio" metric ranging from 0 to 1. The plots visualize the joint distribution of performance scores for two models across a set of tasks or benchmarks. A dashed diagonal line (y=x) in each plot serves as a reference for equal performance.
### Components/Axes
* **Plot Structure:** Six individual plots arranged in two rows and three columns.
* **Axes:** All plots share identical axes.
* **X-axis Label:** `progress ratio` (visible on bottom row plots).
* **Y-axis Label:** `progress ratio` (visible on leftmost column plots).
* **Axis Scale:** Linear scale from 0.0 to 1.0, with major tick marks at 0.2, 0.4, 0.6, and 0.8.
* **Titles:** Each plot has a title specifying the two models being compared in the format "x: [Model A] vs y: [Model B]".
* **Top Row (Left to Right):**
1. `x: DeepSeek-R1 vs y: gemini-2.0-flash`
2. `x: DeepSeek-R1 vs y: gemini-2.5-flash-preview-04-17`
3. `x: gemini-2.0-flash vs y: gemini-2.5-flash-preview-04-17`
* **Bottom Row (Left to Right):**
1. `x: DeepSeek-R1 vs y: Llama-4-Maverick-17B-128E-Instruct-FP8`
2. `x: gemini-2.0-flash vs y: Llama-4-Maverick-17B-128E-Instruct-FP8`
3. `x: gemini-2.5-flash-preview-04-17 vs y: Llama-4-Maverick-17B-128E-Instruct-FP8`
* **Visual Encoding:**
* **Color Gradient:** Represents data point density. The scale transitions from bright yellow (highest density) through green and teal to dark purple (lowest density). White areas indicate regions with no or negligible data points.
* **Diagonal Line:** A dashed gray line from (0,0) to (1,1) in each plot. Points above this line indicate the y-axis model has a higher progress ratio; points below indicate the x-axis model is higher.
* **Data Points:** Small, semi-transparent gray dots are scattered across the plots, representing individual data points underlying the density contours.
### Detailed Analysis
**Trend Verification & Spatial Grounding:**
1. **Top-Left (DeepSeek-R1 vs. gemini-2.0-flash):**
* **Trend:** The highest density (yellow) is concentrated in the bottom-left corner (progress ratios < 0.2 for both models). The density contours extend further along the x-axis (DeepSeek-R1) than the y-axis (gemini-2.0-flash). A significant white region exists in the upper-right quadrant.
* **Interpretation:** For most tasks, both models show low progress. However, when progress is made, DeepSeek-R1 tends to achieve a higher ratio than gemini-2.0-flash, as evidenced by the density mass lying predominantly below the diagonal line.
2. **Top-Middle (DeepSeek-R1 vs. gemini-2.5-flash-preview-04-17):**
* **Trend:** Similar high-density cluster at the origin. The white region is more pronounced and extends closer to the diagonal line compared to the first plot. The density contours show a more balanced spread around the diagonal in the mid-range (0.2-0.6).
* **Interpretation:** The performance gap between DeepSeek-R1 and gemini-2.5-flash-preview is narrower than with gemini-2.0-flash. There are more instances where both models achieve moderate to high progress ratios simultaneously.
3. **Top-Right (gemini-2.0-flash vs. gemini-2.5-flash-preview-04-17):**
* **Trend:** The density distribution is more symmetric around the diagonal line. The yellow core is at the origin, but the contours spread more evenly into the plot area. The white region is smaller and located in the top-right corner.
* **Interpretation:** These two Gemini models have highly correlated performance. Their progress ratios are similar across the majority of tasks, with gemini-2.5-flash-preview showing a slight edge in some regions (density slightly favors the area above the diagonal).
4. **Bottom-Left (DeepSeek-R1 vs. Llama-4-Maverick-17B-128E-Instruct-FP8):**
* **Trend:** The density is heavily skewed below the diagonal line. The yellow region is elongated along the x-axis. A large white area occupies the upper half of the plot.
* **Interpretation:** DeepSeek-R1 consistently outperforms Llama-4-Maverick on this metric. There are very few tasks where Llama-4-Maverick achieves a higher progress ratio than DeepSeek-R1.
5. **Bottom-Middle (gemini-2.0-flash vs. Llama-4-Maverick-17B-128E-Instruct-FP8):**
* **Trend:** Density is concentrated below the diagonal, but with a more substantial spread above it compared to the previous plot. The contours show a "ridge" of moderate density extending along the diagonal.
* **Interpretation:** gemini-2.0-flash generally performs better than Llama-4-Maverick, but the performance difference is less extreme than with DeepSeek-R1. There is a subset of tasks where their performance is comparable.
6. **Bottom-Right (gemini-2.5-flash-preview-04-17 vs. Llama-4-Maverick-17B-128E-Instruct-FP8):**
* **Trend:** The density distribution is the most balanced relative to the diagonal among the bottom row plots. While the core is at the origin, significant density exists both above and below the line in the 0.0-0.4 range.
* **Interpretation:** The performance of gemini-2.5-flash-preview and Llama-4-Maverick is the most competitive of the comparisons against Llama. Neither model shows a dominant advantage across all tasks.
### Key Observations
* **Universal Low-Progress Cluster:** All six plots show the highest data density (yellow) clustered near (0,0). This indicates that for a large portion of the evaluated tasks, all models struggle and achieve very low progress ratios.
* **Model Hierarchy:** A clear performance hierarchy emerges from the visual patterns:
1. **DeepSeek-R1** appears to be the strongest model, consistently lying above (outperforming) others.
2. **gemini-2.5-flash-preview-04-17** is the next strongest, showing balanced or slightly superior performance against gemini-2.0-flash and Llama-4-Maverick.
3. **gemini-2.0-flash** is generally outperformed by the above two.
4. **Llama-4-Maverick-17B-128E-Instruct-FP8** is the weakest in these comparisons, frequently lying below the diagonal.
* **Performance Correlation:** Models from the same family (the two Gemini models) show the most correlated performance (plot is most symmetric around the diagonal). Comparisons between different families (e.g., DeepSeek vs. Gemini, any vs. Llama) show more asymmetric distributions.
### Interpretation
This visualization provides a nuanced, task-level view of relative model capabilities beyond simple average scores. The "progress ratio" likely measures success or improvement on specific problems.
* **What the Data Suggests:** The data demonstrates that model superiority is not absolute but depends on the task distribution. While a clear ranking exists (DeepSeek-R1 > gemini-2.5-flash > gemini-2.0-flash > Llama-4-Maverick), there is significant overlap and task-specific variation. The large low-progress cluster for all models highlights a common set of challenging problems where current AI capabilities plateau.
* **Relationship Between Elements:** The pairwise plots collectively build a comparative landscape. By holding one model constant across a row (e.g., Llama-4-Maverick on the bottom), we can see how other models compare against a common baseline. The diagonal line is the critical reference, transforming a density plot into a direct comparison tool.
* **Notable Anomalies/Outliers:** The white regions are notable. They represent combinations of progress ratios that are rarely or never observed. For example, in the DeepSeek-R1 vs. Llama-4-Maverick plot, the large white area in the top-left (high y, low x) signifies that it's extremely rare for Llama-4-Maverick to significantly outperform DeepSeek-R1 on a task. The scattered gray dots represent these rare outlier events.
</details>
Figure 13: Scatter and density plots of progress ratios per task instance, comparing model pairs on the tasks. These plots illustrate performance agreement and disparities on the same instances of pathfinding tasks. Notably, Gemini-2.5-Flash (example) often succeeds on instances where other models achieve near-zero progress. Data from experiments in Figure 1 (main paper).
Figure 14 isolates the impact of shuffle ratio on model performance when other factors like noise are controlled. This visualization corresponds to the findings discussed in the main paper (Section 2.3, "Fact Ordering (Shuffle Ratio)") that simple reordering of facts has a minimal impact on the performance of the evaluated models under low-noise conditions.
Figure 15 isolates the impact of adding more examples in the instruction prompt, showing a clear improvement once more than a single example is included compared to using none or only one.
Figure 16 is added in this revised version of the supplementary section to reflect that even the most recent SOTA models released by OpenAI suffer from the same performance drop observed in the main paper.
<details>
<summary>figs/fig_vs_shuffle_fixed_L_keys2_noise0.2.png Details</summary>

### Visual Description
\n
## Line Charts: AI Model Performance vs. Shuffle Ratio
### Overview
The image displays three horizontally aligned line charts comparing the performance and output characteristics of two AI models across varying "shuffle ratios." The charts share a common x-axis but measure different metrics on their respective y-axes. The models compared are identified in a legend located in the top-left corner of the first chart.
### Components/Axes
* **Legend (Top-Left of First Chart):**
* Blue line with circle markers: `(Llama-4-Maverick-17B-128E-Instruct-FP8)`
* Orange line with circle markers: `(gemini-2.5-flash-preview-04-17)`
* **Common X-Axis (All Charts):**
* Label: `shuffle ratio`
* Scale: Linear, from 0.0 to 1.0, with major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Chart 1 (Left) Y-Axis:**
* Label: `mean progress ratio`
* Scale: Linear, from 0.0 to 1.0, with major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Chart 2 (Center) Y-Axis:**
* Label: `mean success rate (Pass@1)`
* Scale: Linear, from 0.0 to 1.0, with major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Chart 3 (Right) Y-Axis:**
* Label: `CoT tokens`
* Scale: Linear, from 400 to 1600, with major ticks at 400, 600, 800, 1000, 1200, 1400, 1600.
### Detailed Analysis
**Chart 1: Mean Progress Ratio**
* **Trend Verification:** The orange line (gemini) shows a slight upward trend from 0.0 to 0.4, then plateaus. The blue line (Llama) shows a very slight downward trend overall.
* **Data Points (Approximate):**
* **gemini-2.5-flash-preview-04-17 (Orange):** Starts at ~0.64 (0.0), rises to ~0.69 (0.4), then stabilizes around ~0.66 (0.6, 0.8, 1.0).
* **Llama-4-Maverick-17B-128E-Instruct-FP8 (Blue):** Starts at ~0.22 (0.0), dips to ~0.19 (0.2), and remains relatively flat between ~0.18 and ~0.20 for the remaining ratios.
**Chart 2: Mean Success Rate (Pass@1)**
* **Trend Verification:** The orange line (gemini) shows a gentle arc, peaking at 0.4. The blue line (Llama) is consistently near zero with minimal variation.
* **Data Points (Approximate):**
* **gemini-2.5-flash-preview-04-17 (Orange):** Starts at ~0.50 (0.0), peaks at ~0.56 (0.4), and ends at ~0.50 (1.0).
* **Llama-4-Maverick-17B-128E-Instruct-FP8 (Blue):** Remains very close to 0.0 across all shuffle ratios, with values estimated between 0.01 and 0.02.
**Chart 3: CoT Tokens**
* **Trend Verification:** The blue line (Llama) shows a gradual upward trend, peaking at 0.8. The orange line (gemini) is essentially flat.
* **Data Points (Approximate):**
* **Llama-4-Maverick-17B-128E-Instruct-FP8 (Blue):** Starts at ~1600 (0.0), increases to a peak of ~1680 (0.8), and ends at ~1630 (1.0).
* **gemini-2.5-flash-preview-04-17 (Orange):** Remains stable around ~350 tokens across all shuffle ratios.
### Key Observations
1. **Performance Disparity:** The gemini model consistently outperforms the Llama model on both `mean progress ratio` and `mean success rate` metrics by a significant margin.
2. **Output Length Disparity:** The Llama model generates substantially more Chain-of-Thought (CoT) tokens (approx. 4.5x more) than the gemini model, regardless of the shuffle ratio.
3. **Sensitivity to Shuffle Ratio:**
* The gemini model's performance (progress and success) shows a slight optimal point around a shuffle ratio of 0.4.
* The Llama model's metrics are largely insensitive to the shuffle ratio, showing only minor fluctuations.
* The Llama model's CoT token count shows a slight positive correlation with shuffle ratio, peaking at 0.8.
4. **Anomaly:** The Llama model exhibits a very low success rate (~0.01) despite a moderate progress ratio (~0.2). This suggests it may make partial progress on tasks but rarely completes them successfully under the tested conditions.
### Interpretation
This data suggests a fundamental trade-off or difference in operational strategy between the two models. The **gemini-2.5-flash-preview-04-17** model appears to be more efficient and effective for the task being measured: it achieves higher success and progress rates while using far fewer reasoning tokens. Its performance is subtly optimized at a mid-range shuffle ratio (0.4).
In contrast, the **Llama-4-Maverick-17B-128E-Instruct-FP8** model is less successful but more verbose, generating lengthy reasoning chains that do not translate into task completion. Its performance is largely unaffected by the shuffling of input data (as represented by the shuffle ratio), indicating a different, possibly less adaptive, processing mechanism for this specific task. The charts collectively highlight that higher token expenditure does not correlate with better performance in this comparison; in fact, the inverse is observed.
</details>
Figure 14: Impact of shuffle ratio on Pass@1 success rate. Varying the degree of mixing (shuffle) between supporting and distracting facts shows minimal impact on performance for Gemini 2.5 Flash and Llama-4 Maverick, suggesting robustness to fact order when noise is controlled. The generation and sampling of maze instances for these tasks follow the same methodology detailed for experiments in the main paper (Figures 3 and 4).
<details>
<summary>figs/maze_ablation_analysis.png Details</summary>

### Visual Description
\n
## Line Chart: Success Rate vs. Number of Actions for Llama-4-Maverick-17B-128E-Instruct-FP8
### Overview
This line chart illustrates the performance degradation of the "Llama-4-Maverick-17B-128E-Instruct-FP8" model as task complexity increases. It plots the "Success rate" against the "Number of actions" required, comparing five different prompting strategies. The overall trend shows a sharp, non-linear decline in success rate for all methods as the number of actions grows.
### Components/Axes
* **Chart Title (Top-Left):** "Llama-4-Maverick-17B-128E-Instruct-FP8"
* **X-Axis (Bottom):** Label: "Number of actions". Scale: Linear, with major tick marks at 0, 50, 100, 150, and 200.
* **Y-Axis (Left):** Label: "Success rate". Scale: Linear, with major tick marks at 0.0, 0.2, 0.4, and 0.6.
* **Legend (Top-Right):** Contains five entries, each with a unique color and marker symbol:
1. Green line with circle markers: `5_shots_and_guided_CoT`
2. Purple line with diamond markers: `3_shots_and_guided_CoT`
3. Orange line with upward-pointing triangle markers: `3_shot_unguided`
4. Red line with downward-pointing triangle markers: `1_shot_and_guided_CoT`
5. Blue line with square markers: `zero_shot_and_guided_CoT`
### Detailed Analysis
All five data series exhibit a similar trend: a steep, convex decline in success rate from a low number of actions (approximately 10-20) to around 50 actions, followed by a much shallower decline that asymptotically approaches zero.
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **5_shots_and_guided_CoT (Green, Circles):**
* **Trend:** Consistently the highest-performing method across all action counts. Starts highest, declines steeply, but maintains a lead.
* **Points:** At ~10 actions: ~0.68. At ~20 actions: ~0.45. At ~30 actions: ~0.26. At ~40 actions: ~0.18. At 50 actions: ~0.09. At 100 actions: ~0.04. At 150+ actions: ~0.00.
2. **3_shots_and_guided_CoT (Purple, Diamonds):**
* **Trend:** Second-highest performance initially, closely follows the green line but slightly below it.
* **Points:** At ~10 actions: ~0.67. At ~20 actions: ~0.46. At ~30 actions: ~0.25. At ~40 actions: ~0.14. At 50 actions: ~0.12. At 100 actions: ~0.01. At 150+ actions: ~0.00.
3. **3_shot_unguided (Orange, Up-Triangles):**
* **Trend:** Starts very high (near the top), but its decline is slightly steeper than the guided 3-shot method after ~20 actions.
* **Points:** At ~10 actions: ~0.69. At ~20 actions: ~0.44. At ~30 actions: ~0.24. At ~40 actions: ~0.13. At 50 actions: ~0.11. At 100 actions: ~0.00. At 150+ actions: ~0.00.
4. **1_shot_and_guided_CoT (Red, Down-Triangles):**
* **Trend:** Performance is consistently below the 3-shot and 5-shot methods. Its decline is parallel to the others.
* **Points:** At ~10 actions: ~0.59. At ~20 actions: ~0.38. At ~30 actions: ~0.20. At ~40 actions: ~0.14. At 50 actions: ~0.06. At 100 actions: ~0.01. At 150+ actions: ~0.00.
5. **zero_shot_and_guided_CoT (Blue, Squares):**
* **Trend:** The lowest-performing method at every measured point. Shows the most severe initial drop.
* **Points:** At ~10 actions: ~0.58. At ~20 actions: ~0.38. At ~30 actions: ~0.18. At ~40 actions: ~0.11. At 50 actions: ~0.05. At 100 actions: ~0.00. At 150+ actions: ~0.00.
### Key Observations
1. **Universal Degradation:** Success rate for all prompting strategies collapses as the number of actions increases beyond a very small number (~10-20).
2. **Performance Hierarchy:** A clear and consistent hierarchy exists: `5_shots_and_guided_CoT` > `3_shots_and_guided_CoT` â `3_shot_unguided` > `1_shot_and_guided_CoT` > `zero_shot_and_guided_CoT`. More examples ("shots") generally correlate with better performance.
3. **Guidance vs. Unguided:** For the 3-shot case, the guided CoT (`3_shots_and_guided_CoT`) and unguided (`3_shot_unguided`) methods perform very similarly, with the guided version showing a slight advantage at higher action counts (e.g., at 50 actions).
4. **Convergence to Zero:** By 100 actions, all methods have a success rate at or very near zero. The lines converge and flatten along the x-axis from 100 to 200 actions.
5. **Steep Initial Drop:** The most significant performance loss occurs between approximately 10 and 50 actions, where success rates drop by 80-90% of their initial value.
### Interpretation
This chart demonstrates a fundamental limitation in the model's ability to maintain coherent, successful performance over extended sequential reasoning or multi-step tasks. The "Number of actions" likely represents steps in a plan, tool-use sequences, or reasoning chains.
* **What the data suggests:** The model's reliability is highly sensitive to task length. Even with advanced prompting techniques like few-shot examples and guided Chain-of-Thought (CoT), its capacity to execute long action sequences successfully diminishes rapidly. The benefit of adding more examples (shots) is clear but does not prevent the eventual collapse.
* **Relationship between elements:** The legend defines the independent variable (prompting strategy), while the axes show the relationship between task length (actions) and outcome (success). The tight clustering of lines indicates that while prompting strategy matters, the underlying model's constraint with long horizons is the dominant factor.
* **Notable anomalies/outliers:** There are no major outliers; the data follows a very smooth, predictable decay curve for all series. The most notable finding is the **lack of a plateau**âperformance does not stabilize at a low level but continues to degrade towards zero, indicating a complete failure mode for long action sequences.
* **Peircean Investigation:** The chart is an **index** of the model's cognitive "working memory" or planning horizon. It points to a physical or architectural limit in the model's context or attention mechanism when processing long, dependent chains of actions. The **symbol** of "success rate" here represents functional competence, which is shown to be a fragile resource that degrades with use (actions). The **icon** of the steeply falling lines visually argues that current LLMs, even powerful ones, are not yet reliable agents for complex, long-horizon tasks without significant external scaffolding or error correction.
</details>
Figure 15: The impact of including different number of reference examples in the prompt as part of in-context learning. Increasing the number of examples leads to slight improvements in performance. The experimental parameters used here are the same as ones in Figure 1.
<details>
<summary>figs/model_comparison_openai.png Details</summary>

### Visual Description
## Line Chart: Success Rate vs. Number of Actions for AI Models
### Overview
The image is a line chart comparing the performance of four different AI models. It plots the "Success rate" (y-axis) against the "Number of actions" (x-axis), showing how each model's performance degrades as the task complexity (number of actions) increases. All data series show a downward trend.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Number of actions"
* **Scale:** Linear, ranging from 0 to 300.
* **Major Ticks:** 0, 50, 100, 150, 200, 250, 300.
* **Y-Axis:**
* **Label:** "Success rate"
* **Scale:** Linear, ranging from 0.0 to 1.0.
* **Major Ticks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Legend:** Located in the top-right corner of the chart area. It contains four entries, each with a colored line segment and a circular marker:
1. **Blue line with circle marker:** "GPT-5"
2. **Orange line with circle marker:** "OSS-120B"
3. **Green line with circle marker:** "OSS-20B"
4. **Red line with circle marker:** "Llama-4-Maverick"
* **Grid:** A light gray grid is present, aligned with the major ticks on both axes.
### Detailed Analysis
The chart displays four distinct data series, each representing a model's success rate at different action counts. The following data points are approximate, read from the chart's grid.
**1. GPT-5 (Blue Line)**
* **Trend:** The line slopes downward consistently from left to right, indicating a steady decrease in success rate as the number of actions increases. It maintains the highest success rate among all models at every data point.
* **Approximate Data Points:**
* 0 actions: ~1.00
* ~25 actions: ~0.99
* ~35 actions: ~0.96
* 50 actions: ~0.85
* 100 actions: ~0.62
* 140 actions: ~0.51
* 180 actions: ~0.25
* 220 actions: ~0.18
* 260 actions: ~0.17
* 300 actions: ~0.08
**2. OSS-120B (Orange Line)**
* **Trend:** The line slopes downward, starting very high but declining more steeply than GPT-5. It crosses below the 0.5 success rate mark between 50 and 100 actions.
* **Approximate Data Points:**
* 0 actions: ~0.97
* ~25 actions: ~0.95
* ~35 actions: ~0.91
* 50 actions: ~0.72
* 100 actions: ~0.23
* 140 actions: ~0.04
* 180 actions: ~0.02
* 220 actions: ~0.00 (appears to be at or near zero)
* 260 actions: ~0.00
* 300 actions: ~0.00
**3. OSS-20B (Green Line)**
* **Trend:** The line shows a very steep initial decline, dropping below a 0.5 success rate before 50 actions. It approaches zero success rate by 100 actions.
* **Approximate Data Points:**
* 0 actions: ~0.88
* ~20 actions: ~0.74
* ~35 actions: ~0.53
* 50 actions: ~0.31
* 100 actions: ~0.02
* 140 actions: ~0.00
* 180 actions: ~0.00
* 220 actions: ~0.00
* 260 actions: ~0.00
* 300 actions: ~0.00
**4. Llama-4-Maverick (Red Line)**
* **Trend:** The line exhibits the most severe and rapid decline. It starts at the lowest initial success rate and plummets to near-zero performance by 50 actions.
* **Approximate Data Points:**
* 0 actions: ~0.65
* ~20 actions: ~0.39
* ~35 actions: ~0.18
* 50 actions: ~0.05
* 100 actions: ~0.01
* 140 actions: ~0.00
* 180 actions: ~0.00
* 220 actions: ~0.00
* 260 actions: ~0.00
* 300 actions: ~0.00
### Key Observations
1. **Universal Negative Correlation:** All four models demonstrate a clear negative correlation between the number of actions and success rate. Performance universally degrades with increased task length/complexity.
2. **Performance Hierarchy:** A consistent performance hierarchy is maintained across the entire range: GPT-5 > OSS-120B > OSS-20B > Llama-4-Maverick.
3. **Divergence in Decay Rates:** The models differ significantly in how quickly their performance decays. GPT-5 has the most gradual slope, while Llama-4-Maverick has the steepest.
4. **Convergence to Zero:** Three of the four models (OSS-120B, OSS-20B, Llama-4-Maverick) reach a success rate at or near zero by 100-150 actions. GPT-5 is the only model that maintains a measurable, albeit low, success rate (â0.08) at 300 actions.
5. **Initial Performance Gap:** There is a significant spread in initial success rates (at 0 actions), ranging from ~0.65 (Llama-4-Maverick) to ~1.00 (GPT-5).
### Interpretation
This chart likely illustrates the results of a benchmark evaluating AI models on sequential decision-making or multi-step reasoning tasks. The "Number of actions" represents the length or complexity of the task sequence required for completion.
* **What the data suggests:** The data strongly suggests that maintaining performance over long action sequences is a major challenge for current AI models. The ability to handle extended context or maintain coherence over many steps appears to be a key differentiator between models, with GPT-5 showing significantly greater robustness.
* **Relationship between elements:** The x-axis (complexity) is the independent variable causing the change in the y-axis (performance). The different colored lines represent different model architectures or sizes, isolating the variable of model capability. The steepness of each line is a direct visual measure of that model's "contextual robustness" or "planning horizon."
* **Notable patterns and anomalies:**
* The most striking pattern is the **exponential-like decay** for OSS-20B and Llama-4-Maverick, suggesting a critical failure point is reached relatively early in the action sequence.
* The **near-perfect initial performance** of GPT-5 and OSS-120B at 0 actions indicates they can solve the base task flawlessly, but the challenge lies entirely in scaling that success.
* The **plateauing of GPT-5's curve** between 220-260 actions (â0.18 to â0.17) before a final drop is a minor anomaly that could indicate a subset of tasks solvable within that action range or a measurement artifact.
In summary, the chart provides a clear, quantitative comparison showing that while all models struggle with longer tasks, there is a substantial performance gap, with larger or more advanced models (like GPT-5) demonstrating a markedly superior ability to sustain performance as task complexity grows.
</details>
Figure 16: This figure is added to reflect that the recent closed (GPT-5) and open sourced models (OSS-20B/120B) released by OpenAI also follow the same universal failure patterns highlighted in this paper. The data used here as well as experimental settings is the same as the one used in Figure 1 of the main paper. We include Llama-4-Maverick which is also used in Figure 1 as the benchmark reference.