# seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
**Authors**:
- M.R. Ramezanali
- Salesforce AI
- Palo Alto, CA
- &M. Vazifeh (Capital One, MIT)
- Cambridge, MA
- &P. Santi (MIT)
- Cambridge, MA
> ⋆\stardenotes equal contribution.
Abstract
We introduce seqBench, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. seqBench allows systematic variation of (1) the logical depth, defined as the number of sequential actions required to solve the task; (2) the number of backtracking steps along the optimal path, quantifying how often the agent must revisit prior states to satisfy deferred preconditions (e.g., retrieving a key after encountering a locked door); and (3) the noise ratio, defined as the ratio between supporting and distracting facts about the environment. Our evaluations on state-of-the-art LLMs reveal a universal failure pattern: accuracy collapses exponentially beyond a model-specific logical depth. Unlike existing benchmarks, seqBench ’s fine-grained control facilitates targeted analyses of these reasoning failures, illuminating universal scaling laws and statistical limits, as detailed in this paper alongside its generation methodology and evaluation metrics. We find that even top-performing models systematically fail on seqBench ’s structured reasoning tasks despite minimal search complexity, underscoring key limitations in their commonsense reasoning capabilities. Designed for future evolution to keep pace with advancing models, the seqBench datasets are publicly released to spur deeper scientific inquiry into LLM reasoning, aiming to establish a clearer understanding of their true potential and current boundaries for robust real-world application.
seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
M.R. Ramezanali thanks: $\star$ denotes equal contribution. Salesforce AI Palo Alto, CA 94301 mramezanali@salesforce.com M. Vazifeh footnotemark: Capital One, MIT Cambridge, MA 02143 mvazifeh@mit.edu P. Santi MIT Cambridge, MA 02143 psanti@mit.edu
Large Language Models (LLMs) have shown remarkable performance (Vaswani et al., 2017; Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Smith et al., 2022; Thoppilan et al., 2022; Hoffmann et al., 2022; Du et al., 2021; Fedus et al., 2022; Zoph et al., 2022) on a wide range of tasks and benchmarks spanning diverse human-like capabilities; however, these successes can obscure fundamental limitations in sequential reasoning that still persist. Arguably, reasoning captures a more pure form of intelligence, going beyond mere pattern matching or fact memorization, and is thus a critical capability to understand and enhance in AI systems. Recent studies show that state-of-the-art LLMs (OpenAI, 2025; Google DeepMind, 2025; Meta AI, 2025; Mistral AI, 2024; Anthropic, 2025) excel at complex benchmarks, yet stumble upon simple common-sense inferences trivial for an adult human (Nezhurina et al., 2025; Han et al., 2024; Sharma, 2024; Berglund et al., 2024; Yang et al., 2019). Most existing benchmarks saturate quickly, leaving little room for fine-grained attribution studies to perform systemic probes of LLM failure modes. Consequently, a robust understanding of why and under what circumstances these models fail, especially on problems requiring sequential reasoning, remains elusive.
This gap, we argue, stems from the lack of evaluation benchmarks allowing systematic, multi-dimensional control over key independent factors that influence a task’s overall reasoning difficulty. Most benchmarks (Cobbe et al., 2021; Hendrycks et al., 2021; Srivastava et al., 2023; Weston et al., 2015; Clark et al., 2018; Dua et al., 2019; Rein et al., 2023), despite their evaluation merits, often do not support a systematic variation of crucial complexity dimensions. This makes it difficult to isolate the specific conditions under which reasoning in LLMs falter. For instance, discerning whether a failure is due to the length of the required reasoning chain, the necessity to revise intermediate conclusions, or the density of distracting information is often not quantitatively possible. While prompting strategies like chain-of-thought (CoT) and model scaling have boosted aggregate performance, they often obscure sharp performance cliffs that can emerge when these underlying complexity dimensions are varied independently (Wei et al., 2023; Kojima et al., 2022). Without such systematic control, disentangling inherent architectural limitations from those addressable via scaling (model size, data, or compute), fine-tuning, or prompting techniques is challenging. A fine-grained understanding of these performance boundaries is crucial for developing more robust and reliable reasoning systems.
To complement recent efforts (Sprague et al., 2024; Tyagi et al., 2024; Kuratov et al., 2024; Tang and Kejriwal, 2025; Mirzaee et al., 2021; Tikhonov, 2024; Mirzaee and Kordjamshidi, 2022; Shi et al., 2022) in evaluating reasoning, and to address the need for more controlled analysis, we introduce seqBench, a tunable benchmark designed explicitly to probe and analyze sequential reasoning capabilities in language models. The dataset comprises synthetic yet linguistically grounded pathfinding task configurations on two-dimensional grids. Solving each problem requires sequential inference over relevant and distracting structured facts. Each instance is automatically verifiable and parameterized by controllable factors that directly address the previously identified gaps: (1) logical depth (total number of actions in the ground-truth solution, reflecting the length of the reasoning chain); (2) backtracking count (number of locked-door detours on the optimal path, requiring revision of tentative solution paths); and (3) noise ratio (proportion of distracting vs. supporting facts, testing robustness to irrelevant information). Performance against these dimensions can be quantified with fine-grained metrics (e.g., via progress ratio as we define here). We observe that beyond a certain logical depth, Pass@1 success collapses to near zero for all models (see Figure 1). These features enable precise attribution studies of model failure modes, offering insights into the brittle boundaries of current LLM generalization.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Line Graph: Model Success Rate Decay vs. Number of Actions
### Overview
The image contains two vertically stacked line graphs comparing the success rate decay of various AI models as a function of the number of actions (L). The top plot uses a linear scale (0-1) for success rate, while the bottom plot uses a logarithmic scale (10⁻³-1). Both plots show exponential decay patterns with fitted curves of the form ~exp(-L/L₀).
### Components/Axes
- **X-axis**: Number of Actions (L) ranging from 0 to 300
- **Y-axis (Top)**: Success Rate (0-1)
- **Y-axis (Bottom)**: Success Rate (Log Scale, 10⁻³-1)
- **Legend**: Located in top-right corner, containing:
- Model names (solid lines)
- Fitted exponential curves (dashed lines)
- Corresponding L₀ values (time constants)
### Detailed Analysis
1. **Gemini 2.5-flash-preview-04-17** (Red):
- Solid line: Starts at ~0.95 success rate
- Dashed fit: L₀ = 85.7 (slowest decay)
- Success rate drops below 0.1 after ~200 actions
2. **Gemini 2.0-flash** (Green):
- Solid line: Starts at ~0.75
- Dashed fit: L₀ = 40.2
- Success rate below 0.1 after ~150 actions
3. **Llama-4-Maverick-17B-128E-Instruct-FP8** (Gray):
- Solid line: Starts at ~0.65
- Dashed fit: L₀ = 16.7
- Success rate below 0.1 after ~100 actions
4. **Llama-3.3-70B-Instruct-Turbo** (Pink):
- Solid line: Starts at ~0.55
- Dashed fit: L₀ = 10.2
- Success rate below 0.1 after ~80 actions
5. **Qwen2.5-Coder-32B-Instruct** (Orange):
- Solid line: Starts at ~0.45
- Dashed fit: L₀ = 4.8
- Success rate below 0.1 after ~60 actions
6. **Qwen2.5-7B-Instruct-Turbo** (Blue):
- Solid line: Starts at ~0.35
- Dashed fit: L₀ = 4.0
- Success rate below 0.1 after ~50 actions
7. **Llama-3.2-3B-Instruct-Turbo** (Brown):
- Solid line: Starts at ~0.25
- Dashed fit: L₀ = 1.6 (fastest decay)
- Success rate below 0.1 after ~40 actions
### Key Observations
1. All models show exponential decay patterns with success rates approaching zero as L increases
2. Larger models (Gemini 2.5, Llama-3.3) maintain higher success rates longer
3. L₀ values correlate with model size/performance:
- L₀ > 50: Large models (Gemini 2.5)
- 10 < L₀ < 50: Mid-sized models (Llama-4, Llama-3.3)
- L₀ < 10: Smaller models (Qwen, Llama-3.2)
4. Log-scale plot reveals linear decay patterns, confirming exponential nature
5. All models start near 100% success rate at L=0
### Interpretation
The data demonstrates that model architecture and parameter scale significantly impact performance degradation over sequential actions. The L₀ parameter acts as a "decay constant" where:
- Higher L₀ values indicate better robustness to extended action sequences
- Exponential decay suggests diminishing returns in model performance with increased complexity
- The log-scale visualization helps identify decay rates that would be difficult to discern on linear plots
- Model size appears to be the primary factor determining decay rate, with larger models maintaining performance longer
This analysis suggests that for tasks requiring sustained performance over many actions, larger models like Gemini 2.5-flash-preview-04-17 would be preferable despite their higher computational costs.
</details>
Figure 1: Performance collapse of various models with increasing logical depth $L$ for a pathfinding task ( $N,M=40,\mathcal{B}=2$ keys, Noise Ratio $\mathcal{N}=0.0$ ). Success rates (Pass@1) are shown on linear (top panel) and logarithmic (bottom panel) y-axes, averaged from 5 runs/problem across 40 problems per unit $L$ -bin. All evaluations used Temperature=1.0 and top-p=0.95 (Gemini-2.5-flash: ’auto’ thinking). The displayed fits employ a Weighted Least Squares (WLS) Carroll and Ruppert (2017) method on log-success rates. Weights are derived from inverse squared residuals of a preliminary Ordinary Least Squares (OLS) fit. (In the supplementary section, we have added Figure 16 to show a similar pattern is observed in recently released OpenAI models.)
Furthermore, the seqBench benchmark is built upon a scalable data generation framework, allowing it to evolve alongside increasingly capable models to help with both model training and evaluation. Through evaluations on popular LLMs, we reveal that top-performing LLMs exhibit steep universal declines as either of the three complexity dimensions increases, while remaining comparatively robust to fact shuffle, despite the underlying logical structure being unchanged.
Contributions.
Our main contributions are:
1. seqBench: A Tunable Benchmark for Sequential Reasoning. We introduce an open-source framework for generating pathfinding tasks with fine-grained, orthogonal control over logical depth, backtracking steps, and noise ratio. We also evaluate secondary factors like fact ordering (shuffle ratio; See supplementary material for details).
1. Comprehensive LLM Attribution Study. Using seqBench, we demonstrate the significant impact of these controlled complexities on LLM performance, revealing sharp performance cliffs in state-of-the-art models even when search complexity is minimal.
The seqBench dataset is publicly available https://huggingface.co/datasets/emnlp-submission/seqBench under the CC BY 4.0 license to facilitate benchmarking.
<details>
<summary>figs/llama4_deepdive.png Details</summary>

### Visual Description
## Line Graph: Success Rate vs. Number of Actions
### Overview
The image contains two charts. The top chart is a line graph comparing a model's success rate to the number of actions, with an exponential decay model overlaid. The bottom chart is a bar graph showing three metrics (Precision, Recall, Progress ratio) across the same action range, with error bars.
### Components/Axes
**Top Chart**:
- **X-axis**: "Number of actions" (0 to 300, linear scale).
- **Y-axis**: "Success rate" (0 to 0.6, linear scale).
- **Legend**:
- Blue line: "Llama-4-Maverick-17B-128E-Instruct-FP8" (actual data).
- Orange dashed line: "∝ exp(−L/L₀), L₀ = 16.7" (exponential decay model).
**Bottom Chart**:
- **X-axis**: "Number of actions" (0 to 400, linear scale).
- **Y-axis**: Metric values (0 to 1.0, linear scale).
- **Legend**:
- Blue circles: "Precision" (mean ± error bars).
- Orange circles: "Recall" (mean ± error bars).
- Green circles: "Progress ratio" (mean ± error bars).
### Detailed Analysis
**Top Chart**:
- The blue line (actual data) starts at ~0.62 success rate at 0 actions and decays exponentially, closely following the orange dashed model line.
- Key data points:
- At 0 actions: ~0.62 (blue), ~0.62 (orange).
- At 50 actions: ~0.25 (blue), ~0.25 (orange).
- At 100 actions: ~0.05 (blue), ~0.05 (orange).
- At 150+ actions: ~0.01 (blue), ~0.01 (orange).
**Bottom Chart**:
- **Precision**:
- Stable at ~0.9 across all actions, with small error bars (±0.02–0.05).
- **Recall**:
- Starts at ~0.8 at 0 actions, declines to ~0.3 at 300 actions.
- Error bars increase with actions (e.g., ±0.1 at 100 actions, ±0.2 at 300 actions).
- **Progress ratio**:
- Starts at ~0.75 at 0 actions, declines to ~0.1 at 300 actions.
- Error bars are large (e.g., ±0.1 at 100 actions, ±0.2 at 300 actions).
### Key Observations
1. **Exponential decay**: The top chart confirms the model’s success rate follows an exponential decay with a characteristic length scale L₀ = 16.7.
2. **Metric divergence**: Precision remains high, but Recall and Progress ratio degrade significantly over actions.
3. **Error variability**: Recall and Progress ratio exhibit higher uncertainty (larger error bars) compared to Precision.
### Interpretation
- The exponential decay in success rate suggests the model’s performance degrades predictably with increased actions, likely due to task complexity or data distribution shifts.
- The divergence between Precision (stable) and Recall/Progress ratio (declining) implies the model maintains accuracy in predictions but struggles with completeness (Recall) and incremental improvement (Progress ratio).
- Large error bars for Recall and Progress ratio indicate high variability in these metrics, possibly due to sparse data or task-specific challenges.
- The model’s L₀ = 16.7 implies a "half-life" of ~16.7 actions, after which success rate halves. This quantifies the rate of performance degradation.
</details>
Figure 2: On the left: Llama-4 Maverick-17B-128E-Instruct Model’s performance (pass@1 success rate) versus number of actions in the ground truth path of the pathfinding problems ( $N,M=40,\mathcal{B}=2$ keys, Noise Ratio $\mathcal{N}=0.0$ ) is shown. This Pass@1 success rate across 5 runs per problem is averaged over the problem instances sampled from different actions count bins of width equal to 1. On the right: The mean of progress ratio across all problems as well as mean of precision and recall is shown to highlight models gradually increasing struggle in completing the path. The Temperature is set to 1.0 and the top-p is set to 0.95 in all runs.
1 Methods
1.1 Dataset Generation
The seqBench dataset consists of spatial pathfinding tasks. Task instance generation, detailed below (Algorithm 1; See Appendix A for details), is predicated on the precise independent control of the three key complexity dimensions introduced earlier: Logical Depth ( $L$ ), Backtracking Count ( $\mathcal{B}$ ), and Noise Ratio ( $\mathcal{N}$ ). This allows the creation of instances with specific values for these parameters, enabling targeted studies of their impact on LLM reasoning.
Task instances are produced in a multi-stage process. Initially, primary generation parameters—maze dimensions ( $N,M$ ), target backtracks ( $\mathcal{B}_{\text{target}}$ ), and target noise ratio ( $\mathcal{N}_{\text{target}}$ )—are specified. An acyclic maze graph ( $M_{g}$ ) is formed on an $N× M$ grid using Kruskal’s algorithm (Kleinberg and Tardos, 2006). Our "Rewind Construction" method (Algorithm 1) then embeds $\mathcal{B}_{\text{target}}$ backtracking maneuvers by working backward from a goal to strategically place keys and locked doors, yielding the instance’s actual backtracking count $\mathcal{B}$ . Finally, a natural language fact list ( $\mathcal{F}$ ) is derived from the maze, and distracting facts are added according to $\mathcal{N}_{\text{target}}$ to achieve the final noise ratio $\mathcal{N}$ . The logical depth $L$ (optimal path length) emerges from these generative steps, influenced by $N,M,\mathcal{B}_{\text{target}}$ , and construction stochasticity. While $L$ is not a direct input to the generation algorithm, the process is designed to yield a wide spectrum of logical depths. Each generated instance is then precisely annotated with its emergent $L$ value, alongside its effective $\mathcal{B}$ and $\mathcal{N}$ values. This annotation effectively makes $L$ a key, selectable parameter for users of the seqBench dataset, enabling them to choose or filter tasks by their desired logical depth. Our rewind construction method guarantees task solvability. The full seqBench benchmark is constructed by systematically applying this instance generation process (detailed in Algorithm 1) across a wide range of initial parameters. This includes varied grid sizes (e.g., $N∈\{5..50\},M≈ N$ ) and target backtracks ( $\mathcal{B}_{\text{target}}∈\{0..7\}$ ), yielding a large and diverse data pool. For each $(N,M,\mathcal{B}_{\text{target}})$ configuration, multiple unique base mazes are generated, to which different noise ratios (e.g., $\mathcal{N}_{\text{target}}∈\{0..1\}$ ) are subsequently applied. It is important to note that the algorithm constrains backtracking complexity to a simple dependency chain. In this setting, retrieving the key for each locked door involves at most one backtracking step to pick up its corresponding key, without requiring the unlocking of additional doors along the optimal path. Combined with the uniform random placement of keys, this design ensures a well-balanced distribution of backtracking difficulty across the generated instances for each logical depth $L$ . Nevertheless, the same backward-in-time construction can be extended to generate tasks with higher backtracking complexity—for example, doors that require multiple keys, or intermediate doors that must be unlocked en route to other keys. Such extensions would introduce richer tree-structured dependency graphs and allow seqBench to probe model performance under more complex long-horizon reasoning regimes. The creation of this comprehensive data pool was computationally efficient, requiring approximately an hour of computation on a standard laptop while using minimal memory. The publicly released benchmark comprises a substantial collection of these generated instances, each annotated with its specific emergent logical depth $L$ , effective backtracking count $\mathcal{B}$ , and noise ratio $\mathcal{N}$ . This rich annotation is key, enabling researchers to readily select or filter task subsets by these dimensions for targeted studies (e.g., as done for Figure 1, where instances were sampled into $L$ -bins with other parameters fixed). For the experiments presented in this paper, specific subsets were drawn from this benchmark pool, often involving further filtering or parameter adjustments tailored to the objectives of each study; precise details for each experiment are provided in the relevant sections and figure captions. Full details on path derivation, fact compilation, and overall dataset generation parameters are provided in the Appendix A.
Input : Grid $N× M$ , Target backtracks $\mathcal{B}$
Output : Maze graph $M_{g}$ , Locked doors $\mathcal{D}_{L}$ , Key info $\mathcal{K}_{I}$ , Path skeleton $\Pi_{S}$
1
2 $M_{g}←$ Acyclic graph on grid (Kruskal’s);
3 $x← C_{goal}←$ Random goal cell in $M_{g}$ ;
4 $\mathcal{D}_{L},\mathcal{K}_{I}←\emptyset,\emptyset$ ; $b← 0$ ;
5 $\Pi_{S}←[(C_{goal},\text{GOAL})]$ ;
6
7 while $b<\mathcal{B}$ do
8 $c_{key}←$ Random cell in $M_{g}$ accessible from $x$ (path avoids $\mathcal{D}_{L}$ for this step);
9 $\pi_{seg}←$ Unique path in $M_{g}$ from $x$ to $c_{key}$ ;
10 if $∃ e∈\pi_{seg}$ such that $e∉\mathcal{D}_{L}$ then
11 $d←$ Randomly select such an edge $e$ ;
12 $\mathcal{D}_{L}←\mathcal{D}_{L}\cup\{d\}$ ;
13 $K_{id}←$ New unique key ID;
14 $\mathcal{K}_{I}[K_{id}]←\{\text{opens}:d,\text{loc}:c_{key}\}$ ;
15 $\Pi_{S}$ .prepend( $(c_{key},\text{PICKUP }K_{id})$ , $(d,\text{UNLOCK }K_{id})$ , $(\pi_{seg},\text{MOVE})$ );
16 $x← c_{key}$ ; $b← b+1$ ;
17
18 end if
19 else
20 Break
21 end if
22
23 end while
24 $\Pi_{S}$ .prepend( $(x,\text{START}))$ ;
25 return $M_{g},\mathcal{D}_{L},\mathcal{K}_{I},\Pi_{S}$ ;
Algorithm 1 Rewind Construction of Path Skeleton
1.2 Prompt Construction and Model Configuration
Our evaluation uses a standardized prompt template with four components: (i) task instructions and action schema, (ii) three few-shot examples of increasing complexity (simple navigation, single-key, and multi-key backtracking), (iii) optional reasoning guidance, and (iv) the problem’s natural-language facts. All models are queried using temperature $T{=}1.0$ , nucleus sampling $p{=}0.95$ , and maximum allowed setting in terms of output token limits on a per model basis. For each instance, we compute 5 independent runs to establish robust performance statistics. The complete prompt structure, shown in Figure 6, is provided in the Appendix B.
1.3 Evaluation Metrics
To analyze not just success but also how models fail, we employ several complementary metrics. Success Rate (Pass@1) measures the proportion of runs where the predicted action sequence exactly matches the ground truth. The Progress Ratio (Tyagi et al., 2024), calculated as $k/n$ (where $n$ is the total ground-truth actions and $k$ is the number correctly executed before the first error), pinpoints the breakdown position in reasoning. We also use Precision and Recall. Precision is the proportion of predicted actions that are correct, while Recall is the proportion of ground-truth actions that were correctly predicted. Low precision indicates hallucinated actions, while low recall signifies missed necessary actions. Additionally, we visualize error locations via a Violation Map. This multi-faceted approach reveals each model’s effective "reasoning horizon"—the maximum sequence length it can reliably traverse. Further details on all metrics and visualizations are provided in the supplementary material.
2 Benchmarking Results
<details>
<summary>figs/fig_vs_backtracking_fixed_L_shuffle1.0_noise0.0.png Details</summary>

### Visual Description
## Line Charts: Model Performance Metrics vs Backtracking Steps
### Overview
Three side-by-side line charts compare model performance metrics (progress ratio, success rate, token usage) across five backtracking steps. Each chart tracks four models with distinct color-coded lines, showing divergent trends in efficiency and resource consumption.
### Components/Axes
**Left Chart (Progress Ratio Mean):**
- X-axis: Number of backtracking steps (0-5)
- Y-axis: Progress ratio (0.0-1.0)
- Legend: Top-left, four entries:
- Blue: Llama-4-maverick-17b-128e-instruct-fp8
- Orange: Owned2.5-coder-32b-instruct
- Green: Llama-3.1-nemotron-70b-instruct-hf
- Purple: Gemini-2.0-flash
**Middle Chart (Success Rate):**
- X-axis: Number of backtracking steps (0-5)
- Y-axis: Success rate (0.0-1.0)
- Legend: Top-left, four entries:
- Blue: Llama-4-maverick-17b-128e-instruct-fp8
- Orange: Owned2.5-coder-32b-instruct
- Green: Llama-3.1-nemotron-70b-instruct-hf
- Purple: Gemini-2.5-flash-preview-04-17
**Right Chart (Number of Tokens):**
- X-axis: Number of backtracking steps (0-5)
- Y-axis: Token count (250-1750)
- Legend: Top-left, four entries:
- Blue: Llama-4-maverick-17b-128e-instruct-fp8
- Orange: Owned2.5-coder-32b-instruct
- Green: Llama-3.1-nemotron-70b-instruct-hf
- Purple: Gemini-2.0-flash
### Detailed Analysis
**Left Chart Trends:**
1. **Blue Line (Llama-4-maverick):** Steep decline from ~0.5 to ~0.1 (step 0→5)
2. **Orange Line (Owned2.5-coder):** Gradual drop from ~0.3 to ~0.02
3. **Green Line (Llama-3.1-nemotron):** Moderate decline from ~0.4 to ~0.1
4. **Purple Line (Gemini-2.0-flash):** Slowest decline from ~0.9 to ~0.65
**Middle Chart Trends:**
1. **Blue Line (Llama-4-maverick):** Sharp drop from ~0.25 to ~0.0
2. **Orange Line (Owned2.5-coder):** Minimal presence (near 0 after step 0)
3. **Green Line (Llama-3.1-nemotron):** Near-zero after step 1
4. **Purple Line (Gemini-2.5-flash):** Maintains ~0.55-0.9 range
**Right Chart Trends:**
1. **Blue Line (Llama-4-maverick):** Steady increase from 1600→1800 tokens
2. **Orange Line (Owned2.5-coder):** Peaks at 1200 tokens (step 4)
3. **Green Line (Llama-3.1-nemotron):** Stable ~600-900 tokens
4. **Purple Line (Gemini-2.0-flash):** Gradual rise from 250→400 tokens
### Key Observations
1. **Performance Degradation:** All models show declining progress ratios and success rates with more backtracking steps, except Gemini-2.5-flash-preview which maintains higher success rates.
2. **Token Efficiency:** Llama-4-maverick consumes most tokens (1800 at step 5) but shows inverse correlation between token usage and performance metrics.
3. **Model Specialization:** Gemini models (both versions) demonstrate superior efficiency in maintaining performance metrics despite backtracking.
4. **Resource Tradeoff:** Higher-performing models (Gemini) use fewer tokens, suggesting better optimization.
### Interpretation
The data reveals critical tradeoffs between computational efficiency and performance:
- **Gemini Models** excel in maintaining high success rates/progress ratios with minimal token consumption, indicating superior architectural optimization for backtracking tasks.
- **Llama-4-maverick** shows diminishing returns - while it initially performs well, its resource-intensive nature (high token usage) correlates with performance degradation as backtracking steps increase.
- **Owned2.5-coder** and **Llama-3.1-nemotron** demonstrate limited effectiveness in backtracking scenarios, with near-zero success rates beyond initial steps despite moderate token usage.
These findings suggest Gemini models are better suited for tasks requiring iterative refinement with constrained computational resources, while Llama-4-maverick may be preferable for applications where initial response quality outweighs long-term efficiency concerns.
</details>
Figure 3: Performance as a function of the number of required backtracking steps, operationalized via the number of locked doors with distributed keys along the optimal path. Holding all other complexity factors constant, all models exhibit a clear decline in both progress ratio and success rate as backtracking demands increase. Additionally, we report the corresponding rise in output token counts per model, highlighting the increased reasoning burden associated with longer dependency chains. Fixed experimental parameters in this figure are the same as those in Figure 1. (for each point 100 problems sampled from $L=[40,60]$ )
2.1 Evaluated Models
We evaluate a diverse set of transformer-based LLMs across different model families and parameter scales. Our analysis includes Gemini models (2.5-flash-preview, 2.0-flash), Meta’s Llama family (4-Maverick-17B, 3.3-70B, 3.2-3B), Google’s Gemma-2-27b, and Alibaba’s Qwen models (2.5-Coder-32B, 2.5-7B). [Note: GPT-5 was released during the preparation of this paper’s final version. Our analysis shows that this model exhibits the same performance degradation, as shown in Figure 16]. Access to some open-weight models and benchmarking infrastructure was facilitated by platforms such as Together AI https://www.together.ai/ and Google AI Studio https://aistudio.google.com/. Problem instances for varying logical depths ( $L$ ) were generated by sampling 40 problems for each $L$ , using a fixed maze size of $40× 40$ and 2 keys, unless otherwise specified for specific experiments (e.g., when varying the number of keys for backtracking analysis). All models were evaluated using the standardized prompt template (see Figure 6), the inference settings detailed in Section 1.2, and a common response parsing methodology. For each task instance, we perform 5 independent runs to establish robust performance statistics, primarily analyzing Pass@1 success rates.
2.2 Universal Performance Collapse with Increasing Logical Depth
A central finding of our study is the universal collapse in reasoning performance observed across all evaluated LLMs when confronted with tasks requiring increasing sequential inference steps. As illustrated in Figure 1, Pass@1 success rates exhibit a consistent and sharp exponential decay as the ground-truth path length ( $L$ ) increases. Performance rapidly approaches near-zero past a model-specific point in this decay. To quantify and compare this exponential decay, we fit an exponential decay curve $P(L)=\exp(-L/L_{0})$ to the success rates, deriving a characteristic path length $L_{0}$ . This $L_{0}$ value, representing the path length at which performance drops by a factor of $e^{-1}$ , serves as a robust metric for each model’s sequential reasoning horizon. Plotting success rates on a semi-logarithmic (log-y) scale against $L$ reveals an approximately linear decay trend across the evaluated regime. This log-linear relationship suggests that errors may accumulate with a degree of independence at each reasoning step, eventually overwhelming the model’s capacity for coherent inference. The observed $L_{0}$ values vary significantly, from 85.7 for Gemini-2.5-Flash down to 1.6 for Llama-3.2-3B (Figure 1), underscoring a fundamental bottleneck in current transformer architectures for extended multi-step reasoning.
2.3 Impact of Independently Controlled Complexity Dimensions
Beyond the universal impact of logical depth ( $L$ ) discussed in Section 2.2, our benchmark’s ability to independently vary key complexity dimensions allows for targeted analysis of their distinct impacts on LLM reasoning performance. We highlight the effects of noise, backtracking, and fact ordering, primarily focusing on Pass@1 success rates, mean progress ratios, and response token counts.
<details>
<summary>figs/fig_vary_noise_fixed_L_keys2_shuffle1.0.png Details</summary>

### Visual Description
## Line Charts: Model Performance vs Noise Ratio
### Overview
Three line charts compare the performance of two AI models (Llama-4-maverick-17b-128e-instruct-fp8 and Gemini-2.5-flash-preview-04-17) across three metrics as noise ratio increases from 0.00 to 1.00. Each chart uses distinct y-axes for different metrics.
### Components/Axes
1. **X-axis**: Noise ratio (0.00 to 1.00 in 0.25 increments)
2. **Y-axes**:
- Chart 1: Mean progress ratio (0.0 to 1.0)
- Chart 2: Mean success rate (pass@1) (0.0 to 1.0)
- Chart 3: Cot tokens (0 to 1750)
3. **Legends**:
- Blue circles: Llama-4-maverick-17b-128e-instruct-fp8
- Orange circles: Gemini-2.5-flash-preview-04-17
4. **Legend placement**: Top-left corner of each chart
### Detailed Analysis
#### Chart 1: Mean progress ratio
- **Llama (blue)**: Starts at ~0.25, decreases gradually to ~0.12
- **Gemini (orange)**: Starts at ~0.75, decreases steeply to ~0.22
- **Trend**: Both decline, but Gemini shows sharper degradation
#### Chart 2: Mean success rate (pass@1)
- **Llama (blue)**: Starts at ~0.03, drops to near 0
- **Gemini (orange)**: Starts at ~0.62, decreases to ~0.04
- **Trend**: Both collapse at higher noise, Gemini maintains higher values initially
#### Chart 3: Cot tokens
- **Llama (blue)**: Starts at ~1750, decreases to ~1450
- **Gemini (orange)**: Remains flat at ~250
- **Trend**: Llama shows significant computational cost reduction, Gemini stable
### Key Observations
1. **Performance degradation**: Both models deteriorate with noise, but Gemini's decline is more pronounced in progress ratio and success rate
2. **Computational efficiency**: Llama's token usage drops significantly with noise, while Gemini maintains constant low usage
3. **Robustness**: Llama demonstrates better noise resilience in success rate metrics
4. **Threshold behavior**: All metrics show steep declines after noise ratio >0.5
### Interpretation
The data suggests Llama-4 maintains better performance stability under noise compared to Gemini-2.5, particularly in success rate metrics. However, Gemini shows superior computational efficiency with consistent low token usage. The sharp decline in Llama's progress ratio indicates potential overfitting to clean data. The cot token metric reveals Llama's processing becomes more resource-intensive under noise, while Gemini's fixed token usage suggests optimized inference pathways. These findings highlight trade-offs between model robustness and computational efficiency in noisy environments.
</details>
Figure 4: Performance as a function of contextual noise for Gemini 2.5 flash and Llama-4 Maverick-17B-128E-Instruct models. As noise increases through the inclusion of distracting or irrelevant facts, both models exhibit a clear and consistent decline in performance. Fixed experimental parameters in this figure are the same as those in Figure 1 (for each point 100 problems sampled from $L=[40,60]$ and number of keys is equal to 2).
Impact of Backtracking Requirements.
Increasing the number of required backtracking steps—operationalized via key-door mechanisms—also leads to a clear and significant decline in Pass@1 success rates and mean progress ratios across all evaluated models as shown in Figure 3. Gemini 2.5 Flash-preview maintains the highest performance but still exhibits a notable drop as backtracking count increases from 0 to 5. This decline in reasoning accuracy is generally accompanied by an increase or sustained high level in the mean number of response tokens (Figure 3, right panel). For example, models like Llama-4 Maverick and Gemini 2.5 Flash-preview show a clear upward trend or maintain high token counts as backtracking complexity rises, reflecting the increased reasoning effort or path length articulated by the models when managing more complex sequential dependencies.
Sensitivity to Noise Ratio.
Model performance is highly sensitive to the noise ratio—the proportion of distracting versus supporting facts. As demonstrated in Figure 4 for Gemini 2.5 Flash and Llama-4 Maverick, increasing the proportion of irrelevant facts consistently and significantly degrades both Pass@1 success rates and mean progress ratios. For instance, Gemini 2.5 Flash’s Pass@1 success rate drops from over 0.7 at zero noise to approximately 0.2 at a noise ratio of 1.0. Llama-4 Maverick, starting with lower performance, also shows a consistent decline. Interestingly, for these two models, the number of CoT (output) tokens remains relatively stable despite the increasing noise and degrading performance (Figure 4, right panel), suggesting that models do not necessarily "work harder" (in terms of output length) when faced with more distractors, but their accuracy suffers.
Fact Ordering (Shuffle Ratio).
In contrast to the strong effects of noise and backtracking, shuffle ratio (entropy of fact presentation order) within the prompt appears to play a secondary role when varied in isolation. Our experiments, exemplified by the performance of Gemini 2.5 Flash and Llama-4 Maverick (see Appendix C Figure 14 for details), show that complete shuffling of facts (randomizing their presentation order without adding or removing any information) has a minimal impact on Pass@1 success rates and mean progress ratios. Output token counts also remain stable. This suggests a relative robustness to presentation order as long as all necessary information is present and distinguishable. However, as details provided in supplementary material, when high noise and high shuffle co-occur, the combined effect can be more detrimental than either factor alone, though noise remains the dominant degrading factor.
2.4 Characterizing Key Failure Modes and Error Patterns
A Key Failure Mode: Omission of Critical Steps.
Beyond simply taking illegal shortcuts, detailed analysis reveals that LLMs often fail by omitting critical sub-goals necessary for task completion. Figure 2 (bottom panel) provides a quantitative view for Llama-4 Maverick (Meta AI, 2025), showing that while precision generally remains high (models infrequently hallucinate non-existent rooms or facts), recall and progress ratio plummet with increasing path length ( $L$ ). This indicates that models predominantly fail by missing necessary actions or entire crucial sub-sequences. For a qualitative example, even capable models like Gemini-2.5-Flash can neglect essential detours, such as collecting a required key, thereby violating sequential dependencies and rendering the task unsolvable (illustrative examples are provided in the Appendix B.4; see Figures 8 and 9). This pattern highlights a fundamental breakdown in robust multi-step planning and execution.
Path-Length Dependent First Errors: The Burden of Anticipated Complexity.
The propensity for models to make critical errors is not uniformly distributed across the reasoning process, nor is it solely a feature of late-stage reasoning fatigue. Examining the distribution of steps at which the first constraint violations occur reveals a counterintuitive pattern: as the total required path length ( $L$ ) of a problem increases, models tend to fail more frequently even at the earliest steps of the reasoning chain. This leftward shift in the first-error distribution also observed under increasing noise, (Appendix B.4; Figures 10 and 11) contradicts a simple cumulative error model where each step carries a fixed, independent failure probability. Instead, an error at an early step (e.g., step 5) becomes substantially more likely when the model is attempting to solve an 80-step problem versus a 20-step problem. This suggests that the overall anticipated complexity of the full problem influences reasoning quality from the very outset, indicating a struggle with global planning or maintaining coherence over longer horizons, rather than just an accumulation of local errors. This phenomenon may help explain why prompting techniques that decompose long problems into smaller, manageable sub-problems often succeed.
2.5 Disparity: Information Retention vs. Reasoning Capacity
On seqBench tasks, this disparity is quantitatively striking. While modern LLMs boast million-token contexts, their effective sequential reasoning depth typically remains on the order of hundreds of actions (Figure 1). This functional limit, even at several hundred actions (e.g., 300 actions, with each like (’move_to’, ’A12’) being 5-7 tokens, totaling 1.5k-2.1k tokens), still consumes a minute fraction of their nominal context. Consequently, the ratio of context capacity to reasoning tokens often spans from several hundred-fold (e.g., 500:1 for 300 actions consuming 2k tokens within a 1M context) to potentially higher values given fewer limiting actions or larger model contexts. This striking gap suggests that while transformers can store and retrieve vast information, their ability to reliably chain it for coherent, multi-step inference appears surprisingly constrained.
2.6 Challenging the Conventional Performance Hierarchy
While metrics like average $L_{0}$ provide a general ranking of model capabilities, our fine-grained analysis reveals instances that challenge a simple linear performance hierarchy. Scatter plots of progress ratios across different models on identical tasks (see Appendix C Figure 13) show intriguing cases where models with lower overall $L_{0}$ values (i.e., typically weaker models) occasionally solve specific complex problems perfectly, while models with higher average $L_{0}$ values fail on those same instances. These performance inversions suggest that sequential reasoning failures may not solely stem from insufficient scale (parameters or general training) but could also arise from more nuanced reasoning limitations.
3 Related Work
Recent advancements in benchmarks evaluating sequential reasoning capabilities of LLMs have illuminated various strengths and limitations across different dimensions of complexity. These benchmarks typically differ in how they isolate and quantify reasoning challenges, such as logical deduction, retrieval difficulty, combinatorial complexity, and sensitivity to irrelevant information. ZebraLogic (Lin et al., 2025), for instance, targets formal deductive inference through logic-grid puzzles framed as constraint-satisfaction problems (csp, 2008). While valuable for probing deduction, its core methodology leads to a search space that grows factorially with puzzle size (Sempolinski, 2009). This makes it challenging to disentangle intrinsic reasoning failures from the sheer combinatorial complexity of the search. As the ZebraLogic authors themselves acknowledge: “ solving ZebraLogic puzzles for large instances may become intractable… the required number of reasoning tokens may increase exponentially with the size of the puzzle. ” This inherent characteristic means that for larger puzzles, performance is primarily dictated by the manageability of the search space rather than the limits of sequential reasoning depth. GridPuzzle (Tyagi et al., 2024) complements this by providing a detailed error taxonomy for grid puzzles, focusing on what kinds of reasoning mistakes LLMs make. However, like ZebraLogic, it doesn’t offer independent control over key complexity dimensions such as logical depth, backtracking needs, or noise, separate from the puzzle’s inherent search complexity.
Other benchmarks conflate reasoning with different cognitive demands. BABILong (Kuratov et al., 2024) tests models on extremely long contexts (up to 50M tokens), primarily assessing the ability to retrieve "needles" (facts) from a "haystack" (distracting text that does not contribute to solving the task). While valuable for evaluating long-context processing, this design makes it hard to disentangle retrieval failures from reasoning breakdowns, as performance is often dictated by finding the relevant information rather than reasoning over it. MuSR (Sprague et al., 2024) embeds reasoning tasks within lengthy narratives (e.g., murder mysteries), mixing information extraction challenges with complex, domain-specific reasoning structures. This realism obscures which specific aspect—extraction or reasoning depth—causes model failures. Dyna-bAbI (Tamari et al., 2021) offers a dynamic framework for compositional generalization but focuses on qualitative combinations rather than systematically varying quantitative complexity metrics needed to find precise failure points.
Spatial reasoning benchmarks, while relevant, also target different aspects. GRASP (Tang and Kejriwal, 2025) assesses practical spatial planning efficiency (like obstacle avoidance) in 2D grids, a different skill than the abstract sequential reasoning seqBench isolates. SPARTQA (Mirzaee et al., 2021) focuses on specialized spatial relational complexity (transitivity, symmetry) using coupled dimensions, preventing independent analysis of factors like path length. SpaRTUN (Mirzaee and Kordjamshidi, 2022) uses synthetic data primarily for transfer learning in Spatial Question Answering (SQA), aiming to improve model performance rather than serve as a diagnostic tool with controllable complexity. Similarly, StepGame (Shi et al., 2022) demonstrates performance decay with more reasoning steps in SQA but lacks the fine-grained, orthogonal controls over distinct complexity factors provided by seqBench.
In contrast, seqBench takes a targeted diagnostic approach. By deliberately simplifying the spatial environment to minimize search complexity, it isolates sequential reasoning. Its core contribution lies in the independent, fine-grained control over (1) logical depth (the number of sequential actions required to solve the task), (2) backtracking count (the number of backtracking steps along the optimal path), and (3) noise ratio (the ratio of supporting to distracting facts). This orthogonal parameterization allows us to precisely pinpoint when and why sequential reasoning capabilities degrade, revealing fundamental performance cliffs even when search and retrieval demands are trivial. seqBench thus offers a complementary tool for understanding the specific limitations of sequential inference in LLMs.
4 Limitations
While seqBench offers precise control over key reasoning complexities, our study has limitations that open avenues for future research:
1. Generalizability and Task Design Fidelity: Our current findings are rooted in synthetic spatial pathfinding tasks. While this allows for controlled experimentation, future work must extend seqBench ’s methodology to more diverse reasoning domains (e.g., mathematical proofs) and incorporate greater linguistic diversity (e.g., ambiguity) to assess the broader applicability of the observed phenomena of performance collapse (quantified by $L_{0}$ ) and failure patterns. Moreover, this work did not investigate whether similar failure modes arise when the problem is also presented visually (e.g., as maze images). Multimodal capabilities could influence spatial reasoning outcomes, and we have already extended the benchmark by releasing maze image generation code alongside the HuggingFace dataset. This dataset can also be used to help train multimodal reasoning models.
1. Model Scope and Understanding Deeper Failure Dynamics: Our current evaluation, while covering diverse public models, should be expanded to a wider array of LLMs—including recent proprietary and newer open-source variants (e.g., GPT, Claude, DeepSeek series)—to rigorously assess the universality of our findings on the characteristic length $L_{0}$ and failure patterns. Furthermore, while seqBench effectively characterizes how reasoning performance degrades with logical depth (i.e., by determining $L_{0}$ ), two complementary research thrusts are crucial for understanding why. First, systematic investigation is needed to disentangle how $L_{0}$ is influenced by factors such as model architecture, scale (parameters, training data, compute), fine-tuning strategies, and inference-time computation (e.g., chain-of-thought depth). Second, deeper analysis is required to explain the precise mechanisms underlying the observed exponential performance collapse characterized by $L_{0}$ and to account for other non-trivial error patterns, such as path-length dependent first errors. Additionally, the evaluation presented here does not consider how agentic systems capable of tool use perform as the reasoning complexity is tuned across various dimensions. Exploring such setups, where the LLM can externalize sub-problems, invoke tools, or backtrack programmatically, could provide valuable insights into whether the same exponential failure modes persist. In particular, one can define sequential problems where the degree of backtracking or sequential tool use can be systematically varied, and to test whether similar performance drop emerge as the dependency chain grows. We highlight this as a promising direction for future research.
1. Impact of Prompting: Our current study employed standardized prompts and inference settings. A crucial next step is a robust sensitivity analysis to determine overall decay behavior are influenced by different prompting strategies (e.g., zero-shot vs. few-shot, decomposition techniques), varied decoding parameters (temperature, top-p), and interactive mechanisms such as self-verification or self-correction. Investigating the potential of these techniques to mitigate the observed sequential inference failures, particularly given seqBench ’s minimal search complexity, remains a key avenue for future research.
Addressing these points by leveraging frameworks like seqBench will be vital for developing LLMs with more robust and generalizable sequential reasoning capabilities, and for understanding their fundamental performance limits.
5 Conclusion
We introduced seqBench, a novel benchmark framework designed for the precise attribution of sequential reasoning failures in Large Language Models. seqBench ’s core strength lies in its unique capability for fine-grained, independent control over fundamental complexity dimensions; most notably, logical depth ( $L$ ), backtracking requirements, and noise ratio, its provision of automatically verifiable solutions, and critically minimizing confounding factors like search complexity. This design allows seqBench to isolate and rigorously evaluate the sequential inference capabilities of LLMs, enabling the automatic quantification of fine-grained performance metrics (such as progress ratio) and providing a clear lens into mechanisms often obscured in most other benchmarks. The framework’s inherent scalability and open-source nature position it as a durable tool for assessing and driving progress in current and future generations of models, ultimately aiming to enhance their utility for complex, real-world problems that often span multiple domains. Our comprehensive evaluations using seqBench reveal that reasoning accuracy consistently collapses exponentially with increasing logical depth across a diverse range of state-of-the-art LLMs. This collapse is characterized by a model-specific parameter $L_{0}$ (Section 2.2), indicating an inherent architectural bottleneck in maintaining coherent multi-step inference. In alignment with the goal of advancing NLP’s reach and fostering its responsible application in other fields by offering this precise analysis, seqBench provides a valuable resource. It encourages a shift beyond aggregate benchmark scores towards a more nuanced understanding of model capabilities, an essential step for rigorously assessing the true impact and potential risks of applying LLMs in new domains. The insights gleaned from seqBench can inform both NLP developers in building more robust models, and experts in other disciplines in setting realistic expectations and co-designing NLP solutions that are genuinely fit for purpose. Targeted improvements, guided by such fundamental understanding, are key to enhancing the robustness of sequential reasoning, making LLMs more reliable partners in interdisciplinary endeavors. Future work should leverage these insights to develop models that can overcome the observed performance cliffs and extend their effective reasoning horizons, thereby unlocking their transformative potential in diverse interdisciplinary applications—such as navigating complex scientific literature, supporting intricate legal analysis, or enabling robust multi-step planning in critical autonomous systems. Focusing on commonsense reasoning is paramount for NLP to achieve transformative societal impact, moving beyond incremental improvements to genuine breakthroughs.
References
- csp (2008) 2008. Rina dechter , constraint processing, morgan kaufmann publisher (2003) isbn 1-55860-890-7, francesca rossi, peter van beek and toby walsh, editors, handbook of constraint programming, elsevier (2006) isbn 978-0-444-52726-4. Computer Science Review, 2:123–130.
- Anthropic (2025) Anthropic. 2025. Claude 3.7 sonnet. https://www.anthropic.com/news/claude-3-7-sonnet.
- Berglund et al. (2024) Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2024. The reversal curse: Llms trained on "a is b" fail to learn "b is a". Preprint, arXiv:2309.12288.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Carroll and Ruppert (2017) Raymond J Carroll and David Ruppert. 2017. Transformation and weighting in regression. Chapman and Hall/CRC.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. Preprint, arXiv:1803.05457.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
- Du et al. (2021) Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, and 8 others. 2021. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning.
- Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. Preprint, arXiv:1903.00161.
- Fedus et al. (2022) William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39.
- Google DeepMind (2025) Google DeepMind. 2025. Gemini 2.5 pro experimental. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/.
- Han et al. (2024) Pengrui Han, Peiyang Song, Haofei Yu, and Jiaxuan You. 2024. In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models. Preprint, arXiv:2409.15454.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Preprint, arXiv:2009.03300.
- Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others. 2022. Training compute-optimal large language models. Preprint, arXiv:2203.15556.
- Kleinberg and Tardos (2006) Jon Kleinberg and Eva Tardos. 2006. Algorithm Design. Pearson/Addison-Wesley, Boston.
- Kojima et al. (2022) Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
- Kuratov et al. (2024) Yury Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. 2024. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack. Advances in Neural Information Processing Systems, 37:106519–106554.
- Lieber et al. (2021) Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. Jurassic-1: Technical details and evaluation. https://www.ai21.com/blog/jurassic-1-technical-details-and-evaluation. White Paper.
- Lin et al. (2025) Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. 2025. Zebralogic: On the scaling limits of llms for logical reasoning. Preprint, arXiv:2502.01100.
- Meta AI (2025) Meta AI. 2025. Llama 4: Open and efficient multimodal language models. https://github.com/meta-llama/llama-models.
- Mirzaee et al. (2021) Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjmashidi. 2021. Spartqa: : A textual question answering benchmark for spatial reasoning. Preprint, arXiv:2104.05832.
- Mirzaee and Kordjamshidi (2022) Roshanak Mirzaee and Parisa Kordjamshidi. 2022. Transfer learning with synthetic corpora for spatial role labeling and reasoning. Preprint, arXiv:2210.16952.
- Mistral AI (2024) Mistral AI. 2024. Mistral large 2. https://mistral.ai/news/mistral-large-2407.
- Nezhurina et al. (2025) Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. 2025. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. Preprint, arXiv:2406.02061.
- OpenAI (2025) OpenAI. 2025. Openai gpt-5, o3 and o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/, https://openai.com/index/introducing-gpt-5/. Paper’s supplementary material (appendix) was revised, after GPT-5 release, with a new figure, to reflect that GPT-5 also suffers from the same failure pattern we have observed in this paper.
- Rae et al. (2021) Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Matthias Rauh, Po-Sen Huang, and 58 others. 2021. Scaling language models: Methods, analysis & insights from training Gopher. Preprint, arXiv:2112.11446.
- Rein et al. (2023) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2023. Gpqa: A graduate-level google-proof q&a benchmark. Preprint, arXiv:2311.12022.
- Sempolinski (2009) Peter Sempolinski. 2009. Automatic solutions of logic puzzles.
- Sharma (2024) Manasi Sharma. 2024. Exploring and improving the spatial reasoning abilities of large language models. In I Can’t Believe It’s Not Better Workshop: Failure Modes in the Age of Foundation Models.
- Shi et al. (2022) Zhengxiang Shi, Qiang Zhang, and Aldo Lipani. 2022. Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 11321–11329.
- Smith et al. (2022) Samuel Smith, Mostofa Patwary, Brian Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhenhao Liu, Shrimai Prabhumoye, Georgios Zerveas, Vikas Korthikanti, Eric Zhang, Rewon Child, Reza Yazdani Aminabadi, Jared Bernauer, Xia Song Song, Mohammad Shoeybi, Yuxin He, Michael Houston, Shishir Tiwary, and Bryan Catanzaro. 2022. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. Preprint, arXiv:2201.11990.
- Sprague et al. (2024) Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. 2024. Musr: Testing the limits of chain-of-thought with multistep soft reasoning. Preprint, arXiv:2310.16049.
- Srivastava et al. (2023) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, and 432 others. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Preprint, arXiv:2206.04615.
- Tamari et al. (2021) Ronen Tamari, Kyle Richardson, Aviad Sar-Shalom, Noam Kahlon, Nelson Liu, Reut Tsarfaty, and Dafna Shahaf. 2021. Dyna-babi: unlocking babi’s potential with dynamic synthetic benchmarking. Preprint, arXiv:2112.00086.
- Tang and Kejriwal (2025) Zhisheng Tang and Mayank Kejriwal. 2025. Grasp: A grid-based benchmark for evaluating commonsense spatial reasoning. Preprint, arXiv:2407.01892.
- Thoppilan et al. (2022) Rami Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yi Du, Yanping Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Max Krikun, Dmitry Lepikhin, James Qin, and 38 others. 2022. Lamda: Language models for dialog applications. arXiv preprint. Technical report, Google Research.
- Tikhonov (2024) Alexey Tikhonov. 2024. Plugh: A benchmark for spatial understanding and reasoning in large language models. Preprint, arXiv:2408.04648.
- Tyagi et al. (2024) Nemika Tyagi, Mihir Parmar, Mohith Kulkarni, Aswin RRV, Nisarg Patel, Mutsumi Nakamura, Arindam Mitra, and Chitta Baral. 2024. Step-by-step reasoning to solve grid puzzles: Where do llms falter? Preprint, arXiv:2407.14790.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-thought prompting elicits reasoning in large language models. Preprint, arXiv:2201.11903.
- Weston et al. (2015) Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. 2015. Towards ai-complete question answering: A set of prerequisite toy tasks. Preprint, arXiv:1502.05698.
- Yang et al. (2019) Kaiyu Yang, Olga Russakovsky, and Jia Deng. 2019. SpatialSense: An adversarially crowdsourced benchmark for spatial relation recognition. In International Conference on Computer Vision (ICCV).
- Zoph et al. (2022) Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022. St-moe: Designing stable and transferable sparse expert models. Preprint, arXiv:2202.08906.
Appendices
Appendix A Dataset Generation Details
The seqBench benchmark generates pathfinding tasks by systematically controlling several complexity dimensions. As described in Section 1 (main paper), Algorithm 1 is central to this process. This appendix provides further details on the generation phases, natural language encoding of tasks, and specific dataset parameters.
A.1 Generation Phases
The generation process, guided by Algorithm 1, involves three main phases:
1. Base Maze Construction: An initial $N× M$ grid is populated, and an acyclic maze graph ( $M_{g}$ ) is formed using Kruskal’s algorithm (Kleinberg and Tardos, 2006). This ensures a simply connected environment where a unique path exists between any two cells if all internal "walls" (potential door locations) were open. The overall process results in maze instances like the one visualized in Figure 5.
1. Rewind Construction for Path Skeleton and Key/Door Placement: This phase implements the "Rewind Construction" (Algorithm 1 in the main paper). Starting from a randomly selected goal cell ( $C_{goal}$ ), the algorithm works backward to define a solvable path skeleton ( $\Pi_{S}$ ). It iteratively:
1. Selects a cell $c_{key}$ that would be a preceding point on a path towards the current cell $x$ (initially $C_{goal}$ ).
1. Identifies the unique path segment $\pi_{seg}$ in $M_{g}$ from $x$ to $c_{key}$ .
1. Randomly selects an edge $d$ on this segment $\pi_{seg}$ to become a locked door. This edge $d$ is added to the set of locked doors $\mathcal{D}_{L}$ .
1. A new unique key $K_{id}$ is conceptually placed at $c_{key}$ , and its information (which door it opens, its location) is stored in $\mathcal{K}_{I}$ .
1. The conceptual steps (moving along $\pi_{seg}$ , unlocking door $d$ with $K_{id}$ , picking up $K_{id}$ at $c_{key}$ ) are prepended (in reverse logical order) to the path skeleton $\Pi_{S}$ .
1. The current cell $x$ is updated to $c_{key}$ , and the process repeats until the target number of backtracks ( $\mathcal{B}$ ) is achieved or no valid placements remain.
This backward construction ensures solvability and controlled backtracking complexity. The final agent starting position is the cell $x$ at the end of this phase.
1. Fact Compilation and Noise Injection: Based on the final maze structure ( $M_{g},\mathcal{D}_{L},\mathcal{K}_{I}$ ), a set of natural language facts $\mathcal{F}$ is compiled. This includes facts describing room connections, key locations, and door states. Distracting facts are then introduced based on the target noise ratio $\mathcal{N}$ . These distractors might describe non-existent connections, spurious keys, or misleading adjacencies, chosen to be plausible yet incorrect.
<details>
<summary>figs/compath_viz.png Details</summary>

### Visual Description
## Diagram: Grid-Based Network Flow with Directional Pathways
### Overview
The image depicts a structured grid of interconnected nodes (black circles) arranged in a 5x5 matrix. Nodes are connected by blue lines with white rectangular markers at regular intervals. The diagram includes directional arrows (black triangles), colored rectangles (red, green), and dashed lines, suggesting a network flow or process with specific pathways and decision points.
### Components/Axes
- **Nodes**: Black circles positioned at grid intersections (coordinates implied by 5x5 layout).
- **Connections**:
- **Blue lines**: Primary pathways with white rectangular markers (possibly representing steps or stages).
- **Dashed lines**: Secondary or alternative routes (blue and red dashed lines).
- **Markers**:
- **Red rectangles**: Placed at specific nodes (e.g., (2,2), (3,3), (4,4)), likely indicating critical checkpoints or decision nodes.
- **Green dashed line**: Connects nodes (1,1) → (2,2) → (3,3), suggesting a highlighted path or priority route.
- **Arrows**: Black triangles pointing in cardinal directions (e.g., right, down, left), indicating flow directionality.
### Detailed Analysis
- **Grid Structure**:
- Nodes are evenly spaced, forming a 5x5 matrix. Each node connects to adjacent nodes horizontally/vertically.
- White rectangles on blue lines may represent discrete steps or milestones in a process.
- **Pathways**:
- **Primary Flow**: Most nodes follow a rightward or downward trajectory (e.g., arrows at (1,1) → (1,2), (2,1) → (3,1)).
- **Dashed Paths**:
- Blue dashed line connects (1,3) → (2,3) → (3,3), bypassing some nodes.
- Red dashed line links (4,1) → (4,2) → (4,3), indicating a lateral deviation.
- **Critical Nodes**: Red rectangles at (2,2), (3,3), and (4,4) suggest these nodes are pivotal for decision-making or resource allocation.
### Key Observations
1. **Directional Bias**: Most arrows point right or down, implying a default flow from top-left to bottom-right.
2. **Dashed Line Anomalies**:
- The green dashed line (1,1) → (2,2) → (3,3) deviates from the grid’s primary axis, possibly representing a shortcut or priority path.
- Red dashed lines at the bottom-left corner (4,1) → (4,2) → (4,3) suggest lateral movement after reaching a lower row.
3. **Red Rectangle Placement**: The diagonal alignment of red rectangles (2,2), (3,3), (4,4) may indicate a feedback loop or escalation path.
### Interpretation
This diagram likely represents a **process flow** or **network topology** with:
- **Primary Workflow**: A sequential path from top-left to bottom-right, with nodes acting as processing units.
- **Decision Points**: Red rectangles at (2,2), (3,3), (4,4) could represent branching logic (e.g., "if condition X, proceed to node Y").
- **Alternative Routes**: Dashed lines allow bypassing certain steps, suggesting flexibility or error-handling mechanisms.
- **Priority Path**: The green dashed line highlights a critical or expedited route, possibly for high-priority tasks.
The absence of textual labels limits precise interpretation, but the structured layout and directional cues strongly suggest a **systematic workflow** with modular components and conditional pathways. The red and green markers emphasize nodes or routes requiring special attention, such as bottlenecks or fast tracks.
</details>
Figure 5: Example visualization of a $6× 6$ seqBench maze instance. Red rectangles denote locked doors, dashed lines indicate the locations of keys corresponding to those doors, and triangles mark the start (upward-pointing) and goal (downward-pointing) positions. This illustrates the spatial nature of the tasks.
A.2 Natural Language Encoding
Each task instance is translated into a set of atomic natural language facts. We use a consistent templating approach:
- Room Connections: "Room A1 and B1 are connected by an open door."
- Locked Connections: "Room C3 and D3 are connected by a closed and locked door."
- Key Requirements: "The locked door between C3 and D3 requires key 5." (Key IDs are simple integers).
- Key Placements: "Key 5 is in room E4." (Room IDs use spreadsheet-like notation, e.g., A1, B2).
- Starting Position: "Bob is in room A2."
- Goal Position: "Alice is in room D5."
The full set of facts for a given problem constitutes its description.
A.3 Dataset Parameters and Scope
The seqBench dataset was generated using the following parameter ranges based on the generation configuration:
- Grid Sizes ( $N× M$ ): $N× M$ where $N$ and $M$ range from 5 to 50 (e.g., [5,5], [6,6], …, [50,50]), with $M=N$ for all configurations.
- Target Backtracking Steps ( $\mathcal{B}$ ): Values from 0 to 7. This controls the number of key-door mechanisms deliberately placed on the optimal path.
- Noise Ratio ( $\mathcal{N}$ ): Values from $0.0$ (no distracting facts) to $1.0$ (equal number of supporting and distracting facts), typically in increments of $0.2$ .
- Instances per Configuration: For each primary configuration, defined by a specific grid size ( $N,M$ ) and a specific target backtracking step count ( $\mathcal{B}∈\{0..7\}$ ), 400 unique base maze instances were generated.
- Logical Depth ( $L$ ): As an emergent property, $L$ varies. Experiments typically select problems from these generated instances that fall into specific $L$ bins (e.g., $L∈[10,11),[11,12),...$ ).
This generation pipeline, leveraging the described parameter ranges and variations, can produce a vast and diverse set of problem instances. The publicly released seqBench dataset, used for the analyses in this paper (see main paper for access link), comprises 7,079 such curated instances. This collection offers a rich resource for studying the combined effects of the controlled complexity dimensions.
Appendix B Prompt Design and Model Configuration Details
This appendix provides the complete details of the prompt structure and model configurations used for evaluating LLMs on the seqBench benchmark. The overall prompt, illustrated in Figure 6, concatenates four main components which are detailed below.
<details>
<summary>figs/prompt_template.png Details</summary>

### Visual Description
## Screenshot: Prompt Template for Maze Navigation Task
### Overview
The image displays a structured prompt template divided into three sections: **Task Description**, **Reasoning Guidance**, and **Problem**. It outlines a maze navigation problem where an agent (Bob) must rescue Alice by navigating connected rooms, using keys to unlock doors, and following constraints to find the optimal path.
---
### Components/Axes
#### Task Description
- **Task**: Help Bob navigate a maze of connected rooms to rescue Alice.
- **Maze Description**:
- Room connections (open/locked/closed doors).
- Door information (location, open/locked status).
- Key information (location, door unlock capability).
- Starting location (Bob's start room).
- Target location (Alice's room).
- **Valid Actions**:
- `start`, `move_to`, `pick_up_key`, `use_key`, `unlock_and_open_door_to`, `rescue`.
- **Action & Parameter Syntax**:
- Room IDs: Column-Row (e.g., 'A1').
- Key IDs: Positive integers (e.g., '1').
- Example: `('move_to', 'B1')`.
- **Key Constraints**:
1. Moves must be between adjacent, connected rooms.
2. Keys must be picked up before use.
3. Optimal path minimizes actions/distance.
4. `use_key` action invalidates the key.
- **Output Format**: Python list of tuples (chronological actions).
- **Examples**:
- Input: Room connections, Bob in D5, Alice in C4.
- Output: Sequence of actions to reach Alice.
#### Reasoning Guidance
- **Steps to Complete Task**:
1. Find the shortest path from Bob to Alice.
2. Identify locked doors on the path.
3. For each locked door, find its required key.
4. Plan key collection order to ensure keys are available before reaching doors.
5. Track all actions while following rules.
6. Avoid unnecessary steps increasing path length.
- **Complex Path Handling**:
- Break into smaller segments.
- Solve each segment separately.
- Combine solutions while maintaining optimality.
#### Problem
- **Facts**:
- Room connections (e.g., A6-A5: open door; C6-D6: open door; D4-D5: closed/locked door requiring key 10).
- Key locations (e.g., key 10 in A5).
- Bob starts in F6; Alice is in C5.
- **Solution**: Not provided (to be derived by the agent).
---
### Detailed Analysis
#### Task Description
- **Room Connections**: Defined by open/locked/closed doors (e.g., C4-C3: open door; D4-D5: closed/locked door).
- **Key Usage**: Keys must be picked up before use and become invalid after use.
- **Output Example**: Chronological action list (e.g., `[('start', 'D5'), ('move_to', 'E5'), ...]`).
#### Reasoning Guidance
- **Optimality**: Emphasizes minimizing actions/distance and avoiding redundant steps.
- **Key Collection**: Requires planning to ensure keys are available before reaching locked doors.
#### Problem
- **Maze Complexity**: Includes both open and locked doors, with keys scattered across rooms.
- **Critical Path**: Bob must navigate from F6 to C5, unlocking doors (e.g., D4-D5) using key 10 from A5.
---
### Key Observations
1. **Action Constraints**: Keys are single-use and must be collected before use.
2. **Path Optimization**: The solution must balance shortest distance with key acquisition.
3. **Locked Doors**: Require specific keys (e.g., key 10 for D4-D5), adding complexity to pathfinding.
---
### Interpretation
The template simulates a constrained pathfinding problem with resource management (keys). The agent must:
- Prioritize key collection to unlock critical doors.
- Avoid paths that require backtracking or redundant actions.
- Ensure keys are used before they become invalid.
This structure mirrors real-world scenarios like logistics or escape room puzzles, where resource allocation and path optimization are critical. The absence of a provided solution forces the agent to apply logical reasoning and constraint satisfaction to derive the correct sequence of actions.
</details>
Figure 6: The complete prompt structure passed to the LLMs. This includes: Component 1 (System Instructions and Task Definition), one of the three Few-Shot Examples (Component 2, specifically a simple navigation task), Component 3 (Reasoning Guidance), and an illustration of where the Problem Instance Facts (Component 4) are inserted. For clarity and completeness, the full verbatim text for all three few-shot examples (Component 2) is provided in 7.
B.1 Overall Prompt Components
The prompt presented to the LLMs consists of the following components:
1. System Instructions and Task Definition (Component 1): Outlines the agent’s task, the structure of the maze description, valid actions and their syntax, key operational constraints, and the required output format.
1. Few-Shot Examples (Component 2): Three examples are provided to illustrate the task, ranging in complexity. One of these examples (a simple navigation task) is detailed in Figure 6. The verbatim text for all three examples is provided in Figure 7 for completeness.
1. Reasoning Guidance and Self-Assessment (Component 3): Offers step-by-step algorithmic tips for solving the task and requests the model to provide a self-assessment of its confidence and the perceived difficulty of the instance.
1. Problem Instance Facts (Component 4): The specific natural language facts describing the current maze configuration for the task instance. As illustrated in Figure 6, these facts are appended after the preceding components and are followed by the line "YOUR SOLUTION:" to prompt the model. These facts are generated using the templates described in Appendix A.
1. Example 1 (Simple Navigation): This example, as shown in Figure 6, involves navigating a maze with only open doors.
⬇
EXAMPLE:
INPUT:
Maze Structure: Room C4 and C3 are connected by an open door. Room C3 and D3 are connected by an open door. Room D5 and E5 are connected by an open door. Room A2 and A1 are connected by an open door. Room A3 and B3 are connected by an open door. Room A1 and B1 are connected by an open door. Room A4 and A3 are connected by an open door. Room E5 and E4 are connected by an open door. Room D4 and D3 are connected by an open door. Room A5 and B5 are connected by an open door. Room D4 and E4 are connected by an open door. Bob is in room D5. Alice is in room C4.
OUTPUT:
Solution: [(’ start ’, ’ D5 ’), (’ move_to ’, ’ E5 ’), (’ move_to ’, ’ E4 ’), (’ move_to ’, ’ D4 ’), (’ move_to ’, ’ D3 ’), (’ move_to ’, ’ C3 ’), (’ move_to ’, ’ C4 ’), (’ rescue ’, ’ Alice ’)]
1. Example 2 (Single-Key Backtracking): This example introduces a single locked door and a corresponding key.
⬇
EXAMPLE:
INPUT:
Maze Structure: Room A1 and A2 are connected by an open door. Room A2 and B2 are connected by an open door. Room B1 and B2 are connected by an open door. Room B1 and C1 are connected by an open door. Room C1 and C2 are connected by a closed and locked door. Door between C1 and C2 requires key 1. Key 1 is in room A2. Bob is in room A1. Alice is in room C2.
OUTPUT:
Solution: [(’ start ’, ’ A1 ’), (’ move_to ’, ’ A2 ’), (’ pick_up_key ’, ’1’), (’ move_to ’, ’ B2 ’), (’ move_to ’, ’ B1 ’), (’ move_to ’, ’ C1 ’), (’ use_key ’, ’1’), (’ unlock_and_open_door_to ’, ’ C2 ’), (’ move_to ’, ’ C2 ’), (’ rescue ’, ’ Alice ’)]
1. Example 3 (Multi-Key Backtracking): This example presents a more complex scenario with multiple locked doors and keys, requiring more extensive backtracking.
⬇
EXAMPLE:
INPUT:
Maze Structure: Room B5 and B4 are connected by a closed and locked door. The locked door between B5 and B4 requires key 3. Key 3 is in room B5. Room B5 and C5 are connected by a closed and locked door. The locked door between B5 and C5 requires key 16. Key 16 is in room C5. Room B4 and C4 are connected by an open door. Room C4 and C3 are connected by an open door. Room C3 and D3 are connected by a closed and locked door. The locked door between C3 and D3 requires key 10. Key 10 is in room C4. Room D5 and D4 are connected by an open door. Room D4 and D3 are connected by an open door. Room A5 and B5 are connected by an open door. Bob is in room C5. Alice is in room D5.
OUTPUT:
Solution: [(’ start ’, ’ C5 ’), (’ pick_up_key ’, ’16’), (’ use_key ’, ’16’), (’ unlock_and_open_door_to ’, ’ B5 ’), (’ move_to ’, ’ B5 ’), (’ pick_up_key ’, ’3’), (’ use_key ’, ’3’), (’ unlock_and_open_door_to ’, ’ B4 ’), (’ move_to ’, ’ B4 ’), (’ move_to ’, ’ C4 ’), (’ pick_up_key ’, ’10’), (’ move_to ’, ’ C3 ’), (’ use_key ’, ’10’), (’ unlock_and_open_door_to ’, ’ D3 ’), (’ move_to ’, ’ D3 ’), (’ move_to ’, ’ D4 ’), (’ move_to ’, ’ D5 ’), (’ rescue ’, ’ Alice ’)]
Figure 7: Few-shot examples provided to guide the LLMs in the maze-solving task. These examples demonstrate simple navigation, single-key backtracking, and multi-key backtracking scenarios. The three examples illustrate increasing levels of complexity.
B.2 Evaluation Metrics and Error Analysis Details
This section provides further details on specific aspects of our evaluation metrics and observed error categories, complementing the overview of metrics in Section 1 of the main paper and the discussion of failure modes in Section 2 of the main paper.
Observed Violation Categories.
Failures in model solutions on seqBench tasks can be categorized into several types. Understanding these categories is crucial for interpreting model performance and failure modes. Key types of violations observed include:
- Adjacency errors (e.g., attempting to move between unconnected rooms).
- Locked door errors (e.g., navigating through locked doors without the correct key or without unlocking them).
- Key usage errors (e.g., attempting to use keys not yet collected, or using the wrong key for a door).
- Path inefficiency (e.g., taking unnecessary detours or redundant actions; while not always a hard violation that stops progress, this contributes to solutions not matching the optimal path and thus failing Pass@1).
- Missed critical actions (e.g., failing to pick up a necessary key or unlock a required door). This is a key failure mode discussed in the main paper (Section 2.4) and is often reflected in metrics like low recall or a low progress ratio if the omission occurs early and prevents further correct steps.
Identifying these distinct categories of errors provides a more granular understanding of why models fail on sequential reasoning tasks and helps in the interpretation of aggregate performance metrics reported in the main paper.
B.3 Violation Map: Qualitative Examples of Model Failures
This section provides qualitative examples of characteristic model failures to illustrate common error types. These examples visually support the discussion of failure modes in the main paper (Section 2.4, "A Key Failure Mode: Omission of Critical Steps"). Figure 8 illustrates a significant error by Gemini-2.5-Flash on a complex task, where the model generates an illegal path, bypassing necessary steps and locked doors. This exemplifies a breakdown in multi-step planning. Additionally, Figure 9 shows another common ’adjacency error,’ where a model attempts to jump between unconnected rooms. This type of error reveals a critical lapse in grounding its generated actions within the spatial adjacencies explicitly stated by the task’s input facts.
<details>
<summary>figs/goodexample4040.png Details</summary>

### Visual Description
## Diagram: Comparison of Optimal and Model Paths in a Grid Maze
### Overview
The image presents two side-by-side grid mazes labeled **"Optimal Path"** (left) and **"Model Path"** (right). Each grid contains a maze-like structure with a highlighted path (yellow for optimal, purple for model) and a dashed blue line representing a straight-line approximation between the start and end points. Arrows indicate the start (top-left) and end (bottom-right) positions.
### Components/Axes
- **Grid Structure**: Both grids share an identical maze layout, composed of black and white cells forming interconnected pathways.
- **Path Highlighting**:
- **Optimal Path**: Yellow line tracing a non-linear, zigzag route through the maze.
- **Model Path**: Purple line following a similar zigzag pattern but with slight deviations.
- **Dashed Blue Line**: A straight-line approximation connecting the start and end points in both grids.
- **Arrows**: Black arrows at the top-left (start) and bottom-right (end) of each grid.
- **Labels**:
- Top-left: **"Optimal Path"** (black text).
- Top-right: **"Model Path"** (black text).
### Detailed Analysis
- **Optimal Path (Left Grid)**:
- The yellow path starts at the top-left arrow, navigates through the maze with multiple turns, and ends at the bottom-right arrow.
- The dashed blue line (straight-line approximation) is shorter than the actual path, indicating the optimal path is longer than the direct route.
- **Model Path (Right Grid)**:
- The purple path mirrors the optimal path’s general trajectory but includes minor deviations (e.g., extra turns or longer detours).
- The dashed blue line is identical in both grids, suggesting the model’s path is also longer than the straight-line approximation.
- **Color Consistency**:
- Yellow (optimal) and purple (model) paths are distinct, with no overlap in color coding.
- Dashed blue lines are consistent across both grids, serving as a reference for efficiency.
### Key Observations
1. **Path Efficiency**: The optimal path (yellow) is more direct than the model path (purple), as evidenced by fewer turns and shorter detours.
2. **Dashed Line Discrepancy**: Both paths exceed the length of the dashed blue line, highlighting the inefficiency of the model’s route.
3. **Grid Symmetry**: The mazes are identical, emphasizing that the difference lies solely in the pathfinding algorithm’s performance.
### Interpretation
The diagram illustrates a comparison between an optimal pathfinding solution and a model’s output. The optimal path (yellow) demonstrates a more efficient route, while the model path (purple) introduces unnecessary complexity, likely due to suboptimal decision-making in the algorithm. The dashed blue lines act as a benchmark, showing that both paths are longer than the ideal straight-line distance. This suggests the model may require refinement to better approximate the optimal solution. The identical maze structures reinforce that the disparity in path efficiency is attributable to the algorithm’s design rather than environmental factors.
</details>
Figure 8: Illustrative failure case for Gemini-2.5-Flash on a 40x40 task with 2 locked doors on the optimal path. Left: Optimal path (yellow). Right: Model’s generated path showing an illegal adjacency jump (red arrow), bypassing multiple rooms and a locked door, despite only supporting facts being provided. This highlights a breakdown in multi-step planning.
<details>
<summary>figs/mistakev2.png Details</summary>

### Visual Description
## Diagram: Path Comparison (Optimal vs. Model)
### Overview
The image presents two side-by-side panels labeled **"Optimal Path"** (left) and **"Model Path"** (right). Both panels depict a grid of interconnected nodes with directional paths, highlighted by colored lines, arrows, and annotations. The panels share a common red dashed line and blue dotted line, suggesting a comparison of two strategies or algorithms.
---
### Components/Axes
- **Grid Structure**:
- Both panels feature a grid of black nodes connected by thin lines.
- Nodes are arranged in a 2D lattice, with coordinates implied by their positions (e.g., (3,5), (7,9)).
- **Paths**:
- **Optimal Path (Left)**:
- Yellow lines with arrows indicate a complex, zigzagging route.
- A red dashed line traces a diagonal from (3,5) to (7,5), then a blue dotted line extends vertically to (7,9).
- **Model Path (Right)**:
- Purple lines with arrows show a simpler, more direct route.
- The same red dashed line and blue dotted line are present, but the blue line ends at (7,7).
- **Annotations**:
- Red boxes highlight nodes at (3,5) and (7,5).
- Black triangles mark endpoints at (7,9) (optimal) and (7,7) (model).
- **Legend**:
- Located in the top-right corner of both panels.
- Colors correspond to:
- **Red dashed line**: Shared reference path.
- **Blue dotted line**: Divergent path (optimal vs. model).
- **Yellow/purple lines**: Path directions (optimal/model).
---
### Detailed Analysis
#### Optimal Path (Left)
- **Key Features**:
- The yellow path starts at (3,5), follows the red dashed line to (7,5), then ascends vertically via the blue dotted line to (7,9).
- Arrows indicate movement direction, with frequent turns and backtracking.
- A red box highlights the starting node (3,5), and a black triangle marks the endpoint (7,9).
- **Trends**:
- The path is longer and more convoluted compared to the model path.
- The blue dotted line represents a critical upward trajectory, suggesting a focus on reaching higher nodes.
#### Model Path (Right)
- **Key Features**:
- The purple path starts at (3,5), follows the red dashed line to (7,5), then ascends vertically via the blue dotted line to (7,7).
- Arrows show a more linear progression with fewer turns.
- A red box highlights the starting node (3,5), and a black triangle marks the endpoint (7,7).
- **Trends**:
- The path is shorter and more direct, terminating at a lower node (7,7) compared to the optimal path.
---
### Key Observations
1. **Shared Reference Path**:
- Both panels share the red dashed line from (3,5) to (7,5), indicating a common initial route.
2. **Divergence at (7,5)**:
- The blue dotted line splits at (7,5):
- **Optimal Path**: Continues upward to (7,9).
- **Model Path**: Ends at (7,7).
3. **Path Complexity**:
- The optimal path includes more turns and backtracking (yellow lines), while the model path is streamlined (purple lines).
4. **Endpoint Differences**:
- The optimal path reaches a higher node (7,9), whereas the model path stops at (7,7).
---
### Interpretation
- **Optimal vs. Model Trade-offs**:
- The optimal path prioritizes reaching a higher endpoint (7,9) but at the cost of increased complexity and length.
- The model path sacrifices endpoint height for simplicity and efficiency, terminating at (7,7).
- **Red Dashed Line Significance**:
- Acts as a baseline or constraint, shared by both paths, suggesting it represents a mandatory or preferred route.
- **Blue Dotted Line Role**:
- Represents the critical decision point where the two strategies diverge. The optimal path extends further, while the model path stops earlier.
- **Annotations**:
- Red boxes and triangles emphasize key nodes, possibly indicating decision points, obstacles, or goals.
---
### Conclusion
The diagram illustrates a trade-off between path efficiency and endpoint optimization. The optimal path achieves a higher goal but with greater complexity, while the model path prioritizes simplicity at the expense of reaching a lower endpoint. The shared red dashed line and divergent blue dotted line highlight the critical decision point in the pathfinding process.
</details>
Figure 9: Illustrative failure case of an ’adjacency error’ in model-generated pathfinding on a 20x20 task with 2 locked doors on the optimal path. The left panel displays the optimal path (yellow) to the target (triangle). The right panel shows a suboptimal path (purple) generated by the model. This example highlights a common error where, after a sequence of actions (in this scenario, following a key acquisition), the model fails to navigate through valid connections. Instead, it attempts to ’jump’ directly between two unconnected rooms. This violation of room adjacency constraints is a key challenge in model performance.
B.4 Quantitative Analysis of Error Patterns
To understand how and when models begin to fail within a reasoning sequence, we analyze the distribution of the first violation step. We record the time step at which the initial violation occurs in a model’s generated path. Aggregating this step-indexed data across multiple instances allows us to create temporal distributions of errors. These distributions help determine whether errors tend to cluster early in the reasoning process (potentially indicating issues with initial planning or understanding of the overall problem complexity) or accumulate later (suggesting difficulties in maintaining long chains of inference or context). This analysis complements the discussion in the main paper (Section 2.4, "Path-Length Dependent First Errors: The Burden of Anticipated Complexity").
Figure 10 shows how the distribution of these first-error positions shifts with the overall problem complexity, represented by logical depth ( $L$ ). As detailed in the main paper, an increase in $L$ tends to cause errors to occur earlier in the reasoning chain.
<details>
<summary>figs/failure_step_dist_vs_L.png Details</summary>

### Visual Description
## Line Chart: Solution Steps vs Max Progress Step
### Overview
The image displays a line chart with multiple horizontal lines, each labeled with "Solution steps: X" (X = 20, 60, 100, 140, 180, 220, 260, 300). The x-axis is labeled "max progress step" (0–300), and vertical ticks are plotted along each horizontal line. The density of vertical ticks decreases as solution steps increase.
### Components/Axes
- **X-axis**: "max progress step" (0–300, increments of 50).
- **Y-axis**: Implicitly represents solution step counts (20–300, labeled at intervals).
- **Lines**: Horizontal lines for each solution step count, with vertical ticks indicating progress steps.
- **Labels**: Each line is labeled at the top with "Solution steps: X" (X = 20, 60, ..., 300).
### Detailed Analysis
- **Solution Steps 20**: Dense vertical ticks (≈30–40 ticks) concentrated near x=0–50, indicating high progress steps early.
- **Solution Steps 60**: Slightly fewer ticks (≈25–35), spread slightly further (x=0–70).
- **Solution Steps 100**: Ticks reduce to ≈20–25, spread to x=0–90.
- **Solution Steps 140**: Ticks ≈15–20, spread to x=0–110.
- **Solution Steps 180**: Ticks ≈10–15, spread to x=0–130.
- **Solution Steps 220**: Ticks ≈8–12, spread to x=0–150.
- **Solution Steps 260**: Ticks ≈5–8, spread to x=0–170.
- **Solution Steps 300**: Ticks ≈3–5, spread to x=0–190.
### Key Observations
1. **Inverse Relationship**: As solution steps increase, the number of progress steps decreases.
2. **Diminishing Returns**: The reduction in progress steps becomes less pronounced at higher solution steps (e.g., 260→300 reduces ticks by only 2–3).
3. **Progress Step Distribution**: Early progress steps (x=0–50) dominate across all solution steps, with later steps (x>100) contributing minimally.
### Interpretation
The chart suggests that increasing the number of solution steps improves efficiency in reaching the maximum progress step. However, beyond a certain point (≈180–220 solution steps), additional iterations yield minimal gains, indicating potential algorithmic saturation or diminishing returns. The dense early progress steps imply that initial iterations are critical for rapid advancement, while later steps refine the solution with less impact. This pattern could reflect optimization challenges in iterative algorithms, where early exploration drives progress, and later steps focus on fine-tuning.
</details>
Figure 10: Distribution of first-violation steps for Gemini-2.5-Flash across varying logical depths ( $L$ ). As $L$ (total required path length) increases, the distribution of first errors tends to shift leftward, indicating that models are more likely to fail at earlier steps in longer problems. This suggests that anticipated global complexity impacts reasoning from the outset. Experimental parameters in this figure are the same as those in Figure 1.
Similarly, Figure 11 illustrates how the introduction of contextual noise (distracting facts) affects the point of failure. Increased noise also tends to precipitate earlier errors in the reasoning sequence, as discussed in the main paper in relation to sensitivity to noise (Section 2.3) and its impact on error patterns (Section 2.4).
<details>
<summary>figs/gemini-progress-ratio-vs-noise.png Details</summary>

### Visual Description
## Bar Chart: Progress Ratio Distribution Across Noise Ratios
### Overview
The image displays six horizontal bar charts, each labeled with a distinct noise ratio (0.0 to 1.0). Each chart visualizes the distribution of "progress ratio" values, with the x-axis ranging from 0.0 to 1.0. The bars are predominantly red, and their placement indicates the frequency or probability of specific progress ratios under varying noise conditions.
### Components/Axes
- **X-axis**: Labeled "progress ratio," scaled from 0.0 to 1.0 in increments of 0.2.
- **Y-axis**: Unlabeled, but each chart is vertically stacked, with titles positioned on the left.
- **Legend**: No explicit legend is present, but all bars are red, suggesting a single data series per chart.
- **Titles**: Each chart is labeled with "Noise ratio: [value]" (e.g., "Noise ratio: 0.0," "Noise ratio: 0.2," etc.).
### Detailed Analysis
- **Noise ratio: 0.0**: A single vertical bar at progress ratio 1.0. No other bars are present.
- **Noise ratio: 0.2**: One bar at 1.0 and a few shorter bars at 0.0.
- **Noise ratio: 0.4**: Multiple bars at 0.0 and 0.2, with one bar at 1.0.
- **Noise ratio: 0.6**: Bars spread across 0.0, 0.2, and 0.4, with one bar at 1.0.
- **Noise ratio: 0.8**: Bars distributed between 0.0, 0.2, and 0.4, with one bar at 1.0.
- **Noise ratio: 1.0**: Most bars concentrated at 0.0 and 0.2, with one bar at 1.0.
### Key Observations
1. **Dominance of 1.0**: All charts include at least one bar at progress ratio 1.0, indicating a baseline or maximum value.
2. **Noise-Induced Spread**: As noise ratio increases, the number of bars at lower progress ratios (0.0, 0.2, 0.4) grows, suggesting increased variability.
3. **Progress Ratio Distribution**: Higher noise ratios correlate with a broader spread of progress ratios, implying reduced consistency.
4. **Unlabeled Y-axis**: The absence of a y-axis label limits interpretation of the vertical dimension (e.g., frequency, probability, or count).
### Interpretation
The data suggests that noise ratio directly impacts the distribution of progress ratios. At lower noise levels (0.0–0.2), progress ratios are tightly clustered around 1.0, indicating high consistency. As noise increases (0.4–1.0), the spread of progress ratios widens, reflecting greater variability or instability. This could imply that the system or process being measured is robust to low noise but becomes less reliable under higher noise conditions. The lack of a y-axis label prevents precise quantification of the data (e.g., frequency, probability), but the visual trend clearly demonstrates the relationship between noise and progress ratio distribution.
### Notable Anomalies
- **Consistent 1.0 Bar**: Every chart retains a bar at 1.0, which may represent a fixed or ideal outcome unaffected by noise.
- **Progressive Spread**: The gradual increase in bars at lower progress ratios with higher noise ratios indicates a systematic degradation of performance or accuracy.
</details>
Figure 11: Impact of increasing noise ratio on the distribution of failure steps for Gemini 2.5 Flash. As noise (proportion of distracting facts) increases, failures tend to occur earlier in the reasoning chain. This reflects increased difficulty in isolating relevant information and maintaining focus. Fixed experimental parameters in this figure are the same as those in Figure 1.
Appendix C Supplementary Figures
This appendix provides supplementary figures that offer further visual support for analyses presented in the main paper. These figures illustrate the impact of various complexity dimensions and provide comparative views of model performance, elaborating on points made throughout Section 2 (Benchmarking Results) of the main paper.
Figure 12 details the performance of Llama-4 Maverick-17B-128E-Instruct under varying levels of noise and fact shuffling. This supports the discussion in the main paper (Section 2.3, on how these factors, especially in combination, affect success rates, with noise being a dominant factor.
<details>
<summary>figs/single_model_vs_steps_count_varied_noise_shuffle_Llama-4-Maverick-17B-128E-Instruct-FP8.png Details</summary>

### Visual Description
## Line Chart: Success Rate vs. Number of Actions
### Overview
The image contains two line charts comparing success rates across different noise and shuffle conditions. The left chart uses a linear y-axis (success rate) and linear x-axis (number of actions), while the right chart uses a log-log scale for both axes. Exponential decay models (α exp(-x/L)) are overlaid for comparison.
### Components/Axes
- **Left Chart**:
- **Y-axis**: "success rate" (linear scale: 0.01–1.0)
- **X-axis**: "number of actions" (linear scale: 10–70)
- **Legend**:
- Blue: noise = 0, shuffle = 0
- Orange: noise = 0, shuffle = 0.5
- Green: noise = 0.2, shuffle = 0
- Red: noise = 0.2, shuffle = 0.5
- Purple dashed: α exp(-x/L), L = 24
- Brown dashed: α exp(-x/L), L = 14
- **Right Chart**:
- **Y-axis**: "success rate" (logarithmic scale: 10⁻²–10⁰)
- **X-axis**: "number of actions" (logarithmic scale: 10–70)
- **Legend**: Same as left chart.
### Detailed Analysis
#### Left Chart Trends
1. **Blue Line (noise=0, shuffle=0)**:
- Starts at ~0.95 success rate at 10 actions.
- Declines gradually to ~0.1 by 70 actions.
- Follows the purple dashed exponential fit (L=24).
2. **Orange Line (noise=0, shuffle=0.5)**:
- Starts at ~0.9 at 10 actions.
- Declines faster than blue, reaching ~0.05 by 70 actions.
- Matches the brown dashed exponential fit (L=14).
3. **Green Line (noise=0.2, shuffle=0)**:
- Starts at ~0.85 at 10 actions.
- Declines to ~0.03 by 70 actions.
- Aligns with the purple dashed line (L=24).
4. **Red Line (noise=0.2, shuffle=0.5)**:
- Starts at ~0.8 at 10 actions.
- Declines steeply to ~0.01 by 70 actions.
- Matches the brown dashed line (L=14).
#### Right Chart Trends
- All lines appear linear due to log-log scaling.
- Exponential fits (purple/brown dashed) become straight lines.
- Relative slopes confirm decay rates: L=24 (shallower slope) vs. L=14 (steeper slope).
### Key Observations
1. **Noise/Shuffle Impact**:
- Higher noise (0.2 vs. 0) reduces success rates by ~10–15% at 10 actions.
- Shuffle=0.5 accelerates decay, halving success rates compared to shuffle=0 at 70 actions.
2. **Exponential Decay**:
- Success rate decays exponentially with actions (α exp(-x/L)).
- L=24 (purple) corresponds to lower noise/shuffle, slower decay.
- L=14 (brown) corresponds to higher noise/shuffle, faster decay.
3. **Consistency**:
- Both charts show identical trends, validating log-log scaling preserves relationships.
### Interpretation
The data demonstrates that **noise and shuffle parameters independently degrade performance**, with combined effects (noise=0.2, shuffle=0.5) causing the steepest decline. The exponential models quantify this decay, where **L=24** (slower decay) aligns with ideal conditions (noise=0, shuffle=0), while **L=14** (faster decay) reflects degraded conditions. The log-log visualization emphasizes multiplicative relationships, confirming that success rate halves every ~14–24 actions depending on conditions. This suggests optimizing noise/shuffle parameters is critical for maintaining performance in action-dependent systems.
</details>
Figure 12: Pass@1 success rate for Llama-4 Maverick-17B-128E-Instruct versus solution length ( $L$ ) under different noise and shuffle ratios. Left: Linear scale. Right: Log-linear scale. Performance degrades with increased noise but is less affected by shuffle ratios. Fixed experimental parameters in this figure are the same as those in Figure 1.
To illustrate the performance consistency and disparities across different models, as detailed in Section 2.6, Figure 13 presents scatter and density plots of mean progress ratios. These plots clearly demonstrate that model performance hierarchies are not strictly linear. They reveal ’performance inversions’—instances, also noted in Section 2.6, where models with typically lower overall performance (e.g., lower average $L_{0}$ ) occasionally solve specific complex problems that models with higher average $L_{0}$ values fail on.
<details>
<summary>figs/progress_vs_progress.png Details</summary>

### Visual Description
## Heatmap Comparison: Model Performance Analysis
### Overview
The image contains six comparative heatmaps visualizing performance differences between pairs of AI models across progress ratios. Each plot uses a color gradient (yellow to dark purple) to represent performance levels, with a diagonal dashed line indicating parity. The axes represent progress ratios for two models being compared, with the x-axis labeled as the first model and the y-axis as the second.
### Components/Axes
1. **X-Axis**: Labeled as "progress ratio" for the first model in each comparison (e.g., "DeepSeek-R1", "gemini-2.0-flash", "Llama-4-Maverick-17B-128E-Instruct-FP8").
2. **Y-Axis**: Labeled as "progress ratio" for the second model in each comparison (e.g., "y-gemini-2.0-flash", "y-gemini-2.5-flash-preview-04-17", "y-Llama-4-Maverick-17B-128E-Instruct-FP8").
3. **Diagonal Dashed Line**: Represents parity (equal performance) between the two models.
4. **Color Gradient**:
- Yellow (brightest) indicates regions where the x-axis model significantly outperforms the y-axis model.
- Dark purple indicates regions where the y-axis model dominates.
- Intermediate blues represent moderate performance differences.
- White regions suggest either parity or sparse data.
### Detailed Analysis
1. **Top Row (DeepSeek-R1 vs Gemini Models)**:
- **DeepSeek-R1 vs y-gemini-2.0-flash**:
- Yellow dominance in the bottom-left (x=0.2–0.4, y=0.2–0.4) suggests DeepSeek-R1 performs better at lower progress ratios.
- Dark purple regions in the top-right (x=0.6–0.8, y=0.6–0.8) indicate Gemini-2.0-flash excels at higher progress ratios.
- White regions (x=0.4–0.6, y=0.4–0.6) show parity or transitional performance.
- **DeepSeek-R1 vs y-gemini-2.5-flash-preview-04-17**:
- Similar pattern but with a larger dark purple region in the top-right, indicating Gemini-2.5-flash-preview outperforms more strongly at high progress ratios.
- **gemini-2.0-flash vs y-gemini-2.5-flash-preview-04-17**:
- Yellow dominance in the bottom-left (x=0.2–0.4, y=0.2–0.4) suggests Gemini-2.0-flash performs better at lower progress ratios.
- Dark purple regions in the top-right (x=0.6–0.8, y=0.6–0.8) indicate Gemini-2.5-flash-preview dominates at higher progress ratios.
2. **Bottom Row (Llama-4-Maverick vs Llama-4-Maverick-Instruct-FP8)**:
- **y-Llama-4-Maverick-17B vs y-Llama-4-Maverick-17B-128E-Instruct-FP8**:
- Yellow dominance in the bottom-left (x=0.2–0.4, y=0.2–0.4) indicates the base Llama-4-Maverick model performs better at lower progress ratios.
- Dark purple regions in the top-right (x=0.6–0.8, y=0.6–0.8) show the Instruct-FP8 variant excels at higher progress ratios.
- White regions (x=0.4–0.6, y=0.4–0.6) suggest parity or transitional performance.
### Key Observations
1. **Performance Trends**:
- Models generally show a "sweet spot" where they outperform others at specific progress ratios (e.g., lower ratios for base models, higher ratios for optimized variants).
- The diagonal line’s placement varies: In some plots (e.g., DeepSeek-R1 vs Gemini-2.5-flash-preview), the line tilts toward the y-axis model, indicating stronger performance at higher progress ratios.
- White regions often cluster around the diagonal, suggesting overlapping performance or data sparsity.
2. **Anomalies**:
- In the Llama-4-Maverick comparisons, the white regions are more extensive, possibly indicating less data or higher variability in performance.
- The Gemini-2.5-flash-preview consistently outperforms Gemini-2.0-flash at higher progress ratios, with minimal overlap.
### Interpretation
The heatmaps reveal that model performance is highly dependent on the progress ratio. Optimized variants (e.g., Gemini-2.5-flash-preview, Llama-4-Maverick-Instruct-FP8) dominate at higher progress ratios, while base models (e.g., Gemini-2.0-flash, Llama-4-Maverick) perform better at lower ratios. The diagonal line’s orientation and the distribution of yellow/dark purple regions quantify these trade-offs. The white areas highlight either parity or insufficient data, suggesting further investigation may be needed in those regions. These visualizations emphasize the importance of model selection based on the specific progress ratio requirements of a task.
</details>
Figure 13: Scatter and density plots of progress ratios per task instance, comparing model pairs on the tasks. These plots illustrate performance agreement and disparities on the same instances of pathfinding tasks. Notably, Gemini-2.5-Flash (example) often succeeds on instances where other models achieve near-zero progress. Data from experiments in Figure 1 (main paper).
Figure 14 isolates the impact of shuffle ratio on model performance when other factors like noise are controlled. This visualization corresponds to the findings discussed in the main paper (Section 2.3, "Fact Ordering (Shuffle Ratio)") that simple reordering of facts has a minimal impact on the performance of the evaluated models under low-noise conditions.
Figure 15 isolates the impact of adding more examples in the instruction prompt, showing a clear improvement once more than a single example is included compared to using none or only one.
Figure 16 is added in this revised version of the supplementary section to reflect that even the most recent SOTA models released by OpenAI suffer from the same performance drop observed in the main paper.
<details>
<summary>figs/fig_vs_shuffle_fixed_L_keys2_noise0.2.png Details</summary>

### Visual Description
## Three-Panel Chart: Performance Metrics vs. Shuffle Ratio
### Overview
The image contains three side-by-side line charts comparing performance metrics of two AI models ("Llama-4-Maverick-17B-128E-Instruct-FP8" and "gemini-2.5-flash-preview-04-17") across different shuffle ratios (0.0 to 1.0). Each panel represents a distinct metric: mean progress ratio, mean success rate (Pass@1), and CoT tokens.
---
### Components/Axes
1. **Left Panel**
- **Y-axis**: Mean progress ratio (0.0 to 1.0)
- **X-axis**: Shuffle ratio (0.0 to 1.0)
- **Legend**:
- Blue: Llama-4-Maverick-17B-128E-Instruct-FP8
- Orange: gemini-2.5-flash-preview-04-17
2. **Middle Panel**
- **Y-axis**: Mean success rate (Pass@1) (0.0 to 1.0)
- **X-axis**: Shuffle ratio (0.0 to 1.0)
- **Legend**: Same as left panel
3. **Right Panel**
- **Y-axis**: CoT tokens (0 to 1,600)
- **X-axis**: Shuffle ratio (0.0 to 1.0)
- **Legend**: Same as left panel
---
### Detailed Analysis
#### Left Panel: Mean Progress Ratio
- **Llama-4-Maverick (Blue)**:
- Starts at ~0.22 (shuffle ratio 0.0)
- Dips to ~0.18 (shuffle ratio 0.6)
- Rises slightly to ~0.20 (shuffle ratio 1.0)
- **Gemini-2.5-flash (Orange)**:
- Starts at ~0.64 (shuffle ratio 0.0)
- Peaks at ~0.68 (shuffle ratio 0.4)
- Drops to ~0.66 (shuffle ratio 1.0)
#### Middle Panel: Mean Success Rate (Pass@1)
- **Llama-4-Maverick (Blue)**:
- Remains near 0.01 across all shuffle ratios (flat line).
- **Gemini-2.5-flash (Orange)**:
- Starts at ~0.50 (shuffle ratio 0.0)
- Peaks at ~0.55 (shuffle ratio 0.6)
- Drops to ~0.52 (shuffle ratio 1.0)
#### Right Panel: CoT Tokens
- **Llama-4-Maverick (Blue)**:
- Starts at ~1,600 (shuffle ratio 0.0)
- Dips to ~1,580 (shuffle ratio 0.2)
- Peaks at ~1,700 (shuffle ratio 0.8)
- Drops to ~1,650 (shuffle ratio 1.0)
- **Gemini-2.5-flash (Orange)**:
- Starts at ~350 (shuffle ratio 0.0)
- Peaks at ~370 (shuffle ratio 0.8)
- Drops to ~360 (shuffle ratio 1.0)
---
### Key Observations
1. **Performance Trends**:
- Gemini-2.5-flash consistently outperforms Llama-4-Maverick in mean progress ratio (orange line stays above blue line in left panel).
- Llama-4-Maverick shows negligible success rate (Pass@1) across all shuffle ratios, while Gemini-2.5-flash achieves ~50-55% success.
- Llama-4-Maverick consumes significantly more CoT tokens (1,500–1,700) compared to Gemini-2.5-flash (350–370).
2. **Anomalies**:
- Llama-4-Maverick’s success rate (Pass@1) is near-zero, suggesting potential issues with task completion or metric definition.
- Gemini-2.5-flash’s CoT token usage remains stable despite shuffle ratio changes, indicating efficient resource utilization.
---
### Interpretation
- **Model Efficiency**: Gemini-2.5-flash demonstrates superior performance in both progress ratio and success rate while using fewer computational resources (CoT tokens).
- **Llama-4-Maverick Limitations**: The near-zero success rate suggests either a misconfiguration, task incompatibility, or a need for further optimization.
- **Shuffle Ratio Impact**:
- Higher shuffle ratios (closer to 1.0) correlate with increased CoT token usage for Llama-4-Maverick but minimal performance gains.
- Gemini-2.5-flash maintains stable performance across shuffle ratios, implying robustness to input variability.
This data highlights Gemini-2.5-flash as a more efficient and effective model under the tested conditions, while Llama-4-Maverick requires further investigation to address its low success rate.
</details>
Figure 14: Impact of shuffle ratio on Pass@1 success rate. Varying the degree of mixing (shuffle) between supporting and distracting facts shows minimal impact on performance for Gemini 2.5 Flash and Llama-4 Maverick, suggesting robustness to fact order when noise is controlled. The generation and sampling of maze instances for these tasks follow the same methodology detailed for experiments in the main paper (Figures 3 and 4).
<details>
<summary>figs/maze_ablation_analysis.png Details</summary>

### Visual Description
## Line Chart: Llama-4-Maverick-17B-128E-Instruct-FP8 Performance
### Overview
The chart illustrates the success rate of different prompting strategies (shots and guided/unguided Chain of Thought) for the Llama-4-Maverick-17B-128E-Instruct-FP8 model across varying numbers of actions. Success rate declines sharply as the number of actions increases, with distinct patterns for each prompting strategy.
### Components/Axes
- **X-axis**: "Number of actions" (0 to 200, linear scale).
- **Y-axis**: "Success rate" (0 to 0.8, increments of 0.2).
- **Legend**: Located in the top-right corner, with five data series:
- **Green circles**: 5_shots_and_guided_CoT
- **Purple diamonds**: 3_shots_and_guided_CoT
- **Orange triangles**: 3_shot_unguided
- **Red triangles**: 1_shot_and_guided_CoT
- **Blue squares**: zero_shot_and_guided_CoT
### Detailed Analysis
1. **5_shots_and_guided_CoT (Green)**:
- Starts at ~0.75 success rate at 0 actions.
- Drops to ~0.25 at 50 actions, then declines to near 0 by 100 actions.
- Sharpest initial decline among all series.
2. **3_shots_and_guided_CoT (Purple)**:
- Begins at ~0.65 success rate at 0 actions.
- Declines to ~0.15 at 50 actions, then near 0 by 100 actions.
- Slightly less steep decline than the 5-shot guided series.
3. **3_shot_unguided (Orange)**:
- Starts at ~0.6 success rate at 0 actions.
- Drops to ~0.1 at 50 actions, then near 0 by 100 actions.
- Declines faster than 3-shot guided but slower than 5-shot guided.
4. **1_shot_and_guided_CoT (Red)**:
- Begins at ~0.55 success rate at 0 actions.
- Declines to ~0.05 at 50 actions, then near 0 by 100 actions.
- Steeper decline than 3-shot unguided.
5. **zero_shot_and_guided_CoT (Blue)**:
- Starts at ~0.5 success rate at 0 actions.
- Drops to ~0.02 at 50 actions, then near 0 by 100 actions.
- Slowest initial decline but lowest overall performance.
### Key Observations
- **Initial Performance**: Higher-shot guided strategies (5-shot, 3-shot) achieve the highest initial success rates.
- **Decline Rate**: All strategies show rapid success rate decay as actions increase, with guided methods (except zero-shot) maintaining higher rates than unguided.
- **Unguided vs. Guided**: Guided CoT strategies consistently outperform unguided counterparts, even with fewer shots.
- **Zero-Shot Anomaly**: Zero-shot guided CoT starts mid-range but declines sharply, suggesting limited adaptability to complex tasks.
### Interpretation
The data demonstrates that **prompting strategy significantly impacts model performance**, with guided CoT methods (especially higher-shot variants) achieving better initial success. However, **task complexity (number of actions)** universally degrades performance, indicating limitations in handling multi-step reasoning. The zero-shot guided CoT’s mid-range start but steep decline suggests it struggles with tasks requiring iterative refinement. Guided CoT’s structured reasoning likely mitigates performance drops compared to unguided approaches, but even optimal strategies fail to maintain success beyond ~50 actions. This highlights the need for hybrid approaches combining guided reasoning with adaptive task decomposition for complex workflows.
</details>
Figure 15: The impact of including different number of reference examples in the prompt as part of in-context learning. Increasing the number of examples leads to slight improvements in performance. The experimental parameters used here are the same as ones in Figure 1.
<details>
<summary>figs/model_comparison_openai.png Details</summary>

### Visual Description
## Line Chart: Success Rate vs. Number of Actions
### Overview
The chart illustrates the success rate of four AI models (GPT-5, OSS-120B, OSS-20B, Llama-4-Maverick) as the number of actions increases from 0 to 300. Success rate is plotted on the y-axis (0–1.0), while the x-axis represents the number of actions. All models show declining success rates with increasing actions, but the rate of decline varies significantly.
### Components/Axes
- **X-axis**: "Number of actions" (0–300, linear scale).
- **Y-axis**: "Success rate" (0–1.0, linear scale).
- **Legend**: Located in the top-right corner, mapping colors to models:
- Blue: GPT-5
- Orange: OSS-120B
- Green: OSS-20B
- Red: Llama-4-Maverick
### Detailed Analysis
1. **GPT-5 (Blue Line)**:
- Starts at **1.0 success rate** at 0 actions.
- Declines steadily, reaching **~0.62 at 100 actions**, **~0.25 at 200 actions**, and **~0.08 at 300 actions**.
- Maintains the highest success rate throughout.
2. **OSS-120B (Orange Line)**:
- Begins at **~0.98 at 0 actions**.
- Drops sharply to **~0.22 at 100 actions**, then plateaus near **0.01** by 300 actions.
- Steeper decline than GPT-5 but starts slightly higher.
3. **OSS-20B (Green Line)**:
- Starts at **~0.88 at 0 actions**.
- Declines gradually to **~0.02 at 100 actions**, then near **0.0** by 300 actions.
- Slower decline than OSS-120B but lower initial success rate.
4. **Llama-4-Maverick (Red Line)**:
- Begins at **~0.65 at 0 actions**.
- Plummets to **~0.0** by 50 actions, remaining at 0 thereafter.
- Most drastic decline, with no success beyond 50 actions.
### Key Observations
- **GPT-5** consistently outperforms all models across all action counts.
- **OSS-120B** and **OSS-20B** show similar initial performance but diverge in decline rate.
- **Llama-4-Maverick** fails catastrophically after 50 actions, suggesting poor scalability.
- No lines intersect, indicating no model overtakes another in performance.
### Interpretation
The data suggests **GPT-5** is the most robust model for handling increasing action loads, likely due to architectural advantages or superior training. The OSS models (120B and 20B) exhibit diminishing returns more rapidly, possibly due to resource constraints or optimization gaps. Llama-4-Maverick’s abrupt failure implies critical limitations in its design or training data. The trends highlight trade-offs between model complexity (e.g., OSS-120B’s larger size vs. GPT-5’s efficiency) and real-world applicability under dynamic workloads.
</details>
Figure 16: This figure is added to reflect that the recent closed (GPT-5) and open sourced models (OSS-20B/120B) released by OpenAI also follow the same universal failure patterns highlighted in this paper. The data used here as well as experimental settings is the same as the one used in Figure 1 of the main paper. We include Llama-4-Maverick which is also used in Figure 1 as the benchmark reference.