# seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
**Authors**:
- M.R. Ramezanali
- Salesforce AI
- Palo Alto, CA
- &M. Vazifeh (Capital One, MIT)
- Cambridge, MA
- &P. Santi (MIT)
- Cambridge, MA
> ⋆ \star denotes equal contribution.
## Abstract
We introduce seqBench, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. seqBench allows systematic variation of (1) the logical depth, defined as the number of sequential actions required to solve the task; (2) the number of backtracking steps along the optimal path, quantifying how often the agent must revisit prior states to satisfy deferred preconditions (e.g., retrieving a key after encountering a locked door); and (3) the noise ratio, defined as the ratio between supporting and distracting facts about the environment. Our evaluations on state-of-the-art LLMs reveal a universal failure pattern: accuracy collapses exponentially beyond a model-specific logical depth. Unlike existing benchmarks, seqBench ’s fine-grained control facilitates targeted analyses of these reasoning failures, illuminating universal scaling laws and statistical limits, as detailed in this paper alongside its generation methodology and evaluation metrics. We find that even top-performing models systematically fail on seqBench ’s structured reasoning tasks despite minimal search complexity, underscoring key limitations in their commonsense reasoning capabilities. Designed for future evolution to keep pace with advancing models, the seqBench datasets are publicly released to spur deeper scientific inquiry into LLM reasoning, aiming to establish a clearer understanding of their true potential and current boundaries for robust real-world application.
seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
M.R. Ramezanali thanks: $\star$ denotes equal contribution. Salesforce AI Palo Alto, CA 94301 mramezanali@salesforce.com M. Vazifeh footnotemark: Capital One, MIT Cambridge, MA 02143 mvazifeh@mit.edu P. Santi MIT Cambridge, MA 02143 psanti@mit.edu
Large Language Models (LLMs) have shown remarkable performance (Vaswani et al., 2017; Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Smith et al., 2022; Thoppilan et al., 2022; Hoffmann et al., 2022; Du et al., 2021; Fedus et al., 2022; Zoph et al., 2022) on a wide range of tasks and benchmarks spanning diverse human-like capabilities; however, these successes can obscure fundamental limitations in sequential reasoning that still persist. Arguably, reasoning captures a more pure form of intelligence, going beyond mere pattern matching or fact memorization, and is thus a critical capability to understand and enhance in AI systems. Recent studies show that state-of-the-art LLMs (OpenAI, 2025; Google DeepMind, 2025; Meta AI, 2025; Mistral AI, 2024; Anthropic, 2025) excel at complex benchmarks, yet stumble upon simple common-sense inferences trivial for an adult human (Nezhurina et al., 2025; Han et al., 2024; Sharma, 2024; Berglund et al., 2024; Yang et al., 2019). Most existing benchmarks saturate quickly, leaving little room for fine-grained attribution studies to perform systemic probes of LLM failure modes. Consequently, a robust understanding of why and under what circumstances these models fail, especially on problems requiring sequential reasoning, remains elusive.
This gap, we argue, stems from the lack of evaluation benchmarks allowing systematic, multi-dimensional control over key independent factors that influence a task’s overall reasoning difficulty. Most benchmarks (Cobbe et al., 2021; Hendrycks et al., 2021; Srivastava et al., 2023; Weston et al., 2015; Clark et al., 2018; Dua et al., 2019; Rein et al., 2023), despite their evaluation merits, often do not support a systematic variation of crucial complexity dimensions. This makes it difficult to isolate the specific conditions under which reasoning in LLMs falter. For instance, discerning whether a failure is due to the length of the required reasoning chain, the necessity to revise intermediate conclusions, or the density of distracting information is often not quantitatively possible. While prompting strategies like chain-of-thought (CoT) and model scaling have boosted aggregate performance, they often obscure sharp performance cliffs that can emerge when these underlying complexity dimensions are varied independently (Wei et al., 2023; Kojima et al., 2022). Without such systematic control, disentangling inherent architectural limitations from those addressable via scaling (model size, data, or compute), fine-tuning, or prompting techniques is challenging. A fine-grained understanding of these performance boundaries is crucial for developing more robust and reliable reasoning systems.
To complement recent efforts (Sprague et al., 2024; Tyagi et al., 2024; Kuratov et al., 2024; Tang and Kejriwal, 2025; Mirzaee et al., 2021; Tikhonov, 2024; Mirzaee and Kordjamshidi, 2022; Shi et al., 2022) in evaluating reasoning, and to address the need for more controlled analysis, we introduce seqBench, a tunable benchmark designed explicitly to probe and analyze sequential reasoning capabilities in language models. The dataset comprises synthetic yet linguistically grounded pathfinding task configurations on two-dimensional grids. Solving each problem requires sequential inference over relevant and distracting structured facts. Each instance is automatically verifiable and parameterized by controllable factors that directly address the previously identified gaps: (1) logical depth (total number of actions in the ground-truth solution, reflecting the length of the reasoning chain); (2) backtracking count (number of locked-door detours on the optimal path, requiring revision of tentative solution paths); and (3) noise ratio (proportion of distracting vs. supporting facts, testing robustness to irrelevant information). Performance against these dimensions can be quantified with fine-grained metrics (e.g., via progress ratio as we define here). We observe that beyond a certain logical depth, Pass@1 success collapses to near zero for all models (see Figure 1). These features enable precise attribution studies of model failure modes, offering insights into the brittle boundaries of current LLM generalization.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Chart Type: Two Line Charts (Linear and Logarithmic Success Rate vs. Number of Actions)
### Overview
The image presents two vertically stacked line charts that illustrate the "Success Rate" of eight different language models as a function of "Number of Actions (L)". The top chart uses a linear scale for the Y-axis, while the bottom chart employs a logarithmic scale for the Y-axis, providing different perspectives on the decay of success rate. Both charts share a common X-axis representing the "Number of Actions (L)". Each model's performance is depicted by a series of data points connected by a solid line, alongside a corresponding dashed line representing an exponential fit of the form `~ exp(-L/L₀)`. The characteristic length `L₀` for each model's fit is provided in the legend.
### Components/Axes
**Shared X-axis (Positioned at the bottom of both plots):**
* **Title:** Number of Actions (L)
* **Range:** 0 to 300
* **Major Tick Markers:** 0, 50, 100, 150, 200, 250, 300
**Top Plot Y-axis (Positioned on the left side of the top plot):**
* **Title:** Success Rate
* **Scale:** Linear
* **Range:** 0.0 to 1.0
* **Major Tick Markers:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
**Bottom Plot Y-axis (Positioned on the left side of the bottom plot):**
* **Title:** Success Rate (Log Scale)
* **Scale:** Logarithmic (base 10)
* **Range:** 10⁻³ to 10⁰
* **Major Tick Markers:** 10⁻³, 10⁻², 10⁻¹, 10⁰
**Legend (Positioned at the top-right corner of the top plot):**
The legend details eight distinct models and their associated exponential fit parameters. The general form of the fit is `Fit: ~ exp(-L/L₀)`.
1. **gemini-2.5-flash-preview-04-17**
* Data Series: Red solid line with circular markers
* (Fit): L₀ = 85.7
* Fit Line: Red dashed line
2. **gemini-2.0-flash**
* Data Series: Green solid line with circular markers
* (Fit): L₀ = 40.2
* Fit Line: Green dashed line
3. **Llama-4-Maverick-17B-128E-Instruct-FP8**
* Data Series: Gray solid line with circular markers
* (Fit): L₀ = 16.7
* Fit Line: Gray dashed line
4. **Llama-3.3-70B-Instruct-Turbo**
* Data Series: Pink solid line with circular markers
* (Fit): L₀ = 10.2
* Fit Line: Pink dashed line
5. **gemma-2-27b-it**
* Data Series: Purple solid line with circular markers
* (Fit): L₀ = 8.1
* Fit Line: Purple dashed line
6. **Qwen2.5-Coder-32B-Instruct**
* Data Series: Orange solid line with circular markers
* (Fit): L₀ = 4.8
* Fit Line: Orange dashed line
7. **Qwen2.5-7B-Instruct-Turbo**
* Data Series: Light Blue solid line with circular markers
* (Fit): L₀ = 4.0
* Fit Line: Light blue dashed line
8. **Llama-3.2-3B-Instruct-Turbo**
* Data Series: Brown solid line with circular markers
* (Fit): L₀ = 1.6
* Fit Line: Brown dashed line
### Detailed Analysis
All data series consistently demonstrate a decreasing "Success Rate" as the "Number of Actions (L)" increases, which is characteristic of an exponential decay. The dashed lines represent the exponential fits, where a larger L₀ value indicates a slower decay and thus a more robust performance over a greater number of actions.
**Top Plot (Linear Y-axis):**
* **gemini-2.5-flash-preview-04-17 (Red, L₀ = 85.7):** Shows the slowest decay. Starts near 1.0 at L=0, decreases to approximately 0.52 at L=50, 0.25 at L=100, 0.1 at L=200, and around 0.05 at L=300.
* **gemini-2.0-flash (Green, L₀ = 40.2):** Decays faster than the red series. Starts near 1.0 at L=0, drops to about 0.2 at L=50, 0.1 at L=100, and approximately 0.01 at L=200.
* **Llama-4-Maverick-17B-128E-Instruct-FP8 (Gray, L₀ = 16.7):** Decays significantly faster. Starts near 1.0 at L=0, falls to about 0.2 at L=20, 0.05 at L=50, and approximately 0.01 at L=100.
* **Llama-3.3-70B-Instruct-Turbo (Pink, L₀ = 10.2):** Exhibits rapid decay. Starts near 1.0 at L=0, drops to about 0.3 at L=10, 0.1 at L=20, and approximately 0.01 at L=50.
* **gemma-2-27b-it (Purple, L₀ = 8.1):** Decays very rapidly, slightly faster than pink. Starts near 1.0 at L=0, drops to about 0.25 at L=10, 0.08 at L=20, and approximately 0.01 at L=40.
* **Qwen2.5-Coder-32B-Instruct (Orange, L₀ = 4.8):** Shows extremely rapid decay. Starts near 1.0 at L=0, drops to about 0.1 at L=10, and approximately 0.02 at L=20.
* **Qwen2.5-7B-Instruct-Turbo (Light Blue, L₀ = 4.0):** Decays extremely rapidly, slightly faster than orange. Starts near 1.0 at L=0, drops to about 0.08 at L=10, and approximately 0.01 at L=20.
* **Llama-3.2-3B-Instruct-Turbo (Brown, L₀ = 1.6):** Displays the most rapid decay. Starts near 1.0 at L=0, drops to about 0.05 at L=5, and approximately 0.005 at L=10.
**Bottom Plot (Logarithmic Y-axis):**
This plot effectively visualizes the exponential decay as linear slopes. The closer the solid data line is to its dashed fit line, the better the exponential model describes the data.
* **gemini-2.5-flash-preview-04-17 (Red, L₀ = 85.7):** Appears as the flattest, most gradually declining line, closely following its fit. Success Rate is approximately 0.5 at L=50, 0.1 at L=200, and 0.04 at L=300.
* **gemini-2.0-flash (Green, L₀ = 40.2):** Shows a steeper decline than the red series, with data points closely matching the fit. Success Rate is approximately 0.2 at L=50, 0.08 at L=100, and 0.01 at L=200.
* **Llama-4-Maverick-17B-128E-Instruct-FP8 (Gray, L₀ = 16.7):** Exhibits a significantly steeper slope. Success Rate is approximately 0.2 at L=20, 0.04 at L=50, and 0.002 at L=100.
* **Llama-3.3-70B-Instruct-Turbo (Pink, L₀ = 10.2):** Shows a very steep decline. Success Rate is approximately 0.3 at L=10, 0.1 at L=20, and 0.005 at L=50.
* **gemma-2-27b-it (Purple, L₀ = 8.1):** Displays a very steep decline, slightly steeper than pink. Success Rate is approximately 0.25 at L=10, 0.08 at L=20, and 0.01 at L=40.
* **Qwen2.5-Coder-32B-Instruct (Orange, L₀ = 4.8):** Exhibits an extremely steep decline. Success Rate is approximately 0.1 at L=10, 0.02 at L=20, and 0.005 at L=30.
* **Qwen2.5-7B-Instruct-Turbo (Light Blue, L₀ = 4.0):** Shows an extremely steep decline, slightly steeper than orange. Success Rate is approximately 0.08 at L=10, 0.01 at L=20, and 0.002 at L=30.
* **Llama-3.2-3B-Instruct-Turbo (Brown, L₀ = 1.6):** Displays the steepest decline among all models. Success Rate is approximately 0.05 at L=5 and 0.005 at L=10.
### Key Observations
* **Exponential Decay:** All models demonstrate an exponential decay in success rate as the number of actions increases, with the `exp(-L/L₀)` function providing a good fit for the observed data.
* **L₀ as a Robustness Indicator:** The characteristic length L₀ is a direct measure of a model's ability to maintain its success rate over a longer sequence of actions. A higher L₀ indicates greater robustness and slower performance degradation.
* **Clear Performance Hierarchy:**
* `gemini-2.5-flash-preview-04-17` (L₀ = 85.7) is significantly more robust than all other models, maintaining a high success rate even at 300 actions.
* `gemini-2.0-flash` (L₀ = 40.2) is the second-best performer, showing substantial resilience compared to the Llama and Qwen series.
* `Llama-4-Maverick-17B-128E-Instruct-FP8`
</details>
Figure 1: Performance collapse of various models with increasing logical depth $L$ for a pathfinding task ( $N,M=40,\mathcal{B}=2$ keys, Noise Ratio $\mathcal{N}=0.0$ ). Success rates (Pass@1) are shown on linear (top panel) and logarithmic (bottom panel) y-axes, averaged from 5 runs/problem across 40 problems per unit $L$ -bin. All evaluations used Temperature=1.0 and top-p=0.95 (Gemini-2.5-flash: ’auto’ thinking). The displayed fits employ a Weighted Least Squares (WLS) Carroll and Ruppert (2017) method on log-success rates. Weights are derived from inverse squared residuals of a preliminary Ordinary Least Squares (OLS) fit. (In the supplementary section, we have added Figure 16 to show a similar pattern is observed in recently released OpenAI models.)
Furthermore, the seqBench benchmark is built upon a scalable data generation framework, allowing it to evolve alongside increasingly capable models to help with both model training and evaluation. Through evaluations on popular LLMs, we reveal that top-performing LLMs exhibit steep universal declines as either of the three complexity dimensions increases, while remaining comparatively robust to fact shuffle, despite the underlying logical structure being unchanged.
#### Contributions.
Our main contributions are:
1. seqBench: A Tunable Benchmark for Sequential Reasoning. We introduce an open-source framework for generating pathfinding tasks with fine-grained, orthogonal control over logical depth, backtracking steps, and noise ratio. We also evaluate secondary factors like fact ordering (shuffle ratio; See supplementary material for details).
1. Comprehensive LLM Attribution Study. Using seqBench, we demonstrate the significant impact of these controlled complexities on LLM performance, revealing sharp performance cliffs in state-of-the-art models even when search complexity is minimal.
The seqBench dataset is publicly available https://huggingface.co/datasets/emnlp-submission/seqBench under the CC BY 4.0 license to facilitate benchmarking.
<details>
<summary>figs/llama4_deepdive.png Details</summary>

### Visual Description
## Chart Type: Performance Metrics vs. Number of Actions
### Overview
The image displays two separate line charts, stacked vertically, both illustrating performance metrics as a function of the "Number of actions". The top chart shows the "Success rate" of a specific model ("Llama-4-Maverick-17B-128E-Instruct-FP8") and an exponential decay fit. The bottom chart presents "Precision", "Recall", and "Progress ratio" with error bars for an unspecified system, also against the "Number of actions".
### Components/Axes
#### Top Chart: Success Rate
* **X-axis Label**: "Number of actions"
* **Range**: 0 to 300
* **Major Ticks**: 0, 50, 100, 150, 200, 250, 300
* **Y-axis Label**: "Success rate"
* **Range**: 0.0 to 0.6 (visually extends slightly above 0.6)
* **Major Ticks**: 0.0, 0.2, 0.4, 0.6
* **Legend (Top-right quadrant)**:
* **Blue line with circular markers**: "Llama-4-Maverick-17B-128E-Instruct-FP8"
* **Orange dashed line**: "∝ exp(−L/L₀), L₀ = 16.7"
#### Bottom Chart: Precision, Recall, Progress Ratio
* **X-axis Label**: "Number of actions"
* **Range**: 0 to 400
* **Major Ticks**: 0, 100, 200, 300, 400
* **Y-axis Label**: (Implicitly a ratio or score, ranging from 0.0 to 1.0)
* **Range**: 0.0 to 1.0
* **Major Ticks**: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
* **Legend (Top-right quadrant)**:
* **Blue line with circular markers and error bars**: "Precision"
* **Orange line with circular markers and error bars**: "Recall"
* **Green line with circular markers and error bars**: "Progress ratio"
### Detailed Analysis
#### Top Chart: Success Rate
The chart shows a rapid decrease in success rate as the number of actions increases.
* **Llama-4-Maverick-17B-128E-Instruct-FP8 (Blue line with circular markers)**:
* **Trend**: Starts at a high success rate and rapidly declines, approaching zero.
* **Data Points (approximate)**:
* At 0 actions: ~0.65 success rate
* At ~10 actions: ~0.62
* At ~20 actions: ~0.50
* At ~30 actions: ~0.26
* At ~40 actions: ~0.12
* At ~50 actions: ~0.05
* At ~60 actions: ~0.02
* At ~70 actions: ~0.01
* At ~80 actions: ~0.005
* At 100 actions: ~0.002
* Beyond 100 actions, the success rate remains very close to 0, with minor fluctuations (e.g., ~0.001 at 180, 230, 280 actions).
* **∝ exp(−L/L₀), L₀ = 16.7 (Orange dashed line)**:
* **Trend**: This exponential decay model closely follows the observed success rate of the Llama-4-Maverick model, indicating a good fit.
* **Data Points**: Visually, the orange dashed line is almost indistinguishable from the blue line, especially for the initial rapid decay phase.
#### Bottom Chart: Precision, Recall, Progress Ratio
This chart displays three metrics with associated error bars, showing their behavior as the number of actions increases.
* **Precision (Blue line with circular markers and error bars)**:
* **Trend**: Starts high, shows a slight initial dip, then stabilizes at a high level. The error bars are relatively small and consistent.
* **Data Points (approximate mean and error range)**:
* At 0 actions: ~0.90 (range ~0.85-0.95)
* At ~20 actions: ~0.90 (range ~0.85-0.95)
* At ~40 actions: ~0.90 (range ~0.85-0.95)
* At ~60 actions: ~0.90 (range ~0.85-0.95)
* At ~80 actions: ~0.88 (range ~0.80-0.95)
* At ~120 actions: ~0.88 (range ~0.80-0.95)
* At ~160 actions: ~0.88 (range ~0.80-0.95)
* At ~200 actions: ~0.88 (range ~0.80-0.95)
* At ~240 actions: ~0.88 (range ~0.80-0.95)
* At ~280 actions: ~0.88 (range ~0.80-0.95)
* **Recall (Orange line with circular markers and error bars)**:
* **Trend**: Starts high, decreases significantly and steadily, with increasing uncertainty (larger error bars) as the number of actions grows.
* **Data Points (approximate mean and error range)**:
* At 0 actions: ~0.80 (range ~0.70-0.90)
* At ~20 actions: ~0.75 (range ~0.60-0.90)
* At ~40 actions: ~0.65 (range ~0.50-0.80)
* At ~60 actions: ~0.60 (range ~0.40-0.80)
* At ~80 actions: ~0.55 (range ~0.30-0.75)
* At ~120 actions: ~0.40 (range ~0.20-0.60)
* At ~160 actions: ~0.38 (range ~0.15-0.60)
* At ~200 actions: ~0.35 (range ~0.10-0.55)
* At ~240 actions: ~0.30 (range ~0.05-0.50)
* At ~280 actions: ~0.28 (range ~0.05-0.50)
* **Progress ratio (Green line with circular markers and error bars)**:
* **Trend**: Starts at a moderate level, rapidly decreases, and then flattens out at a very low value. The error bars are initially very large, indicating high variability, and then shrink as the ratio approaches zero.
* **Data Points (approximate mean and error range)**:
* At 0 actions: ~0.45 (range ~0.00-0.80)
* At ~20 actions: ~0.30 (range ~0.00-0.60)
* At ~40 actions: ~0.20 (range ~0.00-0.40)
* At ~60 actions: ~0.15 (range ~0.00-0.30)
* At ~80 actions: ~0.12 (range ~0.00-0.25)
* At ~120 actions: ~0.10 (range ~0.00-0.20)
* At ~160 actions: ~0.09 (range ~0.00-0.18)
* At ~200 actions: ~0.08 (range ~0.00-0.15)
* At ~240 actions: ~0.07 (range ~0.00-0.15)
* At ~280 actions: ~0.07 (range ~0.00-0.15)
### Key Observations
* **Top Chart**: The success rate of the Llama-4-Maverick model drops very sharply with an increasing number of actions, indicating that its performance degrades significantly as the task complexity or length (represented by "Number of actions") increases. The exponential decay model provides an excellent fit for this observed behavior.
* **Bottom Chart**:
* **Precision** remains consistently high (around 0.88-0.90) across the range of actions, suggesting that when the system makes a positive prediction, it is usually correct. The low variability (small error bars) supports this consistency.
* **Recall** shows a substantial decline as the number of actions increases, indicating that the system becomes less able to identify all relevant instances. The increasing error bars suggest higher variability in recall at higher action counts.
* **Progress ratio** experiences the most dramatic drop, quickly approaching very low values. The large initial error bars highlight significant uncertainty in this metric for fewer actions.
### Interpretation
The two charts together likely illustrate the performance characteristics of a language model or an AI agent in tasks requiring a sequence of actions.
The **top chart** suggests that the "Llama-4-Maverick" model has a very limited "memory" or "coherence horizon" for tasks involving sequential actions. Its "Success rate" plummets rapidly, implying that beyond a small number of actions (around 50-60), the model is highly unlikely to succeed. The exponential decay fit with L₀ = 16.7 indicates a characteristic length scale for its success, meaning that for every 16.7 actions, the success rate roughly halves. This points to a fundamental limitation in maintaining task coherence or state over extended sequences.
The **bottom chart** provides a more nuanced view of performance.
* The high and stable **Precision** suggests that when the system *does* attempt an action or make a prediction, it is often correct. This could mean the model is good at local decision-making or generating plausible outputs, even if it misses the overall goal.
* The declining **Recall** is a critical indicator. It implies that as the "Number of actions" increases, the system fails to identify or execute a growing proportion of the necessary steps or components to complete a task. This aligns with the "Success rate" drop in the top chart; if the system misses too many required actions, the overall task will fail. The increasing uncertainty in recall further suggests that this failure to recall or execute necessary steps becomes more erratic and unpredictable with longer action sequences.
* The rapidly decreasing **Progress ratio** likely measures how much of the task is completed or how much progress is made towards the goal. Its sharp decline and low final values, coupled with high initial variability, reinforce the idea that the system struggles to make substantial progress on tasks requiring many actions. The large error bars at lower action counts might indicate that for simpler tasks, the "progress" can be highly variable, perhaps depending on the specific task instance or initial conditions.
In essence, the system (likely the Llama-4-Maverick model or a similar agent) is precise in its individual actions but suffers from a severe recall problem and an inability to sustain progress over longer sequences of actions. This leads to a very low overall success rate for complex, multi-step tasks. The data highlights a common challenge in AI, particularly with large language models, where local coherence can be high (good precision), but global coherence and long-term planning (good recall and progress) remain difficult.
</details>
Figure 2: On the left: Llama-4 Maverick-17B-128E-Instruct Model’s performance (pass@1 success rate) versus number of actions in the ground truth path of the pathfinding problems ( $N,M=40,\mathcal{B}=2$ keys, Noise Ratio $\mathcal{N}=0.0$ ) is shown. This Pass@1 success rate across 5 runs per problem is averaged over the problem instances sampled from different actions count bins of width equal to 1. On the right: The mean of progress ratio across all problems as well as mean of precision and recall is shown to highlight models gradually increasing struggle in completing the path. The Temperature is set to 1.0 and the top-p is set to 0.95 in all runs.
## 1 Methods
### 1.1 Dataset Generation
The seqBench dataset consists of spatial pathfinding tasks. Task instance generation, detailed below (Algorithm 1; See Appendix A for details), is predicated on the precise independent control of the three key complexity dimensions introduced earlier: Logical Depth ( $L$ ), Backtracking Count ( $\mathcal{B}$ ), and Noise Ratio ( $\mathcal{N}$ ). This allows the creation of instances with specific values for these parameters, enabling targeted studies of their impact on LLM reasoning.
Task instances are produced in a multi-stage process. Initially, primary generation parameters—maze dimensions ( $N,M$ ), target backtracks ( $\mathcal{B}_{\text{target}}$ ), and target noise ratio ( $\mathcal{N}_{\text{target}}$ )—are specified. An acyclic maze graph ( $M_{g}$ ) is formed on an $N\times M$ grid using Kruskal’s algorithm (Kleinberg and Tardos, 2006). Our "Rewind Construction" method (Algorithm 1) then embeds $\mathcal{B}_{\text{target}}$ backtracking maneuvers by working backward from a goal to strategically place keys and locked doors, yielding the instance’s actual backtracking count $\mathcal{B}$ . Finally, a natural language fact list ( $\mathcal{F}$ ) is derived from the maze, and distracting facts are added according to $\mathcal{N}_{\text{target}}$ to achieve the final noise ratio $\mathcal{N}$ . The logical depth $L$ (optimal path length) emerges from these generative steps, influenced by $N,M,\mathcal{B}_{\text{target}}$ , and construction stochasticity. While $L$ is not a direct input to the generation algorithm, the process is designed to yield a wide spectrum of logical depths. Each generated instance is then precisely annotated with its emergent $L$ value, alongside its effective $\mathcal{B}$ and $\mathcal{N}$ values. This annotation effectively makes $L$ a key, selectable parameter for users of the seqBench dataset, enabling them to choose or filter tasks by their desired logical depth. Our rewind construction method guarantees task solvability. The full seqBench benchmark is constructed by systematically applying this instance generation process (detailed in Algorithm 1) across a wide range of initial parameters. This includes varied grid sizes (e.g., $N\in\{5..50\},M\approx N$ ) and target backtracks ( $\mathcal{B}_{\text{target}}\in\{0..7\}$ ), yielding a large and diverse data pool. For each $(N,M,\mathcal{B}_{\text{target}})$ configuration, multiple unique base mazes are generated, to which different noise ratios (e.g., $\mathcal{N}_{\text{target}}\in\{0..1\}$ ) are subsequently applied. It is important to note that the algorithm constrains backtracking complexity to a simple dependency chain. In this setting, retrieving the key for each locked door involves at most one backtracking step to pick up its corresponding key, without requiring the unlocking of additional doors along the optimal path. Combined with the uniform random placement of keys, this design ensures a well-balanced distribution of backtracking difficulty across the generated instances for each logical depth $L$ . Nevertheless, the same backward-in-time construction can be extended to generate tasks with higher backtracking complexity—for example, doors that require multiple keys, or intermediate doors that must be unlocked en route to other keys. Such extensions would introduce richer tree-structured dependency graphs and allow seqBench to probe model performance under more complex long-horizon reasoning regimes. The creation of this comprehensive data pool was computationally efficient, requiring approximately an hour of computation on a standard laptop while using minimal memory. The publicly released benchmark comprises a substantial collection of these generated instances, each annotated with its specific emergent logical depth $L$ , effective backtracking count $\mathcal{B}$ , and noise ratio $\mathcal{N}$ . This rich annotation is key, enabling researchers to readily select or filter task subsets by these dimensions for targeted studies (e.g., as done for Figure 1, where instances were sampled into $L$ -bins with other parameters fixed). For the experiments presented in this paper, specific subsets were drawn from this benchmark pool, often involving further filtering or parameter adjustments tailored to the objectives of each study; precise details for each experiment are provided in the relevant sections and figure captions. Full details on path derivation, fact compilation, and overall dataset generation parameters are provided in the Appendix A.
Input : Grid $N\times M$ , Target backtracks $\mathcal{B}$
Output : Maze graph $M_{g}$ , Locked doors $\mathcal{D}_{L}$ , Key info $\mathcal{K}_{I}$ , Path skeleton $\Pi_{S}$
1
2 $M_{g}\leftarrow$ Acyclic graph on grid (Kruskal’s);
3 $x\leftarrow C_{goal}\leftarrow$ Random goal cell in $M_{g}$ ;
4 $\mathcal{D}_{L},\mathcal{K}_{I}\leftarrow\emptyset,\emptyset$ ; $b\leftarrow 0$ ;
5 $\Pi_{S}\leftarrow[(C_{goal},\text{GOAL})]$ ;
6
7 while $b<\mathcal{B}$ do
8 $c_{key}\leftarrow$ Random cell in $M_{g}$ accessible from $x$ (path avoids $\mathcal{D}_{L}$ for this step);
9 $\pi_{seg}\leftarrow$ Unique path in $M_{g}$ from $x$ to $c_{key}$ ;
10 if $\exists e\in\pi_{seg}$ such that $e\notin\mathcal{D}_{L}$ then
11 $d\leftarrow$ Randomly select such an edge $e$ ;
12 $\mathcal{D}_{L}\leftarrow\mathcal{D}_{L}\cup\{d\}$ ;
13 $K_{id}\leftarrow$ New unique key ID;
14 $\mathcal{K}_{I}[K_{id}]\leftarrow\{\text{opens}:d,\text{loc}:c_{key}\}$ ;
15 $\Pi_{S}$ .prepend( $(c_{key},\text{PICKUP }K_{id})$ , $(d,\text{UNLOCK }K_{id})$ , $(\pi_{seg},\text{MOVE})$ );
16 $x\leftarrow c_{key}$ ; $b\leftarrow b+1$ ;
17
18 end if
19 else
20 Break
21 end if
22
23 end while
24 $\Pi_{S}$ .prepend( $(x,\text{START}))$ ;
25 return $M_{g},\mathcal{D}_{L},\mathcal{K}_{I},\Pi_{S}$ ;
Algorithm 1 Rewind Construction of Path Skeleton
### 1.2 Prompt Construction and Model Configuration
Our evaluation uses a standardized prompt template with four components: (i) task instructions and action schema, (ii) three few-shot examples of increasing complexity (simple navigation, single-key, and multi-key backtracking), (iii) optional reasoning guidance, and (iv) the problem’s natural-language facts. All models are queried using temperature $T{=}1.0$ , nucleus sampling $p{=}0.95$ , and maximum allowed setting in terms of output token limits on a per model basis. For each instance, we compute 5 independent runs to establish robust performance statistics. The complete prompt structure, shown in Figure 6, is provided in the Appendix B.
### 1.3 Evaluation Metrics
To analyze not just success but also how models fail, we employ several complementary metrics. Success Rate (Pass@1) measures the proportion of runs where the predicted action sequence exactly matches the ground truth. The Progress Ratio (Tyagi et al., 2024), calculated as $k/n$ (where $n$ is the total ground-truth actions and $k$ is the number correctly executed before the first error), pinpoints the breakdown position in reasoning. We also use Precision and Recall. Precision is the proportion of predicted actions that are correct, while Recall is the proportion of ground-truth actions that were correctly predicted. Low precision indicates hallucinated actions, while low recall signifies missed necessary actions. Additionally, we visualize error locations via a Violation Map. This multi-faceted approach reveals each model’s effective "reasoning horizon"—the maximum sequence length it can reliably traverse. Further details on all metrics and visualizations are provided in the supplementary material.
## 2 Benchmarking Results
<details>
<summary>figs/fig_vs_backtracking_fixed_L_shuffle1.0_noise0.0.png Details</summary>

### Visual Description
## Chart Type: Performance Metrics and Token Usage Across Backtracking Steps for Language Models
### Overview
The image displays three line charts arranged horizontally, comparing the performance (progress ratio mean, success rate) and resource usage (number of tokens) of five different language models as a function of the "Number of backtracking steps." Each chart shares the same X-axis, representing the number of backtracking steps from 0 to 5. A single legend, located in the top-right of the leftmost chart, identifies the five models by color and marker.
### Components/Axes
**Legend (located in the top-right of the leftmost chart):**
* **Blue circle**: (Llama-4-maverick-17b-128e-instruct-fp8)
* **Orange circle**: (Qwen2.5-coder-32b-instruct)
* **Green circle**: (Llama-3.1-nemotron-70b-instruct-hf)
* **Red circle**: (Gemini-2.0-flash)
* **Purple circle**: (Gemini-2.5-flash-preview-04-17)
**Common X-axis for all three charts:**
* **Label**: "Number of backtracking steps"
* **Scale**: 0, 1, 2, 3, 4, 5
**Chart 1 (Left): Progress ratio mean**
* **Y-axis Label**: "Progress ratio mean"
* **Y-axis Scale**: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
**Chart 2 (Middle): Success rate**
* **Y-axis Label**: "Success rate"
* **Y-axis Scale**: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
**Chart 3 (Right): Number of tokens**
* **Y-axis Label**: "Number of tokens"
* **Y-axis Scale**: 250, 500, 750, 1000, 1250, 1500, 1750
### Detailed Analysis
**Chart 1: Progress ratio mean vs. Number of backtracking steps**
This chart shows how the mean progress ratio changes as the number of backtracking steps increases.
* **Purple line (Gemini-2.5-flash-preview-04-17)**: Starts highest at approximately 0.9 for 0 steps, dips to about 0.72 at 2 steps, then slightly recovers to 0.78 at 3 steps before gradually declining to approximately 0.68 at 5 steps. It maintains the highest progress ratio throughout.
* **Red line (Gemini-2.0-flash)**: Starts high at approximately 0.75 for 0 steps and shows a steep, consistent decline, reaching about 0.12 at 5 steps.
* **Blue line (Llama-4-maverick-17b-128e-instruct-fp8)**: Starts at approximately 0.48 for 0 steps and generally decreases, flattening out towards the end, reaching about 0.2 at 5 steps.
* **Green line (Llama-3.1-nemotron-70b-instruct-hf)**: Starts at approximately 0.38 for 0 steps, remains relatively stable at 0.37 at 1 step, then gradually declines to about 0.18 at 5 steps.
* **Orange line (Qwen2.5-coder-32b-instruct)**: Starts lowest among the higher initial values at approximately 0.28 for 0 steps and shows a steady, continuous decline, reaching about 0.05 at 5 steps.
**Chart 2: Success rate vs. Number of backtracking steps**
This chart illustrates the success rate of each model as the number of backtracking steps increases.
* **Purple line (Gemini-2.5-flash-preview-04-17)**: Starts highest at approximately 0.88 for 0 steps, declines to about 0.62 at 2 steps, then slightly recovers to 0.68 at 3 steps before gradually declining to approximately 0.62 at 5 steps. It maintains the highest success rate.
* **Red line (Gemini-2.0-flash)**: Starts at approximately 0.55 for 0 steps and exhibits a very steep decline, dropping to about 0.02 at 5 steps.
* **Blue line (Llama-4-maverick-17b-128e-instruct-fp8)**: Starts at approximately 0.25 for 0 steps and shows a rapid decline to near zero (around 0.01-0.02) by 2 steps, remaining at that level.
* **Green line (Llama-3.1-nemotron-70b-instruct-hf)**: Starts very low at approximately 0.02 for 0 steps, slightly increases to 0.05 at 1 step, then declines to near zero (around 0.01) by 4 steps, remaining there.
* **Orange line (Qwen2.5-coder-32b-instruct)**: Starts very low at approximately 0.01 for 0 steps and remains consistently near zero (around 0.01) across all backtracking steps.
**Chart 3: Number of tokens vs. Number of backtracking steps**
This chart presents the number of tokens used by each model as the number of backtracking steps increases.
* **Blue line (Llama-4-maverick-17b-128e-instruct-fp8)**: Starts highest at approximately 1580 tokens for 0 steps, remains relatively stable until 2 steps (~1590 tokens), then shows a noticeable increase to approximately 1780 tokens at 5 steps. It consistently uses the most tokens.
* **Orange line (Qwen2.5-coder-32b-instruct)**: Starts at approximately 900 tokens for 0 steps, increases to about 1150 tokens at 2 steps, dips slightly to 1100 at 3 steps, then peaks at 1220 at 4 steps before decreasing to approximately 1100 tokens at 5 steps.
* **Green line (Llama-3.1-nemotron-70b-instruct-hf)**: Starts at approximately 650 tokens for 0 steps, increases to about 850 tokens at 2 steps, dips slightly to 800 at 3 steps, then stabilizes around 880 tokens for 4 and 5 steps.
* **Red line (Gemini-2.0-flash)**: Starts at approximately 350 tokens for 0 steps, increases to about 480 tokens at 1 step, then fluctuates between 400 and 480 tokens, ending at approximately 450 tokens at 5 steps.
* **Purple line (Gemini-2.5-flash-preview-04-17)**: Starts lowest at approximately 280 tokens for 0 steps and shows a consistent, gradual increase to approximately 420 tokens at 5 steps. It consistently uses the fewest tokens.
### Key Observations
* **Gemini-2.5-flash-preview-04-17 (Purple)**: This model consistently outperforms all others in "Progress ratio mean" and "Success rate" across all backtracking steps, maintaining high values even with increased backtracking. Notably, it also uses the *fewest* "Number of tokens" among all models, with a moderate increase in token usage as backtracking steps increase.
* **General Trend for Performance Metrics**: For most models, "Progress ratio mean" and "Success rate" generally decrease as the "Number of backtracking steps" increases. This suggests that increased backtracking often leads to diminishing returns or even detrimental effects on these performance indicators.
* **General Trend for Token Usage**: Conversely, the "Number of tokens" generally increases or remains stable with more backtracking steps, indicating that more computational effort (tokens) is expended, even if performance declines.
* **Steepest Declines**: Gemini-2.0-flash (Red) shows a very steep decline in both "Progress ratio mean" and "Success rate" after 0 backtracking steps. Llama-4-maverick (Blue) also experiences a sharp drop in "Success rate."
* **Lowest Performers**: Qwen2.5-coder-32b-instruct (Orange) and Llama-3.1-nemotron-70b-instruct-hf (Green) generally show lower initial performance and decline to very low success rates.
### Interpretation
The data suggests a complex relationship between backtracking, model performance, and resource consumption.
1. **Backtracking Trade-offs**: For most models, increasing the number of backtracking steps appears to be counterproductive for "Progress ratio mean" and "Success rate." This could imply that beyond a certain point, additional backtracking leads to unproductive exploration, getting stuck in local optima, or simply consuming more resources without yielding better results.
2. **Efficiency of Gemini-2.5-flash-preview-04-17**: The "Gemini-2.5-flash-preview-04-17" model stands out as an anomaly. It maintains significantly higher progress and success rates while simultaneously using the least number of tokens. This indicates superior efficiency and robustness to backtracking compared to the other models. It suggests that this model's backtracking mechanism is either more effective at finding solutions or more efficient at pruning unproductive paths, allowing it to achieve better outcomes with less computational overhead.
3. **Resource Consumption vs. Performance**: There isn't a direct positive correlation between token usage and performance. For instance, Llama-4-maverick (Blue) uses the most tokens but performs moderately in progress ratio and poorly in success rate, especially with backtracking. This highlights that simply increasing token usage (or allowing more backtracking) does not guarantee better performance; the quality and efficiency of the search strategy are paramount.
4. **Model Robustness**: The varying slopes of the performance curves indicate different levels of robustness to backtracking. Models with steep declines (e.g., Gemini-2.0-flash, Llama-4-maverick in success rate) are less robust, quickly losing performance as backtracking increases. Gemini-2.5-flash-preview-04-17, with its relatively flat and high-value performance curves, demonstrates high robustness.
5. **Implications for Deployment**: For applications where computational resources are constrained or real-time performance is critical, models like Gemini-2.5-flash-preview-04-17 would be highly preferred due to their superior performance-to-token ratio and resilience to backtracking. For other models, the data suggests that limiting backtracking steps might be a necessary optimization to prevent performance degradation and excessive token consumption.
</details>
Figure 3: Performance as a function of the number of required backtracking steps, operationalized via the number of locked doors with distributed keys along the optimal path. Holding all other complexity factors constant, all models exhibit a clear decline in both progress ratio and success rate as backtracking demands increase. Additionally, we report the corresponding rise in output token counts per model, highlighting the increased reasoning burden associated with longer dependency chains. Fixed experimental parameters in this figure are the same as those in Figure 1. (for each point 100 problems sampled from $L=[40,60]$ )
### 2.1 Evaluated Models
We evaluate a diverse set of transformer-based LLMs across different model families and parameter scales. Our analysis includes Gemini models (2.5-flash-preview, 2.0-flash), Meta’s Llama family (4-Maverick-17B, 3.3-70B, 3.2-3B), Google’s Gemma-2-27b, and Alibaba’s Qwen models (2.5-Coder-32B, 2.5-7B). [Note: GPT-5 was released during the preparation of this paper’s final version. Our analysis shows that this model exhibits the same performance degradation, as shown in Figure 16]. Access to some open-weight models and benchmarking infrastructure was facilitated by platforms such as Together AI https://www.together.ai/ and Google AI Studio https://aistudio.google.com/. Problem instances for varying logical depths ( $L$ ) were generated by sampling 40 problems for each $L$ , using a fixed maze size of $40\times 40$ and 2 keys, unless otherwise specified for specific experiments (e.g., when varying the number of keys for backtracking analysis). All models were evaluated using the standardized prompt template (see Figure 6), the inference settings detailed in Section 1.2, and a common response parsing methodology. For each task instance, we perform 5 independent runs to establish robust performance statistics, primarily analyzing Pass@1 success rates.
### 2.2 Universal Performance Collapse with Increasing Logical Depth
A central finding of our study is the universal collapse in reasoning performance observed across all evaluated LLMs when confronted with tasks requiring increasing sequential inference steps. As illustrated in Figure 1, Pass@1 success rates exhibit a consistent and sharp exponential decay as the ground-truth path length ( $L$ ) increases. Performance rapidly approaches near-zero past a model-specific point in this decay. To quantify and compare this exponential decay, we fit an exponential decay curve $P(L)=\exp(-L/L_{0})$ to the success rates, deriving a characteristic path length $L_{0}$ . This $L_{0}$ value, representing the path length at which performance drops by a factor of $e^{-1}$ , serves as a robust metric for each model’s sequential reasoning horizon. Plotting success rates on a semi-logarithmic (log-y) scale against $L$ reveals an approximately linear decay trend across the evaluated regime. This log-linear relationship suggests that errors may accumulate with a degree of independence at each reasoning step, eventually overwhelming the model’s capacity for coherent inference. The observed $L_{0}$ values vary significantly, from 85.7 for Gemini-2.5-Flash down to 1.6 for Llama-3.2-3B (Figure 1), underscoring a fundamental bottleneck in current transformer architectures for extended multi-step reasoning.
### 2.3 Impact of Independently Controlled Complexity Dimensions
Beyond the universal impact of logical depth ( $L$ ) discussed in Section 2.2, our benchmark’s ability to independently vary key complexity dimensions allows for targeted analysis of their distinct impacts on LLM reasoning performance. We highlight the effects of noise, backtracking, and fact ordering, primarily focusing on Pass@1 success rates, mean progress ratios, and response token counts.
<details>
<summary>figs/fig_vary_noise_fixed_L_keys2_shuffle1.0.png Details</summary>

### Visual Description
## Chart Type: Performance and Token Usage of Language Models Under Varying Noise Ratios
### Overview
This image presents three line charts arranged horizontally, comparing the performance and token usage of two language models, "Llama-4-maverick-17b-128e-instruct-fp8" and "Gemini-2.5-flash-preview-04-17", across different "Noise ratio" values. The charts illustrate how "Mean progress ratio", "Mean success rate (pass@1)", and "Cot tokens" change as the noise ratio increases from 0.00 to 1.00.
### Components/Axes
The image consists of three sub-charts, each sharing a common X-axis and a common legend.
**Common Elements:**
* **X-axis Label (Bottom of each chart):** "Noise ratio"
* **X-axis Markers (Common to all charts):** 0.00, 0.25, 0.50, 0.75, 1.00
* **Legend (Top-left of the leftmost chart, applies to all three):**
* Blue line with circular markers: "(Llama-4-maverick-17b-128e-instruct-fp8)"
* Orange line with circular markers: "(Gemini-2.5-flash-preview-04-17)"
**Chart 1 (Leftmost): Mean progress ratio vs. Noise ratio**
* **Y-axis Label:** "Mean progress ratio"
* **Y-axis Markers:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
**Chart 2 (Middle): Mean success rate (pass@1) vs. Noise ratio**
* **Y-axis Label:** "Mean success rate (pass@1)"
* **Y-axis Markers:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
**Chart 3 (Rightmost): Cot tokens vs. Noise ratio**
* **Y-axis Label:** "Cot tokens"
* **Y-axis Markers:** 0, 250, 500, 750, 1000, 1250, 1500, 1750
### Detailed Analysis
**Chart 1: Mean progress ratio vs. Noise ratio**
* **Orange Line (Gemini-2.5-flash-preview-04-17):**
* **Trend:** The mean progress ratio for Gemini-2.5-flash-preview-04-17 starts high and shows a significant downward trend as the noise ratio increases. The decline is steeper between 0.50 and 0.75 noise ratio.
* **Data Points:**
* Noise ratio 0.00: ~0.72
* Noise ratio 0.25: ~0.65
* Noise ratio 0.50: ~0.55
* Noise ratio 0.75: ~0.28
* Noise ratio 1.00: ~0.25
* **Blue Line (Llama-4-maverick-17b-128e-instruct-fp8):**
* **Trend:** The mean progress ratio for Llama-4-maverick-17b-128e-instruct-fp8 starts lower than Gemini and also shows a downward trend, but it is much flatter and at consistently lower values.
* **Data Points:**
* Noise ratio 0.00: ~0.24
* Noise ratio 0.25: ~0.18
* Noise ratio 0.50: ~0.15
* Noise ratio 0.75: ~0.13
* Noise ratio 1.00: ~0.12
**Chart 2: Mean success rate (pass@1) vs. Noise ratio**
* **Orange Line (Gemini-2.5-flash-preview-04-17):**
* **Trend:** The mean success rate for Gemini-2.5-flash-preview-04-17 starts high and exhibits a sharp, continuous decline as the noise ratio increases, approaching zero at higher noise levels.
* **Data Points:**
* Noise ratio 0.00: ~0.62
* Noise ratio 0.25: ~0.50
* Noise ratio 0.50: ~0.35
* Noise ratio 0.75: ~0.08
* Noise ratio 1.00: ~0.02
* **Blue Line (Llama-4-maverick-17b-128e-instruct-fp8):**
* **Trend:** The mean success rate for Llama-4-maverick-17b-128e-instruct-fp8 starts very low and declines slightly, remaining close to zero across all noise ratios.
* **Data Points:**
* Noise ratio 0.00: ~0.04
* Noise ratio 0.25: ~0.02
* Noise ratio 0.50: ~0.01
* Noise ratio 0.75: ~0.01
* Noise ratio 1.00: ~0.01
**Chart 3: Cot tokens vs. Noise ratio**
* **Orange Line (Gemini-2.5-flash-preview-04-17):**
* **Trend:** The Cot tokens for Gemini-2.5-flash-preview-04-17 remain relatively stable and low across all noise ratios, with a slight increase at higher noise levels.
* **Data Points:**
* Noise ratio 0.00: ~350
* Noise ratio 0.25: ~350
* Noise ratio 0.50: ~350
* Noise ratio 0.75: ~380
* Noise ratio 1.00: ~380
* **Blue Line (Llama-4-maverick-17b-128e-instruct-fp8):**
* **Trend:** The Cot tokens for Llama-4-maverick-17b-128e-instruct-fp8 start high and show a gradual downward trend as the noise ratio increases.
* **Data Points:**
* Noise ratio 0.00: ~1700
* Noise ratio 0.25: ~1620
* Noise ratio 0.50: ~1580
* Noise ratio 0.75: ~1500
* Noise ratio 1.00: ~1480
### Key Observations
* **Performance Degradation with Noise:** Both "Mean progress ratio" and "Mean success rate (pass@1)" generally decrease as the "Noise ratio" increases for both models.
* **Gemini's Superior Performance (Low Noise):** At low noise ratios (e.g., 0.00 to 0.50), Gemini-2.5-flash-preview-04-17 significantly outperforms Llama-4-maverick-17b-128e-instruct-fp8 in both "Mean progress ratio" and "Mean success rate (pass@1)".
* **Gemini's Steep Decline:** Gemini's performance metrics (progress ratio and success rate) show a much steeper decline with increasing noise compared to Llama. Its success rate drops from ~0.62 at 0.00 noise to ~0.02 at 1.00 noise.
* **Llama's Consistent Low Performance:** Llama-4-maverick-17b-128e-instruct-fp8 maintains a consistently low "Mean progress ratio" and "Mean success rate (pass@1)" across all noise levels, suggesting it is less affected by noise in terms of relative performance change, but its absolute performance is poor.
* **Cot Token Usage Disparity:** Llama-4-maverick-17b-128e-instruct-fp8 uses substantially more "Cot tokens" (around 1500-1700) than Gemini-2.5-flash-preview-04-17 (around 350-380) across all noise ratios.
* **Cot Token Stability:** Gemini's Cot token usage is very stable, slightly increasing with noise. Llama's Cot token usage decreases slightly with increasing noise, but remains high.
### Interpretation
The data suggests a trade-off between performance and resource efficiency, and robustness to noise, between the two language models.
Gemini-2.5-flash-preview-04-17 appears to be a higher-performing model under ideal or low-noise conditions, achieving significantly better "Mean progress ratio" and "Mean success rate (pass@1)". However, its performance degrades sharply as the "Noise ratio" increases, indicating a lower robustness to noisy inputs. Despite its higher performance, Gemini consistently uses a much lower number of "Cot tokens," suggesting it is more efficient in terms of computational steps or reasoning complexity (as measured by CoT tokens).
Conversely, Llama-4-maverick-17b-128e-instruct-fp8 exhibits a much lower baseline performance in both progress ratio and success rate. While its performance also declines with noise, the absolute change is less dramatic because it starts from a much lower point. This might imply that Llama is either inherently less capable for the task or less sensitive to noise due to its lower performance ceiling. Critically, Llama uses a significantly higher number of "Cot tokens" across all noise levels, suggesting it requires more computational effort or generates longer chains of thought, yet yields inferior results compared to Gemini, especially at lower noise. The slight decrease in Llama's Cot tokens with increasing noise might indicate that it struggles to generate coherent chains of thought when inputs are very noisy, leading to shorter outputs, but this doesn't translate to improved performance.
In summary, Gemini offers superior performance and token efficiency in clean environments but is more susceptible to performance drops with increasing noise. Llama, while less efficient in token usage and generally lower performing, shows a relatively flatter (though low) performance curve under varying noise, suggesting a different architectural or training approach that might prioritize some form of stability over peak performance or efficiency. The choice between these models would depend on the expected noise level of the input data and the priority given to performance versus resource consumption.
</details>
Figure 4: Performance as a function of contextual noise for Gemini 2.5 flash and Llama-4 Maverick-17B-128E-Instruct models. As noise increases through the inclusion of distracting or irrelevant facts, both models exhibit a clear and consistent decline in performance. Fixed experimental parameters in this figure are the same as those in Figure 1 (for each point 100 problems sampled from $L=[40,60]$ and number of keys is equal to 2).
#### Impact of Backtracking Requirements.
Increasing the number of required backtracking steps—operationalized via key-door mechanisms—also leads to a clear and significant decline in Pass@1 success rates and mean progress ratios across all evaluated models as shown in Figure 3. Gemini 2.5 Flash-preview maintains the highest performance but still exhibits a notable drop as backtracking count increases from 0 to 5. This decline in reasoning accuracy is generally accompanied by an increase or sustained high level in the mean number of response tokens (Figure 3, right panel). For example, models like Llama-4 Maverick and Gemini 2.5 Flash-preview show a clear upward trend or maintain high token counts as backtracking complexity rises, reflecting the increased reasoning effort or path length articulated by the models when managing more complex sequential dependencies.
#### Sensitivity to Noise Ratio.
Model performance is highly sensitive to the noise ratio—the proportion of distracting versus supporting facts. As demonstrated in Figure 4 for Gemini 2.5 Flash and Llama-4 Maverick, increasing the proportion of irrelevant facts consistently and significantly degrades both Pass@1 success rates and mean progress ratios. For instance, Gemini 2.5 Flash’s Pass@1 success rate drops from over 0.7 at zero noise to approximately 0.2 at a noise ratio of 1.0. Llama-4 Maverick, starting with lower performance, also shows a consistent decline. Interestingly, for these two models, the number of CoT (output) tokens remains relatively stable despite the increasing noise and degrading performance (Figure 4, right panel), suggesting that models do not necessarily "work harder" (in terms of output length) when faced with more distractors, but their accuracy suffers.
#### Fact Ordering (Shuffle Ratio).
In contrast to the strong effects of noise and backtracking, shuffle ratio (entropy of fact presentation order) within the prompt appears to play a secondary role when varied in isolation. Our experiments, exemplified by the performance of Gemini 2.5 Flash and Llama-4 Maverick (see Appendix C Figure 14 for details), show that complete shuffling of facts (randomizing their presentation order without adding or removing any information) has a minimal impact on Pass@1 success rates and mean progress ratios. Output token counts also remain stable. This suggests a relative robustness to presentation order as long as all necessary information is present and distinguishable. However, as details provided in supplementary material, when high noise and high shuffle co-occur, the combined effect can be more detrimental than either factor alone, though noise remains the dominant degrading factor.
### 2.4 Characterizing Key Failure Modes and Error Patterns
#### A Key Failure Mode: Omission of Critical Steps.
Beyond simply taking illegal shortcuts, detailed analysis reveals that LLMs often fail by omitting critical sub-goals necessary for task completion. Figure 2 (bottom panel) provides a quantitative view for Llama-4 Maverick (Meta AI, 2025), showing that while precision generally remains high (models infrequently hallucinate non-existent rooms or facts), recall and progress ratio plummet with increasing path length ( $L$ ). This indicates that models predominantly fail by missing necessary actions or entire crucial sub-sequences. For a qualitative example, even capable models like Gemini-2.5-Flash can neglect essential detours, such as collecting a required key, thereby violating sequential dependencies and rendering the task unsolvable (illustrative examples are provided in the Appendix B.4; see Figures 8 and 9). This pattern highlights a fundamental breakdown in robust multi-step planning and execution.
#### Path-Length Dependent First Errors: The Burden of Anticipated Complexity.
The propensity for models to make critical errors is not uniformly distributed across the reasoning process, nor is it solely a feature of late-stage reasoning fatigue. Examining the distribution of steps at which the first constraint violations occur reveals a counterintuitive pattern: as the total required path length ( $L$ ) of a problem increases, models tend to fail more frequently even at the earliest steps of the reasoning chain. This leftward shift in the first-error distribution also observed under increasing noise, (Appendix B.4; Figures 10 and 11) contradicts a simple cumulative error model where each step carries a fixed, independent failure probability. Instead, an error at an early step (e.g., step 5) becomes substantially more likely when the model is attempting to solve an 80-step problem versus a 20-step problem. This suggests that the overall anticipated complexity of the full problem influences reasoning quality from the very outset, indicating a struggle with global planning or maintaining coherence over longer horizons, rather than just an accumulation of local errors. This phenomenon may help explain why prompting techniques that decompose long problems into smaller, manageable sub-problems often succeed.
### 2.5 Disparity: Information Retention vs. Reasoning Capacity
On seqBench tasks, this disparity is quantitatively striking. While modern LLMs boast million-token contexts, their effective sequential reasoning depth typically remains on the order of hundreds of actions (Figure 1). This functional limit, even at several hundred actions (e.g., 300 actions, with each like (’move_to’, ’A12’) being 5-7 tokens, totaling 1.5k-2.1k tokens), still consumes a minute fraction of their nominal context. Consequently, the ratio of context capacity to reasoning tokens often spans from several hundred-fold (e.g., 500:1 for 300 actions consuming 2k tokens within a 1M context) to potentially higher values given fewer limiting actions or larger model contexts. This striking gap suggests that while transformers can store and retrieve vast information, their ability to reliably chain it for coherent, multi-step inference appears surprisingly constrained.
### 2.6 Challenging the Conventional Performance Hierarchy
While metrics like average $L_{0}$ provide a general ranking of model capabilities, our fine-grained analysis reveals instances that challenge a simple linear performance hierarchy. Scatter plots of progress ratios across different models on identical tasks (see Appendix C Figure 13) show intriguing cases where models with lower overall $L_{0}$ values (i.e., typically weaker models) occasionally solve specific complex problems perfectly, while models with higher average $L_{0}$ values fail on those same instances. These performance inversions suggest that sequential reasoning failures may not solely stem from insufficient scale (parameters or general training) but could also arise from more nuanced reasoning limitations.
## 3 Related Work
Recent advancements in benchmarks evaluating sequential reasoning capabilities of LLMs have illuminated various strengths and limitations across different dimensions of complexity. These benchmarks typically differ in how they isolate and quantify reasoning challenges, such as logical deduction, retrieval difficulty, combinatorial complexity, and sensitivity to irrelevant information. ZebraLogic (Lin et al., 2025), for instance, targets formal deductive inference through logic-grid puzzles framed as constraint-satisfaction problems (csp, 2008). While valuable for probing deduction, its core methodology leads to a search space that grows factorially with puzzle size (Sempolinski, 2009). This makes it challenging to disentangle intrinsic reasoning failures from the sheer combinatorial complexity of the search. As the ZebraLogic authors themselves acknowledge: “ solving ZebraLogic puzzles for large instances may become intractable… the required number of reasoning tokens may increase exponentially with the size of the puzzle. ” This inherent characteristic means that for larger puzzles, performance is primarily dictated by the manageability of the search space rather than the limits of sequential reasoning depth. GridPuzzle (Tyagi et al., 2024) complements this by providing a detailed error taxonomy for grid puzzles, focusing on what kinds of reasoning mistakes LLMs make. However, like ZebraLogic, it doesn’t offer independent control over key complexity dimensions such as logical depth, backtracking needs, or noise, separate from the puzzle’s inherent search complexity.
Other benchmarks conflate reasoning with different cognitive demands. BABILong (Kuratov et al., 2024) tests models on extremely long contexts (up to 50M tokens), primarily assessing the ability to retrieve "needles" (facts) from a "haystack" (distracting text that does not contribute to solving the task). While valuable for evaluating long-context processing, this design makes it hard to disentangle retrieval failures from reasoning breakdowns, as performance is often dictated by finding the relevant information rather than reasoning over it. MuSR (Sprague et al., 2024) embeds reasoning tasks within lengthy narratives (e.g., murder mysteries), mixing information extraction challenges with complex, domain-specific reasoning structures. This realism obscures which specific aspect—extraction or reasoning depth—causes model failures. Dyna-bAbI (Tamari et al., 2021) offers a dynamic framework for compositional generalization but focuses on qualitative combinations rather than systematically varying quantitative complexity metrics needed to find precise failure points.
Spatial reasoning benchmarks, while relevant, also target different aspects. GRASP (Tang and Kejriwal, 2025) assesses practical spatial planning efficiency (like obstacle avoidance) in 2D grids, a different skill than the abstract sequential reasoning seqBench isolates. SPARTQA (Mirzaee et al., 2021) focuses on specialized spatial relational complexity (transitivity, symmetry) using coupled dimensions, preventing independent analysis of factors like path length. SpaRTUN (Mirzaee and Kordjamshidi, 2022) uses synthetic data primarily for transfer learning in Spatial Question Answering (SQA), aiming to improve model performance rather than serve as a diagnostic tool with controllable complexity. Similarly, StepGame (Shi et al., 2022) demonstrates performance decay with more reasoning steps in SQA but lacks the fine-grained, orthogonal controls over distinct complexity factors provided by seqBench.
In contrast, seqBench takes a targeted diagnostic approach. By deliberately simplifying the spatial environment to minimize search complexity, it isolates sequential reasoning. Its core contribution lies in the independent, fine-grained control over (1) logical depth (the number of sequential actions required to solve the task), (2) backtracking count (the number of backtracking steps along the optimal path), and (3) noise ratio (the ratio of supporting to distracting facts). This orthogonal parameterization allows us to precisely pinpoint when and why sequential reasoning capabilities degrade, revealing fundamental performance cliffs even when search and retrieval demands are trivial. seqBench thus offers a complementary tool for understanding the specific limitations of sequential inference in LLMs.
## 4 Limitations
While seqBench offers precise control over key reasoning complexities, our study has limitations that open avenues for future research:
1. Generalizability and Task Design Fidelity: Our current findings are rooted in synthetic spatial pathfinding tasks. While this allows for controlled experimentation, future work must extend seqBench ’s methodology to more diverse reasoning domains (e.g., mathematical proofs) and incorporate greater linguistic diversity (e.g., ambiguity) to assess the broader applicability of the observed phenomena of performance collapse (quantified by $L_{0}$ ) and failure patterns. Moreover, this work did not investigate whether similar failure modes arise when the problem is also presented visually (e.g., as maze images). Multimodal capabilities could influence spatial reasoning outcomes, and we have already extended the benchmark by releasing maze image generation code alongside the HuggingFace dataset. This dataset can also be used to help train multimodal reasoning models.
1. Model Scope and Understanding Deeper Failure Dynamics: Our current evaluation, while covering diverse public models, should be expanded to a wider array of LLMs—including recent proprietary and newer open-source variants (e.g., GPT, Claude, DeepSeek series)—to rigorously assess the universality of our findings on the characteristic length $L_{0}$ and failure patterns. Furthermore, while seqBench effectively characterizes how reasoning performance degrades with logical depth (i.e., by determining $L_{0}$ ), two complementary research thrusts are crucial for understanding why. First, systematic investigation is needed to disentangle how $L_{0}$ is influenced by factors such as model architecture, scale (parameters, training data, compute), fine-tuning strategies, and inference-time computation (e.g., chain-of-thought depth). Second, deeper analysis is required to explain the precise mechanisms underlying the observed exponential performance collapse characterized by $L_{0}$ and to account for other non-trivial error patterns, such as path-length dependent first errors. Additionally, the evaluation presented here does not consider how agentic systems capable of tool use perform as the reasoning complexity is tuned across various dimensions. Exploring such setups, where the LLM can externalize sub-problems, invoke tools, or backtrack programmatically, could provide valuable insights into whether the same exponential failure modes persist. In particular, one can define sequential problems where the degree of backtracking or sequential tool use can be systematically varied, and to test whether similar performance drop emerge as the dependency chain grows. We highlight this as a promising direction for future research.
1. Impact of Prompting: Our current study employed standardized prompts and inference settings. A crucial next step is a robust sensitivity analysis to determine overall decay behavior are influenced by different prompting strategies (e.g., zero-shot vs. few-shot, decomposition techniques), varied decoding parameters (temperature, top-p), and interactive mechanisms such as self-verification or self-correction. Investigating the potential of these techniques to mitigate the observed sequential inference failures, particularly given seqBench ’s minimal search complexity, remains a key avenue for future research.
Addressing these points by leveraging frameworks like seqBench will be vital for developing LLMs with more robust and generalizable sequential reasoning capabilities, and for understanding their fundamental performance limits.
## 5 Conclusion
We introduced seqBench, a novel benchmark framework designed for the precise attribution of sequential reasoning failures in Large Language Models. seqBench ’s core strength lies in its unique capability for fine-grained, independent control over fundamental complexity dimensions; most notably, logical depth ( $L$ ), backtracking requirements, and noise ratio, its provision of automatically verifiable solutions, and critically minimizing confounding factors like search complexity. This design allows seqBench to isolate and rigorously evaluate the sequential inference capabilities of LLMs, enabling the automatic quantification of fine-grained performance metrics (such as progress ratio) and providing a clear lens into mechanisms often obscured in most other benchmarks. The framework’s inherent scalability and open-source nature position it as a durable tool for assessing and driving progress in current and future generations of models, ultimately aiming to enhance their utility for complex, real-world problems that often span multiple domains. Our comprehensive evaluations using seqBench reveal that reasoning accuracy consistently collapses exponentially with increasing logical depth across a diverse range of state-of-the-art LLMs. This collapse is characterized by a model-specific parameter $L_{0}$ (Section 2.2), indicating an inherent architectural bottleneck in maintaining coherent multi-step inference. In alignment with the goal of advancing NLP’s reach and fostering its responsible application in other fields by offering this precise analysis, seqBench provides a valuable resource. It encourages a shift beyond aggregate benchmark scores towards a more nuanced understanding of model capabilities, an essential step for rigorously assessing the true impact and potential risks of applying LLMs in new domains. The insights gleaned from seqBench can inform both NLP developers in building more robust models, and experts in other disciplines in setting realistic expectations and co-designing NLP solutions that are genuinely fit for purpose. Targeted improvements, guided by such fundamental understanding, are key to enhancing the robustness of sequential reasoning, making LLMs more reliable partners in interdisciplinary endeavors. Future work should leverage these insights to develop models that can overcome the observed performance cliffs and extend their effective reasoning horizons, thereby unlocking their transformative potential in diverse interdisciplinary applications—such as navigating complex scientific literature, supporting intricate legal analysis, or enabling robust multi-step planning in critical autonomous systems. Focusing on commonsense reasoning is paramount for NLP to achieve transformative societal impact, moving beyond incremental improvements to genuine breakthroughs.
## References
- csp (2008) 2008. Rina dechter , constraint processing, morgan kaufmann publisher (2003) isbn 1-55860-890-7, francesca rossi, peter van beek and toby walsh, editors, handbook of constraint programming, elsevier (2006) isbn 978-0-444-52726-4. Computer Science Review, 2:123–130.
- Anthropic (2025) Anthropic. 2025. Claude 3.7 sonnet. https://www.anthropic.com/news/claude-3-7-sonnet.
- Berglund et al. (2024) Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2024. The reversal curse: Llms trained on "a is b" fail to learn "b is a". Preprint, arXiv:2309.12288.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Carroll and Ruppert (2017) Raymond J Carroll and David Ruppert. 2017. Transformation and weighting in regression. Chapman and Hall/CRC.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. Preprint, arXiv:1803.05457.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
- Du et al. (2021) Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, and 8 others. 2021. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning.
- Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. Preprint, arXiv:1903.00161.
- Fedus et al. (2022) William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39.
- Google DeepMind (2025) Google DeepMind. 2025. Gemini 2.5 pro experimental. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/.
- Han et al. (2024) Pengrui Han, Peiyang Song, Haofei Yu, and Jiaxuan You. 2024. In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models. Preprint, arXiv:2409.15454.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Preprint, arXiv:2009.03300.
- Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others. 2022. Training compute-optimal large language models. Preprint, arXiv:2203.15556.
- Kleinberg and Tardos (2006) Jon Kleinberg and Eva Tardos. 2006. Algorithm Design. Pearson/Addison-Wesley, Boston.
- Kojima et al. (2022) Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
- Kuratov et al. (2024) Yury Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. 2024. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack. Advances in Neural Information Processing Systems, 37:106519–106554.
- Lieber et al. (2021) Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. Jurassic-1: Technical details and evaluation. https://www.ai21.com/blog/jurassic-1-technical-details-and-evaluation. White Paper.
- Lin et al. (2025) Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. 2025. Zebralogic: On the scaling limits of llms for logical reasoning. Preprint, arXiv:2502.01100.
- Meta AI (2025) Meta AI. 2025. Llama 4: Open and efficient multimodal language models. https://github.com/meta-llama/llama-models.
- Mirzaee et al. (2021) Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjmashidi. 2021. Spartqa: : A textual question answering benchmark for spatial reasoning. Preprint, arXiv:2104.05832.
- Mirzaee and Kordjamshidi (2022) Roshanak Mirzaee and Parisa Kordjamshidi. 2022. Transfer learning with synthetic corpora for spatial role labeling and reasoning. Preprint, arXiv:2210.16952.
- Mistral AI (2024) Mistral AI. 2024. Mistral large 2. https://mistral.ai/news/mistral-large-2407.
- Nezhurina et al. (2025) Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. 2025. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. Preprint, arXiv:2406.02061.
- OpenAI (2025) OpenAI. 2025. Openai gpt-5, o3 and o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/, https://openai.com/index/introducing-gpt-5/. Paper’s supplementary material (appendix) was revised, after GPT-5 release, with a new figure, to reflect that GPT-5 also suffers from the same failure pattern we have observed in this paper.
- Rae et al. (2021) Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Matthias Rauh, Po-Sen Huang, and 58 others. 2021. Scaling language models: Methods, analysis & insights from training Gopher. Preprint, arXiv:2112.11446.
- Rein et al. (2023) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2023. Gpqa: A graduate-level google-proof q&a benchmark. Preprint, arXiv:2311.12022.
- Sempolinski (2009) Peter Sempolinski. 2009. Automatic solutions of logic puzzles.
- Sharma (2024) Manasi Sharma. 2024. Exploring and improving the spatial reasoning abilities of large language models. In I Can’t Believe It’s Not Better Workshop: Failure Modes in the Age of Foundation Models.
- Shi et al. (2022) Zhengxiang Shi, Qiang Zhang, and Aldo Lipani. 2022. Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 11321–11329.
- Smith et al. (2022) Samuel Smith, Mostofa Patwary, Brian Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhenhao Liu, Shrimai Prabhumoye, Georgios Zerveas, Vikas Korthikanti, Eric Zhang, Rewon Child, Reza Yazdani Aminabadi, Jared Bernauer, Xia Song Song, Mohammad Shoeybi, Yuxin He, Michael Houston, Shishir Tiwary, and Bryan Catanzaro. 2022. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. Preprint, arXiv:2201.11990.
- Sprague et al. (2024) Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. 2024. Musr: Testing the limits of chain-of-thought with multistep soft reasoning. Preprint, arXiv:2310.16049.
- Srivastava et al. (2023) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, and 432 others. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Preprint, arXiv:2206.04615.
- Tamari et al. (2021) Ronen Tamari, Kyle Richardson, Aviad Sar-Shalom, Noam Kahlon, Nelson Liu, Reut Tsarfaty, and Dafna Shahaf. 2021. Dyna-babi: unlocking babi’s potential with dynamic synthetic benchmarking. Preprint, arXiv:2112.00086.
- Tang and Kejriwal (2025) Zhisheng Tang and Mayank Kejriwal. 2025. Grasp: A grid-based benchmark for evaluating commonsense spatial reasoning. Preprint, arXiv:2407.01892.
- Thoppilan et al. (2022) Rami Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yi Du, Yanping Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Max Krikun, Dmitry Lepikhin, James Qin, and 38 others. 2022. Lamda: Language models for dialog applications. arXiv preprint. Technical report, Google Research.
- Tikhonov (2024) Alexey Tikhonov. 2024. Plugh: A benchmark for spatial understanding and reasoning in large language models. Preprint, arXiv:2408.04648.
- Tyagi et al. (2024) Nemika Tyagi, Mihir Parmar, Mohith Kulkarni, Aswin RRV, Nisarg Patel, Mutsumi Nakamura, Arindam Mitra, and Chitta Baral. 2024. Step-by-step reasoning to solve grid puzzles: Where do llms falter? Preprint, arXiv:2407.14790.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-thought prompting elicits reasoning in large language models. Preprint, arXiv:2201.11903.
- Weston et al. (2015) Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. 2015. Towards ai-complete question answering: A set of prerequisite toy tasks. Preprint, arXiv:1502.05698.
- Yang et al. (2019) Kaiyu Yang, Olga Russakovsky, and Jia Deng. 2019. SpatialSense: An adversarially crowdsourced benchmark for spatial relation recognition. In International Conference on Computer Vision (ICCV).
- Zoph et al. (2022) Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022. St-moe: Designing stable and transferable sparse expert models. Preprint, arXiv:2202.08906.
## Appendices
## Appendix A Dataset Generation Details
The seqBench benchmark generates pathfinding tasks by systematically controlling several complexity dimensions. As described in Section 1 (main paper), Algorithm 1 is central to this process. This appendix provides further details on the generation phases, natural language encoding of tasks, and specific dataset parameters.
### A.1 Generation Phases
The generation process, guided by Algorithm 1, involves three main phases:
1. Base Maze Construction: An initial $N\times M$ grid is populated, and an acyclic maze graph ( $M_{g}$ ) is formed using Kruskal’s algorithm (Kleinberg and Tardos, 2006). This ensures a simply connected environment where a unique path exists between any two cells if all internal "walls" (potential door locations) were open. The overall process results in maze instances like the one visualized in Figure 5.
1. Rewind Construction for Path Skeleton and Key/Door Placement: This phase implements the "Rewind Construction" (Algorithm 1 in the main paper). Starting from a randomly selected goal cell ( $C_{goal}$ ), the algorithm works backward to define a solvable path skeleton ( $\Pi_{S}$ ). It iteratively:
1. Selects a cell $c_{key}$ that would be a preceding point on a path towards the current cell $x$ (initially $C_{goal}$ ).
1. Identifies the unique path segment $\pi_{seg}$ in $M_{g}$ from $x$ to $c_{key}$ .
1. Randomly selects an edge $d$ on this segment $\pi_{seg}$ to become a locked door. This edge $d$ is added to the set of locked doors $\mathcal{D}_{L}$ .
1. A new unique key $K_{id}$ is conceptually placed at $c_{key}$ , and its information (which door it opens, its location) is stored in $\mathcal{K}_{I}$ .
1. The conceptual steps (moving along $\pi_{seg}$ , unlocking door $d$ with $K_{id}$ , picking up $K_{id}$ at $c_{key}$ ) are prepended (in reverse logical order) to the path skeleton $\Pi_{S}$ .
1. The current cell $x$ is updated to $c_{key}$ , and the process repeats until the target number of backtracks ( $\mathcal{B}$ ) is achieved or no valid placements remain.
This backward construction ensures solvability and controlled backtracking complexity. The final agent starting position is the cell $x$ at the end of this phase.
1. Fact Compilation and Noise Injection: Based on the final maze structure ( $M_{g},\mathcal{D}_{L},\mathcal{K}_{I}$ ), a set of natural language facts $\mathcal{F}$ is compiled. This includes facts describing room connections, key locations, and door states. Distracting facts are then introduced based on the target noise ratio $\mathcal{N}$ . These distractors might describe non-existent connections, spurious keys, or misleading adjacencies, chosen to be plausible yet incorrect.
<details>
<summary>figs/compath_viz.png Details</summary>

### Visual Description
## Diagram: Network Schematic with Special Components and Paths
### Overview
This image displays a schematic diagram representing a network structure composed of interconnected nodes and links. The network has a generally grid-like arrangement, with various standard and specialized components, as well as several distinct dashed lines indicating specific connections or paths. The diagram lacks explicit labels or a legend, requiring interpretation of the visual elements.
### Components/Axes
The diagram does not feature traditional axes or a legend with explicit labels. Instead, components are distinguished by their shape, fill, and outline.
**Primary Components:**
* **Nodes (Black Circles):** Represent connection points or junctions within the network. There are approximately 25 such nodes.
* **Links (Light Blue Lines):** Represent the connections or pathways between nodes. These form the backbone of the network.
* **Standard Link Components (White Rectangles with Black Outline):** Rectangular elements placed along light blue links, suggesting a standard component or state on that link. There are approximately 30 such components.
**Specialized Components:**
* **Special Nodes (Outlined Black Circles):** Black filled circles with a distinct white outline. These likely represent specific types of nodes, perhaps sources, destinations, or critical points. There are 4 such nodes:
* One in the top-left quadrant.
* One in the middle-left column.
* One in the bottom-middle-left area.
* One in the middle-right area.
* **Special Link Components (Red Rectangles):** Red filled rectangular elements placed along light blue links, indicating a distinct state or type of component on that link. There are 7 such components:
* One in the top-middle-right, on a vertical link.
* One in the top-middle-left, on a vertical link, below the top-left outlined node.
* One in the middle-left, on a vertical link, below the middle-left outlined node.
* One in the bottom-middle-left, on a horizontal link, to the right of the bottom-middle-left outlined node.
* Three in the far-left column, on vertical links (two above and one below the upward-pointing black triangle).
* **Special Nodes (Black Triangles):** Black filled triangular elements, likely indicating directional components, sources, or sinks. There are 2 such nodes:
* One downward-pointing triangle in the top-middle, on a vertical link.
* One upward-pointing triangle in the far-left column, on a vertical link.
**Special Paths/Connections (Dashed Lines):**
* **Blue Dashed Line:** A dashed line, light blue in color, indicating a specific connection or signal path.
* **Orange Dashed Line:** A dashed line, orange in color, indicating another specific connection or signal path.
* **Green Dashed Line:** A dashed line, green in color, indicating another specific connection or signal path.
* **Red Dashed Line:** A dashed line, red in color, indicating another specific connection or signal path.
* **Purple Dashed Line:** A dashed line, purple in color, indicating another specific connection or signal path.
### Detailed Analysis
The network is structured in an approximate 5x5 grid, though it is not perfectly uniform, with some edges and internal connections missing or modified.
**Grid Structure:**
* The network spans roughly 5 horizontal and 5 vertical "units" of nodes and links.
* The top row has 3 nodes and 2 horizontal links, forming a partial segment.
* The second row from the top has 5 nodes and 4 horizontal links.
* The third row from the top has 5 nodes and 4 horizontal links.
* The fourth row from the top has 5 nodes and 4 horizontal links.
* The bottom row has 3 nodes and 2 horizontal links, forming a partial segment.
* Vertical links connect nodes between rows, maintaining the grid-like appearance.
**Placement and Connectivity of Special Elements:**
1. **Top-Left Outlined Node:** Located in the top-left quadrant, it is connected to a standard node to its right via a horizontal link with a standard white rectangle component. It is also connected downwards to another standard node via a vertical link with a red rectangle component.
* **Blue Dashed Line:** Originates from this top-left outlined node, extends diagonally upwards and to the right, passing over a standard white rectangle component, and terminates at the red rectangle component located on a vertical link in the top-middle-right section of the diagram.
2. **Top-Middle-Left Red Rectangle:** This component is on a vertical link, directly below the node that is to the right of the top-left outlined node.
* **Orange Dashed Line:** Originates from this red rectangle, extends diagonally downwards and to the right, passing over a standard white rectangle component, and terminates at the middle-right outlined node.
3. **Top-Middle Downward-Pointing Triangle:** Located on a vertical link in the top-middle section, connected to a node above and a node below.
4. **Middle-Left Outlined Node:** Located in the far-left column, it is connected to a standard node to its right via a horizontal link with a standard white rectangle component. It is also connected downwards to a standard node via a vertical link.
* **Red Dashed Line:** Originates from this middle-left outlined node, extends vertically downwards, passing over a red rectangle component, and terminates at the upward-pointing black triangle in the far-left column.
* **Purple Dashed Line:** Originates from this same middle-left outlined node, extends vertically downwards, almost coincident with the red dashed line (slightly to its right), passing over the same red rectangle component, and also terminates at the upward-pointing black triangle.
5. **Upward-Pointing Black Triangle:** Located in the far-left column, it is connected to a node above (via a link with two red rectangles and the red/purple dashed lines) and a node below (via a link with one red rectangle).
6. **Bottom-Middle-Left Outlined Node:** Located in the bottom-middle-left area, it is connected to a standard node to its left via a horizontal link with a standard white rectangle component. It is also connected to a standard node to its right via a horizontal link.
* **Green Dashed Line:** Originates from this bottom-middle-left outlined node, extends horizontally to the right, and terminates at a red rectangle component located on the horizontal link immediately to its right.
7. **Middle-Right Outlined Node:** Located in the middle-right section, it is connected to a standard node to its left via a horizontal link with a standard white rectangle component. It is also connected downwards to a standard node via a vertical link with a standard white rectangle component.
### Key Observations
* The network exhibits a primary grid topology, suggesting a structured system.
* The presence of different component types (standard white rectangles, red rectangles, outlined circles, triangles) indicates functional diversity within the network.
* The dashed lines represent distinct, non-standard connections or signal paths that bypass or interact with the primary light blue links and components. These paths are colored differently (blue, orange, green, red, purple), suggesting different types of interactions or priorities.
* Several red rectangle components are involved in the start or end points of these dashed lines, or are traversed by them, implying they might be critical or "active" points.
* The upward-pointing and downward-pointing triangles suggest directional flow or specific input/output points.
* The co-occurrence of red and purple dashed lines between the same two points (middle-left outlined node and upward-pointing triangle) is notable, possibly indicating parallel signals, redundant paths, or different signal types along the same logical connection.
### Interpretation
This diagram likely represents a simplified model of a system where nodes are junctions or processing units, and links are communication channels or physical connections.
* **Standard Components (White Rectangles):** Could represent basic elements like resistors, switches, valves, or standard data packets/flow.
* **Special Nodes (Outlined Circles):** May denote critical points, control centers, specific devices, or points of interest for monitoring or intervention. Their involvement in initiating or terminating dashed lines reinforces their special status.
* **Special Link Components (Red Rectangles):** These could signify active components, faulty sections, high-priority links, or points where a specific action or transformation occurs. Their interaction with dashed lines suggests they are points of interaction for special signals or bypasses.
* **Triangles:** The downward-pointing triangle might represent a source or an input, while the upward-pointing triangle could be a sink, an output, or a measurement point. Their placement on vertical links suggests a flow direction.
* **Dashed Lines (Colored Paths):** These are crucial. They likely represent:
* **Alternative/Bypass Paths:** Connections that exist outside the primary grid structure.
* **Signal/Control Lines:** Non-physical connections, like wireless signals, control commands, or data flows that don't follow the main physical links.
* **Fault Paths/Propagation:** The spread of a fault or anomaly through the system.
* **Logical Connections:** Relationships between components that are not direct physical links.
* The different colors (blue, orange, green, red, purple) strongly suggest different types of signals, priorities, or functional categories for these special paths. The red and purple lines being almost identical could indicate two distinct but closely related signals or redundant control paths.
In essence, the diagram illustrates a network with a baseline structure and several overlaid "special" functionalities or states. The outlined nodes and red rectangles appear to be key interaction points for these special functions, and the dashed lines map out how these functions or signals propagate through or across the network. Without a legend, the exact domain (e.g., electrical circuit, water distribution, data network, process flow) remains ambiguous, but the visual language strongly implies a system with both standard operations and specific, highlighted operational modes or conditions.
</details>
Figure 5: Example visualization of a $6\times 6$ seqBench maze instance. Red rectangles denote locked doors, dashed lines indicate the locations of keys corresponding to those doors, and triangles mark the start (upward-pointing) and goal (downward-pointing) positions. This illustrates the spatial nature of the tasks.
### A.2 Natural Language Encoding
Each task instance is translated into a set of atomic natural language facts. We use a consistent templating approach:
- Room Connections: "Room A1 and B1 are connected by an open door."
- Locked Connections: "Room C3 and D3 are connected by a closed and locked door."
- Key Requirements: "The locked door between C3 and D3 requires key 5." (Key IDs are simple integers).
- Key Placements: "Key 5 is in room E4." (Room IDs use spreadsheet-like notation, e.g., A1, B2).
- Starting Position: "Bob is in room A2."
- Goal Position: "Alice is in room D5."
The full set of facts for a given problem constitutes its description.
### A.3 Dataset Parameters and Scope
The seqBench dataset was generated using the following parameter ranges based on the generation configuration:
- Grid Sizes ( $N\times M$ ): $N\times M$ where $N$ and $M$ range from 5 to 50 (e.g., [5,5], [6,6], …, [50,50]), with $M=N$ for all configurations.
- Target Backtracking Steps ( $\mathcal{B}$ ): Values from 0 to 7. This controls the number of key-door mechanisms deliberately placed on the optimal path.
- Noise Ratio ( $\mathcal{N}$ ): Values from $0.0$ (no distracting facts) to $1.0$ (equal number of supporting and distracting facts), typically in increments of $0.2$ .
- Instances per Configuration: For each primary configuration, defined by a specific grid size ( $N,M$ ) and a specific target backtracking step count ( $\mathcal{B}\in\{0..7\}$ ), 400 unique base maze instances were generated.
- Logical Depth ( $L$ ): As an emergent property, $L$ varies. Experiments typically select problems from these generated instances that fall into specific $L$ bins (e.g., $L\in[10,11),[11,12),\ldots$ ).
This generation pipeline, leveraging the described parameter ranges and variations, can produce a vast and diverse set of problem instances. The publicly released seqBench dataset, used for the analyses in this paper (see main paper for access link), comprises 7,079 such curated instances. This collection offers a rich resource for studying the combined effects of the controlled complexity dimensions.
## Appendix B Prompt Design and Model Configuration Details
This appendix provides the complete details of the prompt structure and model configurations used for evaluating LLMs on the seqBench benchmark. The overall prompt, illustrated in Figure 6, concatenates four main components which are detailed below.
<details>
<summary>figs/prompt_template.png Details</summary>

### Visual Description
## Document Type: Prompt Template
### Overview
This image presents a "Prompt Template" designed to guide a problem-solving agent through a maze navigation task. It outlines the task description, provides reasoning guidance, includes an example input/output, and presents a new problem for the agent to solve. The document is structured into distinct sections to clearly convey instructions, constraints, and expected output format.
### Components/Axes
The document is laid out with a main title at the top and four distinct content panels arranged in a 2x2 grid, with two panels on the left and two on the right. Each of the four main content panels has a vertical, rotated label on its outer side.
* **Header (Top-Center):** "Prompt Template"
* **Left Column, Top Panel:** Labeled "Task Description" (rotated vertically on the left edge). This panel contains the core problem statement, maze details, valid actions, syntax, constraints, and output format requirements.
* **Right Column, Top Panel:** Labeled "Reasoning Guidance" (rotated vertically on the right edge). This panel provides a step-by-step approach and advice for solving the task.
* **Left Column, Bottom Panel:** Labeled "Examples" (rotated vertically on the left edge). This panel provides a concrete example of input facts and the corresponding expected output.
* **Right Column, Bottom Panel:** Labeled "Problem Facts" (rotated vertically on the right edge). This panel presents a new set of facts for a specific problem instance, followed by a placeholder for "YOUR SOLUTION".
### Detailed Analysis
The document is composed entirely of text, structured into logical sections.
**Top-Left Panel: Task Description**
* **Introductory Statement:** "You are a problem solving agent that thinks carefully step by step based on provided facts and follows instructions closely."
* **TASK:** "Help Bob navigate through a maze of connected rooms to rescue Alice. Bob starts in a specified room and needs to find the optimal path to reach Alice's location, following the maze's rules about room connections and door locks."
* **MAZE DESCRIPTION CONTAINS:**
1. "Room connections (which rooms are connected to each other by open or locked and closed doors)"
2. "Door information (open or locked)"
3. "Key information (where they are located and which doors they unlock)"
4. "Starting location: Where Bob is at the start"
5. "Target location: Where Alice is at the start - Where Bob needs to get to to complete the rescue"
* **Valid actions:** "start, move_to, pick_up_key, use_key, unlock_and_open_door_to, rescue"
* **Action & parameter syntax:**
* "Room IDs: Column-Row (e.g., 'A1')"
* "Key IDs: positive integers (e.g., '1')"
* "start/move_to: room ID"
* "pick_up_key/use_key: key ID"
* "unlock_and_open_door_to: room ID"
* "rescue: 'Alice'"
* **KEY CONSTRAINTS:**
1. "Each move must be between adjacent and connected rooms"
2. "Keys must be picked up before use"
3. "Locked doors require use of their specific key to unlock"
4. "Optimal path minimizes actions/distance"
5. "use_key action always come right before unlock_and_open_door_to"
6. "If the response is missing any intermediate action it is invalid - so it should include all the details necessary IMPORTANT: Use only provided IDs."
* **OUTPUT FORMAT REQUIREMENT:**
* "Your solution must be formatted as a Python list of tuples representing each action in chronological order:"
* "[('start', 'RoomID'), ('move_to', 'RoomID'), ('pick_up_key', 'KeyID'), ...]"
* "Example format: [('start', 'A1'), ('move_to', 'B1'), ('pick_up_key', '3'), ('use_key', '3'), ('unlock_and_open_door_to', 'C1'), ('rescue', 'Alice')]"
**Top-Right Panel: Reasoning Guidance**
* **TO COMPLETE THIS TASK FOLLOW THESE STEPS:**
1. "Find the shortest path from Bob to Alice."
2. "Identify any locked doors on this path."
3. "For each locked door, find its required key."
4. "Plan key collection order to ensure you have each key before reaching its door."
5. "Track all actions while following the rules"
6. "Avoid unnecessary steps that increase the total path length."
* **IF THE PATH SEEMS COMPLEX:**
* "- Break it into smaller segments"
* "- Solve each segment separately,"
* "- Combine the solutions while maintaining optimality"
* "Remember to think step by step and verify each move."
* "Proceed to provide your solution as a list of tuples in chronological order."
**Bottom-Left Panel: Examples**
* **EXAMPLES:**
* **INPUT:**
* **FACTS:** "Room C4 and C3 are connected by an open door. Room C3 and D3 are connected by an open door. Room D5 and E5 are connected by an open door. Room A2 and A1 are connected by an open door. Room A3 and B3 are connected by an open door. Room A1 and B1 are connected by an open door. Room A4 and A3 are connected by an open door. Room E5 and E4 are connected by an open door. Room D4 and D3 are connected by an open door. Room A5 and B5 are connected by an open door. Room D4 and E4 are connected by an open door. Bob is in room D5. Alice is in room C4."
* **OUTPUT:**
* "[('start', 'D5'), ('move_to', 'E5'), ('move_to', 'E4'), ('move_to', 'D4'), ('move_to', 'D3'), ('move_to', 'C3'), ('move_to', 'C4'), ('rescue', 'Alice')]"
* "END OF EXAMPLES"
**Bottom-Right Panel: Problem Facts**
* **PROBLEM:**
* **FACTS:** "Room A6 and A5 are connected by an open door. Room A6 and B6 are connected by an open door. Room B6 and C6 are connected by an open door. Room C6 and D6 are connected by an open door. Room C5 and C4 are connected by an open door. Room C4 and D4 are connected by an open door. Room D6 and D5 are connected by a closed and locked door. The locked door between D6 and D5 requires key 10. Key 10 is in room A5. Room D6 and E6 are connected by an open door. Room D5 and D4 are connected by an open door. Room E6 and F6 are connected by an open door. Room A4 and A3 are connected by an open door. Bob is in room F6. Alice is in room C5."
* **YOUR SOLUTION:** (This area is blank, awaiting a solution)
### Key Observations
* The document is a comprehensive instruction set for a pathfinding and object interaction problem within a maze environment.
* It explicitly defines the agent's role, the task, the maze's components, valid actions, their syntax, and critical constraints.
* The output format is strictly defined as a Python list of tuples, with an example provided for clarity.
* Reasoning guidance emphasizes optimality, step-by-step thinking, and handling complexity by segmentation.
* The "Examples" section serves as a clear demonstration of the expected input and output, which is crucial for understanding the task.
* The "Problem Facts" section presents a new, unsolved instance of the problem, indicating that the document is a template for a task to be completed.
* Room IDs use a "Column-Row" format (e.g., 'A1'), and Key IDs are positive integers.
### Interpretation
This "Prompt Template" is designed to elicit a structured, optimal solution from a problem-solving agent (likely an AI or a human programmer) for a complex maze navigation and key-collection task. The detailed instructions, constraints, and reasoning guidance aim to ensure that the agent understands the problem thoroughly and produces a solution that adheres to all rules and formatting requirements.
The problem itself is a variation of a shortest path problem, complicated by the introduction of locked doors and keys. The agent must not only find a path but also strategically collect keys to unlock necessary doors, all while minimizing the total number of actions. The emphasis on "optimal path" and "minimizes actions/distance" suggests that efficiency is a primary concern.
The inclusion of an example is critical for clarifying the expected behavior and output format, especially for the Python list of tuples. The "Reasoning Guidance" acts as a meta-instruction, guiding the thought process rather than just the task itself, which is particularly useful for complex AI tasks where the agent needs to demonstrate logical reasoning.
The document effectively segments the information into digestible parts: what to do, how to think about it, an illustration, and a new challenge. This structure facilitates a clear understanding of the problem and the expectations for the solution, making it a robust template for evaluating problem-solving capabilities in a constrained environment.
</details>
Figure 6: The complete prompt structure passed to the LLMs. This includes: Component 1 (System Instructions and Task Definition), one of the three Few-Shot Examples (Component 2, specifically a simple navigation task), Component 3 (Reasoning Guidance), and an illustration of where the Problem Instance Facts (Component 4) are inserted. For clarity and completeness, the full verbatim text for all three few-shot examples (Component 2) is provided in 7.
### B.1 Overall Prompt Components
The prompt presented to the LLMs consists of the following components:
1. System Instructions and Task Definition (Component 1): Outlines the agent’s task, the structure of the maze description, valid actions and their syntax, key operational constraints, and the required output format.
1. Few-Shot Examples (Component 2): Three examples are provided to illustrate the task, ranging in complexity. One of these examples (a simple navigation task) is detailed in Figure 6. The verbatim text for all three examples is provided in Figure 7 for completeness.
1. Reasoning Guidance and Self-Assessment (Component 3): Offers step-by-step algorithmic tips for solving the task and requests the model to provide a self-assessment of its confidence and the perceived difficulty of the instance.
1. Problem Instance Facts (Component 4): The specific natural language facts describing the current maze configuration for the task instance. As illustrated in Figure 6, these facts are appended after the preceding components and are followed by the line "YOUR SOLUTION:" to prompt the model. These facts are generated using the templates described in Appendix A.
1. Example 1 (Simple Navigation): This example, as shown in Figure 6, involves navigating a maze with only open doors.
⬇
EXAMPLE:
INPUT:
Maze Structure: Room C4 and C3 are connected by an open door. Room C3 and D3 are connected by an open door. Room D5 and E5 are connected by an open door. Room A2 and A1 are connected by an open door. Room A3 and B3 are connected by an open door. Room A1 and B1 are connected by an open door. Room A4 and A3 are connected by an open door. Room E5 and E4 are connected by an open door. Room D4 and D3 are connected by an open door. Room A5 and B5 are connected by an open door. Room D4 and E4 are connected by an open door. Bob is in room D5. Alice is in room C4.
OUTPUT:
Solution: [(’ start ’, ’ D5 ’), (’ move_to ’, ’ E5 ’), (’ move_to ’, ’ E4 ’), (’ move_to ’, ’ D4 ’), (’ move_to ’, ’ D3 ’), (’ move_to ’, ’ C3 ’), (’ move_to ’, ’ C4 ’), (’ rescue ’, ’ Alice ’)]
1. Example 2 (Single-Key Backtracking): This example introduces a single locked door and a corresponding key.
⬇
EXAMPLE:
INPUT:
Maze Structure: Room A1 and A2 are connected by an open door. Room A2 and B2 are connected by an open door. Room B1 and B2 are connected by an open door. Room B1 and C1 are connected by an open door. Room C1 and C2 are connected by a closed and locked door. Door between C1 and C2 requires key 1. Key 1 is in room A2. Bob is in room A1. Alice is in room C2.
OUTPUT:
Solution: [(’ start ’, ’ A1 ’), (’ move_to ’, ’ A2 ’), (’ pick_up_key ’, ’1’), (’ move_to ’, ’ B2 ’), (’ move_to ’, ’ B1 ’), (’ move_to ’, ’ C1 ’), (’ use_key ’, ’1’), (’ unlock_and_open_door_to ’, ’ C2 ’), (’ move_to ’, ’ C2 ’), (’ rescue ’, ’ Alice ’)]
1. Example 3 (Multi-Key Backtracking): This example presents a more complex scenario with multiple locked doors and keys, requiring more extensive backtracking.
⬇
EXAMPLE:
INPUT:
Maze Structure: Room B5 and B4 are connected by a closed and locked door. The locked door between B5 and B4 requires key 3. Key 3 is in room B5. Room B5 and C5 are connected by a closed and locked door. The locked door between B5 and C5 requires key 16. Key 16 is in room C5. Room B4 and C4 are connected by an open door. Room C4 and C3 are connected by an open door. Room C3 and D3 are connected by a closed and locked door. The locked door between C3 and D3 requires key 10. Key 10 is in room C4. Room D5 and D4 are connected by an open door. Room D4 and D3 are connected by an open door. Room A5 and B5 are connected by an open door. Bob is in room C5. Alice is in room D5.
OUTPUT:
Solution: [(’ start ’, ’ C5 ’), (’ pick_up_key ’, ’16’), (’ use_key ’, ’16’), (’ unlock_and_open_door_to ’, ’ B5 ’), (’ move_to ’, ’ B5 ’), (’ pick_up_key ’, ’3’), (’ use_key ’, ’3’), (’ unlock_and_open_door_to ’, ’ B4 ’), (’ move_to ’, ’ B4 ’), (’ move_to ’, ’ C4 ’), (’ pick_up_key ’, ’10’), (’ move_to ’, ’ C3 ’), (’ use_key ’, ’10’), (’ unlock_and_open_door_to ’, ’ D3 ’), (’ move_to ’, ’ D3 ’), (’ move_to ’, ’ D4 ’), (’ move_to ’, ’ D5 ’), (’ rescue ’, ’ Alice ’)]
Figure 7: Few-shot examples provided to guide the LLMs in the maze-solving task. These examples demonstrate simple navigation, single-key backtracking, and multi-key backtracking scenarios. The three examples illustrate increasing levels of complexity.
### B.2 Evaluation Metrics and Error Analysis Details
This section provides further details on specific aspects of our evaluation metrics and observed error categories, complementing the overview of metrics in Section 1 of the main paper and the discussion of failure modes in Section 2 of the main paper.
#### Observed Violation Categories.
Failures in model solutions on seqBench tasks can be categorized into several types. Understanding these categories is crucial for interpreting model performance and failure modes. Key types of violations observed include:
- Adjacency errors (e.g., attempting to move between unconnected rooms).
- Locked door errors (e.g., navigating through locked doors without the correct key or without unlocking them).
- Key usage errors (e.g., attempting to use keys not yet collected, or using the wrong key for a door).
- Path inefficiency (e.g., taking unnecessary detours or redundant actions; while not always a hard violation that stops progress, this contributes to solutions not matching the optimal path and thus failing Pass@1).
- Missed critical actions (e.g., failing to pick up a necessary key or unlock a required door). This is a key failure mode discussed in the main paper (Section 2.4) and is often reflected in metrics like low recall or a low progress ratio if the omission occurs early and prevents further correct steps.
Identifying these distinct categories of errors provides a more granular understanding of why models fail on sequential reasoning tasks and helps in the interpretation of aggregate performance metrics reported in the main paper.
### B.3 Violation Map: Qualitative Examples of Model Failures
This section provides qualitative examples of characteristic model failures to illustrate common error types. These examples visually support the discussion of failure modes in the main paper (Section 2.4, "A Key Failure Mode: Omission of Critical Steps"). Figure 8 illustrates a significant error by Gemini-2.5-Flash on a complex task, where the model generates an illegal path, bypassing necessary steps and locked doors. This exemplifies a breakdown in multi-step planning. Additionally, Figure 9 shows another common ’adjacency error,’ where a model attempts to jump between unconnected rooms. This type of error reveals a critical lapse in grounding its generated actions within the spatial adjacencies explicitly stated by the task’s input facts.
<details>
<summary>figs/goodexample4040.png Details</summary>

### Visual Description
## Diagram: Maze Pathfinding Comparison
### Overview
The image displays two side-by-side grid diagrams, each representing the same maze-like environment. The left diagram, titled "Optimal Path," shows a path highlighted in yellow/gold. The right diagram, titled "Model Path," shows a different path highlighted in purple/violet. Both diagrams also feature an orange dashed line and a blue dashed line, which appear to be direct, straight-line connections across the grid. The purpose is to visually compare an "Optimal Path" against a "Model Path" within an identical constrained environment.
### Components/Axes
The image consists of two primary regions, each containing a grid and paths:
**Common Elements across both diagrams:**
* **Grid Structure**: Both diagrams feature an identical grid composed of small black dots (representing nodes or traversable points) interconnected by short black horizontal and vertical lines (representing walls or obstacles). The grid is approximately 35 nodes wide by 35 nodes high. The white background indicates open space.
* **Orange Dashed Line**: A thin, dashed orange line is present in both diagrams. It extends diagonally from the upper-left region of the grid to the lower-right region, cutting directly through the maze structure. This line likely represents the Euclidean or "as-the-crow-flies" distance between a general start and end point.
* **Blue Dashed Line**: A thin, dashed blue line is present in both diagrams. It extends diagonally from the lower-left region of the grid to the lower-right region, also cutting directly through the maze structure. Its purpose is not explicitly defined but appears to be another direct reference line.
**Left Diagram: "Optimal Path"**
* **Title**: "Optimal Path" (positioned centrally above the left grid).
* **Primary Path (Yellow/Gold)**: A thick, translucent yellow/gold path traces a complex route through the maze.
* **Placement**: This path generally starts in the top-left quadrant of the grid (approximately at grid coordinates X=5, Y=5, assuming a 0-indexed grid) and ends in the bottom-right quadrant (approximately X=30, Y=30).
**Right Diagram: "Model Path"**
* **Title**: "Model Path" (positioned centrally above the right grid).
* **Primary Path (Purple/Violet)**: A thick, translucent purple/violet path traces a complex route through the identical maze.
* **Placement**: This path also generally starts in the top-left quadrant (approximately at grid coordinates X=5, Y=5) and ends in the bottom-right quadrant (approximately X=30, Y=30).
### Detailed Analysis
**Left Diagram: "Optimal Path" (Yellow/Gold Path)**
* **Trend**: The yellow/gold path is highly circuitous, demonstrating numerous turns (both horizontal and vertical) as it navigates around the black wall segments. It moves generally from the top-left corner towards the bottom-right corner.
* **Route Description**:
* Starts near the top-left, moving right for a short segment.
* Turns sharply downwards, then left, then right, then down, creating a zig-zag pattern in the upper-left.
* It then proceeds generally rightwards across the middle-top section of the maze.
* Mid-grid, it takes a significant turn downwards, then meanders left and right in the middle-right section.
* It continues downwards, then makes a long horizontal traverse towards the left in the bottom-middle section.
* Finally, it turns right and downwards, ending in the bottom-right corner.
* The path appears to utilize available open passages, avoiding all black wall segments.
**Right Diagram: "Model Path" (Purple/Violet Path)**
* **Trend**: The purple/violet path is also circuitous, navigating the maze from the top-left to the bottom-right, but its specific route differs from the yellow/gold path.
* **Route Description**:
* Starts near the top-left, moving right, then down, then right, similar to the initial segment of the optimal path.
* It then proceeds generally rightwards across the middle-top section, staying slightly higher than the yellow path in some segments.
* Mid-grid, it turns downwards, then moves right, then down, then makes a distinct left turn.
* It then proceeds downwards and rightwards in the bottom-right quadrant, appearing to take a more direct route in this section compared to the yellow path's more extensive horizontal traverse.
* The path also appears to utilize available open passages, avoiding all black wall segments.
**Comparison of Paths**:
* Both the "Optimal Path" and "Model Path" successfully navigate the maze from a similar start to a similar end point.
* The initial segments (top-left) of both paths appear somewhat similar, moving right and then down.
* However, the paths diverge significantly in the middle and lower sections of the maze. The "Optimal Path" (yellow) appears to explore more horizontally in the bottom-middle, while the "Model Path" (purple) seems to maintain a more direct, albeit still winding, trajectory towards the bottom-right.
* The orange and blue dashed lines are identical in both diagrams, serving as fixed reference lines that ignore the maze structure.
### Key Observations
* The underlying maze structure is identical for both "Optimal Path" and "Model Path" scenarios.
* Both paths successfully navigate the maze, indicating they found a valid route from start to end.
* The "Optimal Path" (yellow) is visually more intricate and covers a wider area of the maze, especially in the lower-middle section, suggesting it might be exploring more options or taking a longer but perhaps more efficient route under certain definitions of "optimal."
* The "Model Path" (purple) appears to take a somewhat less winding, or at least a different, route, particularly in the lower half of the maze, where it seems to stay more to the right.
* The orange and blue dashed lines represent direct, unobstructed paths, likely serving as a baseline for comparison against the maze-constrained paths.
### Interpretation
This image effectively demonstrates a comparison between a presumed "ground truth" or best possible path ("Optimal Path") and a path generated by an algorithm or system ("Model Path") within the same complex environment.
The "Optimal Path" (yellow) likely represents the shortest path or the most efficient path according to some objective function, given the maze constraints. Its highly winding nature suggests that finding the optimal path in such a dense maze requires significant exploration and deviation from a direct line.
The "Model Path" (purple) shows a different valid solution found by a model. The differences between the yellow and purple paths highlight the performance or characteristics of the "model." If the goal of the model is to replicate the optimal path, then the visual discrepancies indicate areas where the model's solution deviates. These deviations could imply:
1. **Sub-optimality**: The "Model Path" might be longer or less efficient than the "Optimal Path."
2. **Different Heuristics**: The model might be using different search algorithms or heuristics that lead to a valid but not necessarily optimal path.
3. **Exploration vs. Exploitation**: The "Optimal Path" might represent a more thorough exploration of the maze, while the "Model Path" might have exploited a quicker, but not globally optimal, route.
The straight orange and blue dashed lines serve as visual benchmarks for direct distance. They emphasize the significant overhead (increased path length and complexity) introduced by the maze's obstacles, which both the optimal and model paths must contend with. The comparison suggests an evaluation scenario where the "Model Path" is being assessed against a known "Optimal Path" in a pathfinding or navigation task.
</details>
Figure 8: Illustrative failure case for Gemini-2.5-Flash on a 40x40 task with 2 locked doors on the optimal path. Left: Optimal path (yellow). Right: Model’s generated path showing an illegal adjacency jump (red arrow), bypassing multiple rooms and a locked door, despite only supporting facts being provided. This highlights a breakdown in multi-step planning.
<details>
<summary>figs/mistakev2.png Details</summary>

### Visual Description
## Two Grid-based Path Diagrams: Optimal vs. Model Path
### Overview
The image presents two side-by-side grid diagrams, each illustrating a path from a designated start point to an end point. The left diagram, titled "Optimal Path," displays a path highlighted in yellow/orange. The right diagram, titled "Model Path," shows a path highlighted in purple. Both diagrams share an identical underlying grid structure, start/end points, and several reference lines, but the highlighted paths differ significantly, particularly in their interaction with a prominent red rectangular region. The grid in the "Model Path" diagram is faded compared to the "Optimal Path" diagram.
### Components/Axes
The image does not contain traditional axes or a formal legend, but elements are color-coded and positioned consistently across both diagrams.
**Common Elements in Both Diagrams:**
* **Grid Structure**: Both diagrams feature a visible grid composed of approximately 15 columns and 15 rows of nodes.
* **Nodes**: Represented by small black solid circles in the "Optimal Path" diagram and faded light grey circles in the "Model Path" diagram.
* **Segments**: Connections between adjacent nodes, represented by thin blue-green lines with small, empty rectangular boxes in their centers in the "Optimal Path" diagram, and faded light grey lines and boxes in the "Model Path" diagram.
* **Start Point**: A larger solid black circle, consistently located at approximately the 5th column from the left and the 14th row from the top (or 2nd row from the bottom) of the grid.
* **End Point**: A solid black upward-pointing triangle, consistently located at approximately the 14th column from the left and the 14th row from the top (or 2nd row from the bottom) of the grid.
* **Dashed Orange Line**: A straight, dashed orange line connecting the Start Point to the End Point, representing the direct Euclidean distance between them.
* **Dashed Blue Line**: A short, dashed blue line originating from a node within the red rectangular region (specifically, the node at the 5th column, 4th row from the top) and pointing diagonally downwards and rightwards towards the node at the 7th column, 5th row from the top.
* **Red Rectangle**: A prominent red rectangular outline highlighting a vertical column of grid segments. This rectangle spans from the node at the 5th column, 3rd row from the top, down to the node at the 5th column, 10th row from the top. It encompasses 7 vertical segments and 8 nodes.
* **Inverted Black Triangle**: A small, solid black inverted triangle, consistently located at the node at the 5th column, 7th row from the top.
**Specific Path Elements:**
* **Optimal Path (Left Diagram)**: Highlighted in a bright yellow/orange color, indicating the chosen route.
* **Model Path (Right Diagram)**: Highlighted in a distinct purple color, indicating the chosen route.
### Detailed Analysis
**Left Diagram: "Optimal Path"**
The grid is clearly visible with black nodes and blue-green segments.
* **Path Trend**: The yellow/orange "Optimal Path" starts at the black circle near the bottom-left. It initially moves rightwards, then zig-zags upwards and rightwards, making several turns. Crucially, it makes a significant detour to the right to completely bypass the region highlighted by the red rectangle. It then continues its upward and rightward progression, eventually reaching the black triangle at the bottom-right.
* **Interaction with Red Rectangle**: The "Optimal Path" does not enter or traverse any segment within the red rectangle. It passes to the right of this highlighted column.
* **Inverted Black Triangle**: The inverted black triangle is located on a node that is part of the grid but is *not* part of the yellow/orange "Optimal Path."
**Right Diagram: "Model Path"**
The underlying grid is faded to light grey, making the purple path more prominent.
* **Path Trend**: The purple "Model Path" starts at the black circle near the bottom-left. It also moves rightwards, then zig-zags upwards and rightwards. Unlike the "Optimal Path," it takes a more direct route through the central part of the grid.
* **Interaction with Red Rectangle**: The "Model Path" directly enters the red rectangle from its bottom (at the node at the 5th column, 10th row from the top), traverses vertically upwards through all 7 segments within the red rectangle, and exits at its top (at the node at the 5th column, 3rd row from the top). After exiting, it continues its zig-zag pattern towards the black triangle at the bottom-right.
* **Inverted Black Triangle**: The inverted black triangle is located on a node that is part of the grid and *is* part of the purple "Model Path."
### Key Observations
1. **Grid Visibility**: The "Optimal Path" diagram shows a fully visible, distinct grid, while the "Model Path" diagram features a faded, less prominent grid.
2. **Path Divergence**: The primary difference between the two diagrams is the route taken by the highlighted paths.
3. **Red Rectangle Interaction**: The "Optimal Path" explicitly avoids the red rectangular region, suggesting it represents an obstacle or high-cost area. In contrast, the "Model Path" directly traverses this region.
4. **Inverted Triangle Significance**: The inverted black triangle, located within the red rectangle, is *not* on the "Optimal Path" but *is* on the "Model Path," reinforcing the differing treatment of this region.
5. **Reference Lines**: The dashed orange (direct) and dashed blue (local connection) lines are identical in both diagrams, serving as consistent visual references.
### Interpretation
This image effectively illustrates a comparison between a theoretically "Optimal Path" and a "Model Path" generated by an algorithm or system.
The "Optimal Path" (yellow/orange) suggests that, under true or ideal conditions, the region marked by the red rectangle is either impassable or carries a very high cost for traversal. The path's significant detour around this area indicates that avoiding it, even if it means a longer physical route, results in a lower overall cost (e.g., time, resources, risk). The dashed blue line originating from within the red box, but not taken by the optimal path, could represent a potential but ultimately undesirable connection. The inverted black triangle being off the optimal path further emphasizes the undesirability of that specific node/region.
The "Model Path" (purple), by contrast, demonstrates that the underlying model perceives the environment differently. The model's path directly cuts through the red rectangular region. This implies that the model either:
1. **Does not recognize the obstacle**: It fails to identify the red rectangle as an impassable or high-cost zone.
2. **Assigns a lower cost**: It assigns a sufficiently low cost to traversing this region, making it appear as a more efficient shortcut compared to the longer detour.
3. **Has incomplete information**: The faded grid might visually suggest that the model operates with a less detailed or accurate understanding of the environment's constraints.
The discrepancy between the two paths highlights a potential flaw or limitation in the "Model Path" generation. The model's path, while appearing more direct visually, is likely sub-optimal in the context of the true environmental costs or obstacles represented by the red rectangle. This comparison is crucial for evaluating the accuracy and effectiveness of pathfinding models, especially in scenarios where certain regions might be dangerous, restricted, or simply more expensive to traverse. The image serves as a clear visual diagnostic tool for identifying where a model's understanding of an environment deviates from the optimal reality.
</details>
Figure 9: Illustrative failure case of an ’adjacency error’ in model-generated pathfinding on a 20x20 task with 2 locked doors on the optimal path. The left panel displays the optimal path (yellow) to the target (triangle). The right panel shows a suboptimal path (purple) generated by the model. This example highlights a common error where, after a sequence of actions (in this scenario, following a key acquisition), the model fails to navigate through valid connections. Instead, it attempts to ’jump’ directly between two unconnected rooms. This violation of room adjacency constraints is a key challenge in model performance.
### B.4 Quantitative Analysis of Error Patterns
To understand how and when models begin to fail within a reasoning sequence, we analyze the distribution of the first violation step. We record the time step at which the initial violation occurs in a model’s generated path. Aggregating this step-indexed data across multiple instances allows us to create temporal distributions of errors. These distributions help determine whether errors tend to cluster early in the reasoning process (potentially indicating issues with initial planning or understanding of the overall problem complexity) or accumulate later (suggesting difficulties in maintaining long chains of inference or context). This analysis complements the discussion in the main paper (Section 2.4, "Path-Length Dependent First Errors: The Burden of Anticipated Complexity").
Figure 10 shows how the distribution of these first-error positions shifts with the overall problem complexity, represented by logical depth ( $L$ ). As detailed in the main paper, an increase in $L$ tends to cause errors to occur earlier in the reasoning chain.
<details>
<summary>figs/failure_step_dist_vs_L.png Details</summary>

### Visual Description
## Chart Type: Series of Histograms - Distribution of Max Progress Step
### Overview
This image presents a series of eight vertically stacked histograms or bar charts. Each chart illustrates the distribution of "max progress step" values for a specific "Solution steps" count. The "Solution steps" values are provided as labels on the left side of each individual chart, ranging from 20 to 300. All charts share a common horizontal axis labeled "max progress step" at the bottom. The vertical axis, though unlabeled, implicitly represents frequency or count.
### Components/Axes
* **Overall Structure**: Eight individual charts are arranged vertically, separated by thin horizontal lines.
* **X-axis (Common to all charts)**:
* **Title**: "max progress step" (positioned centrally below the lowest chart).
* **Markers**: Numerical labels are present at 0, 50, 100, 150, 200, 250, and 300. Minor tick marks are visible at intervals of approximately 10 units.
* **Y-axis (Implicit for each chart)**:
* **Title**: Not explicitly labeled, but represents frequency or count of occurrences.
* **Markers**: No numerical markers are present, but the height of the vertical bars indicates relative frequency.
* **Chart Labels (Left side of each chart)**:
* "Solution steps: 20" (topmost chart)
* "Solution steps: 60"
* "Solution steps: 100"
* "Solution steps: 140"
* "Solution steps: 180"
* "Solution steps: 220"
* "Solution steps: 260"
* "Solution steps: 300" (bottommost chart)
* **Legend**: No legend is present. All data series are represented by black vertical bars.
### Detailed Analysis
Each chart displays a distribution of vertical bars along the "max progress step" axis. The height of the bars indicates the frequency of a particular "max progress step" value.
1. **Solution steps: 20**
* **Trend**: The distribution is heavily concentrated at the very low end of the "max progress step" axis.
* **Data Points**: Bars are visible from approximately 0 to 10. The tallest bars are very close to 0, likely between 0 and 5. There is no prominent peak around x=20.
2. **Solution steps: 60**
* **Trend**: This distribution shows a bimodal or highly skewed pattern with a dominant peak.
* **Data Points**: A cluster of very small bars is present near 0 (approximately 0-10). The most prominent feature is a very tall bar located precisely at x=60. A few very small bars extend beyond 60, up to approximately 100.
3. **Solution steps: 100**
* **Trend**: Similar to the previous chart, with a dominant peak shifted to the right.
* **Data Points**: Small bars are visible near 0 (approximately 0-10). A very tall bar is located at x=100. A few very small bars extend beyond 100, up to approximately 150.
4. **Solution steps: 140**
* **Trend**: Maintains the pattern of a dominant peak corresponding to the "Solution steps" value.
* **Data Points**: Small bars are visible near 0 (approximately 0-10). A very tall bar is located at x=140. A few very small bars extend beyond 140, up to approximately 190.
5. **Solution steps: 180**
* **Trend**: The dominant peak continues to shift rightward with increasing "Solution steps".
* **Data Points**: Small bars are visible near 0 (approximately 0-10). A very tall bar is located at x=180. A few very small bars extend beyond 180, up to approximately 230.
6. **Solution steps: 220**
* **Trend**: The pattern of a primary peak at the "Solution steps" value persists.
* **Data Points**: Small bars are visible near 0 (approximately 0-10). A very tall bar is located at x=220. A few very small bars extend beyond 220, up to approximately 270.
7. **Solution steps: 260**
* **Trend**: The distribution continues to show a strong peak at the corresponding "Solution steps" value.
* **Data Points**: Small bars are visible near 0 (approximately 0-10). A very tall bar is located at x=260. A few very small bars extend beyond 260, up to approximately 300.
8. **Solution steps: 300**
* **Trend**: The dominant peak is now at the maximum value of the x-axis.
* **Data Points**: Small bars are visible near 0 (approximately 0-10). A very tall bar is located at x=300. Due to the x-axis limit, it's unclear if the distribution extends further beyond 300.
### Key Observations
* **Consistent Primary Peak**: For "Solution steps" values of 60 and above, there is a very strong, singular peak in the distribution of "max progress step" that aligns precisely with the "Solution steps" value itself (e.g., for "Solution steps: 60", the peak is at "max progress step" = 60).
* **Initial Concentration**: All charts, regardless of the "Solution steps" value, show a small cluster of low-frequency bars very close to the origin (0-10) on the "max progress step" axis.
* **Rightward Shift**: As "Solution steps" increases, the dominant peak in the "max progress step" distribution shifts consistently to the right, maintaining its alignment with the "Solution steps" value.
* **Distribution Spread**: The spread of the distribution (the range over which small bars are visible) also tends to increase with higher "Solution steps" values, extending further to the right of the main peak.
* **Outlier Behavior**: The "Solution steps: 20" chart is an outlier. Unlike the others, it does not show a prominent peak at x=20. Instead, its distribution is entirely concentrated near 0, suggesting a different behavior or outcome for very low "Solution steps" counts.
### Interpretation
This series of histograms likely illustrates the performance or outcome of an iterative process, where "Solution steps" represents the number of iterations or steps taken to find a solution, and "max progress step" represents some measure of progress achieved.
The data suggests the following:
* **Direct Correlation (for Solution steps >= 60)**: For a sufficient number of "Solution steps" (60 or more), the process consistently achieves a "max progress step" that is directly proportional to, or even equal to, the number of "Solution steps" provided. This implies an efficient or deterministic process where more steps directly lead to more progress, up to a certain point.
* **Early Stage Behavior (Solution steps = 20)**: When the "Solution steps" are very low (e.g., 20), the process might not be able to make significant progress, or it might fail to reach a meaningful "max progress step". The concentration of "max progress step" values near 0 for "Solution steps: 20" suggests that the process either terminates very early or makes minimal progress in most cases when given only 20 steps. This could indicate a minimum threshold of "Solution steps" is required for the process to function effectively or reach its intended "progress step".
* **Residual Low Progress**: The persistent presence of small bars near "max progress step" = 0 across all charts could represent instances where the process failed immediately, encountered an error, or made no progress despite the allocated "Solution steps".
* **Potential for Over-shooting/Variability**: The small bars extending beyond the main peak (e.g., for "Solution steps: 60", bars up to 100) might indicate some variability or "over-shooting" in the "max progress step" achieved, or perhaps represent different scenarios where the process continues to make minor progress even after reaching the "Solution steps" count. However, the dominance of the main peak suggests these are less frequent occurrences.
In essence, the data demonstrates a strong relationship between the number of "Solution steps" and the "max progress step" achieved, with a clear threshold or different operational mode at very low "Solution steps" counts. The process appears to be highly effective and predictable once a certain number of steps are provided.
</details>
Figure 10: Distribution of first-violation steps for Gemini-2.5-Flash across varying logical depths ( $L$ ). As $L$ (total required path length) increases, the distribution of first errors tends to shift leftward, indicating that models are more likely to fail at earlier steps in longer problems. This suggests that anticipated global complexity impacts reasoning from the outset. Experimental parameters in this figure are the same as those in Figure 1.
Similarly, Figure 11 illustrates how the introduction of contextual noise (distracting facts) affects the point of failure. Increased noise also tends to precipitate earlier errors in the reasoning sequence, as discussed in the main paper in relation to sensitivity to noise (Section 2.3) and its impact on error patterns (Section 2.4).
<details>
<summary>figs/gemini-progress-ratio-vs-noise.png Details</summary>

### Visual Description
## Stacked Histograms: Distribution of Progress Ratio Across Varying Noise Ratios
### Overview
The image presents six individual histograms, stacked vertically, each depicting the distribution of a "progress ratio" under a specific "Noise ratio". The "Noise ratio" is a categorical variable ranging from 0.0 to 1.0, incrementing by 0.2. The horizontal axis, labeled "progress ratio", is shared across all plots and spans from 0.0 to 1.0. Each histogram primarily shows a bimodal distribution, with a significant concentration of values at `progress ratio = 1.0` and another cluster of values near `progress ratio = 0.0`. The color of the histogram bars and their baselines changes progressively from dark grey to reddish-brown as the "Noise ratio" increases.
### Components/Axes
* **Overall Structure:** Six distinct histogram plots are arranged vertically, each corresponding to a unique "Noise ratio" condition.
* **Horizontal Axis (X-axis):**
* **Title:** "progress ratio" (positioned centrally below the bottommost histogram).
* **Markers:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0. The axis represents a continuous range from 0.0 to 1.0.
* **Vertical Axis (Y-axis):**
* No explicit title or numerical markers are provided. The height of the bars implicitly represents the frequency or count of observations within each bin.
* **Categorical Labels (Left-aligned, positioned above each respective plot):**
* "Noise ratio: 0.0" (topmost plot)
* "Noise ratio: 0.2"
* "Noise ratio: 0.4"
* "Noise ratio: 0.6"
* "Noise ratio: 0.8"
* "Noise ratio: 1.0" (bottommost plot)
* **Color Scheme:** The visual representation of the histograms employs a color gradient for the bars and their baselines, correlating with the "Noise ratio":
* **Noise ratio 0.0:** Dark grey bars and baseline.
* **Noise ratio 0.2:** Medium grey-brown bars and baseline.
* **Noise ratio 0.4:** Light grey-brown bars and baseline.
* **Noise ratio 0.6:** Light brown bars and baseline.
* **Noise ratio 0.8:** Pale reddish-brown bars and baseline.
* **Noise ratio 1.0:** Reddish-brown bars and baseline.
### Detailed Analysis
Each histogram illustrates the distribution of "progress ratio" values, predominantly concentrated at the extremes of the 0.0 to 1.0 range.
1. **Noise ratio: 0.0** (Topmost plot, dark grey bars)
* **Trend:** The distribution is overwhelmingly concentrated at `progress ratio = 1.0`, represented by the tallest bar in the entire image. A very small cluster of bars is visible near `progress ratio = 0.0`, with the tallest of these appearing around `progress ratio = 0.02` to `0.04`, reaching approximately 5-10% of the height of the bar at 1.0. Frequencies for "progress ratio" values between approximately 0.1 and 0.9 are negligible, appearing as tiny, almost imperceptible lines along the baseline.
* **Approximate Relative Frequencies:** `progress ratio = 1.0` (highest frequency, ~1.0 relative); `progress ratio ≈ 0.02-0.04` (~0.05-0.1 relative frequency); other values (near zero).
2. **Noise ratio: 0.2** (Second plot from top, medium grey-brown bars)
* **Trend:** The distribution remains dominated by the bar at `progress ratio = 1.0`, though its relative height appears slightly reduced compared to the 0.0 case. The cluster of small bars near `progress ratio = 0.0` shows a slight increase in height and possibly a marginal spread. The tallest bar in this cluster, still around `progress ratio = 0.02` to `0.04`, now reaches approximately 10-15% of the height of the bar at 1.0.
* **Approximate Relative Frequencies:** `progress ratio = 1.0` (high frequency, slightly less than 0.0 case); `progress ratio ≈ 0.02-0.04` (~0.1-0.15 relative frequency); other values (near zero).
3. **Noise ratio: 0.4** (Third plot from top, light grey-brown bars)
* **Trend:** The bar at `progress ratio = 1.0` continues to be the most prominent, but its relative height has further decreased. The cluster of bars near `progress ratio = 0.0` shows a more noticeable increase in height and spread. The tallest bar in this cluster, still around `progress ratio = 0.02` to `0.04`, now reaches approximately 15-20% of the height of the bar at 1.0.
* **Approximate Relative Frequencies:** `progress ratio = 1.0` (decreasing high frequency); `progress ratio ≈ 0.02-0.04` (~0.15-0.2 relative frequency); other values (near zero).
4. **Noise ratio: 0.6** (Fourth plot from top, light brown bars)
* **Trend:** The bar at `progress ratio = 1.0` is still dominant but shows a further reduction in relative height. The cluster of bars near `progress ratio = 0.0` is now more pronounced, with several bars reaching significant heights. The tallest bar in this cluster, around `progress ratio = 0.02` to `0.04`, appears to be approximately 20-25% of the height of the bar at 1.0. The spread of these lower-end bars also seems to extend slightly further, possibly up to `progress ratio = 0.1`.
* **Approximate Relative Frequencies:** `progress ratio = 1.0` (further decreasing high frequency); `progress ratio ≈ 0.02-0.04` (~0.2-0.25 relative frequency); other values (near zero, with slight increase in spread near 0.0).
5. **Noise ratio: 0.8** (Fifth plot from top, pale reddish-brown bars)
* **Trend:** The bar at `progress ratio = 1.0` has significantly reduced in relative height, though it remains the single tallest bar in this specific plot. The cluster of bars near `progress ratio = 0.0` is now quite prominent, with multiple bars of varying heights. The tallest bar in this cluster, still around `progress ratio = 0.02` to `0.04`, is now roughly 30-40% of the height of the bar at 1.0. The spread of these bars extends more clearly up to `progress ratio = 0.1` and possibly very small frequencies up to `0.2`.
* **Approximate Relative Frequencies:** `progress ratio = 1.0` (significantly reduced high frequency); `progress ratio ≈ 0.02-0.04` (~0.3-0.4 relative frequency); `progress ratio ≈ 0.0-0.1` (more distributed frequencies).
6. **Noise ratio: 1.0** (Bottommost plot, reddish-brown bars)
* **Trend:** The bar at `progress ratio = 1.0` is at its lowest relative height among all plots, though it still represents a distinct peak. The cluster of bars near `progress ratio = 0.0` is now very prominent and spread out. The tallest bar in this cluster, still around `progress ratio = 0.02` to `0.04`, appears to be approximately 50-60% of the height of the bar at 1.0. The distribution near 0.0 is wider, with noticeable bars extending up to `progress ratio = 0.1` and very small bars up to `0.2` or `0.3`.
* **Approximate Relative Frequencies:** `progress ratio = 1.0` (lowest high frequency); `progress ratio ≈ 0.02-0.04` (~0.5-0.6 relative frequency); `progress ratio ≈ 0.0-0.1` (widest and most prominent distribution).
### Key Observations
* **Bimodal Nature:** All distributions are distinctly bimodal, with peaks at `progress ratio = 1.0` and a cluster of frequencies near `progress ratio = 0.0`.
* **Inverse Relationship with `progress ratio = 1.0`:** As the "Noise ratio" increases from 0.0 to 1.0, the relative frequency (height) of the bar at `progress ratio = 1.0` consistently decreases.
* **Direct Relationship with `progress ratio ≈ 0.0`:** Conversely, as the "Noise ratio" increases, the frequencies of "progress ratio" values near 0.0 (specifically around 0.02-0.04) increase in height and become more spread out, extending further along the x-axis towards 0.1 and beyond.
* **Shift in Dominance:** At low noise ratios (0.0, 0.2), the `progress ratio = 1.0` peak is overwhelmingly dominant. As noise increases, the cluster near `progress ratio = 0.0` gains significant prominence, reducing the relative dominance of the `progress ratio = 1.0` peak.
* **Color Progression:** The subtle color change of the plots from dark grey to reddish-brown visually reinforces the increasing "Noise ratio" parameter.
### Interpretation
This data likely illustrates the impact of increasing "Noise ratio" on a system's ability to achieve a desired "progress ratio". The "progress ratio" can be interpreted as a measure of task completion or success, where 1.0 represents full completion and values near 0.0 represent minimal or no progress.
* **Ideal Conditions (Noise ratio: 0.0):** With no noise, the system performs exceptionally well, almost always achieving full progress (`progress ratio = 1.0`). Only a negligible fraction of instances result in minimal progress. This suggests a highly efficient and robust system in an ideal environment.
* **Degradation with Noise:** As the "Noise ratio" increases, the system's performance degrades. The probability of achieving full progress (`progress ratio = 1.0`) steadily declines, while the probability of making minimal or no progress (`progress ratio ≈ 0.0`) increases. This indicates that noise directly interferes with the system's operational success.
* **Impact on Failures:** The broadening of the distribution near `progress ratio = 0.0` with increasing noise suggests that failures are not always absolute (i.e., exactly 0.0 progress) but can manifest as a range of very low progress values. This implies that noise can cause partial failures or significant setbacks rather than just complete halts.
* **Resilience at High Noise:** Even at the highest "Noise ratio" (1.0), achieving full progress (`progress ratio = 1.0`) remains the single most frequent outcome, albeit with a significantly reduced proportion compared to noiseless conditions. This could suggest a degree of inherent resilience or a mechanism that still allows for successful completion in a substantial number of cases, even under extreme noise, although a large proportion of attempts now result in minimal progress.
In summary, the data clearly demonstrates that increasing noise negatively impacts the system's ability to achieve full progress, shifting outcomes towards minimal progress, while still maintaining a notable, though diminished, capacity for complete success even under high noise conditions.
</details>
Figure 11: Impact of increasing noise ratio on the distribution of failure steps for Gemini 2.5 Flash. As noise (proportion of distracting facts) increases, failures tend to occur earlier in the reasoning chain. This reflects increased difficulty in isolating relevant information and maintaining focus. Fixed experimental parameters in this figure are the same as those in Figure 1.
## Appendix C Supplementary Figures
This appendix provides supplementary figures that offer further visual support for analyses presented in the main paper. These figures illustrate the impact of various complexity dimensions and provide comparative views of model performance, elaborating on points made throughout Section 2 (Benchmarking Results) of the main paper.
Figure 12 details the performance of Llama-4 Maverick-17B-128E-Instruct under varying levels of noise and fact shuffling. This supports the discussion in the main paper (Section 2.3, on how these factors, especially in combination, affect success rates, with noise being a dominant factor.
<details>
<summary>figs/single_model_vs_steps_count_varied_noise_shuffle_Llama-4-Maverick-17B-128E-Instruct-FP8.png Details</summary>

### Visual Description
## Line Charts: Success Rate vs. Number of Actions under Varying Conditions
### Overview
The image displays two line charts side-by-side, illustrating the "success rate" as a function of "number of actions" under different conditions of "noise" and "shuffle". Both charts present the same data series but differ in their Y-axis scaling: the left chart uses a linear scale, while the right chart uses a logarithmic scale. The charts also include two dashed lines representing exponential decay fits with different characteristic lengths (L).
### Components/Axes
**Common Elements (Both Charts):**
* **X-axis Label**: "number of actions"
* **X-axis Tick Markers**: 10, 20, 30, 40, 50, 60, 70. The data points are not aligned with these major tick marks but are positioned approximately at X-values of 7, 15, 25, 35, 45, 55, and 65.
* **Legend**: Located in the top-right quadrant of each chart, listing six data series with their corresponding colors and line styles.
* **Blue solid line with circular markers**: `noise = 0, shuffle = 0`
* **Orange solid line with circular markers**: `noise = 0, shuffle = 0.5`
* **Green solid line with circular markers**: `noise = 0.2, shuffle = 0`
* **Red solid line with circular markers**: `noise = 0.2, shuffle = 0.5`
* **Purple dashed line**: `∝ exp(-x/L), L = 24` (where '∝' means "proportional to")
* **Brown dashed line**: `∝ exp(-x/L), L = 14`
**Left Chart Specifics:**
* **Y-axis Label**: "success rate"
* **Y-axis Scale**: Linear, ranging from 0.0 to 1.0.
* **Y-axis Tick Markers**: 0.2, 0.4, 0.6, 0.8, 1.0.
**Right Chart Specifics:**
* **Y-axis Label**: "success rate"
* **Y-axis Scale**: Logarithmic, ranging from 10^-2 (0.01) to 10^0 (1.0).
* **Y-axis Tick Markers**: 10^-2, 10^-1, 10^0.
### Detailed Analysis
**Left Chart (Linear Y-axis)**
All data series show a decreasing trend in "success rate" as the "number of actions" increases.
* **Blue line (noise = 0, shuffle = 0)**: This line starts at a high success rate and decreases steadily.
* Data points (approx. X, Y): (7, 0.95), (15, 0.68), (25, 0.48), (35, 0.30), (45, 0.18), (55, 0.10), (65, 0.05).
* **Orange line (noise = 0, shuffle = 0.5)**: This line closely follows the blue line but is slightly below it, indicating a slightly lower success rate for the same number of actions.
* Data points (approx. X, Y): (7, 0.95), (15, 0.65), (25, 0.42), (35, 0.28), (45, 0.16), (55, 0.08), (65, 0.04).
* **Green line (noise = 0.2, shuffle = 0)**: This line shows a more rapid decrease in success rate compared to the blue and orange lines.
* Data points (approx. X, Y): (7, 0.90), (15, 0.48), (25, 0.28), (35, 0.15), (45, 0.08), (55, 0.04), (65, 0.02).
* **Red line (noise = 0.2, shuffle = 0.5)**: This line exhibits the steepest decline in success rate among the solid lines, consistently below the green line.
* Data points (approx. X, Y): (7, 0.88), (15, 0.42), (25, 0.20), (35, 0.10), (45, 0.05), (55, 0.02), (65, 0.01).
* **Purple dashed line (∝ exp(-x/L), L = 24)**: This line represents an exponential decay model. It starts around 0.9 and decays smoothly.
* Data points (approx. X, Y): (7, 0.90), (15, 0.68), (25, 0.48), (35, 0.34), (45, 0.24), (55, 0.17), (65, 0.12).
* **Brown dashed line (∝ exp(-x/L), L = 14)**: This line represents another exponential decay model with a shorter characteristic length, indicating a faster decay. It starts around 0.9 and decays more rapidly than the purple dashed line.
* Data points (approx. X, Y): (7, 0.90), (15, 0.50), (25, 0.28), (35, 0.16), (45, 0.09), (55, 0.05), (65, 0.03).
**Right Chart (Logarithmic Y-axis)**
All data series, when plotted on a logarithmic Y-axis, appear approximately linear, which is characteristic of exponential decay.
* **Blue line (noise = 0, shuffle = 0)**: This line appears mostly straight, indicating an exponential decay.
* Data points (approx. X, Y): (7, 0.95), (15, 0.68), (25, 0.48), (35, 0.30), (45, 0.18), (55, 0.10), (65, 0.05).
* **Orange line (noise = 0, shuffle = 0.5)**: This line also appears mostly straight and parallel to the blue line, but slightly below it.
* Data points (approx. X, Y): (7, 0.95), (15, 0.65), (25, 0.42), (35, 0.28), (45, 0.16), (55, 0.08), (65, 0.04).
* **Green line (noise = 0.2, shuffle = 0)**: This line is steeper than the blue and orange lines, indicating a faster exponential decay.
* Data points (approx. X, Y): (7, 0.90), (15, 0.48), (25, 0.28), (35, 0.15), (45, 0.08), (55, 0.04), (65, 0.02).
* **Red line (noise = 0.2, shuffle = 0.5)**: This line is the steepest among the solid lines, showing the most rapid exponential decay.
* Data points (approx. X, Y): (7, 0.88), (15, 0.42), (25, 0.20), (35, 0.10), (45, 0.05), (55, 0.02), (65, 0.01).
* **Purple dashed line (∝ exp(-x/L), L = 24)**: This line is perfectly straight on the logarithmic plot, confirming its exponential nature. It has a shallower slope compared to the brown dashed line.
* Data points (approx. X, Y): (7, 0.90), (15, 0.68), (25, 0.48), (35, 0.34), (45, 0.24), (55, 0.17), (65, 0.12).
* **Brown dashed line (∝ exp(-x/L), L = 14)**: This line is also perfectly straight on the logarithmic plot, with a steeper slope than the purple dashed line, reflecting its smaller characteristic length (L=14 vs L=24).
* Data points (approx. X, Y): (7, 0.90), (15, 0.50), (25, 0.28), (35, 0.16), (45, 0.09), (55, 0.05), (65, 0.03).
### Key Observations
* **Impact of Noise**: Increasing `noise` from 0 to 0.2 (comparing blue/orange to green/red) significantly reduces the success rate and steepens the decay. For example, at `number of actions` = 35, `noise = 0, shuffle = 0` (blue) has a success rate of ~0.30, while `noise = 0.2, shuffle = 0` (green) has a success rate of ~0.15.
* **Impact of Shuffle**: Introducing `shuffle = 0.5` (comparing blue to orange, or green to red) generally leads to a slightly lower success rate, but the effect is less pronounced than that of noise.
* **Exponential Decay**: All experimental data series (solid lines) exhibit a clear exponential decay trend, as evidenced by their approximate linearity on the logarithmic Y-axis plot.
* **Model Fits**:
* The `∝ exp(-x/L), L = 24` (purple dashed) line provides a reasonable fit for the `noise = 0` conditions (blue and orange lines), particularly at higher numbers of actions.
* The `∝ exp(-x/L), L = 14` (brown dashed) line provides a good fit for the `noise = 0.2, shuffle = 0.5` condition (red line), especially at higher numbers of actions. It also closely tracks the `noise = 0.2, shuffle = 0` (green) line.
* **Initial Values**: All series start with a success rate close to 1.0 (or 10^0) for a low number of actions (approx. 7). The initial success rate is slightly lower for `noise = 0.2` conditions.
### Interpretation
The data strongly suggests that the "success rate" in the observed system decays exponentially with the "number of actions". This exponential relationship is clearly demonstrated by the linear appearance of the data on the logarithmic Y-axis chart.
The presence of "noise" (specifically `noise = 0.2`) has a detrimental effect on the success rate, causing it to decay much faster. This is reflected in the steeper slopes of the green and red lines compared to the blue and orange lines on the logarithmic plot, and their lower success rates on the linear plot. This implies that noise introduces errors or inefficiencies that accumulate with more actions, leading to a quicker failure.
The "shuffle" parameter (`shuffle = 0.5`) also negatively impacts the success rate, but to a lesser extent than noise. It slightly reduces the success rate for both `noise = 0` and `noise = 0.2` conditions, suggesting that shuffling might introduce some disorder or less optimal sequencing of actions.
The exponential decay models (`∝ exp(-x/L)`) with different characteristic lengths (L) appear to be good approximations for the observed phenomena. A larger L (e.g., L=24) corresponds to a slower decay (higher success rate for more actions), which aligns with conditions of lower noise. A smaller L (e.g., L=14) corresponds to a faster decay, which aligns with conditions of higher noise and shuffle. This indicates that the system's resilience or "memory" (how long it can maintain a high success rate) is inversely related to the level of noise and shuffle. The characteristic length L can be interpreted as a measure of how many actions, on average, the system can tolerate before its success rate drops significantly.
</details>
Figure 12: Pass@1 success rate for Llama-4 Maverick-17B-128E-Instruct versus solution length ( $L$ ) under different noise and shuffle ratios. Left: Linear scale. Right: Log-linear scale. Performance degrades with increased noise but is less affected by shuffle ratios. Fixed experimental parameters in this figure are the same as those in Figure 1.
To illustrate the performance consistency and disparities across different models, as detailed in Section 2.6, Figure 13 presents scatter and density plots of mean progress ratios. These plots clearly demonstrate that model performance hierarchies are not strictly linear. They reveal ’performance inversions’—instances, also noted in Section 2.6, where models with typically lower overall performance (e.g., lower average $L_{0}$ ) occasionally solve specific complex problems that models with higher average $L_{0}$ values fail on.
<details>
<summary>figs/progress_vs_progress.png Details</summary>

### Visual Description
## Chart Type: Grid of 2D Density Contour Plots Comparing Model Progress Ratios
### Overview
This image displays a 2x3 grid of six individual contour plots, each comparing the "progress ratio" of two different language models or model versions. Each subplot visualizes the joint distribution of progress ratios for a pair of models, with contours indicating data density and scattered points representing individual data instances. A dashed diagonal line (y=x) is present in all plots, signifying equal performance between the two models being compared. The color scheme ranges from dark purple (lowest density) to bright yellow/green (highest density), with white areas indicating regions of zero or near-zero data density.
### Components/Axes
The entire grid shares common axis labels and ranges:
* **X-axis Label (bottom-most plot)**: `progress ratio`
* **Y-axis Label (left-most plots)**: `progress ratio`
* **Axis Range**: Both X and Y axes range from 0.0 to 1.0.
* **Major Tick Markers**: 0.2, 0.4, 0.6, 0.8.
* **Diagonal Line**: A dashed grey line extends from (0.0, 0.0) to (1.0, 1.0) in all plots, representing the line of equality (x=y).
* **Data Points**: Small, light grey dots are scattered across each plot, representing individual data instances.
* **Contour Colors**: A continuous color gradient is used to represent data density, typically a viridis-like colormap:
* Dark purple: Lowest density
* Blue/Cyan: Low to medium density
* Green: Medium to high density
* Yellow/Lime Green: Highest density
* White: Zero or extremely low density (no data points observed in these regions).
Each subplot has a unique title indicating the models being compared:
* **Top Row, Left Plot (R1C1)**: `x: DeepSeek-R1 vs y:gemini-2.0-flash`
* **Top Row, Middle Plot (R1C2)**: `x: DeepSeek-R1 vs y:gemini-2.5-flash-preview-04-17`
* **Top Row, Right Plot (R1C3)**: `x: gemini-2.0-flash vs y:gemini-2.5-flash-preview-04-17`
* **Bottom Row, Left Plot (R2C1)**: `x: DeepSeek-R1 vs y:Llama-4-Maverick-17B-128E-Instruct-FP8`
* **Bottom Row, Middle Plot (R2C2)**: `x: gemini-2.0-flash vs y:Llama-4-Maverick-17B-128E-Instruct-FP8`
* **Bottom Row, Right Plot (R2C3)**: `x: gemini-2.5-flash-preview-04-17 vs y:Llama-4-Maverick-17B-128E-Instruct-FP8`
### Detailed Analysis
The analysis is segmented by the 2x3 grid layout:
**Row 1: Comparisons involving DeepSeek-R1 and Gemini models**
* **R1C1: x: DeepSeek-R1 vs y:gemini-2.0-flash**
* **Trend**: The highest density (bright yellow/green) is concentrated in the bottom-left corner, centered approximately around (0.15, 0.15). The density gradually decreases outwards, forming concentric contours. A secondary, much less intense, and smaller density peak is visible in the top-right, around (0.8, 0.9). A significant white region, indicating no data, is present in the upper-middle part of the plot, roughly from (0.4, 0.5) to (0.8, 0.7), and also in the top-right corner above the secondary peak.
* **Data Points**: Numerous grey scattered points are densest in the bottom-left, following the yellow/green contour, and sparser elsewhere, consistent with the contour map. Most points are clustered near the y=x line in the bottom-left.
* **R1C2: x: DeepSeek-R1 vs y:gemini-2.5-flash-preview-04-17**
* **Trend**: This plot shows a very similar distribution to R1C1. The highest density is in the bottom-left corner, centered around (0.15, 0.15). Density decreases outwards. A large white region is prominent in the middle-right, extending from approximately (0.4, 0.5) to (0.8, 0.7), and another smaller white region in the top-right. There is no clear secondary peak in the top-right as seen in R1C1.
* **Data Points**: Grey scattered points are most concentrated in the bottom-left, aligning with the high-density contours.
* **R1C3: x: gemini-2.0-flash vs y:gemini-2.5-flash-preview-04-17**
* **Trend**: The distribution here is almost identical to R1C2. The highest density is in the bottom-left corner, centered around (0.15, 0.15). Density decreases outwards. A large white region is present in the middle-right, extending from approximately (0.4, 0.5) to (0.8, 0.7), and another smaller white region in the top-right.
* **Data Points**: Grey scattered points are most concentrated in the bottom-left, aligning with the high-density contours.
**Row 2: Comparisons involving Llama-4-Maverick-17B-128E-Instruct-FP8**
* **R2C1: x: DeepSeek-R1 vs y:Llama-4-Maverick-17B-128E-Instruct-FP8**
* **Trend**: This plot exhibits a bimodal distribution. A primary high-density area (yellow/green) is in the bottom-left corner, centered around (0.15, 0.15). A secondary, distinct high-density area (yellow/green) is located in the top-right corner, centered around (0.9, 0.9). Both peaks are centered on the y=x line. A large, irregular white region forms a diagonal band across the middle of the plot, roughly from (0.3, 0.4) to (0.7, 0.8), indicating a lack of data points in this intermediate progress ratio range.
* **Data Points**: Grey scattered points are densely clustered in both the bottom-left and top-right corners, corresponding to the high-density contours.
* **R2C2: x: gemini-2.0-flash vs y:Llama-4-Maverick-17B-128E-Instruct-FP8**
* **Trend**: The distribution is very similar to R2C1. A primary high-density area is in the bottom-left corner, centered around (0.15, 0.15). A secondary high-density area is in the top-right corner, centered around (0.9, 0.9). Both peaks are centered on the y=x line. A large, irregular white region forms a diagonal band across the middle, roughly from (0.3, 0.4) to (0.7, 0.8).
* **Data Points**: Grey scattered points are densely clustered in both the bottom-left and top-right corners.
* **R2C3: x: gemini-2.5-flash-preview-04-17 vs y:Llama-4-Maverick-17B-128E-Instruct-FP8**
* **Trend**: The distribution is nearly identical to R2C1 and R2C2. A primary high-density area is in the bottom-left corner, centered around (0.15, 0.15). A secondary high-density area is in the top-right corner, centered around (0.9, 0.9). Both peaks are centered on the y=x line. A large, irregular white region forms a diagonal band across the middle, roughly from (0.3, 0.4) to (0.7, 0.8).
* **Data Points**: Grey scattered points are densely clustered in both the bottom-left and top-right corners.
### Key Observations
1. **Common Low Performance**: All six plots show a significant concentration of data points and high density in the bottom-left corner (progress ratio < 0.2 for both models). This suggests that for a substantial number of instances, all compared models achieve a very low "progress ratio".
2. **Equality Line Adherence**: In all plots, the high-density regions are generally centered on or very close to the y=x dashed line, indicating that when one model performs well (or poorly), the other model in the pair tends to perform similarly for those specific instances.
3. **Top Row vs. Bottom Row Differences**:
* **Top Row (DeepSeek-R1 vs Gemini, Gemini 2.0 vs Gemini 2.5)**: These plots are characterized by a strong bottom-left peak and a large, contiguous white region in the middle-to-upper-right quadrant. This implies a lack of instances where both models achieve moderate to high progress ratios, or where one significantly outperforms the other in the higher range.
* **Bottom Row (All models vs Llama-4-Maverick)**: These plots show a distinct bimodal distribution with high-density peaks in both the bottom-left and top-right corners. This indicates that for many instances, both models either achieve very low progress ratios *or* very high progress ratios.
4. **Intermediate Performance Gap (Bottom Row)**: The prominent white diagonal band in the middle of the bottom row plots (roughly 0.3 to 0.8 progress ratio) suggests that there are very few instances where both models achieve intermediate progress ratios.
5. **Consistency within Rows**: The plots within the top row are very similar to each other, as are the plots within the bottom row. This suggests that DeepSeek-R1, Gemini 2.0-flash, and Gemini 2.5-flash-preview-04-17 have comparable performance profiles when evaluated against each other (top row) and when evaluated against Llama-4-Maverick-17B-128E-Instruct-FP8 (bottom row).
### Interpretation
The "progress ratio" appears to be a metric where higher values indicate better performance. The plots are essentially comparing the performance of different models on a set of tasks or instances.
The consistent high density in the bottom-left corner across all plots suggests that a significant portion of the evaluated instances are challenging for all models, resulting in low progress ratios. This could indicate a baseline difficulty for the tasks or a common failure mode.
The top row plots, comparing DeepSeek-R1 with Gemini models and Gemini 2.0 with Gemini 2.5, show that while there are many instances of low performance, there are fewer instances where both models achieve moderate to high progress ratios. The large white region in the middle-right suggests that these model pairs rarely achieve high progress ratios simultaneously, or that one model rarely achieves a high progress ratio while the other achieves a moderate one. The similarity between R1C2 and R1C3 implies that DeepSeek-R1 and Gemini 2.0-flash have similar performance characteristics when compared to Gemini 2.5-flash-preview-04-17.
In contrast, the bottom row plots, comparing DeepSeek-R1, Gemini 2.0-flash, and Gemini 2.5-flash-preview-04-17 against Llama-4-Maverick-17B-128E-Instruct-FP8, reveal a different pattern. The bimodal distribution with peaks at both low and high progress ratios suggests that for these comparisons, the tasks tend to be either very easy (both models succeed with high progress ratios) or very difficult (both models fail with low progress ratios). The absence of data in the intermediate range (the white diagonal band) implies that there are few tasks where both models achieve a "medium" level of success. This could indicate a binary nature of success for these tasks when evaluated against Llama-4-Maverick, or that the tasks are not finely graded in terms of difficulty for these model pairs. The strong alignment of both peaks with the y=x line indicates that when one model performs well or poorly, Llama-4-Maverick tends to perform similarly on those specific instances. The striking similarity across all three bottom row plots further reinforces that DeepSeek-R1, Gemini 2.0-flash, and Gemini 2.5-flash-preview-04-17 exhibit very similar performance profiles when benchmarked against Llama-4-Maverick-17B-128E-Instruct-FP8.
</details>
Figure 13: Scatter and density plots of progress ratios per task instance, comparing model pairs on the tasks. These plots illustrate performance agreement and disparities on the same instances of pathfinding tasks. Notably, Gemini-2.5-Flash (example) often succeeds on instances where other models achieve near-zero progress. Data from experiments in Figure 1 (main paper).
Figure 14 isolates the impact of shuffle ratio on model performance when other factors like noise are controlled. This visualization corresponds to the findings discussed in the main paper (Section 2.3, "Fact Ordering (Shuffle Ratio)") that simple reordering of facts has a minimal impact on the performance of the evaluated models under low-noise conditions.
Figure 15 isolates the impact of adding more examples in the instruction prompt, showing a clear improvement once more than a single example is included compared to using none or only one.
Figure 16 is added in this revised version of the supplementary section to reflect that even the most recent SOTA models released by OpenAI suffer from the same performance drop observed in the main paper.
<details>
<summary>figs/fig_vs_shuffle_fixed_L_keys2_noise0.2.png Details</summary>

### Visual Description
## Chart Type: Performance Metrics of Language Models Across Shuffle Ratios
### Overview
This image displays three line charts arranged horizontally, comparing the performance of two language models, `Llama-4-Maverick-17B-128E-Instruct-FP8` and `gemini-2.5-flash-preview-04-17`, across varying "shuffle ratios". The metrics evaluated are "mean progress ratio", "mean success rate (Pass@1)", and "CoT tokens". Each chart plots these metrics against the "shuffle ratio" on the x-axis, which ranges from 0.0 to 1.0.
### Components/Axes
**Legend (located in the top-left chart area):**
* **Blue line with circular markers**: Represents `(Llama-4-Maverick-17B-128E-Instruct-FP8)`
* **Orange line with circular markers**: Represents `(gemini-2.5-flash-preview-04-17)`
This legend applies to all three charts.
**Common X-axis for all charts:**
* **Title**: `shuffle ratio`
* **Range**: 0.0 to 1.0
* **Tick Markers**: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
**Chart 1 (Left):**
* **Y-axis Title**: `mean progress ratio`
* **Y-axis Range**: 0.0 to 1.0
* **Y-axis Tick Markers**: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
**Chart 2 (Middle):**
* **Y-axis Title**: `mean success rate (Pass@1)`
* **Y-axis Range**: 0.0 to 1.0
* **Y-axis Tick Markers**: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
**Chart 3 (Right):**
* **Y-axis Title**: `CoT tokens`
* **Y-axis Range**: 0 to 1600
* **Y-axis Tick Markers**: 0, 200, 400, 600, 800, 1000, 1200, 1400, 1600
### Detailed Analysis
**Chart 1: Mean Progress Ratio vs. Shuffle Ratio**
* **Blue Line (Llama-4-Maverick-17B-128E-Instruct-FP8)**:
* **Trend**: The line starts at a mean progress ratio of approximately 0.22 at `shuffle ratio` 0.0, slightly decreases to a minimum of about 0.18 at `shuffle ratio` 0.6, and then slightly increases to approximately 0.19 at `shuffle ratio` 1.0. The overall trend is relatively flat and low.
* **Data Points**:
* `shuffle ratio` 0.0: ~0.22
* `shuffle ratio` 0.2: ~0.20
* `shuffle ratio` 0.4: ~0.19
* `shuffle ratio` 0.6: ~0.18
* `shuffle ratio` 0.8: ~0.19
* `shuffle ratio` 1.0: ~0.19
* **Orange Line (gemini-2.5-flash-preview-04-17)**:
* **Trend**: The line starts at a mean progress ratio of approximately 0.64 at `shuffle ratio` 0.0, gradually increases to a peak of about 0.68 at `shuffle ratio` 0.4, and then slightly decreases and stabilizes around 0.63 for `shuffle ratio` 0.6 through 1.0. The overall trend is higher and relatively stable compared to the blue line.
* **Data Points**:
* `shuffle ratio` 0.0: ~0.64
* `shuffle ratio` 0.2: ~0.66
* `shuffle ratio` 0.4: ~0.68
* `shuffle ratio` 0.6: ~0.63
* `shuffle ratio` 0.8: ~0.63
* `shuffle ratio` 1.0: ~0.63
**Chart 2: Mean Success Rate (Pass@1) vs. Shuffle Ratio**
* **Blue Line (Llama-4-Maverick-17B-128E-Instruct-FP8)**:
* **Trend**: The line remains very low and almost flat across all `shuffle ratio` values, hovering just above 0.0. There is a slight increase at `shuffle ratio` 1.0.
* **Data Points**:
* `shuffle ratio` 0.0: ~0.01
* `shuffle ratio` 0.2: ~0.01
* `shuffle ratio` 0.4: ~0.01
* `shuffle ratio` 0.6: ~0.01
* `shuffle ratio` 0.8: ~0.01
* `shuffle ratio` 1.0: ~0.02
* **Orange Line (gemini-2.5-flash-preview-04-17)**:
* **Trend**: The line starts at a mean success rate of approximately 0.50 at `shuffle ratio` 0.0, increases to a peak of about 0.56 at `shuffle ratio` 0.4, then decreases to approximately 0.50 at `shuffle ratio` 1.0. The overall trend is higher and relatively stable compared to the blue line.
* **Data Points**:
* `shuffle ratio` 0.0: ~0.50
* `shuffle ratio` 0.2: ~0.52
* `shuffle ratio` 0.4: ~0.56
* `shuffle ratio` 0.6: ~0.53
* `shuffle ratio` 0.8: ~0.54
* `shuffle ratio` 1.0: ~0.50
**Chart 3: CoT tokens vs. Shuffle Ratio**
* **Blue Line (Llama-4-Maverick-17B-128E-Instruct-FP8)**:
* **Trend**: The line starts at approximately 1600 CoT tokens at `shuffle ratio` 0.0, slightly dips to ~1590 at 0.2, then gradually increases to a peak of approximately 1640 at `shuffle ratio` 0.8, before slightly decreasing to ~1620 at `shuffle ratio` 1.0. The overall trend is very high and stable, with minor fluctuations.
* **Data Points**:
* `shuffle ratio` 0.0: ~1600
* `shuffle ratio` 0.2: ~1590
* `shuffle ratio` 0.4: ~1600
* `shuffle ratio` 0.6: ~1620
* `shuffle ratio` 0.8: ~1640
* `shuffle ratio` 1.0: ~1620
* **Orange Line (gemini-2.5-flash-preview-04-17)**:
* **Trend**: The line remains very low and almost flat across all `shuffle ratio` values, consistently around 340-350 CoT tokens.
* **Data Points**:
* `shuffle ratio` 0.0: ~350
* `shuffle ratio` 0.2: ~350
* `shuffle ratio` 0.4: ~350
* `shuffle ratio` 0.6: ~340
* `shuffle ratio` 0.8: ~350
* `shuffle ratio` 1.0: ~340
### Key Observations
* **Performance Disparity**: The `gemini-2.5-flash-preview-04-17` model (orange line) consistently outperforms `Llama-4-Maverick-17B-128E-Instruct-FP8` (blue line) in both `mean progress ratio` and `mean success rate (Pass@1)` across all `shuffle ratio` values.
* **CoT Token Usage**: Conversely, `Llama-4-Maverick-17B-128E-Instruct-FP8` uses significantly more `CoT tokens` (around 1600) than `gemini-2.5-flash-preview-04-17` (around 350).
* **Stability Across Shuffle Ratios**: For both models and all metrics, the performance and token usage remain relatively stable, showing only minor fluctuations as the `shuffle ratio` changes from 0.0 to 1.0. There are no sharp drops or increases that would suggest a strong sensitivity to the `shuffle ratio` within this range.
* **Peak Performance**: `gemini-2.5-flash-preview-04-17` shows a slight peak in both `mean progress ratio` and `mean success rate (Pass@1)` around a `shuffle ratio` of 0.4.
### Interpretation
The data strongly suggests that `gemini-2.5-flash-preview-04-17` is a more efficient and effective model for the tasks measured by "mean progress ratio" and "mean success rate (Pass@1)" compared to `Llama-4-Maverick-17B-128E-Instruct-FP8`.
Specifically:
1. **Superior Performance**: `gemini-2.5-flash-preview-04-17` achieves significantly higher "mean progress ratio" (around 0.63-0.68) and "mean success rate (Pass@1)" (around 0.50-0.56) than `Llama-4-Maverick-17B-128E-Instruct-FP8` (which hovers around 0.18-0.22 for progress ratio and near 0.01-0.02 for success rate). This indicates that the Gemini model is much better at successfully completing tasks and making progress towards solutions.
2. **Resource Efficiency (CoT Tokens)**: Despite its superior performance, `gemini-2.5-flash-preview-04-17` utilizes substantially fewer "CoT tokens" (approximately 350) compared to `Llama-4-Maverick-17B-128E-Instruct-FP8` (approximately 1600). "CoT tokens" likely refer to Chain-of-Thought tokens, which are often used for reasoning or intermediate steps. The Gemini model's lower token usage, combined with higher success rates, implies greater efficiency in its reasoning process or a less verbose approach to problem-solving.
3. **Robustness to Shuffle Ratio**: The relatively flat trends across the `shuffle ratio` for all metrics suggest that the order or arrangement of input elements (as implied by "shuffle ratio") does not significantly impact the performance or token usage of either model within the tested range. This indicates a degree of robustness in how both models handle variations in input structure.
4. **Model Choice Implications**: For applications prioritizing high success rates and efficient resource usage (fewer tokens), `gemini-2.5-flash-preview-04-17` appears to be the clearly superior choice based on these metrics. `Llama-4-Maverick-17B-128E-Instruct-FP8`, while potentially having other strengths not measured here, demonstrates significantly lower performance and higher token consumption for these specific tasks.
</details>
Figure 14: Impact of shuffle ratio on Pass@1 success rate. Varying the degree of mixing (shuffle) between supporting and distracting facts shows minimal impact on performance for Gemini 2.5 Flash and Llama-4 Maverick, suggesting robustness to fact order when noise is controlled. The generation and sampling of maze instances for these tasks follow the same methodology detailed for experiments in the main paper (Figures 3 and 4).
<details>
<summary>figs/maze_ablation_analysis.png Details</summary>

### Visual Description
## Chart Type: Line Chart - Success Rate vs. Number of Actions for Llama-4-Maverick Model
### Overview
This image displays a line chart illustrating the "Success rate" as a function of the "Number of actions" for different configurations of the "Llama-4-Maverick-17B-128E-Instruct-FP8" model. Five distinct experimental setups are compared, varying in the number of "shots" (few-shot learning examples) and whether "guided CoT" (Chain-of-Thought) is employed. All lines show a rapid decrease in success rate with an increasing number of actions, eventually flattening out near zero.
### Components/Axes
**Chart Title (Top-left, within a white box with a black border):**
"Llama-4-Maverick-17B-128E-Instruct-FP8"
**X-axis:**
* **Title:** "Number of actions"
* **Range:** From 0 to approximately 225.
* **Major Ticks:** 0, 50, 100, 150, 200.
* **Grid Lines:** Vertical grid lines are present at each major tick mark.
**Y-axis:**
* **Title:** "Success rate"
* **Range:** From 0 to approximately 0.7.
* **Major Ticks:** 0, 0.2, 0.4, 0.6.
* **Grid Lines:** Horizontal grid lines are present at each major tick mark.
**Legend (Top-right, within a white box with a black border):**
The legend identifies five data series by color and marker:
* **Green line with circular markers:** "5_shots_and_guided_CoT"
* **Purple line with diamond markers:** "3_shots_and_guided_CoT"
* **Orange line with upward-pointing triangle markers:** "3_shot_unguided"
* **Red line with downward-pointing triangle markers:** "1_shot_and_guided_CoT"
* **Blue line with square markers:** "zero_shot_and_guided_CoT"
### Detailed Analysis
All five data series exhibit a similar overall trend: a steep decline in "Success rate" as the "Number of actions" increases, followed by a more gradual decrease, eventually approaching a success rate of zero. The most significant drop in success rate occurs within the first 50 actions for all configurations.
Here is a detailed breakdown of each series:
1. **Green line (5_shots_and_guided_CoT):**
* **Trend:** This line starts at the highest success rate and generally maintains the highest performance among all series for the initial range of actions. It shows a rapid decline up to approximately 50 actions, then a slower decline, flattening out near zero after about 100 actions.
* **Approximate Data Points (X, Y):**
* (X~12, Y~0.67)
* (X~22, Y~0.45)
* (X~32, Y~0.25)
* (X~42, Y~0.15)
* (X~52, Y~0.10)
* (X~72, Y~0.05)
* (X~102, Y~0.02)
* (X~122, Y~0.01)
* (X~142, Y~0.005)
* (X~162, Y~0.005)
* (X~182, Y~0.005)
* (X~202, Y~0.005)
2. **Purple line (3_shots_and_guided_CoT):**
* **Trend:** This line starts slightly below the 5-shot guided CoT, but very close to the 3-shot unguided CoT. It follows a similar rapid decline pattern, consistently performing slightly worse than the 5-shot guided CoT but generally better than the 1-shot and zero-shot guided CoT.
* **Approximate Data Points (X, Y):**
* (X~12, Y~0.65)
* (X~22, Y~0.43)
* (X~32, Y~0.23)
* (X~42, Y~0.12)
* (X~52, Y~0.08)
* (X~72, Y~0.03)
* (X~102, Y~0.01)
* (X~122, Y~0.005)
* (X~142, Y~0.005)
* (X~162, Y~0.005)
* (X~182, Y~0.005)
* (X~202, Y~0.005)
3. **Orange line (3_shot_unguided):**
* **Trend:** This line starts very close to the 5-shot and 3-shot guided CoT lines. It shows a rapid initial decline, crossing below the 3-shot guided CoT line around X=30-40 actions. For most of the range, it performs better than 1-shot and zero-shot guided CoT, but slightly worse than 3-shot guided CoT.
* **Approximate Data Points (X, Y):**
* (X~12, Y~0.66)
* (X~22, Y~0.42)
* (X~32, Y~0.22)
* (X~42, Y~0.10)
* (X~52, Y~0.07)
* (X~72, Y~0.03)
* (X~102, Y~0.01)
* (X~122, Y~0.005)
* (X~142, Y~0.005)
* (X~162, Y~0.005)
* (X~182, Y~0.005)
* (X~202, Y~0.005)
4. **Red line (1_shot_and_guided_CoT):**
* **Trend:** This line starts with a lower initial success rate compared to the 5-shot and 3-shot configurations. It exhibits a rapid decline, generally performing worse than all 3-shot and 5-shot methods, but slightly better than the zero-shot guided CoT for the initial phase.
* **Approximate Data Points (X, Y):**
* (X~12, Y~0.63)
* (X~22, Y~0.38)
* (X~32, Y~0.20)
* (X~42, Y~0.10)
* (X~52, Y~0.06)
* (X~72, Y~0.02)
* (X~102, Y~0.01)
* (X~122, Y~0.005)
* (X~142, Y~0.005)
* (X~162, Y~0.005)
* (X~182, Y~0.005)
* (X~202, Y~0.005)
5. **Blue line (zero_shot_and_guided_CoT):**
* **Trend:** This line consistently shows the lowest success rate among all configurations, particularly in the initial phase of actions. It follows the same rapid decline pattern as the others, eventually converging to near zero success rate.
* **Approximate Data Points (X, Y):**
* (X~12, Y~0.58)
* (X~22, Y~0.37)
* (X~32, Y~0.18)
* (X~42, Y~0.08)
* (X~52, Y~0.05)
* (X~72, Y~0.02)
* (X~102, Y~0.01)
* (X~122, Y~0.005)
* (X~142, Y~0.005)
* (X~162, Y~0.005)
* (X~182, Y~0.005)
* (X~202, Y~0.005)
### Key Observations
* **Diminishing Returns:** All configurations show a sharp decrease in success rate with an increasing number of actions, indicating that the model's ability to maintain a high success rate diminishes rapidly as more actions are required.
* **Impact of Shots:** For "guided CoT" methods, there is a clear positive correlation between the number of shots and the initial success rate. "5_shots_and_guided_CoT" consistently outperforms "3_shots_and_guided_CoT", which in turn outperforms "1_shot_and_guided_CoT", and "zero_shot_and_guided_CoT".
* **Guided vs. Unguided CoT:** "3_shots_and_guided_CoT" generally performs slightly better than "3_shot_unguided" for the initial actions (up to ~30 actions), but their performance becomes very similar thereafter. This suggests that guidance in CoT might offer a slight advantage at lower action counts but its benefit diminishes quickly.
* **Convergence to Zero:** Beyond approximately 100 actions, the success rates for all configurations converge to very low values (approaching 0.01 or less), suggesting a practical limit to the model's effectiveness for highly complex tasks requiring many actions.
* **Initial Performance Spread:** The largest differences in success rate between the configurations are observed at lower "Number of actions" (e.g., below 50). As the number of actions increases, the performance gap narrows significantly.
### Interpretation
The data suggests that for the "Llama-4-Maverick-17B-128E-Instruct-FP8" model, the "Success rate" is highly sensitive to the "Number of actions" required. This could imply that the model struggles with long-horizon tasks or tasks requiring extensive sequential reasoning, where each action introduces a potential point of failure.
The strong correlation between the number of "shots" and initial success rate for guided CoT methods highlights the importance of few-shot learning in improving the model's performance. Providing more examples (shots) significantly boosts the model's ability to succeed, especially when the task is relatively short (fewer actions).
The comparison between "3_shots_and_guided_CoT" and "3_shot_unguided" indicates that while Chain-of-Thought (CoT) itself is beneficial, explicit guidance might offer a marginal improvement, particularly for tasks with fewer steps. However, this benefit is not sustained for tasks requiring a larger number of actions, where the inherent difficulty of the task likely overshadows the guidance mechanism.
The rapid convergence of all success rates to near zero after a certain number of actions (around 100) points to a fundamental limitation in the model's ability to maintain high accuracy over extended sequences of operations. This could be due to error propagation, increasing complexity, or a lack of robust long-term planning capabilities. For practical applications, this suggests that this model configuration might be more suitable for tasks that can be completed within a relatively small number of actions. Further research might focus on improving the model's robustness for higher action counts or exploring alternative strategies for complex, multi-step problems.
</details>
Figure 15: The impact of including different number of reference examples in the prompt as part of in-context learning. Increasing the number of examples leads to slight improvements in performance. The experimental parameters used here are the same as ones in Figure 1.
<details>
<summary>figs/model_comparison_openai.png Details</summary>

### Visual Description
## Line Chart: Success Rate vs. Number of Actions
### Overview
This image displays a 2D line chart illustrating the relationship between "Success rate" (Y-axis) and "Number of actions" (X-axis) for four different models or systems: GPT-5, OSS-120B, OSS-20B, and Llama-4-Maverick. Each line represents a different model, showing how its success rate decreases as the number of actions increases. The chart uses a white background with a light grey grid for readability.
### Components/Axes
The chart is structured with a horizontal X-axis at the bottom and a vertical Y-axis on the left.
* **X-axis (Horizontal)**: Labeled "Number of actions".
* Range: From 0 to 300.
* Major ticks are marked at 0, 50, 100, 150, 200, 250, and 300.
* Minor grid lines are visible, suggesting intervals of 25 units.
* **Y-axis (Vertical)**: Labeled "Success rate".
* Range: From 0.0 to 1.0.
* Major ticks are marked at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* Minor grid lines are visible, suggesting intervals of 0.1 units.
* **Grid**: Light grey horizontal and vertical grid lines extend across the plotting area, aiding in data point estimation.
* **Legend**: Located in the top-right corner of the chart. It identifies the four data series by color and marker type (all using circular markers).
* **Blue line with circle marker**: GPT-5
* **Orange line with circle marker**: OSS-120B
* **Green line with circle marker**: OSS-20B
* **Red line with circle marker**: Llama-4-Maverick
### Detailed Analysis
The chart presents four distinct data series, each showing a generally decreasing trend in success rate as the number of actions increases.
1. **GPT-5 (Blue Line with Circle Markers)**:
* **Trend**: This line starts with the highest success rate and shows the most gradual decline among all models. It maintains a relatively high success rate for a larger number of actions before its decline steepens slightly and then flattens out at lower success rates.
* **Data Points**:
* At approximately 10 actions, the success rate is 1.0.
* At approximately 25 actions, the success rate is around 0.95.
* At 50 actions, the success rate is approximately 0.85.
* At 100 actions, the success rate is around 0.62.
* At approximately 140 actions, the success rate is about 0.52.
* At approximately 180 actions, the success rate is around 0.25.
* At approximately 220 actions, the success rate is about 0.18.
* At approximately 260 actions, the success rate is around 0.17.
* At 300 actions, the success rate is approximately 0.08.
2. **OSS-120B (Orange Line with Circle Markers)**:
* **Trend**: This line starts with a high success rate, slightly below GPT-5, and exhibits a steeper initial decline. Its success rate drops significantly faster than GPT-5, approaching zero around 200 actions.
* **Data Points**:
* At approximately 10 actions, the success rate is around 0.95.
* At approximately 25 actions, the success rate is about 0.90.
* At 50 actions, the success rate is approximately 0.72.
* At 100 actions, the success rate is around 0.23.
* At approximately 140 actions, the success rate is about 0.05.
* At approximately 180 actions, the success rate is around 0.01.
* From approximately 220 actions onwards, the success rate is effectively 0.00.
3. **OSS-20B (Green Line with Circle Markers)**:
* **Trend**: This line starts with a lower success rate compared to GPT-5 and OSS-120B and shows a very rapid decline. Its success rate drops to near zero much faster than the previous two models, reaching this point before 150 actions.
* **Data Points**:
* At approximately 10 actions, the success rate is around 0.88.
* At approximately 25 actions, the success rate is about 0.75.
* At approximately 40 actions, the success rate is around 0.55.
* At 50 actions, the success rate is approximately 0.31.
* At 100 actions, the success rate is around 0.01.
* From approximately 140 actions onwards, the success rate is effectively 0.00.
4. **Llama-4-Maverick (Red Line with Circle Markers)**:
* **Trend**: This line starts with the lowest initial success rate among all models and demonstrates the steepest and fastest decline. Its success rate plummets to near zero very quickly, reaching this point well before 100 actions.
* **Data Points**:
* At approximately 10 actions, the success rate is around 0.65.
* At approximately 25 actions, the success rate is about 0.38.
* At approximately 40 actions, the success rate is around 0.18.
* At 50 actions, the success rate is approximately 0.05.
* At 100 actions, the success rate is around 0.01.
* From approximately 140 actions onwards, the success rate is effectively 0.00.
### Key Observations
* **Performance Hierarchy**: GPT-5 consistently outperforms all other models across the entire range of "Number of actions," maintaining the highest success rate.
* **Rate of Decline**: Llama-4-Maverick shows the most rapid degradation in success rate, followed by OSS-20B, then OSS-120B, and finally GPT-5, which has the most resilient performance.
* **Threshold for Zero Success**:
* Llama-4-Maverick's success rate drops to near zero (below 0.05) by 50 actions and effectively 0.00 by 140 actions.
* OSS-20B's success rate drops to near zero by 100 actions and effectively 0.00 by 140 actions.
* OSS-120B's success rate drops to near zero by 180 actions and effectively 0.00 by 220 actions.
* GPT-5's success rate remains above 0.05 even at 300 actions, indicating superior robustness.
* **Initial Performance**: While GPT-5 starts at a perfect 1.0 success rate, OSS-120B is very close at 0.95, and OSS-20B is also strong at 0.88 for 10 actions. Llama-4-Maverick starts significantly lower at 0.65.
### Interpretation
This chart likely demonstrates the robustness or capability of different models (GPT-5, OSS-120B, OSS-20B, Llama-4-Maverick) in performing a task as the complexity or length of the task (represented by "Number of actions") increases. The "Success rate" can be interpreted as the probability or percentage of successfully completing the task.
The data suggests that:
* **GPT-5 is the most capable and robust model** among those tested. It maintains a high success rate even when faced with a large number of actions, indicating superior long-term coherence, memory, or planning abilities for complex tasks.
* **Model size or architecture (implied by names like 120B, 20B, Llama-4)** appears to correlate with performance. OSS-120B (presumably a larger model than OSS-20B) performs better and degrades more slowly than OSS-20B. This aligns with common observations in AI where larger models often exhibit better performance and generalization.
* **Llama-4-Maverick is the least effective** for tasks involving a higher number of actions, with its performance rapidly deteriorating. This could imply limitations in its ability to handle sequential dependencies, maintain context, or plan over extended sequences.
* The steepness of the curves indicates how quickly a model's performance degrades under increasing task complexity. A flatter curve (like GPT-5) signifies greater resilience.
* The point at which each curve approaches zero success rate can be considered a practical limit for that model's utility in tasks requiring that many actions. For instance, Llama-4-Maverick is practically unusable for tasks exceeding 100 actions, whereas GPT-5 still offers a non-trivial success rate even at 300 actions.
In essence, the chart provides a comparative benchmark of these models' ability to sustain performance under increasing operational demands, highlighting GPT-5's significant advantage in handling complex, multi-step tasks.
</details>
Figure 16: This figure is added to reflect that the recent closed (GPT-5) and open sourced models (OSS-20B/120B) released by OpenAI also follow the same universal failure patterns highlighted in this paper. The data used here as well as experimental settings is the same as the one used in Figure 1 of the main paper. We include Llama-4-Maverick which is also used in Figure 1 as the benchmark reference.