2509.16866v1

Model: gemini-2.0-flash

# seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs **Authors**: - M.R. Ramezanali - Salesforce AI - Palo Alto, CA - &M. Vazifeh (Capital One, MIT) - Cambridge, MA - &P. Santi (MIT) - Cambridge, MA > ⋆\stardenotes equal contribution. Abstract We introduce seqBench, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. seqBench allows systematic variation of (1) the logical depth, defined as the number of sequential actions required to solve the task; (2) the number of backtracking steps along the optimal path, quantifying how often the agent must revisit prior states to satisfy deferred preconditions (e.g., retrieving a key after encountering a locked door); and (3) the noise ratio, defined as the ratio between supporting and distracting facts about the environment. Our evaluations on state-of-the-art LLMs reveal a universal failure pattern: accuracy collapses exponentially beyond a model-specific logical depth. Unlike existing benchmarks, seqBench ’s fine-grained control facilitates targeted analyses of these reasoning failures, illuminating universal scaling laws and statistical limits, as detailed in this paper alongside its generation methodology and evaluation metrics. We find that even top-performing models systematically fail on seqBench ’s structured reasoning tasks despite minimal search complexity, underscoring key limitations in their commonsense reasoning capabilities. Designed for future evolution to keep pace with advancing models, the seqBench datasets are publicly released to spur deeper scientific inquiry into LLM reasoning, aiming to establish a clearer understanding of their true potential and current boundaries for robust real-world application. seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs M.R. Ramezanali thanks: $\star$ denotes equal contribution. Salesforce AI Palo Alto, CA 94301 mramezanali@salesforce.com M. Vazifeh footnotemark: Capital One, MIT Cambridge, MA 02143 mvazifeh@mit.edu P. Santi MIT Cambridge, MA 02143 psanti@mit.edu Large Language Models (LLMs) have shown remarkable performance (Vaswani et al., 2017; Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Smith et al., 2022; Thoppilan et al., 2022; Hoffmann et al., 2022; Du et al., 2021; Fedus et al., 2022; Zoph et al., 2022) on a wide range of tasks and benchmarks spanning diverse human-like capabilities; however, these successes can obscure fundamental limitations in sequential reasoning that still persist. Arguably, reasoning captures a more pure form of intelligence, going beyond mere pattern matching or fact memorization, and is thus a critical capability to understand and enhance in AI systems. Recent studies show that state-of-the-art LLMs (OpenAI, 2025; Google DeepMind, 2025; Meta AI, 2025; Mistral AI, 2024; Anthropic, 2025) excel at complex benchmarks, yet stumble upon simple common-sense inferences trivial for an adult human (Nezhurina et al., 2025; Han et al., 2024; Sharma, 2024; Berglund et al., 2024; Yang et al., 2019). Most existing benchmarks saturate quickly, leaving little room for fine-grained attribution studies to perform systemic probes of LLM failure modes. Consequently, a robust understanding of why and under what circumstances these models fail, especially on problems requiring sequential reasoning, remains elusive. This gap, we argue, stems from the lack of evaluation benchmarks allowing systematic, multi-dimensional control over key independent factors that influence a task’s overall reasoning difficulty. Most benchmarks (Cobbe et al., 2021; Hendrycks et al., 2021; Srivastava et al., 2023; Weston et al., 2015; Clark et al., 2018; Dua et al., 2019; Rein et al., 2023), despite their evaluation merits, often do not support a systematic variation of crucial complexity dimensions. This makes it difficult to isolate the specific conditions under which reasoning in LLMs falter. For instance, discerning whether a failure is due to the length of the required reasoning chain, the necessity to revise intermediate conclusions, or the density of distracting information is often not quantitatively possible. While prompting strategies like chain-of-thought (CoT) and model scaling have boosted aggregate performance, they often obscure sharp performance cliffs that can emerge when these underlying complexity dimensions are varied independently (Wei et al., 2023; Kojima et al., 2022). Without such systematic control, disentangling inherent architectural limitations from those addressable via scaling (model size, data, or compute), fine-tuning, or prompting techniques is challenging. A fine-grained understanding of these performance boundaries is crucial for developing more robust and reliable reasoning systems. To complement recent efforts (Sprague et al., 2024; Tyagi et al., 2024; Kuratov et al., 2024; Tang and Kejriwal, 2025; Mirzaee et al., 2021; Tikhonov, 2024; Mirzaee and Kordjamshidi, 2022; Shi et al., 2022) in evaluating reasoning, and to address the need for more controlled analysis, we introduce seqBench, a tunable benchmark designed explicitly to probe and analyze sequential reasoning capabilities in language models. The dataset comprises synthetic yet linguistically grounded pathfinding task configurations on two-dimensional grids. Solving each problem requires sequential inference over relevant and distracting structured facts. Each instance is automatically verifiable and parameterized by controllable factors that directly address the previously identified gaps: (1) logical depth (total number of actions in the ground-truth solution, reflecting the length of the reasoning chain); (2) backtracking count (number of locked-door detours on the optimal path, requiring revision of tentative solution paths); and (3) noise ratio (proportion of distracting vs. supporting facts, testing robustness to irrelevant information). Performance against these dimensions can be quantified with fine-grained metrics (e.g., via progress ratio as we define here). We observe that beyond a certain logical depth, Pass@1 success collapses to near zero for all models (see Figure 1). These features enable precise attribution studies of model failure modes, offering insights into the brittle boundaries of current LLM generalization. <details> <summary>x1.png Details</summary> ![edb3fd42](/v1/image/edb3fd42fd74f78dcd752f85ecafc45645b78f7beba58e2533a51e2a5c4fe4b0) ### Visual Description ## Success Rate vs. Number of Actions for Various Language Models ### Overview The image presents two line charts comparing the success rate of different language models against the number of actions performed. The top chart uses a linear scale for the y-axis (Success Rate), while the bottom chart uses a logarithmic scale for the y-axis. Both charts share the same x-axis (Number of Actions). Each model's performance is represented by a colored line, and a dashed line shows the exponential fit. ### Components/Axes **Top Chart:** * **Title:** Success Rate vs. Number of Actions (inferred) * **X-axis:** Number of Actions (L), ranging from 0 to 300. Increments of 50. * **Y-axis:** Success Rate, ranging from 0.0 to 1.0. Increments of 0.2. * **Legend:** Located at the top-right of the chart. * gemini-2.5-flash-preview-04-17 (Red): (Fit) L₀ = 85.7 * gemini-2.0-flash (Green): (Fit) L₀ = 40.2 * Llama-4-Maverick-17B-128E-Instruct-FP8 (Gray): (Fit) L₀ = 16.7 * Llama-3.3-70B-Instruct-Turbo (Pink): (Fit) L₀ = 10.2 * gemma-2-27b-it (Maroon): (Fit) L₀ = 8.1 * Qwen2.5-Coder-32B-Instruct (Orange): (Fit) L₀ = 4.8 * Qwen2.5-7B-Instruct-Turbo (Blue): (Fit) L₀ = 4.0 * Llama-3.2-3B-Instruct-Turbo (Brown): (Fit) L₀ = 1.6 * **Fit Equation:** Fit: ~ exp(-L/L₀) **Bottom Chart:** * **Title:** Success Rate (Log Scale) vs. Number of Actions (inferred) * **X-axis:** Number of Actions (L), ranging from 0 to 300. Increments of 50. * **Y-axis:** Success Rate (Log Scale), ranging from 10⁻³ to 10⁰ (0.001 to 1). * **Legend:** Same as the top chart, located at the top-right of the top chart. ### Detailed Analysis **Top Chart (Linear Scale):** * **gemini-2.5-flash-preview-04-17 (Red):** Starts at approximately 0.95 and decreases to around 0.1 at 300 actions. * At 0 actions: ~0.95 * At 50 actions: ~0.72 * At 100 actions: ~0.28 * At 150 actions: ~0.23 * At 200 actions: ~0.18 * At 250 actions: ~0.16 * At 300 actions: ~0.10 * **gemini-2.0-flash (Green):** Starts at approximately 0.95 and decreases to around 0.02 at 300 actions. * At 0 actions: ~0.95 * At 50 actions: ~0.15 * At 100 actions: ~0.08 * At 150 actions: ~0.05 * At 200 actions: ~0.03 * At 250 actions: ~0.02 * At 300 actions: ~0.02 * **Llama-4-Maverick-17B-128E-Instruct-FP8 (Gray):** Starts at approximately 0.55 and decreases to around 0.01 at 300 actions. * At 0 actions: ~0.55 * At 50 actions: ~0.08 * At 100 actions: ~0.03 * At 150 actions: ~0.02 * At 200 actions: ~0.01 * At 250 actions: ~0.01 * At 300 actions: ~0.01 * **Llama-3.3-70B-Instruct-Turbo (Pink):** Starts at approximately 0.45 and decreases to around 0.01 at 300 actions. * At 0 actions: ~0.45 * At 50 actions: ~0.05 * At 100 actions: ~0.02 * At 150 actions: ~0.01 * At 200 actions: ~0.01 * At 250 actions: ~0.01 * At 300 actions: ~0.01 * **gemma-2-27b-it (Maroon):** Starts at approximately 0.35 and decreases to around 0.01 at 300 actions. * At 0 actions: ~0.35 * At 50 actions: ~0.03 * At 100 actions: ~0.01 * At 150 actions: ~0.01 * At 200 actions: ~0.01 * At 250 actions: ~0.01 * At 300 actions: ~0.01 * **Qwen2.5-Coder-32B-Instruct (Orange):** Starts at approximately 0.20 and decreases to around 0.01 at 300 actions. * At 0 actions: ~0.20 * At 50 actions: ~0.02 * At 100 actions: ~0.01 * At 150 actions: ~0.01 * At 200 actions: ~0.01 * At 250 actions: ~0.01 * At 300 actions: ~0.01 * **Qwen2.5-7B-Instruct-Turbo (Blue):** Starts at approximately 0.15 and decreases to around 0.01 at 300 actions. * At 0 actions: ~0.15 * At 50 actions: ~0.01 * At 100 actions: ~0.01 * At 150 actions: ~0.01 * At 200 actions: ~0.01 * At 250 actions: ~0.01 * At 300 actions: ~0.01 * **Llama-3.2-3B-Instruct-Turbo (Brown):** Starts at approximately 0.08 and decreases to around 0.01 at 300 actions. * At 0 actions: ~0.08 * At 50 actions: ~0.01 * At 100 actions: ~0.01 * At 150 actions: ~0.01 * At 200 actions: ~0.01 * At 250 actions: ~0.01 * At 300 actions: ~0.01 **Bottom Chart (Log Scale):** The trends are the same as in the top chart, but the logarithmic scale emphasizes the differences in performance at lower success rates. The exponential fits are more clearly visible. ### Key Observations * The success rate generally decreases as the number of actions increases for all models. * gemini-2.5-flash-preview-04-17 (Red) has the highest initial success rate and maintains a higher success rate compared to other models as the number of actions increases. * Llama-3.2-3B-Instruct-Turbo (Brown) has the lowest initial success rate and the success rate drops rapidly. * The exponential fits provide a good approximation of the performance decay for each model. ### Interpretation The charts illustrate the performance of various language models in terms of success rate as the number of actions increases. The models exhibit an exponential decay in success rate, indicating that their performance degrades as they perform more actions. The gemini-2.5-flash-preview-04-17 model demonstrates the best performance, maintaining a higher success rate even with a large number of actions. The logarithmic scale in the bottom chart highlights the differences in performance at lower success rates, making it easier to compare the models' long-term performance. The L₀ values in the legend represent the characteristic decay constant for each model, with higher values indicating slower decay and better sustained performance. </details> Figure 1: Performance collapse of various models with increasing logical depth $L$ for a pathfinding task ( $N,M=40,\mathcal{B}=2$ keys, Noise Ratio $\mathcal{N}=0.0$ ). Success rates (Pass@1) are shown on linear (top panel) and logarithmic (bottom panel) y-axes, averaged from 5 runs/problem across 40 problems per unit $L$ -bin. All evaluations used Temperature=1.0 and top-p=0.95 (Gemini-2.5-flash: ’auto’ thinking). The displayed fits employ a Weighted Least Squares (WLS) Carroll and Ruppert (2017) method on log-success rates. Weights are derived from inverse squared residuals of a preliminary Ordinary Least Squares (OLS) fit. (In the supplementary section, we have added Figure 16 to show a similar pattern is observed in recently released OpenAI models.) Furthermore, the seqBench benchmark is built upon a scalable data generation framework, allowing it to evolve alongside increasingly capable models to help with both model training and evaluation. Through evaluations on popular LLMs, we reveal that top-performing LLMs exhibit steep universal declines as either of the three complexity dimensions increases, while remaining comparatively robust to fact shuffle, despite the underlying logical structure being unchanged. Contributions. Our main contributions are: 1. seqBench: A Tunable Benchmark for Sequential Reasoning. We introduce an open-source framework for generating pathfinding tasks with fine-grained, orthogonal control over logical depth, backtracking steps, and noise ratio. We also evaluate secondary factors like fact ordering (shuffle ratio; See supplementary material for details). 1. Comprehensive LLM Attribution Study. Using seqBench, we demonstrate the significant impact of these controlled complexities on LLM performance, revealing sharp performance cliffs in state-of-the-art models even when search complexity is minimal. The seqBench dataset is publicly available https://huggingface.co/datasets/emnlp-submission/seqBench under the CC BY 4.0 license to facilitate benchmarking. <details> <summary>figs/llama4_deepdive.png Details</summary> ![01844003](/v1/image/01844003b6f5a4d4a529aef7a59147b2aea6fc3d21e1d313c38dcaaaa105bca1) ### Visual Description ## Line Charts: Model Performance Metrics vs. Number of Actions ### Overview The image contains two line charts that depict the performance of a language model ("Llama-4-Maverick-17B-128E-Instruct-FP8") as a function of the number of actions taken. The top chart shows the success rate, while the bottom chart shows precision, recall, and progress ratio, each plotted against the number of actions. Error bars are included in the bottom chart to indicate variability. ### Components/Axes **Top Chart:** * **Y-axis:** "Success rate", ranging from 0.0 to 0.6. * **X-axis:** "Number of actions", ranging from 0 to 300. * **Legend (top-right):** * Blue line with circles: "Llama-4-Maverick-17B-128E-Instruct-FP8" * Orange dashed line: "∝ exp(-L/L₀), L₀ = 16.7" **Bottom Chart:** * **Y-axis:** Implicitly ranging from 0.0 to 1.0. * **X-axis:** "Number of actions", ranging from 0 to 400. * **Legend (top-right):** * Blue line with circles and error bars: "Precision" * Orange line with circles and error bars: "Recall" * Green line with circles and error bars: "Progress ratio" ### Detailed Analysis **Top Chart: Success Rate** * **Llama-4-Maverick-17B-128E-Instruct-FP8 (Blue):** The success rate starts at approximately 0.62 for a small number of actions and rapidly decreases as the number of actions increases. It approaches 0 as the number of actions reaches 100. * Approximate data points: (10, 0.62), (20, 0.27), (30, 0.14), (50, 0.05), (100, 0.01), (150, 0.005), (200, 0.003), (250, 0.002), (300, 0.001) * **∝ exp(-L/L₀), L₀ = 16.7 (Orange Dashed):** This exponential decay curve closely matches the trend of the "Llama-4-Maverick-17B-128E-Instruct-FP8" line. It starts at approximately 0.65 and decreases rapidly, approaching 0 as the number of actions increases. **Bottom Chart: Precision, Recall, and Progress Ratio** * **Precision (Blue):** The precision starts high, around 0.95, and remains relatively stable with some fluctuations as the number of actions increases. The error bars indicate some variability. * Approximate data points: (0, 0.95), (50, 0.96), (100, 0.88), (150, 0.89), (200, 0.88), (250, 0.89), (300, 0.85) * **Recall (Orange):** The recall starts high, around 0.8, and decreases as the number of actions increases. The error bars become larger as the number of actions increases, indicating greater variability. * Approximate data points: (0, 0.8), (50, 0.7), (100, 0.6), (150, 0.45), (200, 0.4), (250, 0.3), (300, 0.3) * **Progress Ratio (Green):** The progress ratio starts at approximately 0.45 and decreases rapidly as the number of actions increases, approaching a value close to 0.1. The error bars are relatively large, especially for smaller numbers of actions. * Approximate data points: (0, 0.45), (50, 0.25), (100, 0.15), (150, 0.1), (200, 0.12), (250, 0.1), (300, 0.1) ### Key Observations * The success rate of the model decreases exponentially with the number of actions. * Precision remains relatively stable, while recall decreases as the number of actions increases. * The progress ratio decreases significantly with the number of actions. * The error bars in the bottom chart suggest that the variability in recall and progress ratio increases with the number of actions. ### Interpretation The data suggests that while the model maintains a relatively consistent level of precision as the number of actions increases, its ability to recall relevant information and make progress towards a goal diminishes. The exponential decay of the success rate indicates that the model's performance degrades rapidly with increasing task complexity (as represented by the number of actions required). The exponential decay is well modeled by the equation provided. The increasing variability in recall and progress ratio suggests that the model becomes less reliable in its performance as the number of actions increases. </details> Figure 2: On the left: Llama-4 Maverick-17B-128E-Instruct Model’s performance (pass@1 success rate) versus number of actions in the ground truth path of the pathfinding problems ( $N,M=40,\mathcal{B}=2$ keys, Noise Ratio $\mathcal{N}=0.0$ ) is shown. This Pass@1 success rate across 5 runs per problem is averaged over the problem instances sampled from different actions count bins of width equal to 1. On the right: The mean of progress ratio across all problems as well as mean of precision and recall is shown to highlight models gradually increasing struggle in completing the path. The Temperature is set to 1.0 and the top-p is set to 0.95 in all runs. 1 Methods 1.1 Dataset Generation The seqBench dataset consists of spatial pathfinding tasks. Task instance generation, detailed below (Algorithm 1; See Appendix A for details), is predicated on the precise independent control of the three key complexity dimensions introduced earlier: Logical Depth ( $L$ ), Backtracking Count ( $\mathcal{B}$ ), and Noise Ratio ( $\mathcal{N}$ ). This allows the creation of instances with specific values for these parameters, enabling targeted studies of their impact on LLM reasoning. Task instances are produced in a multi-stage process. Initially, primary generation parameters—maze dimensions ( $N,M$ ), target backtracks ( $\mathcal{B}_{\text{target}}$ ), and target noise ratio ( $\mathcal{N}_{\text{target}}$ )—are specified. An acyclic maze graph ( $M_{g}$ ) is formed on an $N× M$ grid using Kruskal’s algorithm (Kleinberg and Tardos, 2006). Our "Rewind Construction" method (Algorithm 1) then embeds $\mathcal{B}_{\text{target}}$ backtracking maneuvers by working backward from a goal to strategically place keys and locked doors, yielding the instance’s actual backtracking count $\mathcal{B}$ . Finally, a natural language fact list ( $\mathcal{F}$ ) is derived from the maze, and distracting facts are added according to $\mathcal{N}_{\text{target}}$ to achieve the final noise ratio $\mathcal{N}$ . The logical depth $L$ (optimal path length) emerges from these generative steps, influenced by $N,M,\mathcal{B}_{\text{target}}$ , and construction stochasticity. While $L$ is not a direct input to the generation algorithm, the process is designed to yield a wide spectrum of logical depths. Each generated instance is then precisely annotated with its emergent $L$ value, alongside its effective $\mathcal{B}$ and $\mathcal{N}$ values. This annotation effectively makes $L$ a key, selectable parameter for users of the seqBench dataset, enabling them to choose or filter tasks by their desired logical depth. Our rewind construction method guarantees task solvability. The full seqBench benchmark is constructed by systematically applying this instance generation process (detailed in Algorithm 1) across a wide range of initial parameters. This includes varied grid sizes (e.g., $N∈\{5..50\},M≈ N$ ) and target backtracks ( $\mathcal{B}_{\text{target}}∈\{0..7\}$ ), yielding a large and diverse data pool. For each $(N,M,\mathcal{B}_{\text{target}})$ configuration, multiple unique base mazes are generated, to which different noise ratios (e.g., $\mathcal{N}_{\text{target}}∈\{0..1\}$ ) are subsequently applied. It is important to note that the algorithm constrains backtracking complexity to a simple dependency chain. In this setting, retrieving the key for each locked door involves at most one backtracking step to pick up its corresponding key, without requiring the unlocking of additional doors along the optimal path. Combined with the uniform random placement of keys, this design ensures a well-balanced distribution of backtracking difficulty across the generated instances for each logical depth $L$ . Nevertheless, the same backward-in-time construction can be extended to generate tasks with higher backtracking complexity—for example, doors that require multiple keys, or intermediate doors that must be unlocked en route to other keys. Such extensions would introduce richer tree-structured dependency graphs and allow seqBench to probe model performance under more complex long-horizon reasoning regimes. The creation of this comprehensive data pool was computationally efficient, requiring approximately an hour of computation on a standard laptop while using minimal memory. The publicly released benchmark comprises a substantial collection of these generated instances, each annotated with its specific emergent logical depth $L$ , effective backtracking count $\mathcal{B}$ , and noise ratio $\mathcal{N}$ . This rich annotation is key, enabling researchers to readily select or filter task subsets by these dimensions for targeted studies (e.g., as done for Figure 1, where instances were sampled into $L$ -bins with other parameters fixed). For the experiments presented in this paper, specific subsets were drawn from this benchmark pool, often involving further filtering or parameter adjustments tailored to the objectives of each study; precise details for each experiment are provided in the relevant sections and figure captions. Full details on path derivation, fact compilation, and overall dataset generation parameters are provided in the Appendix A. Input : Grid $N× M$ , Target backtracks $\mathcal{B}$ Output : Maze graph $M_{g}$ , Locked doors $\mathcal{D}_{L}$ , Key info $\mathcal{K}_{I}$ , Path skeleton $\Pi_{S}$ 1 2 $M_{g}←$ Acyclic graph on grid (Kruskal’s); 3 $x← C_{goal}←$ Random goal cell in $M_{g}$ ; 4 $\mathcal{D}_{L},\mathcal{K}_{I}←\emptyset,\emptyset$ ; $b← 0$ ; 5 $\Pi_{S}←[(C_{goal},\text{GOAL})]$ ; 6 7 while $b<\mathcal{B}$ do 8 $c_{key}←$ Random cell in $M_{g}$ accessible from $x$ (path avoids $\mathcal{D}_{L}$ for this step); 9 $\pi_{seg}←$ Unique path in $M_{g}$ from $x$ to $c_{key}$ ; 10 if $∃ e∈\pi_{seg}$ such that $e∉\mathcal{D}_{L}$ then 11 $d←$ Randomly select such an edge $e$ ; 12 $\mathcal{D}_{L}←\mathcal{D}_{L}\cup\{d\}$ ; 13 $K_{id}←$ New unique key ID; 14 $\mathcal{K}_{I}[K_{id}]←\{\text{opens}:d,\text{loc}:c_{key}\}$ ; 15 $\Pi_{S}$ .prepend( $(c_{key},\text{PICKUP }K_{id})$ , $(d,\text{UNLOCK }K_{id})$ , $(\pi_{seg},\text{MOVE})$ ); 16 $x← c_{key}$ ; $b← b+1$ ; 17 18 end if 19 else 20 Break 21 end if 22 23 end while 24 $\Pi_{S}$ .prepend( $(x,\text{START}))$ ; 25 return $M_{g},\mathcal{D}_{L},\mathcal{K}_{I},\Pi_{S}$ ; Algorithm 1 Rewind Construction of Path Skeleton 1.2 Prompt Construction and Model Configuration Our evaluation uses a standardized prompt template with four components: (i) task instructions and action schema, (ii) three few-shot examples of increasing complexity (simple navigation, single-key, and multi-key backtracking), (iii) optional reasoning guidance, and (iv) the problem’s natural-language facts. All models are queried using temperature $T{=}1.0$ , nucleus sampling $p{=}0.95$ , and maximum allowed setting in terms of output token limits on a per model basis. For each instance, we compute 5 independent runs to establish robust performance statistics. The complete prompt structure, shown in Figure 6, is provided in the Appendix B. 1.3 Evaluation Metrics To analyze not just success but also how models fail, we employ several complementary metrics. Success Rate (Pass@1) measures the proportion of runs where the predicted action sequence exactly matches the ground truth. The Progress Ratio (Tyagi et al., 2024), calculated as $k/n$ (where $n$ is the total ground-truth actions and $k$ is the number correctly executed before the first error), pinpoints the breakdown position in reasoning. We also use Precision and Recall. Precision is the proportion of predicted actions that are correct, while Recall is the proportion of ground-truth actions that were correctly predicted. Low precision indicates hallucinated actions, while low recall signifies missed necessary actions. Additionally, we visualize error locations via a Violation Map. This multi-faceted approach reveals each model’s effective "reasoning horizon"—the maximum sequence length it can reliably traverse. Further details on all metrics and visualizations are provided in the supplementary material. 2 Benchmarking Results <details> <summary>figs/fig_vs_backtracking_fixed_L_shuffle1.0_noise0.0.png Details</summary> ![030efcb9](/v1/image/030efcb92429338c0cfe85d863072569fa4342c7890de62c69b46de44a99ab74) ### Visual Description ## Line Charts: Model Performance vs. Backtracking Steps ### Overview The image presents three line charts comparing the performance of different language models against the number of backtracking steps. The charts measure "Progress ratio mean", "Success rate", and "Number of tokens" as a function of backtracking steps (0 to 5). Five models are compared: Llama-4-maverick-17b-128e-instruct-fp8, Qwen2.5-coder-32b-instruct, Llama-3.1-nemotron-70b-instruct-hf, Gemini-2.0-flash, and Gemini-2.5-flash-preview-04-17. ### Components/Axes **Chart 1: Progress Ratio Mean** * **Y-axis:** "Progress ratio mean", ranging from 0.0 to 1.0 in increments of 0.2. * **X-axis:** "Number of backtracking steps", ranging from 0 to 5 in increments of 1. * **Legend (Top-Left):** * Blue: (Llama-4-maverick-17b-128e-instruct-fp8) * Orange: (Qwen2.5-coder-32b-instruct) * Green: (Llama-3.1-nemotron-70b-instruct-hf) * Red: (Gemini-2.0-flash) * Purple: (Gemini-2.5-flash-preview-04-17) **Chart 2: Success Rate** * **Y-axis:** "Success rate", ranging from 0.0 to 1.0 in increments of 0.2. * **X-axis:** "Number of backtracking steps", ranging from 0 to 5 in increments of 1. * **Legend:** Same as Chart 1. **Chart 3: Number of Tokens** * **Y-axis:** "Number of tokens", ranging from 250 to 1750 in increments of 250. * **X-axis:** "Number of backtracking steps", ranging from 0 to 5 in increments of 1. * **Legend:** Same as Chart 1. ### Detailed Analysis **Chart 1: Progress Ratio Mean** * **Llama-4-maverick-17b-128e-instruct-fp8 (Blue):** Decreases from approximately 0.5 at 0 backtracking steps to approximately 0.2 at 5 backtracking steps. * (0, 0.5) -> (1, 0.35) -> (2, 0.3) -> (3, 0.27) -> (4, 0.25) -> (5, 0.2) * **Qwen2.5-coder-32b-instruct (Orange):** Decreases from approximately 0.3 at 0 backtracking steps to approximately 0.05 at 5 backtracking steps. * (0, 0.3) -> (1, 0.2) -> (2, 0.15) -> (3, 0.12) -> (4, 0.08) -> (5, 0.05) * **Llama-3.1-nemotron-70b-instruct-hf (Green):** Decreases slightly from approximately 0.4 at 0 backtracking steps to approximately 0.2 at 5 backtracking steps. * (0, 0.4) -> (1, 0.38) -> (2, 0.3) -> (3, 0.28) -> (4, 0.25) -> (5, 0.2) * **Gemini-2.0-flash (Red):** Decreases sharply from approximately 0.7 at 0 backtracking steps to approximately 0.15 at 5 backtracking steps. * (0, 0.7) -> (1, 0.55) -> (2, 0.3) -> (3, 0.2) -> (4, 0.18) -> (5, 0.15) * **Gemini-2.5-flash-preview-04-17 (Purple):** Relatively stable, fluctuating between approximately 0.7 and 0.8 across all backtracking steps. * (0, 0.9) -> (1, 0.8) -> (2, 0.73) -> (3, 0.78) -> (4, 0.7) -> (5, 0.73) **Chart 2: Success Rate** * **Llama-4-maverick-17b-128e-instruct-fp8 (Blue):** Decreases from approximately 0.25 at 0 backtracking steps to approximately 0.01 at 5 backtracking steps. * (0, 0.25) -> (1, 0.08) -> (2, 0.03) -> (3, 0.02) -> (4, 0.01) -> (5, 0.01) * **Qwen2.5-coder-32b-instruct (Orange):** Remains near 0.0 across all backtracking steps. * (0, 0.03) -> (1, 0.01) -> (2, 0.01) -> (3, 0.01) -> (4, 0.01) -> (5, 0.01) * **Llama-3.1-nemotron-70b-instruct-hf (Green):** Remains near 0.0 across all backtracking steps. * (0, 0.03) -> (1, 0.05) -> (2, 0.03) -> (3, 0.02) -> (4, 0.03) -> (5, 0.02) * **Gemini-2.0-flash (Red):** Decreases sharply from approximately 0.55 at 0 backtracking steps to approximately 0.02 at 5 backtracking steps. * (0, 0.55) -> (1, 0.25) -> (2, 0.08) -> (3, 0.05) -> (4, 0.03) -> (5, 0.02) * **Gemini-2.5-flash-preview-04-17 (Purple):** Decreases from approximately 0.9 at 0 backtracking steps to approximately 0.65 at 5 backtracking steps. * (0, 0.9) -> (1, 0.73) -> (2, 0.63) -> (3, 0.68) -> (4, 0.63) -> (5, 0.65) **Chart 3: Number of Tokens** * **Llama-4-maverick-17b-128e-instruct-fp8 (Blue):** Increases from approximately 1600 at 0 backtracking steps to approximately 1750 at 5 backtracking steps. * (0, 1600) -> (1, 1620) -> (2, 1600) -> (3, 1610) -> (4, 1720) -> (5, 1750) * **Qwen2.5-coder-32b-instruct (Orange):** Increases from approximately 900 at 0 backtracking steps to approximately 1100 at 1 backtracking step, then decreases to approximately 1050 at 5 backtracking steps. * (0, 900) -> (1, 1150) -> (2, 1100) -> (3, 1220) -> (4, 1150) -> (5, 1050) * **Llama-3.1-nemotron-70b-instruct-hf (Green):** Increases from approximately 650 at 0 backtracking steps to approximately 900 at 2 backtracking steps, then stabilizes. * (0, 650) -> (1, 800) -> (2, 900) -> (3, 880) -> (4, 880) -> (5, 880) * **Gemini-2.0-flash (Red):** Increases from approximately 300 at 0 backtracking steps to approximately 500 at 1 backtracking step, then decreases to approximately 400 at 5 backtracking steps. * (0, 300) -> (1, 500) -> (2, 450) -> (3, 400) -> (4, 480) -> (5, 400) * **Gemini-2.5-flash-preview-04-17 (Purple):** Relatively stable, fluctuating between approximately 300 and 400 across all backtracking steps. * (0, 300) -> (1, 350) -> (2, 320) -> (3, 350) -> (4, 380) -> (5, 350) ### Key Observations * **Progress Ratio Mean:** Gemini-2.5-flash-preview-04-17 consistently maintains a high progress ratio mean, while Gemini-2.0-flash experiences a significant drop with increasing backtracking steps. * **Success Rate:** Gemini-2.5-flash-preview-04-17 has the highest success rate, while the other models show a significant decrease in success rate as backtracking steps increase. * **Number of Tokens:** Llama-4-maverick-17b-128e-instruct-fp8 generates the most tokens, and its token count increases with backtracking steps. Gemini-2.5-flash-preview-04-17 generates the fewest tokens. ### Interpretation The charts suggest that the Gemini-2.5-flash-preview-04-17 model is the most robust in terms of progress ratio and success rate, even with increasing backtracking steps. However, it generates the fewest tokens. The Llama-4-maverick-17b-128e-instruct-fp8 model generates the most tokens, but its progress ratio and success rate decrease with backtracking. The Gemini-2.0-flash model shows a sharp decline in both progress ratio and success rate as backtracking steps increase, indicating that it is highly sensitive to backtracking. The Qwen2.5-coder-32b-instruct and Llama-3.1-nemotron-70b-instruct-hf models have relatively low success rates across all backtracking steps. The relationship between the number of tokens and the other metrics is complex. A higher number of tokens does not necessarily correlate with better performance (progress ratio or success rate). The choice of model and the number of backtracking steps should be carefully considered based on the specific task and desired trade-offs between performance metrics. </details> Figure 3: Performance as a function of the number of required backtracking steps, operationalized via the number of locked doors with distributed keys along the optimal path. Holding all other complexity factors constant, all models exhibit a clear decline in both progress ratio and success rate as backtracking demands increase. Additionally, we report the corresponding rise in output token counts per model, highlighting the increased reasoning burden associated with longer dependency chains. Fixed experimental parameters in this figure are the same as those in Figure 1. (for each point 100 problems sampled from $L=[40,60]$ ) 2.1 Evaluated Models We evaluate a diverse set of transformer-based LLMs across different model families and parameter scales. Our analysis includes Gemini models (2.5-flash-preview, 2.0-flash), Meta’s Llama family (4-Maverick-17B, 3.3-70B, 3.2-3B), Google’s Gemma-2-27b, and Alibaba’s Qwen models (2.5-Coder-32B, 2.5-7B). [Note: GPT-5 was released during the preparation of this paper’s final version. Our analysis shows that this model exhibits the same performance degradation, as shown in Figure 16]. Access to some open-weight models and benchmarking infrastructure was facilitated by platforms such as Together AI https://www.together.ai/ and Google AI Studio https://aistudio.google.com/. Problem instances for varying logical depths ( $L$ ) were generated by sampling 40 problems for each $L$ , using a fixed maze size of $40× 40$ and 2 keys, unless otherwise specified for specific experiments (e.g., when varying the number of keys for backtracking analysis). All models were evaluated using the standardized prompt template (see Figure 6), the inference settings detailed in Section 1.2, and a common response parsing methodology. For each task instance, we perform 5 independent runs to establish robust performance statistics, primarily analyzing Pass@1 success rates. 2.2 Universal Performance Collapse with Increasing Logical Depth A central finding of our study is the universal collapse in reasoning performance observed across all evaluated LLMs when confronted with tasks requiring increasing sequential inference steps. As illustrated in Figure 1, Pass@1 success rates exhibit a consistent and sharp exponential decay as the ground-truth path length ( $L$ ) increases. Performance rapidly approaches near-zero past a model-specific point in this decay. To quantify and compare this exponential decay, we fit an exponential decay curve $P(L)=\exp(-L/L_{0})$ to the success rates, deriving a characteristic path length $L_{0}$ . This $L_{0}$ value, representing the path length at which performance drops by a factor of $e^{-1}$ , serves as a robust metric for each model’s sequential reasoning horizon. Plotting success rates on a semi-logarithmic (log-y) scale against $L$ reveals an approximately linear decay trend across the evaluated regime. This log-linear relationship suggests that errors may accumulate with a degree of independence at each reasoning step, eventually overwhelming the model’s capacity for coherent inference. The observed $L_{0}$ values vary significantly, from 85.7 for Gemini-2.5-Flash down to 1.6 for Llama-3.2-3B (Figure 1), underscoring a fundamental bottleneck in current transformer architectures for extended multi-step reasoning. 2.3 Impact of Independently Controlled Complexity Dimensions Beyond the universal impact of logical depth ( $L$ ) discussed in Section 2.2, our benchmark’s ability to independently vary key complexity dimensions allows for targeted analysis of their distinct impacts on LLM reasoning performance. We highlight the effects of noise, backtracking, and fact ordering, primarily focusing on Pass@1 success rates, mean progress ratios, and response token counts. <details> <summary>figs/fig_vary_noise_fixed_L_keys2_shuffle1.0.png Details</summary> ![322845a2](/v1/image/322845a20e88f3bec2216edd55735384740eb6e09a8c944c7eaed0068a90e299) ### Visual Description ## Line Charts: Model Performance vs. Noise Ratio ### Overview The image presents three line charts comparing the performance of two language models, "Llama-4-maverick-17b-128e-instruct-fp8" and "Gemini-2.5-flash-preview-04-17", across varying levels of noise. The charts depict "Mean progress ratio", "Mean success rate (pass@1)", and "Cot tokens" as a function of "Noise ratio". ### Components/Axes **General Chart Elements:** * **X-axis:** Noise ratio, ranging from 0.00 to 1.00 in increments of 0.25. * **Legend (Top-Left of First Chart):** * Blue: (Llama-4-maverick-17b-128e-instruct-fp8) * Orange: (Gemini-2.5-flash-preview-04-17) **Chart 1: Mean Progress Ratio** * **Y-axis:** Mean progress ratio, ranging from 0.0 to 1.0. **Chart 2: Mean Success Rate (pass@1)** * **Y-axis:** Mean success rate (pass@1), ranging from 0.0 to 1.0. **Chart 3: Cot tokens** * **Y-axis:** Cot tokens, ranging from 0 to 1750 in increments of 250. ### Detailed Analysis **Chart 1: Mean Progress Ratio** * **(Llama-4-maverick-17b-128e-instruct-fp8) (Blue):** The line slopes downward slightly. * Noise Ratio 0.00: Mean progress ratio ~0.24 * Noise Ratio 0.25: Mean progress ratio ~0.18 * Noise Ratio 0.50: Mean progress ratio ~0.16 * Noise Ratio 0.75: Mean progress ratio ~0.13 * Noise Ratio 1.00: Mean progress ratio ~0.12 * **(Gemini-2.5-flash-preview-04-17) (Orange):** The line slopes downward significantly. * Noise Ratio 0.00: Mean progress ratio ~0.72 * Noise Ratio 0.25: Mean progress ratio ~0.58 * Noise Ratio 0.50: Mean progress ratio ~0.40 * Noise Ratio 0.75: Mean progress ratio ~0.28 * Noise Ratio 1.00: Mean progress ratio ~0.24 **Chart 2: Mean Success Rate (pass@1)** * **(Llama-4-maverick-17b-128e-instruct-fp8) (Blue):** The line remains near zero. * Noise Ratio 0.00: Mean success rate ~0.04 * Noise Ratio 0.25: Mean success rate ~0.01 * Noise Ratio 0.50: Mean success rate ~0.00 * Noise Ratio 0.75: Mean success rate ~0.00 * Noise Ratio 1.00: Mean success rate ~0.00 * **(Gemini-2.5-flash-preview-04-17) (Orange):** The line slopes downward significantly. * Noise Ratio 0.00: Mean success rate ~0.62 * Noise Ratio 0.25: Mean success rate ~0.50 * Noise Ratio 0.50: Mean success rate ~0.30 * Noise Ratio 0.75: Mean success rate ~0.08 * Noise Ratio 1.00: Mean success rate ~0.04 **Chart 3: Cot tokens** * **(Llama-4-maverick-17b-128e-instruct-fp8) (Blue):** The line slopes downward slightly. * Noise Ratio 0.00: Cot tokens ~1700 * Noise Ratio 0.25: Cot tokens ~1630 * Noise Ratio 0.50: Cot tokens ~1520 * Noise Ratio 0.75: Cot tokens ~1500 * Noise Ratio 1.00: Cot tokens ~1450 * **(Gemini-2.5-flash-preview-04-17) (Orange):** The line remains relatively constant. * Noise Ratio 0.00: Cot tokens ~370 * Noise Ratio 0.25: Cot tokens ~370 * Noise Ratio 0.50: Cot tokens ~370 * Noise Ratio 0.75: Cot tokens ~380 * Noise Ratio 1.00: Cot tokens ~390 ### Key Observations * The "Gemini-2.5-flash-preview-04-17" model exhibits a significantly higher mean progress ratio and mean success rate compared to the "Llama-4-maverick-17b-128e-instruct-fp8" model at lower noise ratios. * The performance of "Gemini-2.5-flash-preview-04-17" degrades substantially as the noise ratio increases in both "Mean progress ratio" and "Mean success rate (pass@1)". * The "Llama-4-maverick-17b-128e-instruct-fp8" model maintains a relatively stable, but low, mean progress ratio and mean success rate across all noise ratios. * The number of "Cot tokens" for "Gemini-2.5-flash-preview-04-17" remains relatively constant regardless of the noise ratio, while "Cot tokens" for "Llama-4-maverick-17b-128e-instruct-fp8" decreases slightly as noise increases. ### Interpretation The data suggests that the "Gemini-2.5-flash-preview-04-17" model is more sensitive to noise than the "Llama-4-maverick-17b-128e-instruct-fp8" model. While "Gemini-2.5-flash-preview-04-17" performs better in low-noise environments, its performance rapidly declines as noise increases. The "Llama-4-maverick-17b-128e-instruct-fp8" model, although less performant in ideal conditions, demonstrates more robustness to noise. The "Cot tokens" metric indicates the complexity or length of the model's reasoning process. The relatively stable "Cot tokens" for "Gemini-2.5-flash-preview-04-17" suggests that the model maintains a consistent reasoning process even as its performance degrades due to noise. The slight decrease in "Cot tokens" for "Llama-4-maverick-17b-128e-instruct-fp8" may indicate a simplification of its reasoning process under noisy conditions. </details> Figure 4: Performance as a function of contextual noise for Gemini 2.5 flash and Llama-4 Maverick-17B-128E-Instruct models. As noise increases through the inclusion of distracting or irrelevant facts, both models exhibit a clear and consistent decline in performance. Fixed experimental parameters in this figure are the same as those in Figure 1 (for each point 100 problems sampled from $L=[40,60]$ and number of keys is equal to 2). Impact of Backtracking Requirements. Increasing the number of required backtracking steps—operationalized via key-door mechanisms—also leads to a clear and significant decline in Pass@1 success rates and mean progress ratios across all evaluated models as shown in Figure 3. Gemini 2.5 Flash-preview maintains the highest performance but still exhibits a notable drop as backtracking count increases from 0 to 5. This decline in reasoning accuracy is generally accompanied by an increase or sustained high level in the mean number of response tokens (Figure 3, right panel). For example, models like Llama-4 Maverick and Gemini 2.5 Flash-preview show a clear upward trend or maintain high token counts as backtracking complexity rises, reflecting the increased reasoning effort or path length articulated by the models when managing more complex sequential dependencies. Sensitivity to Noise Ratio. Model performance is highly sensitive to the noise ratio—the proportion of distracting versus supporting facts. As demonstrated in Figure 4 for Gemini 2.5 Flash and Llama-4 Maverick, increasing the proportion of irrelevant facts consistently and significantly degrades both Pass@1 success rates and mean progress ratios. For instance, Gemini 2.5 Flash’s Pass@1 success rate drops from over 0.7 at zero noise to approximately 0.2 at a noise ratio of 1.0. Llama-4 Maverick, starting with lower performance, also shows a consistent decline. Interestingly, for these two models, the number of CoT (output) tokens remains relatively stable despite the increasing noise and degrading performance (Figure 4, right panel), suggesting that models do not necessarily "work harder" (in terms of output length) when faced with more distractors, but their accuracy suffers. Fact Ordering (Shuffle Ratio). In contrast to the strong effects of noise and backtracking, shuffle ratio (entropy of fact presentation order) within the prompt appears to play a secondary role when varied in isolation. Our experiments, exemplified by the performance of Gemini 2.5 Flash and Llama-4 Maverick (see Appendix C Figure 14 for details), show that complete shuffling of facts (randomizing their presentation order without adding or removing any information) has a minimal impact on Pass@1 success rates and mean progress ratios. Output token counts also remain stable. This suggests a relative robustness to presentation order as long as all necessary information is present and distinguishable. However, as details provided in supplementary material, when high noise and high shuffle co-occur, the combined effect can be more detrimental than either factor alone, though noise remains the dominant degrading factor. 2.4 Characterizing Key Failure Modes and Error Patterns A Key Failure Mode: Omission of Critical Steps. Beyond simply taking illegal shortcuts, detailed analysis reveals that LLMs often fail by omitting critical sub-goals necessary for task completion. Figure 2 (bottom panel) provides a quantitative view for Llama-4 Maverick (Meta AI, 2025), showing that while precision generally remains high (models infrequently hallucinate non-existent rooms or facts), recall and progress ratio plummet with increasing path length ( $L$ ). This indicates that models predominantly fail by missing necessary actions or entire crucial sub-sequences. For a qualitative example, even capable models like Gemini-2.5-Flash can neglect essential detours, such as collecting a required key, thereby violating sequential dependencies and rendering the task unsolvable (illustrative examples are provided in the Appendix B.4; see Figures 8 and 9). This pattern highlights a fundamental breakdown in robust multi-step planning and execution. Path-Length Dependent First Errors: The Burden of Anticipated Complexity. The propensity for models to make critical errors is not uniformly distributed across the reasoning process, nor is it solely a feature of late-stage reasoning fatigue. Examining the distribution of steps at which the first constraint violations occur reveals a counterintuitive pattern: as the total required path length ( $L$ ) of a problem increases, models tend to fail more frequently even at the earliest steps of the reasoning chain. This leftward shift in the first-error distribution also observed under increasing noise, (Appendix B.4; Figures 10 and 11) contradicts a simple cumulative error model where each step carries a fixed, independent failure probability. Instead, an error at an early step (e.g., step 5) becomes substantially more likely when the model is attempting to solve an 80-step problem versus a 20-step problem. This suggests that the overall anticipated complexity of the full problem influences reasoning quality from the very outset, indicating a struggle with global planning or maintaining coherence over longer horizons, rather than just an accumulation of local errors. This phenomenon may help explain why prompting techniques that decompose long problems into smaller, manageable sub-problems often succeed. 2.5 Disparity: Information Retention vs. Reasoning Capacity On seqBench tasks, this disparity is quantitatively striking. While modern LLMs boast million-token contexts, their effective sequential reasoning depth typically remains on the order of hundreds of actions (Figure 1). This functional limit, even at several hundred actions (e.g., 300 actions, with each like (’move_to’, ’A12’) being 5-7 tokens, totaling 1.5k-2.1k tokens), still consumes a minute fraction of their nominal context. Consequently, the ratio of context capacity to reasoning tokens often spans from several hundred-fold (e.g., 500:1 for 300 actions consuming 2k tokens within a 1M context) to potentially higher values given fewer limiting actions or larger model contexts. This striking gap suggests that while transformers can store and retrieve vast information, their ability to reliably chain it for coherent, multi-step inference appears surprisingly constrained. 2.6 Challenging the Conventional Performance Hierarchy While metrics like average $L_{0}$ provide a general ranking of model capabilities, our fine-grained analysis reveals instances that challenge a simple linear performance hierarchy. Scatter plots of progress ratios across different models on identical tasks (see Appendix C Figure 13) show intriguing cases where models with lower overall $L_{0}$ values (i.e., typically weaker models) occasionally solve specific complex problems perfectly, while models with higher average $L_{0}$ values fail on those same instances. These performance inversions suggest that sequential reasoning failures may not solely stem from insufficient scale (parameters or general training) but could also arise from more nuanced reasoning limitations. 3 Related Work Recent advancements in benchmarks evaluating sequential reasoning capabilities of LLMs have illuminated various strengths and limitations across different dimensions of complexity. These benchmarks typically differ in how they isolate and quantify reasoning challenges, such as logical deduction, retrieval difficulty, combinatorial complexity, and sensitivity to irrelevant information. ZebraLogic (Lin et al., 2025), for instance, targets formal deductive inference through logic-grid puzzles framed as constraint-satisfaction problems (csp, 2008). While valuable for probing deduction, its core methodology leads to a search space that grows factorially with puzzle size (Sempolinski, 2009). This makes it challenging to disentangle intrinsic reasoning failures from the sheer combinatorial complexity of the search. As the ZebraLogic authors themselves acknowledge: “ solving ZebraLogic puzzles for large instances may become intractable… the required number of reasoning tokens may increase exponentially with the size of the puzzle. ” This inherent characteristic means that for larger puzzles, performance is primarily dictated by the manageability of the search space rather than the limits of sequential reasoning depth. GridPuzzle (Tyagi et al., 2024) complements this by providing a detailed error taxonomy for grid puzzles, focusing on what kinds of reasoning mistakes LLMs make. However, like ZebraLogic, it doesn’t offer independent control over key complexity dimensions such as logical depth, backtracking needs, or noise, separate from the puzzle’s inherent search complexity. Other benchmarks conflate reasoning with different cognitive demands. BABILong (Kuratov et al., 2024) tests models on extremely long contexts (up to 50M tokens), primarily assessing the ability to retrieve "needles" (facts) from a "haystack" (distracting text that does not contribute to solving the task). While valuable for evaluating long-context processing, this design makes it hard to disentangle retrieval failures from reasoning breakdowns, as performance is often dictated by finding the relevant information rather than reasoning over it. MuSR (Sprague et al., 2024) embeds reasoning tasks within lengthy narratives (e.g., murder mysteries), mixing information extraction challenges with complex, domain-specific reasoning structures. This realism obscures which specific aspect—extraction or reasoning depth—causes model failures. Dyna-bAbI (Tamari et al., 2021) offers a dynamic framework for compositional generalization but focuses on qualitative combinations rather than systematically varying quantitative complexity metrics needed to find precise failure points. Spatial reasoning benchmarks, while relevant, also target different aspects. GRASP (Tang and Kejriwal, 2025) assesses practical spatial planning efficiency (like obstacle avoidance) in 2D grids, a different skill than the abstract sequential reasoning seqBench isolates. SPARTQA (Mirzaee et al., 2021) focuses on specialized spatial relational complexity (transitivity, symmetry) using coupled dimensions, preventing independent analysis of factors like path length. SpaRTUN (Mirzaee and Kordjamshidi, 2022) uses synthetic data primarily for transfer learning in Spatial Question Answering (SQA), aiming to improve model performance rather than serve as a diagnostic tool with controllable complexity. Similarly, StepGame (Shi et al., 2022) demonstrates performance decay with more reasoning steps in SQA but lacks the fine-grained, orthogonal controls over distinct complexity factors provided by seqBench. In contrast, seqBench takes a targeted diagnostic approach. By deliberately simplifying the spatial environment to minimize search complexity, it isolates sequential reasoning. Its core contribution lies in the independent, fine-grained control over (1) logical depth (the number of sequential actions required to solve the task), (2) backtracking count (the number of backtracking steps along the optimal path), and (3) noise ratio (the ratio of supporting to distracting facts). This orthogonal parameterization allows us to precisely pinpoint when and why sequential reasoning capabilities degrade, revealing fundamental performance cliffs even when search and retrieval demands are trivial. seqBench thus offers a complementary tool for understanding the specific limitations of sequential inference in LLMs. 4 Limitations While seqBench offers precise control over key reasoning complexities, our study has limitations that open avenues for future research: 1. Generalizability and Task Design Fidelity: Our current findings are rooted in synthetic spatial pathfinding tasks. While this allows for controlled experimentation, future work must extend seqBench ’s methodology to more diverse reasoning domains (e.g., mathematical proofs) and incorporate greater linguistic diversity (e.g., ambiguity) to assess the broader applicability of the observed phenomena of performance collapse (quantified by $L_{0}$ ) and failure patterns. Moreover, this work did not investigate whether similar failure modes arise when the problem is also presented visually (e.g., as maze images). Multimodal capabilities could influence spatial reasoning outcomes, and we have already extended the benchmark by releasing maze image generation code alongside the HuggingFace dataset. This dataset can also be used to help train multimodal reasoning models. 1. Model Scope and Understanding Deeper Failure Dynamics: Our current evaluation, while covering diverse public models, should be expanded to a wider array of LLMs—including recent proprietary and newer open-source variants (e.g., GPT, Claude, DeepSeek series)—to rigorously assess the universality of our findings on the characteristic length $L_{0}$ and failure patterns. Furthermore, while seqBench effectively characterizes how reasoning performance degrades with logical depth (i.e., by determining $L_{0}$ ), two complementary research thrusts are crucial for understanding why. First, systematic investigation is needed to disentangle how $L_{0}$ is influenced by factors such as model architecture, scale (parameters, training data, compute), fine-tuning strategies, and inference-time computation (e.g., chain-of-thought depth). Second, deeper analysis is required to explain the precise mechanisms underlying the observed exponential performance collapse characterized by $L_{0}$ and to account for other non-trivial error patterns, such as path-length dependent first errors. Additionally, the evaluation presented here does not consider how agentic systems capable of tool use perform as the reasoning complexity is tuned across various dimensions. Exploring such setups, where the LLM can externalize sub-problems, invoke tools, or backtrack programmatically, could provide valuable insights into whether the same exponential failure modes persist. In particular, one can define sequential problems where the degree of backtracking or sequential tool use can be systematically varied, and to test whether similar performance drop emerge as the dependency chain grows. We highlight this as a promising direction for future research. 1. Impact of Prompting: Our current study employed standardized prompts and inference settings. A crucial next step is a robust sensitivity analysis to determine overall decay behavior are influenced by different prompting strategies (e.g., zero-shot vs. few-shot, decomposition techniques), varied decoding parameters (temperature, top-p), and interactive mechanisms such as self-verification or self-correction. Investigating the potential of these techniques to mitigate the observed sequential inference failures, particularly given seqBench ’s minimal search complexity, remains a key avenue for future research. Addressing these points by leveraging frameworks like seqBench will be vital for developing LLMs with more robust and generalizable sequential reasoning capabilities, and for understanding their fundamental performance limits. 5 Conclusion We introduced seqBench, a novel benchmark framework designed for the precise attribution of sequential reasoning failures in Large Language Models. seqBench ’s core strength lies in its unique capability for fine-grained, independent control over fundamental complexity dimensions; most notably, logical depth ( $L$ ), backtracking requirements, and noise ratio, its provision of automatically verifiable solutions, and critically minimizing confounding factors like search complexity. This design allows seqBench to isolate and rigorously evaluate the sequential inference capabilities of LLMs, enabling the automatic quantification of fine-grained performance metrics (such as progress ratio) and providing a clear lens into mechanisms often obscured in most other benchmarks. The framework’s inherent scalability and open-source nature position it as a durable tool for assessing and driving progress in current and future generations of models, ultimately aiming to enhance their utility for complex, real-world problems that often span multiple domains. Our comprehensive evaluations using seqBench reveal that reasoning accuracy consistently collapses exponentially with increasing logical depth across a diverse range of state-of-the-art LLMs. This collapse is characterized by a model-specific parameter $L_{0}$ (Section 2.2), indicating an inherent architectural bottleneck in maintaining coherent multi-step inference. In alignment with the goal of advancing NLP’s reach and fostering its responsible application in other fields by offering this precise analysis, seqBench provides a valuable resource. It encourages a shift beyond aggregate benchmark scores towards a more nuanced understanding of model capabilities, an essential step for rigorously assessing the true impact and potential risks of applying LLMs in new domains. The insights gleaned from seqBench can inform both NLP developers in building more robust models, and experts in other disciplines in setting realistic expectations and co-designing NLP solutions that are genuinely fit for purpose. Targeted improvements, guided by such fundamental understanding, are key to enhancing the robustness of sequential reasoning, making LLMs more reliable partners in interdisciplinary endeavors. Future work should leverage these insights to develop models that can overcome the observed performance cliffs and extend their effective reasoning horizons, thereby unlocking their transformative potential in diverse interdisciplinary applications—such as navigating complex scientific literature, supporting intricate legal analysis, or enabling robust multi-step planning in critical autonomous systems. Focusing on commonsense reasoning is paramount for NLP to achieve transformative societal impact, moving beyond incremental improvements to genuine breakthroughs. References - csp (2008) 2008. Rina dechter , constraint processing, morgan kaufmann publisher (2003) isbn 1-55860-890-7, francesca rossi, peter van beek and toby walsh, editors, handbook of constraint programming, elsevier (2006) isbn 978-0-444-52726-4. Computer Science Review, 2:123–130. - Anthropic (2025) Anthropic. 2025. Claude 3.7 sonnet. https://www.anthropic.com/news/claude-3-7-sonnet. - Berglund et al. (2024) Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2024. The reversal curse: Llms trained on "a is b" fail to learn "b is a". Preprint, arXiv:2309.12288. - Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901. - Carroll and Ruppert (2017) Raymond J Carroll and David Ruppert. 2017. Transformation and weighting in regression. Chapman and Hall/CRC. - Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. Preprint, arXiv:1803.05457. - Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168. - Du et al. (2021) Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, and 8 others. 2021. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning. - Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. Preprint, arXiv:1903.00161. - Fedus et al. (2022) William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39. - Google DeepMind (2025) Google DeepMind. 2025. Gemini 2.5 pro experimental. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/. - Han et al. (2024) Pengrui Han, Peiyang Song, Haofei Yu, and Jiaxuan You. 2024. In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models. Preprint, arXiv:2409.15454. - Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Preprint, arXiv:2009.03300. - Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others. 2022. Training compute-optimal large language models. Preprint, arXiv:2203.15556. - Kleinberg and Tardos (2006) Jon Kleinberg and Eva Tardos. 2006. Algorithm Design. Pearson/Addison-Wesley, Boston. - Kojima et al. (2022) Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc. - Kuratov et al. (2024) Yury Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. 2024. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack. Advances in Neural Information Processing Systems, 37:106519–106554. - Lieber et al. (2021) Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. Jurassic-1: Technical details and evaluation. https://www.ai21.com/blog/jurassic-1-technical-details-and-evaluation. White Paper. - Lin et al. (2025) Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. 2025. Zebralogic: On the scaling limits of llms for logical reasoning. Preprint, arXiv:2502.01100. - Meta AI (2025) Meta AI. 2025. Llama 4: Open and efficient multimodal language models. https://github.com/meta-llama/llama-models. - Mirzaee et al. (2021) Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjmashidi. 2021. Spartqa: : A textual question answering benchmark for spatial reasoning. Preprint, arXiv:2104.05832. - Mirzaee and Kordjamshidi (2022) Roshanak Mirzaee and Parisa Kordjamshidi. 2022. Transfer learning with synthetic corpora for spatial role labeling and reasoning. Preprint, arXiv:2210.16952. - Mistral AI (2024) Mistral AI. 2024. Mistral large 2. https://mistral.ai/news/mistral-large-2407. - Nezhurina et al. (2025) Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. 2025. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. Preprint, arXiv:2406.02061. - OpenAI (2025) OpenAI. 2025. Openai gpt-5, o3 and o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/, https://openai.com/index/introducing-gpt-5/. Paper’s supplementary material (appendix) was revised, after GPT-5 release, with a new figure, to reflect that GPT-5 also suffers from the same failure pattern we have observed in this paper. - Rae et al. (2021) Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Matthias Rauh, Po-Sen Huang, and 58 others. 2021. Scaling language models: Methods, analysis & insights from training Gopher. Preprint, arXiv:2112.11446. - Rein et al. (2023) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2023. Gpqa: A graduate-level google-proof q&a benchmark. Preprint, arXiv:2311.12022. - Sempolinski (2009) Peter Sempolinski. 2009. Automatic solutions of logic puzzles. - Sharma (2024) Manasi Sharma. 2024. Exploring and improving the spatial reasoning abilities of large language models. In I Can’t Believe It’s Not Better Workshop: Failure Modes in the Age of Foundation Models. - Shi et al. (2022) Zhengxiang Shi, Qiang Zhang, and Aldo Lipani. 2022. Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 11321–11329. - Smith et al. (2022) Samuel Smith, Mostofa Patwary, Brian Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhenhao Liu, Shrimai Prabhumoye, Georgios Zerveas, Vikas Korthikanti, Eric Zhang, Rewon Child, Reza Yazdani Aminabadi, Jared Bernauer, Xia Song Song, Mohammad Shoeybi, Yuxin He, Michael Houston, Shishir Tiwary, and Bryan Catanzaro. 2022. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. Preprint, arXiv:2201.11990. - Sprague et al. (2024) Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. 2024. Musr: Testing the limits of chain-of-thought with multistep soft reasoning. Preprint, arXiv:2310.16049. - Srivastava et al. (2023) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, and 432 others. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Preprint, arXiv:2206.04615. - Tamari et al. (2021) Ronen Tamari, Kyle Richardson, Aviad Sar-Shalom, Noam Kahlon, Nelson Liu, Reut Tsarfaty, and Dafna Shahaf. 2021. Dyna-babi: unlocking babi’s potential with dynamic synthetic benchmarking. Preprint, arXiv:2112.00086. - Tang and Kejriwal (2025) Zhisheng Tang and Mayank Kejriwal. 2025. Grasp: A grid-based benchmark for evaluating commonsense spatial reasoning. Preprint, arXiv:2407.01892. - Thoppilan et al. (2022) Rami Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yi Du, Yanping Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Max Krikun, Dmitry Lepikhin, James Qin, and 38 others. 2022. Lamda: Language models for dialog applications. arXiv preprint. Technical report, Google Research. - Tikhonov (2024) Alexey Tikhonov. 2024. Plugh: A benchmark for spatial understanding and reasoning in large language models. Preprint, arXiv:2408.04648. - Tyagi et al. (2024) Nemika Tyagi, Mihir Parmar, Mohith Kulkarni, Aswin RRV, Nisarg Patel, Mutsumi Nakamura, Arindam Mitra, and Chitta Baral. 2024. Step-by-step reasoning to solve grid puzzles: Where do llms falter? Preprint, arXiv:2407.14790. - Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. - Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-thought prompting elicits reasoning in large language models. Preprint, arXiv:2201.11903. - Weston et al. (2015) Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. 2015. Towards ai-complete question answering: A set of prerequisite toy tasks. Preprint, arXiv:1502.05698. - Yang et al. (2019) Kaiyu Yang, Olga Russakovsky, and Jia Deng. 2019. SpatialSense: An adversarially crowdsourced benchmark for spatial relation recognition. In International Conference on Computer Vision (ICCV). - Zoph et al. (2022) Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022. St-moe: Designing stable and transferable sparse expert models. Preprint, arXiv:2202.08906. Appendices Appendix A Dataset Generation Details The seqBench benchmark generates pathfinding tasks by systematically controlling several complexity dimensions. As described in Section 1 (main paper), Algorithm 1 is central to this process. This appendix provides further details on the generation phases, natural language encoding of tasks, and specific dataset parameters. A.1 Generation Phases The generation process, guided by Algorithm 1, involves three main phases: 1. Base Maze Construction: An initial $N× M$ grid is populated, and an acyclic maze graph ( $M_{g}$ ) is formed using Kruskal’s algorithm (Kleinberg and Tardos, 2006). This ensures a simply connected environment where a unique path exists between any two cells if all internal "walls" (potential door locations) were open. The overall process results in maze instances like the one visualized in Figure 5. 1. Rewind Construction for Path Skeleton and Key/Door Placement: This phase implements the "Rewind Construction" (Algorithm 1 in the main paper). Starting from a randomly selected goal cell ( $C_{goal}$ ), the algorithm works backward to define a solvable path skeleton ( $\Pi_{S}$ ). It iteratively: 1. Selects a cell $c_{key}$ that would be a preceding point on a path towards the current cell $x$ (initially $C_{goal}$ ). 1. Identifies the unique path segment $\pi_{seg}$ in $M_{g}$ from $x$ to $c_{key}$ . 1. Randomly selects an edge $d$ on this segment $\pi_{seg}$ to become a locked door. This edge $d$ is added to the set of locked doors $\mathcal{D}_{L}$ . 1. A new unique key $K_{id}$ is conceptually placed at $c_{key}$ , and its information (which door it opens, its location) is stored in $\mathcal{K}_{I}$ . 1. The conceptual steps (moving along $\pi_{seg}$ , unlocking door $d$ with $K_{id}$ , picking up $K_{id}$ at $c_{key}$ ) are prepended (in reverse logical order) to the path skeleton $\Pi_{S}$ . 1. The current cell $x$ is updated to $c_{key}$ , and the process repeats until the target number of backtracks ( $\mathcal{B}$ ) is achieved or no valid placements remain. This backward construction ensures solvability and controlled backtracking complexity. The final agent starting position is the cell $x$ at the end of this phase. 1. Fact Compilation and Noise Injection: Based on the final maze structure ( $M_{g},\mathcal{D}_{L},\mathcal{K}_{I}$ ), a set of natural language facts $\mathcal{F}$ is compiled. This includes facts describing room connections, key locations, and door states. Distracting facts are then introduced based on the target noise ratio $\mathcal{N}$ . These distractors might describe non-existent connections, spurious keys, or misleading adjacencies, chosen to be plausible yet incorrect. <details> <summary>figs/compath_viz.png Details</summary> ![3a70ff19](/v1/image/3a70ff194bb0697a791219e6922449a188aa4669e4eb8aef2e039f727a3f7ba4) ### Visual Description ## Network Diagram: Abstract Network Topology ### Overview The image presents an abstract network topology diagram. It consists of nodes (represented by circles), connections (represented by light blue lines), and components (represented by rectangles). The diagram also includes directional indicators (triangles) and highlighted paths (dashed lines in various colors). ### Components/Axes * **Nodes:** Represented by solid black circles. Some nodes have a circle around them. * **Connections:** Represented by solid light blue lines. * **Components:** Represented by white rectangles with black outlines. * **Directional Indicators:** Represented by solid black triangles. * **Highlighted Paths:** Represented by dashed lines in the following colors: * Blue * Orange * Green * Purple * **Red Rectangles:** Small red rectangles are placed along some of the light blue lines. ### Detailed Analysis or ### Content Details * **Nodes:** There are approximately 25 nodes in the diagram. * **Connections:** The nodes are interconnected by light blue lines, forming a network. * **Components:** Each light blue line segment has a white rectangle with a black outline. * **Directional Indicators:** There are two black triangles in the diagram. One is pointing downwards in the top-center of the diagram, and the other is pointing upwards on the left side of the diagram. * **Highlighted Paths:** * A dashed blue line connects a node with a circle around it in the top-left to a red rectangle in the top-center. * A dashed orange line connects a red rectangle in the center-left to a node with a circle around it in the center-right. * A dashed green line connects a node with a circle around it in the center-left to a red rectangle to its right. * A dashed purple line connects a red rectangle to a black triangle on the left side of the diagram. * **Red Rectangles:** There are 5 red rectangles in the diagram. ### Key Observations * The diagram shows a network with interconnected nodes and components. * The highlighted paths indicate specific connections or routes within the network. * The directional indicators suggest a flow or directionality within the network. ### Interpretation The diagram likely represents an abstract model of a network, possibly a communication network, a transportation network, or an electrical circuit. The nodes could represent devices, locations, or connection points. The connections represent the links between these nodes. The components could represent resistors, switches, or other network elements. The highlighted paths could represent specific data flows, routes, or connections of interest. The directional indicators could represent the direction of data flow or current. The red rectangles could represent a specific type of component or a point of interest. </details> Figure 5: Example visualization of a $6× 6$ seqBench maze instance. Red rectangles denote locked doors, dashed lines indicate the locations of keys corresponding to those doors, and triangles mark the start (upward-pointing) and goal (downward-pointing) positions. This illustrates the spatial nature of the tasks. A.2 Natural Language Encoding Each task instance is translated into a set of atomic natural language facts. We use a consistent templating approach: - Room Connections: "Room A1 and B1 are connected by an open door." - Locked Connections: "Room C3 and D3 are connected by a closed and locked door." - Key Requirements: "The locked door between C3 and D3 requires key 5." (Key IDs are simple integers). - Key Placements: "Key 5 is in room E4." (Room IDs use spreadsheet-like notation, e.g., A1, B2). - Starting Position: "Bob is in room A2." - Goal Position: "Alice is in room D5." The full set of facts for a given problem constitutes its description. A.3 Dataset Parameters and Scope The seqBench dataset was generated using the following parameter ranges based on the generation configuration: - Grid Sizes ( $N× M$ ): $N× M$ where $N$ and $M$ range from 5 to 50 (e.g., [5,5], [6,6], …, [50,50]), with $M=N$ for all configurations. - Target Backtracking Steps ( $\mathcal{B}$ ): Values from 0 to 7. This controls the number of key-door mechanisms deliberately placed on the optimal path. - Noise Ratio ( $\mathcal{N}$ ): Values from $0.0$ (no distracting facts) to $1.0$ (equal number of supporting and distracting facts), typically in increments of $0.2$ . - Instances per Configuration: For each primary configuration, defined by a specific grid size ( $N,M$ ) and a specific target backtracking step count ( $\mathcal{B}∈\{0..7\}$ ), 400 unique base maze instances were generated. - Logical Depth ( $L$ ): As an emergent property, $L$ varies. Experiments typically select problems from these generated instances that fall into specific $L$ bins (e.g., $L∈[10,11),[11,12),...$ ). This generation pipeline, leveraging the described parameter ranges and variations, can produce a vast and diverse set of problem instances. The publicly released seqBench dataset, used for the analyses in this paper (see main paper for access link), comprises 7,079 such curated instances. This collection offers a rich resource for studying the combined effects of the controlled complexity dimensions. Appendix B Prompt Design and Model Configuration Details This appendix provides the complete details of the prompt structure and model configurations used for evaluating LLMs on the seqBench benchmark. The overall prompt, illustrated in Figure 6, concatenates four main components which are detailed below. <details> <summary>figs/prompt_template.png Details</summary> ![7570aa3f](/v1/image/7570aa3fddb140ce7a6b93d419f4d14ccd817142214e99eccd559d472a8a10e3) ### Visual Description ## Prompt Template: Maze Navigation Task ### Overview The image presents a prompt template for a problem-solving task involving navigating a maze. It includes a task description, reasoning guidance, problem facts, examples of input and output formats, and key constraints. The goal is to help Bob navigate a maze to rescue Alice, following specific rules about room connections and door locks. ### Components/Axes * **Task Description** (Left side, top): Provides the problem statement, maze description, valid actions, action syntax, and key constraints. * **Reasoning Guidance** (Right side, top): Outlines the steps to complete the task, including finding the shortest path, identifying locked doors, planning key collection, and tracking actions. * **Problem Facts** (Right side, bottom): Lists the specific facts about the maze, including room connections, locked doors, key locations, and the starting and target locations. * **Examples** (Left side, bottom): Provides input and output examples to illustrate the expected format and solution. ### Detailed Analysis or ### Content Details **Task Description:** * **Problem:** Help Bob navigate a maze of connected rooms to rescue Alice. * **Maze Description:** Includes room connections (open, locked, closed doors), door information, key information, starting location (Bob), and target location (Alice). * **Valid Actions:** start, move\_to, pick\_up\_key, use\_key, unlock\_and\_open\_door\_to, rescue. * **Action Syntax:** Room IDs (Column-Row, e.g., 'A1'), Key IDs (positive integers, e.g., '1'). * **Key Constraints:** 1. Each move must be between adjacent and connected rooms. 2. Keys must be picked up before use. 3. Locked doors require use of their specific key to unlock. 4. Optimal path minimizes actions/distance. 5. use\_key action always comes right before unlock\_and\_open\_door\_to. 6. If the response is missing any intermediate action it is invalid. * **Output Format:** Python list of tuples: \[('start', 'RoomID'), ('move\_to', 'RoomID'), ('pick\_up\_key', 'KeyID'), ...]. Example: \[('start', 'A1'), ('move\_to', 'B1'), ('pick\_up\_key', '3'), ('use\_key', '3'), ('unlock\_and\_open\_door\_to', 'C1'), ('rescue', 'Alice')]. **Reasoning Guidance:** * Steps to complete the task: 1. Find the shortest path from Bob to Alice. 2. Identify any locked doors on this path. 3. For each locked door, find its required key. 4. Plan key collection order to ensure you have each key before reaching its door. 5. Track all actions while following the rules. 6. Avoid unnecessary steps that increase the total path length. * If the path seems complex: Break it into smaller segments, solve each segment separately, combine the solutions while maintaining optimality, remember to think step by step and verify each move, proceed to provide your solution as a list of tuples in chronological order. **Problem Facts:** * Room A6 and A5 are connected by an open door. * Room A6 and B6 are connected by an open door. * Room B6 and C6 are connected by an open door. * Room C6 and D6 are connected by an open door. * Room C5 and C4 are connected by an open door. * Room C4 and D4 are connected by an open door. * Room D6 and D5 are connected by a closed and locked door. The locked door between D6 and D5 requires key 10. * Key 10 is in room A5. * Room D6 and E6 are connected by an open door. * Room D5 and D4 are connected by an open door. * Room E6 and F6 are connected by an open door. * Room A4 and A3 are connected by an open door. * Bob is in room F6. * Alice is in room C5. **Examples:** * **Input Facts:** * Room C4 and C3 are connected by an open door. * Room C3 and D3 are connected by an open door. * Room D5 and E5 are connected by an open door. * Room A2 and A1 are connected by an open door. * Room A3 and B3 are connected by an open door. * Room A1 and B1 are connected by an open door. * Room A4 and A3 are connected by an open door. * Room E5 and E4 are connected by an open door. * Room D4 and D3 are connected by an open door. * Room A5 and B5 are connected by an open door. * Room D4 and E4 are connected by an open door. * Bob is in room D5. * Alice is in room C4. * **Output:** \[('start', 'D5'), ('move\_to', 'E5'), ('move\_to', 'E4'), ('move\_to', 'D4'), ('move\_to', 'D3'), ('move\_to', 'C3'), ('move\_to', 'C4'), ('rescue', 'Alice')]. ### Key Observations * The task requires a structured approach to problem-solving, emphasizing step-by-step reasoning and adherence to specific rules. * The output format is strictly defined as a Python list of tuples, indicating a computational or programmatic context. * The problem facts provide a limited scope of information, suggesting that the solution should rely solely on the given data. ### Interpretation The prompt template is designed to guide a problem-solving agent (likely an AI or a human) through a maze navigation task. The task is framed as a rescue mission, adding a narrative element. The structured format and constraints suggest an environment where precision and adherence to rules are critical. The inclusion of input and output examples serves to clarify the expected behavior and format of the solution. The problem facts provide the necessary information to solve the maze, but the agent must use the reasoning guidance to determine the optimal path and actions. The task is likely intended to test the agent's ability to understand and apply rules, plan a sequence of actions, and format the output correctly. </details> Figure 6: The complete prompt structure passed to the LLMs. This includes: Component 1 (System Instructions and Task Definition), one of the three Few-Shot Examples (Component 2, specifically a simple navigation task), Component 3 (Reasoning Guidance), and an illustration of where the Problem Instance Facts (Component 4) are inserted. For clarity and completeness, the full verbatim text for all three few-shot examples (Component 2) is provided in 7. B.1 Overall Prompt Components The prompt presented to the LLMs consists of the following components: 1. System Instructions and Task Definition (Component 1): Outlines the agent’s task, the structure of the maze description, valid actions and their syntax, key operational constraints, and the required output format. 1. Few-Shot Examples (Component 2): Three examples are provided to illustrate the task, ranging in complexity. One of these examples (a simple navigation task) is detailed in Figure 6. The verbatim text for all three examples is provided in Figure 7 for completeness. 1. Reasoning Guidance and Self-Assessment (Component 3): Offers step-by-step algorithmic tips for solving the task and requests the model to provide a self-assessment of its confidence and the perceived difficulty of the instance. 1. Problem Instance Facts (Component 4): The specific natural language facts describing the current maze configuration for the task instance. As illustrated in Figure 6, these facts are appended after the preceding components and are followed by the line "YOUR SOLUTION:" to prompt the model. These facts are generated using the templates described in Appendix A. 1. Example 1 (Simple Navigation): This example, as shown in Figure 6, involves navigating a maze with only open doors. ⬇ EXAMPLE: INPUT: Maze Structure: Room C4 and C3 are connected by an open door. Room C3 and D3 are connected by an open door. Room D5 and E5 are connected by an open door. Room A2 and A1 are connected by an open door. Room A3 and B3 are connected by an open door. Room A1 and B1 are connected by an open door. Room A4 and A3 are connected by an open door. Room E5 and E4 are connected by an open door. Room D4 and D3 are connected by an open door. Room A5 and B5 are connected by an open door. Room D4 and E4 are connected by an open door. Bob is in room D5. Alice is in room C4. OUTPUT: Solution: [(’ start ’, ’ D5 ’), (’ move_to ’, ’ E5 ’), (’ move_to ’, ’ E4 ’), (’ move_to ’, ’ D4 ’), (’ move_to ’, ’ D3 ’), (’ move_to ’, ’ C3 ’), (’ move_to ’, ’ C4 ’), (’ rescue ’, ’ Alice ’)] 1. Example 2 (Single-Key Backtracking): This example introduces a single locked door and a corresponding key. ⬇ EXAMPLE: INPUT: Maze Structure: Room A1 and A2 are connected by an open door. Room A2 and B2 are connected by an open door. Room B1 and B2 are connected by an open door. Room B1 and C1 are connected by an open door. Room C1 and C2 are connected by a closed and locked door. Door between C1 and C2 requires key 1. Key 1 is in room A2. Bob is in room A1. Alice is in room C2. OUTPUT: Solution: [(’ start ’, ’ A1 ’), (’ move_to ’, ’ A2 ’), (’ pick_up_key ’, ’1’), (’ move_to ’, ’ B2 ’), (’ move_to ’, ’ B1 ’), (’ move_to ’, ’ C1 ’), (’ use_key ’, ’1’), (’ unlock_and_open_door_to ’, ’ C2 ’), (’ move_to ’, ’ C2 ’), (’ rescue ’, ’ Alice ’)] 1. Example 3 (Multi-Key Backtracking): This example presents a more complex scenario with multiple locked doors and keys, requiring more extensive backtracking. ⬇ EXAMPLE: INPUT: Maze Structure: Room B5 and B4 are connected by a closed and locked door. The locked door between B5 and B4 requires key 3. Key 3 is in room B5. Room B5 and C5 are connected by a closed and locked door. The locked door between B5 and C5 requires key 16. Key 16 is in room C5. Room B4 and C4 are connected by an open door. Room C4 and C3 are connected by an open door. Room C3 and D3 are connected by a closed and locked door. The locked door between C3 and D3 requires key 10. Key 10 is in room C4. Room D5 and D4 are connected by an open door. Room D4 and D3 are connected by an open door. Room A5 and B5 are connected by an open door. Bob is in room C5. Alice is in room D5. OUTPUT: Solution: [(’ start ’, ’ C5 ’), (’ pick_up_key ’, ’16’), (’ use_key ’, ’16’), (’ unlock_and_open_door_to ’, ’ B5 ’), (’ move_to ’, ’ B5 ’), (’ pick_up_key ’, ’3’), (’ use_key ’, ’3’), (’ unlock_and_open_door_to ’, ’ B4 ’), (’ move_to ’, ’ B4 ’), (’ move_to ’, ’ C4 ’), (’ pick_up_key ’, ’10’), (’ move_to ’, ’ C3 ’), (’ use_key ’, ’10’), (’ unlock_and_open_door_to ’, ’ D3 ’), (’ move_to ’, ’ D3 ’), (’ move_to ’, ’ D4 ’), (’ move_to ’, ’ D5 ’), (’ rescue ’, ’ Alice ’)] Figure 7: Few-shot examples provided to guide the LLMs in the maze-solving task. These examples demonstrate simple navigation, single-key backtracking, and multi-key backtracking scenarios. The three examples illustrate increasing levels of complexity. B.2 Evaluation Metrics and Error Analysis Details This section provides further details on specific aspects of our evaluation metrics and observed error categories, complementing the overview of metrics in Section 1 of the main paper and the discussion of failure modes in Section 2 of the main paper. Observed Violation Categories. Failures in model solutions on seqBench tasks can be categorized into several types. Understanding these categories is crucial for interpreting model performance and failure modes. Key types of violations observed include: - Adjacency errors (e.g., attempting to move between unconnected rooms). - Locked door errors (e.g., navigating through locked doors without the correct key or without unlocking them). - Key usage errors (e.g., attempting to use keys not yet collected, or using the wrong key for a door). - Path inefficiency (e.g., taking unnecessary detours or redundant actions; while not always a hard violation that stops progress, this contributes to solutions not matching the optimal path and thus failing Pass@1). - Missed critical actions (e.g., failing to pick up a necessary key or unlock a required door). This is a key failure mode discussed in the main paper (Section 2.4) and is often reflected in metrics like low recall or a low progress ratio if the omission occurs early and prevents further correct steps. Identifying these distinct categories of errors provides a more granular understanding of why models fail on sequential reasoning tasks and helps in the interpretation of aggregate performance metrics reported in the main paper. B.3 Violation Map: Qualitative Examples of Model Failures This section provides qualitative examples of characteristic model failures to illustrate common error types. These examples visually support the discussion of failure modes in the main paper (Section 2.4, "A Key Failure Mode: Omission of Critical Steps"). Figure 8 illustrates a significant error by Gemini-2.5-Flash on a complex task, where the model generates an illegal path, bypassing necessary steps and locked doors. This exemplifies a breakdown in multi-step planning. Additionally, Figure 9 shows another common ’adjacency error,’ where a model attempts to jump between unconnected rooms. This type of error reveals a critical lapse in grounding its generated actions within the spatial adjacencies explicitly stated by the task’s input facts. <details> <summary>figs/goodexample4040.png Details</summary> ![636890db](/v1/image/636890db0fb4e998e8ce26f934d036a6569c16300386a457fa6088506916aec5) ### Visual Description ## Path Comparison Diagram: Optimal vs. Model ### Overview The image presents two diagrams side-by-side, visually comparing an "Optimal Path" and a "Model Path" through a maze-like structure. Each diagram depicts a grid of interconnected nodes, with paths highlighted in different colors to represent the routes taken. A straight dashed line is present in both diagrams, connecting the start and end points. ### Components/Axes * **Titles:** * Left Diagram: "Optimal Path" (top-left) * Right Diagram: "Model Path" (top-right) * **Maze Structure:** Both diagrams share an identical maze layout, consisting of a grid of nodes connected by short line segments. * **Paths:** * Optimal Path (Left): Highlighted in yellow. * Model Path (Right): Highlighted in purple. * **Straight Dashed Line:** A dashed line, approximately teal in color, connects the start and end points in both diagrams. * **Start/End Points:** The start point is marked by a small triangle, and the end point is marked by a small circle. ### Detailed Analysis **Optimal Path (Left Diagram):** * **Yellow Path:** The yellow path starts at the bottom-left and meanders through the maze, taking a somewhat circuitous route to reach the end point at the top-right. * The path begins by moving upwards, then takes a sharp right turn, followed by a series of turns to navigate the maze. * The path is not a straight line and appears to be the most efficient route through the maze. * **Teal Dashed Line:** The teal dashed line represents the direct, straight-line distance between the start and end points. It cuts through the maze, ignoring the walls. **Model Path (Right Diagram):** * **Purple Path:** The purple path also starts at the bottom-left and ends at the top-right, but it takes a different route compared to the optimal path. * The path initially follows a similar upward trajectory as the optimal path, but then deviates and takes a more direct route towards the end point. * The path is less circuitous than the optimal path, suggesting a less efficient but perhaps more direct route. * **Teal Dashed Line:** Similar to the left diagram, the teal dashed line represents the direct, straight-line distance between the start and end points. ### Key Observations * Both paths start and end at the same points. * The optimal path (yellow) appears more complex and longer than the model path (purple). * The straight dashed line provides a baseline for comparison, showing the shortest possible distance. ### Interpretation The diagrams illustrate a comparison between an "Optimal Path" and a "Model Path" through a maze. The "Optimal Path" likely represents the most efficient route in terms of distance or cost, while the "Model Path" represents a route generated by a model or algorithm. The difference in the paths suggests that the model's solution is not as efficient as the optimal solution. The straight dashed line highlights the difference between the actual paths taken and the shortest possible distance, emphasizing the constraints imposed by the maze structure. The image suggests that the model could be improved to find a more optimal path. </details> Figure 8: Illustrative failure case for Gemini-2.5-Flash on a 40x40 task with 2 locked doors on the optimal path. Left: Optimal path (yellow). Right: Model’s generated path showing an illegal adjacency jump (red arrow), bypassing multiple rooms and a locked door, despite only supporting facts being provided. This highlights a breakdown in multi-step planning. <details> <summary>figs/mistakev2.png Details</summary> ![0eae67ce](/v1/image/0eae67ce1bb93a09090b67b889fc15543adc2ea812add0efb570a21cbd2440a8) ### Visual Description ## Path Comparison: Optimal vs. Model ### Overview The image presents two diagrams side-by-side, visually comparing an "Optimal Path" and a "Model Path" through a grid-like structure. Each diagram depicts a maze-like network of interconnected nodes, with a designated start and end point. The paths taken through these networks are highlighted, and a red box emphasizes a specific region of interest in both diagrams. ### Components/Axes * **Titles:** * Left Diagram: "Optimal Path" * Right Diagram: "Model Path" * **Grid Structure:** Both diagrams share a similar grid-like structure, consisting of nodes (small circles) connected by lines. Each node has up to four connections (North, South, East, West). * **Start and End Points:** Each diagram has a distinct start point (filled circle) and end point (filled triangle). * Start point is located at the bottom-left. * End point is located at the bottom-right. * **Paths:** * Optimal Path (Left): Highlighted in yellow. * Model Path (Right): Highlighted in purple with arrows indicating direction. * **Red Box:** A red rectangular box highlights a specific vertical section of the grid in both diagrams. * **Dashed Lines:** Both diagrams contain a dashed orange line connecting the start and end points. A dashed blue line connects the path to the dashed orange line. ### Detailed Analysis **Optimal Path (Left Diagram):** * The optimal path, highlighted in yellow, appears to be the most efficient route through the grid. * The path starts at the bottom-left and navigates through the grid, reaching the end point at the bottom-right. * The dashed orange line represents a direct path from start to end. * The dashed blue line connects the optimal path to the dashed orange line. * The optimal path is mostly horizontal, with some vertical deviations. **Model Path (Right Diagram):** * The model path, highlighted in purple with arrows, represents a different route through the grid. * The path starts at the bottom-left and navigates through the grid, reaching the end point at the bottom-right. * The dashed orange line represents a direct path from start to end. * The dashed blue line connects the model path to the dashed orange line. * The model path is more circuitous and less direct than the optimal path. * The arrows on the purple path indicate the direction of travel. **Comparison:** * The red box highlights a region where the optimal path takes a direct vertical route, while the model path deviates. * The model path appears to be less efficient, taking a longer and more complex route compared to the optimal path. ### Key Observations * The "Optimal Path" is more direct and efficient than the "Model Path." * The red box highlights a key difference in path selection between the two approaches. * The model path includes directional arrows, suggesting a step-by-step decision-making process. ### Interpretation The image visually demonstrates the difference between an optimal solution and a potentially suboptimal solution generated by a model. The "Optimal Path" represents the most efficient route, while the "Model Path" represents a route that, while achieving the same goal, is less direct and potentially less efficient. The red box emphasizes a specific area where the model deviates from the optimal path, suggesting a potential area for improvement in the model's decision-making process. The dashed lines provide a visual reference for the direct distance between the start and end points, further highlighting the efficiency of the optimal path compared to the model path. </details> Figure 9: Illustrative failure case of an ’adjacency error’ in model-generated pathfinding on a 20x20 task with 2 locked doors on the optimal path. The left panel displays the optimal path (yellow) to the target (triangle). The right panel shows a suboptimal path (purple) generated by the model. This example highlights a common error where, after a sequence of actions (in this scenario, following a key acquisition), the model fails to navigate through valid connections. Instead, it attempts to ’jump’ directly between two unconnected rooms. This violation of room adjacency constraints is a key challenge in model performance. B.4 Quantitative Analysis of Error Patterns To understand how and when models begin to fail within a reasoning sequence, we analyze the distribution of the first violation step. We record the time step at which the initial violation occurs in a model’s generated path. Aggregating this step-indexed data across multiple instances allows us to create temporal distributions of errors. These distributions help determine whether errors tend to cluster early in the reasoning process (potentially indicating issues with initial planning or understanding of the overall problem complexity) or accumulate later (suggesting difficulties in maintaining long chains of inference or context). This analysis complements the discussion in the main paper (Section 2.4, "Path-Length Dependent First Errors: The Burden of Anticipated Complexity"). Figure 10 shows how the distribution of these first-error positions shifts with the overall problem complexity, represented by logical depth ( $L$ ). As detailed in the main paper, an increase in $L$ tends to cause errors to occur earlier in the reasoning chain. <details> <summary>figs/failure_step_dist_vs_L.png Details</summary> ![5f1887cc](/v1/image/5f1887ccb46298e0e5d7e3ed5ff35d6d46e9d6cb78382e65018bb3e961626c3d) ### Visual Description ## Bar Chart: Solution Steps vs. Max Progress Step ### Overview The image presents a series of bar charts, each representing a different number of solution steps (20, 60, 100, 140, 180, 220, 260, and 300). Each chart displays the distribution of "max progress step" values. The x-axis represents the "max progress step," ranging from 0 to 300. Each chart shows a distribution of values, with a single prominent bar indicating the most frequent "max progress step" for that number of solution steps. ### Components/Axes * **Y-axis:** Implicitly represents the frequency or count of each "max progress step." No explicit scale is provided, but the height of the bars indicates relative frequency. * **X-axis:** "max progress step," ranging from 0 to 300, with tick marks at intervals of 50 (0, 50, 100, 150, 200, 250, 300). * **Labels:** Each chart is labeled with "Solution steps: [Number]", where the number varies from 20 to 300 in increments of 40. ### Detailed Analysis Each chart represents a different number of solution steps. The charts are arranged vertically, with the number of solution steps increasing from top to bottom. * **Solution steps: 20:** A small number of bars are present near the 0 mark. * **Solution steps: 60:** A single prominent bar is present at approximately x = 50. * **Solution steps: 100:** A single prominent bar is present at approximately x = 100. * **Solution steps: 140:** A single prominent bar is present at approximately x = 150. * **Solution steps: 180:** A single prominent bar is present at approximately x = 150. * **Solution steps: 220:** A single prominent bar is present at approximately x = 0. * **Solution steps: 260:** A single prominent bar is present at approximately x = 0. * **Solution steps: 300:** A small number of bars are present near the 0 mark. ### Key Observations * As the number of solution steps increases from 20 to 180, the "max progress step" tends to increase. * For solution steps 220 and 260, the "max progress step" is concentrated at 0. * For solution steps 20 and 300, the "max progress step" is concentrated at 0. ### Interpretation The charts suggest a relationship between the number of solution steps and the "max progress step." Initially, as the number of solution steps increases, the "max progress step" also tends to increase, indicating that more progress is being made with more steps. However, beyond a certain point (around 180 solution steps), the "max progress step" appears to decrease, with the majority of progress concentrated at 0 for 220 and 260 solution steps. This could indicate a point of diminishing returns or a change in the behavior of the solution process. The behavior at 20 and 300 solution steps is similar, with the "max progress step" concentrated at 0. This could indicate that the solution process is not effective at these extremes. </details> Figure 10: Distribution of first-violation steps for Gemini-2.5-Flash across varying logical depths ( $L$ ). As $L$ (total required path length) increases, the distribution of first errors tends to shift leftward, indicating that models are more likely to fail at earlier steps in longer problems. This suggests that anticipated global complexity impacts reasoning from the outset. Experimental parameters in this figure are the same as those in Figure 1. Similarly, Figure 11 illustrates how the introduction of contextual noise (distracting facts) affects the point of failure. Increased noise also tends to precipitate earlier errors in the reasoning sequence, as discussed in the main paper in relation to sensitivity to noise (Section 2.3) and its impact on error patterns (Section 2.4). <details> <summary>figs/gemini-progress-ratio-vs-noise.png Details</summary> ![ca6906c5](/v1/image/ca6906c5212e5a39840cfb21a2e9f5a9133d6130f98c7a541965b8c2f591027e) ### Visual Description ## Histogram Comparison: Progress Ratio vs. Noise Ratio ### Overview The image presents a series of histograms arranged vertically, each representing the distribution of a "progress ratio" for a different "noise ratio." The noise ratios range from 0.0 to 1.0 in increments of 0.2. Each histogram displays the frequency of different progress ratio values, with the x-axis representing the progress ratio (from 0.0 to 1.0) and the y-axis implicitly representing frequency or count. ### Components/Axes * **X-axis:** "progress ratio" with markers at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0. * **Y-axis:** Implicitly represents frequency or count, but no explicit scale is provided. * **Histograms:** Six histograms, each corresponding to a different "Noise ratio." * **Noise Ratio Labels:** Text labels indicating the noise ratio for each histogram: "Noise ratio: 0.0", "Noise ratio: 0.2", "Noise ratio: 0.4", "Noise ratio: 0.6", "Noise ratio: 0.8", and "Noise ratio: 1.0". The color of the text label matches the color of the histogram bars. ### Detailed Analysis Each histogram is analyzed individually: * **Noise ratio: 0.0** (Gray): The distribution is concentrated towards lower progress ratio values, with several small bars between 0.0 and 0.2. There is a very tall bar at 1.0. * **Noise ratio: 0.2** (Brownish-Gray): Similar to the previous histogram, the distribution is skewed towards lower progress ratio values, with small bars between 0.0 and 0.2. There is a tall bar at 1.0, but it is shorter than the bar in the "Noise ratio: 0.0" histogram. * **Noise ratio: 0.4** (Light Brown): The distribution continues to show a concentration at lower progress ratio values, with small bars between 0.0 and 0.2. The bar at 1.0 is present but shorter than in the previous histograms. * **Noise ratio: 0.6** (Light Brown): The distribution is more spread out compared to the previous histograms. There are still small bars between 0.0 and 0.2, but they are less prominent. The bar at 1.0 is even shorter. * **Noise ratio: 0.8** (Light Red): The distribution is even more spread out. The bars between 0.0 and 0.2 are smaller. The bar at 1.0 is very small. * **Noise ratio: 1.0** (Red): The distribution is the most spread out. The bars between 0.0 and 0.2 are the smallest. The bar at 1.0 is the smallest of all the histograms. ### Key Observations * As the noise ratio increases from 0.0 to 1.0, the distribution of the progress ratio becomes more spread out. * The height of the bar at progress ratio 1.0 decreases as the noise ratio increases. * The concentration of values at lower progress ratios (0.0 to 0.2) decreases as the noise ratio increases. ### Interpretation The histograms suggest that as the "noise ratio" increases, the "progress ratio" becomes less likely to be concentrated at lower values and at 1.0. A higher noise ratio seems to introduce more variability or randomness into the progress ratio, resulting in a more uniform distribution. The decreasing height of the bar at 1.0 indicates that a high progress ratio becomes less probable as noise increases. The data implies an inverse relationship between noise and the predictability or concentration of the progress ratio. </details> Figure 11: Impact of increasing noise ratio on the distribution of failure steps for Gemini 2.5 Flash. As noise (proportion of distracting facts) increases, failures tend to occur earlier in the reasoning chain. This reflects increased difficulty in isolating relevant information and maintaining focus. Fixed experimental parameters in this figure are the same as those in Figure 1. Appendix C Supplementary Figures This appendix provides supplementary figures that offer further visual support for analyses presented in the main paper. These figures illustrate the impact of various complexity dimensions and provide comparative views of model performance, elaborating on points made throughout Section 2 (Benchmarking Results) of the main paper. Figure 12 details the performance of Llama-4 Maverick-17B-128E-Instruct under varying levels of noise and fact shuffling. This supports the discussion in the main paper (Section 2.3, on how these factors, especially in combination, affect success rates, with noise being a dominant factor. <details> <summary>figs/single_model_vs_steps_count_varied_noise_shuffle_Llama-4-Maverick-17B-128E-Instruct-FP8.png Details</summary> ![68fcefb7](/v1/image/68fcefb76a0c2f353368d68de696d98324168e4dcbe091a8d6feada49802536f) ### Visual Description ## Line Charts: Success Rate vs. Number of Actions ### Overview The image contains two line charts comparing the success rate against the number of actions under different noise and shuffle conditions. The left chart uses a linear scale for the y-axis (success rate), while the right chart uses a logarithmic scale. Both charts share the same x-axis (number of actions). The charts show how success rate decreases as the number of actions increases, with variations based on noise levels and shuffling. ### Components/Axes **Left Chart:** * **Title:** Implicitly, "Success Rate vs. Number of Actions (Linear Scale)" * **X-axis:** "number of actions", ranging from 0 to 70 in increments of 10. * **Y-axis:** "success rate", ranging from 0.0 to 1.0 in increments of 0.2. * **Legend:** Located in the top-right corner. * Blue line: "noise = 0, shuffle = 0" * Orange line: "noise = 0, shuffle = 0.5" * Green line: "noise = 0.2, shuffle = 0" * Red line: "noise = 0.2, shuffle = 0.5" * Dashed purple line: "∝ exp(-x/L), L = 24" * Dashed brown line: "∝ exp(-x/L), L = 14" **Right Chart:** * **Title:** Implicitly, "Success Rate vs. Number of Actions (Logarithmic Scale)" * **X-axis:** "number of actions", ranging from 0 to 70 in increments of 10. * **Y-axis:** "success rate", logarithmic scale ranging from 10^-2 to 10^0 (0.01 to 1.0). * **Legend:** Located in the bottom-right corner, identical to the left chart. * Blue line: "noise = 0, shuffle = 0" * Orange line: "noise = 0, shuffle = 0.5" * Green line: "noise = 0.2, shuffle = 0" * Red line: "noise = 0.2, shuffle = 0.5" * Dashed purple line: "∝ exp(-x/L), L = 24" * Dashed brown line: "∝ exp(-x/L), L = 14" ### Detailed Analysis **Left Chart (Linear Scale):** * **noise = 0, shuffle = 0 (Blue):** Starts at approximately 0.98 and decreases to about 0.18. * **noise = 0, shuffle = 0.5 (Orange):** Starts at approximately 0.95 and decreases to about 0.15. * **noise = 0.2, shuffle = 0 (Green):** Starts at approximately 0.88 and decreases to about 0.08. * **noise = 0.2, shuffle = 0.5 (Red):** Starts at approximately 0.42 and decreases to about 0.03. * **∝ exp(-x/L), L = 24 (Purple Dashed):** Starts at approximately 0.95 and decreases to about 0.10. * **∝ exp(-x/L), L = 14 (Brown Dashed):** Starts at approximately 0.88 and decreases to about 0.01. **Right Chart (Logarithmic Scale):** * **noise = 0, shuffle = 0 (Blue):** Starts at approximately 0.98 and decreases to about 0.10. * **noise = 0, shuffle = 0.5 (Orange):** Starts at approximately 0.95 and decreases to about 0.08. * **noise = 0.2, shuffle = 0 (Green):** Starts at approximately 0.88 and decreases to about 0.04. * **noise = 0.2, shuffle = 0.5 (Red):** Starts at approximately 0.42 and decreases to about 0.02. * **∝ exp(-x/L), L = 24 (Purple Dashed):** Starts at approximately 0.95 and decreases to about 0.04. * **∝ exp(-x/L), L = 14 (Brown Dashed):** Starts at approximately 0.88 and decreases to about 0.01. ### Key Observations * All success rates decrease as the number of actions increases. * Higher noise levels (0.2 vs. 0) generally lead to lower success rates. * Shuffling (0.5 vs. 0) also tends to decrease success rates, but the effect is less pronounced than the noise level. * The exponential decay curves (∝ exp(-x/L)) provide a baseline for comparison. The curve with L=14 decays faster than the curve with L=24. * The logarithmic scale in the right chart emphasizes the differences in success rates at lower values. ### Interpretation The charts illustrate the impact of noise and shuffling on the success rate of a process as the number of actions increases. The data suggests that both noise and shuffling negatively affect performance, with noise having a more significant impact. The exponential decay curves provide a theoretical model for the decrease in success rate, and the empirical data appears to follow this trend. The logarithmic scale highlights the differences in performance at lower success rates, which may be important in certain applications. The charts demonstrate that controlling noise and minimizing shuffling are crucial for maintaining high success rates in systems with a large number of actions. </details> Figure 12: Pass@1 success rate for Llama-4 Maverick-17B-128E-Instruct versus solution length ( $L$ ) under different noise and shuffle ratios. Left: Linear scale. Right: Log-linear scale. Performance degrades with increased noise but is less affected by shuffle ratios. Fixed experimental parameters in this figure are the same as those in Figure 1. To illustrate the performance consistency and disparities across different models, as detailed in Section 2.6, Figure 13 presents scatter and density plots of mean progress ratios. These plots clearly demonstrate that model performance hierarchies are not strictly linear. They reveal ’performance inversions’—instances, also noted in Section 2.6, where models with typically lower overall performance (e.g., lower average $L_{0}$ ) occasionally solve specific complex problems that models with higher average $L_{0}$ values fail on. <details> <summary>figs/progress_vs_progress.png Details</summary> ![684c1e34](/v1/image/684c1e345fac7abf70bc7a34ee04a0d94baa6fc034f8f5a89c10dd884f124e5e) ### Visual Description ## Chart Type: Density Scatter Plots ### Overview The image presents six density scatter plots arranged in a 2x3 grid. Each plot compares the "progress ratio" of two different language models. The x and y axes represent the progress ratio for each model, ranging from 0 to 1. The density of points is indicated by color, with yellow/green representing higher density and purple representing lower density. A dashed diagonal line is present in each plot, representing the line of equality where both models have the same progress ratio. ### Components/Axes * **X-axis:** Progress ratio, ranging from 0 to 1, with tick marks at 0.2, 0.4, 0.6, and 0.8. * **Y-axis:** Progress ratio, ranging from 0 to 1, with tick marks at 0.2, 0.4, 0.6, and 0.8. * **Diagonal Line:** Dashed line representing x = y, indicating equal progress ratios. * **Color Density:** Yellow/Green indicates high density of data points, while Purple indicates low density. * **Titles:** Each plot has a title indicating the models being compared. ### Detailed Analysis **Top Row:** 1. **Top-Left Plot:** * X-axis: DeepSeek-R1 * Y-axis: gemini-2.0-flash * High-density region: Located near the origin (0,0), indicating that both models frequently have low progress ratios. * Trend: The density decreases as the progress ratio increases for both models. 2. **Top-Middle Plot:** * X-axis: DeepSeek-R1 * Y-axis: gemini-2.5-flash-preview-04-17 * High-density region: Located near the origin (0,0). * Trend: Similar to the top-left plot, density decreases as progress ratio increases. 3. **Top-Right Plot:** * X-axis: gemini-2.0-flash * Y-axis: gemini-2.5-flash-preview-04-17 * High-density region: Located near the origin (0,0). * Trend: Density decreases as progress ratio increases. **Bottom Row:** 1. **Bottom-Left Plot:** * X-axis: DeepSeek-R1 * Y-axis: Llama-4-Maverick-17B-128E-Instruct-FP8 * High-density region: Located near the origin (0,0). * Trend: Density decreases as progress ratio increases. 2. **Bottom-Middle Plot:** * X-axis: gemini-2.0-flash * Y-axis: Llama-4-Maverick-17B-128E-Instruct-FP8 * High-density region: Located near the origin (0,0). * Trend: Density decreases as progress ratio increases. 3. **Bottom-Right Plot:** * X-axis: gemini-2.5-flash-preview-04-17 * Y-axis: Llama-4-Maverick-17B-128E-Instruct-FP8 * High-density region: Located near the origin (0,0). * Trend: Density decreases as progress ratio increases. ### Key Observations * All plots show a high concentration of data points near the origin (0,0), indicating that all model pairs frequently exhibit low progress ratios. * The density generally decreases as the progress ratio increases for all model pairs. * The plots do not show a strong correlation along the diagonal line, suggesting that the models often have different progress ratios. * There are some regions of higher density away from the origin, but they are less pronounced than the density near the origin. ### Interpretation The density scatter plots suggest that the language models being compared tend to perform poorly (low progress ratio) more often than they perform well. The lack of strong correlation along the diagonal indicates that the models' performance is not consistently aligned; one model may perform better than the other in many cases. The plots do not provide information about the specific tasks or datasets used to evaluate the models, so it is difficult to draw more specific conclusions about their relative strengths and weaknesses. The plots primarily highlight the frequency of low progress ratios across different model pairings. </details> Figure 13: Scatter and density plots of progress ratios per task instance, comparing model pairs on the tasks. These plots illustrate performance agreement and disparities on the same instances of pathfinding tasks. Notably, Gemini-2.5-Flash (example) often succeeds on instances where other models achieve near-zero progress. Data from experiments in Figure 1 (main paper). Figure 14 isolates the impact of shuffle ratio on model performance when other factors like noise are controlled. This visualization corresponds to the findings discussed in the main paper (Section 2.3, "Fact Ordering (Shuffle Ratio)") that simple reordering of facts has a minimal impact on the performance of the evaluated models under low-noise conditions. Figure 15 isolates the impact of adding more examples in the instruction prompt, showing a clear improvement once more than a single example is included compared to using none or only one. Figure 16 is added in this revised version of the supplementary section to reflect that even the most recent SOTA models released by OpenAI suffer from the same performance drop observed in the main paper. <details> <summary>figs/fig_vs_shuffle_fixed_L_keys2_noise0.2.png Details</summary> ![0f1c38f8](/v1/image/0f1c38f82d272696de5818341135ab339f3f39875526dc8479cf64bfe13ae6b5) ### Visual Description ## Line Charts: Model Performance vs. Shuffle Ratio ### Overview The image presents three line charts comparing the performance of two language models, "Llama-4-Maverick-17B-128E-Instruct-FP8" and "gemini-2.5-flash-preview-04-17", across varying "shuffle ratios". The charts depict "mean progress ratio", "mean success rate (Pass@1)", and "CoT tokens" as a function of the shuffle ratio. ### Components/Axes **Legend:** Located at the top-center of the image, the legend identifies the two models: * Blue: (Llama-4-Maverick-17B-128E-Instruct-FP8) * Orange: (gemini-2.5-flash-preview-04-17) **X-Axis (All Charts):** * Label: "shuffle ratio" * Scale: 0.0 to 1.0, with increments of 0.2 (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) **Y-Axis (Chart 1):** * Label: "mean progress ratio" * Scale: 0.0 to 1.0, with increments of 0.2 (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) **Y-Axis (Chart 2):** * Label: "mean success rate (Pass@1)" * Scale: 0.0 to 1.0, with increments of 0.2 (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) **Y-Axis (Chart 3):** * Label: "CoT tokens" * Scale: 400 to 1600, with increments of 200 (400, 600, 800, 1000, 1200, 1400, 1600) ### Detailed Analysis **Chart 1: Mean Progress Ratio vs. Shuffle Ratio** * **Llama-4 (Blue):** The line is relatively flat, hovering around a mean progress ratio of approximately 0.2. * At shuffle ratio 0.0, the mean progress ratio is approximately 0.22. * At shuffle ratio 1.0, the mean progress ratio is approximately 0.18. * **Gemini-2.5 (Orange):** The line is relatively flat, hovering around a mean progress ratio of approximately 0.65 to 0.7. * At shuffle ratio 0.0, the mean progress ratio is approximately 0.65. * At shuffle ratio 0.4, the mean progress ratio is approximately 0.69. * At shuffle ratio 1.0, the mean progress ratio is approximately 0.66. **Chart 2: Mean Success Rate (Pass@1) vs. Shuffle Ratio** * **Llama-4 (Blue):** The line is nearly flat at the bottom of the chart, indicating a very low mean success rate, close to 0.0. * At shuffle ratio 0.0, the mean success rate is approximately 0.01. * At shuffle ratio 1.0, the mean success rate is approximately 0.01. * **Gemini-2.5 (Orange):** The line is relatively flat, hovering around a mean success rate of approximately 0.5 to 0.55. * At shuffle ratio 0.0, the mean success rate is approximately 0.50. * At shuffle ratio 0.4, the mean success rate is approximately 0.55. * At shuffle ratio 1.0, the mean success rate is approximately 0.52. **Chart 3: CoT Tokens vs. Shuffle Ratio** * **Llama-4 (Blue):** The line is relatively flat, hovering around 1600 CoT tokens. * At shuffle ratio 0.0, the CoT tokens are approximately 1600. * At shuffle ratio 0.8, the CoT tokens are approximately 1650. * At shuffle ratio 1.0, the CoT tokens are approximately 1620. * **Gemini-2.5 (Orange):** The line is relatively flat, hovering around 400 CoT tokens. * At shuffle ratio 0.0, the CoT tokens are approximately 350. * At shuffle ratio 0.4, the CoT tokens are approximately 360. * At shuffle ratio 1.0, the CoT tokens are approximately 360. ### Key Observations * Across all shuffle ratios, Gemini-2.5 consistently outperforms Llama-4 in terms of mean progress ratio and mean success rate. * Llama-4 consistently uses significantly more CoT tokens than Gemini-2.5. * The shuffle ratio appears to have a minimal impact on the performance metrics for both models within the tested range (0.0 to 1.0). ### Interpretation The data suggests that Gemini-2.5 is a more efficient and effective model than Llama-4 for the given task, as it achieves higher progress and success rates with significantly fewer CoT tokens. The relative insensitivity of both models to the shuffle ratio indicates that the models' performance is not significantly affected by the order of input data within the tested range. This could imply that the models have robust mechanisms for handling variations in input order or that the task itself is not highly sensitive to input order. The higher CoT token usage by Llama-4, despite its lower performance, could indicate a less efficient reasoning process or a tendency to generate more verbose explanations without necessarily improving accuracy. </details> Figure 14: Impact of shuffle ratio on Pass@1 success rate. Varying the degree of mixing (shuffle) between supporting and distracting facts shows minimal impact on performance for Gemini 2.5 Flash and Llama-4 Maverick, suggesting robustness to fact order when noise is controlled. The generation and sampling of maze instances for these tasks follow the same methodology detailed for experiments in the main paper (Figures 3 and 4). <details> <summary>figs/maze_ablation_analysis.png Details</summary> ![29875ff1](/v1/image/29875ff1fb233e16d0f1cac1993d90395a4caad8104f0849541bc4aa4659ecd0) ### Visual Description ## Line Chart: Success Rate vs. Number of Actions for Llama-4-Maverick-17B-128E-Instruct-FP8 ### Overview The image is a line chart comparing the success rate of the Llama-4-Maverick-17B-128E-Instruct-FP8 model across different numbers of actions, using various "shot" configurations (0, 1, 3, and 5 shots) with guided or unguided Chain-of-Thought (CoT) prompting. The x-axis represents the number of actions, and the y-axis represents the success rate. ### Components/Axes * **Title:** Llama-4-Maverick-17B-128E-Instruct-FP8 (located in the top-left corner within a rounded box) * **X-axis:** * Label: "Number of actions" * Scale: 0 to 200, with major ticks at 50, 100, 150, and 200. * **Y-axis:** * Label: "Success rate" * Scale: 0 to 0.6, with major ticks at 0, 0.2, 0.4, and 0.6. * **Legend:** Located in the top-right corner. * Green line with circle marker: "5\_shots\_and\_guided\_CoT" * Purple line with diamond marker: "3\_shots\_and\_guided\_CoT" * Orange line with triangle marker: "3\_shot\_unguided" * Red line with inverted triangle marker: "1\_shot\_and\_guided\_CoT" * Blue line with square marker: "zero\_shot\_and\_guided\_CoT" ### Detailed Analysis * **5\_shots\_and\_guided\_CoT (Green):** The line starts at approximately (0, 0.68) and decreases to approximately (50, 0.12), then decreases further to approximately (100, 0.03), and remains nearly constant at approximately 0.02-0.03 for higher number of actions. * **3\_shots\_and\_guided\_CoT (Purple):** The line starts at approximately (0, 0.48) and decreases to approximately (50, 0.09), then decreases further to approximately (150, 0.02), and remains nearly constant at approximately 0.02 for higher number of actions. * **3\_shot\_unguided (Orange):** The line starts at approximately (0, 0.67) and decreases to approximately (50, 0.10), then decreases further to approximately (100, 0.02), and remains nearly constant at approximately 0.02 for higher number of actions. * **1\_shot\_and\_guided\_CoT (Red):** The line starts at approximately (0, 0.45) and decreases to approximately (50, 0.06), then decreases further to approximately (100, 0.01), and remains nearly constant at approximately 0.01 for higher number of actions. * **zero\_shot\_and\_guided\_CoT (Blue):** The line starts at approximately (0, 0.58) and decreases to approximately (50, 0.04), then decreases further to approximately (150, 0.01), and remains nearly constant at approximately 0.01 for higher number of actions. ### Key Observations * All lines show a decreasing trend in success rate as the number of actions increases. * The "5\_shots\_and\_guided\_CoT" configuration generally has the highest success rate for lower number of actions (less than 50). * The success rates for all configurations converge to a very low value (approximately 0.01-0.03) as the number of actions increases beyond 100. * The zero-shot configuration has the lowest success rate for lower number of actions (less than 50). ### Interpretation The chart illustrates the performance of the Llama-4-Maverick model under different prompting strategies as the number of actions increases. The initial success rate is highly dependent on the number of "shots" used in the prompt, with more shots generally leading to higher initial success. However, as the number of actions increases, the success rate for all configurations drops significantly and converges to a low level. This suggests that the model's ability to maintain success diminishes with longer sequences of actions, regardless of the initial prompting strategy. The guided CoT prompting appears to provide a slight advantage over the unguided approach, especially with a higher number of shots. The rapid decline in success rate indicates a potential limitation in the model's ability to handle complex or extended tasks. </details> Figure 15: The impact of including different number of reference examples in the prompt as part of in-context learning. Increasing the number of examples leads to slight improvements in performance. The experimental parameters used here are the same as ones in Figure 1. <details> <summary>figs/model_comparison_openai.png Details</summary> ![2984dad0](/v1/image/2984dad06a7127dcf8ddffb3e35e7806bf1d2ce4e2957211b26387ce58d83fcd) ### Visual Description ## Line Chart: Success Rate vs. Number of Actions for Different Language Models ### Overview The image is a line chart comparing the success rate of four different language models (GPT-5, OSS-120B, OSS-20B, and Llama-4-Maverick) as the number of actions increases. The chart plots the success rate on the y-axis against the number of actions on the x-axis. Each model is represented by a different colored line with circular markers. ### Components/Axes * **X-axis:** Number of actions, ranging from 0 to 300 in increments of 50. * **Y-axis:** Success rate, ranging from 0.0 to 1.0 in increments of 0.2. * **Legend (Top-Right):** * Blue: GPT-5 * Orange: OSS-120B * Green: OSS-20B * Red: Llama-4-Maverick ### Detailed Analysis * **GPT-5 (Blue):** The success rate starts at approximately 1.0 and gradually decreases as the number of actions increases. * (0, 1.0) * (25, 0.98) * (50, 0.84) * (100, 0.62) * (150, 0.51) * (200, 0.18) * (250, 0.16) * (300, 0.08) * **OSS-120B (Orange):** The success rate starts near 1.0 and decreases more rapidly than GPT-5. * (0, 0.98) * (25, 0.92) * (50, 0.72) * (100, 0.22) * (150, 0.02) * (200, 0.01) * (250, 0.00) * (300, 0.00) * **OSS-20B (Green):** The success rate drops sharply as the number of actions increases. * (0, 0.88) * (25, 0.73) * (50, 0.30) * (100, 0.01) * (150, 0.00) * (200, 0.00) * (250, 0.00) * (300, 0.00) * **Llama-4-Maverick (Red):** The success rate decreases very rapidly and plateaus near zero. * (0, 0.65) * (25, 0.38) * (50, 0.18) * (100, 0.03) * (150, 0.00) * (200, 0.00) * (250, 0.00) * (300, 0.00) ### Key Observations * GPT-5 maintains the highest success rate across all numbers of actions compared to the other models. * Llama-4-Maverick and OSS-20B experience the most rapid decline in success rate. * All models eventually converge to a success rate near zero as the number of actions increases. ### Interpretation The chart illustrates the performance of different language models in terms of maintaining success as the number of actions increases. GPT-5 demonstrates the most robust performance, suggesting it is better at handling a larger sequence of actions while maintaining a higher success rate. The other models, particularly Llama-4-Maverick and OSS-20B, are more susceptible to performance degradation as the number of actions increases. This could indicate differences in model architecture, training data, or optimization strategies. The convergence of all models to a near-zero success rate at higher action counts suggests a common limitation in handling very long sequences, possibly due to error accumulation or vanishing gradients. </details> Figure 16: This figure is added to reflect that the recent closed (GPT-5) and open sourced models (OSS-20B/120B) released by OpenAI also follow the same universal failure patterns highlighted in this paper. The data used here as well as experimental settings is the same as the one used in Figure 1 of the main paper. We include Llama-4-Maverick which is also used in Figure 1 as the benchmark reference.

Rendering Paper...