2509.25239v2

Model: gemma-3-27b-it-free

# A Formal Comparison Between Chain of Thought and Latent Thought **Authors**: Kevin Xu, Issei Sato ## Abstract Chain of thought (CoT) elicits reasoning in large language models by explicitly generating intermediate tokens. In contrast, latent thought reasoning operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that latent thought admits more efficient parallel computation than inherently sequential CoT. In contrast, CoT enables approximate counting and sampling through stochastic decoding. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms. Machine Learning, ICML ## 1 Introduction Transformer-based large language models (LLMs) (Vaswani et al., 2017) have shown strong performance across diverse tasks and have recently been extended to complex reasoning. Rather than directly predicting final answers, generating intermediate reasoning steps, known as chain of thought (CoT) (Wei et al., 2022), enhances reasoning abilities. This naturally raises the question: why is CoT effective for complex tasks? Recent studies have approached this question by framing reasoning as a computational problem and analyzing its complexity (Feng et al., 2023; Merrill and Sabharwal, 2024; Li et al., 2024; Nowak et al., 2024), showing that CoT improves performance by increasing the model’s effective depth through iterative computation, thereby enabling the solution of problems that would otherwise be infeasible. As an alternative to CoT, recent work has explored latent thought, which reasons directly in the hidden state space rather than in the discrete token space. This paradigm includes chain of continuous thought (Coconut) (Hao et al., 2025), which replaces next tokens with hidden state, and looped Transformer (looped TF), in which output hidden states are iteratively fed back as inputs (Dehghani et al., 2019). Such iterative architectures have been shown to enhance expressivity: Coconut enables the simultaneous exploration of multiple traces (Zhu et al., 2025a; Gozeten et al., 2025), while looped TF satisfies universality (Giannou et al., 2023; Xu and Sato, 2025). Empirically, looped TF achieves competitive performance with fewer parameters (Csordás et al., 2024; Bae et al., 2025; Zhu et al., 2025b), and improves reasoning performance (Saunshi et al., 2025). <details> <summary>x1.png Details</summary> ![1585adb0](/v1/image/1585adb0f2739d01a6c06933c1ce87e8f5bfd73af750af9fe1a7086528eddc1a) ### Visual Description \n ## Diagram: Thought Process Comparison ### Overview The image presents a diagram comparing two thought processes: "Chain of thought" and "Latent thought," and "Approximate counting". It visually contrasts concepts within each process using bounding boxes and mathematical symbols. The diagram is divided into two main sections, separated by a dashed line, with each section further divided into two sub-sections. ### Components/Axes The diagram consists of four rectangular sections, each with associated labels and mathematical symbols. The sections are: * **Top-Left:** "Chain of thought" - "Parallel computation" with label "Lem. 3.13" at the top-left and "Thm. 3.12" at the top-right. Contains "Upper bound" and "Sequentially" on the left, and "Exact bound" and "Parallelizability" on the right, separated by the symbol ⊆ (subset of, with a line through it indicating non-equality). "Thm. 3.14, 3.15" is located below the symbol. * **Top-Right:** "Latent thought" * **Bottom-Left:** "Approximate counting" - Contains "Lower bound" and "Stochasticity" on the left, and "Upper bound" and "Determinism" on the right, separated by the symbol ⊉ (subset of, with a line through it indicating non-equality). "Lem. 4.3, Thm. 4.4, Thm. 4.5" is located below the symbol. * **Bottom-Right:** No explicit label. The diagram uses a color scheme: the top section is light blue, and the bottom section is light orange. The top section is labeled "Chain of thought" and "Latent thought", while the bottom section is labeled "Approximate counting". ### Detailed Analysis or Content Details The diagram presents a comparative analysis of thought processes. * **Chain of thought vs. Latent thought:** This section contrasts "Parallel computation" with the mathematical symbol ⊆ with a line through it, indicating that the upper bound of sequential processing is not a subset of the exact bound of parallelizability. The section is referenced by "Lem. 3.13" and "Thm. 3.12". "Thm. 3.14, 3.15" is also referenced. * **Approximate counting:** This section contrasts "Stochasticity" with "Determinism" using the mathematical symbol ⊉, indicating that the lower bound of stochasticity is not a subset of the upper bound of determinism. The section is referenced by "Lem. 4.3, Thm. 4.4, Thm. 4.5". ### Key Observations The diagram highlights the differences between two pairs of concepts: sequential vs. parallel processing and stochasticity vs. determinism. The use of mathematical symbols suggests a formal, logical relationship between these concepts. The dashed line separating the two sections suggests a distinction between the two levels of thought processes. ### Interpretation The diagram appears to be a visual representation of theoretical relationships within a computational or cognitive framework. The "Chain of thought" section likely refers to a more traditional, sequential processing approach, while "Latent thought" suggests a more parallel or simultaneous processing method. The "Approximate counting" section contrasts probabilistic ("Stochasticity") and certain ("Determinism") approaches to computation. The mathematical symbols (⊆ with a line through it and ⊉) are crucial. They indicate that the concepts on either side of the symbol are *not* equivalent or subsets of each other. This suggests that moving from sequential to parallel processing, or from stochasticity to determinism, does not simply involve a straightforward inclusion of one concept within the other. There are inherent differences and limitations. The references to lemmas and theorems ("Lem. 3.13", "Thm. 3.12", etc.) indicate that these relationships are formally proven within a specific mathematical or theoretical context. The diagram serves as a concise visual summary of these relationships. The dashed line suggests a separation between the two thought processes, potentially indicating different levels of abstraction or different stages in a computational process. </details> Figure 1: Overview of the formal comparison between chain of thought and latent thought reasoning with respect to parallel computation and approximate counting, providing upper, lower, or exact bounds that highlight their respective characteristics. These reasoning paradigms share the core idea of iteratively applying Transformers to enhance expressive power, which naturally leads to a fundamental question: What is the separation between chain of thought and latent thought? Recent studies characterize how expressivity scales with the number of iterations. Specifically, it has been shown that looped TF subsumes deterministic CoT (Saunshi et al., 2025), and exhibits a strict separation with only a logarithmic number of iterations (Merrill and Sabharwal, 2025a). Nevertheless, fundamental questions remain open: Does the separation extend beyond the logarithmic regime? Is latent thought always more expressive than CoT? ### 1.1 Our Contributions In this work, we address both questions by clarifying the respective strengths and limitations of the two reasoning paradigms through a formal complexity-theoretic analysis of their expressive power. Specifically, we show that latent thought gains efficiency from its parallelizability, yielding separations beyond the polylogarithmic regime. In contrast, CoT benefits from stochasticity, which enables approximate counting. An overview is given in Fig. 1. Latent thought enables parallel reasoning. By formalizing decision problems as the evaluation of directed acyclic graphs (DAGs), we reveal the parallel computational capability of latent thought utilizing continuous hidden states. This analysis can be formalized by relating the class of decision problems realizable by the model to Boolean circuits. In particular, Boolean circuits composed of logic gates such as AND, OR, NOT, and Majority, with polylogarithmic depth $\log^{k}n$ for $k\in\mathbb{N}$ and input size $n$ , define the class $\mathsf{TC}^{k}$ , a canonical model of parallel computation. Circuit complexity plays a central role in analyzing the computational power of Transformer models: fixed-depth Transformers without CoT are known to be upper-bounded by $\mathsf{TC}^{0}$ (Merrill and Sabharwal, 2023), and subsequent studies analyze how the expressivity of CoT scales their computational power in terms of Boolean circuit complexity (Li et al., 2024). We show that latent thought with $\log^{k}n$ iterations exactly captures the power of $\mathsf{TC}^{k}$ (Thm. 3.12); in contrast, CoT with $\log^{k}n$ steps cannot realize the full power of $\mathsf{TC}^{k}$ (Thm. 3.13) due to its inherent sequentiality. This yields a strict separation in favor of latent thought in polylogarithmic regime (Thm. 3.15), showing its efficiency in terms of the required number of iterations. CoT enables approximate counting. A counting problem is a fundamental task in mathematics and computer science that determines the number of solutions satisfying a given set of constraints, including satisfying assignments of Boolean formulas, graph colorings, and partition functions (Arora and Barak, 2009). While exact counting for the complexity class $\#\mathsf{P}$ is generally computationally intractable, approximation provides a feasible alternative. We show that CoT supports fully polynomial-time randomized approximation schemes ( $\mathsf{FPRAS}$ ), yielding reliable estimates even in cases where exact counting via deterministic latent thought reasoning is intractable (Lemma 4.3). Furthermore, leveraging classical results connecting approximate counting and sampling (Jerrum et al., 1986), we extend this separation to distribution modeling: there exist target distributions that CoT can approximately represent and sample from, but that remain inaccessible to latent thought (Theorem 4.4). To the best of our knowledge, this constitutes the first formal separation in favor of CoT. ## 2 Background ### 2.1 Models of Computation We define a class of reasoning paradigms in which a Transformer block (Vaswani et al., 2017) is applied iteratively. Informally, CoT generates intermediate reasoning steps explicitly as tokens in an autoregressive manner. Formal definitions and illustrations are given in Appendix A. **Definition 2.1 (CoT, followingMerrill and Sabharwal (2024))** *Let ${\mathcal{V}}$ be a vocabulary, and let $\mathrm{TF}_{\mathrm{dec}}:{\mathcal{V}}^{*}\to{\mathcal{V}}$ denote an decoder-only Transformer. Given an input sequence $x=(x_{1},\dots,x_{n})\in{\mathcal{V}}^{n}$ , the outputs of CoT are defined by $$ f_{\mathrm{cot}}^{0}(x)\coloneq x,\quad f_{\mathrm{cot}}^{k+1}(x)\coloneq f_{\mathrm{cot}}^{k}(x)\cdot\mathrm{TF}_{\mathrm{dec}}(f_{\mathrm{cot}}^{k}(x)), $$ where $\cdot$ denotes concatenation. We define the output to be the last tokens of $f_{\mathrm{cot}}^{T(n)}(x)\in{\mathcal{V}}^{\,n+T(n)}$ .* Coconut feeds the final hidden state as the embedding of the next token. Although the original Coconut model (Hao et al., 2025) can generate both language tokens and hidden states, we focus exclusively on hidden state reasoning steps, in order to compare its representational power with that of CoT. Here, $\mathbb{F}$ denotes the set of finite-precision floating-point numbers, and $d\in\mathbb{N}$ denotes the embedding dimension. **Definition 2.2 (Coconut)** *Let ${\mathcal{V}}$ be a vocabulary and let $\mathrm{TF}^{\mathrm{Coconut}}_{\mathrm{dec}}:{\mathcal{V}}^{*}\times(\mathbb{F}^{d})^{*}\to\mathbb{F}^{d}$ be a decoder-only Transformer that maps a fixed token prefix together with a hidden state to the next hidden state. Given an input sequence $x=(x_{1},\dots,x_{n})\in{\mathcal{V}}^{n}$ , we define the hidden states recursively by $$ h^{0}\coloneq\bigl(e(x_{i})\bigr)_{i=1}^{n},\quad h^{k+1}\coloneq\mathrm{TF}^{\mathrm{Coconut}}_{\mathrm{dec}}(x,h^{k}), $$ where $e:{\mathcal{V}}\to\mathbb{F}^{d}$ denotes an embedding. The output after $T(n)$ steps is obtained by decoding a suffix of the hidden state sequence ending at $h^{T(n)}$ .* Looped TFs, by contrast, feed the entire model output back into the input without generating explicit tokens, recomputing all hidden states of the sequence at every iteration. **Definition 2.3 (Looped TF)** *Let $\mathrm{TF}:\mathbb{F}^{d\times*}\to\mathbb{F}^{d\times*}$ denote a Transformer block. Given an input sequence $x=(x_{1},\dots,x_{n})\in{\mathcal{V}}^{n}$ , the outputs are defined recursively by $$ f_{\mathrm{loop}}^{0}(x)\coloneq\bigl(e(x_{i})\bigr)_{i=1}^{n},\quad f_{\mathrm{loop}}^{k+1}(x)\coloneq\mathrm{TF}(f_{\mathrm{loop}}^{k}(x)), $$ where $e:{\mathcal{V}}\to\mathbb{F}^{d}$ denotes an embedding. The output after $T(n)$ loop iterations is the decoded last tokens of $f_{\mathrm{loop}}^{T(n)}(x)$ .* Here, we assume that the input for looped TF may include sufficient padding so that its length is always at least as large as the output length, as in (Merrill and Sabharwal, 2025b). The definitions of the models describe their core architectures; the specific details may vary depending on the tasks to which they are applied. Table 1: Comparison between prior theoretical analyses and our work on the computational power of CoT, Coconut, and looped TF. $$ \mathsf{\mathsf{TC}^{k}} \mathsf{\mathsf{TC}^{k}} \mathsf{\mathsf{TC}^{k}} \tag{2024} $$ ### 2.2 Related Work To understand the expressive power of Transformers, previous work studies which classes of problems can be solved and with what computational efficiency. These questions can be naturally analyzed within the framework of computational complexity theory. Such studies on CoT and latent thought are summarized below and in Table 1. Computational power of chain of thought. The expressivity of Transformers is limited by bounded depth (Merrill and Sabharwal, 2023), whereas CoT enhances their expressiveness by effectively increasing the number of sequential computational steps, enabling the solution of problems that would otherwise be intractable for fixed-depth architectures (Feng et al., 2023). Recent work has investigated how the expressivity of CoT scales with the number of reasoning steps, formalizing CoT for decision problems and computational complexity classes (Merrill and Sabharwal, 2024; Li et al., 2024). Beyond decision problems, CoT has been further formalized in a probabilistic setting for representing probability distributions over strings (Nowak et al., 2024). Computational power of latent thought. Latent thought is an alternative paradigm for increasing the number of computational steps without being constrained to the language space, with the potential to enhance model expressivity. In particular, Coconut has been shown to enable the simultaneous exploration of multiple candidate reasoning traces (Zhu et al., 2025a; Gozeten et al., 2025). Looped TFs can simulate iterative algorithms (Yang et al., 2024; de Luca and Fountoulakis, 2024) and, more generally, realize polynomial-time computations (Giannou et al., 2023). Recent results further demonstrate advantages over chain of thought reasoning: looped TFs can subsume the class of deterministic computations realizable by CoT using the same number of iterations (Saunshi et al., 2025), and exhibit a strict separation within the same logarithmic iterations (Merrill and Sabharwal, 2025a). Concurrent work (Merrill and Sabharwal, 2025b) further establishes that looped TFs, when equipped with padding and a polylogarithmic number of loops, are computationally equivalent to uniform $\mathsf{TC}^{k}$ . ## 3 Latent Thought Enables Parallel Reasoning We formalize the reasoning problem as a graph evaluation problem. Section 3.2 illustrates how each model approaches the same problem differently, providing intuitive insight into their contrasting capabilities. Building on these observations, Section 3.3 characterizes their expressive power and establishes a formal separation between them. ### 3.1 Problem Setting <details> <summary>x2.png Details</summary> ![31e8b4cb](/v1/image/31e8b4cbfcadfd969992838e63a0fb04f2ba7f32a5b663f8c1838bc8c2ea7da0) ### Visual Description \n ## Diagram: Computational Graph and Thought Simulation ### Overview The image presents three diagrams illustrating different approaches to computation and thought simulation. Diagram (a) depicts a computation graph (Directed Acyclic Graph - DAG) with labeled nodes and edges. Diagram (b) shows a latent thought process simulated layer-by-layer, and diagram (c) illustrates a Chain-of-Thought approach simulating node by node. The diagrams are arranged horizontally, with (c) positioned below (a) and (b). ### Components/Axes **Diagram (a): Computation Graph (DAG)** * **Label:** `Fₙ = {−, ×, ÷, +}` * **Variables:** `m = 3`, `n = 5` * **Nodes:** `y₁, y₂, y₃, y₄, x₁, x₂, x₃, x₄, x₅` * **Edge Labels:** `+6`, `+9`, `+10`, `-11`, `×12`, `×7`, `×8` * **Box:** "size: 12", "depth: 2" **Diagram (b): Latent Thought simulates layer-by-layer** * **Nodes:** `y₁, y₂, y₃, y₄, x₁, x₂, x₃, x₄, x₅` * **Edge Labels:** `+`, `-`, `×`, `÷` * **Label:** "iter" (appearing twice within orange boxes) **Diagram (c): Chain-of-Thought can simulate node by node** * **Nodes:** `x₁, x₂, x₃, x₄, x₅`, `y₁, y₂, y₃, y₄` * **Edge Labels:** `step`, `+`, `×`, `÷` ### Detailed Analysis or Content Details **Diagram (a): Computation Graph (DAG)** The DAG consists of two layers of nodes. The bottom layer contains input nodes `x₁` through `x₅`. The top layer contains output nodes `y₁` through `y₄`. Edges connect the input nodes to intermediate nodes and ultimately to the output nodes. * `x₁` connects to `+6` and then to `y₁`. * `x₂` connects to `+9` and then to `y₂`. * `x₃` connects to `+10` and then to `y₃`. * `x₄` connects to `-11` and then to `y₄`. * `x₅` connects to `×12` and then to `y₄`. * `x₁` also connects to `×7` and then to `y₃`. * `x₂` also connects to `×8` and then to `y₄`. * The function set `Fₙ = {−, ×, ÷, +}` defines the operations performed within the graph. * A box indicates a "size" of 12 and a "depth" of 2. **Diagram (b): Latent Thought simulates layer-by-layer** This diagram shows a layered structure similar to (a), but with operations directly labeled on the edges. * `x₁` connects to `+` and then to `y₁`. * `x₂` connects to `+` and then to `y₂`. * `x₃` connects to `+` and then to `y₃`. * `x₄` connects to `÷` and then to `y₄`. * `x₅` connects to `×` and then to `y₄`. * The "iter" label appears twice, suggesting iterative processing. **Diagram (c): Chain-of-Thought can simulate node by node** This diagram shows a more interconnected network. * `x₁` connects to `step` and then to `y₁`. * `x₂` connects to `step` and then to `y₂`. * `x₃` connects to `step` and then to `y₃`. * `x₄` connects to `step` and then to `y₄`. * `x₅` connects to `+` and then to `y₁`. * `y₁` connects to `+` and then to `y₂`. * `y₂` connects to `÷` and then to `y₃`. * `y₃` connects to `×` and then to `y₄`. ### Key Observations * Diagram (a) represents a static computation graph, while (b) and (c) represent dynamic thought processes. * Diagram (b) emphasizes layer-by-layer simulation, indicated by the "iter" label. * Diagram (c) highlights a chain-of-thought approach, where each node's output becomes the input for the next. * The operations used in each diagram differ, suggesting different computational strategies. ### Interpretation The diagrams illustrate different approaches to modeling computation and thought. Diagram (a) represents a traditional computational graph, where operations are defined and executed in a fixed order. Diagrams (b) and (c) explore more flexible and dynamic approaches, simulating thought processes through iterative layers or chained reasoning. The "size" and "depth" parameters in diagram (a) likely refer to the complexity of the graph. The "iter" label in diagram (b) suggests an iterative refinement process, while the "step" label in diagram (c) indicates a sequential progression of thought. The differences in operations (e.g., `+`, `×`, `÷`, `-` vs. `step`) suggest that each approach is suited for different types of problems or cognitive tasks. The diagrams collectively demonstrate a progression from static computation to dynamic thought simulation, highlighting the potential for AI systems to mimic human reasoning processes. </details> Figure 2: Comparison of reasoning paradigms for evaluating a DAG. (a) A computation graph $G_{n}$ . (b) Latent thought can simulate the computation layer by layer in parallel, using a number of loops equal to the depth of the graph, $\mathrm{depth}(G_{n})$ . (c) CoT can sequentially simulate the computation node by node, using a number of steps proportional to the size of the graph, $O(\mathrm{size}(G_{n}))$ . Reasoning problems that can be solved by straight-line programs admit representations as directed acyclic graphs (DAGs) (Aho and Ullman, 1972), as illustrated in Fig. 2 (a). **Definition 3.1 (Computation graph)** *Let $\Sigma$ be a finite alphabet, and let $\mathcal{F}$ denote a finite set of functions $f:\Sigma^{*}\to\Sigma$ . A computation graph is a directed acyclic graph $G_{n}=(V_{n},E_{n})$ that defines a function $F_{G_{n}}:\Sigma^{n}\to\Sigma^{m(n)}$ , where $m(n)$ denotes the output length. Here $V_{n}$ denotes the set of nodes, consisting of (i) $n$ input nodes with in-degree $0$ , (ii) function nodes labeled by $f\in\mathcal{F}$ , which take as arguments the predecessor nodes specified by their incoming edges in $E_{n}$ , and (iii) $m(n)$ output nodes with out-degree $0$ . The overall function is obtained by evaluating the graph in topological order. The size of the graph is $|V_{n}|$ , denoted by $\mathrm{size}(G_{n})$ , and its depth is the length of the longest path from an input to an output node, denoted by $\mathrm{depth}(G_{n})$ .* Assumptions on models. Our goal is to evaluate the computational efficiency of each model via an asymptotic analysis of how the required number of reasoning steps or loops scales with the input size $n$ . Beyond time complexity, we also allow the space complexity of the model to scale with the input size $n$ . In particular, the embedding dimension in Transformer blocks can be viewed as analogous to the number of processors in classical parallel computation models. Accordingly, we adopt a non-uniform computational model, in which a different model is allowed for each input size. This non-uniform setting is standard in the study of circuit complexity and parallel computation (Cook, 1985), and is consistent with prior analyses of Transformers and CoT (Sanford et al., 2024b; Li et al., 2024). On the fairness of comparing steps and loops. We analyze expressivity in terms of the number of reasoning steps. Although this may appear unfair in terms of raw computation, it is justified when comparing latency. Specifically, CoT benefits from KV caching, which makes each step computationally inexpensive; however, accessing cached states is typically memory-bound, leaving compute resources underutilized. In contrast, looped TFs recompute over the full sequence at each iteration, incurring higher arithmetic cost but achieving higher arithmetic intensity and better utilization of modern parallel hardware. As a result, the latency of looped TFs is comparable to that of CoT. ### 3.2 CoT Suffices with Size-scaled Steps and Latent Thought Suffices with Depth-scaled Iterations We show how each model can evaluate DAGs, which provides a lower bound on their expressivity and offers intuition for the distinctions between the models, in terms of sequentiality and parallelizability. Before presenting our main result, we first state the underlying assumptions. **Definition 3.2 (Merrill and Sabharwal,2023)** *The model is log-precision, where each scalar is stored with $O(\log n)$ bits and every arithmetic operation is rounded to that precision.* **Assumption 3.3 (Poly-size graph)** *$\mathrm{size}(G_{n})\in\mathsf{poly}(n)$ .* **Assumption 3.4 (Poly-efficient approximation, cf.(Fenget al.,2023))** *Each node function of $G_{n}$ can be approximated by a log-precision feedforward network whose parameter size is polynomial in the input length and the inverse of the approximation error. We denote by $\mathrm{ff\_param}(G_{n})$ an upper bound such that every $f\in\mathcal{F}$ admits such a network with at most $\mathrm{ff\_param}(G_{n})$ parameters.* Under these assumptions, we show that CoT can simulate computation by sequentially decoding nodes, where intermediate tokens serve as a scratchpad allowing the evaluation of each node once all its predecessors have been computed. **Theorem 3.5 (CoT for DAGs)** *Let $\{G_{n}\}_{n\in\mathbb{N}}$ be a family of computation graphs that satisfy Assumptions 3.3 and 3.4. Then, for each $n\in\mathbb{N}$ , there exists a log-precision CoT with parameter size bounded by $O(\mathrm{ff\_param}(G_{n}))$ , such that for every input $x\in\Sigma^{n}$ , the model outputs $F_{G_{n}}(x)$ with steps proportional to the “size” of the graph, i.e., $O(\mathrm{size}(G_{n}))$ .* * Proof sketch* At each step, the attention layer retrieves the outputs of predecessor nodes from previously generated tokens, and a feed-forward layer then computes the node function, whose output is generated as the next token. ∎ In contrast, latent thought can operate in parallel, layer by layer, where all nodes at the same depth are computed simultaneously, provided that the model has sufficient size. **Theorem 3.6 (Latent thought for DAGs)** *Let $\{G_{n}\}_{n\in\mathbb{N}}$ be a family of computation graphs that satisfy Assumptions 3.3 and 3.4. Then, for each $n\in\mathbb{N}$ , there exists a log-precision Coconut and looped TF with parameter size $O(\mathrm{ff\_param}(G_{n})\cdot\mathrm{size}(G_{n}))$ , such that for every input $x\in\Sigma^{n}$ , it computes $F_{G_{n}}(x)$ with iterations proportional to the “depth” of the graph $G_{n}$ , i.e., $O(\mathrm{depth}(G_{n}))$ .* * Proof sketch* The role assignment of each layer is based on (Li et al., 2024), as shown in Figure 9. An attention layer aggregates its inputs into a single hidden state. Unlike discrete tokens, continuous latent states allow the simultaneous encoding of the outputs of multiple nodes, enabling the feed-forward layer to compute node functions in parallel. ∎ Remark. Illustrations are provided in Fig. 2, with formal proofs deferred to Appendix B. These results reveal distinct characteristics: CoT can utilize intermediate steps as scratchpad memory and perform computations sequentially, whereas latent thought can leverage structural parallelism to achieve greater efficiency with sufficient resources. ### 3.3 Separation in Polylogarithmic Iterations In this section, we shift to formal decision problems to precisely characterize the computational power of each reasoning paradigm, clarify what cannot be achieved, and use these limitations to derive rigorous separations. We begin by defining their complexity classes, as in (Li et al., 2024). **Definition 3.7 (Complexity Classes𝖢𝗈𝖳\mathsf{CoT},𝖢𝖳\mathsf{CT}and𝖫𝗈𝗈𝗉\mathsf{Loop})** *Let $\mathsf{CoT}[T(n),d(n),s(n)]$ , $\mathsf{CT}[T(n),d(n),s(n)]$ , and $\mathsf{Loop}[T(n),d(n),s(n)]$ denote the sets of languages ${\mathcal{L}}:\{0,1\}^{*}\to\{0,1\}$ for which there exists a deterministic CoT, Coconut, and looped TF, respectively, denoted by $M_{n}$ for each input size $n$ , with embedding size $O(d(n))$ and $O(s(n))$ bits of precision, such that for all $x\in\{0,1\}^{n}$ , the final output token after $O(T(n))$ iterations equals ${\mathcal{L}}(x)$ .* Boolean circuits serve as a standard formal model of computation, where processes are defined by the evaluation of DAGs with well-established complexity measures. **Definition 3.8 (Informal)** *A Boolean circuit is a DAG over the alphabet $\Sigma=\{0,1\}$ , where each internal node (gate) computes a Boolean function such as AND, OR, or NOT. $\mathsf{SIZE}[s(n)]$ and $\mathsf{DEPTH}[d(n)]$ denote the class of languages decidable by a non-uniform circuit family $\{C_{n}\}$ with size $O(s(n))$ and depth $O(d(n))$ , respectively.* First, we formalize the results of the previous section to show that latent thought iterations can represent circuit depth, whereas CoT corresponds to circuit size. **Theorem 3.9 (Liet al.,2024)** *$\forall T(n)\in\mathrm{poly}(n),$ $$ \mathsf{SIZE}[T(n)]\subseteq\mathsf{CoT}[T(n),\log{n},1]. $$* **Theorem 3.10** *For any function $T(n)\in\mathrm{poly}(n)$ and any non-uniform circuit family $\{C_{n}\}$ , it holds that | | $\displaystyle\mathsf{DEPTH}[T(n)]$ | $\displaystyle\subseteq\mathsf{Loop}[T(n),\mathrm{size}(C_{n}),1],$ | | | --- | --- | --- | --- |* Boolean circuits serve as a formal model of parallel computations that run in polylogarithmic time using a polynomial number of processors (Stockmeyer and Vishkin, 1984). **Definition 3.11** *For each $k\in\mathbb{N}$ , the classes $\mathsf{NC}^{k}$ , $\mathsf{AC}^{k}$ , and $\mathsf{TC}^{k}$ consist of languages decidable by non-uniform circuit families of size $\mathsf{poly}(n)$ and depth $O(\log^{k}n)$ , using bounded-fanin Boolean gates, unbounded-fanin $\mathsf{AND}$ / $\mathsf{OR}$ gates, and threshold gates, respectively.* We then characterize the exact computational power of latent thought in the parallel computation regime. **Theorem 3.12** *For each $k\in\mathbb{N},$ it holds that | | | $\displaystyle\mathsf{Loop}[\log^{k}n,\,\mathsf{poly}(n),1\ \textup{(resp.\ }\log n)]$ | | | --- | --- | --- | --- |* * Proof sketch* The inclusion from circuits to latent thought follows from Theorem 3.10. For the converse inclusion, we build on the arguments of prior work (Merrill and Sabharwal, 2023; Li et al., 2024), which show that a fixed-depth Transformer block under finite precision is contained in $\mathsf{AC}^{0}$ (or $\mathsf{TC}^{0}$ under logarithmic precision). We extend their analysis to the looped setting, which can be unrolled into a $\mathsf{TC}^{k}$ circuit by composing a $\mathsf{TC}^{0}$ block for $\log^{k}n$ iterations. ∎ Moreover, we establish an upper bound on the power of CoT in the parallel computation regime. This limitation arises from the inherently sequential nature of CoT. **Lemma 3.13** *For each $k\in\mathbb{N},$ it holds that $$ \mathsf{CoT}[\log^{k}{n},\mathsf{poly}(n),\log{n}]\subseteq\mathsf{TC}^{k-1}. $$* * Proof* The total $\log^{k}n$ steps can be divided into $\log^{k-1}n$ blocks, each consisting of $\log n$ steps. Since $\mathsf{CoT}[\log n,\mathsf{poly}(n),\log n]\subseteq\mathsf{TC}^{0}$ (Li et al., 2024), each block with the previous block’s outputs fed as inputs to the next block can be simulated in $\mathsf{TC}^{0}$ ; iterating this over $\log^{k-1}n$ layers yields a circuit in $\mathsf{TC}^{k-1}$ . ∎ <details> <summary>x3.png Details</summary> ![3a578408](/v1/image/3a5784086aaac5690aa5587018d135ffab00d89615ba1dfa29ca220ee3062258) ### Visual Description \n ## Diagram: Computational Complexity Relationships ### Overview The image presents a nested diagram illustrating relationships between computational time (CT), loops, and a variable 'n' raised to the power of 'k'. The diagram uses a layered structure with three distinct equations, each contained within a progressively larger rectangular border. The equations appear to represent bounds or equivalencies in computational complexity. ### Components/Axes The diagram does not contain axes in the traditional sense. It consists of three equations, each with specific notation: * **CT**: Computational Time * **Loop[logk(n)]**: A loop iterating a number of times determined by the logarithm of 'n' raised to the power of 'k'. * **logk(n)**: The logarithm of 'n' raised to the power of 'k'. * **TCk**: Time Complexity raised to the power of 'k'. * **CoT[logk(n)]**: Cost of Time for a loop iterating a number of times determined by the logarithm of 'n' raised to the power of 'k'. * **≤**: Less than or equal to symbol. ### Detailed Analysis or Content Details The diagram contains the following equations, presented in a nested structure: 1. **Top Layer (Peach Background):** `CT / Loop[logk(n)] = TCk` This equation states that the Computational Time divided by the number of iterations of a loop based on logk(n) is equal to the Time Complexity raised to the power of k. 2. **Middle Layer (White Background):** `CT / Loop[logk-1(n)] = TCk-1` This equation states that the Computational Time divided by the number of iterations of a loop based on logk-1(n) is equal to the Time Complexity raised to the power of k-1. 3. **Bottom Layer (Blue Background):** `CoT[logk(n)] ≤ TCk-1` This equation states that the Cost of Time for a loop based on logk(n) is less than or equal to the Time Complexity raised to the power of k-1. ### Key Observations The equations demonstrate a decreasing relationship between the loop's complexity (logk(n) vs. logk-1(n)) and the corresponding time complexity (TCk vs. TCk-1). The bottom equation provides an upper bound for the cost of time, suggesting a potential optimization or limitation. The nested structure visually emphasizes the hierarchical relationship between these complexity levels. ### Interpretation The diagram likely represents an analysis of the time complexity of an algorithm that involves nested loops. The equations suggest that reducing the complexity of the loop (by decreasing 'k') can lead to a reduction in the overall time complexity. The final inequality indicates that the cost of the loop with complexity logk(n) is bounded by the time complexity of the algorithm with a slightly simpler loop (logk-1(n)). This could be used to justify simplifying the loop structure if the performance gain outweighs the potential loss of accuracy or functionality. The diagram is a concise representation of a theoretical argument about algorithmic efficiency. </details> Figure 3: The separation between latent thought and CoT for decision problems, under polylogarithmic iterations. These results lead to a separation in expressive power under standard complexity assumptions, as illustrated in Figure 3. **Theorem 3.14** *For each $k\in\mathbb{N}$ , if $\mathsf{TC}^{k-1}\subsetneq\mathsf{NC}^{k}$ , then | | $\displaystyle\mathsf{CoT}[\log^{k}n,\mathsf{poly}(n),\log{n}]$ | $\displaystyle\subsetneq\mathsf{Loop}[\log^{k}n,\mathsf{poly}(n),1],$ | | | --- | --- | --- | --- |* **Theorem 3.15** *For each $k\in\mathbb{N}$ , if $\mathsf{TC}^{k-1}\subsetneq\mathsf{TC}^{k}$ , then | | $\displaystyle\mathsf{CoT}[\log^{k}n,\mathsf{poly}(n),\log n]$ | $\displaystyle\subsetneq\mathsf{Loop}[\log^{k}n,\mathsf{poly}(n),\log n],$ | | | --- | --- | --- | --- |* Remark. The claims follow directly from Theorem 3.12 and Lemma 3.13. The established separations of the complexity classes, as summarized in Fig. 3, show that latent thought reasoning enables efficient parallel solutions more effectively than CoT, which is inherently sequential. ## 4 CoT Enables Approximate Counting In the previous section, we showed that for decision problems, latent thought can yield more efficient solutions than CoT. This naturally raises the question of whether latent thought is universally more powerful than CoT. While prior work has shown that CoT can be simulated by looped Transformer models for deterministic decision problems under deterministic decoding (Saunshi et al., 2025), we found that this result does not directly extend to probabilistic settings with stochastic decoding. Accordingly, we shift our focus from efficiency in terms of the number of reasoning steps to expressive capability under polynomially many iterations. ### 4.1 Preliminaries Approximate counting. Formally, let $\Sigma$ be a finite alphabet and let $R\subseteq\Sigma^{*}\times\Sigma^{*}$ be a relation. For an input $x\in\Sigma^{*}$ , define $R(x):=\{\,y\in\Sigma^{*}\mid(x,y)\in R\,\},$ and the counting problem is to determine $|R(x)|$ . A wide class of natural relations admits a recursive structure, which allows solutions to be constructed from smaller subproblems. **Definition 4.1 (Informal: Self-reducibility(Schnorr,1976))** *A relation $R$ is self-reducible if there exists a polynomial-time procedure that, given any input $x$ and prefix $y_{1:k}$ (with respect to a fixed output order), produces a sub-instance $\psi(x,y_{1:k})$ such that every solution $z$ of $\psi(x,y_{1:k})$ extends $y_{1:k}$ to a solution of $R(x)$ (and conversely), i.e., $R\bigl(\psi(x,y_{1:k})\bigr)=\{z\mid\ \mathrm{concat}(y_{1:k},z)\in R(x)\,\}.$* While exact counting is intractable, there exist efficient randomized approximation algorithms (Karp and Luby, 1983). **Definition 4.2 (FPRAS)** *An algorithm is called a fully polynomial-time randomized approximation scheme (FPRAS) for a function $f$ if, for any $\varepsilon>0$ and $\delta>0$ , it outputs an estimate $\hat{f}(x)$ such that $$ \Pr\!\left[(1-\varepsilon)f(x)\leq\hat{f}(x)\leq(1+\varepsilon)f(x)\right]\geq 1-\delta, $$ and runs in time polynomial in $|x|$ , $1/\varepsilon$ , and $\log(1/\delta)$ .* The class of counting problems that admit an FPRAS is denoted by $\mathsf{FPRAS}$ . Although randomized algorithms provide only probabilistic guarantees, they are often both more efficient and simpler than their deterministic counterparts, denoted by $\mathsf{FPTAS}$ (Definition C.11). For example, counting the number of satisfying assignments of a DNF formula admits an FPRAS based on Monte Carlo methods (Karp et al., 1989), whereas no FPTAS is known for this problem. Moreover, probabilistic analysis enables us to capture algorithmic behavior on typical instances arising in real-world applications (Mitzenmacher and Upfal, 2017). Probabilistic models of computation. In contrast to the deterministic models considered in the previous section, we now study probabilistic models that define a conditional distribution over output strings $y=(y_{1},\ldots,y_{m})\in\Sigma^{*}$ . We consider autoregressive next-token prediction of the form $$ p(y\mid x)=\prod_{i=1}^{m}p(y_{i}\mid x,y_{<i}), $$ where the model is allowed to perform additional reasoning steps before producing each output token $y_{i}$ . This formulation was first used to formalize CoT for language modeling by Nowak et al. (2024). For latent thought, reasoning iterations are performed entirely in hidden space; no linguistic tokens are sampled except for the output token $y_{i}$ , as illustrated in Fig. 4. This definition is consistent with practical implementations (Csordás et al., 2024; Bae et al., 2025). Within this framework, we define complexity classes of probabilistic models, denoted by $\mathsf{pCoT}$ , $\mathsf{pCT}$ , and $\mathsf{pLoop}$ , respectively. Formal definitions are in Section C.1. <details> <summary>x4.png Details</summary> ![25a0db52](/v1/image/25a0db52decbfa500a3a6c64dc72e402bfe442786f3c945a7d3cd83687dc00c0) ### Visual Description \n ## Diagram: Continuous Thought vs. Looped Transformer ### Overview The image presents a comparative diagram illustrating two different approaches to processing sequential data using Transformers: "Continuous Thought" and "Looped Transformer". Both diagrams depict a Transformer model interacting with input data (x1...xn) and generating output data (y1...yi, y1+1). The key difference lies in how the output is fed back into the model. ### Components/Axes The diagram consists of the following components: * **Transformer Block:** A large, dark gray rectangle representing the Transformer model. The text "Transformer" is centered within each block. * **Input Sequence:** Represented by a series of rectangular boxes labeled "x1...xn" at the bottom of each diagram. * **Output Sequence:** Represented by a series of rectangular boxes labeled "y1...yi" and "y1+1" at the top of each diagram. * **Intermediate States:** Represented by a series of oval-shaped boxes with arrows indicating the flow of information. * **Labels:** "Continuous Thought" and "Looped Transformer" are labels placed below each respective diagram. ### Detailed Analysis or Content Details **Continuous Thought (Left Diagram):** * The input sequence "x1...xn" is fed into the Transformer. * The Transformer generates an output sequence "y1...yi". * The output "yi" is then fed back into the Transformer along with the original input to generate the next output "y1+1". * The arrows indicate a sequential flow of information from input to output and then back into the input for the next iteration. **Looped Transformer (Right Diagram):** * The input sequence "x1...xn" is fed into the Transformer. * The Transformer generates an output sequence "y1...yi". * The output "yi" is then fed back into the Transformer *along with the original input* to generate the next output "y1+1". * The arrows indicate a sequential flow of information from input to output and then back into the input for the next iteration. The key difference is that in the "Looped Transformer" diagram, the entire input sequence "x1...xn" is re-fed into the Transformer along with the previous output "yi" to generate "y1+1". In "Continuous Thought", only the previous output "yi" is fed back in. ### Key Observations The diagrams highlight a fundamental difference in how the models handle sequential dependencies. The "Continuous Thought" model appears to be more memory-efficient, as it only passes the previous output back into the model. The "Looped Transformer" model, on the other hand, maintains the entire input sequence in memory, potentially allowing it to capture longer-range dependencies but at a higher computational cost. ### Interpretation The diagram illustrates two distinct architectural choices for implementing recurrent behavior in Transformer models. "Continuous Thought" represents a more streamlined approach, potentially suitable for tasks where only recent history is relevant. "Looped Transformer" represents a more comprehensive approach, potentially better suited for tasks requiring a broader contextual understanding. The choice between these two architectures likely depends on the specific application and the trade-off between computational cost and performance. The diagram suggests that the "Looped Transformer" is designed to maintain a more complete state representation by re-introducing the entire input sequence at each step, while "Continuous Thought" focuses on a more incremental update based solely on the previous output. This difference in state management could lead to variations in the models' ability to capture long-range dependencies and handle complex sequential patterns. </details> Figure 4: Probabilistic models of computation with latent thought. Each output token $y_{i}$ is generated autoregressively, with computation permitted in a continuous latent space prior to each token. ### 4.2 Separation in Approximate Counting We first analyze the expressivity of the token-level conditional prediction at each step, $p(y_{i}\mid x,y_{<i})$ , and show that CoT is strictly more expressive than latent thought in this setting. The key distinction is whether intermediate computation permits sampling. CoT explicitly samples intermediate reasoning tokens, inducing stochastic computation and enabling the emulation of randomized algorithms. In contrast, latent thought performs only deterministic transformations in latent space, resulting in deterministic computation. **Lemma 4.3 (Informal)** *Assume that $\mathsf{FPTAS}\subsetneq\mathsf{FPRAS}$ for self-reducible relations. There exists a self-reducible relation $R$ and an associated function $f:\Sigma^{*}\times\Sigma^{*}\to\mathbb{N}$ defined by $f(x,y_{<i})\coloneqq|\{\,z\in\Sigma^{*}:(x,y_{<i}z)\in R\,\}|$ such that CoT with polynomially many steps admits an FPRAS for $f$ . Whereas, no latent thought with polynomially many iterations admits the same approximation guarantee.* * Proof sketch* For self-reducible relations, approximating the counting function $f$ on subproblems is polynomial-time inter-reducible with approximating $|R(x)|$ (Jerrum et al., 1986). If latent thought with polynomially many iterations admitted an FPTAS for $f$ , then it would induce a deterministic FPTAS for $|R(x)|$ , contradicting the assumption.∎ ### 4.3 Separation in Approximate Sampling We then move from token-level conditional distributions $p(y_{i}\mid x,y_{<i})$ to the full sequence-level distribution $p(y\mid x)$ . Beyond approximate counting, we establish a separation for approximate sampling problems. Specifically, we construct target distributions for which the complexity of each conditional can be reduced to approximate counting. **Theorem 4.4** *Assume that $\mathsf{FPTAS}\subsetneq\mathsf{FPRAS}$ for self-reducible relations. There exists a distribution $p(y\mid x)$ over $y\in\Sigma^{*}$ and $x\in\Sigma^{n}$ such that a CoT with a polynomial number of steps, whose induced output conditionals are denoted by $q(y_{i}\mid x,y_{<i})$ , admits an FPRAS for approximating the conditional probabilities $p(y_{i}\mid x,y_{<i})$ for all $x\in\Sigma^{n}$ , indices $i\geq 1$ , and prefixes $y_{<i}\coloneq(y_{1},\ldots,y_{i-1})$ . In contrast, no latent thought with polynomially many iterations admits the same approximation guarantee.* * Proof sketch* Define the target distribution $p$ to be the uniform distribution supported on the solution set $R(x)$ . We rely on the classical result that approximate sampling from the uniform distribution over solutions, captured by the class $\mathsf{FPAUS}$ , is polynomial-time inter-reducible with approximate counting for self-reducible relations (Jerrum et al., 1986). Let $U(\cdot\mid x)$ denote the uniform distribution over solutions of a self-reducible relation $R(x)$ . This distribution admits an autoregressive factorization $U(y\mid x)\;=\;\prod_{i=1}^{m}p(y_{i}\mid x,y_{<i}),$ where each conditional probability is given by $p(y_{i}\mid x,y_{<i})=\frac{\bigl|\{\,z\in\Sigma^{*}:(x,y_{1:i+1}z)\in R\,\}\bigr|}{\bigl|\{\,z\in\Sigma^{*}:(x,y_{1:i}z)\in R\,\}\bigr|}.$ We show that each conditional probability, expressed as a ratio of subproblem counts, reduces to approximate counting. Then, applying Lemma 4.3 to these conditionals yields the desired separation for approximate sampling. ∎ <details> <summary>x5.png Details</summary> ![5b45607b](/v1/image/5b45607b1115d9fedd018f44c8ee7a214f9b15d24d3fd18847ce22df6fac1c5a) ### Visual Description \n ## Diagram: Relationship of Computational Problems ### Overview The image is a diagram illustrating the relationship between different computational problems, represented as overlapping rectangles. The diagram uses text labels to identify the problems and a parenthetical note to indicate a property. ### Components/Axes The diagram consists of two overlapping rectangles: * A larger, light blue rectangle labeled "pCoT[poly(n)]". * A smaller, peach-colored rectangle labeled "pCT / pLoop [poly(n)]". * Within the light blue rectangle, but not overlapping the peach rectangle, is a gray rectangle labeled "FPRAS / FPAUS (self-reducibility)". ### Detailed Analysis or Content Details The diagram shows a hierarchical relationship. The "pCoT[poly(n)]" problem encompasses both "FPRAS / FPAUS" and "pCT / pLoop [poly(n)]". The "FPRAS / FPAUS" problem is distinct from "pCT / pLoop [poly(n)]", as they do not overlap. The label "FPRAS / FPAUS" is accompanied by the note "(self-reducibility)". The label "pCT / pLoop [poly(n)]" indicates that this problem involves both "pCT" and "pLoop" and has a time complexity of "poly(n)". The label "pCoT[poly(n)]" indicates a problem with a time complexity of "poly(n)". ### Key Observations The diagram visually represents a containment relationship. "pCoT[poly(n)]" is the most general problem, containing the other two. "FPRAS / FPAUS" has a specific property ("self-reducibility") that distinguishes it. "pCT / pLoop [poly(n)]" is a distinct problem within the broader "pCoT[poly(n)]" category. ### Interpretation The diagram likely illustrates the complexity classes of computational problems. "pCoT[poly(n)]" could represent a class of problems solvable in polynomial time. "FPRAS / FPAUS" and "pCT / pLoop [poly(n)]" are specific sub-problems within this class. The "self-reducibility" property of "FPRAS / FPAUS" suggests that instances of this problem can be reduced to smaller instances of the same problem, potentially simplifying the solution process. The diagram suggests that "pCT / pLoop [poly(n)]" is a different type of polynomial-time problem than "FPRAS / FPAUS". The notation "[poly(n)]" indicates that the problems are related to polynomial time complexity. The diagram is a high-level overview of the relationships between these problems, and doesn't provide specific details about their algorithms or applications. </details> Figure 5: The separation for approximate counting (sampling). Consequently, we obtain the following separations in favor of CoT, as also shown in Fig. 5. **Theorem 4.5** *Assuming $\mathsf{FPTAS}\subsetneq\mathsf{FPRAS}$ for self-reducible relations, it holds that $$ \forall\,\mathcal{M}\in\{\mathsf{pCT},\mathsf{pLoop}\},\quad\mathcal{M}[\mathsf{poly}(n)]\subsetneq\mathsf{pCoT}[\mathsf{poly}(n)]. $$* ## 5 Experiments In this section, we provide empirical validation of our theoretical results on tasks with well-characterized complexity. Specifically, we study parallelizable tasks to empirically validate the efficiency of latent thought predicted in Section 3, and approximate counting and sampling tasks to demonstrate the effectiveness of CoT as shown in Section 4. ### 5.1 Experimental Setting Fundamental algorithmic reasoning tasks. We use four problems. (1) Word problems for finite non-solvable groups: given a sequence of generators, the task is to evaluate their composition, which is $\mathsf{NC}^{1}$ -complete (Barrington, 1986), also studied for Looped TF (Merrill and Sabharwal, 2025a). (2) $s$ – $t$ connectivity (STCON): given a directed graph $G=(V,E)$ and two vertices $s,t\in V$ , the task is to decide whether $t$ is reachable from $s$ , which belongs to $\mathsf{TC}^{1}$ (Gibbons and Rytter, 1989). (3) Arithmetic expression evaluation: given a formula consisting of $+,\times,-,/$ operations on integers, the task is to evaluate it. This problem is $\mathsf{TC}^{0}$ -reducible to Boolean formula evaluation (Feng et al., 2023), which is $\mathsf{NC}^{1}$ -complete (Buss, 1987). (4) Edit distance: given two strings $x$ and $y$ , the task is to compute the minimum cost to transform $x$ into $y$ . By reducing the dynamic programming formulation to shortest paths, this problem is in $\mathsf{TC}^{1}$ (Apostolico et al., 1990). Approximate counting tasks. We consider DNF counting and uniform sampling of graph colorings, both of which admit fully polynomial randomized approximation schemes for counting and sampling (FPRAS and FPAUS). Specifically, DNF counting admits an FPRAS via Monte Carlo sampling (Karp et al., 1989), while approximate counting and sampling of graph colorings admit an FPAUS based on rapidly mixing Markov chain Monte Carlo under suitable degree and color constraints (Jerrum, 1995). Training strategy. Since our primary objective is to study expressive power, we allow flexibility in optimization and training strategies. For CoT models, training is performed with supervision from explicit sequential algorithms. For fewer CoT steps, we compare two strategies: uniformly selecting steps from the indices of the complete trajectory (Bavandpour et al., 2025), and stepwise internalization (distillation) methods (Deng et al., 2024). For latent thought, we observe that looped TF is easier to train than Coconut, and therefore adopt looped TF as our instantiation of latent thought, with curriculum learning applied to certain tasks. Table 2: Accuracy (%) of CoT and looped TF on parallelizable tasks across different numbers of iterations. Here, $n$ denotes the problem size. For CoT, we report the best accuracy achieved across the two training strategies. | Word Problem Graph Connectivity Arithmetic Evaluation | 64 32 32/16 | 0.8 80.8 43.7 | 0.8 95.8 99.4 | 100.0 99.0 99.5 | 100.0 99.0 99.7 | 0.8 81.0 47.3 | 0.8 81.4 47.6 | 100.0 88.2 48.2 | 100.0 100.0 82.5 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Edit Distance | 32/16 | 57.3 | 72.9 | 86.2 | 90.7 | 76.5 | 80.9 | 87.5 | 94.8 | ### 5.2 Results Table 2 reports results on parallelizable tasks, comparing latent thought and CoT under varying numbers of iterations. Latent thought solves the problems with fewer iterations than CoT requires to reach comparable performance. These empirical results are consistent with our theoretical analysis: latent thought supports efficient parallel reasoning, in contrast to the inherently sequential nature of CoT. <details> <summary>x6.png Details</summary> ![04eea166](/v1/image/04eea166f755275334191013eba2989a1ae7e26b0b670b5dac0aae751bd3b7f7) ### Visual Description \n ## Line Chart: Accuracy vs. Input Size for Different 'r' Values ### Overview This line chart depicts the relationship between accuracy (in percentage) and input size, for four different values of 'r' (2, 3, 4, and 5). The chart shows how accuracy changes as the input size increases for each 'r' value. ### Components/Axes * **X-axis:** Input size, ranging from 8 to 64. The axis is labeled "Input size". * **Y-axis:** Accuracy (in percentage), ranging from 33 to 100. The axis is labeled "Accuracy (%)". * **Legend:** Located in the top-right corner of the chart. It identifies the four lines by their corresponding 'r' values: * r = 5 (Green) * r = 4 (Orange) * r = 3 (Purple) * r = 2 (Pink) ### Detailed Analysis The chart contains four distinct lines, each representing a different 'r' value. * **r = 5 (Green):** The line starts at approximately 98% accuracy at an input size of 8, remains relatively stable around 96-98% as the input size increases to 64. * **r = 4 (Orange):** The line begins at approximately 97% accuracy at an input size of 8, and shows a slight decrease to around 94% at an input size of 64. * **r = 3 (Purple):** The line starts at approximately 96% accuracy at an input size of 8, and decreases to approximately 85% accuracy at an input size of 64. * **r = 2 (Pink):** This line exhibits the most significant decrease in accuracy. It starts at approximately 96% accuracy at an input size of 8, and drops dramatically to approximately 40% accuracy at an input size of 64. Here's a more detailed breakdown of approximate data points: | Input Size | r = 5 (Green) | r = 4 (Orange) | r = 3 (Purple) | r = 2 (Pink) | |---|---|---|---|---| | 8 | 98% | 97% | 96% | 96% | | 16 | 98% | 96% | 92% | 85% | | 32 | 97% | 95% | 78% | 55% | | 64 | 96% | 94% | 68% | 40% | ### Key Observations * Accuracy generally decreases as input size increases, but the rate of decrease varies significantly depending on the value of 'r'. * The line for r = 2 shows a much steeper decline in accuracy compared to the other lines. * The lines for r = 5 and r = 4 maintain relatively high accuracy levels even as the input size increases. * The lines for r = 3 and r = 2 show a more pronounced decrease in accuracy with increasing input size. ### Interpretation The data suggests that the value of 'r' plays a crucial role in maintaining accuracy as the input size increases. Higher values of 'r' (5 and 4) appear to be more robust to increases in input size, while lower values (3 and particularly 2) suffer a significant loss of accuracy. This could indicate that a larger 'r' value allows the model to better generalize from the input data, or that it is less susceptible to overfitting. The steep decline in accuracy for r = 2 suggests that this value may be too low to effectively process larger input sizes. The chart demonstrates a trade-off between input size and accuracy, and highlights the importance of selecting an appropriate 'r' value for a given application. The 'r' parameter likely represents a hyperparameter of a model, and the chart is evaluating the model's performance under different hyperparameter settings. </details> <details> <summary>x7.png Details</summary> ![a59b133d](/v1/image/a59b133dc83106f966e05d74a9cbe721e317ae56c3e4056a4d33a41acd2bb684) ### Visual Description ## Line Chart: Accuracy vs. Input Size for Different 'r' Values ### Overview This line chart depicts the relationship between input size and accuracy for four different values of 'r' (r=2, 3, 4, and 5). The chart shows how accuracy changes as the input size increases for each 'r' value. ### Components/Axes * **X-axis:** Input size, ranging from approximately 10 to 30. * **Y-axis:** Accuracy (%), ranging from approximately 80% to 100%. * **Legend:** Located in the top-left corner, identifies the lines by their corresponding 'r' values: * r = 5 (Green) * r = 4 (Orange) * r = 3 (Gray/Purple) * r = 2 (Magenta/Pink) ### Detailed Analysis The chart contains four distinct lines, each representing a different 'r' value. * **r = 5 (Green):** The line starts at approximately 99% accuracy at an input size of 10 and decreases slightly to approximately 97% accuracy at an input size of 30. The trend is relatively flat, indicating minimal change in accuracy with increasing input size. * (10, 99%) * (15, 98%) * (20, 97%) * (25, 97%) * (30, 97%) * **r = 4 (Orange):** The line begins at approximately 98% accuracy at an input size of 10 and decreases to approximately 94% accuracy at an input size of 30. The slope is more pronounced than that of r=5, indicating a greater sensitivity to input size. * (10, 98%) * (15, 97%) * (20, 96%) * (25, 95%) * (30, 94%) * **r = 3 (Gray/Purple):** The line starts at approximately 97% accuracy at an input size of 10 and decreases to approximately 89% accuracy at an input size of 30. The slope is steeper than both r=4 and r=5. * (10, 97%) * (15, 94%) * (20, 92%) * (25, 90%) * (30, 89%) * **r = 2 (Magenta/Pink):** The line begins at approximately 98% accuracy at an input size of 10 and decreases significantly to approximately 82% accuracy at an input size of 30. This line has the steepest slope, indicating the most substantial decrease in accuracy with increasing input size. * (10, 98%) * (15, 93%) * (20, 88%) * (25, 85%) * (30, 82%) ### Key Observations * Accuracy generally decreases as input size increases for all 'r' values. * The rate of accuracy decrease is heavily influenced by the value of 'r'. Lower 'r' values exhibit a more rapid decline in accuracy with increasing input size. * For smaller input sizes (around 10), all 'r' values achieve high accuracy (close to 100%). * The line for r=5 maintains the highest accuracy across all input sizes. ### Interpretation The data suggests that the parameter 'r' plays a crucial role in maintaining accuracy as the input size grows. Higher values of 'r' appear to provide greater robustness against the negative impact of increasing input size on accuracy. This could indicate that 'r' controls a level of complexity or regularization within the model, preventing overfitting or maintaining generalization performance as the data becomes more extensive. The steep decline in accuracy for r=2 suggests that this value may be insufficient to handle larger input sizes effectively. The chart demonstrates a trade-off between input size and accuracy, where increasing input size can lead to decreased accuracy, particularly for lower values of 'r'. This information is valuable for selecting an appropriate 'r' value based on the expected input size and desired accuracy level. </details> Figure 6: Accuracy of looped TFs on the arithmetic evaluation (top) and the connectivity (bottom). Each curve shows the performance for a fixed loop count $r$ as the input size $n$ increases. We also evaluate the relationship between performance, input size, and the number of iterations, as in prior studies (Sanford et al., 2024b; Merrill and Sabharwal, 2025a). Figure 6 presents our results for looped TFs, illustrating that as the input size $n$ increases, the number of loops required to maintain high accuracy grows only logarithmically, supporting our theoretical claim in the (poly-)logarithmic regime. <details> <summary>x8.png Details</summary> ![c23ad58c](/v1/image/c23ad58c6785561f4f94b7a7d12f5091b333ba627352016a636780ec806f3089) ### Visual Description \n ## Line Chart: Relative Error vs. Iterations x Trials ### Overview This image presents a line chart comparing the relative error of two methods, "Looped" and "CoT" (Chain of Thought), as a function of the product of iterations and trials. The chart visually demonstrates how the relative error changes as the number of iterations and trials increases. ### Components/Axes * **X-axis:** "iterations × trials" with markers at 10, 100, and 1000. The scale is logarithmic. * **Y-axis:** "Relative error (↓)" indicating that lower values represent better performance. The scale ranges from approximately 0.3 to 0.5. * **Legend:** Located in the top-left corner. * "Looped" - represented by a teal/cyan line with circular markers. * "CoT" - represented by an orange line with circular markers. ### Detailed Analysis **Looped (Teal Line):** The teal line representing "Looped" exhibits a relatively flat trend. * At 10 iterations x trials, the relative error is approximately 0.37. * At 100 iterations x trials, the relative error is approximately 0.37. * At 1000 iterations x trials, the relative error is approximately 0.37. **CoT (Orange Line):** The orange line representing "CoT" shows a decreasing trend. * At 10 iterations x trials, the relative error is approximately 0.44. * At 100 iterations x trials, the relative error is approximately 0.40. * At 1000 iterations x trials, the relative error is approximately 0.25. ### Key Observations * The "Looped" method maintains a consistent relative error across all tested iterations x trials. * The "CoT" method demonstrates a significant reduction in relative error as the number of iterations x trials increases. * At 10 iterations x trials, "CoT" has a higher relative error than "Looped". * At 1000 iterations x trials, "CoT" has a significantly lower relative error than "Looped". ### Interpretation The data suggests that the "CoT" method benefits from increased iterations and trials, leading to improved performance (lower relative error). Conversely, the "Looped" method does not appear to improve with more iterations and trials, indicating a potential saturation point or inherent limitation. The initial higher error of "CoT" might be due to a slower initial learning phase, but the subsequent decrease demonstrates its ability to refine its results with more data and processing. This could indicate that "CoT" is more complex and requires more computational effort to converge to a better solution, while "Looped" is simpler but has a limited capacity for improvement. The downward arrow next to "Relative error" suggests that minimizing this value is the desired outcome, and the chart clearly shows "CoT" achieving this as iterations x trials increase. </details> <details> <summary>x9.png Details</summary> ![88cde044](/v1/image/88cde0444705b8f0c96dc32bd45f346d65508ff4d36513dccc3ade774dc30213) ### Visual Description \n ## Line Chart: TV Distance vs. Iterations per Trial ### Overview This line chart depicts the relationship between "Iterations per trial" on the x-axis and "TV distance (↓)" on the y-axis. Two data series are presented: "Looped" and "CoT". The chart appears to demonstrate how the TV distance changes as the number of iterations per trial increases for both methods. ### Components/Axes * **X-axis:** "Iterations per trial" with markers at 30, 60, and 90. * **Y-axis:** "TV distance (↓)" with a scale ranging from 0.0 to 1.0, incrementing by 0.5. The "↓" symbol suggests a decreasing trend. * **Legend:** Located in the top-right corner. * "Looped" - represented by a teal line with teal circular markers. * "CoT" - represented by an orange line with orange circular markers. ### Detailed Analysis **Looped (Teal Line):** The teal line representing "Looped" is nearly horizontal. It starts at approximately 1.0 at 30 iterations, remains at approximately 1.0 at 60 iterations, and remains at approximately 1.0 at 90 iterations. * (30 Iterations, 1.0) * (60 Iterations, 1.0) * (90 Iterations, 1.0) **CoT (Orange Line):** The orange line representing "CoT" shows a decreasing trend. It starts at approximately 0.3 at 30 iterations, decreases slightly to approximately 0.3 at 60 iterations, and then decreases more significantly to approximately 0.0 at 90 iterations. * (30 Iterations, ~0.3) * (60 Iterations, ~0.3) * (90 Iterations, ~0.0) ### Key Observations * The "Looped" method maintains a consistently high TV distance (approximately 1.0) regardless of the number of iterations. * The "CoT" method exhibits a decreasing TV distance as the number of iterations increases, suggesting convergence or improvement with more iterations. * The "CoT" method starts with a lower TV distance than the "Looped" method. ### Interpretation The data suggests that the "Looped" method does not benefit from increased iterations, maintaining a constant TV distance. Conversely, the "CoT" method demonstrates a clear improvement in TV distance with increasing iterations, indicating that it converges towards a more optimal solution. The initial lower TV distance of "CoT" suggests it may be a more efficient method from the start. The "↓" symbol on the y-axis implies that a lower TV distance is desirable. The chart highlights a fundamental difference in the behavior of the two methods as the number of iterations increases. The "Looped" method appears to be stuck in a suboptimal state, while the "CoT" method actively improves. This could be due to the nature of the algorithms themselves, or the specific problem they are being applied to. </details> <details> <summary>x10.png Details</summary> ![9bfac592](/v1/image/9bfac592f38b6ea478001ab326a2ed494b2ef317a71589f44d94988a3f8f6b69) ### Visual Description \n ## Chart: Distribution over valid graph colorings (sorted) ### Overview The image presents a bar chart visualizing the distribution of probability mass per bin for different graph coloring methods. The chart compares "CoT with 90 steps" and "Looped TF with 90 loops" against a "Target uniform distribution". The y-axis is on a logarithmic scale. ### Components/Axes * **Title:** "Distribution over valid graph colorings (sorted)" - positioned at the top-center. * **X-axis:** Not explicitly labeled, but represents the sorted graph colorings. The scale is not visible. * **Y-axis:** "Probability mass per bin" - positioned on the left side. The scale is logarithmic, ranging from approximately 10-4 to 10-2. * **Legend:** Located in the bottom-right corner. * "Target uniform distribution" - represented by a dashed blue line. * "CoT with 90 steps" - represented by a light blue bar. * "Looped TF with 90 loops" - represented by a light orange bar. ### Detailed Analysis The chart displays three data series: 1. **Target uniform distribution:** A horizontal dashed blue line at approximately 10-2. This represents the desired uniform distribution. 2. **CoT with 90 steps:** A series of light blue vertical bars. The bars are relatively consistent in height, fluctuating around the 10-2 level. The first bar is significantly higher, at approximately 5 x 10-2. The remaining bars range from approximately 5 x 10-3 to 2 x 10-2. 3. **Looped TF with 90 loops:** A single light orange bar at the far left. This bar has a height of approximately 2 x 10-1, significantly higher than the other data series. ### Key Observations * The "Looped TF with 90 loops" method exhibits a significantly higher probability mass in the first bin compared to the other methods and the target uniform distribution. * The "CoT with 90 steps" method generally aligns with the target uniform distribution, although with some fluctuations. * The y-axis is logarithmic, which compresses the differences in probability mass. ### Interpretation The chart suggests that the "Looped TF with 90 loops" method produces a highly non-uniform distribution of graph colorings, concentrating probability mass in the initial bins. This indicates a potential bias or limitation in this method. The "CoT with 90 steps" method appears to generate a more uniform distribution, closer to the target, but still exhibits some variability. The target uniform distribution serves as a benchmark for evaluating the quality of the coloring methods. The large difference in the initial bin for "Looped TF" suggests it may be getting stuck in local optima or exhibiting a strong preference for certain colorings. The logarithmic scale highlights the relative differences in probability mass, emphasizing the significant deviation of the "Looped TF" method. </details> Figure 7: Performance of Looped TF and CoT. Top: Relative error for counting (left) and sampling (right). Bottom: Empirical distributions for approximate sampling of graph colorings. Figure 7 shows the results on the approximate counting and approximate sampling tasks. For approximate counting, CoT performs Monte Carlo estimation: the effective number of samples is given by the product of the number of reasoning steps per trial and the number of independent trials. Accordingly, increasing the iteration count corresponds to increasing this total sample budget. For approximate sampling, CoT emulates Markov chain Monte Carlo, where the iteration axis represents the number of Markov chain steps per trial. In contrast, for looped TF on both tasks, the iteration count simply corresponds to the number of loop iterations applied within a single trial. These results support the separation advantages of CoT with stochastic decoding. ## 6 Conclusion We formally analyze the computational capabilities of chain-of-thought and latent thought reasoning, providing a rigorous comparison that reveals their respective strengths and limitations. Specifically, we show that latent thought enables efficient parallel computation, whereas CoT enables randomized approximate counting. Our theoretical results are further supported by experiments. These insights offer practical guidance for reasoning model design. For future work, an important direction is to investigate whether techniques such as distillation can reduce the number of iterations without compromising computational power. Another promising avenue is to extend our analysis to diffusion language models, which possess both parallelizability and stochasticity. Moreover, extending the analysis to realistic downstream tasks remains an important direction. ## Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. ## References - A. V. Aho and J. D. Ullman (1972) Optimization of straight line programs. SIAM Journal on Computing 1 (1), pp. 1–19. Cited by: §3.1. - A. Apostolico, M. J. Atallah, L. L. Larmore, and S. McFaddin (1990) Efficient parallel algorithms for string editing and related problems. SIAM Journal on Computing. Cited by: §5.1. - S. Arora and B. Barak (2009) Computational complexity: a modern approach. Cambridge University Press. Cited by: §1.1. - S. Bae, Y. Kim, R. Bayat, S. Kim, J. Ha, T. Schuster, A. Fisch, H. Harutyunyan, Z. Ji, A. Courville, et al. (2025) Mixture-of-recursions: learning dynamic recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524. Cited by: §1, §4.1. - D. A. Barrington (1986) Bounded-width polynomial-size branching programs recognize exactly those languages in nc. In ACM symposium on Theory of computing, Cited by: §5.1. - A. A. Bavandpour, X. Huang, M. Rofin, and M. Hahn (2025) Lower bounds for chain-of-thought reasoning in hard-attention transformers. In Forty-second International Conference on Machine Learning, External Links: Link Cited by: §D.1.2, §D.1.2, §5.1. - S. R. Buss (1987) The boolean formula value problem is in alogtime. In Proceedings of the nineteenth annual ACM symposium on Theory of computing, pp. 123–131. Cited by: §5.1. - S. A. Cook (1985) A taxonomy of problems with fast parallel algorithms. Information and control 64 (1-3), pp. 2–22. Cited by: §3.1. - R. Csordás, K. Irie, J. Schmidhuber, C. Potts, and C. D. Manning (2024) MoEUT: mixture-of-experts universal transformers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §4.1. - A. B. de Luca and K. Fountoulakis (2024) Simulation of graph algorithms with looped transformers. In Forty-first International Conference on Machine Learning, External Links: Link Cited by: §2.2. - M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser (2019) Universal transformers. In International Conference on Learning Representations, External Links: Link Cited by: §1. - Y. Deng, Y. Choi, and S. Shieber (2024) From explicit cot to implicit cot: learning to internalize cot step by step. arXiv preprint arXiv:2405.14838. Cited by: §D.1.2, §5.1. - P. Erdos and A. Renyi (1959) On random graphs i. Publ. math. debrecen 6 (290-297), pp. 18. Cited by: §D.1.1. - G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang (2023) Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems 36, pp. 70757–70798. Cited by: §B.2, §D.1.1, §D.1.1, §D.1.2, §1, §2.2, Assumption 3.4, §5.1. - A. Giannou, S. Rajput, J. Sohn, K. Lee, J. D. Lee, and D. Papailiopoulos (2023) Looped transformers as programmable computers. In International Conference on Machine Learning, pp. 11398–11442. Cited by: §1, §2.2. - A. Gibbons and W. Rytter (1989) Efficient parallel algorithms. Cambridge University Press. Cited by: §5.1. - H. A. Gozeten, M. E. Ildiz, X. Zhang, H. Harutyunyan, A. S. Rawat, and S. Oymak (2025) Continuous chain of thought enables parallel exploration and reasoning. arXiv preprint arXiv:2505.23648. Cited by: §1, §2.2, Table 1. - S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. E. Weston, and Y. Tian (2025) Training large language models to reason in a continuous latent space. In Second Conference on Language Modeling, External Links: Link Cited by: §1, §2.1. - M. R. Jerrum, L. G. Valiant, and V. V. Vazirani (1986) Random generation of combinatorial structures from a uniform distribution. Theoretical computer science 43, pp. 169–188. Cited by: Proposition C.13, Proposition C.16, Theorem C.19, §1.1, §4.2, §4.3. - M. Jerrum (1995) A very simple algorithm for estimating the number of k-colorings of a low-degree graph. Random Structures & Algorithms 7 (2), pp. 157–165. Cited by: §D.2.2, §5.1. - R. M. Karp, M. Luby, and N. Madras (1989) Monte-carlo approximation algorithms for enumeration problems. Journal of algorithms 10 (3), pp. 429–448. Cited by: §4.1, §5.1. - R. M. Karp and M. Luby (1983) Monte-carlo algorithms for enumeration and reliability problems. In 24th Annual Symposium on Foundations of Computer Science, pp. 56–64. Cited by: §D.2.1, §4.1. - Z. Li, H. Liu, D. Zhou, and T. Ma (2024) Chain of thought empowers transformers to solve inherently serial problems. In The Twelfth International Conference on Learning Representations, External Links: Link Cited by: §B.1, §B.3.1, §B.7, Definition B.1, Lemma B.10, Definition B.2, Definition B.3, Lemma B.5, Theorem B.9, §1.1, §1, §2.2, Table 1, §3.1, §3.2, §3.3, §3.3, §3.3, Theorem 3.9. - Y. Liang, Z. Sha, Z. Shi, Z. Song, and Y. Zhou (2024) Looped relu mlps may be all you need as practical programmable computers. arXiv preprint arXiv:2410.09375. Cited by: §B.5. - W. Merrill, A. Sabharwal, and N. A. Smith (2022) Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics. Cited by: item 3, §C.1. - W. Merrill and A. Sabharwal (2023) The parallelism tradeoff: limitations of log-precision transformers. Transactions of the Association for Computational Linguistics 11, pp. 531–545. Cited by: §1.1, §2.2, §3.3, Definition 3.2. - W. Merrill and A. Sabharwal (2024) The expressive power of transformers with chain of thought. In The Twelfth International Conference on Learning Representations, External Links: Link Cited by: §C.1, §1, §2.2, Definition 2.1. - W. Merrill and A. Sabharwal (2025a) A little depth goes a long way: the expressive power of log-depth transformers. arXiv preprint arXiv:2503.03961. Cited by: §A.2, §1, §2.2, §5.1, §5.2. - W. Merrill and A. Sabharwal (2025b) Exact expressive power of transformers with padding. arXiv preprint arXiv:2505.18948. Cited by: §D.1.1, §2.1, §2.2, Table 1. - M. Mitzenmacher and E. Upfal (2017) Probability and computing: randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press. Cited by: §4.1. - F. Nowak, A. Svete, A. Butoi, and R. Cotterell (2024) On the representational capacity of neural language models with chain-of-thought reasoning. In Association for Computational Linguistics, External Links: Link Cited by: §C.1, Lemma C.6, §1, §2.2, Table 1, §4.1. - C. Sanford, B. Fatemi, E. Hall, A. Tsitsulin, M. Kazemi, J. Halcrow, B. Perozzi, and V. Mirrokni (2024a) Understanding transformer reasoning capabilities via graph algorithms. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §D.1.1. - C. Sanford, D. Hsu, and M. Telgarsky (2024b) Transformers, parallel computation, and logarithmic depth. In Forty-first International Conference on Machine Learning, External Links: Link Cited by: §B.5, §3.1, §5.2. - N. Saunshi, N. Dikkala, Z. Li, S. Kumar, and S. J. Reddi (2025) Reasoning with latent thoughts: on the power of looped transformers. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2.2, Table 1, §4. - C. Schnorr (1976) Optimal algorithms for self-reducible problems. In Proceedings of the Third International Colloquium on Automata, Languages and Programming, Cited by: Definition C.15, Definition 4.1. - L. Stockmeyer and U. Vishkin (1984) Simulation of parallel random access machines by circuits. SIAM Journal on Computing 13 (2), pp. 409–422. Cited by: §3.3. - A. Svete and A. Sabharwal (2025) On the reasoning abilities of masked diffusion language models. arXiv preprint arXiv:2510.13117. Cited by: Table 1. - L. G. Valiant (1979) The complexity of enumeration and reliability problems. siam Journal on Computing 8 (3), pp. 410–421. Cited by: §C.3. - A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1, §2.1. - J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, pp. 24824–24837. Cited by: §1. - K. Xu and I. Sato (2025) On expressive power of looped transformers: theoretical analysis and enhancement via timestep encoding. In Forty-second International Conference on Machine Learning, External Links: Link Cited by: §D.1.2, §1. - L. Yang, K. Lee, R. D. Nowak, and D. Papailiopoulos (2024) Looped transformers are better at learning learning algorithms. In The Twelfth International Conference on Learning Representations, External Links: Link Cited by: §2.2. - H. Zhu, S. Hao, Z. Hu, J. Jiao, S. Russell, and Y. Tian (2025a) Reasoning by superposition: a theoretical perspective on chain of continuous thought. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §2.2, Table 1. - R. Zhu, Z. Wang, K. Hua, T. Zhang, Z. Li, H. Que, B. Wei, Z. Wen, F. Yin, H. Xing, et al. (2025b) Scaling latent reasoning via looped language models. arXiv preprint arXiv:2510.25741. Cited by: §1. ## Appendix A Formal Definitions ### A.1 Notation Vectors are written in lowercase bold letters (e.g., ${\bm{x}}$ ) and matrices in uppercase bold letters (e.g., ${\bm{W}}$ ). The $i$ -th entry of a vector ${\bm{x}}$ is ${\bm{x}}_{i}$ , the vector from the $i$ -th to the $j$ -th entry is denoted by ${\bm{x}}_{i:j}$ , and the $(i,j)$ -th entry of a matrix ${\bm{W}}$ is ${\bm{W}}_{i,j}$ . We use the symbol $*$ to denote a “don’t care” value (or block of values). For $n\in\mathbb{N}^{+}$ , let $[n]\coloneq\{1,2,\ldots,n\}$ . We sometimes write column vectors horizontally, e.g., ${\bm{x}}=(x_{1},\ldots,x_{n})$ , for brevity. The Hadamard (element-wise) product is $\odot$ . ${\bm{e}}_{i}\in\{0,1\}^{d}$ is the $i$ -th standard basis vector, ${\bm{1}}_{d}\in\mathbb{R}^{d}$ (or $1_{d}$ ) the all-ones vector, and ${\bm{0}}_{d}\in\mathbb{R}^{d}$ the zero vector. ${\bm{I}}_{d}\in\mathbb{R}^{d\times d}$ denotes the $d\times d$ identity matrix, and $\mathbf{0}_{m\times n}\in\mathbb{R}^{m\times n}$ the $m\times n$ zero matrix. The indicator function is ${\bm{1}}[\cdot]$ , and $\bigoplus$ denotes block-diagonal concatenation. Functions on scalars or vectors are written in upright letters (e.g., $\mathrm{FFN}$ ), while functions on matrices are boldface (e.g., $\mathbf{ATTN}$ ). Boldface is also used when scalar- or vector-level functions are extended to sequence level and applied independently to each token (e.g., $\mathbf{FFN}$ ). Finally, $\mathsf{poly}(n)$ denotes the set of functions growing at most polynomially: $\mathsf{poly}(n)\coloneq\left\{f:\mathbb{N}\to\mathbb{N}\;\middle|\;\exists k\in\mathbb{N},\;\exists c>0,\;\forall n\in\mathbb{N},\;f(n)\leq c\cdot n^{k}\right\}.$ ### A.2 Transformer Block We define the computational components of a Transformer block using the notation of (Merrill and Sabharwal, 2025a). Let $\mathbb{F}_{s}$ denote the set of $s$ -bit floating-point numbers with truncated arithmetic (Definition B.1). **Definition A.1 (Transformer)** *A Transformer consists of the following components: 1. A word embedding function $\mathrm{WE}:{\mathcal{V}}\to\mathbb{F}_{s}^{m}$ , where ${\mathcal{V}}$ denotes the vocabulary set. 1. A positional embedding function $\mathrm{PE}:\mathbb{N}\to\mathbb{F}_{s}^{m}$ . 1. A multi-head self-attention layer $\mathbf{SA}:\mathbb{F}_{s}^{m\times N}\to\mathbb{F}_{s}^{m\times N}$ for arbitrary sequence length $N$ , parameterized by a matrix ${\bm{O}}:\mathbb{F}_{s}^{s\times H}\to\mathbb{F}_{s}^{m}$ and, for each head $h\in[H]$ with head size $s$ , matrices ${\bm{Q}}_{h},{\bm{K}}_{h},{\bm{V}}_{h}:\mathbb{F}_{s}^{m}\to\mathbb{F}_{s}^{s}$ . Given an input ${\bm{x}}_{i}\in\mathbb{F}_{s}^{m}$ for each position $i\in[N]$ , it computes the query $\mathbf{q}_{i,h}={\bm{Q}}_{h}{\bm{x}}_{i}$ , key ${\bm{k}}_{i,h}={\bm{K}}_{h}{\bm{x}}_{i}$ , and value $\mathbf{v}_{i,h}={\bm{V}}_{h}{\bm{x}}_{i}$ , and outputs ${\bm{O}}\cdot({\bm{a}}_{i,1},\ldots,{\bm{a}}_{i,H}),$ where each attention output ${\bm{a}}_{i,h}$ is defined, for softmax function, as: $$ {\bm{a}}_{i,h}=\sum_{j=1}^{c(i)}\frac{\exp(\mathbf{q}_{i,h}^{\top}{\bm{k}}_{j,h})}{Z_{i,h}}\cdot\mathbf{v}_{j,h},\quad Z_{i,h}=\sum_{j=1}^{c(i)}\exp(\mathbf{q}_{i,h}^{\top}{\bm{k}}_{j,h}), \tag{1} $$ with $c(i)=i$ for causal attention and $c(i)=N$ for full attention. For the saturated hardmax attention (Merrill et al., 2022), each attention output ${\bm{a}}_{i,h}$ is defined as: $$ {\bm{a}}_{i,h}=\sum_{j\in M_{i,h}}\frac{1}{|M_{i,h}|}\,\mathbf{v}_{j,h},\quad M_{i,h}=\left\{j\in[c(i)]\;\middle|\;\mathbf{q}_{i,h}^{\top}{\bm{k}}_{j,h}=\max_{j^{\prime}}\mathbf{q}_{i,h}^{\top}{\bm{k}}_{j^{\prime},h}\right\}. \tag{2} $$ 1. A feedforward layer $\mathrm{FF}:\mathbb{F}_{s}^{m}\to\mathbb{F}_{s}^{m}$ with parameter ${\bm{W}}_{1}:\mathbb{F}_{s}^{m}\to\mathbb{F}_{s}^{w}$ , ${\bm{W}}_{2}:\mathbb{F}_{s}^{w}\to\mathbb{F}_{s}^{m}$ , and ${\bm{b}}\in\mathbb{F}_{s}^{m}$ , where $w$ is the hidden dimension. Given an input ${\bm{x}}_{i}\in\mathbb{F}_{s}^{m}$ , it outputs ${\bm{W}}_{2}\mathrm{ReLU}({\bm{W}}_{1}{\bm{x}}_{i}+{\bm{b}})$ , where $\mathrm{ReLU}({\bm{x}})=(\max\{0,{\bm{x}}_{1}\},\ldots,\max\{0,{\bm{x}}_{m}\})^{\top}$ . 1. An output function $\mathbf{OUT}:\mathbb{F}_{s}^{m}\to\mathbb{F}_{s}^{|{\mathcal{V}}|}$ , parameterized as a linear transformation.* ### A.3 Chain of Thought **Definition A.2 (CoT)** *Let the Transformer be defined as the composition: $$ \mathrm{TF}_{\mathrm{dec}}\coloneq\mathbf{OUT}\circ(\operatorname{id}+\mathbf{FF}_{L})\circ(\operatorname{id}+\mathbf{SA}_{L})\circ\cdots\circ(\operatorname{id}+\mathbf{FF}_{1})\circ(\operatorname{id}+\mathbf{SA}_{1})\circ(\mathbf{WE}+\mathbf{PE}), \tag{3} $$ where $\mathbf{SA}_{\ell}$ and $\mathbf{FF}_{\ell}$ denote the causal attention and the feedforward layers at depth $\ell\in[L]$ , respectively, and $\operatorname{id}$ denotes the identity function. The input tokens are first embedded via the word embedding function $\mathbf{WE}$ and the positional encoding $\mathbf{PE}$ , and the final output is produced by a linear projection $\mathbf{OUT}$ . Given an input sequence $x=(x_{1},\dots,x_{n})\in{\mathcal{V}}^{n}$ , we define the initial sequence as: $f_{\mathrm{cot}}^{0}(x)\coloneq x.$ Then, the CoT computes recursively as: $$ f_{\mathrm{cot}}^{k+1}(x)\coloneq f_{\mathrm{cot}}^{k}(x)\cdot\mathrm{Dec}\,(\mathrm{TF}_{\mathrm{dec}}(f_{\mathrm{cot}}^{k}(x))), \tag{4} $$ where $\cdot$ denotes concatenation, and $\mathrm{Dec}(\cdot)$ is a decoding function that maps the output logits to a token in ${\mathcal{V}}$ : in the deterministic model, $\mathrm{Dec}(z)\coloneq\arg\max_{i\in[|{\mathcal{V}}|]}z_{i}$ ; in the stochastic model, $\mathrm{Dec}(z)\sim\mathrm{Multinomial}({z_{i}}/{\sum_{j}z_{j}})$ , assuming $z_{i}>0$ for all $i$ . The final output of the CoT model after $T(n)$ steps is defined as the last output length $m$ tokens of $f_{\mathrm{cot}}^{T(n)}(x)$ .* ### A.4 Continuous Thought **Definition A.3 (Coconut)** *Let the Transformer block be defined as the composition: $$ \mathrm{TF}_{\mathrm{dec}}\coloneq(\operatorname{id}+\mathbf{FF}_{L})\circ(\operatorname{id}+\mathbf{SA}_{L})\circ\cdots\circ(\operatorname{id}+\mathbf{FF}_{1})\circ(\operatorname{id}+\mathbf{SA}_{1}), \tag{5} $$ where $\mathbf{SA}_{\ell}$ and $\mathbf{FF}_{\ell}$ denote the causal attention and the feedforward layers at depth $\ell\in[L]$ , respectively, and $\operatorname{id}$ denotes the identity function. Given an input sequence $x=(x_{1},\dots,x_{n})\in{\mathcal{V}}^{n}$ , we define the initial sequence as: $f_{\mathrm{cot}}^{0}(x)\coloneq\mathbf{WE}(x)+\mathbf{PE}([n]),$ where the input tokens are first embedded via the word embedding function $\mathbf{WE}$ and the positional encoding $\mathbf{PE}$ Then, the continuous thought (Coconut) computes recursively as: $$ f_{\mathrm{ct}}^{k+1}(x)\coloneq f_{\mathrm{ct}}^{k}(x)\cdot(\mathrm{TF}_{\mathrm{dec}}(f_{\mathrm{cot}}^{k}(x)+\mathbf{PE}(k))), \tag{6} $$ where $\cdot$ denotes concatenation. The final output of the model after $T(n)$ steps is defined as the last output length $m$ tokens of $\mathrm{Dec}\,\left(\mathbf{OUT}\,\left(f_{\mathrm{ct}}^{T(n)}(x))\right)\right),$ where the final output is produced by a linear projection $\mathbf{OUT}$ and $\mathrm{Dec}(\cdot)$ is a decoding function that maps the output logits to a token in ${\mathcal{V}}$ : in the deterministic model, $\mathrm{Dec}(z)\coloneq\arg\max_{i\in[|{\mathcal{V}}|]}z_{i}$ ; in the stochastic model, $\mathrm{Dec}(z)\sim\mathrm{Multinomial}({z_{i}}/{\sum_{j}z_{j}})$ , assuming $z_{i}>0$ for all $i$ .* ### A.5 Looped Transformer **Definition A.4 (Looped TF)** *Let the Transformer block be defined as the composition: $$ \mathrm{TF}\coloneq(\operatorname{id}+\mathbf{FF}_{L})\circ(\operatorname{id}+\mathbf{SA}_{L})\circ\cdots\circ(\operatorname{id}+\mathbf{FF}_{1})\circ(\operatorname{id}+\mathbf{SA}_{1}), \tag{7} $$ where $\mathbf{SA}_{\ell}$ and $\mathbf{FF}_{\ell}$ denote the (non-causal) self-attention and feedforward layers at depth $\ell\in[L]$ . Given an input token sequence $x=(x_{1},\dots,x_{n})\in{\mathcal{V}}^{n}$ , the initial hidden state is: $f_{\mathrm{loop}}^{0}(x)\coloneq\mathbf{WE}(x).$ At each loop iteration $k$ , the hidden state is updated by: $$ f_{\mathrm{loop}}^{k+1}(x)\coloneq\mathrm{TF}\left(f_{\mathrm{loop}}^{k}(x)\right). \tag{8} $$ The final outputs after $T(n)$ loop iterations are decoded as $\mathrm{Dec}\circ\mathbf{OUT}\circ f_{\mathrm{loop}}^{T(n)}(x),$ and the model’s prediction is defined as the last output length $m\leq n$ tokens of this projected sequence.* <details> <summary>x11.png Details</summary> ![2d44a916](/v1/image/2d44a9167ecef28e775f4f90ca8aec46fa348b7f73c4bbb71c358044c8c3d8eb) ### Visual Description \n ## Diagram: Transformer Architectures - Chain of Thought, Continuous Thought, Looped Transformer ### Overview The image presents a comparative diagram illustrating three different transformer architectures: Chain of Thought, Continuous Thought, and Looped Transformer. Each architecture is depicted as a block diagram showing the flow of information from input to output through a "Transformer" component. The diagrams emphasize the different ways the transformer is utilized and connected within each architecture. ### Components/Axes The diagram consists of three distinct sections, each representing a different architecture. Each section contains the following elements: * **Input:** Represented by a series of four rectangular boxes labeled "input". * **Transformer:** A large, dark gray rectangle labeled "Transformer". * **Output:** Represented by a series of four rectangular boxes labeled "output". * **Arrows:** Arrows indicate the direction of information flow. * **Labels:** Each section is labeled with the name of the architecture: "Chain of Thought", "Continuous Thought", and "Looped Transformer". ### Detailed Analysis or Content Details **1. Chain of Thought:** * The input consists of four rectangular boxes. * The Transformer processes the input and generates an output consisting of four rectangular boxes. * The flow is linear: input -> Transformer -> output. **2. Continuous Thought:** * The input consists of four rectangular boxes. * The Transformer processes the input and generates an output consisting of four rectangular boxes. * The flow is linear: input -> Transformer -> output. * This architecture appears visually identical to "Chain of Thought". **3. Looped Transformer:** * The input consists of two rectangular boxes. * The Transformer processes the input. * The output of the Transformer is fed back as input to the Transformer, creating a loop. * The final output consists of two rectangular boxes. * The flow is cyclical: input -> Transformer -> output -> Transformer -> output. ### Key Observations * The "Chain of Thought" and "Continuous Thought" architectures are visually indistinguishable. * The "Looped Transformer" architecture introduces a feedback loop, suggesting iterative processing. * The number of input and output boxes differs between the "Chain of Thought/Continuous Thought" and "Looped Transformer" architectures. ### Interpretation The diagram illustrates different approaches to utilizing transformer models for sequential processing. "Chain of Thought" and "Continuous Thought" represent a standard, linear application of the transformer. The similarity between these two suggests they may be conceptually equivalent, potentially differing only in naming or specific implementation details not shown in the diagram. The "Looped Transformer" introduces a recurrent element, allowing the model to refine its output through iterative processing. This suggests that the "Looped Transformer" is designed for tasks where context and refinement are crucial, and where multiple passes through the transformer can improve the quality of the output. The difference in the number of input/output boxes in the Looped Transformer may indicate a compression or expansion of information during the iterative process. The diagram is conceptual and does not provide quantitative data. It focuses on illustrating the architectural differences rather than performance characteristics. It is a high-level overview and lacks details about the internal workings of the Transformer component itself. </details> Figure 8: The models of reasoning paradigms based on iterative use of Transformer models. ## Appendix B Deferred Proofs for Section 3 ### B.1 Precision Modeling We focus on signed floating-point numbers, following (Li et al., 2024), but omit exponents for simplicity. **Definition B.1 (Floating-point Representation, cf.(Liet al.,2024))** *Consider floating-point numbers with a mantissa part of $s$ bits and a sign bit of 1, totaling $(s+1)$ bits. We denote the set of such floating-point numbers by $\mathbb{F}_{s}$ , and define $B_{s}\triangleq\max\mathbb{F}_{s}.$* **Definition B.2 (Correct Rounding, cf.(Liet al.,2024))** *For any $x\in\mathbb{R}$ and any closed subset $\mathbb{F}\subset\mathbb{R}$ containing $0$ , we define the correct rounding $\operatorname{round}(x,\mathbb{F})$ as the number in $\mathbb{F}$ closest to $x$ . In particular, rounding to a floating-point number with mantissa part $s$ bits is denoted by $[\cdot]_{s}$ . Rounding applied to vectors is to be operated coordinate-wise.* They also define primitive arithmetic under finite precision by applying rounding after each basic operation. In particular, for multi-operand operations, rounding is applied after each binary operation. Finite-precision summation over more than two numbers is thus defined as follows. **Definition B.3 (Summation with Iterative Rounding(Liet al.,2024))** *For any $s,n\in\mathbb{N}^{+}$ and vector ${\bm{x}}\in\mathbb{R}^{n}$ , define the summation with iterative rounding to $s$ -bit precision $$ \mathrm{sum}_{s}:\ \bigcup_{n\in\mathbb{N}^{+}}(\mathbb{F}_{s})^{n}\to\mathbb{F}_{s}, \tag{9} $$ where, for any $n\in\mathbb{N}^{+}$ and ${\bm{x}}=(x_{1},\dots,x_{n})\in\mathbb{R}^{n}$ , $$ \mathrm{sum}_{s}(x)\coloneqq\Bigl[\;\Bigl[\;\cdots\Bigl[\,[x_{1}+x_{2}]_{s}+x_{3}\Bigr]_{s}+\cdots+x_{n-1}\Bigr]_{s}+x_{n}\;\Bigr]_{s}. \tag{10} $$* Based on this definition, all computations in the Transformer block of Definition A.1 can be represented in finite precision. The inner product and the matrix product are defined as $$ {\bm{x}}^{\top}{\bm{y}}\coloneqq\operatorname{sum}_{s}({\bm{x}}\odot{\bm{y}}),\qquad({\bm{A}}{\bm{B}})_{i,j}\coloneqq{\bm{A}}_{i,:}^{\top}{\bm{B}}_{:,j}. \tag{11} $$ Throughout this section, we interpret all operations as finite-precision computations as defined above. ### B.2 Definition of Assumption 3.4 Our definition of polynomially efficient approximation follows that of Feng et al. (2023), but differs in scope: while their framework targets real-valued functions, ours applies to symbolic functions. **Definition B.4 (Polynomially-efficient approximation)** *We say that a function $f_{n}:\Sigma^{\ell(n)}\to\Sigma$ admits a polynomially efficient approximation if, for a sufficiently small error tolerance $0<\delta\leq\tfrac{1}{3}$ , there exists a feedforward network $\mathrm{FF}:\mathbb{F}_{s(n)}^{\ell(n)\cdot|\Sigma|}\to\mathbb{F}_{s(n)}^{|\Sigma|},\ s(n)=O(\log n),$ such that the following holds: for every input ${\bm{x}}=(x_{1},\ldots,x_{\ell(n)})\in\Sigma^{\ell(n)}$ , $$ \mathrm{FF}(\bm{e}(x_{1}),\ldots,\bm{e}(x_{\ell(n)}))_{i}\;=\;\begin{cases}\;\geq 1-\delta&\text{if}\quad\bm{e}(f_{n}({\bm{x}}))={\bm{e}}_{i},\\ \;\leq\delta&\text{else },\end{cases} \tag{12} $$ where $\bm{e}:\Sigma\to\{0,1\}^{|\Sigma|}$ denote the one-hot encoding. Moreover, the number of parameters of the feedforward network is bounded by a polynomial in $\ell(n)$ and $1/\delta$ .* ### B.3 Technical lemmas In this section, we provide the key components for our constructive proofs. #### B.3.1 Orthogonal Vectors We follow the notation of (Li et al., 2024). For any positive integer $s\in\mathbb{N}^{+}$ and $x\in\{0,1,\dots,2^{s}-1\}$ , we denote by $\mathsf{bin}_{s}(x)\in\{0,1\}^{s}$ the standard binary representation of $x$ using $s$ bits, defined such that $x=\sum_{i=1}^{s}2^{i}\cdot(\mathsf{bin}_{s}(x))_{i}.$ We further define the signed binary encoding of $x$ , denoted by $\mathsf{sbin}_{s}(x)\in\{-1,1\}^{s}$ , as $\mathsf{sbin}_{s}(x)=2\cdot\mathsf{bin}_{s}(x)-(1,\ldots,1).$ Let ${\bm{x}},{\bm{y}}\in\mathbb{R}^{s}$ be two vectors of the same length. We define their interleaving, denoted by ${{\bm{x}}}^{\frown}{{\bm{y}}}\in\mathbb{R}^{2s}$ , as follows: $({{\bm{x}}}^{\frown}{{\bm{y}}})_{2i-1}=x_{i},({{\bm{x}}}^{\frown}{{\bm{y}}})_{2i}={\bm{y}}_{i}$ for all $i\in[s].$ The orthogonal vectors under finite-precision arithmetic can be: **Lemma B.5 (Liet al.,2024)** *For any $s\in\mathbb{N}^{+}$ , let ${\bm{q}}_{i}={\mathsf{sbin}_{s}(i)}^{\frown}{1_{s}}$ and ${\bm{k}}_{i}=2^{s+1}\cdot({\mathsf{sbin}_{s}(i)}^{\frown}{(-1_{s})})$ for all $i\in[2^{s}-1]$ , it holds that $\left\langle{\bm{q}}_{i},{\bm{k}}_{j}\right\rangle_{s}=-B_{s}$ if $i\neq j$ and $\left\langle{\bm{q}}_{i},{\bm{k}}_{j}\right\rangle_{s}=0$ if $i=j$ . Since $\bigl[\exp(-B_{s})\bigr]_{s}\leq\bigl[2^{-s-1}]_{s}=0$ , it follows that $\left[\exp(\left\langle{\bm{q}}_{i},{\bm{k}}_{j}\right\rangle_{s})\right]_{s}=\mathbf{1}[i=j]$ for all $i,j\in[2^{s}-1]$ .* #### B.3.2 Position Selector **Lemma B.6** *For any $m\in\mathbb{N}^{+}$ and ${\bm{x}}\in\mathbb{F}_{s}^{m}$ with $x_{i}>0$ for all $i\in[m]$ , there exists a feedforward layer $\mathrm{FF}:\mathbb{F}_{s}^{2m}\to\mathbb{F}_{s}^{2m}$ , for any $i\in[m]$ , such that $$ (\operatorname{id}+\mathrm{FF})(({\bm{x}},{\bm{e}}_{i}))=({\bm{x}}\odot{\bm{e}}_{i},{\bm{e}}_{i})\,. \tag{13} $$* * Proof* Let the input be ${\bm{z}}=({\bm{x}},{\bm{e}}_{i})\in\mathbb{F}_{s}^{2m}$ . Set the weight ${\bm{W}}_{1}\in\mathbb{F}_{s}^{m\times 2m}$ and bias ${\bm{b}}\in\\ mathbb{F}_{s}^{m}$ by $$ {\bm{W}}_{1}=\bigl[\,{\bm{I}}_{m}\;\;\;-B_{s}{\bm{I}}_{m}\,\bigr],\qquad{\bm{b}}={\bm{0}}, \tag{14} $$ to have ${\bm{W}}_{1}{\bm{z}}+{\bm{b}}={\bm{x}}-B_{s}{\bm{e}}_{i}.$ Applying the ReLU activation coordinate-wise gives $$ \mathrm{ReLU}({\bm{x}}-B_{s}{\bm{e}}_{i})_{j}=\begin{cases}0,&j=i,\\ x_{j},&j\neq i.\end{cases} \tag{15} $$ Hence, $$ {\bm{h}}\coloneqq\mathrm{ReLU}({\bm{W}}_{1}{\bm{z}}+{\bm{b}})={\bm{x}}\odot({\bm{1}}-{\bm{e}}_{i}). \tag{16} $$ Next, set the second linear layer ${\bm{W}}_{2}\in\mathbb{F}_{s}^{2m\times m}$ by $$ {\bm{W}}_{2}=\begin{bmatrix}-{\bm{I}}_{m}\\ \mathbf{0}_{m\times m}\end{bmatrix}. \tag{17} $$ Thus we have $$ {\bm{z}}+{\bm{W}}_{2}{\bm{h}}=\begin{bmatrix}{\bm{x}}\\ {\bm{e}}_{i}\end{bmatrix}+\begin{bmatrix}-{\bm{h}}\\ {\bm{0}}\end{bmatrix}=\begin{bmatrix}{\bm{x}}\\ {\bm{e}}_{i}\end{bmatrix}+\begin{bmatrix}-{\bm{x}}\odot({\bm{1}}-{\bm{e}}_{i})\\ {\bm{0}}\end{bmatrix}=\begin{bmatrix}{\bm{x}}\odot{\bm{e}}_{i}\\ {\bm{e}}_{i}\end{bmatrix}. \tag{18} $$ ∎ #### B.3.3 Feedforward layers **Lemma B.7** *Let $N\in\mathbb{N}^{+}$ . For each $i\in[N]$ , let $f_{i}:\Sigma^{\ell_{i}}\to\Sigma$ admit polynomially-efficient approximation by $\mathrm{FF}_{i}$ . Then there exists a feedforward layer $\mathrm{FF}:\ \mathbb{F}_{s}^{\sum_{i=1}^{N}\ell_{i}|\Sigma|}\to\mathbb{F}_{s}^{N|\Sigma|}$ such that for every input tuple ${\bm{x}}_{i}=(x^{(i)}_{1},\ldots,x^{(i)}_{\ell_{i}})\in\Sigma^{\ell_{i}}$ , $$ \mathrm{FF}\!\left(\big\|_{i=1}^{N}\,(\bm{e}(x^{(i)}_{1}),\ldots,\bm{e}(x^{(i)}_{\ell_{i}}))\right)\;=\;\big\|_{i=1}^{N}\,\mathrm{FF}_{i}({\bm{x}}_{i}), \tag{19} $$ where $\|$ denotes concatenation.* * Proof* For each $i\in[N]$ , by Definition B.4, there exist width $w_{i}\in\mathbb{N}$ and parameters $$ {\bm{W}}^{(i)}_{1}\in\mathbb{F}_{s}^{w_{i}\times\ell_{i}|\Sigma|},\quad{\bm{W}}^{(i)}_{2}\in\mathbb{F}_{s}^{|\Sigma|\times w_{i}},\quad{\bm{b}}^{(i)}\in\mathbb{F}_{s}^{w_{i}} \tag{20} $$ such that $\mathrm{FF}_{i}({\bm{x}}_{i})={\bm{W}}^{(i)}_{2}\mathrm{ReLU}({\bm{W}}^{(i)}_{1}\,\bm{e}({\bm{x}}_{i})+{\bm{b}}^{(i)}).$ Now, define block-diagonal matrices $$ {\bm{W}}_{1}\coloneqq\bigoplus_{i=1}^{N}{\bm{W}}^{(i)}_{1},\qquad{\bm{W}}_{2}\coloneqq\bigoplus_{i=1}^{N}{\bm{W}}^{(i)}_{2},\qquad{\bm{b}}\coloneqq\bigoplus_{i=1}^{N}{\bm{b}}^{(i)}. \tag{21} $$ Then the single feedforward layer $$ \mathrm{FF}({\bm{x}})\coloneqq{\bm{W}}_{2}\mathrm{ReLU}({\bm{W}}_{1}{\bm{x}}+{\bm{b}}) \tag{22} $$ applies each block independently to its corresponding input, yielding exactly $\mathrm{FF}({\bm{x}})=\big\|_{i=1}^{N}\mathrm{FF}_{i}({\bm{x}}_{i}).$ ∎ **Lemma B.8** *Let $f:\Sigma^{\ell}\to\Sigma$ be a function that admits polynomially-efficient approximation (Definition B.4). Then, there exist two feedforward layers $$ \mathrm{FF}_{1}:\mathbb{F}_{s}^{(1+\ell)|\Sigma|+1}\to\mathbb{F}_{s}^{(1+\ell)|\Sigma|+1},\quad\mathrm{FF}_{2}:\mathbb{F}_{s}^{(1+\ell)|\Sigma|+1}\to\mathbb{F}_{s}^{(1+\ell)|\Sigma|+1}, \tag{23} $$ such that, for every ${\bm{x}}=(x_{1},\ldots,x_{\ell})\in\Sigma^{\ell}$ and $t\in\{0,1\}$ , $$ (\operatorname{id}+\mathrm{FF}_{2})\circ(\operatorname{id}+\mathrm{FF}_{1})\big({\bm{0}},\ \bm{e}(x_{1}),\ldots,\bm{e}(x_{\ell}),\ t\big)\;=\;(t\cdot\bm{e}(f({\bm{x}})),\ \bm{e}(x_{1}),\ldots,\bm{e}(x_{\ell}),\ t). \tag{24} $$* * Proof* Let $n\coloneqq|\Sigma|$ and $L\coloneqq\ell n$ . Write ${\bm{u}}=(\bm{e}(x_{1}),\ldots,\bm{e}(x_{\ell}))\in\{0,1\}^{L}$ . By Definition B.4, there exist $w_{f}\in\mathbb{N}$ , matrices ${\bm{W}}^{(f)}_{1}\in\mathbb{F}_{s}^{w_{f}\times L}$ , ${\bm{W}}^{(f)}_{2}\in\mathbb{F}_{s}^{n\times w_{f}}$ , and bias ${\bm{b}}^{(f)}\in\mathbb{F}_{s}^{w_{f}}$ such that $$ \mathrm{FF}_{f}({\bm{u}})\coloneqq{\bm{W}}^{(f)}_{2}\,\mathrm{ReLU}({\bm{W}}^{(f)}_{1}{\bm{u}}+{\bm{b}}^{(f)}),\quad\mathrm{FF}_{f}({\bm{u}})_{i}=\begin{cases}\geq 1-\delta&\text{if }\ \bm{e}(f({\bm{x}}))={\bm{e}}_{i},\\ \leq\delta&\text{otherwise}.\end{cases} \tag{25} $$ Set the first layer as $$ {\bm{W}}^{(1)}_{1}=\begin{bmatrix}{\bm{0}}&{\bm{W}}^{(f)}_{1}&\mathbf{0}_{\,w_{f}\times 1}\\ {\bm{0}}&{\bm{I}}_{L}&\mathbf{0}_{\,L\times 1}\end{bmatrix},\quad{\bm{b}}^{(1)}=\begin{bmatrix}{\bm{b}}^{(f)}\\ {\bm{0}}_{L}\end{bmatrix},\quad{\bm{W}}^{(1)}_{2}=\begin{bmatrix}{\bm{W}}^{(f)}_{2}&\,{\bm{0}}\\ \mathbf{0}_{\,L\times w_{f}}&\,{\bm{0}}\\ \mathbf{0}_{\,1\times w_{f}}&{\bm{0}}\end{bmatrix}, \tag{1} $$ and define $\mathrm{FF}_{1}({\bm{0}},{\bm{u}},t)\coloneqq{\bm{W}}^{(1)}_{2}\,\mathrm{ReLU}\big({\bm{W}}^{(1)}_{1}({\bm{0}},{\bm{u}},t)+{\bm{b}}^{(1)}\big)$ . Then, it holds that $$ \displaystyle\mathrm{FF}_{1}({\bm{0}},{\bm{u}},t) \displaystyle={\bm{W}}^{(1)}_{2}\,\mathrm{ReLU}\left(\begin{bmatrix}{\bm{W}}^{(f)}_{1}{\bm{u}}+{\bm{b}}^{(f)}\\ {\bm{u}}\end{bmatrix}\right) \displaystyle=\begin{bmatrix}{\bm{W}}^{(f)}_{2}\mathrm{ReLU}({\bm{W}}^{(f)}_{1}{\bm{u}}+{\bm{b}}^{(f)})\\ {\bm{0}}\end{bmatrix} \displaystyle=\begin{bmatrix}\mathrm{FF}_{f}({\bm{u}})\\ {\bm{0}}\end{bmatrix}. \tag{1} $$ Thus we have $(\operatorname{id}+\mathrm{FF}_{1})({\bm{0}},{\bm{u}},t)=\bigl(\mathrm{FF}_{f}({\bm{u}}),\,{\bm{u}},\,t\bigr).$ For the second layer, choose $\delta\leq\tfrac{1}{3}$ and $M\geq 1$ and set $$ {\bm{W}}^{(2)}_{1}=\begin{bmatrix}2{\bm{I}}_{n}&\mathbf{0}_{n\times L}&M{\bm{1}}_{n}\\ 2{\bm{I}}_{n}&\mathbf{0}_{n\times L}&M{\bm{1}}_{n}\\ {\bm{I}}_{n}&\mathbf{0}_{n\times L}&\mathbf{0}_{n\times 1}\\ -{\bm{I}}_{n}&\mathbf{0}_{n\times L}&\mathbf{0}_{n\times 1}\end{bmatrix},\ {\bm{b}}^{(2)}=\begin{bmatrix}-M{\bm{1}}_{n}\\ (1-M){\bm{1}}_{n}\\ {\bm{0}}_{n}\\ {\bm{0}}_{n}\end{bmatrix},\ {\bm{W}}^{(2)}_{2}=\begin{bmatrix}\ {\bm{I}}_{n}&-{\bm{I}}_{n}&-{\bm{I}}_{n}&\ {\bm{I}}_{n}\ \\ \ \mathbf{0}_{\,L\times n}&\mathbf{0}_{\,L\times n}&\mathbf{0}_{\,L\times n}&\mathbf{0}_{\,L\times n}\\ \ \mathbf{0}_{\,1\times n}&\mathbf{0}_{\,1\times n}&\mathbf{0}_{\,1\times n}&\mathbf{0}_{\,1\times n}\end{bmatrix}. \tag{2} $$ and define $\mathrm{FF}_{2}({\bm{y}})\coloneqq{\bm{W}}^{(2)}_{2}\,\mathrm{ReLU}\big({\bm{W}}^{(2)}_{1}{\bm{y}}+{\bm{b}}^{(2)}\big)$ . Then it holds that, for ${\bm{z}}\coloneqq\mathrm{FF}_{f}({\bm{u}})$ , $$ \displaystyle\mathrm{FF}_{2}({\bm{z}},{\bm{u}},t) \displaystyle={\bm{W}}^{(2)}_{2}\,\begin{bmatrix}\mathrm{ReLU}\bigl(2{\bm{z}}+Mt{\bm{1}}_{n}-M{\bm{1}}_{n}\bigr)\\ \mathrm{ReLU}\bigl(2{\bm{z}}+Mt{\bm{1}}_{n}+(1-M){\bm{1}}_{n}\bigr)\\ \mathrm{ReLU}({\bm{z}})\\ \mathrm{ReLU}(-{\bm{z}})\end{bmatrix} \displaystyle={\bm{W}}^{(2)}_{2}\,\begin{bmatrix}\mathrm{ReLU}\bigl(2{\bm{z}}+M(t-1){\bm{1}}_{n}\bigr)\\ \mathrm{ReLU}\bigl(2{\bm{z}}+1+M(t-1){\bm{1}}_{n}\bigr)\\ {\bm{z}}\\ {\bm{0}}\end{bmatrix} \displaystyle=\begin{bmatrix}\mathrm{ReLU}\bigl({\bm{z}}-\delta+M(t-1)\bigr)-\mathrm{ReLU}\bigl({\bm{z}}-1+M(t-1)\bigr)-{\bm{z}}\\ \mathbf{0}\\ 0\end{bmatrix}, \tag{2} $$ where it satisfies that $$ \mathrm{ReLU}\bigl({\bm{z}}-\delta+M(t-1)\bigr)-\mathrm{ReLU}\bigl({\bm{z}}-\delta-1+M(t-1)\bigr)\;=\;\begin{cases}\;\bm{e}(f({\bm{x}}))&\text{if}\quad t=1,\\ \;{\bm{0}}&\text{if}\quad t=0.\end{cases} \tag{34} $$ Therefore, the composition satisfies $(\operatorname{id}+\mathrm{FF}_{2})\circ(\operatorname{id}+\mathrm{FF}_{1})\bigl({\bm{0}},{\bm{u}},t\bigr)=(t\cdot\bm{e}(f({\bm{x}})),\ {\bm{u}},\ t).$ ∎ ### B.4 Proof for Theorem 3.5 * Proof* Let $G_{n}=(V_{n},E_{n})$ be a computation graph, where $\mathcal{F}=\{f_{1},f_{2},\dots,f_{|\mathcal{F}|}\}$ . Each node $v\in V_{n}$ is labeled by a one-hot vector $\bm{e}(v)\in\{0,1\}^{|\mathcal{F}|}$ indicating the function assigned to $v$ from the finite set $\mathcal{F}$ . Let $v_{1},v_{2},\dots,v_{|V_{n}|}$ denote a fixed topological ordering of $V_{n}$ , with inputs appearing first and outputs last. For each function $f_{i}\in\mathcal{F}$ , let $C_{f_{i}}(n)\coloneqq\max\{\,|\mathrm{pred}(v)|:v\in V_{n},\;\bm{e}(v)={\bm{e}}_{i}\,\}.$ and define $C_{\mathrm{sum}}(n)\coloneqq\sum_{f\in\mathcal{F}}C_{f}(n),\ C_{\mathrm{max}}(n)\coloneqq\max_{f\in\mathcal{F}}C_{f}(n).$ Let the precision be $s(n)=C\cdot\lceil\log_{2}n\rceil$ where $C\in\mathbb{N}$ is a sufficiently large integer such that $2^{s(n)}\;\geq\;n^{C}$ exceeds the maximum polynomial step bound under consideration. We denote by $\mathrm{pred}(v_{i})\in\mathbb{F}_{s(n)}^{C_{\max}(n)}$ the vector of predecessor indices of node $v_{i}$ ; that is, if $v_{i}$ has $d\leq C_{\max}(n)$ incoming edges from nodes $v_{j_{1}},\dots,v_{j_{d}}$ , then $\mathrm{pred}(v_{i})=(j_{1},\dots,j_{d},{\bm{0}})$ , where zeros are used for padding so that the length is exactly $C_{\max}(n)$ . Let the vocabulary be ${\mathcal{V}}=\Sigma$ . At decoding step $k$ , the model has access to the concatenated sequence $$ (x_{1},x_{2},\ldots,x_{n},\,y_{1},y_{2},\ldots,y_{k})\in\Sigma^{n+k}, \tag{35} $$ where $x=(x_{1},\ldots,x_{n})\in\Sigma^{n}$ denotes the input, and $y_{i}$ is the token generated at the $i$ -th CoT step. For each node $v_{j}$ , let $v_{j}(x)$ denote its value on input $x$ . We assume that every intermediate output satisfies $y_{i}=v_{n+i}(x).$ Under this assumption, we prove by induction that the model generates the next token correctly, i.e., $y_{k+1}=v_{n+k+1}(x).$ Embedding The embedding at position $i\in[n+k]$ , denoted by ${\bm{h}}^{(0)}_{i}\in\mathbb{F}_{s(n)}^{m}$ , where $m\coloneq|\Sigma|+|\mathcal{F}|+(1+C_{\mathrm{max}}(n))s(n)+|\Sigma|C_{\mathrm{sum}}(n)$ is defined as $$ {\bm{h}}^{(0)}_{i}=\left(\bm{e}(v_{i}(x)),\,\bm{e}(v_{i+1}),\,\mathsf{sbin}_{s(n)}(i),\,\bm{\mathrm{sbinpred}}_{s(n)}(v_{i+1}),\,{\bm{0}}_{|\Sigma|C_{\mathrm{sum}}(n)}\right), \tag{0} $$ where $\bm{e}\colon\Sigma\to\{0,1\}^{|\Sigma|}$ denote the one-hot encoding of the symbol and, $\bm{\mathrm{sbinpred}}_{s(n)}(v)\in\mathbb{F}^{C_{\mathrm{max}}(n)\cdot s(n)}_{s(n)}$ encodes the binary representations of the predecessor indices: $$ \bm{\mathrm{sbinpred}}_{s(n)}(v_{i})\coloneq\left(\mathsf{sbin}_{s(n)}(\mathrm{pred}(v_{i})_{0}),\,\ldots,\,\mathsf{sbin}_{s(n)}(\mathrm{pred}(v_{i})_{C_{\mathrm{max}}(n)})\right). \tag{37} $$ This embedding is constructed, for $z\in\Sigma$ , as $$ \mathbf{WE}(z)=\left(\bm{e}(z),\,{\bm{0}}\right),\quad\mathbf{PE}(i)=\left({\bm{0}},\,\bm{e}(v_{i+1}),\,\mathsf{sbin}_{s(n)}(i),\,\bm{\mathrm{sbinpred}}_{s(n)}(v_{i+1}),\,\mathbf{0}\right). \tag{38} $$ Attention layer The first attention layer consists of $C_{\mathrm{max}(n)}$ heads. The $h$ -th head is configured to attend to the position corresponding to the $h$ -th predecessor. Specifically, for each position $i$ and head $h\in[C_{\mathrm{max}(n)}]$ , the attention vectors are defined as: $$ \displaystyle{\bm{q}}_{i,h} \displaystyle={\mathsf{sbin}_{s(n)}(\mathrm{pred}(v_{i+1})_{h})}^{\frown}{1_{s(n)}}, \displaystyle{\bm{k}}_{i,h} \displaystyle=2^{s(n)+1}\cdot{\mathsf{sbin}_{s(n)}(i)}^{\frown}{(-1_{s(n)})}, \displaystyle\mathbf{v}_{i,h} \displaystyle=\bm{e}(v_{i}(x)), \tag{39} $$ where vectors of different lengths are zero-padded to match the dimension. By Lemma B.5, each attention head of the last position $i=n+k$ retrieves the predecessor’s value of $v_{n+k}$ $$ {\bm{a}}_{n+k,h}=\bm{e}\bigl(v_{\mathrm{pred}(v_{n+k+1})_{h}}(x)\bigr) \tag{42} $$ With an appropriate output projection ${\bm{O}}$ such that $$ {\bm{O}}({\bm{a}}_{i,1},\ldots,{\bm{a}}_{i,H})=({\bm{0}},\,\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{0}}(x)),\,\ldots,\,\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{C_{\mathrm{max}(n)}}}(x)),\,{\bm{0}}), \tag{43} $$ the hidden state at position $n+k$ after the attention layer is given by $$ \displaystyle{\bm{h}}^{(0.5)}_{n+k}=\Big(\bm{e}(v_{n+k}(x)),\,\bm{e}(v_{n+k+1}),\,\mathsf{sbin}_{s(n)}(n+k),\,\bm{\mathrm{sbinpred}}_{s(n)}(v_{n+k+1}),\, \displaystyle\underbrace{\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{0}}(x)),\,\ldots,\,\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{C_{\mathrm{max}(n)}}}(x))}_{\text{updated}}\,,{\bm{0}}_{C_{\mathrm{sum}}(n)|\Sigma|}\Big). \tag{44} $$ The second and third attention layers are disabled (i.e., all attention weights are set to zero). Feed-forward layer By Lemma B.7, a single feed-forward layer can approximate multiple functions by partitioning the input into blocks. The first feed-forward layer then places the arguments, gathered by attention, into the correct positions. By Lemma B.6, where the vector ${\bm{e}}_{i}$ therein corresponds to ${\bm{1}}_{|\Sigma|}$ here, the hidden state at the last position, denoted by ${\bm{h}}^{(1)}_{n+k}$ , becomes $$ \displaystyle\Big(({\bm{h}}^{(0.5)}_{n+k})_{1:r},\big\|_{j=1}^{|\mathcal{F}|}\,\big(\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{1}}(x))\cdot 1,\ldots,\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{C_{j}(n)}}(x))\cdot 1\Big), \tag{46} $$ where $r=|\Sigma|+|\mathcal{F}|+(1+C_{\mathrm{max}})(n)s(n)$ . By Assumption 3.4 and Lemmas B.8 and B.7, there exist feed-forward layers $\mathrm{FF}_{2},\mathrm{FF}_{3}:\mathbb{F}_{s(n)}^{m}\to\mathbb{F}_{s(n)}^{m}$ such that, for every input tuple ${\bm{x}}_{j}\coloneq(x^{(j)}_{1},\ldots,x^{(j)}_{C_{j}(n)})\in\Sigma^{C_{j}(n)}$ for $j\in[|\mathcal{F}|]$ and every ${\bm{t}}\in\{0,1\}^{|\mathcal{F}|}$ , the composition $\mathcal{FF}\coloneq(\operatorname{id}+\mathrm{FF}_{3})\circ(\operatorname{id}+\mathrm{FF}_{2})$ satisfies $$ \mathcal{FF}\left(*,\,{\bm{t}},\,*,\,\big\|_{j=1}^{|\mathcal{F}|}\,(\bm{e}(x^{(j)}_{1}),\ldots,\bm{e}(x^{(j)}_{C_{j}(n)}))\right)\;=\;\left(\sum^{|\mathcal{F}|}_{j}{\bm{t}}_{j}\cdot\bm{e}(f_{j}({\bm{x}}_{j})),*\right), \tag{47} $$ zwhere $*$ denotes an unspecified vector. Since the second and third attention layers are disabled, after the third layer, applying the second and third feed-forward layers $\mathrm{FF}_{2},\mathrm{FF}_{3}$ , the hidden state becomes $$ \displaystyle{\bm{h}}^{(3)}_{n+k} \displaystyle=\mathcal{FF}\left({\bm{h}}^{(1)}_{n+k}\right) \displaystyle=\mathcal{FF}\left(*,\,\bm{e}(v_{n+k+1}),\,*,\,\bigg\|_{j=1}^{|\mathcal{F}|}\,\big(\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{1}}(x)),\ldots,\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{C_{j}(n)}}(x))\big)\right) \displaystyle=\Biggl(\sum_{j=1}^{|\mathcal{F}|}\bm{e}(v_{n+k+1})_{j}\cdot\bm{e}\Big(f_{j}\big(\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{1}}(x)),\ldots,\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{C_{j}(n)}}(x))\big)\Big),\;*\Biggr) \displaystyle=\Biggl(\bm{e}\Big(f_{l}\big(\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{1}}(x)),\ldots,\bm{e}(v_{\mathrm{pred}(v_{n+k+1})_{C_{l}(n)}}(x))\big)\Big),\;*\Biggr),\,\text{where}\quad\bm{e}(v_{n+k+1})={\bm{e}}_{l} \displaystyle=\Big(\bm{e}\big(v_{n+k+1}(x)\big),\;*\Big). \tag{3} $$ Output layer The final output is given by $$ {\bm{h}}_{n+k}=\mathbf{OUT}({\bm{h}}^{(3)}_{n+k})=\begin{bmatrix}{\bm{I}}_{|\Sigma|}&\mathbf{0}\end{bmatrix}{\bm{h}}^{(3)}_{n+k}=\bm{e}\!\left(v_{n+k+1}(x)\right). \tag{3} $$ The decoding function then outputs the symbol corresponding to the maximum score, $$ y_{n+k+1}=\mathrm{Dec}({\bm{h}}_{n+k})=\arg\max_{j\in[|\Sigma|]}\bm{e}(v_{n+k+1}(x))=v_{n+k+1}(x). \tag{54} $$ By induction on $k$ , the model computes the values at all nodes in topological order. The parameter size of the model is determined by the requirements of the feedforward layers, $O(\mathrm{ff\_param}(G_{n}))\,.$ While the dimensions and heads of the attention layers depend on $C_{\max}(n)$ , which is precisely what is already required for the feedforward layers to approximate the target functions. ∎ ### B.5 Proof of Theorem 3.6 for Looped Transformer <details> <summary>x12.png Details</summary> ![b06c0879](/v1/image/b06c087935b800c77daa437f56e29ff619f2fb84d9bbfff49f601017baabbd2d) ### Visual Description \n ## Diagram: Attention and MLP Block ### Overview The image depicts a diagram illustrating the interaction between an Attention mechanism and a Multi-Layer Perceptron (MLP) block within a neural network architecture. The diagram shows the flow of data through these components, with labeled inputs and outputs. It appears to be a simplified representation of a feedforward layer within a transformer-like model. ### Components/Axes The diagram consists of the following components: * **Input Blocks:** Four rectangular blocks labeled 'a', 'b', 'd', and 'c'. These represent input vectors or features. * **Attention Block:** A large, horizontally oriented rectangle labeled "Attention". This block receives input from the four input blocks. * **MLP Block:** A rectangular block labeled "MLP". This block receives input from the Attention block and outputs to itself in a feedback loop. * **Output Blocks:** Two sets of vertically stacked rectangular blocks, each labeled with 'a', 'b', 'd', and 'c' with additional ellipsis indicating more elements. These represent the output vectors or features. * **Labels:** 'fj' and 'fi' are labels indicating input and output flows to the MLP block. * **Arrows:** Arrows indicate the direction of data flow between the components. ### Detailed Analysis or Content Details The diagram shows the following data flow: 1. **Inputs to Attention:** The four input blocks 'a', 'b', 'd', and 'c' are connected to the Attention block via arrows. 2. **Attention Output:** The Attention block outputs to a vertically stacked block containing 'a', 'b', 'd', and 'c' with ellipsis. 3. **MLP Input:** The output of the Attention block is fed as input to the MLP block, labeled 'fj'. 4. **MLP Output:** The MLP block outputs to another vertically stacked block containing 'a', 'b', 'd', and 'c' with ellipsis, labeled 'fi'. 5. **MLP Feedback Loop:** The output of the MLP block is also fed back as input to itself, creating a feedback loop. The vertical blocks containing 'a', 'b', 'd', 'c' and ellipsis suggest a vector or a sequence of features. The ellipsis indicates that the vector/sequence has more elements than those explicitly listed. ### Key Observations * The diagram highlights the interaction between Attention and MLP layers, which is a common pattern in modern neural network architectures like Transformers. * The feedback loop within the MLP block suggests a recurrent or iterative processing of the data. * The labels 'fj' and 'fi' suggest that the MLP block is performing a transformation on the input data. ### Interpretation The diagram illustrates a key building block in a neural network architecture, likely a transformer. The Attention mechanism processes the input features ('a', 'b', 'd', 'c') to capture relationships between them. The output of the Attention mechanism is then fed into the MLP block, which performs a non-linear transformation. The feedback loop within the MLP block suggests that the transformation is applied iteratively, potentially refining the output over multiple steps. The labels 'fj' and 'fi' likely represent the input and output feature representations at different stages of the processing pipeline. The diagram suggests a modular design where Attention and MLP blocks are combined to create a powerful and flexible neural network architecture. The use of 'a', 'b', 'c', 'd' as labels for the input and output suggests these are feature vectors or embeddings. The ellipsis indicates that the feature vectors are likely higher dimensional than just four elements. </details> Figure 9: Illustration of the role of the attention and feedforward layers in looped TFs for evaluating DAGs: the attention layer uniformly attends to and aggregates all inputs at each position, while the feedforward layer simultaneously simulates the functions of the nodes. * Proof* In the proof, we assume that the computation graph contains at most $n$ output nodes. This assumption is without loss of generality: if the number of output nodes exceeds the number of input nodes, we can simply pad the input with dummy nodes (e.g., fixed zeros), thereby reducing the setting to the same case. We construct a model in which (1) the attention layer aggregates the inputs, and (2) the looped feed-forward layer performs the computation of all nodes in parallel, as illustrated in Figure 9. We first show that a feedforward layer followed by an attention layer can copy all input tokens to each position. Assume, given an input sequence $x=(x_{1},x_{2},\ldots,x_{n})\in\Sigma^{n}$ and one-hot encoding $\bm{e}:\Sigma\to\{0,1\}^{|\Sigma|}$ . Assume each input at position $i\in[n]$ is embedded as $$ {\bm{h}}_{i}=\bigl(\,{\bm{0}},\;\bm{e}(x_{i}),\;{\bm{e}}_{i}\,\bigr)\in\{0,1\}^{n|\Sigma|+|\Sigma|+n}, \tag{55} $$ where ${\bm{e}}_{i}\in\{0,1\}^{n}$ . By Lemma B.6, when substituting ${\bm{x}}=(\bm{e}(x_{1}),\ldots,\bm{e}(x_{n}))$ into the lemma, there exists a feed-forward layer $\mathrm{FF}_{1}$ such that $$ (\operatorname{id}+\mathrm{FF}_{1})({\bm{h}}_{i})=\bigl(\,({\bm{e}}_{i})_{1}\cdot\bm{e}(x_{i}),\;({\bm{e}}_{i})_{2}\cdot\bm{e}(x_{i}),\;\ldots,\;({\bm{e}}_{i})_{n}\cdot\bm{e}(x_{i}),\;\bm{e}(x_{i}),\;{\bm{e}}_{i}\,\bigr). \tag{56} $$ To aggregate all positions via uniform attention, we use a single-head attention layer with: $$ \mathbf{q}_{i}={\bm{k}}_{i}=\mathbf{1}_{n|\Sigma|},\quad\mathbf{v}_{i}=n\big(({\bm{e}}_{i})_{1}\cdot\bm{e}(x_{i}),\;({\bm{e}}_{i})_{2}\cdot\bm{e}(x_{i}),\;\ldots,\;({\bm{e}}_{i})_{n}\cdot\bm{e}(x_{i})\big)\quad\text{for all }i\in[n], \tag{57} $$ with an appropriate output projection, the output of the attention layer, at position $i$ , becomes $$ \displaystyle\frac{1}{n}\sum^{n}_{j=1}1\cdot n{\bm{h}}_{j} \displaystyle\;=\;\Bigl(\sum_{j=1}^{n}({\bm{e}}_{j})_{1}\,\bm{e}(x_{j}),\;\;\sum_{j=1}^{n}({\bm{e}}_{j})_{2}\,\bm{e}(x_{j}),\;\;\ldots,\;\;\sum_{j=1}^{n}({\bm{e}}_{j})_{n}\,\bm{e}(x_{j})\Bigr) \displaystyle\;=\;\bigl(\,\bm{e}(x_{1}),\;\bm{e}(x_{2}),\;\ldots,\;\bm{e}(x_{n})\,\bigr). \tag{58} $$ Then, we show that the feed-forward layer can encode the entire computation graph into its weights and simulate all nodes simultaneously. Let the flag vector for each node be $(t_{1},\ldots,t_{N})\in\{0,1\}^{N}$ , where $N\coloneqq|V_{n}|=\mathrm{size}(G_{n})$ . By Lemmas B.7 and B.8, there exist feed-forward layers $\mathrm{FF}_{2},\mathrm{FF}_{3}:\mathbb{F}_{s}^{N(|\Sigma|+1)}\to\mathbb{F}_{s}^{N(|\Sigma|+1)}$ such that, for the input vector $(z_{1},\ldots,z_{N})\in\Sigma^{N}.$ , $$ \displaystyle\mathcal{FF}\Bigl(\,\bm{e}(z_{1}),t_{1},\ldots,\bm{e}(z_{N}(x)),t_{N}\Bigr) \displaystyle=\big\|_{i=1}^{N}\,\Bigl(t_{i}\cdot\bm{e}\bigl(f_{v_{i}}({\bm{z}}^{(i)})\bigr),\;\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}t_{p_{i,j}}\Bigr), \tag{60} $$ where $\mathcal{FF}\coloneq(\operatorname{id}+\mathrm{FF}_{3})\circ(\operatorname{id}+\mathrm{FF}_{2})$ . Here, $f_{v_{i}}$ denotes the function associated with node $v_{i}$ , and $p_{i,1},\ldots,p_{i,m_{i}}$ denote the indices of the predecessor nodes of $v_{i}$ , and ${\bm{z}}^{(i)}=(z_{p_{i,1}},\ldots,z_{p_{i,m_{i}}})$ denotes their values. The last term $\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}t_{p_{i,j}}$ can be obtained using a linear layer. For the $k$ -th loop, assume by induction that the hidden state is $$ {\bm{h}}(k)\coloneq\bigl(t_{1,k-1}\cdot\bm{e}(v_{1}(x)),\,t_{1,k},\,t_{2,k-1}\cdot\bm{e}(v_{2}(x)),\,t_{2,k},\,\ldots,\,t_{N,k-1}\cdot\bm{e}(v_{N}(x)),\,t_{N,k}\bigr), \tag{61} $$ where $v_{i}(x)\in\Sigma$ denotes the value computed by node $v_{i}$ given the input $x$ , and $t_{i,k}\in\{0,1\}$ indicates whether node $v_{i}$ lies within depth at most $k$ . Under this assumption, it holds that $$ \displaystyle\mathcal{FF}({\bm{h}}(k)) \displaystyle\;=\;\big\|_{i=1}^{N}\,\Bigl(t_{i,k}\cdot\bm{e}\bigl(f_{v_{i}}(v_{p_{i,1}}(x),v_{p_{i,2}}(x),\ldots,v_{p_{i,m_{i}}}(x))\bigr),\;\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}t_{(p_{i,j},k)}\Bigr), \displaystyle\;=\;\big\|_{i=1}^{N}\,\Bigl(t_{i,k}\cdot\bm{e}\bigl(v_{i}(x))\bigr),\;t_{i,k+1}\Bigr)\;=\;{\bm{h}}(k+1). \tag{62} $$ To extract the output node corresponding to each position denoted by $o_{i}$ , in the final loop iteration, by Lemma B.6, there exists a feedforward layer $\mathrm{FF}_{4}$ such that $$ (\operatorname{id}+\mathrm{FF}_{4})({\bm{h}}(k),{\bm{e}}_{o_{i}})=\bigl(t_{o_{i},k}\cdot\bm{e}(v_{o_{i}}(x)),\;*\bigr)\,. \tag{64} $$ Summary We construct the looped model as follows. Each input token $x_{i}\in\Sigma$ at position $i\in[n]$ is embedded as $$ {\bm{h}}^{(0)}_{i}=\bigl(\,{\bm{e}}_{i},\;{\bm{e}}_{o_{i}},\;\bm{e}(x_{i}),\;0,\;\bm{e}(x_{i}),\;0,\;\ldots,\;\bm{e}(x_{i}),\;0,\;{\bm{0}}_{(1+|\Sigma|)(N-n)+|\Sigma|}\bigr)\in\{0,1\}^{2n+(1+|\Sigma|)N+|\Sigma|}. \tag{0} $$ The first attention layer is an identity map, while the first feedforward layers compute $$ {\bm{h}}^{(1)}_{i}\;=\;\bigl(\,{\bm{e}}_{i},\;{\bm{e}}_{o_{i}},\;({\bm{e}}_{i})_{1}\cdot\bm{e}(x_{i}),\;0,\;({\bm{e}}_{i})_{2}\cdot\bm{e}(x_{i}),\;0,\;\ldots,\;({\bm{e}}_{i})_{n}\cdot\bm{e}(x_{i}),\;0,\;{\bm{0}}_{(1+|\Sigma|)(N-n)+|\Sigma|}\,\bigr). \tag{1} $$ The second attention layer uniformly gathers all positions and appends a constant $1$ to each block: $$ \displaystyle{\bm{h}}^{(1.5)}_{i} \displaystyle\;=\;\bigl(\,{\bm{e}}_{i},\;{\bm{e}}_{o_{i}},\;\bm{e}(x_{1}),\;1,\;\bm{e}(x_{2}),\;1,\;\ldots,\;\bm{e}(x_{n}),\;1,\;{\bm{0}}_{(1+|\Sigma|)(N-n)+|\Sigma|}\,\bigr) \displaystyle\;=\;\bigl(\,{\bm{e}}_{i},\;{\bm{e}}_{o_{i}},\;{\bm{h}}(0),\;{\bm{0}}_{|\Sigma|}\,\bigr). \tag{67} $$ We now proceed by induction. Assume that at the $k$ -th iteration, the output after the second attention layer at position $i$ is $$ {\bm{h}}^{(1.5)}_{k,i}\;=\;\bigl(\,{\bm{e}}_{i},\;{\bm{e}}_{o_{i}},\;{\bm{h}}(k),\;t_{o_{i},k-1}\cdot\bm{e}(v_{o_{i}}(x))\bigr), \tag{69} $$ where $t_{o_{i},k-1}\coloneq\bigl({\bm{h}}^{(1.5)}_{k-1,i}\bigr)_{\,2n+(1+|\Sigma|)(o_{i}-1)}$ denotes the indicator flag showing whether the $o_{i}$ -th node has been reached, i.e., whether it lies at depth $k-1$ . After passing through the second feedforward layer, the third attention layer with all weights set to zero, and the third feedforward layer, the hidden state updates to $$ {\bm{h}}^{(3)}_{k,i}\;=\;\bigl(\,{\bm{e}}_{i},\;{\bm{e}}_{o_{i}},\;{\bm{h}}(k+1),\;t_{o_{i},k-1}\cdot\bm{e}(v_{o_{i}}(x))\,\bigr). \tag{3} $$ After the fourth feedforward layer, the hidden state becomes $$ {\bm{h}}^{(4)}_{k,i}\;=\;\bigl(\,{\bm{e}}_{i},\;{\bm{e}}_{o_{i}},\;{\bm{h}}(k+1),\;t_{o_{i},k}\cdot\bm{e}(v_{o_{i}}(x))\bigr). \tag{4} $$ By induction, after the final iteration of depth $\mathrm{depth}(G_{n})$ of the computation graph $G_{n}$ , we obtain $$ {\bm{h}}^{(4)}_{\mathrm{depth}(G_{n}),i}\;=\;\bigl(\,*,\;t_{o_{i},\mathrm{depth}(G_{n})}\cdot\bm{e}(v_{o_{i}}(x))\bigr)\;=\;\bigl(\,*,\;\bm{e}(v_{o_{i}}(x))\bigr). \tag{4} $$ The final output is given by $$ z_{i}=\mathbf{OUT}({\bm{h}}^{(4)}_{\mathrm{depth}(G_{n}),i})=\begin{bmatrix}{\bm{0}}&{\bm{I}}_{|\Sigma|}\end{bmatrix}({\bm{h}}^{(4)}_{\mathrm{depth}(G_{n}),i})=\bm{e}(v_{o_{i}}(x)). \tag{4} $$ The decoding function then selects the symbol corresponding to the maximum score: $$ y_{i}\;=\;\mathrm{Dec}(z_{i})=\arg\max_{j\in[|\Sigma|]}\,\bigl(\bm{e}(v_{o_{i}}(x))\bigr)_{j}=v_{o_{i}}(x). \tag{74} $$ The parameter size grows proportionally with the input dimension $\mathrm{size}(G_{n})$ , since the function must be approximated along each dimension. Therefore, it can be bounded by $O\!\left(\mathrm{ff\_param}(G_{n})\cdot\mathrm{size}(G_{n})\right).$ ∎ Discussion: Our proof compresses the entire input into a single position before applying an arbitrary feed-forward computation, which may appear to deviate from the standard Transformer architecture. An alternative approach is to distribute computation across multiple positions, as shown by (Sanford et al., 2024b), which can reduce the required embedding dimension. We nevertheless adopt the single-position construction to isolate the core characteristics of looped models: attention is used purely as an information aggregation mechanism, a single Transformer layer suffices, and all computation is carried out by the feed-forward network in latent space. This latent-space computation is strictly more expressive than computation in the language space. This is supported by recent work for looped ReLUs (Liang et al., 2024). ### B.6 Proof of Theorem 3.6 for Continuous Thought * Proof* The proofs are based on the construction for CoT and looped TF. Let the vocabulary be ${\mathcal{V}}=\Sigma$ and each node $v\in V_{n}$ is labeled by a one-hot vector $\bm{e}(v)\in\{0,1\}^{|\mathcal{F}|}$ . At decoding step $k$ , the model has access to the concatenated sequence $$ (x_{1},x_{2},\ldots,x_{n},\,h_{1},h_{2},\ldots,h_{k}) \tag{75} $$ where $x=(x_{1},\ldots,x_{n})\in\Sigma^{n}$ denotes the input, and $h_{i}\in\mathbb{F}_{s}^{m}$ is the hidden state generated at the $i$ -th Coconut step. For each node $v_{j}$ , let $v_{j}(x)$ denote its value on input $x$ . We assume that $$ h_{k}\coloneq\bigl(t_{1,k-1}\cdot\bm{e}(v_{1}(x)),\,t_{1,k},\,t_{2,k-1}\cdot\bm{e}(v_{2}(x)),\,t_{2,k},\,\ldots,\,t_{N,k-1}\cdot\bm{e}(v_{N}(x)),\,t_{N,k},\,{\bm{e}}^{\prime}_{k},\,{\bm{0}}_{|\Sigma|}\bigr), $$ where $N\coloneqq\mathrm{size}(G_{n})$ , $t_{i,k}\in\{0,1\}$ indicates whether node $v_{i}$ lies within depth at most $k$ , and $\bm{e}:\Sigma\to\{0,1\}^{|\Sigma|}$ denotes one-hot encoding, and the vector ${\bm{e}}^{\prime}_{k}\in\{0,1\}^{N}$ is defined by $$ {\bm{e}}^{\prime}_{k}=\begin{cases}\mathbf{0}&k<\mathrm{depth}(G_{n}),\\[4.0pt] \bm{e}(v_{o_{i}})&k=\mathrm{depth}(G_{n})+i,\end{cases} \tag{76} $$ where $v_{o_{i}}$ denotes the $i$ -th output node, and ${\bm{e}}^{\prime}_{k}$ can be encoded by positional embeddings. Under this assumption, we prove by induction that the model generates $h_{k+1}(x)$ . We first consider the base case $k=1$ , in which the model receives only ${\bm{x}}$ . We focus on the final token. Using the same construction as in the looped TF, we can show that a single feed-forward layer followed by an attention layer suffices to copy all input tokens to every position. Specifically, the hidden representation at the last position becomes $$ h_{1}=\bigl(\bm{e}(v_{1}(x)),\,1,\,\ldots,\,\bm{e}(v_{n}(x)),\,1,\,{\bm{0}},\ 1,\,{\bm{0}},\,{\bm{e}}^{\prime}_{0},\,{\bm{0}}\bigl), \tag{77} $$ where $v_{i}(x)=x_{i}$ for all $i\in[n]$ , corresponding to the input nodes. In what follows, we consider the case $k>1$ . By the same argument as for Looped TF, based on Equation 62, we show that the feed-forward layers can encode the entire computation graph into their weights and simulate all nodes simultaneously. That is, there exist two feed-forward layers whose composition, denoted by $\mathcal{FF}$ , satisfies $\mathcal{FF}(h_{k})=h_{k+1}.$ Consider the last feed-forward layer applied after the zero-weight attention layer. By substituting ${\bm{e}}_{i}={\bm{e}}_{k}^{\prime}$ and ${\bm{x}}={(h_{k})}_{1:m-|\Sigma|}$ into Lemma B.6, there exists a feed-forward network $\mathrm{FF}$ such that $$ (\operatorname{id}+\mathrm{FF})(h_{k})=\begin{cases}h_{k+1},&\text{if }k<\mathrm{depth}(G_{n}),\\[6.0pt] \Bigl({(h_{k+1})}_{1:m-|\Sigma|},\,\bm{e}(v_{o_{i}}(x))\Bigr),&\text{if }k=\mathrm{depth}(G_{n})+i.\end{cases} \tag{78} $$ In particular, for all $k<\mathrm{depth}(G_{n})$ , the output of the last token, for the input $(x_{1},x_{2},\ldots,x_{n},\;h_{1},h_{2},\ldots,h_{k})$ is $h_{k+1}$ . For the final positions corresponding to output nodes, namely $k=\mathrm{depth}(G_{n})+i$ , the hidden state ${\bm{h}}_{k}$ contains the embedding of the output symbol $v_{o_{i}}(x)$ . Applying the output projection yields $$ \mathbf{OUT}({\bm{h}}_{k})=\begin{bmatrix}\mathbf{0}&{\bm{I}}_{|\Sigma|}\end{bmatrix}{\bm{h}}_{k}=\bm{e}\!\left(v_{o_{i}}(x)\right). \tag{79} $$ Finally, the decoding function outputs the symbol in $\Sigma$ corresponding to the maximum coordinate of $\mathbf{OUT}({\bm{h}}_{k})$ . ∎ ### B.7 Proof for Theorem 3.12 For the upper bound $\mathsf{Loop}[\log^{k}n,\,\mathsf{poly}(n),\,1\ \textup{(resp.\ }\log n)]\subseteq\mathsf{AC}^{k}\ \textup{(resp.\ }\mathsf{TC}^{k}),$ we follow the argument of (Li et al., 2024). Their key observation is that a restricted form of automaton can model iterative computation under constant precision: the rounding operation preserves monotonicity, and constant precision yields counter-free, restricted state spaces, which are therefore computable by $\mathsf{AC}^{0}$ circuits. In the case of polynomial precision, prefix summation can be simulated by $\mathsf{TC}^{0}$ , which also allows the detection and correction of rounding in floating-point arithmetic. **Theorem B.9 (Liet al.,2024)** *For $s(n)\in\mathsf{poly}(n)$ , $\mathrm{sum}_{s(n)}:(\mathbb{F}_{s(n)})^{n}\to\mathbb{F}_{s(n)}$ is computable by $\mathsf{TC}^{0}$ circuits, and by $\mathsf{AC}^{0}$ circuits when $s(n)$ is constant.* It has also been shown that gates can be efficiently simulated by feedforward layers. **Lemma B.10 (Liet al.,2024)** *Unbounded-fanin $\and,\mathsf{OR}$ (resp. $\mathsf{MAJORITY}):\{0,1\}^{n}\to\{0,1\}$ can be simulated by a two-layer feedforward ReLU network with constant (resp. $\log{n}$ ) bits of precision constant hidden dimension and additional $n$ constant inputs of value $1$ .* * Proof forTheorem3.12* $\mathsf{Loop}[\log^{k}n,\,\mathsf{poly}(n),\,1\ \textup{(resp.\ }\log n)]\subseteq\mathsf{AC}^{k}\ \textup{(resp.\ }\mathsf{TC}^{k}):$ In Transformers, constant-depth computation is defined by Summation with Iterative Rounding (see Definition B.3), which by Theorem B.9 can be simulated in $\mathsf{AC}^{0}$ (resp. $\mathsf{TC}^{0}$ ). A looped TF simply stacks these computations vertically through iteration, and thus the result follows. $\mathsf{AC}^{k}\ \textup{(resp.\ }\mathsf{TC}^{k})\subseteq\mathsf{Loop}[\log^{k}n,\,\mathsf{poly}(n),\,1\ \textup{(resp.\ }\log n)]:$ Since Boolean circuits are DAGs, the claim follows directly from Theorem 3.10 together with Lemma B.10. ∎ ## Appendix C Deferred Proofs for Section 4 ### C.1 Definition for Models of Computation We model a language model as a probabilistic process that, given an input and a generated prefix, produces either an internal reasoning state or an output token. This process induces a probability distribution over final outputs. We first formalize CoT under this setting. We focus on the saturated hardmax attention as in (Merrill et al., 2022; Nowak et al., 2024). **Definition C.1 (Language model with CoT)** *Let ${\mathcal{V}}$ be a vocabulary. Given an input $x\in{\mathcal{V}}^{*}$ , a language model with CoT stochastically generates a sequence of output blocks of the form $$ \Big(r_{1},\,e,\,y_{1},\,e^{\prime},\;r_{2},\,e,\,y_{2},\,e^{\prime},\;\cdots\;r_{m},\,e,\,y_{m},\,e^{\prime}\Big), \tag{80} $$ where each $r_{i}\in{\mathcal{V}}^{*}$ represents an explicit reasoning trace, $y_{i}\in{\mathcal{V}}$ is an output token, and $e,e^{\prime}\in{\mathcal{V}}$ are special delimiter tokens. The final output is the string $y_{1}\cdots y_{m}$ . Generation proceeds autoregressively: at iteration $i$ , the model generates a reasoning segment $r_{i}$ followed by an output token $y_{i}$ , conditioned on the input $x$ , the previously generated outputs $y_{<i}$ , and prior reasoning segments $r_{<i}$ . We denote by $p(y\mid x)$ the resulting distribution over final outputs.* Then we define the language models with latent thought reasoning: Coconut and looped TF. **Definition C.2 (Language model with Coconut)** *Let ${\mathcal{V}}$ be a vocabulary. Given an input $x\in{\mathcal{V}}^{*}$ , a language model with Coconut generates a sequence of blocks $$ \Big(r_{1},\,y_{1},\,r_{2},\,y_{2},\,\ldots,\,r_{m},\,y_{m}\Big), \tag{81} $$ where each $r_{i}\in{\mathbb{F}^{d}}^{*}$ represents an internal continuous reasoning state, and each $y_{i}\in{\mathcal{V}}$ is an output token. Generation proceeds autoregressively. At each iteration, the internal continuous reasoning state $r_{i}$ is generated deterministically as a function of the input $x$ , the previously generated outputs $y_{<i}$ , and the prior internal states $r_{<i}$ , while the output token $y_{i}$ is generated stochastically. The decision of when to emit an output token is determined internally by the model.* For looped TF models, the computation proceeds through internally repeated iterations before producing any output tokens. The number of iterations is determined by the model as a function of the input. **Definition C.3 (Language model with looped TF)** *Given an input $x\in{\mathcal{V}}^{*}$ , a looped TF produces an output sequence autoregressively. At each iteration $i\in[m]$ , the model internally performs a number of repeated transformation steps before emitting an output token $y_{i}$ , conditioned on the input $x$ and the previously generated outputs $y_{<i}$ . The number of internal loop iterations required to generate each output token is determined internally by the model as a function of the input $x$ and the generated prefix $y_{<i}$ .* We define complexity classes corresponding to language models under different reasoning paradigms. In contrast to the non-uniform models typically used in parallel computation analysis, we adopt a uniform setting analogous to Turing machines: a single model with a fixed set of parameters is applied to all input lengths, while the number of reasoning steps is allowed to grow as a function of the input size $n$ . Following the convention in (Merrill and Sabharwal, 2024), we allow $O(\log n)$ bits of numerical precision to represent positional embeddings and internal activations. Furthermore, we extend the output space beyond simple binary decisions. Specifically, the model’s output is not restricted to $\Sigma^{k}$ , but can be an arbitrary finite binary string in $\Sigma^{*}$ . This extension enables the model to represent functions of the form $f:\Sigma^{*}\to\mathbb{R}$ , thereby capturing counting, probabilistic modeling, and approximation tasks. **Definition C.4 (Complexity Classes𝗉𝖢𝗈𝖳\mathsf{pCoT},𝗉𝖢𝖳\mathsf{pCT}, and𝗉𝖫𝖮𝖮𝖯\mathsf{pLOOP})** *Let $\mathsf{pCoT}[T(n)]$ , $\mathsf{pCT}[T(n)]$ , and $\mathsf{pLOOP}[T(n)]$ denote the classes of probabilistic computations that define an output distribution $p:\Sigma^{*}\times\Sigma^{*}\to[0,1]$ . A function $f$ belongs to such a class if there exists a language model $\mathcal{M}$ , employing CoT, Coconut, or looped TF, respectively, such that for any input $x\in\Sigma^{n}$ , the model induces the distribution $p(x,\cdot)$ within $O(T(n))$ steps using $O(\log n)$ bits of numerical precision.* ### C.2 Lemma: Universality of Chain of Thought **Definition C.5** *A probabilistic Turing machine $M=(Q,\Sigma,\Gamma,q_{0},q_{1},\delta_{1},\delta_{2})$ is defined as: - $Q$ is a finite set of states, - $\Sigma$ is the input/output alphabet, - $\Gamma$ is the tape alphabet, with $\Sigma\subseteq\Gamma$ and a distinguished blank symbol $\sqcup\in\Gamma$ , - $q_{0}\in Q$ denotes the initial state, and $q_{1}\in Q$ denotes the final state, - $\delta_{1},\delta_{2}:Q\times\Gamma\to Q\times\Gamma\times(\Sigma\cup\{\varepsilon\})\times\{L,S,R\}$ are two transition functions. At each step, $M$ chooses uniformly at random between $\delta_{1}$ and $\delta_{2}$ . Each transition $\delta_{i}(q,a)=(q^{\prime},b,\sigma,D)$ is interpreted as: writing $b\in\Gamma$ on the work tape, writing $\sigma\in\Sigma$ on the output tape, where $\varepsilon$ means that nothing is written, and moving the tape head in direction $D\in\{L,S,R\}$ , where $L$ = left, $R$ = right, and $S$ = stay. The output of $M$ is defined to be the string remaining on the output tape when the machine halts.* Prior work has shown that probabilistic CoT can simulate any such PTM, thereby demonstrating its universality. **Lemma C.6 (Nowaket al.,2024)** *Let $M$ be a PTM with input and output alphabet $\Sigma$ , and running time bounded by $T(n)$ on inputs of length $n$ . Then there exists a log-precision CoT model with stochastic decoding, denoted $\mathsf{CoT}_{M}$ , such that the induced output distribution of $\mathsf{CoT}_{M}$ coincides exactly with that of $M$ . Formally, for every input string $x\in\Sigma^{*}$ and output string $y\in\Sigma^{*}$ , $$ \Pr\bigl[\mathsf{CoT}_{M}(x)=y\bigr]\;=\;\Pr\bigl[M(x)=y\bigr]. \tag{82} $$ Moreover, the number of reasoning steps of $\mathsf{CoT}_{M}$ is bounded by $\mathsf{poly}(|x|)$ .* ### C.3 Self-Reducibility and Complexity of Approximate Counting Here, we provide the definitions of relation and associated counting problems. **Definition C.7 (Relation)** *A relation over an alphabet $\Sigma$ is a subset $R\subseteq\Sigma^{*}\times\Sigma^{*}.$ For an input $x\in\Sigma^{*}$ , we denote $$ R(x):=\{\,y\in\Sigma^{*}:(x,y)\in R\,\}. $$* **Definition C.8 (Counting)** *Given a relation $R$ , the associated counting function is defined as $$ N_{R}:\Sigma^{*}\to\mathbb{N},\qquad N_{R}(x):=|R(x)|. $$* **Definition C.9 (pp-relation)** *A relation $R\subseteq\Sigma^{*}\times\Sigma^{*}$ is called a $p$ -relation if - the membership $(x,y)\in R$ can be decided in time polynomial in $|x|$ , and - there exists a polynomial $p$ such that for all $(x,y)\in R$ , we have $|y|\leq p(|x|)$ .* **Definition C.10** *The extension counting function associated with a relation $R\subseteq\Sigma^{*}\times\Sigma^{*}$ is $$ \operatorname{EXT}_{R}:\Sigma^{*}\times\Sigma^{*}\to\mathbb{N},\qquad\operatorname{EXT}_{R}(x,w):=\bigl|\{\,z\in\Sigma^{*}:(x,wz)\in R\,\}\bigr|. $$* The complexity class $\#\mathsf{P}$ consists of counting problems associated with $p$ -relations (Valiant, 1979). Then, we provide definitions of schemes for approximate counting. In the remainder of this paper, we say that an algorithm produces an output approximating $f(x)$ within ratio $1+\varepsilon$ if its output $\hat{f}(x)$ satisfies $$ (1-\varepsilon)f(x)\leq\hat{f}(x)\leq(1+\varepsilon)f(x), $$ **Definition C.11 (FPTAS)** *An algorithm is called a fully polynomial-time approximation scheme (FPTAS) for a function $f$ if, for any input $x$ and any $\varepsilon>0$ , it produces an output approximating $f(x)$ within ratio $1+\varepsilon$ , and runs in time polynomial in $|x|$ and $1/\varepsilon$ .* **Definition C.12 (FPRAS)** *An algorithm is a fully polynomial-time randomized approximation scheme (FPRAS) for a function $f$ if, for any input $x$ and any $\varepsilon>0$ and $\delta>0$ , it produces an output approximating $f(x)$ within ratio $1+\varepsilon$ with probability at least $1-\delta$ , and runs in time polynomial in $|x|$ , $1/\varepsilon$ , and $\log(1/\delta)$ .* The following proposition formalizes the relationship between approximate counting and the extension counting function. **Proposition C.13 (Jerrumet al.,1986)** *Let $R$ be a $p$ -relation and let $x\in\Sigma^{n}$ . Let $m=p(n)$ and define $r=1+\frac{\varepsilon}{2m}$ . If there exists a FPRAS that approximates $\mathrm{Ext}_{R}(x,w)$ within ratio $r$ for all prefixes $w$ with probability at least $1-\delta/m$ , then there exists an FPRAS that approximates $N_{R}(x)$ within ratio $1+\varepsilon$ with probability at least $1-\delta$ .* This reduction transforms the global counting problem into a sequence of local estimation steps. **Proposition C.14** *For any $0<\varepsilon\leq 1$ and any integer $m\geq 1$ , it holds that $$ \left(1+\frac{\varepsilon}{2m}\right)^{m}<1+\varepsilon. \tag{83} $$* * Proof* We use the standard inequality $1+x\leq e^{x}$ , which holds for all $x\in\mathbb{R}$ . Substituting $x=\frac{\varepsilon}{2m}$ , we have: $$ \left(1+\frac{\varepsilon}{2m}\right)^{m}\leq\left(\exp\left(\frac{\varepsilon}{2m}\right)\right)^{m}=e^{\varepsilon/2}. \tag{84} $$ For $0<\varepsilon\leq 1$ , we show that $e^{\varepsilon/2}<1+\varepsilon$ . Let $f(\varepsilon)=1+\varepsilon-e^{\varepsilon/2}$ . Then $f(0)=0$ and $f^{\prime}(\varepsilon)=1-\frac{1}{2}e^{\varepsilon/2}$ . Since $e^{\varepsilon/2}\leq e^{1/2}<2$ for all $\varepsilon\in[0,1]$ , it follows that $f^{\prime}(\varepsilon)>0$ . Thus, $f$ is strictly increasing on $[0,1]$ , implying $f(\varepsilon)>f(0)=0$ , or equivalently $e^{\varepsilon/2}<1+\varepsilon$ . ∎ * Proof forPropositionC.13* Let $x\in\Sigma^{n}$ and $m=p(n)$ . We express $N_{R}(x)$ as a product of $m$ ratios. Let $w=y_{1}y_{2}\dots y_{m}$ be a witness, and let $w^{(i)}$ denote its prefix of length $i$ . We can write: $$ N_{R}(x)=\mathrm{Ext}_{R}(x,\lambda)=\prod_{i=1}^{m}\frac{\mathrm{Ext}_{R}(x,w^{(i-1)})}{\mathrm{Ext}_{R}(x,w^{(i)})}\cdot\mathrm{Ext}_{R}(x,w^{(m)}). \tag{85} $$ In the standard self-reducibility framework, we estimate each ratio $\rho_{i}=\mathrm{Ext}_{R}(x,w^{(i-1)})/\mathrm{Ext}_{R}(x,w^{(i)})$ using the assumed approximation scheme. Let $A_{i}$ be the estimator for the $i$ -th ratio such that: $$ \mathbb{P}\left[\frac{1}{r}\leq\frac{A_{i}}{\rho_{i}}\leq r\right]\geq 1-\frac{\delta}{m}. \tag{86} $$ By the union bound, the probability that at least one of these $m$ estimations fails to stay within the ratio $r$ is at most $m\cdot(\delta/m)=\delta$ . Therefore, with probability at least $1-\delta$ , all $m$ estimations are successful. Under this condition, the final estimate $\hat{N}=\prod_{i=1}^{m}A_{i}$ satisfies: $$ \frac{1}{r^{m}}\leq\frac{\hat{N}}{N_{R}(x)}\leq r^{m}. \tag{87} $$ From Proposition C.14, we have $r^{m}=(1+\frac{\varepsilon}{2m})^{m}<1+\varepsilon$ . Furthermore, for $0<\varepsilon\leq 1$ , it holds that $1/r^{m}>(1+\varepsilon)^{-1}\geq 1-\varepsilon$ , ensuring a relative error of at most $\varepsilon$ . Regarding complexity, each of the $m$ calls to the FPRAS for $\mathrm{Ext}_{R}$ runs in time $\mathrm{poly}(n,(r-1)^{-1},\log(m/\delta))$ . Substituting $r-1=\varepsilon/2m$ , the runtime per call is $\mathrm{poly}(n,m/\varepsilon,\log(m/\delta))$ . Since $m=p(n)$ is a polynomial in $n$ , the total running time is $\mathrm{poly}(n,1/\varepsilon,\log(1/\delta))$ , which satisfies the requirements for an FPRAS. ∎ We have shown that approximate counting reduces to approximating the extension counting function. For a general relation $R$ , however, the connection between the extension counting function and the original counting problem for $R$ remains unclear. This gap is bridged by the notion of self-reducibility: for self-reducible relations, the extension counting function can be reduced back to the original counting problem for $R$ . **Definition C.15 (Schnorr,1976)** *A relation $R\subseteq\Sigma^{*}\times\Sigma^{*}$ is self-reducible if: 1. There exists a polynomial-time computable function $g\in\Sigma^{*}\to\mathbb{N}$ s.t., $(x,y)\in R\Rightarrow|y|=g(x);$ 1. There exists a polynomial-time Turing machine that decides membership in $R$ . 1. There exist polynomial-time computable functions $\psi\in\Sigma^{*}\times\Sigma^{*}\to\Sigma^{*}$ and $\sigma\in\Sigma^{*}\to\mathbb{N}$ s.t. $$ \displaystyle\sigma(x) \displaystyle=O(\log|x|), \displaystyle g(x)>0 \displaystyle\Rightarrow\sigma(x)>0\quad\forall x\in\Sigma^{*}, \displaystyle|\psi(x,w)| \displaystyle\leq|x|\quad\forall x,w\in\Sigma^{*}, \tag{88} $$ and such that, for all $x\in\Sigma^{*}$ , $y=y_{1}\dots y_{n}\in\Sigma^{*}$ , $$ \langle x,y_{1},\dots,y_{n}\rangle\in R\iff\langle\psi(x,y_{1}\dots y_{\sigma(x)}),y_{\sigma(x)+1},\dots,y_{n}\rangle\in R. \tag{91} $$* For example, SAT is self-reducible: by fixing a prefix of variables and applying the reduction map $\psi$ , the problem is simplified to a smaller instance whose solutions extend the chosen prefix. Consider the Boolean formula $F=(x_{1}\vee x_{2})\ \wedge\ (\neg x_{1}\vee x_{3})\ \wedge\ (\neg x_{2}\vee\neg x_{3}),$ and suppose we fix the first variable to $x_{1}=1$ . The residual instance is obtained by applying $\psi(F,(1))$ , which substitutes $x_{1}=1$ and simplifies the formula by deleting satisfied clauses and removing falsified literals: $\psi(F,(1))\;=\;(x_{3})\ \wedge\ (\neg x_{2}\vee\neg x_{3}).$ The unit clause $(x_{3})$ forces $x_{3}=1$ , which in turn simplifies $(\neg x_{2}\vee\neg x_{3})$ to $\neg x_{2}$ , yielding $x_{2}=0$ . Hence the unique residual assignment is $(x_{2},x_{3})=(0,1)$ , and together with the prefix $x_{1}=1$ , we obtain the satisfying assignment $(x_{1},x_{2},x_{3})=(1,0,1).$ For self-reducible relations, the extension counting function is no harder to approximate than the original counting problem. **Proposition C.16 (Jerrumet al.,1986)** *Let $R$ be self-reducible. If there exists an FPRAS for $N_{R}$ , then there exists an FPRAS for $\mathrm{Ext}_{R}$ .* * Proof* By the definition of self-reducibility, the extension function $\mathrm{Ext}_{R}(x,w)$ , which counts the number of strings $y$ such that $(x,wy)\in R$ , can be mapped to the counting problem of a modified instance. Specifically, for any prefix $w$ where $|w|\leq\sigma(x)$ , there exists a polynomial-time computable mapping $\psi$ such that: $$ \mathrm{Ext}_{R}(x,w)=|\{y\in\Sigma^{\sigma(x)-|w|}:(x,wy)\in R\}|=N_{R}(\psi(x,w)). \tag{92} $$ Since $R$ is self-reducible, the instance $x_{w}=\psi(x,w)$ can be constructed in polynomial time relative to $|x|$ . By the hypothesis, there exists an FPRAS for $N_{R}$ , which provides a randomized $(1\pm\epsilon)$ -approximation of $N_{R}(x_{w})$ in time polynomial in $|x_{w}|$ and $1/\epsilon$ . Consequently, this algorithm serves as an FPRAS for $\mathrm{Ext}_{R}(x,w)$ , as it runs in polynomial time and satisfies the required approximation guarantees. ∎ ### C.4 Proof for Theorem 4.3 **Lemma C.17 (Formal Statement ofLemma4.3)** *Assume that $\mathsf{FPTAS}\subsetneq\mathsf{FPRAS}$ for self-reducible relations. There exists a self-reducible relation $R$ and an associated function $\mathrm{Ext}_{R}:\Sigma^{*}\times\Sigma^{*}\to\mathbb{N}$ defined by $\mathrm{Ext}_{R}(x,y_{<i})\coloneqq|\{\,z\in\Sigma^{*}:(x,y_{<i}z)\in R\,\}|$ such that language models with CoT using polynomially many reasoning steps, which output a distribution for a given input $(x,y_{<i})\in\Sigma^{n}\times\Sigma^{*}$ by using a linear head for the last hidden state before emitting the output token, admit an FPRAS for $\mathrm{Ext}_{R}$ , whereas no latent thought with polynomially many iterations admits the same approximation guarantee using a linear head for the last hidden state before emitting the output token.* * Proof* By Lemma C.6, CoT can simulate any probabilistic Turing machine running in polynomial time; thus, it can implement an FPRAS for $N_{R}$ . By Proposition C.16, the existence of an FPRAS for $N_{R}$ further implies the existence of an FPRAS for the extension function $\mathrm{Ext}_{R}$ . On the other hand, latent thought consisting of a polynomial number of iterations can always be simulated by a deterministic polynomial-time Turing machine, provided that all state transitions in the latent computation are deterministic. Consequently, if such a latent thought process were to admit an FPRAS for $\mathrm{Ext}_{R}$ , it would effectively yield a deterministic polynomial-time approximation scheme. By Proposition C.16, the existence of such a scheme for $\mathrm{Ext}_{R}$ would imply the existence of an FPTAS for $N_{R}$ . However, under the standard complexity-theoretic assumption that $\mathsf{FPTAS}\subsetneq\mathsf{FPRAS}$ for self-reducible relations, there exists a self-reducible relation $R$ whose counting function admits an FPRAS but no FPTAS. This yields a contradiction. ∎ ### C.5 Proof for Theorem 4.4 **Definition C.18 (FPAUS)** *Uniform generation asks to sample an element $y$ uniformly at random from $R(x)$ . A fully polynomial almost uniform sampler (FPAUS) for $R$ is a randomized algorithm that, given an input $x\in\Sigma^{*}$ and an accuracy parameter $\varepsilon>0$ , runs in time polynomial in $|x|$ and $\log(1/\varepsilon)$ , and outputs a distribution $q(\cdot\mid x)$ such that $$ \bigl\|q(\cdot\mid x)-U(R(x))\bigr\|_{\mathrm{TV}}\;\leq\;\varepsilon, $$ where $U(R(x))$ denotes the uniform distribution over the set $R(x)$ , and $\|\cdot\|_{\mathrm{TV}}$ denotes total variation distance.* For self-reducible relations, the following holds. **Theorem C.19 (Jerrumet al.,1986)** *Let $R$ be a self-reducible relation. There exists an FPRAS for approximating $|R(x)|$ if and only if there exists an FPAUS for sampling uniformly from $R(x)$ .* * Proof ofTheorem4.4* The target uniform conditional distribution is defined as follows: $$ p(y_{i}\mid x,y_{<i})\;:=\;\frac{\operatorname{EXT}_{R}(x,y_{<i}y_{i})}{\sum_{u\in\Sigma}\operatorname{EXT}_{R}(x,y_{<i}u)}\qquad(y_{i}\in\Sigma). \tag{93} $$ Assume an FPRAS $\mathcal{A}(x,\varepsilon,\delta)$ exists for the self-reducible relation $|R(x)|$ . By Proposition C.16, the existence of an FPRAS for $|R(x)|$ implies the existence of an FPRAS for the extension function $\operatorname{EXT}_{R}$ . We construct a CoT that samples $y\in R(x)$ by sequentially approximating these conditional probabilities. For each step $i\in\{1,\dots,m(n)\}$ , the CoT computes $(1\pm\varepsilon)$ -accurate estimates $\widehat{\operatorname{EXT}}_{R}(x,y_{<i}u)$ for all $u\in\Sigma$ using $\mathcal{A}$ , and induces the following distribution: $$ \pi(y_{i}\mid x,y_{<i})\;:=\;\frac{\widehat{\operatorname{EXT}}_{R}(x,y_{<i}y_{i})}{\sum_{u\in\Sigma}\widehat{\operatorname{EXT}}_{R}(x,y_{<i}u)}. \tag{94} $$ Conditioned on the event that all estimates in Equation 94 are $(1\pm\varepsilon)$ -accurate, the multiplicative error is bounded by: $$ \frac{1-\varepsilon}{1+\varepsilon}\leq\frac{\pi(y_{i}\mid x,y_{<i})}{p(y_{i}\mid x,y_{<i})}\leq\frac{1+\varepsilon}{1-\varepsilon}. \tag{95} $$ To ensure the cumulative approximation error remains within $(1\pm\varepsilon^{\prime})$ , we set the local precision to $\varepsilon\leq\frac{\varepsilon^{\prime}}{2+\varepsilon^{\prime}}$ , which yields $\frac{1+\varepsilon}{1-\varepsilon}\leq 1+\varepsilon^{\prime}$ and $\frac{1-\varepsilon}{1+\varepsilon}\geq 1-\varepsilon^{\prime}$ . To ensure the failure probability is at most $\delta^{\prime}$ , we apply a union bound over the $m(n)$ generation steps and the $|\Sigma|$ calls per step. By setting the local confidence to $\delta\leq\frac{\delta^{\prime}}{m(n)(|\Sigma|+1)}$ , the joint success event holds with probability at least $1-\delta^{\prime}$ . Under this event, the CoT correctly simulates an FPRAS for $p(y_{i}\mid x,y_{<i})$ in total time $\text{poly}(n,1/\varepsilon^{\prime},\log(1/\delta^{\prime}))$ . Finally, since CoT can represent an FPRAS by Lemma 4.3, it satisfies the requirements for the construction. On the other hand, we show that latent thought cannot compute such an approximation. Suppose, for contradiction, that the model could compute the conditional distribution $\pi(y_{i}\mid x,y_{<i})$ to within a $(1\pm\varepsilon)$ relative error in a single step. We define the estimator for the total count $Z(x)=|R(x)|$ as: $$ \widehat{Z}(x)\;\coloneq\;\Biggl(\prod_{i=1}^{m(n)}\pi(y_{i}\mid x,y_{<i})\Biggr)^{-1}. \tag{96} $$ Since the true distribution satisfies $p(y\mid x)=1/|R(x)|=\prod_{i}p(y_{i}\mid x,y_{<i})$ , the relative error of $\widehat{Z}(x)$ is governed by the product of local errors: $$ (1+\varepsilon)^{-m(n)}|R(x)|\leq Z(x)\leq(1-\varepsilon)^{-m(n)}|R(x)|, \tag{97} $$ By setting $\varepsilon\leq\frac{\varepsilon^{\prime}}{2m(n)}$ , we apply Proposition C.14 and the properties of multiplicative error: $$ (1+\varepsilon)^{-m(n)}\geq 1-m(n)\varepsilon\geq 1-\varepsilon^{\prime}/2>1-\varepsilon^{\prime}, \tag{98} $$ and for sufficiently small $\varepsilon$ , $$ (1-\varepsilon)^{-m(n)}\leq 1+2m(n)\varepsilon\leq 1+\varepsilon^{\prime}. \tag{99} $$ Substituting these into Equation 96, we obtain: $$ (1-\varepsilon^{\prime})|R(x)|\leq\widehat{Z}(x)\leq(1+\varepsilon^{\prime})|R(x)|. \tag{100} $$ This implies that if the model could compute $\pi$ accurately, $\widehat{Z}(x)$ would constitute an FPTAS for $|R(x)|$ . However, under the standard complexity-theoretic assumption that $\mathsf{FPTAS}\subsetneq\mathsf{FPRAS}$ for self-reducible relations, this yields a contradiction. ∎ ## Appendix D Experimental Details ### D.1 Fundamental Algorithmic Reasoning Tasks #### D.1.1 Task Settings Word Problem We define a sequence prediction task based on finite groups such as the symmetric group $S_{5}$ . Given a sequence of group elements of length $k$ , the model is required to output the cumulative products obtained by scanning the sequence from left to right. Formally, for an input sequence $(g_{1},g_{2},\ldots,g_{k})$ , the target sequence is $(g_{1},\;g_{1}g_{2},\;g_{1}g_{2}g_{3},\;\ldots,\;g_{1}g_{2}\cdots g_{k}).$ We follow the setting of (Merrill and Sabharwal, 2025b). Connectivity To ensure that the reachability labels are approximately balanced, we generate undirected graphs according to the Erdős–Rényi model (Erdos and Renyi, 1959) $G(n,p)$ , where $n$ is the number of vertices and each possible edge is included independently with probability $p$ . In the supercritical regime ( $pn=c>1$ ), a single “giant” connected component emerges, occupying a fraction $s\in(0,1)$ of the vertices, which satisfies $s=1-e^{-cs}.$ Consequently, the probability that two uniformly random vertices are both in this component—and hence mutually reachable—is approximately $s^{2}$ . To target a reachability probability of $1/2$ , we set $s\approx\sqrt{\tfrac{1}{2}}\approx 0.707,c\approx\tfrac{-\ln(1-s)}{s}\approx 1.74,$ and thus $p=\tfrac{c}{n}\approx\tfrac{1.7}{n}.$ In practice, for each graph of size $n$ we fix $p=1.7/n$ , which empirically yields $\Pr[\text{reachable}]\approx 50\$ for $n\in[50,100]$ . We follow the encoding scheme of Sanford et al. (2024a). The input to the model is serialized as a flat token sequence consisting of three parts: $v_{0}\,v_{1}\,\cdots\,v_{n-1}\;e_{1}\,e_{2}\,\cdots\,e_{m}\;s,t$ where each vertex is denoted by a token $v_{i}$ , each edge is represented as a pair “ u,v ” with $u<v$ , and the final token “ s,t ” specifies the source–target pair for the reachability query. Arithmetic Expression Evaluation Following (Feng et al., 2023), we generate expressions over integers modulo $r$ using the four operations ${+,-,\times,\div}$ , where multiplication and division are defined via precomputed modular tables. To guarantee that each expression evaluates to a specific target value, we grow expressions backwards: starting from a sampled number, we iteratively replace it with a binary sub-expression that preserves its value under modular arithmetic. Different from (Feng et al., 2023), we fix the modulus to $r=3$ , as our focus lies in evaluating the reasoning over expressions rather than exploring the properties of each modular arithmetic system. Edit Distance The Edit Distance task requires computing the minimum number of edit operations needed to transform one string into another. The allowed operations are insertion, deletion, and replacement of a single character, and the objective is to predict the total edit distance given two input strings. To build the dataset, we follow (Feng et al., 2023). We first generate two strings over a randomly sampled alphabet. The first string has a fixed length, while the second string is produced in two possible ways: with probability $0.4$ , it is drawn as a random string of nearly the same length (within $\pm 3$ characters), and with probability $0.6$ , it is derived from the first string by applying a given number of random edit operations. Each edit operation is chosen uniformly among deletion, replacement, and insertion. To avoid trivial cases, string pairs that are identical or whose lengths differ excessively are rejected and resampled. Finally, the shorter string is always placed first to maintain a consistent input format. An example instance in the format is shown below: s v d h s s e e … v e | s h d s s s s … e s e <sep> 20 Here, the two input strings are separated by the token “ | ”, “ <sep> ” marks the end of the inputs, and the final number “ 20 ” denotes the computed edit distance. #### D.1.2 Training Configuration Configuration of chain of thought For CoT models, training is performed with supervision of step-by-step algorithms. (1) Word problem: for this task, the CoT algorithm proceeds by sequentially scanning the token sequence and producing at each prefix the evaluation result of the expression step by step. Thus, the overall length of the CoT sequence matches the input length. (2) Graph connectivity: Following Bavandpour et al. (2025), the algorithm sequence is simply the trace of a breadth-first search (BFS) starting from the source $s$ . At each step, the model emits the incident edges of the currently expanded node in the order they are visited. The sequence terminates as soon as the target $t$ is discovered. To implement this algorithm, we maintain a list (“scratchpad”) initialized with a dummy marker and the source, $(\texttt{N},s)$ . We iterate through this list from left to right (i.e., queue order). Whenever the current node $u$ is expanded, we append to the end of the list all incident edges $(u,v)$ for neighbors $v$ , followed by a separator token $(u,\texttt{N})$ . (3) Arithmetic expression evaluation: Following (Feng et al., 2023), the CoT takes the fully expanded expression and repeatedly evaluates one innermost subexpression, writing down the simplified expression at each step until only a single numeral remains. For example, $2*(0+1)/2\to 2*1/2\to 2/2\to 1.$ The overall CoT sequence has quadratic length. (4) Edit distance: Following (Feng et al., 2023), the CoT algorithm outputs the DP table entries in the same order they are computed, i.e., row by row from top-left to bottom-right (topological order). This yields a quadratic number of steps in the input length. Optimization and model details. We trained all models using the AdamW optimizer with a linear learning rate schedule. The initial learning rate was set to $1\times 10^{-4}$ with a weight decay of $0.01$ , and a batch size of $256$ . Training was continued until the training loss plateaued. For looped TFs, curriculum learning was applied to all tasks except edit distance: the input size was increased by $2$ for the word problem task, and by $4$ for the connectivity and arithmetic evaluation tasks. The model architecture was based on standard Transformers with an embedding dimension of $256$ . We used $4$ attention heads, and varied the number of Transformer layers depending on the task: two layers for word problems, a single layer for connectivity, and time-modulated (Xu and Sato, 2025) model with a single layer for looped TF, to stabilize training, on both arithmetic evaluation and the edit distance task. For CoT, we use the same configuration of the Transformer block. Uniform selection. To estimate a lower bound on the number of reasoning steps required by CoT for each task, we first follow the procedure introduced in prior work (Bavandpour et al., 2025). Specifically, given a complete CoT trajectory consisting of $T$ intermediate reasoning steps, we construct shortened trajectories by uniformly selecting $k$ step indices from $\{1,\dots,T\}$ , i.e., by taking a uniformly spaced subsequence of the original trajectory. Only the selected intermediate steps are used, while the remaining steps are removed. We then evaluate task performance as a function of $k$ , as shown in Table 3. Table 3: Results for CoT on parallelizable tasks trained with uniform selection. | Word Problem Graph Connectivity Arithmetic Evaluation | 64 32 16 | 0.8 81.0 41.0 | 0.8 81.4 41.5 | 0.8 83.6 41.2 | 100.0 100.0 82.5 | | --- | --- | --- | --- | --- | --- | | Edit Distance | 16 | 69.2 | 70.3 | 82.6 | 94.8 | Stepwise internalization. We adopt stepwise internalization proposed by (Deng et al., 2024), a curriculum-based training procedure that gradually removes CoT tokens and encourages the model to internalize intermediate reasoning within its hidden states. Starting from a model trained on full CoT trajectories, we progressively truncate intermediate reasoning tokens according to a predefined schedule and finetune the model at each stage. Specifically, when the CoT length is greater than $128$ , we remove $16$ tokens per stage until the remaining CoT length reaches $128$ . We then continue removing $8$ tokens per stage until the CoT length is reduced to $8$ , training the model for $16$ epochs at each stage. The results for each fundamental algorithmic reasoning task are shown in Fig. 10. <details> <summary>x13.png Details</summary> ![d6e2e4ed](/v1/image/d6e2e4edd6e5379efee06524177a40d56fa7167bcbdb2e525689326b5486dfae) ### Visual Description \n ## Line Charts: Accuracy vs. Number of CoT Steps for Different Tasks ### Overview The image presents four separate line charts, each depicting the relationship between accuracy and the number of Chain-of-Thought (CoT) steps for a different task. The tasks are: Word Problem, Graph Connectivity, Arithmetic Evaluation, and Edit Distance. Each chart includes a sample size (n) indicated in the title. ### Components/Axes Each chart shares the following components: * **X-axis:** "Number of CoT steps", ranging from 0 to 60 (Word Problem), 0 to 200 (Graph Connectivity), 0 to 500 (Arithmetic Evaluation), and 0 to 250 (Edit Distance). The scale is linear. * **Y-axis:** "Accuracy", ranging from 0 to 100, representing percentage accuracy. The scale is linear. * **Data Series:** A single blue line with circular data points representing the accuracy for each task at different CoT step counts. * **Titles:** Each chart has a title indicating the task and the sample size (n). ### Detailed Analysis or Content Details **1. Word Problem (n=64)** * **Trend:** The line starts at approximately 100% accuracy with a small number of CoT steps (around 60) and rapidly decreases to approximately 0% accuracy as the number of CoT steps decreases to 0. The accuracy appears to plateau at 0% for the lowest CoT step counts. * **Data Points (approximate):** * 60 CoT steps: ~100% accuracy * 40 CoT steps: ~90% accuracy * 30 CoT steps: ~60% accuracy * 20 CoT steps: ~10% accuracy * 10 CoT steps: ~0% accuracy * 0 CoT steps: ~0% accuracy **2. Graph Connectivity (n=32)** * **Trend:** The line starts at approximately 100% accuracy with a small number of CoT steps (around 200) and gradually decreases to approximately 90% accuracy as the number of CoT steps decreases to 0. The decrease is relatively smooth. * **Data Points (approximate):** * 200 CoT steps: ~100% accuracy * 150 CoT steps: ~98% accuracy * 100 CoT steps: ~95% accuracy * 50 CoT steps: ~92% accuracy * 0 CoT steps: ~90% accuracy **3. Arithmetic Evaluation (n=16)** * **Trend:** The line starts at approximately 100% accuracy with a small number of CoT steps (around 500) and initially decreases slowly, then experiences a more rapid decline to approximately 50% accuracy around 200 CoT steps. It continues to decrease, reaching approximately 0% accuracy at 0 CoT steps. * **Data Points (approximate):** * 500 CoT steps: ~100% accuracy * 400 CoT steps: ~95% accuracy * 300 CoT steps: ~80% accuracy * 200 CoT steps: ~50% accuracy * 100 CoT steps: ~20% accuracy * 0 CoT steps: ~0% accuracy **4. Edit Distance (n=16)** * **Trend:** The line starts at approximately 100% accuracy with a small number of CoT steps (around 250) and gradually decreases to approximately 80% accuracy as the number of CoT steps decreases to 0. The decrease is relatively smooth. * **Data Points (approximate):** * 250 CoT steps: ~100% accuracy * 200 CoT steps: ~95% accuracy * 150 CoT steps: ~90% accuracy * 100 CoT steps: ~85% accuracy * 50 CoT steps: ~82% accuracy * 0 CoT steps: ~80% accuracy ### Key Observations * The Word Problem task shows the most significant decrease in accuracy as the number of CoT steps decreases. * Graph Connectivity and Edit Distance exhibit relatively stable accuracy levels even with fewer CoT steps. * Arithmetic Evaluation shows a more complex relationship, with a gradual decline followed by a steeper drop in accuracy. * The sample sizes (n) vary across tasks, which could influence the observed trends. ### Interpretation The data suggests that the effectiveness of Chain-of-Thought prompting varies significantly depending on the task. Word Problems appear to be highly sensitive to the number of CoT steps, requiring a substantial number of steps to maintain high accuracy. In contrast, Graph Connectivity and Edit Distance are more robust and maintain reasonable accuracy even with fewer CoT steps. Arithmetic Evaluation falls in between, showing a more nuanced relationship. The differing sensitivities likely reflect the inherent complexity of each task. Word Problems often require multiple reasoning steps to arrive at a solution, making them more reliant on CoT. Graph Connectivity and Edit Distance may involve more straightforward algorithmic processes, reducing the need for extensive reasoning. The varying sample sizes (n=64, n=32, n=16, n=16) introduce a potential confounding factor. Larger sample sizes generally provide more reliable estimates of accuracy. The lower sample sizes for Arithmetic Evaluation and Edit Distance might contribute to the observed variability in those tasks. The charts demonstrate the importance of tailoring the number of CoT steps to the specific task at hand. Simply increasing the number of CoT steps does not guarantee improved performance and can even be detrimental for certain tasks, as seen in the Word Problem example. Further investigation with larger and more balanced sample sizes would be beneficial to confirm these findings and explore the optimal CoT strategy for each task. </details> Figure 10: Accuracy as a function of the CoT steps during stepwise internalization. ### D.2 Approximate Counting and Approximate Sampling #### D.2.1 Approximate Counting of DNF Formulas To generate the dataset, we first construct a DNF formula $F$ by sampling $m$ clauses, each consisting of $w$ distinct literals over $n$ Boolean variables. Each literal is independently assigned to be either positive or negated. The formula is then serialized into a token sequence in which each clause is represented by its index together with variable–value pairs such as “ $2=$ $+1$ ” or “ $4=$ $-1$ ”. For looped TFs, we prepare $100{,}000$ training samples and $1{,}000$ test samples. For CoT models, we instead generate an online dataset with the following structure. To train the CoT models, we simulate a single trial of the randomized counting algorithm of Karp and Luby (1983). The sequence concatenates the serialized formula, the sampled clause, the full assignment, and the verification outcome, separated by <sep> tokens and terminated by <eos>. CoT model is trained in an autoregressive manner, where prediction targets are defined by shifting the token sequence while masking out the formula description. We trained the models using the AdamW optimizer with a linear learning rate schedule. The initial learning rate was set to $1\times 10^{-4}$ , with a weight decay of $0.01$ . We used a batch size of $256$ (reduced to $32$ for the $1000$ -loop setting) and trained for $10{,}000$ iterations. For inference, we count the total number of iterations used for summing output tokens (steps) in CoT across trials, and the number of loop iterations in the looped TF. #### D.2.2 Approximate Sampling of Graph Colorings We consider the problem of approximately sampling a proper $k$ -coloring of a graph, also known as almost-uniform generation, a canonical randomized task closely related to $\#\mathsf{P}$ -hard counting problems. Given an undirected graph $G=(V,E)$ with $n=|V|$ vertices and maximum degree $\Delta$ , a proper $k$ -coloring is an assignment of colors from $\{1,\dots,k\}$ to vertices such that no adjacent vertices share the same color. Let $\Omega_{k}(G)$ denote the set of all proper $k$ -colorings of $G$ . We restrict attention to graphs of bounded degree and assume $k\geq 2\Delta+1.$ Under this condition, classical results in approximate counting show that the number of proper $k$ -colorings admits a FPAUS (Jerrum, 1995). The approximation relies on Markov Chain Monte Carlo (MCMC) sampling using Glauber dynamics for graph colorings. Starting from an arbitrary proper coloring, the Markov chain repeatedly selects a vertex uniformly at random and proposes to recolor it with a randomly chosen color, accepting the update only if the resulting coloring remains proper. When $k\geq 2\Delta+1$ , this Markov chain is known to be rapidly mixing, converging to the uniform distribution over $\Omega_{k}(G)$ in polynomial time. For our experiments, we generate an undirected Erdős–Rényi random graph, where each edge is included independently with probability $p=\frac{1.7}{n},$ using a fixed random seed for reproducibility. For simplicity and ease of analysis, we focus on small graphs with $n=3$ and set the number of colors to $k=5$ . Each sample is generated by running $T$ steps of Glauber dynamics on the space $\Omega_{k}(G)$ of proper $k$ -colorings. Starting from a greedy proper initialization, at each step we uniformly select a vertex and a color; the recoloring is accepted if and only if it preserves properness. The final state of the Markov chain is treated as an approximate sample from the uniform distribution over $\Omega_{k}(G)$ . In addition to the final coloring, we optionally record the entire sequence of proposals and accept/reject outcomes for CoT and use it as supervision, which is not included for latent thought. To train sequence models, we serialize the graph structure, the initial coloring, and either the MCMC history or the final coloring into a single token sequence. We evaluate the distribution of solutions generated by the model rather than only solution accuracy. Given an input $x$ , we draw $N$ independent samples from the model using ancestral decoding. Generation continues until an end-of-sequence (EOS) token is produced, and from each generated sequence we extract the final $n$ tokens immediately preceding the first EOS token, which encode a candidate solution. The resulting samples define an empirical distribution $\hat{P}$ over generated solutions via normalized occurrence counts. For each instance, the task provides an exact enumeration $\mathcal{Y}$ of all valid solutions. We define the reference distribution $P_{\mathrm{true}}$ as the uniform distribution over this solution set, assigning probability $1/|\mathcal{Y}|$ to each $y\in\mathcal{Y}$ and zero otherwise. We quantify the discrepancy between the empirical distribution $\hat{P}$ and the uniform reference distribution using the total variation distance $\frac{1}{2}\sum_{y\in\mathcal{Y}\cup\hat{\mathcal{Y}}}\left|\hat{P}(y)-P_{\mathrm{true}}(y)\right|.$ We trained the models using the AdamW optimizer with a linear learning rate schedule. The initial learning rate was set to $1\times 10^{-4}$ , with a weight decay of $0.01$ . We used a batch size of $256$ and trained for $5{,}000$ iterations. For inference, we measure the average number of steps (loops) per generation. We generate $N=50{,}000$ samples for CoT and $N=10{,}000$ samples for looped TFs. The resulting histograms of CoT outputs over the target support are shown in Figure 11. <details> <summary>x14.png Details</summary> ![b955cc1d](/v1/image/b955cc1d42bb9b357e7c938789a0e66f430e4cc4dd8a5573c6e525771941cf8a) ### Visual Description \n ## Bar Chart: Colorings (sorted by count) ### Overview The image presents a bar chart visualizing the distribution of "Colorings" sorted by their count. The chart displays the frequency of each coloring, with the x-axis representing the colorings and the y-axis representing the count. A horizontal dashed line indicates a uniform distribution for comparison. The chart is titled "CoT with 30 steps, TV 0.272". ### Components/Axes * **Title:** "CoT with 30 steps, TV 0.272" (top-center) * **X-axis Label:** "Colorings (sorted by count)" (bottom-center) * **Y-axis Label:** "Count" (left-center) * **Legend:** Located in the top-right corner, with a single entry: "Uniform" (blue dashed line). * **Data Series:** A series of blue bars representing the count of each coloring. * **Uniform Line:** A horizontal blue dashed line representing a uniform distribution. ### Detailed Analysis The chart consists of approximately 40-50 bars. The bars are sorted in descending order of count. * **Initial Bars:** The first few bars (leftmost) have a count of approximately 600. * **Decreasing Trend:** The counts generally decrease as you move from left to right. * **Plateau:** After an initial steep decline, the rate of decrease slows down, and the bars appear to plateau around a count of 400-500 for a period. * **Final Bars:** The last few bars (rightmost) have counts around 300. * **Uniform Line Value:** The horizontal dashed line representing the uniform distribution is positioned at a count of approximately 620. Here's a rough approximation of the counts for the first few and last few bars: * Bar 1: ~600 * Bar 5: ~560 * Bar 10: ~520 * Bar 20: ~440 * Bar 30: ~380 * Bar 40: ~320 * Bar 45 (approximate last bar): ~300 ### Key Observations * The distribution of colorings is not uniform. The counts are higher for certain colorings and lower for others. * The distribution is right-skewed, meaning there are a few colorings with very high counts and many colorings with low counts. * The "Uniform" line provides a baseline for comparison, and the observed distribution deviates significantly from it. * The initial steep decline suggests that a relatively small number of colorings account for a large proportion of the total count. ### Interpretation The chart demonstrates the distribution of colorings obtained after 30 steps of a "CoT" (Chain of Thought) process, with a TV (Total Variation) value of 0.272. The fact that the distribution is not uniform suggests that the CoT process is not generating colorings randomly. Instead, it is favoring certain colorings over others. The TV value likely quantifies the degree of non-uniformity. The deviation from the uniform distribution indicates that the CoT process is introducing some bias or structure into the generation of colorings. This could be due to the specific parameters of the CoT process or the nature of the problem being solved. The right-skewed distribution suggests that a few dominant colorings are emerging, while many others remain relatively rare. This could indicate that the CoT process is converging towards a particular solution or set of solutions. The chart provides insights into the behavior of the CoT process and its ability to explore the space of possible colorings. The non-uniform distribution suggests that the CoT process is not simply generating random solutions but is actively searching for promising ones. </details> <details> <summary>x15.png Details</summary> ![c8b881ea](/v1/image/c8b881ea01f22df1a2a4cb5e24f8390bea6fa34f45ddc7896bc7b9123cdc042c) ### Visual Description \n ## Bar Chart: Colorings (Sorted by Count) vs. Count ### Overview The image presents a bar chart visualizing the distribution of "Colorings" and their corresponding "Count". The chart appears to represent the results of a "CoT with 60 steps" process, with a Total Variation (TV) value of 0.231. A horizontal dashed line represents a "Uniform" distribution baseline. ### Components/Axes * **X-axis:** "Colorings (sorted by count)". This axis represents the different colorings, sorted in descending order of their frequency. The axis is not numerically labeled, but represents discrete categories. * **Y-axis:** "Count". This axis represents the number of occurrences of each coloring. The scale ranges from 0 to approximately 1000, with increments of 200. * **Title:** "CoT with 60 steps, TV 0.231". This indicates the context of the data. * **Legend:** Located in the top-right corner, the legend contains a single entry: "Uniform" represented by a blue dashed line. ### Detailed Analysis The chart consists of a series of vertical bars. The bars are arranged from left to right, with the tallest bar on the left and progressively shorter bars towards the right. This indicates that some colorings occur much more frequently than others. * **Trend:** The bar heights decrease consistently from left to right, indicating a long-tail distribution. * **Data Points (Approximate):** * The tallest bar has a height of approximately 1000. * The second tallest bar has a height of approximately 900. * The bar at approximately the 10th position has a height of around 700. * The bar at approximately the 20th position has a height of around 500. * The bar at approximately the 30th position has a height of around 400. * The bar at approximately the 40th position has a height of around 300. * The bar at approximately the 50th position has a height of around 200. * The horizontal dashed line representing the "Uniform" distribution is positioned at a count of approximately 600. * The bars to the left of the dashed line are generally above the uniform level, while the bars to the right are generally below. ### Key Observations * The distribution of colorings is not uniform. * A small number of colorings account for a large proportion of the total count. * The "Uniform" baseline is around 600. Many colorings occur more frequently than this baseline, while a significant number occur less frequently. * The TV value of 0.231 suggests the degree of deviation from a uniform distribution. ### Interpretation The chart demonstrates that the "CoT with 60 steps" process does not produce a uniform distribution of colorings. Instead, it favors certain colorings over others. The TV value quantifies this non-uniformity. The fact that many colorings occur *more* frequently than the uniform baseline suggests a concentration of probability towards specific outcomes. The long-tail distribution indicates that while a few colorings are very common, there is a large number of colorings that occur relatively rarely. This could be indicative of a biased sampling process or underlying structure in the data. The "CoT" likely refers to a Chain-of-Thought process, and the TV value is a measure of how far the distribution of colorings is from being uniform. A lower TV value would indicate a more uniform distribution. </details> <details> <summary>x16.png Details</summary> ![292d68f8](/v1/image/292d68f801743806e5c7ed0cccff2dfe0cbf25c66bdc5083818f5c72ed24f604) ### Visual Description \n ## Bar Chart: CoT with 90 steps, TV 0.027 ### Overview The image presents a bar chart visualizing the distribution of "Colorings" sorted by their "Count". A horizontal dashed line represents a "Uniform" distribution for comparison. The chart appears to show a relatively even distribution of colorings, with a slight downward trend in count as the colorings are sorted. ### Components/Axes * **Title:** "CoT with 90 steps, TV 0.027" - Indicates the chart relates to a "Chain of Thought" process with 90 steps and a "Total Variation" (TV) of 0.027. * **X-axis:** "Colorings (sorted by count)" - Represents the different colorings, ordered from most frequent to least frequent. * **Y-axis:** "Count" - Represents the number of occurrences of each coloring. The scale ranges from 0 to approximately 700. * **Legend:** Located in the top-right corner, it labels the dashed horizontal line as "Uniform". The line is colored blue. ### Detailed Analysis The chart consists of a large number of vertical bars, each representing a different coloring. The bars are arranged horizontally, sorted by their count. * **Trend:** The bars initially start with a count of approximately 680-690, then gradually decrease to a count of around 580-600. The decrease is not linear, but rather a slow, steady decline. * **Uniform Line:** A horizontal dashed blue line is positioned at a count of approximately 620. This line represents the expected count for a uniform distribution. * **Data Points (Approximate):** * The first few bars have counts around 680-690. * Around the 20th bar, the count is approximately 650. * Around the 50th bar, the count is approximately 630. * Around the 100th bar, the count is approximately 610. * The last few bars have counts around 580-600. ### Key Observations * The distribution of colorings is not perfectly uniform, as the counts generally decrease as the colorings are sorted. * The counts are clustered around the "Uniform" level (approximately 620), suggesting that the distribution is relatively close to uniform. * There are no significant outliers or unusual spikes in the data. ### Interpretation The chart suggests that the "Chain of Thought" process with 90 steps and a TV of 0.027 results in a distribution of colorings that is reasonably close to a uniform distribution. The slight downward trend in counts indicates that some colorings are more likely than others, but the differences are not substantial. The TV value of 0.027 likely represents a measure of how far the actual distribution deviates from a perfectly uniform distribution. A lower TV value indicates a closer approximation to uniformity. The chart provides insight into the diversity of solutions or states explored during the CoT process. The fact that the distribution is relatively uniform suggests that the process is exploring a wide range of possibilities, rather than converging on a small set of solutions. </details> Figure 11: Output distributions of CoT compared against the uniform target distribution.

Rendering Paper...