# Training Large Language Models to Reason in a Continuous Latent Space
**Authors**: Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian
1]FAIR at Meta 2]UC San Diego [*]Work done at Meta
(December 23, 2025)
## Abstract
Large language models (LLMs) are restricted to reason in the âlanguage spaceâ, where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue that language space may not always be optimal for reasoning. For example, most word tokens are primarily for textual coherence and not essential for reasoning, while some critical tokens require complex planning and pose huge challenges to LLMs. To explore the potential of LLM reasoning in an unrestricted latent space instead of using natural language, we introduce a new paradigm Coconut (C hain o f Con tin u ous T hought). We utilize the last hidden state of the LLM as a representation of the reasoning state (termed âcontinuous thoughtâ). Rather than decoding this into a word token, we feed it back to the LLM as the subsequent input embedding directly in the continuous space. Experiments show that Coconut can effectively augment the LLM on several reasoning tasks. This novel latent reasoning paradigm leads to emergent advanced reasoning patterns: the continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform a breadth-first search (BFS) to solve the problem, rather than prematurely committing to a single deterministic path like CoT. Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference. These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.
## 1 Introduction
Large language models (LLMs) have demonstrated remarkable reasoning abilities, emerging from extensive pretraining on human languages (Dubey et al., 2024; Achiam et al., 2023). While next token prediction is an effective training objective, it imposes a fundamental constraint on the LLM as a reasoning machine: the explicit reasoning process of LLMs must be generated in word tokens. For example, a prevalent approach, known as chain-of-thought (CoT) reasoning (Wei et al., 2022), involves prompting or training LLMs to generate solutions step-by-step using natural language. However, this is in stark contrast to certain human cognition results. Neuroimaging studies have consistently shown that the language network â a set of brain regions responsible for language comprehension and production â remains largely inactive during various reasoning tasks (Amalric and Dehaene, 2019; Monti et al., 2012, 2007, 2009; Fedorenko et al., 2011). Further evidence indicates that human language is optimized for communication rather than reasoning (Fedorenko et al., 2024).
A significant issue arises when LLMs use language for reasoning: the amount of reasoning required for each particular reasoning token varies greatly, yet current LLM architectures allocate nearly the same computing budget for predicting every token. Most tokens in a reasoning chain are generated solely for fluency, contributing little to the actual reasoning process. On the contrary, some critical tokens require complex planning and pose huge challenges to LLMs. While previous work has attempted to fix these problems by prompting LLMs to generate succinct reasoning chains (Madaan and Yazdanbakhsh, 2022), or performing additional reasoning before generating some critical tokens (Zelikman et al., 2024), these solutions remain constrained within the language space and do not solve the fundamental problems. On the contrary, it would be ideal for LLMs to have the freedom to reason without any language constraints, and then translate their findings into language only when necessary.
<details>
<summary>figures/figure_1_meta_3.png Details</summary>

### Visual Description
\n
## Diagram: Comparison of Chain-of-Thought (CoT) and Chain of Continuous Thought (Coconut) Reasoning Methods
### Overview
The image is a technical diagram comparing two prompting/reasoning methods for Large Language Models (LLMs). It visually contrasts the standard **Chain-of-Thought (CoT)** process with a proposed **Chain of Continuous Thought (Coconut)** method. The diagram is split into two distinct panels, each illustrating the flow of data through a "Large Language Model" block.
### Components/Axes
The diagram is organized into two main panels, left and right, each with a consistent vertical flow from bottom (input) to top (output).
**Left Panel: Chain-of-Thought (CoT)**
* **Title:** "Chain-of-Thought (CoT)" at the top.
* **Central Component:** A large, grey, rounded rectangle labeled "Large Language Model".
* **Input Layer (Bottom):**
* **Label (left):** "input token"
* **Elements:** A sequence of tokens: `[Question]`, `x_i`, `x_{i+1}`, `x_{i+2}`, `...`, `x_{i+j}`.
* **Label (left, above tokens):** "input embedding"
* **Elements:** Corresponding yellow rounded rectangles (embeddings) above each input token.
* **Processing & Output Layer (Top):**
* **Label (left):** "last hidden state"
* **Elements:** A sequence of purple rounded rectangles (hidden states) emerging from the top of the LLM block.
* **Label (left, above hidden states):** "output token (sampling)"
* **Elements:** Tokens `x_i`, `x_{i+1}`, `x_{i+2}`, `...`, `x_{i+j}` and finally `[Answer]` are shown above the hidden states, with upward arrows indicating they are sampled from them.
* **Flow:** Arrows show a sequential, auto-regressive process. The output token `x_i` is fed back as the next input token, creating a loop. The final output is `[Answer]`.
**Right Panel: Chain of Continuous Thought (Coconut)**
* **Title:** "Chain of Continuous Thought (Coconut)" at the top. The word "Continuous" is highlighted in orange.
* **Subtitle/Annotation:** "Last hidden states are used as input embeddings" is written below the title.
* **Central Component:** A large, grey, rounded rectangle labeled "Large Language Model".
* **Input Layer (Bottom):**
* **Elements:** A sequence starting with `[Question]`, followed by a special token `<bot>` (beginning of thought), then a series of orange-to-purple gradient rounded rectangles, and ending with a special token `<eot>` (end of thought).
* **Flow:** Arrows show that the hidden state from one step is directly used as the input embedding for the next step, creating a continuous chain within the model's latent space.
* **Processing & Output Layer (Top):**
* **Elements:** A series of orange-to-purple gradient rounded rectangles (continuous thought states) emerge from the top of the LLM block.
* **Final Output:** A single purple hidden state leads to the final output token `[Answer]`, indicated by an upward arrow.
* **Key Difference:** There is no intermediate sampling of discrete output tokens (`x_i`, `x_{i+1}`, etc.) during the reasoning chain. The process operates on continuous hidden states until the final answer is generated.
### Detailed Analysis
**Chain-of-Thought (CoT) Process Flow:**
1. The model receives an initial `[Question]` token.
2. It generates a sequence of intermediate reasoning tokens (`x_i`, `x_{i+1}`, `x_{i+2}`...).
3. Each generated token is explicitly sampled (converted from a hidden state to a discrete token) and then fed back as the next input.
4. This creates a visible, discrete chain of "thought" tokens before the final `[Answer]` is produced.
**Chain of Continuous Thought (Coconut) Process Flow:**
1. The model receives an initial `[Question]` token and a special `<bot>` token.
2. Instead of generating discrete tokens, the model's last hidden state from one step is directly used as the input embedding for the next step.
3. This forms a chain of continuous thought states (represented by gradient-colored blocks) within the model's internal representation space.
4. After a series of these continuous steps, signaled by an `<eot>` token, the model produces the final `[Answer]`.
### Key Observations
1. **Discrete vs. Continuous:** The core distinction is that CoT reasons using a sequence of discrete, sampled tokens, while Coconut reasons using a sequence of continuous hidden state vectors.
2. **Special Tokens:** Coconut introduces explicit control tokens (`<bot>`, `<eot>`) to manage the continuous reasoning process, which are absent in the standard CoT diagram.
3. **Feedback Loop:** CoT has an explicit external feedback loop (output token becomes next input). Coconut's feedback loop is internal, passing hidden states directly.
4. **Visual Coding:** Color is used meaningfully. CoT uses solid yellow (input embeddings) and solid purple (hidden states). Coconut uses a yellow-to-orange-to-purple gradient for its continuous thought states, visually representing the transformation of information.
### Interpretation
This diagram illustrates a fundamental shift in how an LLM can perform multi-step reasoning.
* **What it suggests:** The Coconut method proposes that forcing a model to generate discrete language tokens for every reasoning step (as in CoT) may be a bottleneck or an unnecessary constraint. By allowing the model to reason in its native, continuous hidden state space, it might achieve more efficient, flexible, or powerful reasoning.
* **Relationship between elements:** The diagram positions Coconut as an evolution or alternative to CoT. Both start with a question and end with an answer, but the "black box" of intermediate reasoning is handled differently. CoT's process is transparent but rigid (tied to vocabulary), while Coconut's is opaque (continuous vectors) but potentially more fluid.
* **Notable implications:** The use of `<bot>` and `<eot>` tokens suggests the continuous reasoning process has a defined start and stop point, which is crucial for control. The absence of intermediate text output in Coconut implies the reasoning process is not directly interpretable to humans in real-time, which is a trade-off between potential performance and explainability. This diagram is likely from a research paper proposing the Coconut method, arguing that moving the reasoning chain from the discrete token space to the continuous latent space of the model is a promising direction for improving LLM capabilities.
</details>
Figure 1: A comparison of Chain of Continuous Thought (Coconut) with Chain-of-Thought (CoT). In CoT, the model generates the reasoning process as a word token sequence (e.g., $[x_i,x_i+1,...,x_i+j]$ in the figure). Coconut regards the last hidden state as a representation of the reasoning state (termed âcontinuous thoughtâ), and directly uses it as the next input embedding. This allows the LLM to reason in an unrestricted latent space instead of a language space.
In this work we instead explore LLM reasoning in a latent space by introducing a novel paradigm, Coconut (Chain of Continuous Thought). It involves a simple modification to the traditional CoT process: instead of mapping between hidden states and language tokens using the language model head and embedding layer, Coconut directly feeds the last hidden state (a continuous thought) as the input embedding for the next token (Figure 1). This modification frees the reasoning from being within the language space, and the system can be optimized end-to-end by gradient descent, as continuous thoughts are fully differentiable. To enhance the training of latent reasoning, we employ a multi-stage training strategy inspired by Deng et al. (2024), which effectively utilizes language reasoning chains to guide the training process.
Interestingly, our proposed paradigm leads to an efficient reasoning pattern. Unlike language-based reasoning, continuous thoughts in Coconut can encode multiple potential next steps simultaneously, allowing for a reasoning process akin to breadth-first search (BFS). While the model may not initially make the correct decision, it can maintain many possible options within the continuous thoughts and progressively eliminate incorrect paths through reasoning, guided by some implicit value functions. This advanced reasoning mechanism surpasses traditional CoT, even though the model is not explicitly trained or instructed to operate in this manner, as seen in previous works (Yao et al., 2023; Hao et al., 2023).
Experimentally, Coconut successfully enhances the reasoning capabilities of LLMs. For math reasoning (GSM8k, Cobbe et al., 2021), using continuous thoughts is shown to be beneficial to reasoning accuracy, mirroring the effects of language reasoning chains. This indicates the potential to scale and solve increasingly challenging problems by chaining more continuous thoughts. On logical reasoning including ProntoQA (Saparov and He, 2022), and our newly proposed ProsQA (Section 4.1) which requires stronger planning ability, Coconut and some of its variants even surpasses language-based CoT methods, while generating significantly fewer tokens during inference. We believe that these findings underscore the potential of latent reasoning and could provide valuable insights for future research.
## 2 Related Work
Chain-of-thought (CoT) reasoning. We use the term chain-of-thought broadly to refer to methods that generate an intermediate reasoning process in language before outputting the final answer. This includes prompting LLMs (Wei et al., 2022; Khot et al., 2022; Zhou et al., 2022), or training LLMs to generate reasoning chains, either with supervised finetuning (Yue et al., 2023; Yu et al., 2023) or reinforcement learning (Wang et al., 2024; Havrilla et al., 2024; Shao et al., 2024; Yu et al., 2024a). Madaan and Yazdanbakhsh (2022) classified the tokens in CoT into symbols, patterns, and text, and proposed to guide the LLM to generate concise CoT based on analysis of their roles. Recent theoretical analyses have demonstrated the usefulness of CoT from the perspective of model expressivity (Feng et al., 2023; Merrill and Sabharwal, 2023; Li et al., 2024). By employing CoT, the effective depth of the transformer increases because the generated outputs are looped back to the input (Feng et al., 2023). These analyses, combined with the established effectiveness of CoT, motivated our design that feeds the continuous thoughts back to the LLM as the next input embedding. While CoT has proven effective for certain tasks, its autoregressive generation nature makes it challenging to mimic human reasoning on more complex problems (LeCun, 2022; Hao et al., 2023), which typically require planning and search. There are works that equip LLMs with explicit tree search algorithms (Xie et al., 2023; Yao et al., 2023; Hao et al., 2024), or train the LLM on search dynamics and trajectories (Lehnert et al., 2024; Gandhi et al., 2024; Su et al., 2024). In our analysis, we find that after removing the constraint of a language space, a new reasoning pattern similar to BFS emerges, even though the model is not explicitly trained in this way.
Latent reasoning in LLMs. Previous works mostly define latent reasoning in LLMs as the hidden computation in transformers (Yang et al., 2024; Biran et al., 2024). Yang et al. (2024) constructed a dataset of two-hop reasoning problems and discovered that it is possible to recover the intermediate variable from the hidden representations. Biran et al. (2024) further proposed to intervene the latent reasoning by âback-patchingâ the hidden representation. Shalev et al. (2024) discovered parallel latent reasoning paths in LLMs. Another line of work has discovered that, even if the model generates a CoT to reason, the model may actually utilize a different latent reasoning process. This phenomenon is known as the unfaithfulness of CoT reasoning (Wang et al., 2022; Turpin et al., 2024). To enhance the latent reasoning of LLM, previous research proposed to augment it with additional tokens. Goyal et al. (2023) pretrained the model by randomly inserting a learnable <pause> tokens to the training corpus. This improves LLMâs performance on a variety of tasks, especially when followed by supervised finetuning with <pause> tokens. On the other hand, Pfau et al. (2024) further explored the usage of filler tokens, e.g., â... â, and concluded that they work well for highly parallelizable problems. However, Pfau et al. (2024) mentioned these methods do not extend the expressivity of the LLM like CoT; hence, they may not scale to more general and complex reasoning problems. Wang et al. (2023) proposed to predict a planning token as a discrete latent variable before generating the next reasoning step. Recently, it has also been found that one can âinternalizeâ the CoT reasoning into latent reasoning in the transformer with knowledge distillation (Deng et al., 2023) or a special training curriculum which gradually shortens CoT (Deng et al., 2024). Yu et al. (2024b) also proposed to distill a model that can reason latently from data generated with complex reasoning algorithms. These training methods can be combined to our framework, and specifically, we find that breaking down the learning of continuous thoughts into multiple stages, inspired by iCoT (Deng et al., 2024), is very beneficial for the training. Recently, looped transformers (Giannou et al., 2023; Fan et al., 2024) have been proposed to solve algorithmic tasks, which have some similarities to the computing process of continuous thoughts, but we focus on common reasoning tasks and aim at investigating latent reasoning in comparison to language space.
## 3 Coconut : Chain of Continuous Thought
In this section, we introduce our new paradigm Coconut (Chain of Continuous Thought) for reasoning in an unconstrained latent space. We begin by introducing the background and notation we use for language models. For an input sequence $x=(x_1,...,x_T)$ , the standard large language model $M$ can be described as:
$$
H_t=Transformer(E_t+P_t)
$$
$$
M(x_t+1\mid x_†t)=softmax(Wh_t)
$$
where $E_t=[e(x_1),e(x_2),...,e(x_t)]$ is the sequence of token embeddings up to position $t$ ; $P_t=[p(1),p(2),...,p(t)]$ is the sequence of positional embeddings up to position $t$ ; $H_tââ^tĂ d$ is the matrix of the last hidden states for all tokens up to position $t$ ; $h_t$ is the last hidden state of position $t$ , i.e., $h_t=H_t[t,:]$ ; $e(·)$ is the token embedding function; $p(·)$ is the positional embedding function; $W$ is the parameter of the language model head.
Method Overview. In the proposed Coconut method, the LLM switches between the âlanguage modeâ and âlatent modeâ (Figure 1). In language mode, the model operates as a standard language model, autoregressively generating the next token. In latent mode, it directly utilizes the last hidden state as the next input embedding. This last hidden state represents the current reasoning state, termed as a âcontinuous thoughtâ.
Special tokens <bot> and <eot> are employed to mark the beginning and end of the latent thought mode, respectively. As an example, we assume latent reasoning occurs between positions $i$ and $j$ , i.e., $x_i=$ <bot> and $x_j=$ <eot>. When the model is in the latent mode ( $i<t<j$ ), we use the last hidden state from the previous token to replace the input embedding, i.e., $E_t=[e(x_1),e(x_2),...,e(x_i),h_i,h_i+1,...,h_t-1]$ . After the latent mode finishes ( $t℠j$ ), the input reverts to using the token embedding, i.e., $E_t=[e(x_1),e(x_2),...,e(x_i),h_i,h_i+1,...,h_j-1,e(x_j),...,e(x_t)]$ . It is noteworthy that $M(x_t+1\mid x_†t)$ is not defined when $i<t<j$ , since the latent thought is not intended to be mapped back to language space. However, $softmax(Wh_t)$ can still be calculated for probing purposes (see Section 4).
Training Procedure. In this work, we focus on a problem-solving setting where the model receives a question as input and is expected to generate an answer through a reasoning process. We leverage language CoT data to supervise continuous thought by implementing a multi-stage training curriculum inspired by Deng et al. (2024). As shown in Figure 2, in the initial stage, the model is trained on regular CoT instances. In the subsequent stages, at the $k$ -th stage, the first $k$ reasoning steps in the CoT are replaced with $kĂ c$ continuous thoughts If a language reasoning chain is shorter than $k$ steps, then all the language thoughts will be removed., where $c$ is a hyperparameter controlling the number of latent thoughts replacing a single language reasoning step. Following Deng et al. (2024), we also reset the optimizer state when training stages switch. We insert <bot> and <eot> tokens to encapsulate the continuous thoughts.
<details>
<summary>figures/figure_2_meta_5.png Details</summary>

### Visual Description
## Diagram: Language CoT (Chain-of-Thought) Training Data Structure
### Overview
The image is a technical diagram illustrating the structure of training data for a language model using a Chain-of-Thought (CoT) approach. It depicts a progressive training methodology across multiple stages, where the model learns to generate intermediate reasoning steps ("Thoughts") before producing a final answer. The diagram is composed of a main content area on the left and a legend on the right.
### Components/Axes
**Legend (Top-Right Corner):**
* `[Thought]` : continuous thought (represented by a rounded rectangle with a yellow-to-green gradient fill).
* `[ ... ]` : sequence of tokens (represented by text in square brackets).
* `<...>` : special token (represented by text in angle brackets).
* `___` : calculating loss (represented by an underline beneath text).
**Main Content (Left and Center):**
The diagram is organized into rows, each representing a different stage in the training process. The stages are labeled on the far left.
* **Header Row:** Labeled "Language CoT (training data)". Contains a single gray rounded rectangle with the text: `[Question] [Step 1] [Step 2] [Step 3] ... [Step N] [Answer]`.
* **Stage 0:** Labeled "Stage 0". Contains a gray rounded rectangle with the text: `[Question] <bot> <eot> [Step 1] [Step 2] ... [Step N] [Answer]`. The segments `[Step 1]`, `[Step 2]`, and `[Step N]` are underlined.
* **Stage 1:** Labeled "Stage 1". Contains a gray rounded rectangle with the text: `[Question] <bot> [Thought] <eot> [Step 2] [Step 3] ... [Step N] [Answer]`. The `[Thought]` block has the yellow-green gradient. The segments `[Step 2]`, `[Step 3]`, and `[Step N]` are underlined.
* **Stage 2:** Labeled "Stage 2". Contains a gray rounded rectangle with the text: `[Question] <bot> [Thought] [Thought] <eot> [Step 3] ... [Step N] [Answer]`. Both `[Thought]` blocks have the yellow-green gradient. The segments `[Step 3]` and `[Step N]` are underlined.
* **Ellipsis Row:** Contains a single centered ellipsis (`...`), indicating intermediate stages between Stage 2 and Stage N.
* **Stage N:** Labeled "Stage N". Contains a gray rounded rectangle with the text: `[Question] <bot> [Thought] [Thought] ... [Thought] <eot> [Answer]`. Multiple `[Thought]` blocks are shown, all with the yellow-green gradient. The final `[Answer]` segment is underlined.
### Detailed Analysis
The diagram outlines a curriculum or progressive training schedule:
1. **Baseline (Language CoT Header):** The standard format is a linear sequence: Question -> multiple reasoning Steps -> Answer.
2. **Stage 0 (Initialization):** The model is trained with the question followed immediately by special tokens `<bot>` (beginning of thought) and `<eot>` (end of thought), with no generated thoughts in between. The loss is calculated on the explicit reasoning steps (`[Step 1]` through `[Step N]`) and the final answer.
3. **Stage 1 (Single Thought):** The model is now trained to generate a single continuous `[Thought]` block after `<bot>` and before `<eot>`. The loss calculation shifts: it is no longer performed on `[Step 1]` (which is replaced by the model-generated thought), but begins from `[Step 2]` onward to the answer.
4. **Stage 2 (Multiple Thoughts):** The model generates two `[Thought]` blocks. The loss calculation shifts further, starting from `[Step 3]`.
5. **Progression to Stage N:** The pattern continues. At each subsequent stage, the model generates an additional `[Thought]` block, and the supervised loss calculation on the explicit "Step" tokens begins one step later. By **Stage N**, the model generates a sequence of thoughts that fully replaces all explicit intermediate steps (`[Step 1]` to `[Step N]`), and the loss is calculated only on the final `[Answer]`.
### Key Observations
* **Spatial Arrangement:** The legend is consistently placed in the top-right. The stages are arranged in a clear top-to-bottom vertical flow, implying a temporal or sequential progression in training.
* **Visual Coding:** The `[Thought]` blocks are uniquely identified by a color gradient, distinguishing model-generated content from the static token sequences like `[Question]` and `[Step X]`.
* **Loss Calculation Migration:** The underlined segments (`___`) visually track how the objective function (loss calculation) migrates from being applied to all explicit steps to being applied only to the final answer as the model learns to internalize the reasoning process.
* **Special Token Function:** `<bot>` and `<eot>` act as delimiters, framing the section where the model is expected to generate its chain of thought.
### Interpretation
This diagram illustrates a **progressive distillation or internalization training protocol** for teaching a language model to perform chain-of-thought reasoning. The core idea is to gradually shift the model's reliance from memorizing and reproducing explicit, human-written reasoning steps (`[Step 1, 2, ... N]`) to generating its own continuous reasoning process (`[Thought]`).
* **What it demonstrates:** It's a method for moving from supervised learning on step-by-step demonstrations to a more autonomous form of reasoning. The early stages provide strong supervision on the reasoning structure, while later stages encourage the model to develop its own internal representation of the thought process.
* **Relationship between elements:** Each stage builds directly upon the previous one. The `<bot>`/`<eot>` tokens provide the structural scaffold, the `[Thought]` blocks represent the model's learned internal reasoning, and the migrating loss calculation (`___`) is the training signal that drives this internalization.
* **Notable Anomaly/Insight:** The key insight is the **decoupling of the "thought generation" task from the "answer generation" task** in the loss function. By Stage N, the model is only supervised on producing the correct final answer given its own generated thoughts, which is a form of **reinforcement of the reasoning-to-answer mapping**. This suggests the goal is not just to generate plausible thoughts, but to generate thoughts that are *useful for arriving at the correct answer*.
</details>
Figure 2: Training procedure of Chain of Continuous Thought (Coconut). Given training data with language reasoning steps, at each training stage we integrate $c$ additional continuous thoughts ( $c=1$ in this example), and remove one language reasoning step. The cross-entropy loss is then used on the remaining tokens after continuous thoughts.
During the training process, we mask the loss on questions and latent thoughts. It is important to note that the objective does not encourage the continuous thought to compress the removed language thought, but rather to facilitate the prediction of future reasoning. Therefore, itâs possible for the LLM to learn more effective representations of reasoning steps compared to human language.
Training Details. Our proposed continuous thoughts are fully differentiable and allow for back-propagation. We perform $n+1$ forward passes when $n$ latent thoughts are scheduled in the current training stage, computing a new latent thought with each pass and finally conducting an additional forward pass to obtain a loss on the remaining text sequence. While we can save any repetitive computing by using a KV cache, the sequential nature of the multiple forward passes poses challenges for parallelism. Further optimizing the training efficiency of Coconut remains an important direction for future research.
Inference Process. The inference process for Coconut is analogous to standard language model decoding, except that in latent mode, we directly feed the last hidden state as the next input embedding. A challenge lies in determining when to switch between latent and language modes. As we focus on the problem-solving setting, we insert a <bot> token immediately following the question tokens. For <eot>, we consider two potential strategies: a) train a binary classifier on latent thoughts to enable the model to autonomously decide when to terminate the latent reasoning, or b) always pad the latent thoughts to a constant length. We found that both approaches work comparably well. Therefore, we use the second option in our experiment for simplicity, unless specified otherwise.
## 4 Experiments
We validate the feasibility of LLM reasoning in a continuous latent space through experiments on three datasets. We mainly evaluate the accuracy by comparing the model-generated answers with the ground truth. The number of newly generated tokens per question is also analyzed, as a measure of reasoning efficiency. We report the clock-time comparison in Appendix 8.
### 4.1 Reasoning Tasks
Math Reasoning. We use GSM8k (Cobbe et al., 2021) as the dataset for math reasoning. It consists of grade school-level math problems. Compared to the other datasets in our experiments, the problems are more diverse and open-domain, closely resembling real-world use cases. Through this task, we explore the potential of latent reasoning in practical applications. To train the model, we use a synthetic dataset generated by Deng et al. (2023).
Logical Reasoning. Logical reasoning involves the proper application of known conditions to prove or disprove a conclusion using logical rules. This requires the model to choose from multiple possible reasoning paths, where the correct decision often relies on exploration and planning ahead. We use 5-hop ProntoQA (Saparov and He, 2022) questions, with fictional concept names. For each problem, a tree-structured ontology is randomly generated and described in natural language as a set of known conditions. The model is asked to judge whether a given statement is correct based on these conditions. This serves as a simplified simulation of more advanced reasoning tasks, such as automated theorem proving (Chen et al., 2023; DeepMind, 2024).
We found that the generation process of ProntoQA could be more challenging, especially since the size of distracting branches in the ontology is always small, reducing the need for complex planning. To fix that, we apply a new dataset construction pipeline using randomly generated DAGs to structure the known conditions. The resulting dataset requires the model to perform substantial planning and searching over the graph to find the correct reasoning chain. We refer to this new dataset as ProsQA (Pro of with S earch Q uestion- A nswering). A visualized example is shown in Figure 7. More details of datasets can be found in Appendix 7.
### 4.2 Experimental Setup
We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. The learning rate is set to $1Ă 10^-4$ while the effective batch size is 128. Following Deng et al. (2024), we also reset the optimizer when the training stages switch.
Math Reasoning. By default, we use 2 latent thoughts (i.e., $c=2$ ) for each reasoning step. We analyze the correlation between performance and $c$ in Section 1. The model goes through 3 stages besides the initial stage. Then, we have an additional stage, where we still use $3Ă c$ continuous thoughts as in the penultimate stage, but remove all the remaining language reasoning chain. This handles the long-tail distribution of reasoning chains longer than 3 steps. We train the model for 6 epochs in the initial stage, and 3 epochs in each remaining stage.
Logical Reasoning. We use one continuous thought for every reasoning step (i.e., $c=1$ ). The model goes through 6 training stages in addition to the initial stage, because the maximum number of reasoning steps is 6 in these two datasets. The model then fully reasons with continuous thoughts to solve the problems in the last stage. We train the model for 5 epochs per stage.
For all datasets, after the standard schedule, the model stays in the final training stage, until the 50th epoch. We select the checkpoint based on the accuracy on the validation set. For inference, we manually set the number of continuous thoughts to be consistent with their final training stage. We use greedy decoding for all experiments.
### 4.3 Baselines and Variants of Coconut
We consider the following baselines: (1) CoT: We use the complete reasoning chains to train the language model with supervised finetuning, and during inference, the model generates a reasoning chain before outputting an answer. (2) No-CoT: The LLM is trained to directly generate the answer without using a reasoning chain. (3) iCoT (Deng et al., 2024): The model is trained with language reasoning chains and follows a carefully designed schedule that âinternalizesâ CoT. As the training goes on, tokens at the beginning of the reasoning chain are gradually removed until only the answer remains. During inference, the model directly predicts the answer. (4) Pause token (Goyal et al., 2023): The model is trained using only the question and answer, without a reasoning chain. However, different from No-CoT, special <pause> tokens are inserted between the question and answer, which are believed to provide the model with additional computational capacity to derive the answer. For a fair comparison, the number of <pause> tokens is set the same as continuous thoughts in Coconut.
We also evaluate some variants of our method: (1) w/o curriculum: Instead of the multi-stage training, we directly use the data from the last stage which only includes questions and answers to train Coconut. The model uses continuous thoughts to solve the whole problem. (2) w/o thought: We keep the multi-stage training which removes language reasoning steps gradually, but donât use any continuous latent thoughts. While this is similar to iCoT in the high-level idea, the exact training schedule is set to be consistent with Coconut, instead of iCoT. This ensures a more strict comparison. (3) Pause as thought: We use special <pause> tokens to replace the continuous thoughts, and apply the same multi-stage training curriculum as Coconut.
### 4.4 Results and Discussion
| Method | GSM8k | ProntoQA | ProsQA | | | |
| --- | --- | --- | --- | --- | --- | --- |
| Acc. (%) | # Tokens | Acc. (%) | # Tokens | Acc. (%) | # Tokens | |
| CoT | 42.9 $±$ 0.2 | 25.0 | 98.8 $±$ 0.8 | 92.5 | 77.5 $±$ 1.9 | 49.4 |
| No-CoT | 16.5 $±$ 0.5 | 2.2 | 93.8 $±$ 0.7 | 3.0 | 76.7 $±$ 1.0 | 8.2 |
| iCoT | 30.0 â | 2.2 | 99.8 $±$ 0.3 | 3.0 | 98.2 $±$ 0.3 | 8.2 |
| Pause Token | 16.4 $±$ 1.8 | 2.2 | 77.7 $±$ 21.0 | 3.0 | 75.9 $±$ 0.7 | 8.2 |
| Coconut (Ours) | 34.1 $±$ 1.5 | 8.2 | 99.8 $±$ 0.2 | 9.0 | 97.0 $±$ 0.3 | 14.2 |
| - w/o curriculum | 14.4 $±$ 0.8 | 8.2 | 52.4 $±$ 0.4 | 9.0 | 76.1 $±$ 0.2 | 14.2 |
| - w/o thought | 21.6 $±$ 0.5 | 2.3 | 99.9 $±$ 0.1 | 3.0 | 95.5 $±$ 1.1 | 8.2 |
| - pause as thought | 24.1 $±$ 0.7 | 2.2 | 100.0 $±$ 0.1 | 3.0 | 96.6 $±$ 0.8 | 8.2 |
Table 1: Results on three datasets: GSM8l, ProntoQA and ProsQA. Higher accuracy indicates stronger reasoning ability, while generating fewer tokens indicates better efficiency. â The result is from Deng et al. (2024).
<details>
<summary>x1.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thoughts per Step
### Overview
The image displays a line chart plotting the relationship between the number of thoughts per step (x-axis) and the resulting accuracy percentage (y-axis). The chart features a single data series represented by a solid blue line, accompanied by a semi-transparent blue shaded region indicating a confidence interval or range of uncertainty around the central trend.
### Components/Axes
* **Chart Type:** Line chart with a confidence band.
* **X-Axis:**
* **Label:** "# Thoughts per step"
* **Scale:** Linear, with major tick marks and labels at the integer values 0, 1, and 2.
* **Y-Axis:**
* **Label:** "Accuracy (%)"
* **Scale:** Linear, with major tick marks and labels at 26, 28, 30, 32, 34, and 36.
* **Data Series:**
* A single, solid blue line representing the central trend (e.g., mean or median accuracy).
* A semi-transparent blue shaded area surrounding the line, representing the confidence interval or variability.
* **Legend:** No explicit legend is present in the image. The single data series and its confidence band are self-explanatory from the axis labels.
* **Spatial Layout:** The chart area is bounded by a light gray grid. The axes labels are positioned conventionally: the x-axis label is centered below the axis, and the y-axis label is rotated 90 degrees and positioned to the left of the axis.
### Detailed Analysis
* **Trend Verification:** The solid blue line exhibits a clear, positive, upward slope from left to right. This indicates that as the number of thoughts per step increases, the measured accuracy also increases.
* **Data Point Extraction (Approximate):**
* At **# Thoughts per step = 0**: The line starts at an accuracy of approximately **25.5%**. The confidence interval at this point is relatively narrow, spanning roughly from 25.0% to 26.0%.
* At **# Thoughts per step = 1**: The line passes through an accuracy of approximately **31.2%**. The confidence interval has widened, spanning roughly from 29.5% to 32.8%.
* At **# Thoughts per step = 2**: The line ends at an accuracy of approximately **34.2%**. The confidence interval is at its widest here, spanning roughly from 32.8% to 35.6%.
* **Confidence Interval Behavior:** The shaded blue region (confidence interval) is narrowest at the starting point (0 thoughts) and progressively widens as the number of thoughts per step increases. This indicates that the estimate of accuracy becomes less precise (has greater uncertainty) for higher numbers of thoughts per step.
### Key Observations
1. **Positive Correlation:** There is a strong, positive relationship between the number of thoughts per step and accuracy.
2. **Diminishing Returns:** The slope of the line appears slightly steeper between 0 and 1 thoughts per step than between 1 and 2, suggesting the marginal gain in accuracy may decrease with additional thoughts.
3. **Increasing Uncertainty:** The widening confidence interval is a critical feature. While average accuracy improves with more thoughts, the range of possible outcomes (or the uncertainty in the estimate) also expands significantly.
### Interpretation
The data suggests that allocating more "thoughts" (likely referring to computational steps, reasoning iterations, or internal processing cycles) per decision step in a system leads to a measurable improvement in its accuracy. This is a common pattern in complex problem-solving or AI systems, where additional computation allows for more exploration, verification, or refinement of a solution.
However, the chart reveals a crucial trade-off: **increased performance comes with increased variability or uncertainty.** The widening confidence band implies that while the system's *average* performance improves, its *consistency* may decrease. At 2 thoughts per step, the system has a higher average accuracy than at 0 or 1, but the range of possible accuracy scores is much broader. This could mean that in some runs, the system performs exceptionally well (near 35.6%), while in others, it performs only marginally better than the average at 1 thought per step (near 32.8%).
This pattern is vital for system design. If reliability and predictability are paramount, the optimal operating point might be at 1 thought per step, where accuracy is good and uncertainty is moderate. If maximizing peak performance is the goal and some variability is acceptable, then 2 thoughts per step is preferable. The chart does not show data beyond 2 thoughts, so it is unknown if accuracy plateaus or if uncertainty continues to grow.
</details>
Figure 3: Accuracy on GSM8k with different number of continuous thoughts.
We show the overall results on all datasets in Table 1. Continuous thoughts effectively enhance LLM reasoning, as shown by the consistent improvement over no-CoT. It even shows better performance than CoT on ProntoQA and ProsQA. We describe several key conclusions from the experiments as follows.
âChainingâ continuous thoughts enhances reasoning. In conventional CoT, the output token serves as the next input, which proves to increase the effective depth of LLMs and enhance their expressiveness (Feng et al., 2023). We explore whether latent space reasoning retains this property, as it would suggest that this method could scale to solve increasingly complex problems by chaining multiple latent thoughts.
In our experiments with GSM8k, we found that Coconut outperformed other architectures trained with similar strategies, particularly surpassing the latest baseline, iCoT (Deng et al., 2024). The performance is significantly better than Coconut (pause as thought) which also enables more computation in the LLMs. While Pfau et al. (2024) empirically shows that filler tokens, such as the special <pause> tokens, can benefit highly parallelizable problems, our results show that Coconut architecture is more effective for general problems, e.g., math word problems, where a reasoning step often heavily depends on previous steps. Additionally, we experimented with adjusting the hyperparameter $c$ , which controls the number of latent thoughts corresponding to one language reasoning step (Figure 3). As we increased $c$ from 0 to 1 to 2, the modelâs performance steadily improved. We discuss the case of larger $c$ in Appendix 9. These results suggest that a chaining effect similar to CoT can be observed in the latent space.
In two other synthetic tasks, we found that the variants of Coconut (w/o thoughts or pause as thought), and the iCoT baseline also achieve impressive accuracy. This indicates that the modelâs computational capacity may not be the bottleneck in these tasks. In contrast, GSM8k, being an open-domain question-answering task, likely involves more complex contextual understanding and modeling, placing higher demands on computational capability.
Latent reasoning outperforms language reasoning in planning-intensive tasks. Complex reasoning often requires the model to âlook aheadâ and evaluate the appropriateness of each step. Among our datasets, GSM8k and ProntoQA are relatively straightforward for next-step prediction, due to intuitive problem structures and limited branching. In contrast, ProsQAâs randomly generated DAG structure significantly challenges the modelâs planning capabilities. As shown in Table 1, CoT does not offer notable improvement over No-CoT. However, Coconut, its variants, and iCoT substantially enhance reasoning on ProsQA, indicating that latent space reasoning provides a clear advantage in tasks demanding extensive planning. An in-depth analysis of this process is provided in Section 5.
The LLM still needs guidance to learn latent reasoning. In the ideal case, the model should learn the most effective continuous thoughts automatically through gradient descent on questions and answers (i.e., Coconut w/o curriculum). However, from the experimental results, we found the models trained this way do not perform any better than no-CoT.
<details>
<summary>figures/figure_4_meta.png Details</summary>

### Visual Description
## Diagram: Large Language Model Processing of a Math Word Problem
### Overview
The image is a technical diagram illustrating the process of a Large Language Model (LLM) generating a response to a specific math word problem. It visualizes the flow from input text, through the model's internal processing, to the final output token selection via a Language Model (LM) head, which assigns probabilities to candidate answer tokens.
### Components/Axes
The diagram is structured in three main horizontal layers, with a clear flow from bottom to top.
1. **Input Layer (Bottom):**
* **Text:** A word problem is presented as the input prompt: "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?"
* **Token Sequence:** Above the text, a sequence of tokens is shown, represented by rounded rectangles. The sequence begins with an ellipsis (`...`), followed by a yellow token, a `<bot>` (beginning of turn) token, three gradient-colored (yellow to purple) tokens, and an `<eot>` (end of turn) token. A curved arrow points from the start of the text to the beginning of this token sequence, indicating the text is tokenized for input.
2. **Model Core (Center):**
* A large, grey, rounded rectangle labeled **"Large Language Model"** spans the width of the diagram. This represents the main computational body of the LLM.
* **Internal Flow:** Faint, curved arrows connect the input token sequence (below) to the output token sequence (above), passing through the "Large Language Model" block. This illustrates the transformation of input representations into output representations.
3. **Output & Prediction Layer (Top):**
* **Output Token Sequence:** A sequence of tokens mirrors the input structure above the model block. It starts with a `<bot>` token, followed by three gradient-colored tokens, an `<eot>` token, the number `540`, and an ellipsis (`...`).
* **LM Head:** An orange, rounded rectangle labeled **"LM head"** is positioned above the second output token (the first gradient token after `<bot>`). A dashed arrow points from this token up to the LM head, indicating it is the current token being predicted.
* **Prediction Bar Chart:** To the right of the LM head, a small horizontal bar chart displays the model's top predictions for the next token. The chart has three entries:
* `"180"` with a probability of `0.22` (longest bar).
* `" 180"` (note the leading space) with a probability of `0.20`.
* `"9"` with a probability of `0.13`.
* A dashed, curved arrow connects the LM head to this prediction chart.
### Detailed Analysis
* **Tokenization:** The input problem is broken into discrete tokens. The diagram uses symbolic tokens (`<bot>`, `<eot>`) and colored shapes to represent this process. The three gradient tokens between `<bot>` and `<eot>` in the input likely correspond to the core of the problem statement.
* **Model Processing:** The "Large Language Model" block processes the input token sequence. The internal arrows suggest a sequential or transformer-based processing flow where information from earlier tokens influences later ones.
* **Output Generation:** The model generates an output sequence. The diagram shows the model has already produced the tokens `<bot>`, three intermediate tokens, `<eot>`, and the number `540`. The ellipsis (`...`) indicates the sequence may continue.
* **Next-Token Prediction:** The focus is on predicting the token that should follow the first gradient token after the initial `<bot>` in the output sequence. The LM head evaluates the model's internal state at that point.
* **Candidate Answers & Probabilities:** The LM head's output is a probability distribution over potential next tokens. The top three candidates are all numerical answers to the word problem:
1. `"180"` (Probability: ~0.22)
2. `" 180"` (Probability: ~0.20) - This is a distinct token, likely representing the number with a leading space.
3. `"9"` (Probability: ~0.13)
* **Spatial Grounding:** The LM head and its prediction chart are located in the **top-center** of the diagram, directly above the output token sequence it is analyzing. The prediction chart is to the **right** of the LM head label.
### Key Observations
1. **Multiple Correct Representations:** The model considers two visually similar but token-distinct representations of the number 180 (`"180"` and `" 180"`) as the most likely answers, with a combined probability of approximately 0.42.
2. **Presence of an Incorrect Candidate:** The number `"9"` is also a top candidate, though with lower probability. This may represent a partial calculation (e.g., 3 sprints * 3 times = 9) or a common error mode.
3. **Output Sequence Anomaly:** The output sequence shown (`<bot> ... <eot> 540 ...`) is unusual. The number `540` appears *after* the `<eot>` token, which typically signifies the end of a model's response. This could indicate the diagram is illustrating a specific intermediate state or a particular model behavior where generation continues past a logical endpoint.
4. **Visual Coding:** The diagram uses color and shape consistently: yellow for general tokens, gradient for problem-specific tokens, orange for the prediction component (LM head and its bars), and purple for special control tokens (`<bot>`, `<eot>`).
### Interpretation
This diagram provides a **Peircean** insight into the "black box" of an LLM solving a reasoning task. It demonstrates that the model does not simply compute the answer (540) in one step. Instead, it operates in a **token-by-token, probabilistic** fashion.
* **What the data suggests:** The model's internal reasoning leads it to consider the intermediate answer "180" (which is 3 sprints * 60 meters) as a highly probable next step, even before arriving at the final correct answer of 540 (180 meters per session * 3 sessions per week). This suggests the model may be performing a **stepwise calculation** mirroring human problem-solving.
* **How elements relate:** The flow from input text to tokenized sequence, through the model core, to the LM head's probability distribution, visually maps the **inference pipeline**. The LM head acts as a translator from the model's high-dimensional internal state to a human-interpretable choice over discrete tokens.
* **Notable patterns/anomalies:** The high probability for `" 180"` (with a space) highlights how **tokenization artifacts** can influence model behavior. The presence of `540` after `<eot>` is a critical anomaly; it may imply the model's generation process is not perfectly aligned with the semantic structure of a conversation, or it could be a deliberate choice by the diagram's creator to show a specific point in the generation timeline. The diagram ultimately reveals that LLM "reasoning" is a **stochastic process of selecting the most likely next piece of information**, not a deterministic execution of a mathematical algorithm.
</details>
Figure 4: A case study where we decode the continuous thought into language tokens.
With the multi-stage curriculum which decomposes the training into easier objectives, Coconut is able to achieve top performance across various tasks. The multi-stage training also integrates well with pause tokens (Coconut - pause as thought). Despite using the same architecture and similar multi-stage training objectives, we observed a small gap between the performance of iCoT and Coconut (w/o thoughts). The finer-grained removal schedule (token by token) and a few other tricks in iCoT may ease the training process. We leave combining iCoT and Coconut as future work. While the multi-stage training used for Coconut has proven effective, further research is definitely needed to develop better and more general strategies for learning reasoning in latent space, especially without the supervision from language reasoning chains.
Continuous thoughts are efficient representations of reasoning. Though the continuous thoughts are not intended to be decoded to language tokens, we can still use it as an intuitive interpretation of the continuous thought. We show a case study in Figure 4 of a math word problem solved by Coconut ( $c=1$ ). The first continuous thought can be decoded into tokens like â180â, â 180â (with a space), and â9â. Note that, the reasoning trace for this problem should be $3Ă 3Ă 60=9Ă 60=540$ , or $3Ă 3Ă 60=3Ă 180=540$ . The interpretations of the first thought happen to be the first intermediate variables in the calculation. Moreover, it encodes a distribution of different traces into the continuous thoughts. As shown in Section 5.3, this feature enables a more advanced reasoning pattern for planning-intense reasoning tasks.
## 5 Understanding the Latent Reasoning in Coconut
In this section, we present an analysis of the latent reasoning process with a variant of Coconut. By leveraging its ability to switch between language and latent space reasoning, we are able to control the model to interpolate between fully latent reasoning and fully language reasoning and test their performance (Section 5.2). This also enables us to interpret the the latent reasoning process as tree search (Section 5.3). Based on this perspective, we explain why latent reasoning can make the decision easier for LLMs (Section 5.4).
### 5.1 Experimental Setup
Methods. The design of Coconut allows us to control the number of latent thoughts by manually setting the position of the <eot> token during inference. When we enforce Coconut to use $k$ continuous thoughts, the model is expected to output the remaining reasoning chain in language, starting from the $k+1$ step. In our experiments, we test variants of Coconut on ProsQA with $kâ\{0,1,2,3,4,5,6\}$ . Note that all these variants only differ in inference time while sharing the same model weights. Besides, we report the performance of CoT and no-CoT as references.
To address the issue of forgetting earlier training stages, we modify the original multi-stage training curriculum by always mixing data from other stages with a certain probability ( $p=0.3$ ). This updated training curriculum yields similar performance and enables effective control over the switch between latent and language reasoning.
Metrics. We apply two sets of evaluation metrics. One of them is based on the correctness of the final answer, regardless of the reasoning process. It is the metric used in the main experimental results above (Section 1). To enable fine-grained analysis, we define another metric on the reasoning process. Assuming we have a complete language reasoning chain which specifies a path in the graph, we can classify it into (1) Correct Path: The output is one of the shortest paths to the correct answer. (2) Longer Path: A valid path that correctly answers the question but is longer than the shortest path. (3) Hallucination: The path includes nonexistent edges or is disconnected. (4) Wrong Target: A valid path in the graph, but the destination node is not the one being asked. These four categories naturally apply to the output from Coconut ( $k=0$ ) and CoT, which generate the full path. For Coconut with $k>0$ that outputs only partial paths in language (with the initial steps in continuous reasoning), we classify the reasoning as a Correct Path if a valid explanation can complete it. Also, we define Longer Path and Wrong Target for partial paths similarly. If no valid explanation completes the path, itâs classified as hallucination. In no-CoT and Coconut with larger $k$ , the model may only output the final answer without any partial path, and it falls into (5) Correct Label or (6) Incorrect Label. These six categories cover all cases without overlap.
<details>
<summary>figures/figure_5_revised_1111.png Details</summary>

### Visual Description
\n
## Horizontal Bar Charts: Method Performance Analysis
### Overview
The image displays two horizontal bar charts comparing the performance of different reasoning methods. The left chart, titled "Final answer," shows the accuracy percentage for each method. The right chart, titled "Reasoning Process," provides a stacked bar breakdown of the reasoning outcomes for the same methods, categorized by type of error or success. A legend on the right defines the color-coded categories for the stacked bars.
### Components/Axes
**Left Chart ("Final answer"):**
* **Y-axis (Vertical):** Labeled "Method". Categories listed from top to bottom: "Ours (k=6)", "Ours (k=5)", "Ours (k=4)", "Ours (k=3)", "Ours (k=2)", "Ours (k=1)", "Ours (k=0)", "no-CoT", "CoT".
* **X-axis (Horizontal):** Labeled "Accuracy (%)". Scale ranges from 70 to 100, with major tick marks at 70, 80, 90, 100.
**Right Chart ("Reasoning Process"):**
* **Y-axis (Vertical):** Labeled "Method". Identical categories as the left chart, in the same order.
* **X-axis (Horizontal):** Labeled "Count". Scale ranges from 0 to 500, with major tick marks at 0, 100, 200, 300, 400, 500.
* **Legend (Top-Right):** Titled "Category". Contains six color-coded entries:
* Dark Green: "Correct Label"
* Light Green: "Correct Path"
* Red: "Incorrect Label"
* Orange: "Longer path"
* Purple: "Wrong Target"
* Brown: "Hallucination"
### Detailed Analysis
**Left Chart - Final Answer Accuracy:**
* **Trend:** Accuracy generally increases as the parameter 'k' increases from 0 to 6 for the "Ours" method. The "no-CoT" method has the lowest accuracy, while "CoT" performs better than "no-CoT" but worse than most "Ours" variants.
* **Approximate Data Points (Accuracy %):**
* Ours (k=6): ~96%
* Ours (k=5): ~96%
* Ours (k=4): ~96%
* Ours (k=3): ~96%
* Ours (k=2): ~91%
* Ours (k=1): ~86%
* Ours (k=0): ~87%
* no-CoT: ~74%
* CoT: ~80%
**Right Chart - Reasoning Process Breakdown (Stacked Bars, Total Count ~500 per method):**
* **Ours (k=6):** Dominated by a large "Correct Label" (dark green) segment (~480 count), with a very small "Incorrect Label" (red) segment (~20 count).
* **Ours (k=5):** Similar to k=6: large "Correct Label" (~480), small "Incorrect Label" (~20).
* **Ours (k=4):** Similar to k=6 and k=5: large "Correct Label" (~480), small "Incorrect Label" (~20).
* **Ours (k=3):** Mix of categories. "Correct Label" (dark green, ~360), "Correct Path" (light green, ~80), "Incorrect Label" (red, ~20), "Hallucination" (brown, ~40).
* **Ours (k=2):** "Correct Path" (light green, ~340), "Longer path" (orange, ~60), "Wrong Target" (purple, ~40), "Hallucination" (brown, ~60).
* **Ours (k=1):** "Correct Path" (light green, ~310), "Longer path" (orange, ~60), "Wrong Target" (purple, ~60), "Hallucination" (brown, ~70).
* **Ours (k=0):** "Correct Path" (light green, ~320), "Longer path" (orange, ~50), "Wrong Target" (purple, ~60), "Hallucination" (brown, ~70).
* **no-CoT:** "Correct Label" (dark green, ~370), "Incorrect Label" (red, ~130).
* **CoT:** "Correct Path" (light green, ~250), "Longer path" (orange, ~80), "Wrong Target" (purple, ~40), "Hallucination" (brown, ~130).
### Key Observations
1. **Performance Plateau:** The "Ours" method's final answer accuracy plateaus at approximately 96% for k=3, 4, 5, and 6.
2. **Reasoning Quality Shift:** There is a clear shift in reasoning quality as 'k' increases. For k=0,1,2, the reasoning is primarily "Correct Path" (light green) but with significant error categories (Longer path, Wrong Target, Hallucination). For k=4,5,6, the reasoning is overwhelmingly "Correct Label" (dark green) with minimal errors.
3. **"no-CoT" Anomaly:** The "no-CoT" method has very low final answer accuracy (~74%), yet its reasoning process shows a high proportion of "Correct Label" (~74% of its bar). This suggests it often identifies the correct label but fails to produce a correct final answer, possibly due to a lack of structured reasoning.
4. **"CoT" vs. "Ours (k=0)":** The standard Chain-of-Thought ("CoT") method has lower accuracy (~80%) than "Ours (k=0)" (~87%). Its reasoning breakdown shows a larger "Hallucination" segment compared to "Ours (k=0)".
### Interpretation
The data demonstrates the effectiveness of the proposed method ("Ours") with increasing values of the parameter 'k'. The charts tell a two-part story:
1. **Outcome vs. Process:** The left chart shows the final outcome (accuracy), while the right chart reveals the quality of the underlying reasoning process. High final accuracy (k=4,5,6) is strongly correlated with a reasoning process that predominantly yields the "Correct Label" directly, minimizing errors like hallucinations or wrong targets.
2. **The Role of 'k':** The parameter 'k' appears to control the depth or quality of the reasoning process. At low 'k' (0,1,2), the model finds a correct reasoning path ("Correct Path") but is prone to execution errors (Longer path, Wrong Target, Hallucination), leading to lower final accuracy. At high 'k' (4,5,6), the model consistently identifies the correct label through its reasoning, resulting in high and stable final accuracy. The plateau suggests diminishing returns beyond k=4.
3. **Diagnostic Value of the Breakdown:** The "Reasoning Process" chart is crucial for diagnosis. For instance, the "no-CoT" method's high "Correct Label" count but low accuracy indicates a disconnect between its internal labeling and its final answer generation. The "CoT" method's significant "Hallucination" segment points to a specific weakness in its reasoning chain that the "Ours" method (especially with higher k) appears to mitigate.
In essence, the visualization argues that the "Ours" method, when configured with a sufficient 'k' (â„4), not only achieves high accuracy but does so via a robust and reliable reasoning process that avoids common failure modes like hallucinations and incorrect targets.
</details>
Figure 5: The accuracy of final answer (left) and reasoning process (right) of multiple variants of Coconut and baselines on ProsQA.
<details>
<summary>figures/figure_6_meta_3.png Details</summary>

### Visual Description
\n
## Directed Graph Diagram: Logical Reasoning Problem
### Overview
The image displays a directed graph (a network of nodes connected by arrows) used to illustrate a logical reasoning problem. The diagram is accompanied by text blocks presenting a question, a ground truth solution, and two examples of Chain-of-Thought (CoT) reasoning, one of which contains a hallucination. The graph visualizes the relationships between various entities (all ending in "-pus") to determine if "Alex" belongs to the category "gompus" or "bompus".
### Components/Axes
**Legend (Top Center):**
* **Root node:** Dark blue circle.
* **Target node:** Dark green circle.
* **Distractive node:** Orange circle.
* **Child of the root node:** Purple circle with a yellow outline.
* **Grandchild of the root node:** Light orange circle.
**Graph Structure (Left Side):**
The graph consists of circular nodes connected by directional arrows (edges). The nodes are color-coded according to the legend. The primary path of interest starts from the "Alex" node (Root, dark blue) and leads towards the "bompus" node (Target, dark green).
**Text Blocks (Right Side):**
1. **Question:** Presents a set of logical premises and asks: "Is Alex a gompus or bompus?"
2. **Ground Truth Solution:** Provides the correct logical deduction.
3. **CoT (Chain of Thought):** Shows an example of reasoning that leads to an incorrect answer (marked with a red X and labeled "Hallucination").
4. **COCONUT (k=1):** Shows another reasoning attempt that leads to a wrong target (marked with a red X).
5. **COCONUT (k=2):** Shows a reasoning attempt that leads to the correct answer (marked with a green checkmark).
### Detailed Analysis
**Graph Node Inventory and Connections:**
* **Root Node:** `Alex` (dark blue, bottom center).
* **Target Node:** `bompus` (dark green, center right).
* **Key Path (from Ground Truth):** `Alex` -> `grimpus` (purple) -> `rorpus` (light orange) -> `bompus` (green).
* **Other Notable Nodes & Connections:**
* `Alex` connects to: `grimpus`, `lempus`, `zhompus`.
* `grimpus` connects to: `rorpus`, `lorpus`.
* `rorpus` connects to: `bompus`, `gwompus`, `yimpus`.
* `bompus` connects to: `boompus`.
* `lempus` connects to: `scrompus`, `sterpus`.
* `scrompus` connects to: `yumpus`, `brimpus`.
* `yumpus` connects to: `rempus`.
* `rempus` connects to: `gorpus`, `jompus`.
* `jompus` connects to: `worpas`, `impus`.
* `worpas` connects to: `brimpus`, `impus`.
* `impus` connects to: `hilpus`, `rem pus`.
* `rem pus` connects to: `gorpus`, `jelpus`.
* `jelpus` connects to: `num pus`, `Sam`.
* `num pus` connects to: `Sam`.
* `Sam` connects to: `jompus`.
* `Jack` connects to: `yimpus`.
* `sterpus` connects to: `rompus`, `gwompus`.
* `zhompus` connects to: `gwompus`, `lorpus`.
* `lorpus` connects to: `grim pus` (Note: This appears to be a separate node from `grimpus`).
* `gwompus` connects to: `rorpus`.
* `yimpus` connects to: `rorpus`.
* `boompus` has no outgoing edges shown.
* `brimpus` has no outgoing edges shown.
* `gorpus` has no outgoing edges shown.
* `hilpus` has no outgoing edges shown.
**Text Transcription:**
* **Question:** "Every grimpus is a yimpus. Every worpus is a jelpus. Every zhorpus is a sterpus. Alex is a grimpus ... Every lumps is a yumpus. Question: **Is Alex a gompus or bompus?**"
* **Ground Truth Solution:** "Alex is a grimpus. Every grimpus is a rorpus. Every rorpus is a bompus. ### Alex is a bompus"
* **CoT:** "Alex is a lempus. Every lempus is a scrompus. Every scrompus is a yumpus. Every yumpus is a rempus. Every rempus is a gompus. ### Alex is a gompus â (Hallucination)"
* **COCONUT (k=1):** "<bot> [Thought] <eot> Every lempus is a scrompus. Every scrompus is a brimpus. ### Alex is a brimpus â (Wrong Target)"
* **COCONUT (k=2):** "<bot> [Thought] [Thought] <eot> Every rorpus is a bompus. ### Alex is a bompus â (Correct Path)"
### Key Observations
1. **Correct Path:** The Ground Truth and COCONUT (k=2) identify the correct path: `Alex` (grimpus) -> `rorpus` -> `bompus`.
2. **Hallucination in CoT:** The CoT example incorrectly starts by classifying Alex as a "lempus" instead of a "grimpus," leading it down a completely different and incorrect path in the graph (`lempus` -> `scrompus` -> `yumpus` -> `rempus` -> `gompus`).
3. **Wrong Target in COCONUT (k=1):** This attempt correctly identifies `Alex` as a `grimpus` but then incorrectly deduces the next step, jumping to `lempus` and `scrompus` before wrongly concluding with `brimpus`.
4. **Graph Complexity:** The graph is intentionally complex with many distractive nodes (orange) and alternative paths (e.g., from `Alex` to `lempus` or `zhompus`) that do not lead to the target `bompus`.
5. **Node Labeling:** Some node labels are split across two lines within their circles (e.g., "num pus", "rem pus", "grim pus").
### Interpretation
This diagram is a pedagogical or research tool designed to test and illustrate the capabilities of different reasoning methods (like Chain-of-Thought and a method called COCONUT) on a complex, graph-based logical problem. The "facts" are the premises given in the question, and the graph is a visual representation of those facts.
The core demonstration is that **accurate multi-step reasoning requires correctly tracing the specific path defined by the logical premises, avoiding distractions from other plausible but incorrect paths.** The "Hallucination" in the CoT example shows a critical failure mode where the reasoning process starts with an incorrect assumption (misclassifying Alex), which invalidates all subsequent steps. The COCONUT examples show that with different parameters (k=1 vs. k=2), the model's ability to find the correct path can vary, highlighting the sensitivity of such reasoning tasks. The graph's structure, with its many dead-ends and loops, is specifically designed to challenge a reasoner's ability to maintain logical consistency over multiple inference steps.
</details>
Figure 6: A case study of ProsQA. The model trained with CoT hallucinates an edge (Every yumpus is a rempus) after getting stuck in a dead end. Coconut (k=1) outputs a path that ends with an irrelevant node. Coconut (k=2) solves the problem correctly.
<details>
<summary>figures/figure_7_meta_4.png Details</summary>

### Visual Description
## Diagram: COCONUT Probabilistic Model Visualization
### Overview
The image displays two side-by-side diagrams illustrating a probabilistic model named "COCONUT" with two different parameter settings (k=1 and k=2). Each diagram shows a central node ("Alex") connected to multiple peripheral nodes via directed arrows. The diagrams are accompanied by text snippets and mathematical equations that appear to calculate the probability of generating specific words based on component parts.
### Components/Axes
**Left Panel (COCONUT k=1):**
* **Central Node:** A blue circle labeled "Alex".
* **Peripheral Nodes:** Four purple-outlined circles with yellow fill, connected to "Alex" by gray arrows. One node ("grim-pus") is connected by a green arrow.
* Node Labels & Associated Data:
* `ster-pus`: 0.01 (h=0)
* `lemp-us`: 0.33 (h=2)
* `zhor-pus`: 0.16 (h=1)
* `grim-pus`: 0.32 (h=2)
* **Text Block (Top Right):**
* Title: `COCONUT (k=1)`
* Sequence: `<bot> [Thought] <eot>`
* Sentence Fragment: `Every lempus ...`
* **Equation Block (Bottom Right):**
* `p("lempus")`
* `= p("le")p("mp")p("us")`
* `= 0.33`
**Right Panel (COCONUT k=2):**
* **Central Node:** A blue circle labeled "Alex".
* **Peripheral Nodes:** Multiple orange-outlined circles with yellow fill, connected to "Alex" or to each other by gray arrows. One node ("rorp-us") is connected by a green arrow.
* Node Labels & Associated Data:
* `yum-pus`: 7e-4 (h=0)
* `ster-pus`: (no value shown)
* `lemp-us`: (no value shown)
* `scrom-pus`: 2e-3 (h=1)
* `zhor-pus`: (no value shown)
* `rom-pus`: 3e-3 (h=0)
* `lorp-us`: 2e-4 (h=0)
* `grim-pus`: (no value shown)
* `yimp-us`: 5e-5 (h=1)
* `gwompus`: 0.12 (h=1)
* `rorp-us`: 0.87 (h=1)
* **Text Block (Top Right):**
* Title: `COCONUT (k=2)`
* Sequence: `<bot> [Thought] [Thought] <eot>`
* Sentence Fragment: `Every rorpus ...`
* **Equation Block (Bottom Right):**
* `p("rorpus")`
* `= p("ro")p("rp")p("us")`
* `= 0.87`
### Detailed Analysis
**Spatial Layout & Connections:**
* In both panels, the "Alex" node is positioned center-left. The peripheral nodes are arranged in a fan-like pattern to its right.
* **Left Panel (k=1):** "Alex" has direct connections to four nodes. The green arrow points specifically to `grim-pus`.
* **Right Panel (k=2):** "Alex" has direct connections to several nodes (`yum-pus`, `ster-pus`, `lemp-us`, `zhor-pus`, `grim-pus`). Some nodes (`scrom-pus`, `rom-pus`, `lorp-us`, `rorp-us`) are connected to other peripheral nodes (e.g., `lemp-us` connects to `scrom-pus` and `rom-pus`; `grim-pus` connects to `lorp-us` and `rorp-us`), indicating a more complex, multi-step network. The green arrow points specifically to `rorp-us`.
**Data Values & Trends:**
* **k=1 Diagram:** The highest probability value shown is 0.33 for `lemp-us` (h=2), which matches the final calculated probability in the equation. The other values are 0.32, 0.16, and 0.01.
* **k=2 Diagram:** The highest probability value shown is 0.87 for `rorp-us` (h=1), which matches the final calculated probability. Other values are orders of magnitude smaller (0.12, 3e-3, 2e-3, 7e-4, 2e-4, 5e-5). Several nodes have no displayed value.
**Textual & Mathematical Content:**
* The text snippets (`Every lempus ...`, `Every rorpus ...`) suggest the model is generating or completing a sentence.
* The equations decompose the probability of a whole word (`p("lempus")`, `p("rorpus")`) into the product of probabilities of its sub-word components (e.g., "le", "mp", "us"). The final calculated probability (0.33, 0.87) matches the value associated with the target node in the diagram.
### Key Observations
1. **Complexity Increase with k:** Moving from k=1 to k=2 significantly increases the number of nodes and connections, creating a deeper, more hierarchical network.
2. **Probability Concentration:** In both cases, the probability mass is heavily concentrated on a single node (`lemp-us` for k=1, `rorp-us` for k=2), which is also the node highlighted by the green arrow and used in the example sentence.
3. **Parameter 'h':** Each node with a value has an associated `h` parameter (h=0, 1, or 2). In the k=1 diagram, the target node (`lemp-us`) has h=2. In the k=2 diagram, the target node (`rorp-us`) has h=1.
4. **Scientific Notation:** The k=2 diagram uses scientific notation (e.g., 7e-4) for very small probabilities, indicating a wider range of values.
### Interpretation
This image visually explains a **hierarchical probabilistic model for text or word generation**, likely a type of language model or reasoning system called COCONUT.
* **What it demonstrates:** The diagrams show how the model's internal "thought" process (represented by the network of nodes) becomes more complex when allowed more steps (k=2 vs. k=1). The central "Alex" node may represent an agent or a starting context. The peripheral nodes represent potential concepts, words, or sub-word units.
* **How elements relate:** The arrows represent probabilistic dependencies or transitions. The green arrow highlights the selected or most probable path taken by the model to generate the next word in the sequence (`lempus` or `rorpus`). The equations show the mathematical foundation: the probability of generating a word is computed from the probabilities of its constituent parts, which are derived from the network state.
* **Notable Patterns:** The model exhibits a **"winner-take-all"** dynamic where one path dominates the probability distribution, especially in the more complex k=2 setting. The `h` parameter likely represents a step count or depth within the thought process for that specific node. The increase in k allows for longer chains of reasoning (more `h` steps) and a broader exploration of possibilities (more nodes), ultimately leading to a different, more confident output (higher final probability of 0.87 vs. 0.33). This suggests that allowing the model more "thinking time" (k) refines its predictions and increases confidence in the selected output.
</details>
Figure 7: An illustration of the latent search trees. The example is the same test case as in Figure 7. The height of a node (denoted as $h$ in the figure) is defined as the longest distance to any leaf nodes in the graph. We show the probability of the first concept predicted by the model following latent thoughts (e.g., âlempusâ in the left figure). It is calculated as the multiplication of the probability of all tokens within the concept conditioned on previous context (omitted in the figure for brevity). This metric can be interpreted as an implicit value function estimated by the model, assessing the potential of each node leading to the correct answer.
### 5.2 Interpolating between Latent and Language Reasoning
Figure 5 shows a comparative analysis of different reasoning methods on ProsQA. As more reasoning is done with continuous thoughts (increasing $k$ ), both final answer accuracy (Figure 5, left) and the rate of correct reasoning processes (âCorrect Labelâ and âCorrect Pathâ in Figure 5, right) improve. Additionally, the rate of âHallucinationâ and âWrong Targetâ decrease, which typically occur when the model makes a wrong move earlier. This also indicates the better planning ability when more reasoning happens in the latent space.
A case study is shown in Figure 7, where CoT hallucinates an nonexistent edge, Coconut ( $k=1$ ) leads to a wrong target, but Coconut ( $k=2$ ) successfully solves the problem. In this example, the model cannot accurately determine which edge to choose at the earlier step. However, as latent reasoning can avoid making a hard choice upfront, the model can progressively eliminate incorrect options in subsequent steps and achieves higher accuracy at the end of reasoning. We show more evidence and details of this reasoning process in Section 5.3.
The comparison between CoT and Coconut ( $k=0$ ) reveals another interesting observation: even when Coconut is forced to generate a complete reasoning chain, the accuracy of the answers is still higher than CoT. The generated reasoning paths are also more accurate with less hallucination. From this, we can infer that the training method of mixing different stages improves the modelâs ability to plan ahead. The training objective of CoT always concentrates on the generation of the immediate next step, making the model âshortsightedâ. In later stages of Coconut training, the first few steps are hidden, allowing the model to focus more on future steps. This is related to the findings of Gloeckle et al. (2024), where they propose multi-token prediction as a new pretraining objective to improve the LLMâs ability to plan ahead.
### 5.3 Interpreting the Latent Search Tree
Given the intuition that continuous thoughts can encode multiple potential next steps, the latent reasoning can be interpreted as a search tree, rather than merely a reasoning âchainâ. Taking the case of Figure 7 as a concrete example, the first step could be selecting one of the children of Alex, i.e., {lempus, sterpus, zhorpus, grimpus}. We depict all possible branches in the left part of Figure 7. Similarly, in the second step, the frontier nodes will be the grandchildren of Alex (Figure 7, right).
Unlike a standard breadth-first search (BFS), which explores all frontier nodes uniformly, the model demonstrates the ability to prioritize promising nodes while pruning less relevant ones. To uncover the modelâs preferences, we analyze its subsequent outputs in language space. For instance, if the model is forced to switch back to language space after a single latent thought ( $k=1$ ), it predicts the next step in a structured format, such as âevery [Concept A] is a [Concept B].â By examining the probability distribution over potential fillers for [Concept A], we can derive numeric values for the children of the root node Alex (Figure 7, left). Similarly, when $k=2$ , the prediction probabilities for all frontier nodesâthe grandchildren of Alex âare obtained (Figure 7, right).
<details>
<summary>figures/percentile.png Details</summary>

### Visual Description
## [Area Chart Comparison]: First thoughts vs. Second thoughts
### Overview
The image displays two side-by-side area charts, each plotting cumulative distributions. The left chart is titled "First thoughts" and the right chart is titled "Second thoughts." Both charts share identical axes and a three-category stacked area legend. The charts appear to compare the distribution of a "Value" metric across percentiles under two different conditions or time points.
### Components/Axes
* **Chart Titles:** "First thoughts" (left), "Second thoughts" (right).
* **X-Axis (Both Charts):** Labeled "Percentile". Scale ranges from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Y-Axis (Both Charts):** Labeled "Value". Scale ranges from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend (Both Charts):** Located in the bottom-right corner of each plot area. Contains three entries, each with a colored circle and label:
* **Top 3:** Light blue circle.
* **Top 2:** Medium blue circle.
* **Top 1:** Dark blue circle.
* **Data Series:** Each chart contains three stacked, filled areas corresponding to the legend. The areas are bounded by white lines. The stacking order from bottom to top is: Top 1 (dark blue), Top 2 (medium blue), Top 3 (light blue).
### Detailed Analysis
**Chart 1: "First thoughts" (Left)**
* **Trend Verification:** The overall shape is a concave curve that rises steeply at low percentiles and gradually flattens as it approaches the top-right corner (Percentile=1.0, Value=1.0).
* **Top 1 (Dark Blue):** This is the bottom layer. It starts at approximately (Percentile=0.0, Value=0.0). It rises steadily, showing a gentle convex curve. It reaches approximately Value=0.5 at Percentile=0.3, Value=0.8 at Percentile=0.7, and converges to Value=1.0 at Percentile=1.0.
* **Top 2 (Medium Blue):** This is the middle layer, stacked on Top 1. The combined height of Top 1 + Top 2 shows a steeper initial rise than Top 1 alone. The boundary between Top 2 and Top 3 starts at a high Value (~0.8) at very low percentiles and slowly approaches 1.0.
* **Top 3 (Light Blue):** This is the top layer. It occupies the remaining area between the Top 2 boundary and the top of the chart (Value=1.0). Its thickness is greatest at low percentiles and diminishes to near zero as the percentile increases.
**Chart 2: "Second thoughts" (Right)**
* **Trend Verification:** The overall shape is also a concave curve rising to (1.0, 1.0), but with a notably different initial trajectory compared to the first chart.
* **Top 1 (Dark Blue):** This bottom layer shows a much steeper initial ascent. It rises from (0.0, 0.0) to approximately Value=0.6 by Percentile=0.2. After this sharp initial rise, its slope decreases, and it follows a more gradual path to (1.0, 1.0).
* **Top 2 (Medium Blue):** The middle layer's boundary with Top 3 also rises steeply at first. The combined Top 1 + Top 2 area reaches near-maximum values (e.g., >0.9) at a lower percentile (around 0.4) compared to the "First thoughts" chart.
* **Top 3 (Light Blue):** This top layer is compressed into a much narrower band, especially in the mid-to-high percentile range. Its contribution is most significant only at the very lowest percentiles.
### Key Observations
1. **Shift in Distribution:** The most significant difference is the shape of the "Top 1" (dark blue) area. In "Second thoughts," it captures a much larger share of the total "Value" at lower percentiles (0.0 to 0.4) compared to "First thoughts."
2. **Convergence Speed:** The cumulative value (Top 1 + Top 2 + Top 3) approaches the maximum of 1.0 more quickly in the "Second thoughts" chart. For example, at Percentile=0.4, the total value is already above 0.9 in "Second thoughts," while it is closer to 0.7 in "First thoughts."
3. **Layer Compression:** The "Top 3" (light blue) layer is substantially thinner in "Second thoughts," indicating that the highest-ranked item(s) account for a smaller portion of the total value distribution after "second thoughts."
4. **Identical Axes and Legend:** The scales and legend are perfectly consistent between the two plots, allowing for direct visual comparison of the area shapes.
### Interpretation
These charts likely visualize the results of a ranking or selection process, comparing initial ("First thoughts") and revised ("Second thoughts") assessments. The "Value" could represent a score, probability, or resource allocation, while "Percentile" represents the ranked population.
The data suggests a consolidation of value or confidence toward the top-ranked item ("Top 1") after reconsideration. In the "First thoughts" phase, value is more distributed across the top three candidates, especially at lower percentiles. In the "Second thoughts" phase, the top candidate ("Top 1") rapidly accumulates a dominant share of the total value early in the percentile distribution, leaving less for the second and third-ranked candidates. This could model phenomena like increased decisiveness after deliberation, a "winner-takes-most" effect upon review, or the focusing of attention/resources on a primary option after initial exploration. The steep initial rise in the "Second thoughts" chart indicates that the top choice is deemed highly valuable by a large portion of the population (low percentiles) after reflection.
</details>
Figure 8: Analysis of parallelism in latent tree search. The left plot depicts the cumulative value of the top-1, top-2, and top-3 candidate nodes for the first thoughts, calculated across test cases and ranked by percentile. The significant gaps between the lines reflect the modelâs ability to explore alternative latent thoughts in parallel. The right plot shows the corresponding analysis for the second thoughts, where the gaps between lines are narrower, indicating reduced parallelism and increased certainty in reasoning as the search tree develops. This shift highlights the modelâs transition toward more focused exploration in later stages.
The probability distribution can be viewed as the modelâs implicit value function, estimating each nodeâs potential to reach the target. As shown in the figure, âlempusâ, âzhorpusâ, âgrimpusâ, and âsterpusâ have a value of 0.33, 0.16, 0.32, and 0.01, respectively. This indicates that in the first continuous thought, the model has mostly ruled out âsterpusâ as an option but remains uncertain about the correct choice among the other three. In the second thought, however, the model has mostly ruled out other options but focused on ârorpusâ.
Figure 8 presents an analysis of the parallelism in the modelâs latent reasoning across the first and second thoughts. For the first thoughts (left panel), the cumulative values of the top-1, top-2, and top-3 candidate nodes are computed and plotted against their respective percentiles across the test set. The noticeable gaps between the three lines indicate that the model maintains significant diversity in its reasoning paths at this stage, suggesting a broad exploration of alternative possibilities. In contrast, the second thoughts (right panel) show a narrowing of these gaps. This trend suggests that the model transitions from parallel exploration to more focused reasoning in the second latent reasoning step, likely as it gains more certainty about the most promising paths.
### 5.4 Why is a Latent Space Better for Planning?
<details>
<summary>figures/value_stats_meta_2.png Details</summary>

### Visual Description
## Line Charts with Confidence Intervals: "First thoughts" and "Second thoughts"
### Overview
The image contains two vertically stacked line charts, each displaying the relationship between a variable called "Height" (x-axis) and a "Value" (y-axis). Each chart plots two data series, "Correct" and "Incorrect," represented by lines with shaded regions indicating confidence intervals or variability. The charts compare trends between an initial state ("First thoughts") and a subsequent state ("Second thoughts").
### Components/Axes
* **Chart Titles:**
* Top Chart: "First thoughts"
* Bottom Chart: "Second thoughts"
* **Axes:**
* **X-axis (both charts):** Labeled "Height". The scale runs from 0 to 6 in the top chart and 0 to 5 in the bottom chart, with integer tick marks.
* **Y-axis (both charts):** Labeled "Value". The scale runs from 0.0 to 0.6, with tick marks at 0.1 intervals.
* **Legend:**
* **Position:** Top-left corner of the "First thoughts" chart.
* **Series:**
* **Correct:** Represented by a blue line.
* **Incorrect:** Represented by an orange line.
* The same color scheme is used consistently in both charts.
* **Data Representation:**
* Solid lines connect the central tendency (likely mean or median) of each data series at each Height value.
* Shaded bands of corresponding color (light blue for Correct, light orange for Incorrect) surround each line, representing the spread of the data (e.g., standard deviation, standard error, or confidence interval).
### Detailed Analysis
#### **Chart 1: "First thoughts"**
* **Correct Series (Blue Line):**
* **Trend:** The line starts at a high value and remains relatively stable with a slight peak in the middle before a minor decline.
* **Approximate Data Points:**
* Height 2: Value â 0.52
* Height 3: Value â 0.58 (peak)
* Height 4: Value â 0.57
* Height 5: Value â 0.58
* Height 6: Value â 0.51
* **Confidence Interval:** The blue shaded band is moderately wide, suggesting some variability. It is narrowest around Height 3-4 and widens slightly at the extremes (Height 2 and 6).
* **Incorrect Series (Orange Line):**
* **Trend:** The line shows a clear upward trend from Height 0 to 5, followed by a decline at Height 6.
* **Approximate Data Points:**
* Height 0: Value â 0.01
* Height 1: Value â 0.17
* Height 2: Value â 0.21
* Height 3: Value â 0.26
* Height 4: Value â 0.29
* Height 5: Value â 0.35 (peak)
* Height 6: Value â 0.29
* **Confidence Interval:** The orange shaded band is significantly wider than the blue band, especially at higher Height values (4, 5, 6), indicating much greater uncertainty or variance in the "Incorrect" measurements.
#### **Chart 2: "Second thoughts"**
* **Correct Series (Blue Line):**
* **Trend:** The line starts high, peaks early, and then exhibits a pronounced downward trend.
* **Approximate Data Points:**
* Height 1: Value â 0.50
* Height 2: Value â 0.53 (peak)
* Height 3: Value â 0.49
* Height 4: Value â 0.45
* Height 5: Value â 0.25 (sharp drop)
* **Confidence Interval:** The blue shaded band is wide, particularly from Height 3 onward, indicating increasing uncertainty as Height increases and the value drops.
* **Incorrect Series (Orange Line):**
* **Trend:** The line shows a gradual upward trend from Height 0 to 4, followed by a decline at Height 5.
* **Approximate Data Points:**
* Height 0: Value â 0.01
* Height 1: Value â 0.09
* Height 2: Value â 0.10
* Height 3: Value â 0.16
* Height 4: Value â 0.18 (peak)
* Height 5: Value â 0.13
* **Confidence Interval:** The orange shaded band is very wide, especially at Height 4 and 5, where it overlaps substantially with the lower bound of the blue (Correct) band.
### Key Observations
1. **Diverging Trends:** In "First thoughts," the Correct value remains high and stable while the Incorrect value rises. In "Second thoughts," the Correct value declines sharply after an initial peak, while the Incorrect value shows a more modest rise.
2. **Uncertainty Disparity:** Across both charts, the "Incorrect" series (orange) consistently has wider confidence intervals than the "Correct" series (blue), suggesting less reliable or more variable data for incorrect outcomes.
3. **Peak Shift:** The peak for the Correct series occurs at a lower Height in "Second thoughts" (Height 2) compared to "First thoughts" (Height 3/5).
4. **Convergence at High Height:** In both charts, the confidence intervals for Correct and Incorrect begin to overlap at the highest Height values (6 in the first chart, 5 in the second), indicating that the distinction between correct and incorrect values becomes statistically less clear at these extremes.
### Interpretation
The data suggests a cognitive or performance model where "Height" represents a parameter like problem difficulty, time pressure, or cognitive load.
* **"First thoughts"** likely represent an initial, intuitive response. Here, the model maintains high accuracy (Correct value) across increasing difficulty, while the likelihood or magnitude of errors (Incorrect value) also increases. The high uncertainty in errors implies that when initial guesses are wrong, they are highly variable.
* **"Second thoughts"** likely represent a deliberative or review phase. The sharp decline in the Correct value suggests that re-evaluation can undermine initially correct intuitions, especially at higher difficulties. The more modest rise in Incorrect value, coupled with massive uncertainty, indicates that second-guessing often leads to ambiguous or inconsistent outcomes rather than clearly wrong answers.
* **Overall Implication:** The system or subjects perform more reliably and with greater confidence on initial judgments. A secondary review process, particularly under high "Height" conditions, introduces doubt and variability, degrading the clarity and correctness of the outcome. This could model phenomena like overthinking, analysis paralysis, or the interference of System 2 thinking with fast, accurate System 1 intuitions.
</details>
Figure 9: The correlation between prediction probability of concepts and their heights.
In this section, we explore why latent reasoning is advantageous for planning, drawing on the search tree perspective and the value function defined earlier. Referring to our illustrative example, a key distinction between âsterpusâ and the other three options lies in the structure of the search tree: âsterpusâ is a leaf node (Figure 7). This makes it immediately identifiable as an incorrect choice, as it cannot lead to the target node âbompusâ. In contrast, the other nodes have more descendants to explore, making their evaluation more challenging.
To quantify a nodeâs exploratory potential, we measure its height in the tree, defined as the shortest distance to any leaf node. Based on this notion, we hypothesize that nodes with lower heights are easier to evaluate accurately, as their exploratory potential is limited. Consistent with this hypothesis, in our example, the model exhibits greater uncertainty between âgrimpusâ and âlempusâ, both of which have a height of 2âhigher than the other candidates.
To test this hypothesis more rigorously, we analyze the correlation between the modelâs prediction probabilities and node heights during the first and second latent reasoning steps across the test set. Figure 9 reveals a clear pattern: the model successfully assigns lower values to incorrect nodes and higher values to correct nodes when their heights are low. However, as node heights increase, this distinction becomes less pronounced, indicating greater difficulty in accurate evaluation.
In conclusion, these findings highlight the benefits of leveraging latent space for planning. By delaying definite decisions and expanding the latent reasoning process, the model pushes its exploration closer to the search treeâs terminal states, making it easier to distinguish correct nodes from incorrect ones.
## 6 Conclusion
In this paper, we presented Coconut, a novel paradigm for reasoning in continuous latent space. Through extensive experiments, we demonstrated that Coconut significantly enhances LLM reasoning capabilities. Notably, our detailed analysis highlighted how an unconstrained latent space allows the model to develop an effective reasoning pattern similar to BFS. Future work is needed to further refine and scale latent reasoning methods. One promising direction is pretraining LLMs with continuous thoughts, which may enable models to generalize more effectively across a wider range of reasoning scenarios. We anticipate that our findings will inspire further research into latent reasoning methods, ultimately contributing to the development of more advanced machine reasoning systems.
Acknowledgement
The authors express their sincere gratitude to Jihoon Tack for his valuable discussions throughout the course of this work.
## References
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Amalric and Dehaene (2019) Marie Amalric and Stanislas Dehaene. A distinct cortical network for mathematical knowledge in the human brain. NeuroImage, 189:19â31, 2019.
- Biran et al. (2024) Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir Globerson. Hopping too late: Exploring the limitations of large language models on multi-hop queries. arXiv preprint arXiv:2406.12775, 2024.
- Chen et al. (2023) Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889â7901, 2023.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- DeepMind (2024) Google DeepMind. Ai achieves silver-medal standard solving international mathematical olympiad problems, 2024. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/.
- Deng et al. (2023) Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023.
- Deng et al. (2024) Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint arXiv:2405.14838, 2024.
- Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Fan et al. (2024) Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. arXiv preprint arXiv:2409.15647, 2024.
- Fedorenko et al. (2011) Evelina Fedorenko, Michael K Behr, and Nancy Kanwisher. Functional specificity for high-level linguistic processing in the human brain. Proceedings of the National Academy of Sciences, 108(39):16428â16433, 2011.
- Fedorenko et al. (2024) Evelina Fedorenko, Steven T Piantadosi, and Edward AF Gibson. Language is primarily a tool for communication rather than thought. Nature, 630(8017):575â586, 2024.
- Feng et al. (2023) Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36, 2023.
- Gandhi et al. (2024) Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman. Stream of search (sos): Learning to search in language. arXiv preprint arXiv:2404.03683, 2024.
- Giannou et al. (2023) Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. In International Conference on Machine Learning, pages 11398â11442. PMLR, 2023.
- Gloeckle et al. (2024) Fabian Gloeckle, Badr Youbi Idrissi, Baptiste RoziĂšre, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024.
- Goyal et al. (2023) Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. arXiv preprint arXiv:2310.02226, 2023.
- Hao et al. (2023) Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
- Hao et al. (2024) Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, et al. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models. arXiv preprint arXiv:2404.05221, 2024.
- Havrilla et al. (2024) Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642, 2024.
- Khot et al. (2022) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022.
- LeCun (2022) Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1â62, 2022.
- Lehnert et al. (2024) Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search dynamics bootstrapping. arXiv preprint arXiv:2402.14083, 2024.
- Li et al. (2024) Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. arXiv preprint arXiv:2402.12875, 2024.
- Madaan and Yazdanbakhsh (2022) Aman Madaan and Amir Yazdanbakhsh. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022.
- Merrill and Sabharwal (2023) William Merrill and Ashish Sabharwal. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923, 2023.
- Monti et al. (2007) Martin M Monti, Daniel N Osherson, Michael J Martinez, and Lawrence M Parsons. Functional neuroanatomy of deductive inference: a language-independent distributed network. Neuroimage, 37(3):1005â1016, 2007.
- Monti et al. (2009) Martin M Monti, Lawrence M Parsons, and Daniel N Osherson. The boundaries of language and thought in deductive inference. Proceedings of the National Academy of Sciences, 106(30):12554â12559, 2009.
- Monti et al. (2012) Martin M Monti, Lawrence M Parsons, and Daniel N Osherson. Thought beyond language: neural dissociation of algebra and natural language. Psychological science, 23(8):914â922, 2012.
- Pfau et al. (2024) Jacob Pfau, William Merrill, and Samuel R Bowman. Letâs think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758, 2024.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Saparov and He (2022) Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022.
- Shalev et al. (2024) Yuval Shalev, Amir Feder, and Ariel Goldstein. Distributional reasoning in llms: Parallel reasoning processes in multi-hop reasoning. arXiv preprint arXiv:2406.13858, 2024.
- Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Su et al. (2024) DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, and Qinqing Zheng. Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces. arXiv preprint arXiv:2410.09918, 2024.
- Turpin et al. (2024) Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models donât always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024.
- Wang et al. (2022) Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001, 2022.
- Wang et al. (2024) Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426â9439, 2024.
- Wang et al. (2023) Xinyi Wang, Lucas Caccia, Oleksiy Ostapenko, Xingdi Yuan, William Yang Wang, and Alessandro Sordoni. Guiding language model reasoning with planning tokens. arXiv preprint arXiv:2310.05707, 2023.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824â24837, 2022.
- Xie et al. (2023) Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems, 36, 2023.
- Yang et al. (2024) Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837, 2024.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2023.
- Yu et al. (2024a) Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, and Lianhui Qin. Flow of reasoning: Efficient training of llm policy with divergent thinking. arXiv preprint arXiv:2406.05673, 2024a.
- Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
- Yu et al. (2024b) Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. Distilling system 2 into system 1. arXiv preprint arXiv:2407.06023, 2024b.
- Yue et al. (2023) Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
- Zelikman et al. (2024) Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629, 2024.
- Zhou et al. (2022) Denny Zhou, Nathanael SchÀrli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
## 7 Datasets
### 7.1 Examples
We provide some examples of the questions and CoT solutions for the datasets used in our experiments.
GSM8k
Question = "John cuts his grass to 2 inches. It grows .5 inches per month. When it gets to 4 inches he cuts it back down to 2 inches. It cost $100 to get his grass cut. How much does he pay per year?" Steps = ["<<4-2=2>>", "<<2/.5=4>>", "<<12/4=3>>", "<<100*3=300>>"] Answer = "300"
ProntoQA
Question = "Brimpuses are not luminous. Shumpuses are amenable. Each yumpus is a lorpus. Gorpuses are shumpuses. Each zumpus is a grimpus. Gorpuses are rompuses. Dumpuses are not floral. Lempuses are cold. Brimpuses are impuses. Every lorpus is floral. Every rompus is transparent. Grimpuses are muffled. Rompuses are yumpuses. Rompuses are wumpuses. Zumpuses are fast. Wumpuses are bitter. Every sterpus is orange. Each lorpus is a vumpus. Yumpuses are feisty. Each yumpus is a lempus. Gorpuses are snowy. Zumpuses are gorpuses. Every lorpus is a sterpus. Stella is a brimpus. Stella is a zumpus. True or false: Stella is not floral." Steps = ["Stella is a zumpus. Zumpuses are gorpuses.", "Stella is a gorpus. Gorpuses are rompuses.", "Stella is a rompus. Rompuses are yumpuses.", "Stella is a yumpus. Each yumpus is a lorpus.", "Stella is a lorpus. Every lorpus is floral.", "Stella is floral."] Answer = "False"
ProsQA
Question = "Every shumpus is a rempus. Every shumpus is a yimpus. Every terpus is a fompus. Every terpus is a gerpus. Every gerpus is a brimpus. Alex is a rempus. Every rorpus is a scrompus. Every rorpus is a yimpus. Every terpus is a brimpus. Every brimpus is a lempus. Tom is a terpus. Every shumpus is a timpus. Every yimpus is a boompus. Davis is a shumpus. Every gerpus is a lorpus. Davis is a fompus. Every shumpus is a boompus. Every shumpus is a rorpus. Every terpus is a lorpus. Every boompus is a timpus. Every fompus is a yerpus. Tom is a dumpus. Every rempus is a rorpus. Is Tom a lempus or scrompus?" Steps = ["Tom is a terpus.", "Every terpus is a brimpus.", "Every brimpus is a lempus."] Answer = "Tom is a lempus."
### 7.2 Construction of ProsQA
To construct the dataset, we first compile a set of typical entity names, such as âAlexâ and âJack,â along with fictional concept names like âlorpusâ and ârorpus,â following the setting of ProntoQA (Saparov and He, 2022). Each problem is structured as a binary question: âIs [Entity] a [Concept A] or [Concept B]?â Assuming [Concept A] is the correct answer, we build a directed acyclic graph (DAG) where each node represents an entity or a concept. The graph is constructed such that a path exists from [Entity] to [Concept A] but not to [Concept B].
Algorithm 1 describes the graph construction process. The DAG is incrementally built by adding nodes and randomly connecting them with edges. To preserve the validity of the binary choice, with some probability, we enforce that the new node cannot simultaneously serve as a descendant to both node $0$ and $1$ . This separation maintains distinct families of nodes and balances their sizes to prevent model shortcuts.
After the graph is constructed, nodes without parents are assigned entity names, while other nodes receive concept names. To formulate a question of the form âIs [Entity] a [Concept A] or [Concept B]?â, we designate node $0$ in the graph as [Entity], a leaf node labeled $1$ as [Concept A], and a leaf node labeled $2$ as [Concept B]. This setup ensures a path from [Entity] to [Concept A] without any connection to [Concept B], introducing a moderately complex reasoning path. Finally, to avoid positional biases, [Concept A] and [Concept B] are randomly permuted in each question.
Algorithm 1 Graph Construction for ProsQA
$edgesâ\{\}$
$nodesâ\{0,1\}$
$labelsâ\{0:1,1:2\}$
$\triangleright$ Labels: 1 (descendant of node 0), 2 (descendant of node 1), 3 (both), 0 (neither).
$groupsâ\{0:\{\},1:\{0\},2:\{1\},3:\{\}\}$
$idxâ 2$
while $idx<N$ do $\triangleright$ For each new node, randomly add edges from existing nodes
$n\_in\_nodesâpoisson(1.5)$
$randârandom()$
if $rand†0.35$ then
$candidatesâ groups[0]âȘ groups[1]$ $\triangleright$ Cannot be a descendant of node 1.
else if $rand†0.7$ then
$candidatesâ groups[0]âȘ groups[2]$ $\triangleright$ Cannot be a descendant of node 0.
else
$candidatesâ nodes$
end if
$n\_in\_nodesâ\min(len(candidates),n\_in\_nodes)$
$weightsâ[depth\_to\_root(c)· 1.5+1 â câ candidates]$
$\triangleright$ Define sampling weights to prioritize deeper nodes.
$\triangleright$ This way, the solution reasoning chain is expected to be longer.
$in\_nodesârandom\_choice(candidates,n\_in\_nodes,prob=weights/sum(weights))$
$cur\_labelâ 0$
for $in\_idxâ in\_nodes$ do
$cur\_labelâ cur\_label\mid labels[in\_idx]$ $\triangleright$ Update label using bitwise OR.
$edges.append((in\_idx,idx))$
end for
$groups[cur\_label].append(idx)$
$labels[idx]â cur\_label$
$nodesâ nodesâȘ\{idx\}$
$idxâ idx+1$
end while
| # Nodes | # Edges | Len. of Shortest Path | # Shortest Paths |
| --- | --- | --- | --- |
| 23.0 | 36.0 | 3.8 | 1.6 |
Table 2: Statistics of the graph structure in ProsQA.
### 7.3 Statistics
We show the size of all datasets in Table 3.
| GSM8k ProntoQA ProsQA | 385,620 9,000 17,886 | 500 200 300 | 1319 800 500 |
| --- | --- | --- | --- |
Table 3: Statistics of the datasets.
## 8 Clock-Time Reasoning Efficiency Metric
We present a clock-time comparison to evaluate reasoning efficiency. The reported values represent the average inference time per test case (in seconds), with a batch size of 1, measured on an Nvidia A100 GPU. For the no-CoT and CoT baselines, we employ the standard generate method from the transformers https://github.com/huggingface/transformers library. Our results show that clock time is generally proportional to the number of newly generated tokens, as detailed in Table 1.
| No-CoT CoT Coconut | 0.03 0.26 0.09 | 0.03 0.85 0.11 | 0.08 0.47 0.15 |
| --- | --- | --- | --- |
Table 4: Inference time (in seconds) comparison across tasks and methods.
## 9 Using More Continuous Thoughts
In Figure 3, we present the performance of Coconut on GSM8k using $câ\{0,1,2\}$ . When experimenting with $c=3$ , we observe a slight performance drop accompanied by increased variance. Analysis of the training logs indicates that adding three continuous thoughts at once â particularly during the final stage transition â leads to a sharp spike in training loss, causing instability. Future work will explore finer-grained schedules, such as incrementally adding continuous thoughts one at a time while removing fewer language tokens, as in iCoT (Deng et al., 2024). Additionally, combining language and latent reasoningâe.g., generating the reasoning skeleton in language and completing the reasoning process in latent spaceâcould provide a promising direction for improving performance and stability.