2505.16782

Model: gemini-2.5-flash-free

# Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning > Equal Contributions.Corresponding Authors. Abstract Large Language Models (LLMs) have achieved impressive performance on complex reasoning tasks with Chain-of-Thought (CoT) prompting. However, conventional CoT relies on reasoning steps explicitly verbalized in natural language, introducing inefficiencies and limiting its applicability to abstract reasoning. To address this, there has been growing research interest in latent CoT reasoning, where inference occurs within latent spaces. By decoupling reasoning from language, latent reasoning promises richer cognitive representations and more flexible, faster inference. Researchers have explored various directions in this promising field, including training methodologies, structural innovations, and internal reasoning mechanisms. This paper presents a comprehensive overview and analysis of this reasoning paradigm. We begin by proposing a unified taxonomy from four perspectives: token-wise strategies, internal mechanisms, analysis, and applications. We then provide in-depth discussions and comparative analyses of representative methods, highlighting their design patterns, strengths, and open challenges. We aim to provide a structured foundation for advancing this emerging direction in LLM reasoning. The relevant papers will be regularly updated at https://github.com/EIT-NLP/Awesome-Latent-CoT. <details> <summary>2505.16782v1/extracted/6467822/figures/logo.png Details</summary> ![9ffd919e106855e4ca5ee8a9419bfd8a93cf792b3beb19d6fa9e8fd15ef2a317](http://localhost:8000/v1/image/9ffd919e106855e4ca5ee8a9419bfd8a93cf792b3beb19d6fa9e8fd15ef2a317) ### Visual Description This image is a symbolic illustration or icon, not a chart, data table, or a diagram with explicit labels or factual data. It visually represents concepts related to intelligence, thought processes, artificial intelligence, or cognitive mechanics. **Image Description:** The image depicts a stylized profile of a human head, facing towards the right. The head is primarily filled with a light blue color, outlined in thick black. Within the upper part of the head, representing the brain, there is a cloud-like shape filled with a light grey color and outlined in black. Inside this "brain cloud," a network of interconnected elements is illustrated: * **Nodes:** There are five prominent circular nodes, colored purple, connected by thin black lines. These lines form a network, suggesting connections or pathways. * **Additional Points:** Scattered within the brain cloud are several smaller, solid black square dots, further implying a complex internal structure or data points. In the lower-middle section of the head, roughly where the ear or jawline would be, a single, prominent **gear (cogwheel)** is depicted. This gear is colored golden-orange with a black outline and a central black circle. It has eight teeth visible around its circumference. **Information and Details:** * **Type:** Symbolic illustration/icon. * **Components:** * Stylized human head profile (light blue fill, black outline). * Brain/thought cloud (light grey fill, black outline). * Neural network representation within the brain (purple circular nodes, black connecting lines, small black square dots). * Mechanical gear/cogwheel (golden-orange fill, black outline, central black circle). * **Colors Used:** Light blue, light grey, purple, golden-orange, black. * **No factual data, charts, or text labels are present in this image.** * **Symbolic Interpretation:** The combination of a human head, a brain represented as a network, and a mechanical gear strongly suggests themes such as: * Artificial Intelligence (AI) * Machine Learning * Cognitive processes * Problem-solving * Mechanical or systematic thinking * The inner workings of the mind or a system. </details> Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning Xinghao Chen 1,2 thanks: Equal Contributions., Anhao Zhao 2*, Heming Xia 1, Xuan Lu 2, Hanlin Wang 1, Yanjun Chen 1,2, Wei Zhang 2, Jian Wang 1 thanks: Corresponding Authors., Wenjie Li 1, Xiaoyu Shen 2 $\dagger$ 1 Department of Computing, The Hong Kong Polytechnic University 2 Ningbo Digital Twin Institute, Eastern Institute of Technology, Ningbo, China xing-hao.chen@connect.polyu.hk plclmezboss@gmail.com jian51.wang@polyu.edu.hk xyshen@eitech.edu.cn 1 Introduction “Whereof one cannot speak, thereof one must be silent.” — Ludwig Wittgenstein Large Language Models (LLMs) have demonstrated remarkable capabilities on complex reasoning tasks (Guo et al., 2025; OpenAI, 2025; Qwen, 2025) via Chain-of-Thought (CoT) reasoning (Wei et al., 2022; Chen et al., 2025b), which encourages models to reason step-by-step through natural language. This approach not only improves interpretability but often leads to better task performance (Kojima et al., 2022; Chu et al., 2024). Despite its utility, explicit CoT reasoning is inherently constrained by its reliance on natural language for representing each step. This linguistic mediation leads to two primary challenges. First, it introduces computational inefficiency (Lin et al., 2025b; Feng et al., 2025; Qu et al., 2025; Sui et al., 2025; Wang et al., 2025a; Liu et al., 2025), as not all tokens in the articulated thought process carry informative content. Secondly, human thinking often transcends the limits of language. There are other aspects of cognition—such as abstract insights, intuitive leaps, or highly compositional thoughts—that resist complete or precise verbalization (Wittgenstein, 1922; Pinker, 1994). For these tasks, as noted by Hao et al. (2024), forcing the verbalization of every step can be not only difficult but also an unnatural constraint on the reasoning process itself. <details> <summary>2505.16782v1/x1.png Details</summary> ![a8d5c2897edd3a06639c9a5617370f64fccf8caa2ca2c7399e3215913f2b1877](http://localhost:8000/v1/image/a8d5c2897edd3a06639c9a5617370f64fccf8caa2ca2c7399e3215913f2b1877) ### Visual Description This image is a conceptual diagram illustrating two different modes of operation for a "Large Language Model" (LLM), presented side-by-side. Both diagrams depict an input "Prompt" leading to an output "72", but through distinct internal processes. The image is composed of two main sections: a left diagram and a right diagram. --- ## Left Diagram: Direct Generation Model This section illustrates a Large Language Model generating output directly from input. * **Main Component:** A large, rounded rectangular box with a light brown outline and diagonal hatching inside. It is centrally labeled with the text "Large Language Model" in brown cursive font. * **Input:** * Below the "Large Language Model" box, there is a grey rounded rectangular text box labeled "Prompt". A grey upward arrow originates from the top of this "Prompt" box and points to the bottom edge of the "Large Language Model". * To the right of the "Prompt" input, several green upward arrows point to the bottom edge of the "Large Language Model". These arrows imply additional input tokens, though specific labels for these inputs are not provided at the bottom. * **Output:** * Above the "Large Language Model" box, a blue speech bubble-like box contains a sequence of words and numbers: "Natalia sold 48 ...". Green upward arrows originate from the top edge of the "Large Language Model" and point to each element in this sequence ("Natalia", "sold", "48", and the ellipsis "..."). * Further to the right, outside the blue speech bubble, the number "72" is displayed. A green upward arrow originates from the top edge of the "Large Language Model" and points to this "72". * **Iconography:** A stylized megaphone or funnel icon, colored purple and pink, is attached to the top-left corner of the "Large Language Model" box. ### Flow Description (Left Diagram) A "Prompt" (and potentially other inputs) is fed into the "Large Language Model". The model then directly generates a sequence of tokens, exemplified by "Natalia sold 48 ...", and ultimately produces a final numerical output of "72". This suggests a direct, token-by-token generation process. ## Right Diagram: Thinking Process Model This section illustrates a Large Language Model engaging in an internal "Thinking" Process before generating output. * **Main Component:** A large, rounded rectangular box with a light brown outline and diagonal hatching inside. It is labeled with the text "Large Language Model" in brown cursive font, positioned below an internal process. * **Input:** * Below the "Large Language Model" box, there is a grey rounded rectangular text box labeled "Prompt". A grey upward arrow originates from the top of this "Prompt" box and points to the bottom edge of the "Large Language Model". * **Internal Process Component:** * Inside the "Large Language Model" box, there is a nested, smaller, rounded rectangular box with a blue outline and diagonal hatching. This inner box is explicitly labeled ""Thinking" Process" in blue cursive font. * **Iconography within ""Thinking" Process" box:** * On the left side, a blue human head silhouette with a white cogwheel inside and a yellow lightbulb illuminating above it, symbolizing thought or ideation. * Above the ""Thinking" Process" label, three blue cloud icons are arranged horizontally. Each cloud has small white upward arrows emanating from it, possibly representing intermediate thoughts or generated ideas. * Below the ""Thinking" Process" label, three purple, hatched, rounded rectangular boxes are arranged horizontally. Blue dashed arrows connect these boxes sequentially from left to right, indicating a flow or sequence of distinct internal processing steps. * **Output:** * Above the "Large Language Model" box (and specifically originating from the "Thinking" Process component), the number "72" is displayed. A green upward arrow originates from the top edge of the "Large Language Model" and points to this "72". ### Flow Description (Right Diagram) A "Prompt" is fed into the "Large Language Model". Instead of direct generation, the model first engages in an internal ""Thinking" Process". This process involves conceptualization (head with cogwheel/lightbulb), generation of intermediate thoughts (clouds), and a sequence of internal steps (purple boxes). After completing this internal process, the model produces the final numerical output of "72". This suggests a more complex, multi-step reasoning or deliberation phase. ## Overall Interpretation The image contrasts two conceptual approaches for an LLM to arrive at a specific output ("72"). The left diagram depicts a direct, possibly less explicit, generation path, while the right diagram illustrates a model that incorporates an explicit, structured "thinking" or reasoning phase internally before producing the final result. Both models achieve the same numerical outcome, but through different conceptual mechanisms. </details> Figure 1: Explicit CoT (left) generates reasoning steps with natural language, while latent CoT (right) allows the model to reason internally in latent spaces. {forest} forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=center, font=, rectangle, draw=hidden-draw, rounded corners, align=left, text centered, minimum width=4em, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, line width=0.8pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=10em,font=,, where level=2text width=12em,font=, where level=3text width=8em,font=,, where level=4text width=6em,font=,, [ Latent Chain-of-Thought, ver [ Token-wise Strategies (§ 3), text width=11em [ Discrete Tokens (§ 3.1) [ Pause Tokens (Goyal et al., 2024), Planning Tokens (Wang et al., 2024b), Thinking Tokens (Herel and Mikolov, 2024), Filler Tokens (Pfau et al., 2024), Disentangled Inference (Jin et al., 2025b), Quiet-STaR (Zelikman et al., 2024), BoLT (Ruan et al., 2025), Reasoning CPT (Ishibashi et al., 2025), Token Assorted (Su et al., 2025), Latent Preference Coding (Gong et al., 2025), PHD-Transformer (Wu et al., 2025), leaf, text width=41.5em ] ] [ Continuous Tokens (§ 3.2) [ Coconut (Hao et al., 2024), CCoT (Cheng and Durme, 2024), HCoT (Liu et al., 2024b), SoftCoT (Xu et al., 2025a), LightThinker (Zhang et al., 2025), Cocomix (Tack et al., 2025), CODI (Shen et al., 2025b), SoftCoT++ (Xu et al., 2025c), leaf, text width=41.5em ] ] ] [ Internal Mechanisms (§ 4), text width=10.5em [ Structural CoT (§ 4.1) [ CoTFormer Mohtashami et al. (2025), Huginn (Geiping et al., 2025), RELAY (Yu et al., 2025), ITT (Chen et al., 2025e), Looped Transformers (Saunshi et al., 2025a), leaf, text width=42em ] ] [ Representational CoT (§ 4.2) [ STaR (Zelikman et al., 2022), ICoT Deng et al. (2023), Stepwise Internalization (Deng et al., 2024), System 2 Distillation (Yu et al., 2024), leaf, text width=42em ] ] ] [ Analysis (§ 5), text width=8em [ Hou et al. (2023), Brinkmann et al. (2024), Yang et al. (2024), Yom Din et al. (2024), Shalev et al. (2024), Wang et al. (2024a), Liu et al. (2024a), Kudo et al. (2025), Yu (2025), Lin et al. (2025a), Zhang and Viteri (2025), Wang et al. (2025b), leaf, text width=58.2em ] ] [ Applications (§ 6), text width=8em [ Heima (Shen et al., 2025a), XS-CoT (Xue et al., 2025), DEBATER (Ji et al., 2025), ReaRec (Tang et al., 2025), leaf, text width=58.2em ] ] ] Figure 2: Taxonomy of Latent Chain-of-Thought (CoT) reasoning. These inherent limitations of natural language and explicit reasoning have directly motivated a shift towards Latent Chain-of-Thought reasoning. As illustrated in Figure 1, models reason not through language tokens but in latent spaces, offering a more abstract and efficient medium for a thought-like process. This process can be viewed as “de-linguistified” reasoning, enabling richer thought representations, faster inference through compressed computation, and greater flexibility for non-verbal cognitive patterns (Lindsey et al., 2025). Yet, latent CoT also raises critical challenges: (1) unsupervisable processes, as their internal reasoning processes occur in latent spaces that are not directly interpretable by humans (Lindsey et al., 2025); (2) evaluation gaps, with no clear metrics to distinguish deep latent reasoning from input-output shortcuts (Ameisen et al., 2025); and (3) alignment risks, where the inability to inspect or constrain latent trajectories complicates ethical control (Xu et al., 2025b; Ruan et al., 2025). Despite these open questions, the rapid yet fragmented development of latent reasoning research highlights the pressing need for a clear and structured understanding within the research community. In this work, we present the first comprehensive survey of latent Chain-of-Thought reasoning. Our key contributions are threefold: (1) Systematic taxonomy: We introduce a structured taxonomy of latent CoT research, dividing existing work into four distinct categories. Within each, we organize representative studies into a coherent framework that clarifies their methodological assumptions and innovations (as illustrated in Figure 2); (2) In-depth analysis: Building on this taxonomy, we conduct a comprehensive analysis of representative works in each category, comparing training strategies, design paradigms, supervision signals, and efficiency trade-offs; and (3) Challenge identification and research frontiers: We identify critical open problems and outline promising directions for future research. We aim to consolidate the fragmented landscape of latent reasoning and facilitate future developments in this emerging direction. 2 Overview This paper presents a comprehensive survey of latent CoT reasoning in LLMs. We begin by examining methodological advances, which fall into two major categories: Token-wise strategies (§ 3), including both discrete tokens (§ 3.1) and continuous tokens (§ 3.2); and Internal mechanisms (§ 4), which divide into structural and representational forms. Beyond design mechanisms, we review a growing body of work on the analysis and interpretability of latent reasoning (§ 5). Finally, we discuss real-world applications (§ 6), challenges and future directions (§ 7). 3 Token-wise Strategies While explicit CoT has significantly enhanced the reasoning capabilities of LLMs by generating reasoning steps, it often increases computational costs and inference latency. To mitigate these limitations and further extend the expressive capacity of reasoning models, recent work has explored the use of token-wise strategies, which are designed not only to streamline reasoning but also to unlock more abstract and compact cognitive processes. We categorize these external tokens into two primary types: Discrete Tokens, which are symbolic, and often serve as explicit control cues; and Continuous Tokens, which are learned embeddings in latent spaces and facilitate implicit reasoning. 3.1 Discrete Tokens Discrete tokens, which serve as symbolic representations of intermediate reasoning steps or cognitive operations, have emerged as a promising paradigm for enhancing the reasoning capabilities of LLMs. They significantly contribute to improved task performance and greater efficiency. Early studies in exploring discrete tokens introduced simple markers such as “[pause]” or ellipses (“…”) to segment reasoning steps, which has significantly improved multi-step task performance (Pfau et al. (2024), Herel and Mikolov (2024)). Prior to these efforts, Goyal et al. (2024) proposed adaptive and learnable “pause tokens,” which dynamically allocate computational resources. These tokens enable delayed prediction, allowing models to perform additional internal computation before generating outputs, thereby enhancing accuracy for logic-intensive tasks. Beyond these pioneering exploration, researchers developed more sophisticated tokens to encode complex reasoning structures. For example, Wang et al. (2024b) introduced “planning tokens” derived from heuristics or variational autoencoders (VAEs) to improve coherence and precision in reasoning. To disentangle cognitive processes and enhance interpretability, Jin et al. (2025b) proposed specialized tokens such as “memory” and “reason”, which modularize reasoning by isolating specific cognitive operations. To further advance modularized reasoning, Zelikman et al. (2024) introduced Quiet-STaR, a method that uses learnable tokens to mark the boundaries of internal rationales. This approach enables language models to infer unstated reasoning steps, leading to improved generalization on challenging tasks without requiring task-specific fine-tuning. Building on this foundation, Ruan et al. (2025) proposed BoLT, which models the thought process as a trainable latent variable. This innovation allows models to infer and refine sequences of cognitive steps during pretraining, enhancing their ability to tackle complex reasoning tasks. Ishibashi et al. (2025) expanded on BoLT by introducing continual pretraining (CPT) with synthetic data containing hidden thought processes. Their reasoning CPT framework reconstructed the implicit cognitive steps underlying texts, significantly improving reasoning across diverse domains. These advancements were particularly impactful in specialized areas such as STEM and law, demonstrating notable performance gains on challenging tasks and showcasing the transferability of reasoning skills across domains. Pfau et al. (2024) pointed out that the structural organization of tokens is more critical than their semantic content. Surprisingly, replacing meaningful tokens with neutral placeholders yields negligible performance loss, underscoring the importance of token structure. Inspired by this finding, compression-based approaches have emerged to address computational inefficiencies. For example, Su et al. (2025) employed vector-quantized VAEs (VQ-VAEs) to condense reasoning steps into discrete latent tokens, reducing computational costs while maintaining performance. To further enhance token-based frameworks, Gong et al. (2025) extended this compression-based strategy to preference modeling, leveraging a learnable codebook of latent codes to align reasoning outputs with human expectations. The Parallel Hidden Decoding Transformer (PHD-Transformer) series introduced a pivotal innovation by utilizing hidden decoding tokens for efficient length scaling (Wu et al., 2025). This method achieves deeper reasoning and better task performance without increasing the size of the key-value (KV) cache, addressing long-context reasoning and enhancing the utility of discrete tokens. Overall, discrete tokens have progressed from simple markers to versatile tools for abstract cognitive modeling. They serve as powerful mechanisms that advances LLM reasoning capabilities, improving both efficiency and interpretability. <details> <summary>2505.16782v1/x2.png Details</summary> ![b7867eda8c7c7476a58b31b7ac0bcf3eb096c2946daaebc7c09d6f37efac86c2](http://localhost:8000/v1/image/b7867eda8c7c7476a58b31b7ac0bcf3eb096c2946daaebc7c09d6f37efac86c2) ### Visual Description This image is a technical diagram illustrating different architectures for "Continuous Thought" in language models. The diagram is divided into two main sections: "Intrinsic Continuous Thought" and "Auxiliary Continuous Thought", with a legend providing definitions for the visual elements used throughout. --- **Legend (located on the right side of the bottom section):** * **Prompt:** Represented by a grey rectangle with a thin border. * **Explicit Tokens:** Represented by a green rectangle with a thin border. * **Continuous Tokens:** Represented by a blue rectangle with a thin border. * **Hidden States:** Represented by a stack of red rectangles. * **Trainable:** Indicated by a flame icon. * **Frozen:** Indicated by a snowflake icon. * **Loss:** Indicated by a pink dashed rectangle with angle brackets (`< >`). --- **Section 1: Intrinsic Continuous Thought** This section presents three models: COCONUT, CODI, and LightThinker. **1.1 COCONUT Model:** * **Components:** * A central "Large Language Model" (LLM) is depicted as a horizontally elongated oval-shaped box. It has a flame icon on its right side, indicating it is **Trainable**. * **Input Flow:** * Two **Prompts** (grey rectangles) are fed into the bottom-left of the LLM. * Two **Continuous Tokens** (blue rectangles) are fed into the bottom-center of the LLM. * Three **Explicit Tokens** (green rectangles) are fed into the bottom-right of the LLM. * **Output Flow:** * Two **Continuous Tokens** (blue rectangles) are output from the top-left of the LLM. * Three **Explicit Tokens** (green rectangles) are output from the top-right of the LLM. * **Loss Calculation:** A **Loss** (pink dashed rectangle with angle brackets) surrounds the three output **Explicit Tokens** and the three input **Explicit Tokens** to the LLM, indicating that the loss is calculated based on these explicit tokens. * **Summary:** The COCONUT model uses a trainable Large Language Model that takes prompts, continuous tokens, and explicit tokens as input, and outputs continuous and explicit tokens. A loss is computed based on the explicit tokens. **1.2 CODI Model:** * **Components:** * A "Latent Reasoning Model" is depicted as a horizontally elongated oval-shaped box, positioned on the left. It has a flame icon on its right side, indicating it is **Trainable**. * A "Large Language Model" (LLM) is depicted as a horizontally elongated oval-shaped box, positioned on the right. It also has a flame icon on its right side, indicating it is **Trainable**. * **Input Flow (Latent Reasoning Model):** * Two **Prompts** (grey rectangles) are fed into the bottom-left of the "Latent Reasoning Model". * Two **Continuous Tokens** (blue rectangles) are fed into the bottom-center of the "Latent Reasoning Model". * One **Explicit Token** (green rectangle) is fed into the bottom-right of the "Latent Reasoning Model". * **Output Flow (Latent Reasoning Model):** * Two **Continuous Tokens** (blue rectangles) are output from the top-left of the "Latent Reasoning Model". * A stack of **Hidden States** (red rectangles) is output from the top-right of the "Latent Reasoning Model". * **Input Flow (Large Language Model):** * Two **Prompts** (grey rectangles) are fed into the bottom-left of the LLM. * Two **Continuous Tokens** (blue rectangles) are fed into the bottom-center of the LLM. * Three **Explicit Tokens** (green rectangles) are fed into the bottom-right of the LLM. * The **Hidden States** (red rectangles) output from the "Latent Reasoning Model" are fed into the right side of the LLM. * **Output Flow (Large Language Model):** * Two **Continuous Tokens** (blue rectangles) are output from the top-left of the LLM. * Three **Explicit Tokens** (green rectangles) are output from the top-right of the LLM. * **Loss Calculation:** A **Loss** (pink dashed rectangle with angle brackets) surrounds the three output **Explicit Tokens** and the three input **Explicit Tokens** to the LLM, indicating that the loss is calculated based on these explicit tokens. * **Summary:** The CODI model uses a trainable Latent Reasoning Model to process initial inputs and generate continuous tokens and hidden states. These hidden states, along with additional prompts, continuous, and explicit tokens, are then fed into a separate trainable Large Language Model, which produces continuous and explicit tokens. A loss is calculated based on the explicit tokens from the LLM. **1.3 LightThinker Model:** * **Components:** * A central "Large Language Model" (LLM) is depicted as a horizontally elongated oval-shaped box. It has a flame icon on its right side, indicating it is **Trainable**. * Two rectangular boxes labeled "Mask" are positioned above the LLM. * **Input Flow:** * Two **Prompts** (grey rectangles) are fed into the bottom-left of the LLM. * Two **Continuous Tokens** (blue rectangles) are fed into the bottom-center of the LLM. * Three **Explicit Tokens** (green rectangles) are fed into the bottom-right of the LLM. * **Output Flow:** * Two **Continuous Tokens** (blue rectangles) are output from the top-left of the LLM. One of these is connected to a "Mask" component. * One **Explicit Token** (green rectangle) is output from the top-center of the LLM. * Two **Continuous Tokens** (blue rectangles) are output from the top-right of the LLM. One of these is connected to a "Mask" component. * **Loss Calculation:** A **Loss** (pink dashed rectangle with angle brackets) surrounds the two "Mask" components, the output **Explicit Token**, and the two output **Continuous Tokens** that are not connected to masks. This indicates a loss calculation related to these elements. * **Summary:** The LightThinker model employs a trainable Large Language Model that takes prompts, continuous tokens, and explicit tokens as input. It outputs a mix of continuous and explicit tokens, with some continuous tokens passing through "Mask" components. A loss is calculated based on the masked continuous tokens, explicit tokens, and other continuous tokens. --- **Section 2: Auxiliary Continuous Thought** This section presents two models: CCoT and SoftCoT. **2.1 CCoT Model:** * **Components:** * "LLM Layer i": A horizontally elongated rectangle on the bottom-left, with a flame icon indicating it is **Trainable**. * "Scorer": A horizontally elongated rectangle positioned above "LLM Layer i". * "CCoT φ": A horizontally elongated oval-shaped box on the center-right, with a flame icon indicating it is **Trainable**. It contains a stack of **Hidden States** (red rectangles) labeled from 1 to L. * "Decoder ψ": A horizontally elongated oval-shaped box on the far right, with a flame icon indicating it is **Trainable**. * A label "i=L" is positioned above the "Scorer" and "CCoT φ" components. * **Input Flow (LLM Layer i):** * Four **Prompts** (grey rectangles) are fed into the bottom-left of "LLM Layer i". * Four **Explicit Tokens** (green rectangles) are fed into the bottom-right of "LLM Layer i". * **Output Flow (LLM Layer i):** * Four **Continuous Tokens** (blue rectangles) are output from the top of "LLM Layer i" and fed into the "Scorer". * **Input Flow (Scorer):** * The four **Continuous Tokens** (blue rectangles) from "LLM Layer i" are fed into the bottom of the "Scorer". * **Output Flow (Scorer):** * Four **Continuous Tokens** (blue rectangles) are output from the top of the "Scorer". * **Loss Calculation (Scorer):** A **Loss** (pink dashed rectangle with angle brackets) surrounds the four output **Continuous Tokens** from the "Scorer". * **Input Flow (CCoT φ):** * Two **Prompts** (grey rectangles) are fed into the bottom-left of "CCoT φ". * Two **Continuous Tokens** (blue rectangles) are fed into the bottom-right of "CCoT φ". A dashed blue arrow connects the output **Continuous Tokens** from the "Scorer" to these two input **Continuous Tokens** of "CCoT φ". * **Output Flow (CCoT φ):** * A stack of **Hidden States** (red rectangles) labeled 1 to L is output from the right side of "CCoT φ". * **Input Flow (Decoder ψ):** * The **Hidden States** (red rectangles) from "CCoT φ" are fed into the left side of "Decoder ψ". * Two **Continuous Tokens** (blue rectangles) are fed into the bottom-right of "Decoder ψ". * **Output Flow (Decoder ψ):** * One **Explicit Token** (green rectangle) is output from the top of "Decoder ψ". * **Loss Calculation (Decoder ψ):** A **Loss** (pink dashed rectangle with angle brackets) surrounds the output **Explicit Token** from "Decoder ψ". * **Summary:** The CCoT model involves an "LLM Layer i" that processes prompts and explicit tokens, feeding continuous tokens to a "Scorer". The "Scorer" also outputs continuous tokens, which are subject to a loss and also contribute to the input of "CCoT φ". "CCoT φ" takes prompts and continuous tokens, producing hidden states. These hidden states, along with continuous tokens, are then fed into a "Decoder ψ", which outputs an explicit token, also subject to a loss. All components ("LLM Layer i", "Scorer", "CCoT φ", "Decoder ψ") are trainable. **2.2 SoftCoT Model:** * **Components:** * "Assistant LLM": A horizontally elongated rectangle on the bottom-left, with a snowflake icon indicating it is **Frozen**. * "Linear Layer": A horizontally elongated rectangle in the bottom-center, with a flame icon indicating it is **Trainable**. * "Large Language Model" (LLM): A large, horizontally elongated oval-shaped box on the right, with a snowflake icon indicating it is **Frozen**. * **Input Flow (Assistant LLM):** * Two **Prompts** (grey rectangles) are fed into the bottom-left of the "Assistant LLM". * **Output Flow (Assistant LLM):** * Two **Continuous Tokens** (blue rectangles) are output from the top of the "Assistant LLM". * **Input Flow (Linear Layer):** * The two **Continuous Tokens** (blue rectangles) from the "Assistant LLM" are fed into the bottom-left of the "Linear Layer". * **Output Flow (Linear Layer):** * Two **Continuous Tokens** (blue rectangles) are output from the top of the "Linear Layer". * **Input Flow (Large Language Model):** * Two **Prompts** (grey rectangles) are fed into the bottom-left of the LLM. * The two **Continuous Tokens** (blue rectangles) from the "Linear Layer" are fed into the bottom-center of the LLM. A dashed blue arrow connects the output of the "Linear Layer" to these input continuous tokens. * Three **Explicit Tokens** (green rectangles) are fed into the bottom-right of the LLM. * **Output Flow (Large Language Model):** * Two **Continuous Tokens** (blue rectangles) are output from the top-left of the LLM. * Three **Explicit Tokens** (green rectangles) are output from the top-right of the LLM. * **Loss Calculation:** A **Loss** (pink dashed rectangle with angle brackets) surrounds the three output **Explicit Tokens** and the three input **Explicit Tokens** to the LLM, indicating that the loss is calculated based on these explicit tokens. * **Summary:** The SoftCoT model utilizes a frozen "Assistant LLM" to generate continuous tokens from prompts. These continuous tokens are then transformed by a trainable "Linear Layer". The resulting continuous tokens, along with additional prompts and explicit tokens, are fed into a frozen "Large Language Model". The LLM outputs continuous and explicit tokens, with a loss calculated based on the explicit tokens. </details> Figure 3: llustration of representative Continuous Tokens -based methods. Intrinsic methods generate and consume continuous tokens within a single LLM. Auxiliary methods use external modules to generate continuous tokens. 3.2 Continuous Tokens In contrast to discrete tokens, a growing body of research investigates latent reasoning through continuous representations, where reasoning processes are modeled as trajectories within high-dimensional embedding spaces rather than explicit textual sequences. This shift reflects a significant transition from hard, discrete tokens to soft, continuous tokens, offering more flexible and compact representations of intermediate reasoning states. We categorize existing methods based on whether the latent reasoning is integrated during post-training or pre-training. Post-training methods offer an efficient way to equip LLMs with latent reasoning capabilities using minimal additional data. Based on whether an LLM both generates the final output and is responsible for producing and consuming the continuous tokens, we categorize existing methods into two types: 1) Intrinsic methods keep the whole pipeline inside a single LLM; and 2) Auxiliary methods introduce a separate module that generates continuous tokens, which are then injected into the main model. Both methods aim to address the key question: how can we guide continuous tokens toward the correct reasoning direction? Figure 3 provides a comparative illustration of these approaches. Among intrinsic methods, COCONUT (Hao et al., 2024) made pioneering efforts to enable internal reasoning by feeding the model’s last hidden states into its next input embedding, effectively allowing for latent iteration without producing explicit rationales. This recurrent reuse of internal states supports breadth-first exploration and improves efficiency. To improve the semantic directionality of these latent trajectories, CODI (Shen et al., 2025b) introduced a self-distillation loss to force the hidden activations of the specific position token of the student model to mimic the teacher model’s hidden activations under explicit CoT supervision. LightThinker (Zhang et al., 2025) trained the model to decide when and how to compress reasoning into latent “gist” tokens, using strategically placed masking to reduce KV cache usage. These studies show that intrinsic latent representations can elicit viable reasoning behavior. The addition of structural priors or alignment objectives significantly stabilizes learning and improves generalization, demonstrating that internal trajectories benefit from consistent directional guidance. Among auxiliary methods, HCoT (Liu et al., 2024b) trained a dedicated auxiliary CoT model to generate and compress the full thought process into a compact special token representation, which was then passed to the main model as input for answer generation. Following a similar process, CCoT (Cheng and Durme, 2024) encoded complete reasoning sequences into variable-length latent embeddings using a trained CCoT model $\varphi$ , replacing explicit chains with dense, semantically rich contemplation tokens. The contemplation tokens were supervised to match a subset of hidden states precomputed from concatenated input. A subset was selected via a scorer, and subsequently fed into a trained decoder $\psi$ to generate final answers. To reduce training cost and ensure stability and generalization across different domains, SoftCoT (Xu et al., 2025a) combined a frozen assistant model with a trained projection layer to generate “soft tokens” that plug directly into a frozen LLM. SoftCoT++ (Xu et al., 2025c) extended SoftCoT to the test-time scaling paradigm by enabling diverse explorations in the continuous space. SoftCoT++ perturbs the latent space using multiple specialized initial tokens, and applies contrastive learning to promote diversity among soft thoughts. While post-training methods consistently yield improvements in efficiency, reducing token usage and latency, their reasoning performance often matches, rather than exceeds, that of explicit CoT prompting on standard benchmarks. The ceiling suggests that, without deeper objectives that sculpt latent trajectories, continuous-token reasoning may continue to lean on capabilities learnt in text space. Pre-training methods take a step further by embedding latent reasoning directly into the model’s cognitive prior during the pre-training phase. Rather than treating reasoning as a generative process, these methods model it as an internalizable, optimizable process within the latent space of representations. CoCoMix (Tack et al., 2025) introduced this idea by mixing continuous, high-level “concepts” into the model’s hidden states during pre-training. These concepts were extracted using a sparse autoencoder trained on the activations of a pretrained model and selected based on their causal influence on the next-token prediction. CoCoMix enhanced LLMs by interleaving predicted concepts alongside token embeddings, creating a latent scaffold that improves both performance and interpretability. Unlike post-training strategies that treat latent reasoning as a side effect, pre-training embeds it as a native cognitive faculty, potentially yielding more generalizable and cognitively aligned models. 4 Internal Mechanisms Recent research has explored the internal computational mechanisms that underlie reasoning within LLMs. These internal mechanisms focus on how reasoning can emerge implicitly through internal architectures and representations, without relying on explicit token-level traces. We categorize this line of work into two main directions: (1) Structural CoT, which examines how architectural depth, recurrence, and looping computations support latent reasoning; and (2) Representational CoT, which explores how intermediate reasoning processes can be embedded directly into the model’s hidden states, without requiring explicit intermediate outputs. 4.1 Structural CoT Given the impressive reasoning capabilities exhibited by LLMs, recent work has attempted to investigate the scaling laws specific to reasoning tasks. Ye et al. (2025) suggested that scaling laws for reasoning were more nuanced than previously understood, with the model depth playing a critical role alongside parameters. At a fixed parameter budget, deeper—but—narrowermodels tend to outperform wider counterparts. This challenged the conventional wisdom of scaling laws, yet aligns with intuitive reasoning: the success of test-time scaling closely resembles shared-weight strategies (Lan et al., 2020; Dehghani et al., 2019), where reusing the same layers across multiple tokens effectively constructs deeper computational graphs. Further empirical evidence reinforced the importance of depth in reasoning. For example, Chen and Zou (2024) found that a minimal depth was a necessary condition for the emergence of CoT reasoning. While increasing depth presents a promising approach to enhancing reasoning, by enabling iterative refinement of latent representations, the continual addition of layers imposes substantial computational and memory overheads, thereby limiting scalability in practice. <details> <summary>2505.16782v1/x3.png Details</summary> ![5f3fb114a6566cd278e70d6635b8d53e941848ac8ba83247ec3064c9c59fc234](http://localhost:8000/v1/image/5f3fb114a6566cd278e70d6635b8d53e941848ac8ba83247ec3064c9c59fc234) ### Visual Description This image is a technical diagram illustrating the architecture and flow of a system, likely a neural network model, with an emphasis on recurrent processing. It depicts a sequence of layers and modules, along with explanatory notes for each stage. The diagram uses color-coding to associate components with their descriptions. The diagram can be segmented into three main conceptual regions: 1. **Main Processing Flow (Left Column):** A vertical stack of computational layers. 2. **Input/Output (Right Side, Horizontal):** The entry and exit points of the system. 3. **Explanatory Notes (Right Column, Aligned with Flow):** Descriptions for the main components. --- ### **1. Main Processing Flow (Left Column)** The core of the system is represented by a vertical stack of three distinct types of layers/modules, connected by upward-pointing arrows indicating data flow. * **Bottom Component:** * **Label:** "Embedding Layer" * **Shape:** Rounded rectangle with a green outline and light green diagonal hatching. * **Input:** Receives input from "Prompt" via a black arrow with a green outline. * **Middle Components:** * **Label:** "Recurrent Module (Loop Transformer / RNNs)" * **Shape:** Rounded rectangle with a blue outline and light blue diagonal hatching. * **Connection:** An upward-pointing blue arrow connects the "Embedding Layer" to the first "Recurrent Module". * **Indication of Repetition:** An ellipsis (`...`) is placed between two instances of the "Recurrent Module", signifying that there can be multiple such modules stacked. * **Connection:** An upward-pointing blue arrow connects the last "Recurrent Module" to the "Decoding Layer". * **Top Component:** * **Label:** "Decoding Layer" * **Shape:** Rounded rectangle with a purple outline and light purple diagonal hatching. * **Output:** Sends output to "Answer" via a black arrow with a purple outline. --- ### **2. Input and Output (Right Side)** * **Input:** * **Label:** "Prompt" * **Shape:** Rounded rectangle with a grey fill and black outline. * **Connection:** A black arrow, outlined in green, points from "Prompt" to the "Embedding Layer". * **Output:** * **Label:** "Answer" * **Shape:** Rounded rectangle with a grey fill and black outline. * **Connection:** A black arrow, outlined in purple, points from the "Decoding Layer" to "Answer". --- ### **3. Explanatory Notes (Right Column)** These notes provide context and function descriptions for the corresponding components, indicated by matching colors and small connecting circles. * **Bottom Explanation (Associated with Embedding Layer and Prompt):** * **Shape:** An oval-shaped speech bubble with a green outline. * **Text:** "Encoding the input into a latent space." * **Spatial Grounding:** Located to the right of the "Embedding Layer" and below the "Recurrent Module" explanations. A small green circle and line connect it to the "Embedding Layer" and the arrow from "Prompt". * **Middle Explanation (Associated with Recurrent Module):** * **Shape:** A dashed blue circle containing text, with additional text to its right. * **Text (inside dashed circle):** "Iterate K times" * **Text (to the right of dashed circle):** "Shared-parameter module iterates hidden state in latent space." * **Spatial Grounding:** Located to the right of the "Recurrent Module" stack. The blue color of the dashed circle and text matches the "Recurrent Module" components. This indicates that the recurrent modules involve iteration and shared parameters to evolve a hidden state in a latent space. * **Top Explanation (Associated with Decoding Layer and Answer):** * **Shape:** An oval-shaped speech bubble with a purple outline. * **Text:** "Decoding representations from the latent space." * **Spatial Grounding:** Located to the right of the "Decoding Layer" and above the "Recurrent Module" explanations. A small purple circle and line connect it to the "Decoding Layer" and the arrow to "Answer". --- **Summary of Flow and Functionality:** The diagram illustrates a process where a "Prompt" is first processed by an "Embedding Layer" (encoding the input into a latent space). This embedded representation then passes through one or more "Recurrent Module" instances (which are described as "Loop Transformer / RNNs"). These recurrent modules iterate "K times" using shared parameters to evolve a hidden state within the latent space. Finally, the output from the recurrent modules is fed into a "Decoding Layer" (which decodes representations from the latent space) to produce the final "Answer". </details> Figure 4: Illustration of structural CoT mechanisms, where latent reasoning emerges through iterative refinement of the hidden state via a recurrent module. Existing work commonly interprets each recurrence as a discrete reasoning step in the CoT. Inspired by evidence from recurrent architectures in the “ deep thinking ” literature Schwarzschild et al. (2021); McLeish and Tran-Thanh (2023), which demonstrated inherent advantages in learning complex, iterative algorithms, recent research has shifted toward exploring recurrent methodologies for efficient latent reasoning, as illustrated in Figure 4. As an early exploration in this direction, Mohtashami et al. (2025) introduced CoTFormer, which emulates CoT reasoning by interleaving and looping representations. This approach maintains computational efficiency while mimicking the step-wise nature of human reasoning. To enable arbitrary computational depth at test time, Geiping et al. (2025) proposed Huginn, a novel recurrent framework that dynamically allocates resources through RNN-like iterative computations. Huginn achieves performance comparable to larger, static-depth models but with improved efficiency. Building upon the length generalization capability of looped architectures, Yu et al. (2025) proposed RELAY, which explicitly aligns CoT reasoning steps with loop iterations in a Looped Transformer. Intermediate supervision is applied during training to guide reasoning across steps, and the resulting reasoning chains are used to fine-tune an autoregressive model, enhancing performance on tasks that exceed training sequence lengths. To further improve reasoning on critical tokens, Chen et al. (2025e) introduced the Inner Thinking Transformer (ITT), where each Transformer layer is treated as a discrete reasoning step. By incorporating adaptive token routing and residual refinement, ITT dynamically allocates computation across tokens, achieving strong reasoning capabilities with fewer parameters and less training data. Finally, Saunshi et al. (2025b) empirically showed that deepening via recurrence, rather than increasing parameter count, can significantly enhance reasoning ability, reinforcing the trend toward recurrent strategies for latent reasoning. These studies validate the potential of increased depth, achieved either through stacking or shared-weight mechanisms, to effectively support latent-space reasoning. This line of thinking drives research toward more computationally efficient ways that harness depth for reasoning-intensive tasks. 4.2 Representational CoT In addition to the exploration of depth-driven reasoning, another promising avenue involves internalizing explicit CoT directly into the latent representations of LLMs. Early implementations of representational internalized CoT utilized rationale-augmented fine-tuning strategies, explicitly teaching models to predict intermediate reasoning outcomes without generating textual outputs (Zelikman et al., 2022). Subsequent advancements further refined this approach through sophisticated knowledge distillation methods, training student models to emulate hidden-state reasoning trajectories exhibited by teacher models performing explicit CoT (Deng et al., 2023). Additionally, phased fine-tuning paradigms (Deng et al., 2024) and self-distillation frameworks (Yu et al., 2024) enable LLMs to implicitly internalize complex reasoning pathways within their latent representations without explicitly articulating intermediate reasoning steps. Overall, this line of work shows that it is effective to condense reasoning processes into compact and computationally efficient latent structures. In summary, structural and representational approaches offer two complementary pathways for internalizing reasoning within LLMs. Structural methods leverage architectural depth (such as via stacking, recurrence, or weight sharing) to support iterative computation, effectively simulating multi-step reasoning in a layer-wise manner. In contrast, representational methods encode reasoning processes directly within hidden states, enabling models to perform inference without emitting explicit intermediate steps. Together, these approaches underscore the dual importance of computational structure and internal representation in achieving efficient and powerful latent CoT reasoning. 5 Analysis and Interpretability Since latent CoT decouples reasoning from explicit linguistic traces, it naturally raises the question: do LLMs internally simulate step-by-step reasoning, or do they rely on shallow heuristics that only approximate such behavior? This has encouraged analytical studies from various perspectives, including interpreting internal computation as evidence of structured reasoning, identifying shortcut mechanisms, and analyzing latent reasoning dynamics. 5.1 Internal Computation Interpretation Several studies posit that LLMs can carry out multi-step reasoning implicitly within their hidden states, even without explicit CoT prompts is provided. These works attempt to uncover internal structures indicative of decompositional processes. Hou et al. (2023) recovered reasoning trees from attention patterns, revealing distributed latent inference across transformer layers. Brinkmann et al. (2024) dissected a transformer trained on symbolic logic tasks and revealed an emergent recurrent computation mechanism: the model reuses internal representations across depth to simulate iterative reasoning, despite lacking explicit recurrence in its architecture. Shalev et al. (2024) showed that hidden states simultaneously encode multiple intermediate reasoning paths, indicating parallel evaluation of latent inference options. Wang et al. (2024a) showed that grokked transformers shift from memorization to generalizable algorithmic patterns, forming implicit reasoning circuits that simulate step-by-step inference without explicit CoT, even in shallow models. Yang et al. (2024) demonstrated that LLMs can retrieve intermediate bridge facts without being prompted, providing behavioral evidence of latent multi-hop reasoning. All these findings support the view that reasoning can be internally enacted without the need for external verbalization. 5.2 Shortcut Mechanisms A line of research argues that correct outputs may result not from latent reasoning, but from shortcut strategies acquired during pre-training. These studies highlight cases where models succeed by exploiting surface-level correlations or pattern completion, rather than engaging in true inference. Yom Din et al. (2024) demonstrated that final answers were often linearly decodable from early hidden layers via the logit lens, implying that later computations may simply rephrase an already-available result. This challenges the assumption that depth corresponds to incremental reasoning. Liu et al. (2024a) showed that LLMs can learn expert-like shortcuts by skipping intermediate reasoning steps. Lin et al. (2025a) identified that reliance on token-level spurious associations, revealing fragile positional heuristics rather than compositional inference. Yu (2025) indicated LLMs dynamically alternate between shortcut mechanisms and latent multi-step reasoning depending on task complexity. These studies caution against interpreting accurate outputs as evidence of genuine reasoning. Instead, they highlight how shortcut mechanisms—rooted in superficial correlations and positional heuristics—can produce seemingly coherent answers without underlying inference, underscoring the importance of identifying when such shortcuts are at play. 5.3 Latent Reasoning Dynamics Bridging the two perspectives above, recent work has focused on representational analysis and controlled interventions to better characterize and steer latent reasoning dynamics. Kudo et al. (2025) used causal interventions to identify mixed reasoning strategies, showing that simple answers are computed prior to explicit reasoning, whereas harder tasks trigger active step-by-step inference. Zhang and Viteri (2025) discovered a latent CoT vector—an activation-space direction—that, when added to internal states, elicits CoT behavior without explicit prompts, revealing latent CoT as an internally accessible processing mode. Complementing this, Wang et al. (2025b) proposed CoE, a representation of hidden-state trajectories during reasoning, identifying distinct patterns linked to reasoning success that enable latent self-evaluation. Overall, latent reasoning leaves measurable traces in the activation space and may be controllable or interpretable through geometric and dynamic analysis, offering new avenues for understanding and harnessing latent CoT reasoning. 6 Applications Latent CoT reasoning has been successfully applied in many domains due to its reasoning efficiency. Below, we discuss representative applications of latent CoT reasoning. Textual Reasoning. Existing latent CoT methods have been systematically evaluated on natural-language reasoning tasks, including mathematical reasoning (Cobbe et al., 2021; Deng et al., 2023; Hendrycks et al., 2021b; Miao et al., 2020; Patel et al., 2021; Ling et al., 2017), general commonsense reasoning (Talmor et al., 2019; Suzgun et al., 2023; Rein et al., 2024; Hendrycks et al., 2021a), and logical multi-hop reasoning datasets (Yang et al., 2018; Geva et al., 2021; Saparov and He, 2023; Hao et al., 2024). However, latent reasoning methods have yet to be evaluated on several high-bar reasoning benchmarks that have become standard for assessing Large Reasoning Models (MAA, 2024), and code-centric datasets (Jimenez et al., 2024; Jain et al., 2025). Moreover, there remains a lack of benchmarks that are both aligned with real-world applications and specifically designed to showcase the advantages of latent reasoning. Multimodal Reasoning and Generation. Latent reasoning has recently been extended to multimodal domains, where generating step-by-step explanations in natural language becomes both inefficient and semantically brittle. Heima (Shen et al., 2025a) introduces compact latent “thinking tokens” that summarize intermediate reasoning steps during multimodal tasks, cutting generation cost without hurting accuracy; XS-CoT (Xue et al., 2025) hides cross-lingual speech reasoning inside a semi-implicit token schedule that speeds non-core-language responses; and LatentLM (Sun et al., 2024) treats every modality as just another latent token, enabling a truly unified generative interface. They suggest that latent CoT reasoning is no longer confined to text. As modalities proliferate, the ability to steer and edit these hidden trajectories may become the key to controllable, efficient multimodal intelligence. Retrieval-Augmented Generation and Recommendation. Recent work (Chen et al., 2025a; Song et al., 2025; Jin et al., 2025a) has integrated explicit reasoning mechanisms within Retrieval-Augmented Generation (RAG) frameworks, and compressing these retrieval–reasoning steps in latent space could further cut tokens and latency. Recent work on pluggable virtual tokens for RAG (Zhu et al., 2024) suggests that latent tokens can serve as lightweight carriers of external knowledge and implicit reasoning. DEBATER (Ji et al., 2025) incorporates a Chain-of-Deliberation (CoD) mechanism into dense retrieval. CoD introduces a sequence of prompt tokens to stimulate the latent reasoning capability of LLMs during document representation. It further employs self-distillation to integrate multiple reasoning steps into a unified embedding. In the recommendation area, ReaRec (Tang et al., 2025) leverages latent reasoning to enhance user interest modeling, which recursively feeds the final hidden state of a user behavior back into the network for multiple rounds, using special positional embeddings to distinguish between original behavioral inputs and internal reasoning steps. 7 Challenges and Future Directions In this section, we highlight key obstacles that hinder the full realization of latent reasoning’s potential and outline critical areas for future research. 7.1 Challenges Training Difficulties Despite its efficiency and inference speed, current latent reasoning methods still underperform explicit reasoning approaches in accuracy and problem-solving capability. This gap may stem from the difficulty of training, as current training methods typically optimize for explicit reasoning outputs, rather than directly supervising latent reasoning processes. There remains a key challenge in developing training methods that can fully activate LLMs’ internal reasoning capabilities. Generalization Issues The training methods for implicit reasoning demonstrate stability primarily on fixed patterns but exhibit poor generalization capabilities. Models trained with latent space reasoning techniques often struggle when faced with novel problem structures or reasoning patterns not encountered during training (Lin et al., 2025a). This fragility suggests that current approaches to latent reasoning may be learning to compress specific reasoning templates rather than developing truly flexible reasoning capabilities in abstract space. Interpretability Concerns Recent studies suggest that models often perform reasoning in their “heads” that is not reflected in their verbalized CoTs, raising concerns about unfaithful or hidden internal processes (Chen et al., 2025d; Lindsey et al., 2025). The shift from explicit to implicit reasoning further introduces significant challenges for identifying errors and understanding how the model draws a particular conclusion. 7.2 Future Directions To effectively advance latent reasoning, several promising directions merit exploration: (1) Alternative Architectures. These may play a crucial role in enhancing the expressiveness and efficiency of latent reasoning. Beyond conventional Transformers, recurrent or looped Transformer variants, such as recurrent or looped Transformers (Saunshi et al., 2025c) enable reasoning through parameter reuse across multiple steps. In multimodal domains, diffusion model-based architectures present compelling alternatives, potentially due to their ability to model global dependencies and non-sequential reasoning in a parallel, noise-aware manner. Recent work has successfully demonstrated the effectiveness of the integration of diffusion models and latent CoT (Ye et al., 2024; Huang et al., 2025). (2) Interpretability and Verification. These are critical concerns that warrant further exploration in latent reasoning. Developing methods to probe, decode, or verify these latent representations is crucial for improving transparency and calibrating reasoning behavior (Chen et al., 2025c). (3) Training Approaches. Most existing training methods are insufficient to effectively shape latent reasoning capabilities. Reinforcement learning provides a promising paradigm for exploring the potential of LLMs to develop latent reasoning through self-evolution (Guo et al., 2025), using reward signals to implicitly sculpt a structured reasoning space aligned with task objectives. In addition, curriculum learning enables models to gradually acquire increasingly abstract reasoning skills via a simple-to-complex training process. (4) LLM Agents. These may benefit significantly from latent CoT reasoning, particularly in terms of inference efficiency. These agents often generate lengthy and verbose reasoning sequences, introducing substantial computational overhead (Zhou et al., 2025; Li et al., 2024; Zhang et al., 2024). With latent CoT reasoning, these agents are expected to perform more compact and faster planning and decision-making. (5) Social Intelligence and Theory of Mind. Latent reasoning provides a natural substrate for modeling nested mental states essential to Theory of Mind —the capacity to infer others’ beliefs, desires, and intentions (Ma et al., 2023). Embedding latent belief modeling into reasoning pipelines could offer a scalable path toward socially competent AI. 8 Conclusion This paper presents a comprehensive survey of latent CoT reasoning with LLMs. By moving reasoning beyond surface-level language into the latent space, latent CoT reasoning enables more abstract, efficient, and scalable inference. We summarize the key methods, identify major challenges, and highlight promising future directions. We hope this survey serves as a foundation and offers valuable insights to support further exploration in this emerging field. Limitations This survey offers a comprehensive review of existing methodologies and analyses in the emerging field of latent reasoning with LLMs. However, due to the breadth and rapid evolution of related work, particularly in the areas of interpretability, internal analysis, and alignment, we may have inadvertently omitted other valuable contributions. We outline several promising future directions, including alternative architectures, training paradigms, LLM agents, and Theory-of-Mind modeling, which we highlight as areas for continued exploration. Additionally, as many surveyed works rely on small-scale models or limited benchmarks, there is a need for more up-to-date and rigorous empirical validation. We advocate for continued, in-depth research to provide practitioners with actionable and robust insights into the design and deployment of latent reasoning models. Ethics Statement This survey is based entirely on publicly available research papers, models, and datasets. All referenced works are properly cited and used in accordance with their respective licenses and intended purposes. While latent reasoning introduces novel challenges in interpretability and alignment, this survey aims to provide a neutral, structured overview of the field without promoting specific deployments. We emphasize the importance of future work addressing fairness, safety, and transparency in latent reasoning. References - Ameisen et al. (2025) Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, and Joshua Batson. 2025. Circuit tracing: Revealing computational graphs in language models. Transformer Circuits Thread. - Brinkmann et al. (2024) Jannik Brinkmann, Abhay Sheshadri, Victor Levoso, Paul Swoboda, and Christian Bartelt. 2024. A mechanistic analysis of a transformer trained on a symbolic multi-step reasoning task. In Findings of the Association for Computational Linguistics: ACL 2024, pages 4082–4102, Bangkok, Thailand. Association for Computational Linguistics. - Chen et al. (2025a) Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. 2025a. Research: Learning to reason with search for llms via reinforcement learning. Preprint, arXiv:2503.19470. - Chen et al. (2025b) Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. 2025b. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. Preprint, arXiv:2503.09567. - Chen et al. (2025c) Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, and Zhangyang Wang. 2025c. Seal: Steerable reasoning calibration of large language models for free. Preprint, arXiv:2504.07986. - Chen and Zou (2024) Xingwu Chen and Difan Zou. 2024. What can transformer learn with varying depth? case studies on sequence learning tasks. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org. - Chen et al. (2025d) Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. 2025d. Reasoning models don’t always say what they think. Preprint, arXiv:2505.05410. - Chen et al. (2025e) Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. 2025e. Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking. Preprint, arXiv:2502.13842. - Cheng and Durme (2024) Jeffrey Cheng and Benjamin Van Durme. 2024. Compressed chain of thought: Efficient reasoning through dense representations. Preprint, arXiv:2412.13171. - Chu et al. (2024) Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. 2024. Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1173–1203, Bangkok, Thailand. Association for Computational Linguistics. - Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168. - Dehghani et al. (2019) Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In International Conference on Learning Representations. - Deng et al. (2024) Yuntian Deng, Yejin Choi, and Stuart Shieber. 2024. From explicit cot to implicit cot: Learning to internalize cot step by step. Preprint, arXiv:2405.14838. - Deng et al. (2023) Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. 2023. Implicit chain of thought reasoning via knowledge distillation. Preprint, arXiv:2311.01460. - Feng et al. (2025) Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. 2025. Efficient reasoning models: A survey. Preprint, arXiv:2504.10903. - Geiping et al. (2025) Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. 2025. Scaling up test-time compute with latent reasoning: A recurrent depth approach. Preprint, arXiv:2502.05171. - Geva et al. (2021) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361. - Gong et al. (2025) Zhuocheng Gong, Jian Guan, Wei Wu, Huishuai Zhang, and Dongyan Zhao. 2025. Latent preference coding: Aligning large language models via discrete latent codes. Preprint, arXiv:2505.04993. - Goyal et al. (2024) Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. 2024. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations. - Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Preprint, arXiv:2501.12948. - Hao et al. (2024) Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. 2024. Training large language models to reason in a continuous latent space. Preprint, arXiv:2412.06769. - Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021a. Measuring massive multitask language understanding. In International Conference on Learning Representations. - Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). - Herel and Mikolov (2024) David Herel and Tomas Mikolov. 2024. Thinking tokens for language modeling. Preprint, arXiv:2405.08644. - Hou et al. (2023) Yifan Hou, Jiaoda Li, Yu Fei, Alessandro Stolfo, Wangchunshu Zhou, Guangtao Zeng, Antoine Bosselut, and Mrinmaya Sachan. 2023. Towards a mechanistic interpretation of multi-step reasoning capabilities of language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4902–4919, Singapore. Association for Computational Linguistics. - Huang et al. (2025) Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, and Guo-Jun Qi. 2025. Reinforcing the diffusion chain of lateral thought with diffusion language models. Preprint, arXiv:2505.10446. - Ishibashi et al. (2025) Yoichi Ishibashi, Taro Yano, and Masafumi Oyamada. 2025. Mining hidden thoughts from texts: Evaluating continual pretraining with synthetic data for llm reasoning. Preprint, arXiv:2505.10182. - Jain et al. (2025) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations. - Ji et al. (2025) Yifan Ji, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shi Yu, Yishan Li, Zhiyuan Liu, Yu Gu, Ge Yu, and Maosong Sun. 2025. Learning more effective representations for dense retrieval through deliberate thinking before search. Preprint, arXiv:2502.12974. - Jimenez et al. (2024) Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations. - Jin et al. (2025a) Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025a. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. Preprint, arXiv:2503.09516. - Jin et al. (2025b) Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, and Yongfeng Zhang. 2025b. Disentangling memory and reasoning ability in large language models. Preprint, arXiv:2411.13504. - Kojima et al. (2022) Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc. - Kudo et al. (2025) Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Ana Brassard, Keisuke Sakaguchi, and Kentaro Inui. 2025. Think-to-talk or talk-to-think? when llms come up with an answer in multi-step arithmetic reasoning. Preprint, arXiv:2412.01113. - Lan et al. (2020) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations. - Li et al. (2024) Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, and Yunxin Liu. 2024. Personal llm agents: Insights and survey about the capability, efficiency and security. Preprint, arXiv:2401.05459. - Lin et al. (2025a) Tianhe Lin, Jian Xie, Siyu Yuan, and Deqing Yang. 2025a. Implicit reasoning in transformers is reasoning through shortcuts. Preprint, arXiv:2503.07604. - Lin et al. (2025b) Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. 2025b. Critical tokens matter: Token-level contrastive estimation enhances llm’s reasoning capability. Preprint, arXiv:2411.19943. - Lindsey et al. (2025) Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, and Joshua Batson. 2025. On the biology of a large language model. Transformer Circuits Thread. - Ling et al. (2017) Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics. - Liu et al. (2024a) Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Jiayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang. 2024a. Can language models learn to skip steps? In The Thirty-eighth Annual Conference on Neural Information Processing Systems. - Liu et al. (2024b) Tianqiao Liu, Zui Chen, Zitao Liu, Mi Tian, and Weiqi Luo. 2024b. Expediting and elevating large language model reasoning via hidden chain-of-thought decoding. Preprint, arXiv:2409.08561. - Liu et al. (2025) Yue Liu, Jiaying Wu, Yufei He, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, Zhiqi Huang, and Bryan Hooi. 2025. Efficient inference for large reasoning models: A survey. Preprint, arXiv:2503.23077. - Ma et al. (2023) Ziqiao Ma, Jacob Sansom, Run Peng, and Joyce Chai. 2023. Towards a holistic landscape of situated theory of mind in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1011–1031, Singapore. Association for Computational Linguistics. - MAA (2024) MAA. 2024. American invitational mathematics examination - aime. Accessed in February 2024, from American Invitational Mathematics Examination - AIME 2024. - McLeish and Tran-Thanh (2023) Sean Michael McLeish and Long Tran-Thanh. 2023. [re] end-to-end algorithm synthesis with recurrent networks: Logical extrapolation without overthinking. In ML Reproducibility Challenge 2022. - Miao et al. (2020) Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984, Online. Association for Computational Linguistics. - Mohtashami et al. (2025) Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. 2025. CoTFormer: A chain of thought driven architecture with budget-adaptive computation cost at inference. In The Thirteenth International Conference on Learning Representations. - OpenAI (2025) OpenAI. 2025. Learning to reason with llms. https://openai.com/index/learning-to-reason-with-llms/. - Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics. - Pfau et al. (2024) Jacob Pfau, William Merrill, and Samuel R. Bowman. 2024. Let’s think dot by dot: Hidden computation in transformer language models. In First Conference on Language Modeling. - Pinker (1994) Steven Pinker. 1994. The Language Instinct: How the Mind Creates Language. Harper Collins, New York. - Qu et al. (2025) Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, and Yu Cheng. 2025. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond. Preprint, arXiv:2503.21614. - Qwen (2025) Qwen. 2025. Qwen3: Think deeper, act faster. https://qwenlm.github.io/blog/qwen3/. - Rein et al. (2024) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2024. GPQA: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling. - Ruan et al. (2025) Yangjun Ruan, Neil Band, Chris J. Maddison, and Tatsunori Hashimoto. 2025. Reasoning to learn from latent thoughts. Preprint, arXiv:2503.18866. - Saparov and He (2023) Abulhair Saparov and He He. 2023. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In The Eleventh International Conference on Learning Representations. - Saunshi et al. (2025a) Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. 2025a. Reasoning with latent thoughts: On the power of looped transformers. In The Thirteenth International Conference on Learning Representations. - Saunshi et al. (2025b) Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. 2025b. Reasoning with latent thoughts: On the power of looped transformers. Preprint, arXiv:2502.17416. - Saunshi et al. (2025c) Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. 2025c. Reasoning with latent thoughts: On the power of looped transformers. In The Thirteenth International Conference on Learning Representations. - Schwarzschild et al. (2021) Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. 2021. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. In Advances in Neural Information Processing Systems. - Shalev et al. (2024) Yuval Shalev, Amir Feder, and Ariel Goldstein. 2024. Distributional reasoning in llms: Parallel reasoning processes in multi-hop reasoning. Preprint, arXiv:2406.13858. - Shen et al. (2025a) Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu. 2025a. Efficient reasoning with hidden thinking. Preprint, arXiv:2501.19201. - Shen et al. (2025b) Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. 2025b. Codi: Compressing chain-of-thought into continuous space via self-distillation. Preprint, arXiv:2502.21074. - Song et al. (2025) Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. Preprint, arXiv:2503.05592. - Su et al. (2025) DiJia Su, Hanlin Zhu, Yingchen Xu, Jiantao Jiao, Yuandong Tian, and Qinqing Zheng. 2025. Token assorted: Mixing latent and text tokens for improved language model reasoning. Preprint, arXiv:2502.03275. - Sui et al. (2025) Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, and Xia Hu. 2025. Stop overthinking: A survey on efficient reasoning for large language models. Preprint, arXiv:2503.16419. - Sun et al. (2024) Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. 2024. Multimodal latent language modeling with next-token diffusion. Preprint, arXiv:2412.08635. - Suzgun et al. (2023) Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei. 2023. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada. Association for Computational Linguistics. - Tack et al. (2025) Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong Tian, Jason Weston, and Xian Li. 2025. Llm pretraining with continuous concepts. Preprint, arXiv:2502.08524. - Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics. - Tang et al. (2025) Jiakai Tang, Sunhao Dai, Teng Shi, Jun Xu, Xu Chen, Wen Chen, Wu Jian, and Yuning Jiang. 2025. Think before recommend: Unleashing the latent reasoning power for sequential recommendation. Preprint, arXiv:2503.22675. - Wang et al. (2024a) Boshi Wang, Xiang Yue, Yu Su, and Huan Sun. 2024a. Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization. In Advances in Neural Information Processing Systems. - Wang et al. (2025a) Rui Wang, Hongru Wang, Boyang Xue, Jianhui Pang, Shudong Liu, Yi Chen, Jiahao Qiu, Derek Fai Wong, Heng Ji, and Kam-Fai Wong. 2025a. Harnessing the reasoning economy: A survey of efficient reasoning for large language models. Preprint, arXiv:2503.24377. - Wang et al. (2024b) Xinyi Wang, Lucas Caccia, Oleksiy Ostapenko, Xingdi Yuan, William Yang Wang, and Alessandro Sordoni. 2024b. Guiding language model reasoning with planning tokens. In First Conference on Language Modeling. - Wang et al. (2025b) Yiming Wang, Pei Zhang, Baosong Yang, Derek F. Wong, and Rui Wang. 2025b. Latent space chain-of-embedding enables output-free llm self-evaluation. In The Thirteenth International Conference on Learning Representations. - Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837. - Wittgenstein (1922) Ludwig Wittgenstein. 1922. Tractatus Logico-Philosophicus. Annalen der Naturphilosophie. - Wu et al. (2025) Bohong Wu, Shen Yan, Sijun Zhang, Jianqiao Lu, Yutao Zeng, Ya Wang, and Xun Zhou. 2025. Efficient pretraining length scaling. Preprint, arXiv:2504.14992. - Xu et al. (2025a) Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. 2025a. Softcot: Soft chain-of-thought for efficient reasoning with llms. Preprint, arXiv:2502.12134. - Xu et al. (2025b) Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. 2025b. Softcot: Soft chain-of-thought for efficient reasoning with llms. Preprint, arXiv:2502.12134. - Xu et al. (2025c) Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. 2025c. Softcot++: Test-time scaling with soft chain-of-thought reasoning. Preprint, arXiv:2505.11484. - Xue et al. (2025) Hongfei Xue, Yufeng Tang, Hexin Liu, Jun Zhang, Xuelong Geng, and Lei Xie. 2025. Enhancing non-core language instruction-following in speech llms via semi-implicit cross-lingual cot reasoning. Preprint, arXiv:2504.20835. - Yang et al. (2024) Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. 2024. Do large language models latently perform multi-hop reasoning? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10210–10229, Bangkok, Thailand. Association for Computational Linguistics. - Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics. - Ye et al. (2024) Jiacheng Ye, Shansan Gong, Liheng Chen, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, and Lingpeng Kong. 2024. Diffusion of thoughts: Chain-of-thought reasoning in diffusion language models. In Advances in Neural Information Processing Systems. - Ye et al. (2025) Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. 2025. Physics of language models: Part 2.1, grade-school math and the hidden reasoning process. In The Thirteenth International Conference on Learning Representations. - Yom Din et al. (2024) Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Mor Geva. 2024. Jump to conclusions: Short-cutting transformers with linear transformations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9615–9625, Torino, Italia. ELRA and ICCL. - Yu et al. (2024) Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. 2024. Distilling system 2 into system 1. Preprint, arXiv:2407.06023. - Yu et al. (2025) Qifan Yu, Zhenyu He, Sijie Li, Xun Zhou, Jun Zhang, Jingjing Xu, and Di He. 2025. Enhancing auto-regressive chain-of-thought through loop-aligned reasoning. Preprint, arXiv:2502.08482. - Yu (2025) Yijiong Yu. 2025. Do llms really think step-by-step in implicit reasoning? Preprint, arXiv:2411.15862. - Zelikman et al. (2024) Eric Zelikman, Georges Raif Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah Goodman. 2024. Quiet-STar: Language models can teach themselves to think before speaking. In First Conference on Language Modeling. - Zelikman et al. (2022) Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. 2022. STar: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems. - Zhang and Viteri (2025) Jason Zhang and Scott Viteri. 2025. Uncovering latent chain of thought vectors in language models. Preprint, arXiv:2409.14026. - Zhang et al. (2025) Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, and Ningyu Zhang. 2025. Lightthinker: Thinking step-by-step compression. Preprint, arXiv:2502.15589. - Zhang et al. (2024) Yang Zhang, Shixin Yang, Chenjia Bai, Fei Wu, Xiu Li, Zhen Wang, and Xuelong Li. 2024. Towards efficient llm grounding for embodied multi-agent collaboration. Preprint, arXiv:2405.14314. - Zhou et al. (2025) Xueyang Zhou, Guiyao Tie, Guowen Zhang, Weidong Wang, Zhigang Zuo, Di Wu, Duanfeng Chu, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. 2025. Large reasoning models in agent scenarios: Exploring the necessity of reasoning capabilities. Preprint, arXiv:2503.11074. - Zhu et al. (2024) Yutao Zhu, Zhaoheng Huang, Zhicheng Dou, and Ji-Rong Wen. 2024. One token can help! learning scalable and pluggable virtual tokens for retrieval-augmented large language models. Preprint, arXiv:2405.19670.

Rendering Paper...