2505.04955

Model: nemotron-free

# Chain-of-Thought Tokens are Computer Program Variables Abstract Chain-of-thoughts (CoT) requires large language models (LLMs) to generate intermediate steps before reaching the final answer, and has been proven effective to help LLMs solve complex reasoning tasks. However, the inner mechanism of CoT still remains largely unclear. In this paper, we empirically study the role of CoT tokens in LLMs on two compositional tasks: multi-digit multiplication and dynamic programming. While CoT is essential for solving these problems, we find that preserving only tokens that store intermediate results would achieve comparable performance. Furthermore, we observe that storing intermediate results in an alternative latent form will not affect model performance. We also randomly intervene some values in CoT, and notice that subsequent CoT tokens and the final answer would change correspondingly. These findings suggest that CoT tokens may function like variables in computer programs but with potential drawbacks like unintended shortcuts and computational complexity limits between tokens. The code and data are available at https://github.com/solitaryzero/CoTs_are_Variables. \pdfcolInitStack tcb@breakable Chain-of-Thought Tokens are Computer Program Variables Fangwei Zhu, Peiyi Wang, Zhifang Sui School of Computer Science, State Key Laboratory of Multimedia Information Processing, Peking University zhufangwei2022@stu.pku.edu.cn wangpeiyi9979@gmail.com szf@pku.edu.cn 1 Introduction Chain-of-thoughts (CoT) Wei et al. (2022) is a widely adopted technique that greatly boosts the capability of large language models (LLMs) in reasoning tasks like solving mathematical problems Shao et al. (2024); Wang et al. (2024) or generating codes Guo et al. (2025). By requiring language models to generate intermediate steps before reaching the final result, chain-of-thoughts enables LLMs to perform advanced reasoning, and thus significantly outperforms standard supervised learning methods. Various methods have been explored to unlock the ability of chain-of-thought reasoning in LLMs, for example designing prompts Wei et al. (2022); Khot et al. (2022); Zhou et al. (2022), instruction tuning Yue et al. (2023); Yu et al. (2023) or reinforcement learning Havrilla et al. (2024); Wang et al. (2024); Guo et al. (2025). There have been theoretical studies on the efficacy of chain-of-thoughts Deng et al. (2024); Li et al. (2024); Chen et al. (2024). These studies reveal that while it is exponentially difficult for language models to solve compositional problems requiring serial computations, CoT could help models solve problems under multinominal complexity. More interestingly, CoT tokens do not need to fall in the “language space”, using latent vectors could also enable language models to perform complex reasoning Hao et al. (2024), indicating that CoTs are more than mere thought traces. The mechanism of how CoT works, and the role of CoT tokens are still not fully explored. In this paper, we propose a novel hypothesis that CoT tokens function like computer program variables. To be specific, the tokens in CoT store intermediate values that will be used in subsequent computations, and these values are partially mutable to control the final output. As long as the important intermediate values are calculated and stored, the CoT that leads to the final answer could be represented in different forms. To verify the hypothesis, we conduct empirical study on two types of problems that both require long-chain serial computations: multi-digit multiplication and dynamic programming. By comparing the performance of vanilla prompting with CoT prompting, we confirm that CoT is crucial for these problems. However, we also find that removing non-result tokens would not bring performance drops, which means that tokens storing intermediate values matter more in chain-of-thoughts. We further explore whether intermediate values could be represented in different forms. We attempt to compress consequent number digits within a single latent vector, and experimental results show that it does not detriment the model performance. This phenomenon indicates that the existence, rather than the form, of intermediate values matters more to language models. However, when the degree of compression exceeds a certain limit of language models’ capacity, it would lead to failure in reasoning. To further confirm that the intermediate values are causally connected with the output, we intervene in some tokens in CoT, replacing them with random values. It can be observed that LLMs will ignore previous steps, and use the intervened value to perform subsequent computations, supporting that CoT tokens are causally related with the final result. We conclude that CoT tokens function like the variables in computer programs. To sum up, in this paper we empirically study the function of CoT tokens, and find that: (1) The role of CoT tokens is similar to variables in computer programs as they store intermediate values used in subsequent computations; (2) The intermediate values could be stored in CoT tokens with different forms; (3) The values in CoT tokens are causally related to the final output and could be intervened like program variables. These findings are helpful in understanding alternative forms of CoT, and could assist in designing more concise CoTs. 2 Preliminary 2.1 Chain-of-Thoughts Chain-of-thoughts (CoT) Wei et al. (2022) is a technique commonly used in decoder-only transformers. Given the input text $x$ , CoT attempts to generate intermediate steps $z$ prior to the final answer $y$ . In other words, instead of modeling the probability distribution $P(y|x)$ , CoT attempts to model the joint distribution $P(y,z|x)=P(z|x)P(y|x,z)$ . For convenience, we use two special tokens <COT> and </COT> to separate CoT tokens from the final result in our experiments. 2.2 Compositional Tasks It has been noticed that LLMs may fail on seemingly trivial problems like multi-digit multiplication. The commonality of these problems is that they need strict multi-hop reasoning to derive correct predictions, which requires language models to perform step-to-step reasoning like human intelligence. In this paper, we choose two representative tasks to study the role of CoT tokens: Algorithm 1 Digit-wise multiplication 0: Integer $a$ and $b$ 0: Value of $a*b$ Partial = [ ] for Digit $b[i]$ in $b$ do carry $←$ 0 for Digit $a[i]$ in $a$ do x $←$ $a[i]*b[i]$ + carry digit $←$ x/10 carry $←$ x mod 10 end for res $←$ Combine digits and last carry Add res to Partial end for while Len(Partial) > 1 do x $←$ Partial[0] + Partial[1] Partial $←$ [x] + Partial[2:] end while return Partial[0] Multi-digit Multiplication Calculating the multiplication result of two multi-digit numbers $(x,y)$ requires executing multiple operations based on procedural rules Dziri et al. (2023). A commonly adopted solution is the long-form multiplication algorithm, which iteratively calculates the digit-wise multiplication result and adds them up to get the final result. We describe the algorithm in Algorithm 1, see Appendix A for prompt and dataset construction details. Algorithm 2 Maximum path sum in a grid 0: A $m*n$ matrix $W$ 0: Maximum weight sum $s$ on path DP $←$ $m*n$ matrix filled with 0 for $i$ in range( $m$ ) do for $j$ in range( $n$ ) do DP[ $i$ ][ $j$ ] = $\max$ (DP[ $i-1$ ][ $j$ ], DP[ $i$ ][ $j-1$ ]) + W[ $i$ ][ $j$ ] end for end for return DP[ $m-1$ ][ $n-1$ ] Dynamic Programming Dynamic programming (DP) is an algorithmic paradigm that breaks down complicated problems into simpler sub-problems, and then recursively solves these sub-problems. In our experiments, we use the “Maximum Path Sum in a Grid” problem: Given a $m× n$ grid filled with non-negative numbers where only moving downward and rightward is allowed, find a path from top left to bottom right which maximizes the sum of all numbers along its path. This is a classic problem that can be solved with dynamic programming in $O(m× n)$ time. We describe the algorithm in Algorithm 2, see Appendix B for prompt and dataset construction details. <details> <summary>x1.png Details</summary> ![6754e662](/v1/image/6754e662242e88d1e9f6f0c064122e4a6eaf6bb6d10a532cf6fe84d8f85cf928) ### Visual Description ## Heatmap: Accuracy Matrix ### Overview The image displays a 5x5 heatmap titled "Accuracy," visualizing the relationship between two digit classifications: `digit_a` (horizontal axis) and `digit_b` (vertical axis). Values range from 0.00 (red) to 1.00 (blue), with intermediate shades of gray and pink. The heatmap reveals a strong diagonal pattern of high accuracy, with diminishing performance as digit values diverge. ### Components/Axes - **X-axis (digit_a)**: Labeled 1 to 5, representing predicted digit values. - **Y-axis (digit_b)**: Labeled 1 to 5, representing true digit values. - **Legend**: Positioned on the right, mapping colors to accuracy values: - Blue (1.00): Perfect accuracy - Gray (0.67–0.57): Moderate accuracy - Pink (0.39–0.35): Low accuracy - Red (0.00–0.03): No accuracy - **Title**: "Accuracy" (centered at the top). ### Detailed Analysis - **Diagonal Dominance**: - `digit_a = digit_b` (e.g., 1→1, 2→2) shows near-perfect accuracy (1.00, 0.97, 0.94, 0.99, 0.97). - Example: `digit_b=1` predicting `digit_a=1` achieves 1.00 accuracy. - **Off-Diagonal Trends**: - Accuracy decreases as digit values diverge. For instance: - `digit_b=2` predicting `digit_a=3` = 0.12 (red). - `digit_b=4` predicting `digit_a=5` = 0.01 (dark red). - **Anomalies**: - `digit_b=5` predicting `digit_a=5` = 0.00 (contradicts diagonal trend; likely an error or edge case). ### Key Observations 1. **Highest Accuracy**: Diagonal cells (same digit) dominate, with values ≥0.93. 2. **Lowest Accuracy**: Bottom-right corner (`digit_b=5`→`digit_a=5`) shows 0.00, an outlier. 3. **Gradual Decline**: Accuracy diminishes symmetrically from the diagonal (e.g., `digit_b=3`→`digit_a=4` = 0.03). ### Interpretation The heatmap demonstrates a model optimized for digit classification, excelling at matching identical digits but struggling with dissimilar ones. The 0.00 value for `digit_b=5`→`digit_a=5` is puzzling and may indicate: - A data entry error (e.g., mislabeled true digit). - A failure mode for the digit "5" in the dataset. - A limitation in the model's ability to generalize even for identical inputs. The diagonal dominance suggests the model relies heavily on exact digit matching, with minimal capacity to infer relationships between similar digits (e.g., 2 vs. 3). This could reflect a shallow feature representation or overfitting to training examples. </details> (a) Plain multiplication <details> <summary>x2.png Details</summary> ![a1f74d51](/v1/image/a1f74d510e87f37d221a5b5712a553a1e0f7c00fde5fc57c0aacf0c560c19554) ### Visual Description ## Heatmap: Accuracy Matrix for Digit Prediction ### Overview The image displays a square heatmap titled "Accuracy" that visualizes the relationship between two digit prediction variables (`digit_a` and `digit_b`). The matrix uses a blue-to-red color gradient to represent accuracy values ranging from 0.88 to 1.00. Each cell contains a numerical value indicating the accuracy of predicting `digit_b` given `digit_a`. ### Components/Axes - **X-axis (digit_a)**: Labeled "digit_a" with categories 1–5 (integers). - **Y-axis (digit_b)**: Labeled "digit_b" with categories 1–5 (integers). - **Color Legend**: A vertical gradient bar on the right maps values to colors: - **Dark blue**: High accuracy (1.00). - **Light blue/white**: Medium accuracy (~0.95). - **Red**: Low accuracy (0.88). - **Cell Values**: Numerical accuracy scores embedded in each cell. ### Detailed Analysis The heatmap reveals the following structure: ``` digit_a \ digit_b | 1 | 2 | 3 | 4 | 5 ------------------|-------|-------|-------|-------|------- 1 | 1.00 | 0.97 | 0.99 | 0.99 | 0.99 2 | 0.97 | 0.95 | 0.97 | 0.98 | 0.96 3 | 0.99 | 0.97 | 1.00 | 0.97 | 0.95 4 | 0.99 | 0.98 | 0.97 | 0.96 | 0.95 5 | 0.99 | 0.99 | 0.97 | 0.95 | 0.88 ``` - **Diagonal Dominance**: The main diagonal (where `digit_a = digit_b`) contains the highest values (1.00 for most cells, except 0.88 at (5,5)). - **Symmetry**: The matrix is symmetric along the diagonal, suggesting mutual accuracy between `digit_a` and `digit_b`. - **Off-Diagonal Trends**: Accuracy decreases as the absolute difference between `digit_a` and `digit_b` increases. For example: - (1,5) and (5,1): 0.99 (small drop). - (5,5): 0.88 (notable outlier). ### Key Observations 1. **Perfect Match Accuracy**: All diagonal cells except (5,5) show 1.00 accuracy, indicating perfect prediction when `digit_a = digit_b`. 2. **Digit 5 Anomaly**: The (5,5) cell deviates with 0.88 accuracy, suggesting a systematic error in predicting digit 5. 3. **Gradual Decline**: Accuracy decreases smoothly as the digit difference grows (e.g., (1,2) = 0.97, (1,3) = 0.99, (1,4) = 0.99, (1,5) = 0.99). 4. **Color Consistency**: Darker blue cells align with higher values (e.g., 1.00), while lighter blue/red cells match lower values (e.g., 0.88). ### Interpretation The heatmap demonstrates that the prediction model performs optimally when `digit_a` and `digit_b` are identical, with near-perfect accuracy. However, the (5,5) cell’s low accuracy (0.88) indicates a potential flaw in handling digit 5, possibly due to data imbalance or feature extraction issues. The symmetry suggests the model treats `digit_a` and `digit_b` interchangeably, but the gradual decline in accuracy for larger digit differences highlights limitations in generalizing across dissimilar digits. This pattern could inform targeted improvements, such as enhancing robustness for edge cases like digit 5 or refining feature representations for dissimilar digit pairs. </details> (b) CoT multiplication <details> <summary>x3.png Details</summary> ![f20cadf5](/v1/image/f20cadf50785fa76d0d8721695ac70c4e6fc46e2091f69c6915fee459e9edafa) ### Visual Description ## Heatmap: Accuracy ### Overview The image is a heatmap visualizing the accuracy of digit comparisons between two variables, `digit_a` (x-axis) and `digit_b` (y-axis). Accuracy values range from 0.0 to 1.0, with blue representing higher accuracy and red representing lower accuracy. The diagonal (where `digit_a` = `digit_b`) shows the highest accuracies, while off-diagonal values decrease as digits diverge. ### Components/Axes - **Title**: "Accuracy" - **X-axis**: Labeled "digit_a" with categories 1–5. - **Y-axis**: Labeled "digit_b" with categories 1–5. - **Legend**: Positioned on the right, showing a gradient from dark blue (1.0) to red (0.0). ### Detailed Analysis | digit_b \ digit_a | 1 | 2 | 3 | 4 | 5 | |-------------------|-------|-------|-------|-------|-------| | **1** | 1.00 | 1.00 | 0.99 | 0.92 | 0.09 | | **2** | 1.00 | 0.98 | 0.18 | 0.11 | 0.03 | | **3** | 0.99 | 0.18 | 0.11 | 0.09 | 0.05 | | **4** | 0.92 | 0.11 | 0.09 | 0.03 | 0.05 | | **5** | 0.09 | 0.11 | 0.10 | 0.09 | 0.05 | - **Diagonal Values**: - `digit_a` = `digit_b` = 1: 1.00 (dark blue) - `digit_a` = `digit_b` = 2: 0.98 (dark blue) - `digit_a` = `digit_b` = 3: 0.99 (dark blue) - `digit_a` = `digit_b` = 4: 0.92 (light blue) - `digit_a` = `digit_b` = 5: 0.05 (dark red) — **anomaly** (extremely low accuracy for matching digits). - **Off-Diagonal Trends**: - Accuracy decreases as `digit_a` and `digit_b` differ, especially for non-adjacent digits (e.g., 1 vs 5: 0.09). - Adjacent digits (e.g., 1 vs 2: 1.00, 2 vs 3: 0.18) show mixed performance, with some high and some low values. ### Key Observations 1. **Diagonal Dominance**: The highest accuracies occur when `digit_a` and `digit_b` are identical, except for `digit_b` = 5, which has a critical outlier (0.05). 2. **Confusion Between Digits**: - Digits 1 and 2 show near-perfect accuracy when compared to themselves but drop sharply when mismatched (e.g., 1 vs 5: 0.09). - Digit 5 exhibits poor performance overall, with low accuracy even when compared to itself. 3. **Color Gradient Consistency**: Blue shades align with high values (1.0–0.92), while red shades correspond to low values (0.11–0.05). ### Interpretation The heatmap suggests a classification model optimized for exact digit matches but struggles with distinguishing dissimilar digits, particularly digit 5. The anomaly for `digit_b` = 5 (0.05 accuracy when matched) indicates a potential flaw in the model’s handling of this digit, possibly due to data imbalance, feature extraction issues, or misclassification during training. The sharp drop in accuracy for non-adjacent digits (e.g., 1 vs 5) highlights challenges in generalizing across visually distinct classes. This pattern is typical in digit recognition tasks, where confusion matrices often reveal class-specific weaknesses. </details> (c) Plain DP <details> <summary>x4.png Details</summary> ![32f84bb3](/v1/image/32f84bb3a81516d81461731e07616f645a6daeae641c3c59ac9ce0b2123bad05) ### Visual Description ## Heatmap: Accuracy Matrix for Digit Recognition ### Overview The image is a square heatmap titled "Accuracy," visualizing the performance of a digit recognition system. The matrix compares predicted digits (`digit_a`) against true digits (`digit_b`), with values ranging from 0.00 (red) to 1.00 (blue) on a color gradient scale. The matrix is 5x5, with axes labeled `digit_a` (horizontal) and `digit_b` (vertical), both ranging from 1 to 5. ### Components/Axes - **Title**: "Accuracy" - **X-axis (digit_a)**: Labeled "digit_a," with categories 1–5. - **Y-axis (digit_b)**: Labeled "digit_b," with categories 1–5. - **Legend**: Positioned on the right, with a gradient from red (0.0) to blue (1.0). - **Cell Values**: Numerical accuracy scores embedded in each cell (e.g., 1.00, 0.99). ### Detailed Analysis - **Diagonal Cells (digit_a = digit_b)**: All values are **1.00**, indicating perfect accuracy when the predicted digit matches the true digit. - **Off-Diagonal Cells (digit_a ≠ digit_b)**: Most values are **0.99**, with a few exceptions (e.g., `digit_a=2` vs. `digit_b=1` and `digit_a=5` vs. `digit_b=3` also show 0.99). These represent near-perfect accuracy for mismatched digits. - **Color Consistency**: Blue dominates the matrix, aligning with the legend’s high-accuracy range. Red is absent, confirming no cells fall below 0.99. ### Key Observations 1. **Perfect Diagonal Accuracy**: The diagonal (1.00) confirms the system flawlessly identifies correct digits. 2. **High Off-Diagonal Accuracy**: Off-diagonal values (0.99) suggest minimal confusion between digits, even when mismatched. 3. **Uniformity**: All non-diagonal cells share the same value (0.99), indicating consistent performance across digit pairs. ### Interpretation This heatmap demonstrates a highly accurate digit recognition system. The perfect diagonal (1.00) reflects no errors in identifying correct digits. The near-perfect off-diagonal values (0.99) imply the system rarely confuses digits, even when predictions are incorrect. The uniformity of off-diagonal values suggests the model’s errors are not concentrated in specific digit pairs but are broadly distributed. This could indicate a robust model with minor, systematic limitations in distinguishing certain digit combinations. The absence of red cells (0.00) confirms no complete failures in digit recognition. </details> (d) CoT DP Figure 1: Comparison on model accuracy between plain prompting and chain-of-thought prompting. 3 CoT Tokens Store Intermediate Results Experimental Setup In all of our experiments, We use Qwen-2.5-1.5B Yang et al. (2024) as the backbone model. On each task, we finetune the model on the corresponding training data and then evaluate whether the generated final answer matches the golden answer. The training stage goes under a learning rate of 1e-5 for 1 epoch. See Appendix C for detailed hyperparameter settings. 3.1 Necessity of Chain-of-Thoughts We start by examining the effectiveness of CoT by comparing the model performance under direct prompting and CoT settings. As illustrated in Figure 1, training the model with direct prompts faces difficulty starting from 3*3 multiplication problems, and completely fails on larger numbers. In contrast, the model could easily solve multiplication problems with chain-of-thoughts, with near-perfect accuracy. The same applies to dynamic programming problems. Direct prompting would fail as the number of intermediate states increases, while CoT maintains its competence. These results support the conclusion that chain-of-thoughts is necessary for solving inherent serial problems that require multi-step reasoning, just as previous research suggests Li et al. (2024); Chen et al. (2024). <details> <summary>x5.png Details</summary> ![6b735282](/v1/image/6b735282965f1d4de826c21002b2567cf7420e98fd9c34414b01005576ac970e) ### Visual Description ## Heatmap: Accuracy Matrix for Digit Recognition ### Overview The image displays a 5x5 heatmap titled "Accuracy," visualizing the performance of a digit recognition system. Rows represent predicted digits (`digit_b`), and columns represent true digits (`digit_a`). Values range from 0.0 (red) to 1.0 (blue), indicating classification accuracy. The diagonal from top-left to bottom-right shows perfect accuracy (1.00) for matching digit pairs, while off-diagonal values reflect lower accuracy for mismatched pairs. ### Components/Axes - **X-axis (digit_a)**: True digit labels (1–5), increasing left to right. - **Y-axis (digit_b)**: Predicted digit labels (1–5), increasing top to bottom. - **Legend**: Color gradient from blue (1.0, perfect accuracy) to red (0.0, no accuracy), positioned vertically on the right. - **Title**: "Accuracy" (centered at the top). ### Detailed Analysis - **Diagonal Values**: All diagonal cells (e.g., digit_a=1/digit_b=1, digit_a=2/digit_b=2) show 1.00 accuracy, confirming perfect self-classification. - **Off-Diagonal Trends**: - **Symmetry**: Values are symmetric across the diagonal (e.g., digit_a=2/digit_b=1 = 0.92; digit_a=1/digit_b=2 = 0.92). - **Degradation**: Accuracy decreases as digit pairs diverge. For example: - digit_a=3/digit_b=5 = 0.87 (lowest value). - digit_a=4/digit_b=5 = 0.93. - **Consistency**: Values like 0.93 (digit_a=4/digit_b=5 and digit_a=5/digit_b=4) suggest mutual confusion between certain digit pairs. ### Key Observations 1. **Perfect Self-Recognition**: All diagonal entries are 1.00, indicating no errors in identifying the same digit. 2. **Mutual Confusion**: Symmetric off-diagonal values (e.g., 0.92 for 1↔2, 0.93 for 4↔5) suggest bidirectional misclassification. 3. **Lowest Accuracy**: The pair digit_a=3 and digit_b=5 has the lowest accuracy (0.87), highlighting a potential weakness in distinguishing these digits. 4. **Gradual Decline**: Accuracy decreases smoothly from the diagonal outward, with no abrupt drops except for the 3↔5 pair. ### Interpretation This heatmap represents a **confusion matrix** for a digit recognition model, likely trained on handwritten digits (e.g., MNIST dataset). Key insights: - **Model Strength**: High diagonal accuracy confirms strong performance in self-classification. - **Confusion Patterns**: Symmetry implies the model confuses digit pairs bidirectionally (e.g., mistaking 1 for 2 and vice versa). - **Weaknesses**: The 3↔5 pair’s low accuracy (0.87) suggests specific challenges, possibly due to visual similarities (e.g., stroke patterns or loops). - **Generalization**: Values above 0.87 indicate the model handles most digit pairs reasonably well, with only minor errors for dissimilar digits. The data underscores the model’s robustness in digit recognition while highlighting areas for improvement, particularly for digits 3 and 5. </details> Figure 2: Model performance when non-result tokens are removed from CoT in multi-digit multiplication. Removing these tokens has little impact. 3.2 Removing Non-result Tokens One of the concerns about CoT is whether it could be compressed into a more concise form. An obvious approach is to remove some less important tokens. To be specific, we remove tokens that are neither a number nor a symbol For example word ”carry“ in multiplication problems, see Appendix C for details, making CoT a sequence consisting purely of intermediate results to solve the task. Figure 2 shows the model performances after removing these tokens. While the removal would make the CoT unreadable, models finetuned on compressed CoT still achieve satisfying performance, even outperforming the full CoT setting. We can infer from this phenomenon that intermediate results are more important than semantic completeness in CoT. In other words, the central function of CoT is to store the sequence of intermediate results. <details> <summary>x6.png Details</summary> ![32c2240a](/v1/image/32c2240a8cea7610ec59519e84f2f52f9a09c75fe2e3ce775f5797896f83e802) ### Visual Description ## Diagram: Large Language Model Processing Flow ### Overview The diagram illustrates the sequential processing flow of a large language model (LLM), showing how input tokens are transformed into output tokens through multiple stages. It emphasizes latent embeddings, hidden states, and prediction mechanisms. ### Components/Axes 1. **Input Section**: - **Input Token**: Labeled as `[Question]` (yellow rectangle). - **Tokens**: Includes `<COT>` (Chain-of-Thought start) and repeated `<LAT>` (Latent) tokens. - **Input Embedding**: Yellow rectangles representing token embeddings. 2. **Latent Processing**: - **Latent Embedding**: Brown rectangles with yellow highlights, showing token-specific latent representations. - **Latent Prediction**: Green rectangles indicating model predictions at each stage. 3. **Model Core**: - **Large Language Model**: Central blue rectangle labeled "Large Language Model," acting as the processing engine. 4. **Output Section**: - **Last Hidden State**: Green rectangles representing the model's internal state after processing. - **Output Token**: Includes `<COT>` and `<LAT>` tokens, with the final token labeled `Result`. - **Flow Arrows**: Indicate sequential processing from input to output. ### Detailed Analysis - **Input Tokenization**: The input starts with a question token (`[Question]`), followed by `<COT>` and multiple `<LAT>` tokens, suggesting a structured prompt for reasoning. - **Embedding Layers**: Input tokens are first converted into embeddings (yellow), then refined into latent embeddings (brown with yellow highlights), capturing semantic and contextual features. - **Model Processing**: The LLM processes latent embeddings, generating latent predictions (green) at each step. These predictions are tied to the `<LAT>` tokens, indicating iterative reasoning. - **Output Generation**: The final output combines the last hidden state (green) with latent predictions to produce the `Result`, ending with `<COT>` to close the reasoning chain. ### Key Observations 1. **Sequential Reasoning**: The repeated `<LAT>` tokens and their corresponding latent predictions suggest the model performs step-by-step reasoning, typical of chain-of-thought (CoT) prompting. 2. **Latent State Integration**: The last hidden state (green) is critical for generating the final output, highlighting the model's reliance on internal memory. 3. **Token-Specific Processing**: Latent embeddings (brown) are visually distinct, emphasizing their role in capturing token-level nuances. ### Interpretation The diagram demonstrates how LLMs handle complex tasks by breaking them into latent steps. The use of `<COT>` and `<LAT>` tokens indicates a structured approach to reasoning, where the model generates intermediate predictions before arriving at the final result. The color-coded components (yellow for input, brown for latent, green for predictions) visually separate stages, aiding in understanding the model's architecture. This flow underscores the importance of latent representations and hidden states in enabling coherent, multi-step reasoning in LLMs. </details> Figure 3: The model structure used to reason with latent tokens. We use one-hot vectors as the latent embedding of latent tokens <LAT>. When the input token is a latent token, we use its projected latent embedding to replace the original input embedding. Correspondingly, a latent output head is added to predict the latent embedding of the next token from the last hidden state. <details> <summary>x7.png Details</summary> ![01653ce6](/v1/image/01653ce6eaa02f0892554ef9067b7d539cdff4042589f8c26f3cba4df7faee1f) ### Visual Description ## Heatmap: Accuracy ### Overview The image is a 5x5 heatmap titled "Accuracy," visualizing the relationship between two variables: `digit_a` (x-axis) and `digit_b` (y-axis). Each cell represents a numerical value between 0.0 and 1.0, with darker blue indicating higher values (closer to 1.0) and lighter blue/white indicating lower values (closer to 0.0). The heatmap includes a colorbar legend on the right, mapping values to shades of blue and red. --- ### Components/Axes - **X-axis (`digit_a`)**: Labeled with categories 1, 2, 3, 4, 5. - **Y-axis (`digit_b`)**: Labeled with categories 1, 2, 3, 4, 5. - **Legend**: Vertical colorbar on the right, ranging from 0.0 (red) to 1.0 (blue). Intermediate values (e.g., 0.2, 0.4, 0.6, 0.8) are marked with corresponding shades. --- ### Detailed Analysis #### Cell Values The heatmap contains the following values (row = `digit_b`, column = `digit_a`): ``` 1: [1.00, 0.92, 0.92, 0.96, 0.93] 2: [0.92, 1.00, 0.97, 0.96, 0.93] 3: [1.00, 1.00, 0.97, 0.96, 0.93] 4: [1.00, 0.99, 0.90, 0.96, 0.93] 5: [1.00, 0.99, 0.97, 0.96, 0.93] ``` #### Spatial Grounding - **Legend**: Positioned on the far right, aligned vertically with the y-axis. - **Title**: Centered at the top of the heatmap. - **Axes**: Labeled at the bottom (x-axis) and left (y-axis), with tick marks for categories 1–5. --- ### Key Observations 1. **Diagonal Dominance**: The diagonal cells (where `digit_a` = `digit_b`) all have values of **1.00**, indicating perfect accuracy when predicting the same digit. 2. **Off-Diagonal Trends**: - Values decrease as the distance from the diagonal increases (e.g., `digit_a=1` vs. `digit_b=2` = 0.92; `digit_a=4` vs. `digit_b=3` = 0.90). - The lowest value (**0.90**) occurs at `digit_a=3` and `digit_b=4`. 3. **Consistency**: Most off-diagonal values cluster between **0.92–0.99**, suggesting strong but imperfect generalization across digits. --- ### Interpretation - **Model Performance**: The heatmap likely represents a confusion matrix for a digit classification model. The diagonal 1.00 values confirm perfect recall/precision for identical inputs. - **Generalization**: High off-diagonal values (e.g., 0.97 for `digit_a=2` vs. `digit_b=3`) indicate the model performs well even when digits are similar but distinct. - **Anomaly**: The cell at `digit_a=3` and `digit_b=4` (value = 0.90) is an outlier, suggesting potential challenges in distinguishing these specific digits. - **Color Correlation**: Darker blue cells align with higher values (e.g., 1.00), while lighter shades (e.g., 0.90) correspond to lower accuracy, consistent with the legend. This heatmap demonstrates a model with strong digit recognition capabilities, particularly for identical inputs, and robust generalization across similar digits, with minor exceptions. </details> (a) Multiplication <details> <summary>x8.png Details</summary> ![389437d1](/v1/image/389437d123595f9a810b359bc2fccff456323a921095f1fff4eee4194680ab82) ### Visual Description ## Heatmap: Accuracy Matrix for Digit Recognition ### Overview The image displays a heatmap titled "Accuracy" visualizing the performance of a digit recognition system. The matrix compares accuracy scores between predicted digits (`digit_a`) and true digits (`digit_b`), with values ranging from 0.0 (red) to 1.0 (blue) on a color gradient scale. The heatmap reveals high accuracy for correct digit matches and moderate accuracy for similar digits. ### Components/Axes - **X-axis (digit_a)**: Predicted digit labels (1–5). - **Y-axis (digit_b)**: True digit labels (1–5). - **Legend**: Vertical colorbar on the right, mapping values to colors: - **Blue** (top): High accuracy (1.0). - **Red** (bottom): Low accuracy (0.0). - **Axis Titles**: - X-axis: "digit_a" (bold, black text). - Y-axis: "digit_b" (bold, black text). - Title: "Accuracy" (centered, bold, black text). ### Detailed Analysis - **Diagonal Values** (digit_a = digit_b): - (1,1): 1.00 (darkest blue). - (2,2): 0.96. - (3,3): 0.94. - (4,4): 0.98. - (5,5): 0.95. - **Off-Diagonal Values** (digit_a ≠ digit_b): - Similar digits (e.g., 1 vs. 2, 2 vs. 3) show moderate accuracy (0.91–0.97). - Dissimilar digits (e.g., 1 vs. 5) show lower accuracy (0.93–0.95). - **Color Gradient**: Darker blue dominates the diagonal, transitioning to lighter blue/white off-diagonal, with no red cells. ### Key Observations 1. **Perfect Matches**: The diagonal (digit_a = digit_b) achieves near-perfect accuracy (0.94–1.00). 2. **Similar Digits**: Adjacent digits (e.g., 2 vs. 3) maintain high accuracy (0.93–0.97). 3. **Dissimilar Digits**: Non-adjacent digits (e.g., 1 vs. 5) show slightly reduced accuracy (0.93–0.95). 4. **Consistency**: No cell falls below 0.91, indicating robust performance across all digit pairs. ### Interpretation The heatmap demonstrates a digit recognition system with **high overall accuracy**, particularly for correct matches. The diagonal dominance suggests the model excels at identifying identical digits, while off-diagonal values indicate it generalizes well to similar digits (e.g., 2 vs. 3). The minimal drop in accuracy for dissimilar digits (e.g., 1 vs. 5) implies the system avoids significant confusion between non-adjacent digits. This pattern is typical of models trained on digit datasets like MNIST, where spatial proximity between digits correlates with feature similarity. The absence of red cells confirms no catastrophic failures in digit classification. </details> (b) DP Figure 4: Model performances when merging intermediate results into latent tokens. 3.3 Merging Results into Latent Tokens Another concern about CoT is whether intermediate results should be explicitly recorded. To test this hypothesis, we try to merge some of the intermediate results, and represent the merged results with latent tokens. Method design. As depicted in Figure 3, we use latent tokens <LAT> to store intermediate results, and each latent token stores the information of a complete number. For simplicity, we use one-hot vectors $\mathbf{l}$ as the embedding of latent tokens: a one-hot vector $\mathbf{l}=(l_{1},l_{2},...,l_{d})∈\mathbb{R}^{d}$ consisting of $d$ dimensions could represent a number $N$ of at most $n$ digits, where $d=10n$ . $$ l_{10k+x}=\begin{cases}1,\lfloor\frac{N}{10^{k}}\rfloor\mod 10=x\\ 0,\lfloor\frac{N}{10^{k}}\rfloor\mod 10\neq x\end{cases} \tag{1} $$ We start by setting all values in $L$ to 0. Assuming that the value of the $k$ -th digit under the little-endian system is $x$ , we set $l_{10k+x}=1$ . In this way, we could represent a number with a single latent token instead of multiple tokens. To support reasoning with latent tokens, we augment the Transformer structure by adding an input projection module $P_{in}$ and a latent output head $P_{out}$ . When the input token $c_{t}$ at position $t$ is a latent token, we feed its latent embedding $\mathbf{l}_{t}$ to the projection module $P_{in}$ , and use the projected vector as the input embedding; Correspondingly, the last hidden state $\mathbf{h}_{t}$ is fed to the latent output head $P_{out}$ aside from the default LM head to predict the latent embedding of the next token $\mathbf{l}_{t+1}$ . We use linear layers to implement $P_{in}$ and $P_{out}$ , which can be described as: $$ \displaystyle P_{in}(\mathbf{l}_{t}) \displaystyle=\mathbf{W}_{in}\mathbf{l}_{t}+\mathbf{b}_{in} \displaystyle P_{out}(\mathbf{h}_{t}) \displaystyle=\mathbf{W}_{out}\mathbf{h}_{t}+\mathbf{b}_{out} \tag{2} $$ where $\mathbf{W}_{in},\mathbf{b}_{in},\mathbf{W}_{out},\mathbf{b}_{out}$ are trainable parameters. We randomly initialize these parameters. An additional latent loss $\mathcal{L}_{lat}$ is introduced to train the augmented model: $$ \mathcal{L}_{lat}=\frac{1}{N_{l}}\sum_{c_{t}=<LAT>}\text{BCE}(\sigma(P_{out}(% \mathbf{h}_{t}),\mathbf{y})) \tag{4} $$ Where $N_{l}$ is the number of latent tokens, $\mathbf{y}$ is the golden latent embedding, BCE is the binary cross entropy loss function, and $\sigma$ is the Sigmoid function. <details> <summary>x9.png Details</summary> ![92c819b4](/v1/image/92c819b48d10cfac59b9685c4040696f0c467ed626c256f468afe40dba264817) ### Visual Description ## Textual Analysis: Side-by-Side Calculation Comparison ### Overview The image presents two side-by-side panels comparing manual multiplication calculations for 8493 × 8877. Each panel shows: 1. The original multiplication problem 2. Step-by-step digit multiplication with carry values 3. Partial result summation 4. Final results with color-coded differences ### Components/Axes **Left Panel (Original CoT):** - Title: "Original CoT" - Multiplication: 8493 × 8877 - Calculation steps: - 3×7=21 (digit 1, carry 2) - 9×7=63 (digit 5, carry 6) - 4×7=28 (digit 4, carry 3) - 8×7=56 (digit 9, carry 5) - Partial results: 59451 + 594510 + ... - Final result: 75392361 (blue) **Right Panel (Intervened CoT):** - Title: "Intervened CoT" - Multiplication: 8493 × 8877 - Calculation steps: - 3×7=21 (digit 1, carry 4) - 9×7=63 (digit 7, carry 6) - 4×7=28 (digit 4, carry 3) - 8×7=56 (digit 9, carry 5) - Partial results: 59471 + 594510 + ... - Final result: 75392381 (green) ### Detailed Analysis **Key Differences:** 1. **Carry Value Discrepancy:** - Original: 3×7=21 (carry 2) - Intervened: 3×7=21 (carry 4) - This 2-unit error propagates through subsequent calculations 2. **Partial Result Variation:** - Original: 59451 - Intervened: 59471 - Difference: +20 in the hundreds place 3. **Final Result Divergence:** - Original: 75392361 - Intervened: 75392381 - Difference: +20 in the tens place **Calculation Flow:** Both panels follow identical multiplication steps for digits 4, 8, and 9, but the initial carry error creates a cascading effect. The error manifests in: - First partial result (units place) - Final result's tens place ### Key Observations 1. **Error Propagation:** A 2-unit carry error in the first multiplication step results in a 20-unit difference in the final product 2. **Color Coding:** - Original result (blue): 75392361 - Intervened result (green): 75392381 3. **Consistency in Later Steps:** Despite the initial error, subsequent multiplication steps (4×7, 8×7, 9×7) remain identical in both panels ### Interpretation This comparison demonstrates: 1. **Calculation Sensitivity:** Small errors in early calculation steps can significantly impact final results 2. **Manual Computation Risks:** The 2-unit carry error in the first step (3×7) creates a 20-unit discrepancy in the 8-digit final result 3. **Verification Importance:** The identical partial results after the first error (594510+) suggest the error was isolated to the initial multiplication step 4. **Visual Documentation:** The color-coded results effectively highlight the magnitude and location of calculation errors The image serves as a pedagogical tool to illustrate error propagation in manual arithmetic, emphasizing the critical nature of accurate carry management in multi-digit multiplication. </details> (a) Successful intervention <details> <summary>x10.png Details</summary> ![a27f0120](/v1/image/a27f0120569119c35f115428da5a99993bdf6fdff051f1a57ab53e79c2941920) ### Visual Description ## Screenshot: Comparison of Original vs. Intervened Chain-of-Thought (CoT) Calculations ### Overview The image compares two side-by-side panels labeled "Original CoT" and "Intervened CoT," each demonstrating a step-by-step multiplication of `7967 * 1083`. Both panels show intermediate calculations, carryover values, and final results, with color-coded annotations highlighting differences in carryover logic. ### Components/Axes - **Panels**: - **Left Panel**: "Original CoT" (black header, light gray background). - **Right Panel**: "Intervened CoT" (black header, light gray background). - **Text Elements**: - Mathematical operations (`7967 * 1083`, `7967 * 1000`). - Step-by-step digit multiplication (e.g., `7*1=7`, `6*1=6`). - Carryover values (e.g., "carry 0," "carry 9"). - Partial results (`23901`, `7967000`). - Final results (`8628261`). - **Color Coding**: - **Blue**: Original carryover values (`carry 0`). - **Red**: Intervened carryover value (`carry 9`). - **Green**: Digit `5` in intervened panel. - **Purple**: Final result in intervened panel. ### Detailed Analysis #### Original CoT Panel 1. **Calculation**: - `7967 * 1000` broken into digit-wise multiplication: - `7*1=7` (digit 7, carry 0). - `6*1=6` (digit 6, carry 0). - `9*1=9` (digit 9, carry 0). - `7*1=7` (digit 7, carry 0). - Partial result: `7967000`. 2. **Addition**: - `23901 + 7967000 = 8628261`. #### Intervened CoT Panel 1. **Calculation**: - Same initial steps as Original CoT, but: - `6*1=6` (digit 5, carry 9) — **key deviation** (red annotation). - Subsequent steps retain carry 0 for `9*1=9` and `7*1=7`. - Partial result: `7967000` (same as Original CoT). 2. **Addition**: - `23901 + 7967000 = 8628261` (same final result). ### Key Observations 1. **Carryover Discrepancy**: - The intervened CoT introduces a carryover of `9` in the second step (`6*1=6`), altering the intermediate digit to `5` (green annotation). - Despite this, the final result matches the Original CoT. 2. **Color Consistency**: - Blue (`carry 0`) and red (`carry 9`) annotations align with their respective panels. - Purple highlights the final result in both panels. ### Interpretation The intervened CoT introduces an error in carryover logic (red annotation), but the final result remains unchanged. This suggests: - **Error Mitigation**: The carryover discrepancy in the intervened CoT is either corrected in later steps or does not propagate to the final sum. - **Robustness of Calculation**: The final result’s consistency implies that intermediate errors may not always affect the outcome, depending on the arithmetic structure. - **Visual Clarity**: Color coding effectively highlights critical deviations, aiding in error detection. This comparison underscores the importance of intermediate step verification in complex calculations, even when final results appear identical. </details> (b) Intervention with a shortcut error Figure 5: Examples of a successful intervention (left) and an intervention with a shortcut error (right). Blue numbers refer to relevant values in the original CoT, red numbers refer to the intervention, green numbers refer to values that change as expected, but purple numbers do not change due to a shortcut error. Experimental setup. For multiplication problems, we replace each digit-wise multiplication step with a single latent token and set $d=20$ ; For DP problems, we replace each intermediate state with a single latent token and set $d=50$ . We add the latent loss $\mathcal{L}_{lat}$ with the default LM head loss as the final loss for training. Figure 4 shows the model performances when trained with latent tokens. Surprisingly, merging digit tokens to a single latent token does not detriment the performance: the model retains most of its ability to solve problems. The accuracy of using latent tokens on multiplication problems is almost identical with the accuracy of using full CoT. On 5*5 multiplication problems, using latent tokens even surpasses the original CoT, suggesting that the form of intermediate results does not matter. However, it can also be observed that using latent tokens brings disadvantage on DP problems where latent tokens store larger numbers. For example, the accuracy reduces by 9% on 4*5 DP problems. This raises the hypothesis that the computation complexity should not exceed a certain limit, which we will discuss further in Section 4.2. 4 CoT Tokens are Mutable Variables In the previous section, we find that while CoT is essential for solving complex problems (Section 3.1), the tokens representing intermediate results are more important than others (Section 3.2). Meanwhile, compressing intermediate results into latent tokens would not obviously harm model performance (Section 3.3), indicating that intermediate results could be stored in different forms. Here, we continue to discuss whether these stored intermediate results are causally related to the final prediction, and how the computation complexity between intermediate results affects model performance. <details> <summary>x11.png Details</summary> ![5b1d7f8a](/v1/image/5b1d7f8af356963e7c6a4f89f3cc2772870c29efeb5252e82e579472a5298c88) ### Visual Description ## Bar Chart: Accuracy Comparison of Multi and DP Tasks ### Overview The image is a bar chart comparing the accuracy of two tasks: "Multi" and "DP". The y-axis represents accuracy (ranging from 0.0 to 1.0), and the x-axis categorizes tasks. Two vertical bars are present, one for each task. ### Components/Axes - **X-axis (task)**: Labeled "task", with two categories: - "Multi" (left bar) - "DP" (right bar) - **Y-axis (accuracy)**: Labeled "accuracy", scaled from 0.0 to 1.0 in increments of 0.2. - **Legend**: Not visible in the image. - **Bars**: - "Multi" bar: Light blue color, height ≈ 0.75. - "DP" bar: Dark blue color, height ≈ 1.0. ### Detailed Analysis - **Multi Task**: Accuracy ≈ 0.75 (75%). - **DP Task**: Accuracy ≈ 1.0 (100%). - No error bars, annotations, or additional data points are present. ### Key Observations 1. The "DP" task achieves perfect accuracy (1.0), while the "Multi" task has lower accuracy (≈0.75). 2. The "DP" bar is taller than the "Multi" bar, indicating a significant performance gap. 3. No uncertainty ranges or confidence intervals are provided for the values. ### Interpretation The chart suggests that the "DP" task outperforms the "Multi" task in terms of accuracy. The absence of a legend leaves the color coding (light blue vs. dark blue) ambiguous, but the stark difference in bar heights is visually clear. The lack of error bars or additional context limits the ability to assess statistical significance or variability. This could imply that "DP" is a more reliable or optimized task compared to "Multi", though further data (e.g., sample size, error margins) would strengthen this conclusion. </details> (a) Success rate <details> <summary>x12.png Details</summary> ![b6b7d43e](/v1/image/b6b7d43e48f66e6a93317ceabce23bf577201fd4b9908cbfd42bae212a36f019) ### Visual Description ## Pie Chart: Error Breakdown ### Overview The image displays a pie chart titled "Error breakdown" with a legend categorizing five error types: addition, reconstruct, shortcut, copy, and misc. Each segment is color-coded and labeled with its corresponding percentage. The chart visually represents the distribution of error types, with "shortcut" errors dominating the dataset. ### Components/Axes - **Legend**: Located in the top-right corner, mapping colors to error types: - Blue: addition - Orange: reconstruct - Green: shortcut - Red: copy - Purple: misc - **Data Segments**: Five proportional slices representing error frequencies, with percentages explicitly labeled. ### Detailed Analysis 1. **Addition (Blue)**: 29.3% of errors, the second-largest segment. 2. **Reconstruct (Orange)**: 19.0% of errors, occupying a moderate portion. 3. **Shortcut (Green)**: 49.4% of errors, the largest segment by far. 4. **Copy (Red)**: 0.3% of errors, a negligible slice. 5. **Misc (Purple)**: 0.1% of errors, the smallest segment. ### Key Observations - **Dominance of Shortcut Errors**: "Shortcut" errors account for nearly half (49.4%) of all errors, indicating a critical area for improvement. - **Addition and Reconstruct**: Together, these two error types make up ~48.3% (29.3% + 19.0%), forming the majority outside of "shortcut." - **Rare Errors**: "Copy" and "misc" errors are virtually insignificant (0.3% and 0.1%, respectively), suggesting they are either well-handled or infrequently triggered. ### Interpretation The data highlights a systemic issue with "shortcut" errors, which may stem from flawed user interface design, ambiguous shortcut functionality, or user misinterpretation. The high frequency of "addition" and "reconstruct" errors suggests potential problems in data processing or validation workflows. The near-absence of "copy" and "misc" errors implies these features are either robust or underutilized. Addressing "shortcut" errors could significantly reduce overall error rates, while investigating "addition" and "reconstruct" issues may uncover deeper architectural flaws. The chart underscores the need for targeted debugging and user testing to resolve the most impactful error categories. </details> (b) Error breakdown Figure 6: (a) Success rate of intervention. When the intervened output is the same as simulated, we view it as a successful intervention. (b) Error breakdown. Shortcut error occupies a large percentage of the errors. <details> <summary>x13.png Details</summary> ![5438ab2e](/v1/image/5438ab2ea9b336930767f734e87bba575bd80d262e6232ba7e89ce33ffdd8f92) ### Visual Description ## Diagram: Data Processing Node Network ### Overview The diagram illustrates a structured network of data processing nodes arranged in a grid pattern. Nodes are labeled with DP[i][j] notation, where "i" and "j" represent row and column indices. Arrows indicate directional relationships between nodes, with distinct styling for horizontal vs vertical connections. The middle row of nodes is visually emphasized through background highlighting. ### Components/Axes - **Nodes**: - Labeled using DP[i][j] syntax (e.g., DP[1][1], DP[2][3]) - Organized in a 3x3 grid structure (visible portion) - Color-coded: - Top/bottom rows: Yellow background - Middle row: Light yellow background - **Arrows**: - Horizontal connections: Dashed red arrows (left-to-right flow) - Vertical connections: Solid black arrows (top-to-bottom flow) - **Special Elements**: - Middle row nodes highlighted with light yellow background - Ellipsis (...) indicates continuation beyond visible grid ### Detailed Analysis 1. **Node Relationships**: - Horizontal flow: DP[i][j] → DP[i][j+1] → DP[i][j+2] (dashed red) - Vertical flow: DP[i][j] → DP[i+1][j] → DP[i+2][j] (solid black) - Diagonal relationships implied through combined flows 2. **Indexing Pattern**: - Row index "i" increments downward (top=1, middle=2, bottom=3) - Column index "j" increments rightward (left=1, center=2, right=3) 3. **Highlighted Middle Row**: - DP[2][j] nodes emphasized through background color - Suggests special processing layer or priority status ### Key Observations 1. Structured grid organization enables systematic data flow 2. Dual-directional connectivity allows both row-wise and column-wise processing 3. Middle row emphasis implies potential bottleneck or critical processing stage 4. Consistent labeling convention supports scalable network expansion ### Interpretation This diagram represents a multi-stage data processing pipeline where: 1. **Horizontal Flow** (dashed red): Represents sequential processing within the same row (e.g., feature extraction stages) 2. **Vertical Flow** (solid black): Indicates data aggregation or transformation between processing layers 3. **Middle Row Emphasis**: Suggests this layer might contain: - Core processing logic - Data validation checkpoint - Resource-intensive operations 4. The grid structure enables: - Parallel processing capabilities - Modular system design - Scalable architecture through index-based addressing The network's design implies a balance between horizontal data propagation and vertical transformation, with the middle row serving as a critical integration point. The consistent labeling convention and directional arrows create a self-documenting system that could be extended to larger dimensions while maintaining operational clarity. </details> Figure 7: Demonstration of the alternative merging strategy. Each line refers to the compare-then-add state transfer function in the original setup. Nodes corresponding to the dashed boxes will not appear in the new CoT. It will cost at most 3 compare-then-add operations (red lines) to transfer states between new matrix tokens. 4.1 Intervening the Value of CoT Tokens An inherent problem is that while CoT is essential for reaching the right answer, some of the intermediate results may only be correlational to the output, rather than having causal effects. To address this problem, we perform intervention experiments by replacing intermediate results and observe whether the final result would change as expected. For multiplication problems, we randomly choose a substep in CoT and replace its result with a different random number; For DP problems, we randomly choose an intermediate state and replace it with a different random number. For simplicity, we perform interventions on 4*4 problems, and only one number is replaced in each data entry. Details are described in Appendix E. As shown in Figure 6(a), the intervention on both tasks achieves a decent success rate, clearly indicating that the intermediate values stored in CoT tokens are causally related to the final answer. We also notice that subsequent reasoning steps will change correspondingly. Take Figure 5(a) as an example, when we change the carry from 2 to 4, the model generates a result of $8493*7=59471$ instead of 59451, just as simulated. In other words, tokens in CoT not only store intermediate values, but they are also “variables” that would affect subsequent reasoning steps. Another interesting observation is that the success rate on multiplication problems is significantly lower than that on DP problems. We investigate the cause of unsuccessful interventions and categorize them into 5 categories. (1) Addition error means that the model fails to add up partial multiplication results; (2) Reconstruction error means that the partial multiplication result conflicts with digit-wise results; (3) Copy error means that partial multiplication results do not correctly appear in the addition step; (4) Shortcut error means that the model learns a “shortcut” on certain multiplications (usually when one of the operands is 0 or 1); (5) Misc error covers the remaining errors. Figure 6(b) illustrates the distribution of error types. Among the 5 types, shortcut error occupies the largest portion. As shown in Figure 5(b), while changing the carry from 0 to 9 will affect the next digit as intended, the model does not change its result in the substep $7967*1000$ . When multiplying a number $x$ by 1, the model seems to be taking a shortcut of directly copying $x$ , rather than collecting the digit-wise multiplication results. To sum up, language models use the value in CoT tokens like treating program variables, but models may develop shortcut on easy subproblems that leave certain variables unused. <details> <summary>x14.png Details</summary> ![4eb07263](/v1/image/4eb072635a0f42b4752273ae490ca9b8d6e7dc00a8b616d85f574cb1aaf36c6e) ### Visual Description ## Line Graph: Probe Accuracy ### Overview The image is a line graph titled "Probe Accuracy," comparing the performance of four metrics ("element," "entry," "compression," "row") across 25 layers. The y-axis represents accuracy values (0.0–1.0), while the x-axis represents layers (0–25). The graph includes a legend on the left, with distinct line styles and colors for each metric. ### Components/Axes - **Title**: "Probe Accuracy" - **X-axis (Layer)**: Ranges from 0 to 25, labeled "Layer." - **Y-axis (Value)**: Ranges from 0.0 to 1.0, labeled "Value." - **Legend**: - **element**: Solid blue line (flat at 1.0). - **entry**: Solid orange line (spikes at layer 20). - **compression**: Dashed black line (flat at 0.0). - **row**: Dashed orange line (gradual increase after layer 20). ### Detailed Analysis 1. **element (blue solid line)**: - Remains constant at **1.0** across all layers. - No variation observed. 2. **entry (solid orange line)**: - Stays at **0.0** until layer 20. - Sharp increase to **1.0** between layers 20 and 25. 3. **compression (dashed black line)**: - Remains constant at **0.0** across all layers. 4. **row (dashed orange line)**: - Stays at **0.0** until layer 20. - Gradual increase to **~0.2** between layers 20 and 25. ### Key Observations - **element** and **compression** metrics are static, suggesting no dependency on layer depth. - **entry** exhibits a discontinuous jump at layer 20, indicating a potential threshold or phase change. - **row** shows a slow, linear increase post-layer 20, contrasting with the abrupt rise of **entry**. - **entry** and **row** share the same base color (orange) but differ in line style (solid vs. dashed), which may aid differentiation but requires careful legend interpretation. ### Interpretation The graph highlights distinct behaviors in probe accuracy metrics: - **element**'s constant 1.0 accuracy implies perfect performance, possibly representing a baseline or idealized metric. - **compression**'s flat 0.0 line suggests no compression effect or a non-functional metric in this context. - The **entry** metric's sharp rise at layer 20 could indicate a critical layer where data processing or model architecture changes (e.g., activation function, layer type). - **row**'s gradual increase might reflect cumulative improvements or dependencies on prior layers, though its slower growth compared to **entry** suggests differing optimization dynamics. The use of shared colors (orange) for **entry** and **row** with differing line styles emphasizes the need for precise legend interpretation. The abrupt changes in **entry** and **row** at layer 20 warrant further investigation into the underlying system's design or data handling at that layer. </details> Figure 8: Probing accuracy on different layers. Intermediate variable values can only be probed on late layers, regardless of the overall accuracy. 4.2 Probing the Limit of CoT Tokens In Section 3.3, we discover that intermediate values can be compressed in latent tokens. This naturally raises the question: to what extent could the values be compressed? To address this problem, we adopt some aggressive compression strategies and use linear probing classifiers to observe how the compression affects the final output. We choose 5*5 DP problems as the base problem and use the latent token setting in Section 3.3. Specifically, we introduce an alternative strategy that merges two adjacent latent tokens in a row to one latent token (Figure 7). In this way, this strategy yields a 3*3 CoT token matrix instead of a 5*5 matrix. However, the computational complexity between CoT tokens also increases: it would cost up to 3 times as much as in the original case. For each CoT token <LAT>, we use a linear probe $P$ to probe its latent embedding $\mathbf{l}$ from the hidden states $\mathbf{h}_{k}$ on different layer $k$ of the previous token. We use a unique probe $P_{k}$ for each layer: $$ P_{k}(\mathbf{h}_{k})=\mathbf{W}_{k}\mathbf{h}_{k}+\mathbf{b}_{k} \tag{5} $$ where $\mathbf{W}_{k}$ and $\mathbf{b}_{k}$ are trainable parameters. After training the probes on the training set, we evaluate them with two metrics: element accuracy evaluates the ratio of correctly predicted individual dimensions, and token accuracy evaluates the ratio of correct latent tokens. Figure 8 shows the result of probing CoT tokens. Aggressively merging CoT tokens will significantly lower both element accuracy and token accuracy, meaning that there exists a computation complexity limit, over which the LLM can no longer correctly calculate the next intermediate variable. <details> <summary>x15.png Details</summary> ![29ff827b](/v1/image/29ff827b0ce9b0bb8c84a6c586860d20210696f3456f033a384756ad3f9a3222) ### Visual Description ## Heatmap: Accuracy Breakdown ### Overview The image is a heatmap titled "Accuracy Breakdown," visualizing the relationship between "layer" (x-axis) and "Digit Scale" (y-axis). The color intensity represents accuracy, with darker blue indicating higher accuracy. The data is concentrated in the top-right quadrant, suggesting a correlation between higher layers and lower digit scales with improved performance. ### Components/Axes - **X-axis (layer)**: Labeled "layer," with values ranging from 0 to 26 in increments of 2. - **Y-axis (Digit Scale)**: Labeled "Digit Scale," with values 1 to 4. - **Color Scale**: A gradient from light blue (0.0) to dark blue (0.8), representing accuracy. - **Legend**: The color bar on the right acts as the legend, mapping color intensity to accuracy values. ### Detailed Analysis - **Top-right quadrant (layers 24–26, digit scales 1–2)**: Dominated by dark blue, indicating the highest accuracy (approximately 0.8–0.9). For example: - Layer 26, digit scale 1: ~0.85 - Layer 24, digit scale 1: ~0.75 - Layer 26, digit scale 2: ~0.65 - **Middle layers (12–22)**: Lighter blue shades, with accuracy decreasing as digit scale increases. For instance: - Layer 20, digit scale 3: ~0.4 - Layer 18, digit scale 2: ~0.5 - **Lower-left quadrant (layers 0–10, digit scales 3–4)**: Predominantly light blue, with accuracy near 0.0–0.2. For example: - Layer 4, digit scale 4: ~0.1 - Layer 8, digit scale 3: ~0.05 ### Key Observations 1. **High accuracy in top-right**: Layers 24–26 and digit scales 1–2 show the strongest performance. 2. **Decline with increasing digit scale**: Accuracy drops significantly as digit scale increases (e.g., from 0.8 to 0.2 when moving from digit scale 1 to 4). 3. **Layer dependency**: Higher layers (24–26) consistently outperform lower layers across all digit scales. ### Interpretation The heatmap suggests that the model's accuracy improves with deeper layers (24–26) and simpler digit scales (1–2). This implies that: - **Layer depth**: Deeper layers may capture more complex features, enhancing accuracy for simpler tasks (digit scale 1). - **Digit scale complexity**: Higher digit scales (3–4) likely involve more intricate patterns, which the model struggles to process, even in deeper layers. - **Trade-off**: While deeper layers improve performance, they may not fully compensate for the challenges posed by higher digit scales. The data highlights a clear trend where model performance is optimized for simpler tasks (low digit scale) in later layers, with diminishing returns as task complexity increases. </details> Figure 9: Accuracy breakdown by the scale of target values. When computational complexity between tokens exceeds a limit, the model will fail. Figure 9 further breaks down the accuracy distribution by the range of values stored in merged latent tokens. We can see that merging latent tokens has little impact on numbers with a digit length of 1 or 2, but would decrease the accuracy to near 0 on larger number values. This phenomenon can be explained as it is easier to calculate small numbers, and thus the model could “afford” the extra computational cost of merging latent tokens. Another interesting point to notice is that two accuracy curves share a similar pattern: the token accuracy stays at 0 from early layers, and rapidly rises around layer 20. Previous work Stolfo et al. (2023); Zhu et al. (2025) has concluded that LLMs tend to use early-mid layers to gather and process information from previous tokens, and determine the output only in mid-late layers. We may further assume that the role of layers will not change with computation complexity between CoT tokens. 5 Discussion Explaining alternative CoT forms. By viewing CoTs as programs, we can explain alternative CoT forms in a novel way. For example, the success of internalizing CoT steps Deng et al. (2024); Hao et al. (2024) could be viewed as compressing explicit step tokens into implicit latent tokens that cover all essential intermediate values. And the validity of inserting meaningless filler tokens Goyal et al. (2024); Pfau et al. (2024) comes from enabling LLMs to store intermediate results in the hidden states of filler tokens. By reserving space for complex reasoning steps and compressing simple reasoning steps, we could design high-efficiency CoTs. Generalization of CoT “programs”. From the experiments in previous sections, we can see that CoT tokens store intermediate values, and their values are subsequently used like the way variables function in computer programs. Theoretical proof has been made that Transformer with a recurrent module is Turing complete Pérez et al. (2021). However, there is also evidence that LLMs may struggle to generalize on compositional problems Dziri et al. (2023): models trained on easy problems would fail on more complex problems. In real-world settings, the type and complexity of desired programs are unknown, and a general-purpose LLM needs to first determine the type of program to use, or in other words, generate a meta-program first. Identification of intermediate variable tokens. It is not surprising that the CoT generated by LLMs is partially redundant and could be shortened. In Section 3.2, we find that preserving value tokens could retain most of the ability of language models. While it is easy to judge whether a token stores intermediate results in multiplication and DP problems, it is harder to identify variable tokens on general tasks: Madaan and Yazdanbakhsh (2022) finds that plain text helps LLMs elicit semantic commonsense knowledge, which may be infused into later CoT tokens. Developing an approach to identifying variable tokens would benefit CoT compression. Estimation of computational complexity between variable tokens. Section 4.2 shows that LLMs would fail when the computational complexity between variable tokens exceeds a certain limit. However, it is difficult to estimate the exact complexity limit for LLMs. It is possible to calculate the theoretical bound of ability for finite-precision Transformers Chen et al. (2024), but how LLMs process semantic information is still largely opaque, and unexpected features may appear Lindsey et al. (2025). Moreover, LLMs are not guaranteed to solve similar subproblems in the same way, they may take shortcuts (Section 4.1) that would largely affect the computational complexity between variable tokens. We hope that the broader research community could help estimate the computational complexity between variable tokens in different types of questions. 6 Related Work 6.1 Chain-of-Thought Reasoning Chain-of-Thoughts (CoT) Wei et al. (2022) is a commonly adopted technique in LLMs. Nowadays, CoT refers to a broad range of approaches that require LLMs to generate an intermediate reasoning process before reaching the final answer. Typical approaches include designing the prompt Wei et al. (2022); Khot et al. (2022); Zhou et al. (2022) and finetuning LLMs on existing chain-of-thoughts Yue et al. (2023); Yu et al. (2023). Recently, reinforcement learning also reveals its great potential in enabling LLMs to perform complex reasoning without extensive human annotations Havrilla et al. (2024); Wang et al. (2024); Shao et al. (2024); Guo et al. (2025). While the tokens in CoT can be classified into symbols, patterns, and text, which both contribute to the final answer Madaan and Yazdanbakhsh (2022), it seems that LLMs can still perform well with a small amount of CoT tokens Xu et al. (2025). Aside from plain text, researchers have also explored alternative forms of CoT. Some works focus on search abilities, like tree-form thought traces Yao et al. (2023); Xie et al. (2023) and Monte-Carlo Tree Search (MCTS) algorithms Zhang et al. (2024); Guan et al. (2025). Another line of work attempts to reason in a latent space: Goyal et al. (2024) uses a pause token to help models process extra computation before reaching an answer, and Pfau et al. (2024) shows it is also possible to replace CoT with meaningless filler tokens. On top of this, Deng et al. (2024) tries to train models with gradually shortened CoT, and COCONUT Hao et al. (2024) proposes the continuous thought paradigm, where the last hidden state of a latent token is used as the next input embedding. 6.2 Theoretical Analysis on CoT It has been noticed that LLMs face difficulty in compositional problems where combining multiple reasoning steps is strictly required, and it may be an intrinsic drawback of the Transformer structure Dziri et al. (2023). Feng et al. (2023) explains the phenomenon with the circuit complexity theory, and reaches the conclusion that it is impossible for a constant-depth log-precision transformer to solve certain math problems like linear equations. However, with the help of CoT, the model could solve these problems in polynomial complexity. Li et al. (2024) further extends the conclusion that constant-depth transformers using constant-bit precision could solve any problems solvable by boolean circuits, as long as they are equipped with CoT whose steps are longer than the circuit size. Chen et al. (2024) analyzes the problem with a multi-party autoregressive communication model, and finds that it is exponentially harder for Transformer models to solve composition tasks that require more steps than the model layers, and CoT could make the problem exponentially easier. In fact, Transformer models are powerful enough to represent finite-state automata Liu et al. (2022), and could even be Turing-complete Pérez et al. (2021) to simulate computer programs when equipped with loop modules Giannou et al. (2023). We hold the belief that these findings could also be extended to chain-of-thoughts reasoning. 7 Conclusion In this paper, we empirically explore the role CoT tokens play in reasoning. By observing the model performance on multi-digit multiplication problems and dynamic programming, we confirm that CoT is essential for solving these compositional problems. We further find that we could mostly preserve model ability by only using tokens that store intermediate results, and these intermediate results could be stored in different forms like latent token embeddings. To validate the causal connection between CoT tokens and model output, we randomly replace some values in CoT, and find that both the subsequent reasoning process and the final result would change corresponding to the intervention. The way CoT tokens behave is similar to the function of computer program variables. However, in easy subproblems LLMs would learn shortcuts that are unfaithful to the generated reasoning process, and the intervention would fail under these scenarios. We also train probing classifiers to probe variable values from hidden states on different layers, and find that there exists a computational complexity limit between CoT tokens. Intermediate values could be compressed within a single latent CoT token, but the model would drastically fail when computational complexity exceeds the limit. Our work conducts preliminary experiments on the function of CoT tokens, and there still exist mysteries like generalization ability, variable identification and complexity limit estimation, which we leave for future explorations. Limitations In this paper we empirically demonstrate that an important function of CoT tokens is to store intermediate values, and these values function like program variables. However, currently we are not able to give a theoretical proof on this statement. Another limitation of our work is that the experiments are conducted on two synthetic tasks with Qwen-2.5-1.5B, as it is difficult to identify and analyze intermediate results in real-world datasets like GSM8K and Math. Future experiments on other problems and models will be beneficial. References - Chen et al. (2024) Lijie Chen, Binghui Peng, and Hongxun Wu. 2024. Theoretical limitations of multi-layer transformer. arXiv preprint arXiv:2412.02975. - Deng et al. (2024) Yuntian Deng, Yejin Choi, and Stuart Shieber. 2024. From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint arXiv:2405.14838. - Dziri et al. (2023) Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, and 1 others. 2023. Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36:70293–70332. - Feng et al. (2023) Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. 2023. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36:70757–70798. - Giannou et al. (2023) Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. 2023. Looped transformers as programmable computers. In International Conference on Machine Learning, pages 11398–11442. PMLR. - Goyal et al. (2024) Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. 2024. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations. - Guan et al. (2025) Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. 2025. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519. - Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. - Hao et al. (2024) Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. 2024. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769. - Havrilla et al. (2024) Alexander Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. 2024. Teaching large language models to reason with reinforcement learning. In AI for Math Workshop@ ICML 2024. - Khot et al. (2022) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations. - Li et al. (2024) Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. 2024. Chain of thought empowers transformers to solve inherently serial problems. In The Twelfth International Conference on Learning Representations. - Lindsey et al. (2025) Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, and 8 others. 2025. On the biology of a large language model. Transformer Circuits Thread. - Liu et al. (2022) Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. 2022. Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations. - Madaan and Yazdanbakhsh (2022) Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686. - Pérez et al. (2021) Jorge Pérez, Pablo Barceló, and Javier Marinkovic. 2021. Attention is turing-complete. Journal of Machine Learning Research, 22(75):1–35. - Pfau et al. (2024) Jacob Pfau, William Merrill, and Samuel R Bowman. 2024. Let’s think dot by dot: Hidden computation in transformer language models. In First Conference on Language Modeling. - Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. - Stolfo et al. (2023) Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. 2023. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7035–7052. - Wang et al. (2024) Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. 2024. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439. - Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837. - Xie et al. (2023) Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. 2023. Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems, 36:41618–41650. - Xu et al. (2025) Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. 2025. Chain of draft: Thinking faster by writing less. arXiv preprint arXiv:2502.18600. - Yang et al. (2024) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, and 1 others. 2024. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115. - Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822. - Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, YU Jincheng, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations. - Yue et al. (2023) Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2023. Mammoth: Building math generalist models through hybrid instruction tuning. In The Twelfth International Conference on Learning Representations. - Zhang et al. (2024) Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, and 1 others. 2024. Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning. arXiv preprint arXiv:2410.02884. - Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and 1 others. 2022. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations. - Zhu et al. (2025) Fangwei Zhu, Damai Dai, and Zhifang Sui. 2025. Language models encode the value of numbers linearly. In Proceedings of the 31st International Conference on Computational Linguistics, pages 693–709. Appendix A Details on Multiplication Task Dataset construction For each problem scale of $m× n$ that multiplies $m$ -digit number $a$ with $n$ -digit number $b$ , we generate 100,000 data entries by randomly sampling $a$ and $b$ . When the scale is small (for example $1× 2$ ), we exhaustively generate all number pairs instead. The generated data entries are then divided into train and test splits with a ratio of 90%/10%. Prompt and CoT Formulation We use simple multiplication expressions as prompts. Figure 10 shows an example prompt for querying the model to perform multiplication. For convenience, we use <tool_call> as the start-of-CoT token <COT>, </tool_call> as the end-of-CoT token </COT>, and <|fim_middle|> as the latent token <LAT>, which already exist in the tokenizer vocabulary. We formulate the reasoning process with the algorithm of digit-wise multiplication, whose example is demonstrated in Figure 11. In the compressed CoT setting, we remove all tokens that merely represent text semantics in CoT, namely “Calculate”, “digit”, “carry”, “Result of” “Add up partial results:” and “The final result is:”, whose example is demonstrated in Figure 12. Prompt example 3773*6821= Figure 10: Example prompt for the multi-digit multiplication task. Full CoT example 3773*6821=<tool_call>Calculate 3773*1 3*1=3, digit 3, carry 0 7*1=7, digit 7, carry 0 7*1=7, digit 7, carry 0 3*1=3, digit 3, carry 0 Result of 3773*1=3773 Calculate 3773*20 3*2=6, digit 6, carry 0 7*2=14, digit 4, carry 1 7*2=14, digit 5, carry 1 3*2=6, digit 7, carry 0 Result of 3773*20=75460 Calculate 3773*800 3*8=24, digit 4, carry 2 7*8=56, digit 8, carry 5 7*8=56, digit 1, carry 6 3*8=24, digit 0, carry 3 Result of 3773*800=3018400 Calculate 3773*6000 3*6=18, digit 8, carry 1 7*6=42, digit 3, carry 4 7*6=42, digit 6, carry 4 3*6=18, digit 2, carry 2 Result of 3773*6000=22638000 Add up partial results: 3773+75460+3018400+22638000 3773+75460+3018400+22638000=79233+3018400+22638000 79233+3018400+22638000=3097633+22638000 3097633+22638000=25735633 The final result is: 3773*6821=25735633</tool_call> Result: 25735633 Figure 11: Example CoT for the multi-digit multiplication task. Compressed CoT example 3773*6821=<tool_call>3773*1 3*1 3 0 7*1 7 0 7*1 7 0 3*1 3 0 3773*1=3773 3773*20 3*2 6 0 7*2 4 1 7*2 5 1 3*2 7 0 3773*20=75460 3773*800 3*8 4 2 7*8 8 5 7*8 1 6 3*8 0 3 3773*800=3018400 3773*6000 3*6 8 1 7*6 3 4 7*6 6 4 3*6 2 2 3773*6000=22638000 3773+75460+3018400+22638000 3773+75460+3018400+22638000=79233+3018400+22638000 79233+3018400+22638000=3097633+22638000 3097633+22638000=25735633 3773*6821=25735633</tool_call> Result: 25735633 Figure 12: Example CoT after compression for the multi-digit multiplication task. Appendix B Details on Dynamic Programming Task Dataset construction Similar to the multiplication problems, we generate 100,000 data entries for each problem scale of $m× n$ (whose input matrix has a shape of $m$ rows, $n$ columns), and divide them into train and test splits with a ratio of 90%/10%. To control the value of intermediate states within a reasonable range, we ensure all values $x$ in the input matrix satisfy $1<x<100$ . In other words, each input value is a 2-digit number. Prompt formulation We use a matrix whose shape is the same as the input matrix to store intermediate values. The choice of special tokens <COT>, </COT> and <LAT> are the same as those in multiplication problems. An example of the input prompt is shown in Figure 13, and an example of the full prompt is shown in Figure 14. Notice that we do not have a compressed version of CoT in dynamic programming tasks. Prompt example Find a path in the given table from the top-left corner to the bottom-right corner that maximizes the sum of the numbers on it. You can only move rightwards or downwards. Table: 85 93 45 79 49 28 12 37 57 76 3 22 37 55 68 26 2 57 7 100 87 11 12 67 89 Figure 13: Example Prompt for the dynamic programming task. Full CoT example Find a path in the given table from the top-left corner to the bottom-right corner that maximizes the sum of the numbers on it. You can only move rightwards or downwards. Table: 15 5 59 62 22 41 61 7 12 27 98 60 34 94 24 45 40 12 77 11 56 94 46 34 45 <tool_call>15 20 79 141 163 56 117 124 153 190 154 214 248 342 366 199 254 266 419 430 255 349 395 453 498</tool_call> Result: 498 Figure 14: Example CoT for the dynamic programming task. Appendix C Main Experiment Settings For all of our experiments, we use Qwen-2.5-1.5B Yang et al. (2024) from the huggingface model hub as the base model. On each task, we finetune the model on the training set and then evaluate the model on the test set of the corresponding prompt type. We use the full-parameter supervised finetuning setting and do not use parameter-efficient training techniques. During training, we use the AdamW optimizer with a learning rate of $1e-5$ . The weight decay is set to 0 and the gradient clipping threshold is set to 1. We train the model for 1 epoch with a training batch size of 4 by default. For small datasets like $1× 2$ digit multiplication, we change the epoch to 10 to ensure convergence. The models on multiplication problems are trained under BFloat16 precision, while models on DP problems are trained under Float32 precision. During evaluation, we evaluate with a batch size of 1. We only check the correctness of the final result, and do not check the values in CoT. Appendix D Latent Experiment Settings The hyperparameters in latent experiments are the same as the main experiment. For convenience, we use <|fim_middle|> as the latent token <LAT>. In multiplication problems, the dimension of latent embeddings is set to 20 (10 for digit results and 10 for carry results). In dynamic programming problems, the dimension of latent embeddings is set to 50 to store values no larger than 100,000. The latent projection module $P_{in}$ and the latent output head $P_{out}$ are trained with the backbone model with the same learning rate. We simply add the latent loss $\mathcal{L}_{lat}$ with the original LM head loss $\mathcal{L}_{lm}$ as the final loss $\mathcal{L}=\mathcal{L}_{lat}+\mathcal{L}_{lm}$ . Figure 15 shows an example of latent CoT in multiplication problems, and Figure 16 shows an example of latent CoT in dynamic programming problems. Latent CoT example 8493*8877=<tool_call>8493*7 <|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|>|59451 8493*70 <|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|>|594510 8493*800 <|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|>|6794400 8493*8000 <|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|>|67944000 59451+594510+6794400+67944000 59451+594510+6794400+67944000=653961+6794400+67944000 653961+6794400+67944000=7448361+67944000 7448361+67944000=75392361 8493*8877=75392361</tool_call> Result: 75392361 Figure 15: Example latent CoT for the multi-digit multiplication task. Latent CoT example Find a path in the given table from the top-left corner to the bottom-right corner that maximizes the sum of the numbers on it. You can only move rightwards or downwards. Table: 15 5 59 62 22 41 61 7 12 27 98 60 34 94 24 45 40 12 77 11 56 94 46 34 45 <tool_call><|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|> <|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|> <|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|> <|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|> <|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|><|fim_middle|></tool_call> Result: 498 Figure 16: Example latent CoT for the dynamic programming task. Appendix E Intervention Experiment Details In the intervention experiments, we randomly substitute a number value in the CoT generated by trained models on the test set. The interventions are performed on the full CoT texts. The substituted number has the same digit length as the original number, but with a different value. To prevent outlier values, we keep the first digit to be the same as the original number when substituting numbers with 2 or more digits. We choose the number to substitute within the following range: Multiplication - $x$ or $y$ in “digit $x$ , carry $y$ ” statements; - A random number $x$ in “Add up partial results:” statements; - The first partial result $x$ in “ $a_{1}+...+a_{n}=x+...$ statements; - The result $x$ in the “The final result is: $...=x$ ” statement. Dynamic programming A random intermediate value in the CoT. After intervention, we truncate all tokens after the intervened value, and feed the partial CoT into trained models to complement the full CoT and get the final answer. The detailed breakdown of errors in multiplication problems is shown in Table 1 (1 entry with deformed CoT is excluded): | Total Success Error | 9999 7383 2616 | | --- | --- | | Addition error | 767 | | Reconstruct error | 496 | | Shortcut error | 1291 | | Copy error | 6 | | Misc error | 56 | Table 1: Intervention error breakdown in multiplication problems. Appendix F Probing Experiment Details In the probing experiments, we probe on latent CoT for simplicity. We first collect hidden states of LLMs on different layers, and then train the probe classifiers. The training set of hidden states is collected by running the trained model on the original training set, and so is the test set. We use a learning rate of $1e-3$ and a gradient clipping threshold of $1$ . We train the probe classifiers for 4 epochs with a training batch size of 32, and an evaluate batch size of 64.

Rendering Paper...