# Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
**Authors**: Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu
> Arizona State University
\pdftrailerid
redacted Corresponding author: {czhao93, ztan36, pingchua, daweili5, bjiang14, yancheng.wang, yingzhen.yang, huanliu}@asu.edu
Abstract
Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning. Our code is available at GitHub: https://github.com/ChengshuaiZhao0/DataAlchemy.
1 Introduction
Recent years have witnessed Large Language Models’ (LLMs) dominant role in various domains (Zhao et al., 2023; Li et al., 2025b; Zhao et al., 2025; Ting et al., 2025) through versatile prompting techniques (Wei et al., 2022; Yao et al., 2023; Kojima et al., 2022). Among these, Chain-of-Thought (CoT) prompting (Wei et al., 2022) has emerged as a prominent method for eliciting structured reasoning from LLMs (a.k.a., CoT reasoning). By appending a simple cue such as “Let’s think step by step,” LLMs decompose complex problems into intermediate steps, producing outputs that resemble human-like reasoning. It has been shown to be effective in tasks requiring logical inference Xu et al. (2024), mathematical problem solving (Imani et al., 2023), and commonsense reasoning (Wei et al., 2022). The empirical successes of CoT reasoning lead to the perception that LLMs engage in deliberate inferential processes (Yu et al., 2023; Zhang et al., 2024a; Ling et al., 2023; Zhang et al., 2024c).
However, a closer examination reveals inconsistencies that challenge this optimistic view. Consider this straightforward question: “The day the US was established is in a leap year or a normal year?” When prompted with the CoT prefix, the modern LLM Gemini responded: “The United States was established in 1776. 1776 is divisible by 4, but it’s not a century year, so it’s a leap year. Therefore, the day the US was established was in a normal year.” This response exemplifies a concerning pattern: the model correctly recites the leap year rule and articulates intermediate reasoning steps, yet produces a logically inconsistent conclusion (i.e., asserting 1776 is both a leap year and a normal year). Such inconsistencies suggest that there is a distinction between human-like inference and CoT reasoning.
An expanding body of analyses reveals that LLMs tend to rely on surface-level semantics and clues rather than logical procedures (Bentham et al., 2024; Chen et al., 2025b; Lanham et al., 2023). LLMs construct superficial chains of logic based on learned token associations, often failing on tasks that deviate from commonsense heuristics or familiar templates (Tang et al., 2023). In the reasoning process, performance degrades sharply when irrelevant clauses are introduced, which indicates that models cannot grasp the underlying logic (Mirzadeh et al., 2024). This fragility becomes even more apparent when models are tested on more complex tasks, where they frequently produce incoherent solutions and fail to follow consistent reasoning paths (Shojaee et al., 2025). Collectively, these pioneering works deepen the skepticism surrounding the true nature of CoT reasoning.
In light of this line of research, we question the CoT reasoning by proposing an alternative lens through data distribution and further investigating why and when it fails. We hypothesize that CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. As such, its effectiveness is inherently limited by the nature and extent of the distribution discrepancy between training data and the test queries. Guided by this data distribution lens, we dissect CoT reasoning via three dimensions: (i) task —To what extent CoT reasoning can handle tasks that involve transformations or previously unseen task structures. (2) length —how CoT reasoning generalizes to chains with length different from that of training data; and (3) format —how sensitive CoT reasoning is to surface-level query form variations. To evaluate each aspect, we introduce DataAlchemy, a controlled and isolated experiment that allows us to train LLMs from scratch and systematically probe them under various distribution shifts.
Our findings reveal that CoT reasoning works effectively when applied to in-distribution or near in-distribution data but becomes fragile and prone to failure even under moderate distribution shifts. In some cases, LLMs generate fluent yet logically inconsistent reasoning steps. The results suggest that what appears to be structured reasoning can be a mirage, emerging from memorized or interpolated patterns in the training data rather than logical inference. These insights carry important implications for both practitioners and researchers. For practitioners, our results highlight the risk of relying on CoT as a plug-and-play solution for reasoning tasks and caution against equating CoT-style output with human thinking. For researchers, the results underscore the ongoing challenge of achieving reasoning that is both faithful and generalizable, motivating the need to develop models that can move beyond surface-level pattern recognition to exhibit deeper inferential competence. Our contributions are summarized as follows:
- Novel perspective. We propose a data distribution lens for CoT reasoning, illuminating that its effectiveness stems from structured inductive biases learned from in-distribution training data. This framework provides a principled lens for understanding why and when CoT reasoning succeeds or fails.
- Controlled environment. We introduce DataAlchemy, an isolated experimental framework that enables training LLMs from scratch and systematically probing CoT reasoning. This controlled setting allows us to isolate and analyze the effects of distribution shifts on CoT reasoning without interference from complex patterns learned during large-scale pre-training.
- Empirical validation. We conduct systematic empirical validation across three critical dimensions— task, length, and format. Our experiments demonstrate that CoT reasoning exhibits sharp performance degradation under distribution shifts, revealing that seemingly coherent reasoning masks shallow pattern replication.
- Real-world implication. This work reframes the understanding of contemporary LLMs’ reasoning capabilities and emphasizes the risk of over-reliance on COT reasoning as a universal problem-solving paradigm. It underscores the necessity for proper evaluation methods and the development of LLMs that possess authentic and generalizable reasoning capabilities.
2 Related Work
2.1 LLM Prompting and Co
Chain-of-Thought (CoT) prompting revolutionized how we elicit reasoning from Large Language Models by decomposing complex problems into intermediate steps (Wei et al., 2022). By augmenting few-shot exemplars with reasoning chains, CoT showed substantial performance gains on various tasks (Xu et al., 2024; Imani et al., 2023; Wei et al., 2022). Building on this, several variants emerged. Zero-shot CoT triggers reasoning without exemplars using instructional prompts (Kojima et al., 2022), and self-consistency enhances performance via majority voting over sampled chains (Wang et al., 2023). To reduce manual effort, Auto-CoT generates CoT exemplars using the models themselves (Zhang et al., 2023). Beyond linear chains, Tree-of-Thought (ToT) frames CoT as a tree search over partial reasoning paths (Yao et al., 2023), enabling lookahead and backtracking. SymbCoT combines symbolic reasoning with CoT by converting problems into formal representations (Xu et al., 2024). Recent work increasingly integrates CoT into the LLM inference process, generating long-form CoTs (Jaech et al., 2024; Team, 2024; Guo et al., 2025; Team et al., 2025). This enables flexible strategies like mistake correction, step decomposition, reflection, and alternative reasoning paths (Yeo et al., 2025; Chen et al., 2025a). The success of prompting techniques and long-form CoTs has led many to view them as evidence of emergent, human-like reasoning in LLMs. In this work, we challenge that viewpoint by adopting a data-centric perspective and demonstrating that CoT behavior arises largely from pattern matching over training distributions.
2.2 Discussion on Illusion of LLM Reasoning
While Chain-of-Thought prompting has led to impressive gains on complex reasoning tasks, a growing body of work has started questioning the nature of these improvements. One major line of research highlights the fragility of CoT reasoning. Minor and semantically irrelevant perturbations such as distractor phrases or altered symbolic forms can cause significant performance drops in state-of-the-art models (Mirzadeh et al., 2024; Tang et al., 2023). Models often incorporate such irrelevant details into their reasoning, revealing a lack of sensitivity to salient information. Other studies show that models prioritize the surface form of reasoning over logical soundness; in some cases, longer but flawed reasoning paths yield better final answers than shorter, correct ones (Bentham et al., 2024). Similarly, performance does not scale with problem complexity as expected—models may overthink easy problems and give up on harder ones (Shojaee et al., 2025). Another critical concern is the faithfulness of the reasoning process. Intervention-based studies reveal that final answers often remain unchanged even when intermediate steps are falsified or omitted (Lanham et al., 2023), a phenomenon dubbed the illusion of transparency (Bentham et al., 2024; Chen et al., 2025b). Together, these findings suggest that LLMs are not principled reasoners but rather sophisticated simulators of reasoning-like text. However, a systematic understanding of why and when CoT reasoning fails is still a mystery.
2.3 OOD Generalization of LLMs
Out-of-distribution (OOD) generalization, where test inputs differ from training data, remains a key challenge in machine learning, particularly for large language models (LLMs) Yang et al. (2024, 2023); Budnikov et al. (2025); Zhang et al. (2024b). Recent studies show that LLMs prompted to learn novel functions often revert to similar functions encountered during pretraining (Wang et al., 2024; Garg et al., 2022). Likewise, LLM generalization frequently depends on mapping new problems onto familiar compositional structures (Song et al., 2025). CoT prompting improves OOD generalization (Wei et al., 2022), with early work demonstrating length generalization for multi-step problems beyond training distributions (Yao et al., 2025; Shen et al., 2025). However, this ability is not inherent to CoT and heavily depends on model architecture and training setups. For instance, strong generalization in arithmetic tasks was achieved only when algorithmic structures were encoded into positional encodings (Cho et al., 2024). Similarly, finer-grained CoT demonstrations during training boost OOD performance, highlighting the importance of data granularity (Wang et al., 2025a). Theoretical and empirical evidence shows that CoT generalizes well only when test inputs share latent structures with training data; otherwise, performance declines sharply (Wang et al., 2025b; Li et al., 2025a). Despite its promise, CoT still struggles with genuinely novel tasks or formats. In the light of these brilliant findings, we propose rethinking CoT reasoning through a data distribution lens: decomposing CoT into task, length, and format generalization, and systematically investigating each in a controlled setting.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Task Generalization and Transformations
### Overview
The image presents a diagram illustrating different types of generalization and transformations applied to a basic set of elements. It shows how a sequence of letters (e.g., "APPLE") can be transformed using operations like ROT transformation and cyclic shift, and how these transformations are generalized across tasks, lengths, and formats. The diagram uses color-coding to distinguish between input, output, training, and testing data.
### Components/Axes
* **Basic atoms A:** A set of letters from A to Z.
* **Element l = 5:** An example element "APPLE" with a length of 5.
* **f1: ROT Transformation +13:** A transformation that rotates each letter by 13 positions in the alphabet.
* **f2: Cyclic Shift +1:** A transformation that shifts each letter by 1 position in the alphabet.
* **Element:** Shows ID, Comp, and OOD (Out-of-Distribution) examples of the element "ABCD".
* **Task Generalization:** Generalization of tasks using ID, Comp, POOD, and OOD examples.
* **Transformation:** Shows the transformations applied to ID, Comp, POOD, and OOD.
* **Length Generalization:** Generalization based on text length.
* **Format Generalization:** Generalization based on format, including insertion, deletion, and modification.
* **Legend:**
* Red circle: Input
* Blue circle: Output
* Red rectangle: Training
* Dashed red rectangle: Testing
### Detailed Analysis
**1. Basic Atoms and Element Transformations:**
* The diagram starts with a set of basic atoms, the letters A through Z.
* An example element "APPLE" of length 5 is given.
* Two transformations are applied to "APPLE":
* **f1: ROT Transformation +13:** "APPLE" becomes "NCCYR".
* **f2: Cyclic Shift +1:** "APPLE" becomes "EAPPL".
**2. Element Examples:**
* **ID:** The element "ABCD" is shown.
* **Comp:** The element "ABCD" is shown.
* **OOD:** The element "ABCD" is shown.
**3. Task Generalization:**
* The diagram shows how tasks are generalized using ID, Comp, POOD, and OOD examples.
* The input data (red rectangles) is transformed into output data (blue rectangles) using a function *fcomp*.
* The transformations are:
* ID: f1 o f1 -> f1
* Comp: {f1 o f1 -> f2, f2 o f2 o f1} -> f2 o f2
* POOD: f1 o f1 -> f2
* OOD: f1 o f1 -> f2
**4. Length Generalization:**
* The diagram shows how the length of the text is generalized.
* Examples of text lengths are "ABCD", "ABC", and "ABCDA".
* The input data (red rectangles) is transformed into output data (blue rectangles) using a function *fs*.
* The reasoning step is f1 o f1 -> f1 and f1 o f1 -> f1 o f1 o f1.
**5. Format Generalization:**
* The diagram shows how the format of the text is generalized.
* Examples of format generalizations are insertion, deletion, and modification.
* The input data (red rectangles) is transformed into output data (blue rectangles) using a function *fs*.
* Insertion: "ABCD" -> "AB?CD"
* Deletion: "ABCD" -> "ACD"
* Modify: "ABC?" -> "ABC?"
### Key Observations
* The diagram uses color-coding to clearly distinguish between input, output, training, and testing data.
* The transformations are applied to the elements to generate new elements.
* The diagram shows how the transformations are generalized across tasks, lengths, and formats.
### Interpretation
The diagram illustrates a system for generalizing transformations on sequences of letters. It demonstrates how simple transformations (ROT, Cyclic Shift) can be applied and then generalized across different tasks, text lengths, and formats. The use of ID, Comp, POOD, and OOD examples suggests a focus on handling both in-distribution and out-of-distribution scenarios. The functions *fcomp* and *fs* represent the underlying mechanisms for these generalizations. The diagram highlights the importance of considering different aspects of generalization when designing systems that operate on sequential data.
</details>
Figure 1: Framework of DataAlchemy. It creates an isolated and controlled environment to train LLMs from scratch and probe the task, length, and format generalization.
3 The Data Distribution Lens
We propose a fundamental reframing to understand what CoT actually represents. We hypothesize that the underlying mechanism is better understood through the lens of data distribution: rather than executing explicit reasoning procedures, CoT operates as a pattern-matching process that interpolates and extrapolates from the statistical regularities present in its training distribution. Specifically, we posit that CoT’s success stems not from a model’s inherent reasoning capacity, but from its ability to generalize conditionally to out-of-distribution (OOD) test cases that are structurally similar to in-distribution exemplars.
To formalize this view, we model CoT prompting as a conditional generation process constrained by the distributional properties of the training data. Let $\mathcal{D}_{\text{train}}$ denote the training distribution over input-output pairs $(x,y)$ , where $x$ represents a reasoning problem and $y$ denotes the solution sequence (including intermediate reasoning steps). The model learns an approximation $f_{\theta}(x)≈ y$ by minimizing empirical risk over samples drawn from $\mathcal{D}_{\text{train}}$ .
Let the expected training risk be defined as:
$$
R_{\text{train}}(f_{\theta})=\mathbb{E}_{(x,y)\sim\mathcal{D}_{\text{train}}}[\ell(f_{\theta}(x),y)], \tag{1}
$$
where $\ell$ is a task-specific loss function (e.g., cross-entropy, token-level accuracy). At inference time, given a test input $a_{\text{test}}$ sampled from a potentially different distribution $\mathcal{D}_{\text{test}}$ , the model generates a response $y_{\text{test}}$ conditioned on patterns learned from $\mathcal{D}_{\text{train}}$ . The corresponding expected test risk is:
$$
R_{\text{test}}(f_{\theta})=\mathbb{E}_{(x,y)\sim\mathcal{D}_{\text{test}}}[\ell(f_{\theta}(x),y)]. \tag{2}
$$
The degree to which the model generalizes from $\mathcal{D}_{\text{train}}$ to $\mathcal{D}_{\text{test}}$ is governed by the distributional discrepancy between the two, which we quantify using divergence measures:
**Definition 3.1 (Distributional Discrepancy)**
*Given training distribution $\mathcal{D}_{\text{train}}$ and test distribution $\mathcal{D}_{\text{test}}$ , the distributional discrepancy is defined as:
$$
\Delta(\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}})=\mathcal{H}(\mathcal{D}_{\text{train}}\parallel\mathcal{D}_{\text{test}}) \tag{3}
$$
where $\mathcal{H}(·\parallel·)$ is a divergence measure (e.g., KL divergence, Wasserstein distance) that quantifies the statistical distance between the two distributions.*
**Theorem 3.1 (CoT Generalization Bound)**
*Let $f_{\theta}$ denote a model trained on $\mathcal{D}_{\text{train}}$ with expected training risk $R_{\text{train}}(f_{\theta})$ . For a test distribution $\mathcal{D}_{\text{test}}$ , the expected test risk $R_{\text{test}}(f_{\theta})$ is bounded by:
$$
R_{\text{test}}(f_{\theta})\leq R_{\text{train}}(f_{\theta})+\Lambda\cdot\Delta(\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}})+\mathcal{O}\left(\sqrt{\frac{\log(1/\delta)}{n}}\right) \tag{4}
$$
where $\Lambda>0$ is a Lipschitz constant that depends on the model architecture and task complexity, $n$ is the training sample size, and the bound holds with probability $1-\delta$ , where $\delta$ is the failure propability.*
The proof is provided in Appendix A.1
Building on this data distribution perspective, we identify three critical dimensions along which distributional shifts can occur, each revealing different aspects of CoT’s pattern-matching nature: ➊ Task generalization examines how well CoT transfers across different types of reasoning tasks. Novel tasks may have unique elements and underlying logical structure, which introduces distributional shifts that challenge the model’s ability to apply learned reasoning patterns. ➋ Length generalization investigates CoT’s robustness to reasoning chains of varying lengths. Since training data typically contains reasoning sequences within a certain length range, test cases requiring substantially longer or shorter reasoning chains represent a form of distributional shift along the sequence length dimension. This length discrepancy could result from the reasoning step or the text-dependent solution space. ➌ Format generalization explores how sensitive CoT is to variations in prompt formulation and structure. Due to various reasons (e.g., sophistical training data or diverse background of users), it is challenging for LLM practitioners to design a golden prompt to elicit knowledge suitable for the current case. Their detailed definition and implementation are given in subsequent sections.
Each dimension provides a unique lens for understanding the boundaries of CoT’s effectiveness and the mechanisms underlying its apparent reasoning capabilities. By systematically varying these dimensions in controlled experimental settings, we can empirically validate our hypothesis that CoT performance degrades predictably as distributional discrepancy increases, thereby revealing its fundamental nature as a pattern-matching rather than reasoning system.
4 DataAlchemy: An Isolated and Controlled Environment
To systematically investigate the influence of distributional shifts on CoT reasoning capabilities, we introduce DataAlchemy, a synthetic dataset framework designed for controlled experimentation. This environment enables us to train language models from scratch under precisely defined conditions, allowing for rigorous analysis of CoT behavior across different OOD scenarios. The overview is shown in Figure 1.
4.1 Basic Atoms and Elements
Let $\mathcal{A}=\{\texttt{A},\texttt{B},\texttt{C},...,\texttt{Z}\}$ denote the alphabet of 26 basic atoms. An element $\mathbf{e}$ is defined as an ordered sequence of atoms:
$$
\mathbf{e}=(a_{0},a_{1},\ldots,a_{l-1})\quad\text{where}\quad a_{i}\in\mathcal{A},\quad l\in\mathbb{Z}^{+} \tag{5}
$$
This design provides a versatile manipulation for the size of the dataset $\mathcal{D}$ (i.e., $|\mathcal{D}|=|\mathcal{A}|^{l}$ ) by varying element length $l$ to train language models with various capacities. Meanwhile, it also allows us to systematically probe text length generalization capabilities.
4.2 Transformations
A transformation is an operation that operates on elements $F:\mathbf{e}→\hat{\mathbf{e}}$ . In this work, we consider two fundamental transformations: the ROT Transformation and the Cyclic Position Shift. To formally define the transformations, we introduce a bijective mapping $\phi:\mathcal{A}→\mathbb{Z}_{26}$ , where $\mathbb{Z}_{26}=\{0,1,...,25\}$ , such that $\phi(c)$ maps a character to its zero-based alphabetical index.
**Definition 4.1 (ROT Transformation)**
*Given an element $\mathbf{e}=(a_{0},\\
...,a_{l-1})$ and a rotation parameter $n∈\mathbb{Z}$ , the ROT Transformation $f_{\text{rot}}$ produces an element $\hat{\mathbf{e}}=(\hat{a}_{0},...,\hat{a}_{l-1})$ . Each atom $\hat{a}_{i}$ is:
$$
\hat{a}_{i}=\phi^{-1}((\phi(a_{i})+n)\pmod{26}) \tag{6}
$$
This operation cyclically shifts each atom $n$ positions forward in alphabetical order. For example, if $\mathbf{e}=(\texttt{A},\texttt{P},\texttt{P},\texttt{L},\texttt{E})$ and $n=13$ , then $f_{\text{rot}}(\mathbf{e},13)=(\texttt{N},\texttt{C},\texttt{C},\texttt{Y},\texttt{R})$ .*
**Definition 4.2 (Cyclic Position Shift)**
*Given an element $\mathbf{e}=(a_{0},\\
...,a_{l-1})$ and a shift parameter $n∈\mathbb{Z}$ , the Cyclic Position Shift $f_{\text{pos}}$ produces an element $\hat{\mathbf{e}}=(\hat{a}_{0},...,\hat{a}_{l-1})$ . Each atom $\hat{a}_{i}$ is defined by a cyclic shift of indices:
$$
\hat{a}_{i}=a_{(i-n)\pmod{l}} \tag{7}
$$
This transformation cyclically shifts the positions of the atoms within the sequence by $n$ positions to the right. For instance, if $\mathbf{e}=(\texttt{A},\texttt{P},\texttt{P},\texttt{L},\texttt{E})$ and $n=1$ , then $f_{\text{pos}}(\mathbf{e},1)=(\texttt{E},\texttt{A},\texttt{P},\texttt{P},\texttt{L})$ .*
**Definition 4.3 (Generalized Compositional Transformation)**
*To model multi-step reasoning, we define a compositional transformation as the successive application of a sequence of operations. Let $S=(f_{1},f_{2},...,f_{k})$ be a sequence of operations, where each $f_{i}$ is one of the fundamental transformations $\mathcal{F}=\{f_{\text{rot}},f_{\text{pos}}\}$ with its respective parameters. The compositional transformation $f_{\text{S}}$ for the sequence $S$ is the function composition:
$$
f_{\text{S}}=f_{k}\circ f_{k}\circ\cdots\circ f_{1} \tag{8}
$$
The resulting element $\hat{\mathbf{e}}$ is obtained by applying the operations sequentially to an initial element $\mathbf{e}$ :
$$
\hat{\mathbf{e}}=f_{k}(f_{k-1}(\ldots(f_{1}(\mathbf{e}))\ldots)) \tag{9}
$$*
This design enables the construction of arbitrarily complex transformation chains by varying the type, parameters, order, and length of operations within the sequence. At the sample time, we can naturally acquire the COT reasoning step by decomposing the intermediate process:
$$
\underbrace{f_{\text{S}}(\mathbf{e}):}_{\text{Query}}\quad\underbrace{\mathbf{e}\xrightarrow{f_{1}}\mathbf{e}^{(1)}\xrightarrow{f_{2}}\mathbf{e}^{(2)}\cdots\xrightarrow{f_{k-1}}\mathbf{e}^{(k-1)}\xrightarrow{f_{k}}}_{\text{COT reasoning steps}}\underbrace{\boxed{\hat{\mathbf{e}}}}_{\text{Answer}} \tag{1}
$$
4.3 Environment Setting
Through systematic manipulation of elements and transformations, DataAlchemy offers a flexible and controllable framework for training LLMs from scratch, facilitating rigorous investigation of diverse OOD scenarios. Without specification, we employ a decoder-only language model GPT-2 (Radford et al., 2019) with a configuration of 4 layers, 32 hidden dimensions, and 4 attention heads. We utilize a Byte-Pair Encoding (BPE) tokenizer. Both LLMs and the tokenizer follow the general modern LLM pipeline. During the inference time, we set the temperature to 1e-5. For rigor, we also study LLMs with various parameters, architectures, and temperatures in Section 8. Details of the implementation are provided in the Appendix B. We consider that each element consists of 4 basic atoms, which produces 456,976 samples for each dataset with varied transformations and token amounts. We initialize the two transformations $f_{1}=f_{\text{rot}}(e,13)$ and $f_{2}=f_{\text{pos}}(e,1)$ . We consider the exact match rate, Levenshtein distance (i.e., edit distance) (Yujian and Bo, 2007), and BLEU score (Papineni et al., 2002) as metrics and evaluate the produced reasoning step, answer, and full chain. Examples of the datasets and evaluations are shown in Appendix C
5 Task Generalization
Task generalization represents a fundamental challenge for CoT reasoning, as it directly tests a model’s ability to apply learned concepts and reasoning patterns to unseen scenarios. In our controlled experiments, both transformation and elements could be novel. Following this, we decompose task generalization into two primary dimensions: element generalization and transformation generalization.
Task Generalization Complexity. Guided by the data distribution lens, we first introduce a measure for generalization difficulty:
**Proposition 5.1 (Task Generalization Complexity)**
*For a reasoning chain $f_{S}$ operating on elements $\mathbf{e}=(a_{0},...,a_{l-1})$ , define:
$$
\displaystyle\operatorname{TGC}(C)= \displaystyle\alpha\sum_{i=1}^{m}\mathbb{I}\left[a_{i}\notin\mathcal{E}^{i}_{\text{train }}\right]+\beta\sum_{j=1}^{n}\mathbb{I}\left[f_{j}\notin\mathcal{F}_{\text{train }}\right]+\gamma\mathbb{I}\left[\left(f_{1},f_{2},\ldots,f_{k}\right)\notin\mathcal{P}_{\text{train }}\right]+C_{T} \tag{11}
$$
as a measurement of task discrepancy $\Delta_{task}$ , where $\alpha,\beta,\gamma$ are weighting parameters for different novelty types and $C_{T}$ is task specific constant. $\mathcal{E}^{i}_{\text{train }},\mathcal{F}_{\text{train }},$ and $\mathcal{P}_{\text{train }}$ denote the bit-wise element set, relation set and the order of relation set used during training.*
We establish a critical threshold beyond which CoT reasoning fails exponentially:
**Theorem 5.1 (Task Generalization Failure Threshold)**
*There exists a threshold $\tau$ such that when $\text{TGC}(C)>\tau$ , the probability of correct CoT reasoning drops exponentially:
$$
P(\text{correct}|C)\leq e^{-\delta(\text{TGC}(C)-\tau)} \tag{12}
$$*
The proof is provided in Appendix A.2.
5.1 Transformation Generalization
Transformation generalization evaluates the ability of CoT reasoning to effectively transfer when models encounter novel transformations during testing, which is an especially prevalent scenario in real-world applications.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Scatter Plot: BLEU Score vs. Edit Distance with Distribution Shift
### Overview
The image is a scatter plot showing the relationship between BLEU Score and Edit Distance. The color of each data point represents the Distribution Shift, ranging from blue (low) to red (high). The plot shows a cluster of points with high BLEU scores and low edit distances, and another cluster with lower BLEU scores and higher edit distances.
### Components/Axes
* **X-axis:** Edit Distance, ranging from 0.00 to 0.30 in increments of 0.05.
* **Y-axis:** BLEU Score, ranging from 0.2 to 1.0 in increments of 0.1.
* **Color Legend (Right Side):** Distribution Shift, ranging from 0.2 (blue) to 0.8 (red). The color gradient indicates the magnitude of the distribution shift.
### Detailed Analysis
* **Cluster 1 (Top-Left):** A cluster of approximately 4 blue data points is located in the top-left corner, indicating high BLEU scores (approximately 1.0) and low Edit Distances (approximately 0.0). These points have a low Distribution Shift.
* **Cluster 2 (Center-Right):** A larger cluster of points, ranging in color from purple to red, is located in the center-right of the plot. These points have BLEU scores ranging from approximately 0.2 to 0.7, and Edit Distances ranging from approximately 0.1 to 0.3. The Distribution Shift varies from moderate to high.
* **Individual Points:** There are a few scattered points between the two main clusters. For example, there is a purple point with an Edit Distance of approximately 0.1 and a BLEU score of approximately 0.7.
### Key Observations
* There is a clear separation between the high-BLEU/low-Edit Distance cluster and the lower-BLEU/higher-Edit Distance cluster.
* Higher Edit Distances are generally associated with lower BLEU scores and higher Distribution Shifts.
* Lower Edit Distances are generally associated with higher BLEU scores and lower Distribution Shifts.
### Interpretation
The scatter plot suggests an inverse relationship between Edit Distance and BLEU Score. As the Edit Distance increases, the BLEU Score tends to decrease. The Distribution Shift appears to be correlated with both Edit Distance and BLEU Score, with higher shifts generally occurring when Edit Distance is high and BLEU Score is low. The cluster of points with high BLEU scores and low Edit Distances likely represents a scenario where the generated text is very similar to the reference text, while the other cluster represents a scenario where the generated text is significantly different. The color gradient adds another dimension, suggesting that the distribution shift is also a factor in the performance of the system.
</details>
Figure 2: Performance of CoT reasoning on transformation generalization. Efficacy of CoT reasoning declines as the degree of distributional discrepancy increases.
Experimental Setup. To systematically evaluate the impact of transformations, we conduct experiments by varying transformations between training and testing sets while keeping other factors constant (e.g., elements, length, and format). Guided by the intuition formalized in Proposition 5.1, we define four incremental levels of distribution shift in transformations as shown in Figure 1: (i) In-Distribution (ID): The transformations in the test set are identical to those observed during training, e.g., $f_{1}\circ f_{1}→ f_{1}\circ f_{1}$ . (ii) Composition (CMP): Test samples comprise novel compositions of previously encountered transformations, though each individual transformation remains familiar, e.g., ${f_{1}\circ f_{1},f_{1}\circ f_{2},f_{2}\circ f_{1}}→ f_{2}\circ f_{2}$ . (iii) Partial Out-of-Distribution (POOD): Test data include compositions involving at least one novel transformation not seen during training, e.g., $f_{1}\circ f_{1}→ f_{1}\circ f_{2}$ . (iv) Out-of-Distribution (OOD): The test set contains entirely novel transformation types that are unseen in training, e.g., $f_{1}\circ f_{1}→ f_{2}\circ f_{2}$ .
Table 1: Full chain evaluation under different scenarios for transformation generalization.
| $f_{1}\circ f_{1}→ f_{1}\circ f_{1}$ $\{f_{2}\circ f_{2},f_{1}\circ f_{2},f_{2}\circ f_{1}\}→ f_{1}\circ f_{1}$ $f_{1}\circ f_{2}→ f_{1}\circ f_{1}$ | ID CMP POOD | 100.00% 0.01% 0.00% | 0 0.1326 0.1671 | 1 0.6867 0.4538 |
| --- | --- | --- | --- | --- |
| $f_{2}\circ f_{2}→ f_{1}\circ f_{1}$ | OOD | 0.00% | 0.2997 | 0.2947 |
Findings. Figure 2 illustrates the performance of the full chain under different distribution discrepancies computed by task generalize complexities (normalized between 0 and 1) in Definition 5.1. We can observe that, in general, the effectiveness of CoT reasoning decreases when distribution discrepancy increases. For the instance shown in Table 1, from in-distribution to composition, POOD, and OOD, the exact match decreases from 1 to 0.01, 0, and 0, and the edit distance increases from 0 to 0.13, 0.17 when tested on data with transformation $f_{1}\circ f_{1}$ . Apart from ID, LLMs cannot produce a correct full chain in most cases, while they can produce correct CoT reasoning when exposed to some composition and POOD conditions by accident. As shown in Table 2, from $f_{1}\circ f_{2}$ to $f_{2}\circ f_{2}$ , the LLMs can correctly answer 0.1% of questions. A close examination reveals that it is a coincidence, e.g., the query element is A, N, A, N, which happened to produce the same result for the two operations detailed in the Appendix D.1. When further analysis is performed by breaking the full chain into reasoning steps and answers, we observe strong consistency between the reasoning steps and answers. For example, under the composition generalization setting, the reasoning steps are entirely correct on test data distribution $f_{1}\circ f_{1}$ and $f_{2}\circ f_{2}$ , but with wrong answers. Probe these insistent cases in Appendix D.1, we can find that when a novel transformation (say $f_{1}\circ f_{1}$ ) is present, LLMs try to generalize the reasoning paths based on the most similar ones (i.e., $f_{1}\circ f_{2}$ ) seen during training, which leads to correct reasoning paths, yet incorrect answer, which echo the example in the introduction. Similarly, generalization from $f_{1}\circ f_{2}$ to $f_{2}\circ f_{1}$ or vice versa allows LLMs to produce correct answers that are attributed to the commutative property between the two orthogonal transformations with unfaithful reasoning paths. Collectively, the above results indicate that the CoT reasoning fails to generalize to novel transformations, not even to novel composition transforms. Rather than demonstrating a true understanding of text, CoT reasoning under task transformations appears to reflect a replication of patterns learned during training.
Table 2: Evaluation on different components in CoT reasoning on transformation generalization. CoT reasoning shows inconsistency with the reasoning steps and answers.
| $\{f_{1}\circ f_{1},f_{1}\circ f_{2},f_{2}\circ f_{1}\}→ f_{2}\circ f_{2}$ $\{f_{1}\circ f_{2},f_{2}\circ f_{1},f_{2}\circ f_{2}\}→ f_{1}\circ f_{1}$ $f_{1}\circ f_{2}→ f_{2}\circ f_{1}$ | 100.00% 100.00% 0.00% | 0.01% 0.01% 100.00% | 0.01% 0.01% 0.00% | 0.000 0.000 0.373 | 0.481 0.481 0.000 | 0.133 0.133 0.167 |
| --- | --- | --- | --- | --- | --- | --- |
| $f_{2}\circ f_{1}→ f_{1}\circ f_{2}$ | 0.00% | 100.00% | 0.00% | 0.373 | 0.000 | 0.167 |
Experiment settings. To further probe when CoT reasoning can generalize to unseen transformations, we conduct supervised fine-tuning (SFT) on a small portion $\lambda$ of unseen data. In this way, we can decrease the distribution discrepancy between the training and test sets, which might help LLMs to generalize to test queries.
Findings. As shown in Figure 3, we can find that generally a very small portion ( $\lambda=1.5e^{-4}$ ) of data can make the model quickly generalize to unseen transformations. The less discrepancy between the training and testing data, the quicker the model can generalize. This indicates that a similar pattern appears in the training data, helping LLMs to generalize to the test dataset.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: Exact Match vs. SFT Data Ratio
### Overview
The image is a line chart comparing the "Exact Match (%)" against the "SFT Data Ratio (×10-4)" for four different categories: ID, CMP, POOD, and OOD. The chart illustrates how the exact match percentage changes as the SFT data ratio increases for each category.
### Components/Axes
* **Y-axis (Vertical):** "Exact Match (%)", ranging from 0 to 100 in increments of 20.
* **X-axis (Horizontal):** "SFT Data Ratio (×10-4)", ranging from 0 to 6 in increments of 1.
* **Legend (Right):**
* *ID*: Light blue dotted line with diamond markers.
* *CMP*: Light purple solid line with square markers.
* *POOD*: Light red dashed line with circle markers.
* *OOD*: Light teal dash-dotted line with triangle markers.
### Detailed Analysis
* **ID (Light Blue, Dotted, Diamond Markers):** The "ID" line remains consistently at 100% across all SFT Data Ratio values.
* (0, 100)
* (1, 100)
* (2, 100)
* (3, 100)
* (4, 100)
* (5, 100)
* (6, 100)
* **CMP (Light Purple, Solid, Square Markers):** The "CMP" line increases sharply from approximately 0% to approximately 80% between SFT Data Ratio values of 0 and 1. It then continues to increase, reaching approximately 95% at a ratio of 2, and plateaus near 100% after a ratio of 4.
* (0, 0)
* (1, 80)
* (2, 95)
* (3, 98)
* (4, 99)
* (5, 100)
* (6, 100)
* **POOD (Light Red, Dashed, Circle Markers):** The "POOD" line increases from approximately 0% to approximately 50% between SFT Data Ratio values of 0 and 1. It continues to increase, reaching approximately 90% at a ratio of 3, and plateaus near 100% after a ratio of 5.
* (0, 0)
* (1, 50)
* (2, 75)
* (3, 90)
* (4, 95)
* (5, 98)
* (6, 100)
* **OOD (Light Teal, Dash-Dotted, Triangle Markers):** The "OOD" line increases gradually from approximately 0% to approximately 50% between SFT Data Ratio values of 0 and 2. It continues to increase, reaching approximately 90% at a ratio of 5, and plateaus near 90% after a ratio of 5.
* (0, 0)
* (1, 10)
* (2, 45)
* (3, 70)
* (4, 85)
* (5, 90)
* (6, 90)
### Key Observations
* The "ID" category consistently achieves a 100% exact match, regardless of the SFT Data Ratio.
* The "CMP" category shows the most rapid improvement in exact match percentage with increasing SFT Data Ratio.
* The "OOD" category shows the slowest improvement in exact match percentage with increasing SFT Data Ratio.
* All categories except "ID" show a positive correlation between SFT Data Ratio and Exact Match (%).
### Interpretation
The chart suggests that increasing the SFT Data Ratio generally improves the exact match percentage for the "CMP", "POOD", and "OOD" categories. The "ID" category, however, maintains a perfect match regardless of the SFT Data Ratio, implying it may be inherently easier to match or is already optimized. The "CMP" category benefits the most from increased SFT data, while "OOD" benefits the least. This could indicate differences in the complexity or characteristics of the data within each category. The data suggests that "CMP" and "POOD" benefit from increased SFT data ratio, but eventually plateau, suggesting diminishing returns. "OOD" continues to improve with increased SFT data ratio, but at a slower rate.
</details>
Figure 3: Performance on unseen transformation using SFT in various levels of distribution shift. Introducing a small amount of unseen data helps CoT reasoning to generalize across different scenarios.
5.2 Element Generalization
Element generalization is another critical factor to consider when LLMs try to generalize to new tasks.
Experiment settings. Similar to transformation generalization, we fix other factors and consider three progressive distribution shifts for elements: ID, CMP, and OOD, as shown in Figure 1. It is noted that in composition, we test if CoT reasoning can be generalized to novel combinations when seeing all the basic atoms in the elements, e.g., $(\texttt{A},\texttt{B},\texttt{C},\texttt{D})→(\texttt{B},\texttt{C},\texttt{D},\texttt{A})$ . Based on the atom order in combination (can be measured by edit distance $n$ ), the CMP can be further developed. While for OOD, atoms that constitute the elements are totally unseen during the training.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Heatmaps: BLEU Score and Exact Match vs. Transformation
### Overview
The image presents two heatmaps, one displaying BLEU scores and the other displaying exact match percentages, for different scenarios (ID, CMP, OOD) under various transformations (f1, f2, f1·f1, f1·f2, f2·f1, f2·f2). The heatmaps use a color gradient from blue to red to represent the values, with blue indicating lower values and red indicating higher values.
### Components/Axes
* **Top Heatmap:**
* **Y-axis (Scenario):** ID, CMP, OOD
* **X-axis (Transformation):** f1, f2, f1·f1, f1·f2, f2·f1, f2·f2
* **Color Scale:** Represents BLEU Score, ranging from 0.0 (blue) to 1.0 (red).
* **Title:** BLEU Score (located on the right side of the heatmap)
* **Bottom Heatmap:**
* **Y-axis (Scenario):** ID, CMP, OOD
* **X-axis (Transformation):** f1, f2, f1·f1, f1·f2, f2·f1, f2·f2
* **Color Scale:** Represents Exact Match (%), ranging from 0% (blue) to 100% (red).
* **Title:** Exact Match (%) (located on the right side of the heatmap)
* **Shared X-axis Label:** Transformation (located below the bottom heatmap)
### Detailed Analysis
**Top Heatmap (BLEU Score):**
* **ID Scenario:** The BLEU score is 1.00 for all transformations (f1, f2, f1·f1, f1·f2, f2·f1, f2·f2).
* **CMP Scenario:** The BLEU scores vary across transformations:
* f1: 0.71
* f2: 0.62
* f1·f1: 0.65
* f1·f2: 0.68
* f2·f1: 0.32
* f2·f2: 0.16
* **OOD Scenario:** The BLEU scores are generally lower than CMP:
* f1: 0.00
* f2: 0.00
* f1·f1: 0.46
* f1·f2: 0.35
* f2·f1: 0.40
* f2·f2: 0.35
**Bottom Heatmap (Exact Match %):**
* **ID Scenario:** The exact match percentage is 100% for all transformations (f1, f2, f1·f1, f1·f2, f2·f1, f2·f2).
* **CMP Scenario:** The exact match percentage is 0% for all transformations (f1, f2, f1·f1, f1·f2, f2·f1, f2·f2).
* **OOD Scenario:** The exact match percentage is 0% for all transformations (f1, f2, f1·f1, f1·f2, f2·f1, f2·f2).
### Key Observations
* The ID scenario consistently shows perfect BLEU scores and exact matches across all transformations.
* The CMP scenario has varying BLEU scores depending on the transformation, but always 0% exact match.
* The OOD scenario generally has the lowest BLEU scores and 0% exact match.
* Transformations involving f2 (f2·f1, f2·f2) tend to have lower BLEU scores in the CMP scenario compared to transformations involving f1.
### Interpretation
The heatmaps illustrate the performance of a system under different scenarios and transformations. The ID scenario represents in-distribution data, where the system performs perfectly. The CMP scenario represents a compositional split, where the system's performance varies depending on the specific transformation applied. The OOD scenario represents out-of-distribution data, where the system struggles significantly.
The BLEU score measures the similarity between the generated output and the reference output, while the exact match percentage measures the proportion of generated outputs that are identical to the reference outputs. The fact that the ID scenario has both high BLEU scores and high exact match percentages indicates that the system is able to generate accurate and precise outputs for in-distribution data. The lower BLEU scores and zero exact match percentages for the CMP and OOD scenarios suggest that the system is not able to generalize well to unseen data or compositional variations.
The difference in BLEU scores between different transformations in the CMP scenario suggests that some transformations are more challenging for the system than others. For example, transformations involving f2 may introduce more noise or ambiguity, leading to lower BLEU scores. The zero exact match percentages for CMP and OOD scenarios indicate that the system rarely produces the exact expected output, even when the BLEU score is non-zero, suggesting that while the generated output may be semantically similar, it is not identical to the reference.
</details>
Figure 4: Element generalization results on various scenarios and relations.
Findings. Similar to transformation generalization, the performances degrade sharply when facing the distribution shift consistently across all transformations, as shown in Figure 4. From ID to CMP and OOD, the exact match decreases from 1.0 to 0 and 0, for all cases. Most strikingly, the BLEU score is 0 when transferred to $f_{1}$ and $f_{2}$ transformations. A failure case in Appendix D.1 shows that the models cannot respond to any words when novel elements are present. We further explore when CoT reasoning can generalize to novel elements by conducting SFT. The results are summarized in Figure 5. We evaluate the performance under three exact matches for the full chain under three scenarios, CMP based on the edit distance n. The result is similar to SFT on transformation. The performance increases rapidly when presented with similar (a small $n$ ) examples in the training data. Interestingly, the exact match rate for CoT reasoning aligns with the lower bound of performance when $n=3$ , which might suggest the generalization of CoT reasoning on novel elements is very limited, even SFT on the downstream task. When we further analyze the exact match of reasoning, answer, and token during the training for $n=3$ , as summarized in Figure 5b. We find that there is a mismatch of accuracy between the answer and the reasoning step during the training process, which somehow might provide an explanation regarding why CoT reasoning is inconsistent in some cases.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: Exact Match vs. SFT Data Ratio for Different 'n' Values
### Overview
The image is a line chart showing the relationship between "SFT Data Ratio" (x-axis) and "Exact Match (%)" (y-axis) for four different values of 'n' (n=1, n=2, n=3, n=4). The chart illustrates how the exact match percentage changes as the SFT data ratio increases, for each 'n' value.
### Components/Axes
* **X-axis:** "SFT Data Ratio" ranges from 0.0 to 1.0 in increments of 0.1.
* **Y-axis:** "Exact Match (%)" ranges from 0 to 100 in increments of 20.
* **Legend:** Located on the right side of the chart, it identifies the line colors and markers corresponding to each 'n' value:
* **Green (Triangle marker):** n=4
* **Salmon (Circle marker):** n=3
* **Blue-Purple (Square marker):** n=2
* **Light Blue (Diamond marker):** n=1
### Detailed Analysis
* **n=1 (Light Blue, Diamond marker, Dotted Line):** The line remains constant at 100% across all SFT Data Ratio values.
* (0.0, 100%), (0.1, 100%), (0.2, 100%), (0.3, 100%), (0.4, 100%), (0.5, 100%), (0.6, 100%), (0.7, 100%), (0.8, 100%), (0.9, 100%), (1.0, 100%)
* **n=2 (Blue-Purple, Square marker, Solid Line):** The line increases sharply from 0% to 100% between SFT Data Ratio values of 0.0 and 0.7.
* (0.0, 0%), (0.1, ~10%), (0.2, ~30%), (0.3, ~75%), (0.4, ~90%), (0.5, ~98%), (0.6, ~100%), (0.7, ~100%), (0.8, ~100%), (0.9, ~100%), (1.0, ~100%)
* **n=3 (Salmon, Circle marker, Dashed Line):** The line increases gradually from 0% to 100% between SFT Data Ratio values of 0.0 and 0.8.
* (0.0, 0%), (0.1, ~10%), (0.2, ~20%), (0.3, ~40%), (0.4, ~60%), (0.5, ~75%), (0.6, ~90%), (0.7, ~95%), (0.8, ~100%), (0.9, ~100%), (1.0, ~100%)
* **n=4 (Green, Triangle marker, Dash-Dot Line):** The line increases gradually from 0% to approximately 80% between SFT Data Ratio values of 0.0 and 0.7, then continues to increase to 100%.
* (0.0, 0%), (0.1, ~10%), (0.2, ~20%), (0.3, ~30%), (0.4, ~40%), (0.5, ~50%), (0.6, ~65%), (0.7, ~80%), (0.8, ~90%), (0.9, ~95%), (1.0, ~100%)
### Key Observations
* When n=1, the exact match is always 100%, regardless of the SFT Data Ratio.
* As 'n' increases from 2 to 4, the SFT Data Ratio required to achieve a high exact match percentage also increases.
* The 'n=2' line shows the steepest increase in exact match percentage as the SFT Data Ratio increases.
### Interpretation
The chart demonstrates the impact of the 'n' parameter on the exact match percentage, given different SFT Data Ratios. A lower 'n' value (n=1) guarantees a 100% exact match, irrespective of the SFT Data Ratio. However, as 'n' increases, a higher SFT Data Ratio is needed to achieve a similar level of exact match. This suggests that 'n' influences the sensitivity of the exact match to the amount of SFT data used. The sharp increase for n=2 indicates a critical threshold in SFT Data Ratio for achieving high accuracy, while the gradual increase for n=3 and n=4 suggests a more proportional relationship.
</details>
(a) Performance on unseen element via SFT in various CMP scenarios.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Bar Chart: Exact Match vs. SFT Data Ratio
### Overview
The image is a bar chart comparing the "Exact Match (%)" for "Reasoning Step", "Answer", and "Full Chain" against varying "SFT Data Ratio" values. The chart shows how the exact match percentage changes as the SFT data ratio increases.
### Components/Axes
* **Y-axis:** "Exact Match (%)", with a scale from 0.0 to 1.0 in increments of 0.2.
* **X-axis:** "SFT Data Ratio", with values 0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.8, and 1.0.
* **Legend (Top-Left):**
* Reasoning Step (Blue with diagonal hatching)
* Answer (Red with diagonal hatching)
* Full Chain (Light Blue with diagonal hatching)
### Detailed Analysis
Here's a breakdown of the data for each category:
* **Reasoning Step (Blue):**
* Trend: Generally increases with the SFT Data Ratio.
* Values:
* 0.0: ~0.0
* 0.1: ~0.04
* 0.2: ~0.19
* 0.3: ~0.45
* 0.5: ~0.73
* 0.7: ~0.90
* 0.8: ~0.97
* 1.0: ~0.99
* **Answer (Red):**
* Trend: Generally increases with the SFT Data Ratio.
* Values:
* 0.0: ~0.0
* 0.1: ~0.06
* 0.2: ~0.19
* 0.3: ~0.40
* 0.5: ~0.68
* 0.7: ~0.87
* 0.8: ~0.95
* 1.0: ~0.98
* **Full Chain (Light Blue):**
* Trend: Generally increases with the SFT Data Ratio.
* Values:
* 0.0: ~0.0
* 0.1: ~0.03
* 0.2: ~0.20
* 0.3: ~0.41
* 0.5: ~0.70
* 0.7: ~0.88
* 0.8: ~0.96
* 1.0: ~0.99
### Key Observations
* All three categories ("Reasoning Step", "Answer", and "Full Chain") show a positive correlation between "SFT Data Ratio" and "Exact Match (%)".
* The "Reasoning Step" category consistently has a slightly higher "Exact Match (%)" than the "Answer" and "Full Chain" categories for most "SFT Data Ratio" values.
* The "Exact Match (%)" values for all three categories converge and approach 1.0 as the "SFT Data Ratio" increases to 0.8 and 1.0.
* At lower "SFT Data Ratio" values (0.0 and 0.1), the "Exact Match (%)" is very low for all categories.
### Interpretation
The chart suggests that increasing the "SFT Data Ratio" significantly improves the "Exact Match (%)" for all three categories: "Reasoning Step", "Answer", and "Full Chain". This indicates that the model's performance, measured by exact match, is highly dependent on the amount of SFT (Supervised Fine-Tuning) data used. The "Reasoning Step" category performing slightly better than "Answer" and "Full Chain" might indicate that the model benefits more from fine-tuning on reasoning steps compared to the final answer or the full chain of reasoning. The convergence of all three categories at higher "SFT Data Ratio" values suggests that with sufficient fine-tuning data, the model can achieve near-perfect exact match performance regardless of whether it's evaluated on reasoning steps, answers, or the full chain.
</details>
(b) Evaluation of CoT reasoning in SFT.
Figure 5: SFT performances for element generalization. SFT helps to generalize to novel elements.
6 Length Generalization
Length generalization examines how CoT reasoning degrades when models encounter test cases that differ in length from their training distribution. The difference in length could be introduced from the text space or the reasoning space of the problem. Therefore, we decompose length generalization into two complementary aspects: text length generalization and reasoning step generalization. Guided by instinct, we first propose to measure the length discrepancy.
Length Extrapolation Bound. We establish a power-law relationship for length extrapolation:
**Proposition 6.1 (Length Extrapolation Gaussian Degradation)**
*For a model trained on chain-of-thought sequences of fixed length $L_{\text{train}}$ , the generalization error at test length $L$ follows a Gaussian distribution:
$$
\mathcal{E}(L)=\mathcal{E}_{0}+\left(1-\mathcal{E}_{0}\right)\cdot\left(1-\exp\left(-\frac{\left(L-L_{\text{train }}\right)^{2}}{2\sigma^{2}}\right)\right) \tag{13}
$$
where $\mathcal{E}_{0}$ is the in-distribution error at $L=L_{\text{train}}$ , $\sigma$ is the length generalization width parameter, and $L$ is the test sequence length*
The proof is provided in Appendix A.3.
6.1 Text Length Generalization
Text length generalization evaluates how CoT performance varies when the input text length (i.e., the element length $l$ ) differs from training examples. Considering the way LLMs process long text, this aspect is crucial because real-world problems often involve varying degrees of complexity that manifest as differences in problem statement length, context size, or information density.
Experiment settings. We pre-train LLMs on the dataset with text length merely on $l=4$ while fixing other factors and evaluate the performance on a variety of lengths. We consider three different padding strategies during the pre-training: (i) None: LLMs do not use any padding. (ii) Padding: We pad LLM to the max length of the context window. (iii) Group: We group the text and truncate it into segments with a maximum length.
Table 3: Evaluation for text length generalization.
| 2 3 4 | 0.00% 0.00% 100.00% | 0.00% 0.00% 100.00% | 0.00% 0.00% 100.00% | 0.3772 0.2221 0.0000 | 0.4969 0.3203 0.0000 | 0.5000 0.2540 0.0000 | 0.4214 0.5471 1.0000 | 0.1186 0.1519 1.0000 | 0.0000 0.0000 1.0000 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 5 | 0.00% | 0.00% | 0.00% | 0.1818 | 0.2667 | 0.2000 | 0.6220 | 0.1958 | 0.2688 |
| 6 | 0.00% | 0.00% | 0.00% | 0.3294 | 0.4816 | 0.3337 | 0.4763 | 0.1174 | 0.2077 |
Findings. As illustrated in the Table 3, the CoT reasoning failed to directly generate two test cases even though those lengths present a mild distribution shift. Further, the performance declines as the length discrepancy increases shown in Figure 6. For instance, from data with $l=4$ to those with $l=3$ or $l=5$ , the BLEU score decreases from 1 to 0.55 and 0.62. Examples in Appendix D.1 indicate that LLMs attempt to produce CoT reasoning with the same length as the training data by adding or removing tokens in the reasoning chains. The efficacy of CoT reasoning length generalization deteriorates as the discrepancy increases. Moreover, we consider using a different padding strategy to decrease the divergence between the training data and test cases. We found that padding to the max length doesn’t contribute to length generalization. However, the performance increases when we replace the padding with text by using the group strategy, which indicates its effectiveness.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Bar Chart: BLEU Score and Edit Distance vs. Text Length
### Overview
The image is a bar chart comparing BLEU scores and Edit Distances for different text lengths (2 to 6) under three conditions: "None", "Group", and "Padding". The chart uses a dual y-axis, with the left axis representing BLEU score (ranging from 0.0 to 1.0) and the right axis representing Edit Distance (ranging from 0.00 to 1.00). The x-axis represents Text Length.
### Components/Axes
* **X-axis:** Text Length (2, 3, 4, 5, 6)
* **Left Y-axis:** BLEU Score (0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
* **Right Y-axis:** Edit Distance (0.00, 0.20, 0.40, 0.60, 0.80, 1.00)
* **Legend:** Located in the lower-left corner.
* None (light purple with diagonal lines)
* Group (light red with diagonal lines)
* Padding (light blue with diagonal lines)
### Detailed Analysis
Here's a breakdown of the data for each text length and condition:
* **Text Length 2:**
* None: BLEU Score ~0.42, Edit Distance ~0.58
* Group: BLEU Score ~0.50, Edit Distance ~0.50
* Padding: BLEU Score ~0.40, Edit Distance ~0.60
* **Text Length 3:**
* None: BLEU Score ~0.54, Edit Distance ~0.46
* Group: BLEU Score ~0.70, Edit Distance ~0.30
* Padding: BLEU Score ~0.74, Edit Distance ~0.26
* **Text Length 4:**
* None: BLEU Score ~0.98, Edit Distance ~0.02
* Group: BLEU Score ~0.98, Edit Distance ~0.02
* Padding: BLEU Score ~0.98, Edit Distance ~0.02
* **Text Length 5:**
* None: BLEU Score ~0.62, Edit Distance ~0.38
* Group: BLEU Score ~0.62, Edit Distance ~0.38
* Padding: BLEU Score ~0.90, Edit Distance ~0.10
* **Text Length 6:**
* None: BLEU Score ~0.48, Edit Distance ~0.52
* Group: BLEU Score ~0.50, Edit Distance ~0.50
* Padding: BLEU Score ~0.80, Edit Distance ~0.20
### Key Observations
* For text length 4, all three conditions ("None", "Group", "Padding") achieve near-perfect BLEU scores (close to 1.0) and minimal Edit Distances (close to 0.0).
* "Padding" generally results in higher BLEU scores and lower Edit Distances compared to "None" and "Group", especially for text lengths 5 and 6.
* The BLEU scores for "None" and "Group" conditions are relatively low for text lengths 2, 5, and 6, indicating poorer performance.
### Interpretation
The chart suggests that the "Padding" condition generally improves the quality of the generated text, as indicated by higher BLEU scores and lower Edit Distances. The performance is particularly notable for text length 4, where all conditions perform exceptionally well. The lower scores for "None" and "Group" at certain text lengths suggest that these conditions may not be as effective in maintaining text quality. The relationship between text length and performance varies depending on the condition, indicating that the optimal strategy may depend on the specific text length being considered.
</details>
Figure 6: Performance of text length generalization across various padding strategies. Group strategies contribute to length generalization.
6.2 Reasoning Step Generalization
The reasoning step generalization investigates whether models can extrapolate to reasoning chains requiring different steps $k$ from those observed during training. which is a popular setting in multi-step reasoning tasks.
Experiment settings. Similar to text length generalization, we first pre-train the LLM with reasoning step $k=2$ , and evaluate on data with reasoning step $k=1$ or $k=3$ .
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: Exact Match vs. Data Percentage for k=1 and k=2
### Overview
The image is a line chart comparing the "Exact Match (%)" against "Data Percentage" for two different parameters, k=1 and k=2. The chart shows how the exact match percentage changes as the data percentage increases for each parameter.
### Components/Axes
* **Y-axis (Vertical):** "Exact Match (%)". Scale ranges from 0 to 80, with implicit ticks at 20, 40, and 60.
* **X-axis (Horizontal):** "Data Percentage". Scale ranges from 0.0 to 1.0, with ticks at intervals of 0.1 (0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0).
* **Legend (Bottom-Right):**
* Blue line with circle markers: "k=1"
* Coral/Salmon dashed line with square markers: "k=2"
### Detailed Analysis
* **k=1 (Blue Line):**
* Trend: The line slopes upward, indicating an increasing exact match percentage as the data percentage increases.
* Data Points:
* At 0.0 Data Percentage, Exact Match is approximately 0%.
* At 0.1 Data Percentage, Exact Match is approximately 32%.
* At 0.2 Data Percentage, Exact Match is approximately 54%.
* At 0.3 Data Percentage, Exact Match is approximately 65%.
* At 0.4 Data Percentage, Exact Match is approximately 70%.
* At 0.5 Data Percentage, Exact Match is approximately 78%.
* At 0.6 Data Percentage, Exact Match is approximately 86%.
* At 0.7 Data Percentage, Exact Match is approximately 90%.
* At 0.8 Data Percentage, Exact Match is approximately 92%.
* At 0.9 Data Percentage, Exact Match is approximately 92%.
* At 1.0 Data Percentage, Exact Match is approximately 92%.
* **k=2 (Coral/Salmon Dashed Line):**
* Trend: The line slopes downward after an initial increase, indicating a decreasing exact match percentage as the data percentage increases beyond a certain point.
* Data Points:
* At 0.0 Data Percentage, Exact Match is approximately 92%.
* At 0.1 Data Percentage, Exact Match is approximately 93%.
* At 0.2 Data Percentage, Exact Match is approximately 90%.
* At 0.3 Data Percentage, Exact Match is approximately 85%.
* At 0.4 Data Percentage, Exact Match is approximately 70%.
* At 0.5 Data Percentage, Exact Match is approximately 38%.
* At 0.6 Data Percentage, Exact Match is approximately 12%.
* At 0.7 Data Percentage, Exact Match is approximately 5%.
* At 0.8 Data Percentage, Exact Match is approximately 3%.
* At 0.9 Data Percentage, Exact Match is approximately 2%.
* At 1.0 Data Percentage, Exact Match is approximately 1%.
### Key Observations
* The "k=1" line shows a positive correlation between data percentage and exact match percentage.
* The "k=2" line shows a negative correlation between data percentage and exact match percentage after an initial increase.
* The two lines intersect at approximately 0.4 data percentage, where both have an exact match percentage of around 70%.
* For low data percentages (0.0 to 0.3), k=2 has a significantly higher exact match percentage than k=1.
* For high data percentages (0.6 to 1.0), k=1 has a significantly higher exact match percentage than k=2.
### Interpretation
The chart suggests that the optimal value of 'k' depends on the available data percentage. When the data percentage is low, k=2 performs better, providing a higher exact match percentage. However, as the data percentage increases, k=1 becomes the better choice, as its exact match percentage increases while k=2's decreases. This could indicate that k=2 is more effective with limited data, while k=1 benefits from a larger dataset. The intersection point at 0.4 data percentage represents a threshold where the performance of the two parameters switches.
</details>
(a) Reasoning step. From k=2 to k=1
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Exact Match vs. Data Percentage for k=2 and k=3
### Overview
The image is a line chart comparing the "Exact Match (%)" against "Data Percentage" for two different values of 'k' (k=2 and k=3). The x-axis represents the Data Percentage, ranging from 0.0 to 1.0 in increments of 0.1. The y-axis represents the Exact Match in percentage, ranging from 0 to 80 in increments of 10. Two lines are plotted: one for k=3 (blue with circle markers) and one for k=2 (coral/orange with square markers).
### Components/Axes
* **X-axis:** Data Percentage, ranging from 0.0 to 1.0 with increments of 0.1.
* **Y-axis:** Exact Match (%), ranging from 0 to 80 with increments of 10.
* **Legend:** Located in the bottom-right corner.
* k=3: Blue line with circle markers.
* k=2: Coral/Orange dashed line with square markers.
### Detailed Analysis
* **k=3 (Blue line with circle markers):**
* Trend: Initially near 0, the line gradually increases with data percentage, showing a positive correlation after a data percentage of 0.5.
* Data Points:
* 0.0 Data Percentage: ~0% Exact Match
* 0.1 Data Percentage: ~0% Exact Match
* 0.2 Data Percentage: ~2% Exact Match
* 0.3 Data Percentage: ~6% Exact Match
* 0.4 Data Percentage: ~10% Exact Match
* 0.5 Data Percentage: ~10% Exact Match
* 0.6 Data Percentage: ~12% Exact Match
* 0.7 Data Percentage: ~28% Exact Match
* 0.8 Data Percentage: ~48% Exact Match
* 0.9 Data Percentage: ~62% Exact Match
* 1.0 Data Percentage: ~72% Exact Match
* **k=2 (Coral/Orange dashed line with square markers):**
* Trend: Initially high, the line decreases with data percentage until around 0.5, then slightly increases before dropping again at 1.0.
* Data Points:
* 0.0 Data Percentage: ~80% Exact Match
* 0.1 Data Percentage: ~54% Exact Match
* 0.2 Data Percentage: ~35% Exact Match
* 0.3 Data Percentage: ~22% Exact Match
* 0.4 Data Percentage: ~17% Exact Match
* 0.5 Data Percentage: ~12% Exact Match
* 0.6 Data Percentage: ~10% Exact Match
* 0.7 Data Percentage: ~17% Exact Match
* 0.8 Data Percentage: ~19% Exact Match
* 0.9 Data Percentage: ~3% Exact Match
* 1.0 Data Percentage: ~0% Exact Match
### Key Observations
* For k=3, the exact match percentage increases as the data percentage increases.
* For k=2, the exact match percentage decreases as the data percentage increases, with a slight increase around 0.7 to 0.8.
* The k=2 line starts with a much higher exact match percentage than the k=3 line.
* The k=3 line surpasses the k=2 line in exact match percentage around a data percentage of 0.6.
### Interpretation
The chart illustrates how the exact match percentage varies with the amount of data used, for different values of 'k'. When k=2, the exact match is high with very little data, but quickly degrades as more data is added. This suggests that with k=2, the algorithm is very sensitive to the addition of new data points. Conversely, when k=3, the exact match starts low but improves significantly as more data is incorporated. This indicates that k=3 benefits from a larger dataset to achieve better accuracy. The crossover point around 0.6 data percentage suggests that for datasets larger than this threshold, k=3 is the preferred parameter.
</details>
(b) Reasoning step. From k=2 to k=3
Figure 7: SFT performances for reasoning step generalization.
Findings. As showcased in Figure 7, CoT reasoning cannot generalize across data requiring different reasoning steps, indicating the failure of generalization. Then, we try to decrease the distribution discrepancy introduced by gradually increasing the ratio of unseen data while keeping the dataset size the same when pre-training the model. And then, we evaluate the performance on two datasets. As we can observe, the performance on the target dataset increases along with the ratio. At the same time, the LLMs can not generalize to the original training dataset because of the small amount of training data. The trend is similar when testing different-step generalization, which follows the intuition and validates our hypothesis directly.
7 Format Generalization
Format generalization assesses the robustness of CoT reasoning to surface-level variations in test queries. This dimension is especially crucial for determining whether models have internalized flexible, transferable reasoning strategies or remain reliant on the specific templates and phrasings encountered during training.
Format Alignment Score. We introduce a metric for measuring prompt similarity:
**Definition 7.1 (Format Alignment Score)**
*For training prompt distribution $P_{train}$ and test prompt $p_{test}$ :
$$
\text{PAS}(p_{test})=\max_{p\in P_{train}}\cos(\phi(p),\phi(p_{test})) \tag{14}
$$
where $\phi$ is a prompt embedding function.*
<details>
<summary>x10.png Details</summary>

### Visual Description
## Bar Chart: Edit Distance vs. Noise Level
### Overview
The image is a bar chart comparing the edit distance for different types of modifications (Insertion, Deletion, Modify) and the overall edit distance ("All") at varying noise levels (5% to 30%). The chart uses different patterned bars to represent each modification type.
### Components/Axes
* **X-axis:** Noise Level (%), with markers at 5, 10, 15, 20, 25, and 30.
* **Y-axis:** Edit Distance, ranging from 0.0 to 0.8, with gridlines at intervals of 0.2.
* **Legend (Top-Left):**
* "All" - Light purple bars with diagonal stripes.
* "Insertion" - Light red bars with small dots.
* "Deletion" - Light blue bars with a cross-hatch pattern.
* "Modify" - Light green bars with small dots.
### Detailed Analysis
Here's a breakdown of the edit distance for each modification type at different noise levels:
* **All (Light Purple with Diagonal Stripes):**
* 5%: ~0.2
* 10%: ~0.35
* 15%: ~0.48
* 20%: ~0.58
* 25%: ~0.65
* 30%: ~0.72
Trend: The "All" edit distance consistently increases as the noise level increases.
* **Insertion (Light Red with Small Dots):**
* 5%: ~0.27
* 10%: ~0.46
* 15%: ~0.48
* 20%: ~0.73
* 25%: ~0.72
* 30%: ~0.82
Trend: The "Insertion" edit distance generally increases with noise level, with a slight plateau between 20% and 25%.
* **Deletion (Light Blue with Cross-Hatch):**
* 5%: ~0.14
* 10%: ~0.25
* 15%: ~0.24
* 20%: ~0.43
* 25%: ~0.35
* 30%: ~0.57
Trend: The "Deletion" edit distance increases with noise level, but with some fluctuations.
* **Modify (Light Green with Small Dots):**
* 5%: ~0.19
* 10%: ~0.36
* 15%: ~0.34
* 20%: ~0.62
* 25%: ~0.71
* 30%: ~0.79
Trend: The "Modify" edit distance generally increases with noise level.
### Key Observations
* The "Insertion" modification type has the highest edit distance at higher noise levels (20% and above).
* The "Deletion" modification type has the lowest edit distance across all noise levels.
* The "All" edit distance generally falls between the "Insertion" and "Deletion" values, as expected.
* All edit distances tend to increase as the noise level increases.
### Interpretation
The chart demonstrates how different types of modifications contribute to the overall edit distance as noise levels increase. The "Insertion" modifications appear to be the most sensitive to noise, resulting in the highest edit distances at higher noise levels. The "Deletion" modifications are less sensitive, resulting in lower edit distances. The "All" edit distance reflects the combined effect of all modification types. This data suggests that the "Insertion" operations are more prone to errors or changes when noise is introduced, while "Deletion" operations are more robust.
</details>
(a) Format generalization. Performance under various perturbation methods.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: BLEU Score vs. Noise Level
### Overview
The image is a line chart comparing the BLEU (Bilingual Evaluation Understudy) score against the noise level (in percentage) for four different conditions: "None", "Prompt", "Transformation", and "Element". The chart illustrates how the BLEU score changes as the noise level increases for each condition.
### Components/Axes
* **X-axis:** Noise Level (%), with markers at 10, 20, 30, 40, 50, 60, 70, 80, and 90.
* **Y-axis:** BLEU Score, ranging from 0.0 to 1.0, with markers at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located on the right side of the chart, it identifies the four conditions:
* None (light blue, dotted line, diamond marker)
* Prompt (light purple, solid line, square marker)
* Transformation (light red, dashed line, circle marker)
* Element (light green, dash-dot line, triangle marker)
### Detailed Analysis
* **None (Light Blue, Dotted Line, Diamond Marker):** The BLEU score remains almost constant at approximately 1.0 across all noise levels.
* At 10% Noise Level: BLEU Score ~ 1.0
* At 90% Noise Level: BLEU Score ~ 1.0
* **Prompt (Light Purple, Solid Line, Square Marker):** The BLEU score decreases as the noise level increases.
* At 10% Noise Level: BLEU Score ~ 0.9
* At 20% Noise Level: BLEU Score ~ 0.85
* At 30% Noise Level: BLEU Score ~ 0.78
* At 40% Noise Level: BLEU Score ~ 0.72
* At 50% Noise Level: BLEU Score ~ 0.66
* At 60% Noise Level: BLEU Score ~ 0.6
* At 70% Noise Level: BLEU Score ~ 0.54
* At 80% Noise Level: BLEU Score ~ 0.48
* At 90% Noise Level: BLEU Score ~ 0.43
* **Transformation (Light Red, Dashed Line, Circle Marker):** The BLEU score decreases as the noise level increases.
* At 10% Noise Level: BLEU Score ~ 0.9
* At 20% Noise Level: BLEU Score ~ 0.77
* At 30% Noise Level: BLEU Score ~ 0.65
* At 40% Noise Level: BLEU Score ~ 0.57
* At 50% Noise Level: BLEU Score ~ 0.47
* At 60% Noise Level: BLEU Score ~ 0.38
* At 70% Noise Level: BLEU Score ~ 0.3
* At 80% Noise Level: BLEU Score ~ 0.2
* At 90% Noise Level: BLEU Score ~ 0.1
* **Element (Light Green, Dash-Dot Line, Triangle Marker):** The BLEU score decreases sharply as the noise level increases, reaching near zero at higher noise levels.
* At 10% Noise Level: BLEU Score ~ 0.65
* At 20% Noise Level: BLEU Score ~ 0.45
* At 30% Noise Level: BLEU Score ~ 0.25
* At 40% Noise Level: BLEU Score ~ 0.15
* At 50% Noise Level: BLEU Score ~ 0.05
* At 60% Noise Level: BLEU Score ~ 0.02
* At 70% Noise Level: BLEU Score ~ 0.01
* At 80% Noise Level: BLEU Score ~ 0.00
* At 90% Noise Level: BLEU Score ~ 0.00
### Key Observations
* The "None" condition maintains a consistently high BLEU score regardless of the noise level.
* The "Element" condition is the most sensitive to noise, with its BLEU score dropping rapidly as noise increases.
* The "Prompt" and "Transformation" conditions show a gradual decrease in BLEU score as noise increases, with "Transformation" decreasing more rapidly than "Prompt".
### Interpretation
The chart demonstrates the impact of noise on the BLEU scores of different conditions. The "None" condition likely represents a baseline scenario without any added noise or modifications, hence its stable and high BLEU score. The "Element" condition is highly susceptible to noise, suggesting that the elements being evaluated are easily disrupted by noisy data. The "Prompt" and "Transformation" conditions show intermediate levels of sensitivity, indicating that these methods are somewhat robust to noise but still affected by it. The data suggests that the "None" condition is the most robust, while the "Element" condition is the least robust to noise. The "Prompt" condition is more robust than the "Transformation" condition.
</details>
(b) Format generalization. Performance vs. various applied perturbation areas.
Figure 8: Performance of format generalization.
Experiment settings. To systematically probe this, we introduce four distinct perturbation modes to simulate scenario in real-world: (i) insertion, where a noise token is inserted before each original token; (ii) deletion: it deletes the original token; (iii) modification: it replaces the original token with a noise token; and (iv) hybrid mode: it combines multiple perturbations. Each mode is applied for tokens with probabilities $p$ , enabling us to quantify the model’s resilience to increasing degrees of prompt distribution shift.
Findings. As shown in Figure 8a, we found that generally CoT reasoning can be easily affected by the format changes. No matter insertion, deletion, modifications, or hybrid mode, it creates a format discrepancy that affects the correctness. Among them, the deletion slightly affects the performance. While the insertions are relatively highly influential on the results. We further divide the query into several sections: elements, transformations, and prompt tokens. As shown in Figure 8b, we found that the elements and transformation play an important role in the format, whereas the changes to other tokens rarely affect the results.
8 Temperature and Model Size
Temperature and model size generalization explores how variations in sampling temperature and model capacity can influence the stability and robustness of CoT reasoning. For the sake of rigorous evaluation, we further investigate whether different choices of temperatures and model sizes may significantly affect our results.
Experiment settings. We explore the impact of different temperatures on the validity of the presented results. We adopt the same setting in the transformation generalization.
Findings. As illustrated in Figure 9a, LLMs tend to generate consistent and reliable CoT reasoning across a broad range of temperature settings (e.g., from 1e-5 up to 1), provided the values remain within a suitable range. This stability is maintained even when the models are evaluated under a variety of distribution shifts.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Heatmap: BLEU Score and Edit Distance vs. Temperature for Different Scenarios
### Overview
The image presents two heatmaps arranged vertically. The top heatmap visualizes BLEU scores for different scenarios across varying temperatures. The bottom heatmap visualizes Edit Distance for the same scenarios and temperatures. The x-axis represents temperature, and the y-axis represents different scenarios (OOD, POOD, CMP). Color intensity corresponds to the score/distance value, with a color gradient legend provided on the right side of each heatmap.
### Components/Axes
* **X-axis (shared):** Temperature, with values: 1e-05, 0.01, 0.1, 1.0, 5.0, 10.0
* **Y-axis (top):** Scenario (BLEU Score), with categories: OOD, POOD, CMP
* **Y-axis (bottom):** Scenario (Edit Distance), with categories: OOD, POOD, CMP
* **Color Legend (right, top):** BLEU Score, ranging from approximately 0.0 (dark blue) to 0.75 (red). The legend shows markers at 0.25 and 0.50.
* **Color Legend (right, bottom):** Edit Distance, ranging from approximately 0.0 (dark blue) to 1.0 (red). The legend shows markers at 0.25, 0.50, and 0.75.
### Detailed Analysis
**Top Heatmap: BLEU Score**
* **Scenario CMP:** The BLEU score is high (approximately 0.68) for temperatures 1e-05, 0.01, 0.1, and 1.0. It drops significantly to 0.019 at temperature 5.0 and further to 0.002 at temperature 10.0.
* **Scenario POOD:** The BLEU score is moderate (approximately 0.45) for temperatures 1e-05, 0.01, 0.1, and 1.0. It drops significantly to 0.010 at temperature 5.0 and further to 0.002 at temperature 10.0.
* **Scenario OOD:** The BLEU score is low (approximately 0.27) for temperatures 1e-05, 0.01, 0.1, and 1.0. It drops significantly to 0.003 at temperature 5.0 and further to 0.001 at temperature 10.0.
**Bottom Heatmap: Edit Distance**
* **Scenario CMP:** The Edit Distance is low (approximately 0.13) for temperatures 1e-05, 0.01, 0.1, and 1.0. It increases significantly to 0.760 at temperature 5.0 and further to 0.830 at temperature 10.0.
* **Scenario POOD:** The Edit Distance is low (approximately 0.167) for temperatures 1e-05, 0.01, 0.1, and 1.0. It increases significantly to 0.790 at temperature 5.0 and further to 0.824 at temperature 10.0.
* **Scenario OOD:** The Edit Distance is low (approximately 0.299) for temperatures 1e-05, 0.01, 0.1, and 1.0. It increases significantly to 0.834 at temperature 5.0 and further to 0.846 at temperature 10.0.
### Key Observations
* For BLEU score, all scenarios (CMP, POOD, OOD) show a high score at low temperatures (1e-05 to 1.0) and a drastic drop at higher temperatures (5.0 and 10.0).
* For Edit Distance, all scenarios (CMP, POOD, OOD) show a low distance at low temperatures (1e-05 to 1.0) and a significant increase at higher temperatures (5.0 and 10.0).
* The BLEU score is highest for CMP and lowest for OOD across the lower temperature range.
* The Edit Distance is lowest for CMP and highest for OOD across the lower temperature range.
### Interpretation
The heatmaps suggest an inverse relationship between temperature and model performance, as measured by BLEU score, and a direct relationship between temperature and Edit Distance. At lower temperatures (1e-05 to 1.0), the models perform relatively well, with higher BLEU scores and lower Edit Distances. However, as the temperature increases (5.0 and 10.0), the performance degrades significantly, indicated by the sharp drop in BLEU scores and the substantial increase in Edit Distances.
The different scenarios (CMP, POOD, OOD) also exhibit varying levels of performance. CMP consistently shows the best performance (highest BLEU score, lowest Edit Distance), while OOD shows the worst performance (lowest BLEU score, highest Edit Distance). This suggests that the model is better suited for the CMP scenario compared to the OOD scenario. The POOD scenario falls in between CMP and OOD in terms of performance.
The data implies that a lower temperature is preferable for optimal model performance, and that the model's effectiveness is highly dependent on the specific scenario. The drastic change in performance at higher temperatures warrants further investigation into the model's behavior and potential overfitting or instability at those temperature settings.
</details>
(a) Influences of various temperatures.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Line Chart: Exact Match (%) vs. SFT Ratio
### Overview
The image is a line chart comparing the "Exact Match (%)" against the "SFT Ratio (×10-4)" for five different data series, labeled as 68K, 589K, 4.8M, 35M, and 543M. The chart illustrates how the exact match percentage changes with the SFT ratio for each series.
### Components/Axes
* **Y-axis (Vertical):** "Exact Match (%)", ranging from 0 to 100, with gridlines at intervals of 20.
* **X-axis (Horizontal):** "SFT Ratio (×10-4)", ranging from 1.0 to 4.0, with gridlines at intervals of approximately 0.5.
* **Legend (Top-Right):**
* Blue solid line: 68K
* Red dashed line: 589K
* Green dash-dotted line: 4.8M
* Light blue dotted line: 35M
* Orange dash-dot-dotted line: 543M
### Detailed Analysis
* **68K (Blue solid line):** Starts at approximately 40% at an SFT Ratio of 1.0, increases to approximately 45% at 1.2, then rises sharply to approximately 85% at 1.3, and reaches 100% at an SFT Ratio of approximately 2.0, remaining at 100% thereafter.
* **589K (Red dashed line):** Begins at approximately 30% at an SFT Ratio of 1.0, increases to approximately 90% at 1.2, peaks at approximately 95% at 1.3, then decreases to approximately 80% at 1.5, before rising again to 100% at an SFT Ratio of approximately 2.0, remaining at 100% thereafter.
* **4.8M (Green dash-dotted line):** Remains at approximately 0% until an SFT Ratio of approximately 2.3, then increases sharply to 100% at an SFT Ratio of approximately 2.7, remaining at 100% thereafter.
* **35M (Light blue dotted line):** Remains at approximately 0% until an SFT Ratio of approximately 1.6, then increases sharply to approximately 65% at an SFT Ratio of approximately 2.3, and reaches 100% at an SFT Ratio of approximately 3.0, remaining at 100% thereafter.
* **543M (Orange dash-dot-dotted line):** Remains at approximately 0% until an SFT Ratio of approximately 3.5, then increases sharply to 100% at an SFT Ratio of approximately 4.0.
### Key Observations
* The 68K and 589K series achieve 100% exact match at lower SFT ratios compared to the 4.8M, 35M, and 543M series.
* The 4.8M, 35M, and 543M series exhibit a sharp transition from 0% to 100% exact match within a narrow range of SFT ratios.
* The 589K series shows a slight dip in exact match percentage before reaching 100%.
### Interpretation
The chart suggests that the "Exact Match (%)" is highly dependent on the "SFT Ratio (×10-4)", and this relationship varies significantly across the different data series (68K, 589K, 4.8M, 35M, and 543M). The lower-valued series (68K and 589K) achieve high exact match percentages at lower SFT ratios, indicating they are more sensitive to changes in this ratio. The higher-valued series (4.8M, 35M, and 543M) require a higher SFT ratio to reach similar levels of exact match, suggesting a different response characteristic. The sharp transitions observed in the 4.8M, 35M, and 543M series could indicate a threshold effect, where a certain SFT ratio is necessary to activate a significant increase in exact match. The dip in the 589K series might be due to some form of over-correction or interference at that specific SFT ratio range.
</details>
(b) Influences of various sizes.
Figure 9: Temperature and model size. The findings hold under different temperatures and model sizes.
Experiment settings. We further examine the influence of model size by employing the same experimental configuration as used in the novel relation SFT study. In particular, we first pretrain models of different sizes using the transformation $f_{1}\circ f_{1}$ , and subsequently perform SFT on $f_{2}\circ f_{2}$ while varying the SFT ratios.
Finding. Fig. 9b shows the accuracy of models with different sizes using different SFT ratios, which closely matches the result of our default model size across all evaluated settings and configurations.
9 Discussion and Implication
Our investigation, conducted through the controlled environment of DataAlchemy, reveals that the apparent reasoning prowess of Chain-of-Thought (CoT) is largely a brittle mirage. The findings across task, length, and format generalization experiments converge on a conclusion: CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces.
While our experiments utilized models trained from scratch in a controlled environment, the principles uncovered are extensible to large-scale pre-trained models. We summarize the implications for practitioners as follows.
Guard Against Over-reliance and False Confidence. CoT should not be treated as a “plug-and-play” module for robust reasoning, especially in high-stakes domains like medicine, finance, or legal analysis. The ability of LLMs to produce “fluent nonsense” —plausible but logically flawed reasoning chains—can be more deceptive and damaging than an outright incorrect answer, as it projects a false aura of dependability. Sufficient auditing from domain experts is indispensable.
Prioritize Out-of-Distribution (OOD) Testing. Standard validation practices, where the test set closely mirrors the training set, are insufficient to gauge the true robustness of a CoT-enabled system. Practitioners must implement rigorous adversarial and OOD testing that systematically probes for vulnerabilities across task, length, and format variations.
Recognize Fine-Tuning as a Patch, Not a Panacea. Our results show that Supervised Fine-Tuning (SFT) can quickly “patch” a model’s performance on a new, specific data distribution. However, this should not be mistaken for achieving true generalization. It simply expands the model’s “in-distribution” bubble slightly. Relying on SFT to fix every OOD failure is an unsustainable and reactive strategy that fails to address the core issue: the model’s lack of abstract reasoning capability.
10 Conclusion
In this paper, we critically examine the COT reasoning of LLMs through the lens of data distribution, revealing that the perceived structured reasoning capability largely arises from inductive biases shaped by in-distribution training data. We propose a controlled environment, DataAlchemy, allowing systematic probing of CoT reasoning along three crucial dimensions: task structure, reasoning length, and query format. Empirical findings consistently demonstrate that CoT reasoning effectively reproduces reasoning patterns closely aligned with training distributions but suffers significant degradation when faced with distributional deviations. Such observations reveal the inherent brittleness and superficiality of current CoT reasoning capabilities. We provide insights that emphasize real-world implications for both practitioners and researchers.
\nobibliography
*
References
- Bentham et al. (2024) O. Bentham, N. Stringham, and A. Marasovic. Chain-of-thought unfaithfulness as disguised accuracy. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=ydcrP55u2e. Reproducibility Certification.
- Budnikov et al. (2025) M. Budnikov, A. Bykova, and I. P. Yamshchikov. Generalization potential of large language models. Neural Computing and Applications, 37(4):1973–1997, 2025.
- Chen et al. (2025a) Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou, T. Gao, and W. Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567, 2025a.
- Chen et al. (2025b) Y. Chen, J. Benton, A. Radhakrishnan, J. Uesato, C. Denison, J. Schulman, A. Somani, P. Hase, M. Wagner, F. Roger, et al. Reasoning models don’t always say what they think. arXiv preprint arXiv:2505.05410, 2025b.
- Cho et al. (2024) H. Cho, J. Cha, P. Awasthi, S. Bhojanapalli, A. Gupta, and C. Yun. Position coupling: Improving length generalization of arithmetic transformers using task structure. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=5cIRdGM1uG.
- Garg et al. (2022) S. Garg, D. Tsipras, P. S. Liang, and G. Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in neural information processing systems, 35:30583–30598, 2022.
- Guo et al. (2025) D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Imani et al. (2023) S. Imani, L. Du, and H. Shrivastava. Mathprompter: Mathematical reasoning using large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 37–42, 2023.
- Jaech et al. (2024) A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024.
- Kojima et al. (2022) T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Lanham et al. (2023) T. Lanham, A. Chen, A. Radhakrishnan, B. Steiner, C. Denison, D. Hernandez, D. Li, E. Durmus, E. Hubinger, J. Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
- Li et al. (2025a) H. Li, S. Lu, P.-Y. Chen, X. Cui, and M. Wang. Training nonlinear transformers for chain-of-thought inference: A theoretical generalization analysis. In The Thirteenth International Conference on Learning Representations, 2025a. URL https://openreview.net/forum?id=n7n8McETXw.
- Li et al. (2025b) Y. Li, Z. Lai, W. Bao, Z. Tan, A. Dao, K. Sui, J. Shen, D. Liu, H. Liu, and Y. Kong. Visual large language models for generalized and specialized applications. arXiv preprint arXiv:2501.02765, 2025b.
- Ling et al. (2023) Z. Ling, Y. Fang, X. Li, Z. Huang, M. Lee, R. Memisevic, and H. Su. Deductive verification of chain-of-thought reasoning. Advances in Neural Information Processing Systems, 36:36407–36433, 2023.
- Mirzadeh et al. (2024) I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024.
- Papineni et al. (2002) K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Radford et al. (2019) A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Shen et al. (2025) Z. Shen, H. Yan, L. Zhang, Z. Hu, Y. Du, and Y. He. Codi: Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint arXiv:2502.21074, 2025.
- Shojaee et al. (2025) P. Shojaee, I. Mirzadeh, K. Alizadeh, M. Horton, S. Bengio, and M. Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. arXiv preprint arXiv:2506.06941, 2025.
- Song et al. (2025) J. Song, Z. Xu, and Y. Zhong. Out-of-distribution generalization via composition: a lens through induction heads in transformers. Proceedings of the National Academy of Sciences, 122(6):e2417182122, 2025.
- Tang et al. (2023) X. Tang, Z. Zheng, J. Li, F. Meng, S.-C. Zhu, Y. Liang, and M. Zhang. Large language models are in-context semantic reasoners rather than symbolic reasoners. arXiv preprint arXiv:2305.14825, 2023.
- Team et al. (2025) K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025.
- Team (2024) Q. Team. Qwq: Reflect deeply on the boundaries of the unknown. Hugging Face, 2024.
- Ting et al. (2025) L. P.-Y. Ting, C. Zhao, Y.-H. Zeng, Y. J. Lim, and K.-T. Chuang. Beyond rag: Reinforced reasoning augmented generation for clinical notes. arXiv preprint arXiv:2506.05386, 2025.
- Wang et al. (2024) Q. Wang, Y. Wang, Y. Wang, and X. Ying. Can in-context learning really generalize to out-of-distribution tasks? arXiv preprint arXiv:2410.09695, 2024.
- Wang et al. (2023) X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
- Wang et al. (2025a) Y. Wang, F.-C. Chang, and P.-Y. Wu. Chain-of-thought prompting for out-of-distribution samples: A latent-variable study. arXiv e-prints, pages arXiv–2504, 2025a.
- Wang et al. (2025b) Y. Wang, F.-C. Chang, and P.-Y. Wu. A theoretical framework for ood robustness in transformers using gevrey classes. arXiv preprint arXiv:2504.12991, 2025b.
- Wei et al. (2022) J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Xu et al. (2024) J. Xu, H. Fei, L. Pan, Q. Liu, M.-L. Lee, and W. Hsu. Faithful logical reasoning via symbolic chain-of-thought. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13326–13365, 2024.
- Yang et al. (2024) J. Yang, K. Zhou, Y. Li, and Z. Liu. Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12):5635–5662, 2024.
- Yang et al. (2023) L. Yang, Y. Song, X. Ren, C. Lyu, Y. Wang, J. Zhuo, L. Liu, J. Wang, J. Foster, and Y. Zhang. Out-of-distribution generalization in natural language processing: Past, present, and future. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4533–4559, 2023.
- Yao et al. (2023) S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023.
- Yao et al. (2025) X. Yao, R. Ren, Y. Liao, and Y. Liu. Unveiling the mechanisms of explicit cot training: How chain-of-thought enhances reasoning generalization. arXiv e-prints, pages arXiv–2502, 2025.
- Yeo et al. (2025) E. Yeo, Y. Tong, M. Niu, G. Neubig, and X. Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025.
- Yu et al. (2023) Z. Yu, L. He, Z. Wu, X. Dai, and J. Chen. Towards better chain-of-thought prompting strategies: A survey. arXiv preprint arXiv:2310.04959, 2023.
- Yujian and Bo (2007) L. Yujian and L. Bo. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007.
- Zhang et al. (2024a) X. Zhang, C. Du, T. Pang, Q. Liu, W. Gao, and M. Lin. Chain of preference optimization: Improving chain-of-thought reasoning in llms. Advances in Neural Information Processing Systems, 37:333–356, 2024a.
- Zhang et al. (2024b) Y. Zhang, H. Wang, S. Feng, Z. Tan, X. Han, T. He, and Y. Tsvetkov. Can llm graph reasoning generalize beyond pattern memorization? In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2289–2305, 2024b.
- Zhang et al. (2023) Z. Zhang, A. Zhang, M. Li, and A. Smola. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=5NTt8GFjUHkr.
- Zhang et al. (2024c) Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola. Multimodal chain-of-thought reasoning in language models. Transactions on Machine Learning Research, 2024, 2024c.
- Zhao et al. (2025) C. Zhao, Z. Tan, C.-W. Wong, X. Zhao, T. Chen, and H. Liu. Scale: Towards collaborative content analysis in social science with large language model agents and human intervention. arXiv preprint arXiv:2502.10937, 2025.
- Zhao et al. (2023) W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
Appendix A Proof of Theorems
A.1 Proof of CoT Generalization Bound
* Proof*
Let $f_{\theta}$ be a model trained on samples from the distribution $\mathcal{D}_{\text{train}}$ using a loss function $\ell(f_{\theta}(x),y)$ that is $\Lambda$ -Lipschitz and bounded. The expected test risk is given by
$$
R_{\text{test}}(f_{\theta})=\mathbb{E}_{(x,y)\sim\mathcal{D}_{\text{test}}}\left[\ell(f_{\theta}(x),y)\right]. \tag{15}
$$
We can decompose the test risk as
$$
R_{\text{test}}(f_{\theta})=R_{\text{train}}(f_{\theta})+\left(R_{\text{test}}(f_{\theta})-R_{\text{train}}(f_{\theta})\right). \tag{16}
$$
To bound the discrepancy between $R_{\text{test}}$ and $R_{\text{train}}$ , we invoke a standard result from statistical learning theory. Given that $\ell$ is $\Lambda$ -Lipschitz and the discrepancy measure $\Delta(\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}})$ is an integral probability metric (e.g., Wasserstein-1 distance), we have
$$
\left|R_{\text{test}}(f_{\theta})-R_{\text{train}}(f_{\theta})\right|\leq\Lambda\cdot\Delta(\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}}). \tag{17}
$$
Therefore, the test risk satisfies
$$
R_{\text{test}}(f_{\theta})\leq R_{\text{train}}(f_{\theta})+\Lambda\cdot\Delta(\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}}). \tag{18}
$$ We next account for the generalization gap between the empirical training risk $\hat{R}_{\text{train}}(f_{\theta})$ and the expected training risk $R_{\text{train}}(f_{\theta})$ . By applying a concentration inequality (e.g., Hoeffding’s inequality), with probability at least $1-\delta$ , we have
$$
R_{\text{train}}(f_{\theta})\leq\hat{R}_{\text{train}}(f_{\theta})+\mathcal{O}\left(\sqrt{\frac{\log(1/\delta)}{n}}\right), \tag{19}
$$
where $n$ is the number of training samples. Combining the above, we obtain that with high probability,
$$
R_{\text{test}}(f_{\theta})\leq\hat{R}_{\text{train}}(f_{\theta})+\Lambda\cdot\Delta(\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}})+\mathcal{O}\left(\sqrt{\frac{\log(1/\delta)}{n}}\right). \tag{20}
$$
This concludes the proof. ∎
A.2 Proof of Task Generalization Failure Threshold
We establish the exponential decay bound through a probabilistic analysis of reasoning failure modes in the presence of task generalization complexity.
Let $\Omega$ denote the sample space of all possible reasoning configurations, and let $C∈\Omega$ represent a specific configuration. We define the following events: $A_{i}$ as the event that element $a_{i}$ is novel, i.e., $a_{i}∉\mathcal{E}^{i}_{\text{train}}$ ; $F_{j}$ as the event that transformation $f_{j}$ is novel, i.e., $f_{j}∉\mathcal{F}_{\text{train}}$ ; and $\mathcal{Q}$ as the event that the transformation sequence $(f_{1},f_{2},...,f_{k})$ is novel, i.e., $(f_{1},f_{2},...,f_{k})∉\mathcal{P}_{\text{train}}$ .
Here we make the assumption that the reasoning failures induced by novel arguments, functions, and patterns contribute independently to the overall failure probability and hence we model the success probability as a product of component-wise success rates:
$$
P(\text{correct}|C)=P_{0}\prod_{i=1}^{m}\rho_{a}^{\mathbb{I}[A_{i}]}\prod_{j=1}^{n}\rho_{f}^{\mathbb{I}[F_{j}]}\rho_{p}^{\mathbb{I}[\mathcal{Q}]}\rho_{c}^{C_{T}}
$$
where $P_{0}∈(0,1]$ represents the baseline success probability when all components are within the training distribution, and $\rho_{a},\rho_{f},\rho_{p},\rho_{c}∈(0,1)$ are the degradation factors associated with novel arguments, functions, patterns, and task-specific complexity, respectively.
$$
\displaystyle\ln P(\text{correct}\mid C)\;= \displaystyle\ln P_{0}+\sum_{i=1}^{m}\mathbb{I}[A_{i}]\,\ln\rho_{a}+\sum_{j=1}^{n}\mathbb{I}[F_{j}]\,\ln\rho_{f}+\mathbb{I}[\mathcal{Q}]\,\ln\rho_{p}+C_{T}\,\ln\rho_{c} \tag{21}
$$
For notational convenience, we define the positive constants:
$$
\xi_{a}:=-\ln\rho_{a}>0,\xi_{f}:=-\ln\rho_{f}>0,\xi_{p}:=-\ln\rho_{p}>0,\xi_{c}:=-\ln\rho_{c}>0
$$
hence we have:
$$
\ln P(\text{correct}|C)=\ln P_{0}-\xi_{a}\sum_{i=1}^{m}\mathbb{I}[A_{i}]-\xi_{f}\sum_{j=1}^{n}\mathbb{I}[F_{j}]-\xi_{p}\mathbb{I}[\mathcal{Q}]-\xi_{c}C_{T} \tag{22}
$$
Lemma: Relationship to TGC. The expression in equation above can be bounded in terms of $\text{TGC}(C)$ as follows:
$$
\ln P(\text{correct}|C)\leq\ln P_{0}-\delta\cdot\text{TGC}(C) \tag{23}
$$
where $\delta=\min(\frac{\xi_{a}}{\alpha},\frac{\xi_{f}}{\beta},\frac{\xi_{p}}{\gamma},\xi_{c})>0$ .
Proof of Lemma: From the definition of $\text{TGC}(C)$ in Eq. (11), we have:
$$
\text{TGC}(C)=\alpha\sum_{i=1}^{m}\mathbb{I}[A_{i}]+\beta\sum_{j=1}^{n}\mathbb{I}[F_{j}]+\gamma\mathbb{I}[\mathcal{Q}]+C_{T} \tag{24}
$$
By the definition of $\delta$ , each term in Eq. (22) satisfies:
$$
\xi_{a}\sum_{i=1}^{m}\mathbb{I}[A_{i}]\geq\delta\alpha\sum_{i=1}^{m}\mathbb{I}[A_{i}] \tag{25}
$$
$$
\xi_{f}\sum_{j=1}^{n}\mathbb{I}[F_{j}]\geq\delta\beta\sum_{j=1}^{n}\mathbb{I}[F_{j}] \tag{26}
$$
$$
\xi_{p}\mathbb{I}[\mathcal{Q}]\geq\delta\gamma\mathbb{I}[\mathcal{Q}] \tag{27}
$$
$$
\xi_{c}C_{T}\geq\delta C_{T} \tag{28}
$$
Summing these inequalities establishes Eq. (23).
We now define the threshold $\tau:=\frac{\ln P_{0}}{\delta}$ . From Eq. (23), when $\text{TGC}(C)>\tau$ , we have:
$$
\displaystyle\ln P(\operatorname{correct}\mid C) \displaystyle\leq\ln P_{0}-\delta\cdot\operatorname{TGC}(C) \displaystyle=\delta(\tau-\operatorname{TGC}(C)) \displaystyle=-\delta(\operatorname{TGC}(C)-\tau) \tag{29}
$$
Exponentiating both sides yields the desired bound: $P(\operatorname{correct}\mid C)≤ e^{-\delta(\mathrm{TGC}(C)-\tau)}$
A.3 Proof of Length Extrapolation Bound
* Proof*
Consider a transformer model $f_{\theta}$ processing sequences of length $L$ . The model implicitly learns position-dependent representations through positional encodings $\text{PE}(i)∈\mathbb{R}^{d}$ for position $i∈\{1,...,L\}$ and attention patterns $A_{ij}=\text{softmax}\left(\frac{Q_{i}K_{j}^{T}}{\sqrt{d}}\right)$ . During training on fixed length $L_{\text{train}}$ , the model learns a specific distribution:
$$
p_{\text{train}}(\mathbf{h})=p(\mathbf{h}\mid L=L_{\text{train}}) \tag{32}
$$
where $\mathbf{h}=\{h_{1},...,h_{L}\}$ represents hidden states. For sequences of length $L≠ L_{\text{train}}$ , we encounter distribution shift in two forms: (1) positional encoding mismatch, where the model has never seen positions $i>L_{\text{train}}$ if $L>L_{\text{train}}$ , and (2) attention pattern disruption, where the learned attention patterns are calibrated for length $L_{\text{train}}$ . The KL divergence between training and test distributions can be bounded:
$$
D_{KL}(p_{\text{test}}\|p_{\text{train}})\propto|L-L_{\text{train}}|^{2} \tag{33}
$$ This quadratic relationship arises from linear accumulation of positional encoding errors and quadratic growth in attention pattern misalignment due to pairwise interactions. Let $\mathcal{E}(L)$ be the prediction error at length $L$ . We decompose it as:
$$
\mathcal{E}(L)=\mathcal{E}_{\text{inherent}}(L)+\mathcal{E}_{\text{shift}}(L) \tag{34}
$$
where $\mathcal{E}_{\text{inherent}}(L)=\mathcal{E}_{0}$ is the inherent model error (constant) and $\mathcal{E}_{\text{shift}}(L)$ is the error due to distribution shift. The distribution shift error follows from the Central Limit Theorem. As the error accumulates over sequence positions, the total shift error converges to:
$$
\mathcal{E}_{\text{shift}}(L)=(1-\mathcal{E}_{0})\cdot\left(1-\exp\left(-\frac{(L-L_{\text{train}})^{2}}{2\sigma^{2}}\right)\right) \tag{35}
$$ This form ensures that $\mathcal{E}_{\text{shift}}(L_{\text{train}})=0$ (no shift at training length) and $\lim_{|L-L_{\text{train}}|→∞}\mathcal{E}_{\text{shift}}(L)=1-\mathcal{E}_{0}$ (maximum error bounded by 1). The width parameter $\sigma$ depends on:
$$
\sigma=\sigma_{0}\cdot\sqrt{\frac{d}{L_{\text{train}}}} \tag{36}
$$
where $\sigma_{0}$ is a model-specific constant, $d$ is the model dimension, and the $\sqrt{d/L_{\text{train}}}$ factor captures the concentration of measure in high dimensions. Therefore, the total error follows:
$$
\mathcal{E}(L)=\mathcal{E}_{0}+(1-\mathcal{E}_{0})\cdot\left(1-\exp\left(-\frac{(L-L_{\text{train}})^{2}}{2\sigma^{2}}\right)\right) \tag{37}
$$ This Gaussian form naturally emerges from the accumulation of position-dependent errors and matches the experimental observation of near-zero error at $L=L_{\text{train}}$ with symmetric increase in both directions. ∎
Appendix B Experiment Details
We fine-tune a GPT-2–style decoder-only Transformer with a vocabulary size of 10,000. The model supports a maximum context length of 256 tokens. The hidden dimension is 32, the number of Transformer layers is 4, and the number of attention heads is 4. Each block includes a GELU-activated feed-forward sublayer with width $4× d_{\text{model}}$ .
The model is trained using the AdamW optimiser in mixed precision (FP16). The default learning rate is $3× 10^{-3}$ , and the schedule follows a cosine decay with a 10% warm-up ratio. Training is conducted for 10 epochs, using a batch size of 1024. A weight decay of 0.01 is applied, and gradient norms are clipped at 1.0.
Appendix C Illustration of Datasets
Below are the examples of transformation $f_{1}$ and $f_{2}$ :
| Transformation[F1]: Transformation[F2]: | A A F Q [F1] <answer> N N S D A A L P [F2] <answer> A L P A |
| --- | --- |
aside from single transformation, we can composite transformations arbitrarily:
| Transformation[F1F2]: | A C I A [F1] [F2] <think> |
| --- | --- |
| N P V N [F2] <answer> | |
| P V N N | |
| Transformation[F2F2]: | N O V S [F2] [F2] <think> |
| O V S N [F2] <answer> | |
| V S N O | |
we use exact match, edit distance and BELU score to measure the discrepancy between generated tokens and the labels. For more than one transformation examples, we can further measure the discrepancy for reasoning and answering seperately.
Appendix D Additional Experimental Results
D.1 Additional Qualitative Analysis
D.1.1 Orthognal Transformation Caused Coincidence
The following case shows that even if the transformation is different, the model that trained on transformation $f_{2}\circ f_{1}$ can still provide correct answer through incorrect reasoning:
⬇
Prompt: ’ A A A B [F1] [F2] < answer >’
Generated:’ B A A A [F1] < answer > O N N N ’
Expected: ’ O N N N ’
D.1.2 Correct reasoning but failed in final answer
The following case shows that the model pretrained on the union of three transformation $f_{1}\circ f_{2},f_{2}\circ f_{1},f_{2}\circ f_{2}$ and test on $f_{1}\circ f_{1}$
⬇
Prompt: ’ A A A D [R1] [R1] < think >’
Generated: ’ N N N Q [R1] < answer > N N Q N ’
Expected: ’ N N N Q [R1] < answer > A A A D ’
D.1.3 Failure to generalize to novel element
The following case shows that the model trained on element set $a_{i}∈[A,M]$ can not generalize to unseen elements such as N or O
⬇
Prompt: ’ N N N O [F1] [F1] < think >’
Generated: ’ R V Q S [F1] < answer > E I D F ’
Expected: ’ A A A B [F1] < answer > N N N O ’
D.1.4 LLM reproduces CoT reasoning at seen lengths
The following case shows that model trained under $f_{1}\circ f_{1}$ tried to reproduce the length in training data by adding tokens in the reason chain even prompted with seen transformation $f_{1}$
⬇
Prompt: ’ A A B D [f1] < answer >’
Generated: ’ N O A Z N N O Q [f1]
< answer > A A B D ’
Expected: ’ N N O Q ’