# Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
**Authors**: Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu
> Arizona State University
redacted {czhao93, ztan36, pingchua, daweili5, bjiang14, yancheng.wang, yingzhen.yang, huanliu}@asu.edu
## Abstract
Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning. Our code is available at GitHub: https://github.com/ChengshuaiZhao0/DataAlchemy.
## 1 Introduction
Recent years have witnessed Large Language Models’ (LLMs) dominant role in various domains (Zhao et al., 2023; Li et al., 2025b; Zhao et al., 2025; Ting et al., 2025) through versatile prompting techniques (Wei et al., 2022; Yao et al., 2023; Kojima et al., 2022). Among these, Chain-of-Thought (CoT) prompting (Wei et al., 2022) has emerged as a prominent method for eliciting structured reasoning from LLMs (a.k.a., CoT reasoning). By appending a simple cue such as “Let’s think step by step,” LLMs decompose complex problems into intermediate steps, producing outputs that resemble human-like reasoning. It has been shown to be effective in tasks requiring logical inference Xu et al. (2024), mathematical problem solving (Imani et al., 2023), and commonsense reasoning (Wei et al., 2022). The empirical successes of CoT reasoning lead to the perception that LLMs engage in deliberate inferential processes (Yu et al., 2023; Zhang et al., 2024a; Ling et al., 2023; Zhang et al., 2024c).
However, a closer examination reveals inconsistencies that challenge this optimistic view. Consider this straightforward question: “The day the US was established is in a leap year or a normal year?” When prompted with the CoT prefix, the modern LLM Gemini responded: “The United States was established in 1776. 1776 is divisible by 4, but it’s not a century year, so it’s a leap year. Therefore, the day the US was established was in a normal year.” This response exemplifies a concerning pattern: the model correctly recites the leap year rule and articulates intermediate reasoning steps, yet produces a logically inconsistent conclusion (i.e., asserting 1776 is both a leap year and a normal year). Such inconsistencies suggest that there is a distinction between human-like inference and CoT reasoning.
An expanding body of analyses reveals that LLMs tend to rely on surface-level semantics and clues rather than logical procedures (Bentham et al., 2024; Chen et al., 2025b; Lanham et al., 2023). LLMs construct superficial chains of logic based on learned token associations, often failing on tasks that deviate from commonsense heuristics or familiar templates (Tang et al., 2023). In the reasoning process, performance degrades sharply when irrelevant clauses are introduced, which indicates that models cannot grasp the underlying logic (Mirzadeh et al., 2024). This fragility becomes even more apparent when models are tested on more complex tasks, where they frequently produce incoherent solutions and fail to follow consistent reasoning paths (Shojaee et al., 2025). Collectively, these pioneering works deepen the skepticism surrounding the true nature of CoT reasoning.
In light of this line of research, we question the CoT reasoning by proposing an alternative lens through data distribution and further investigating why and when it fails. We hypothesize that CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. As such, its effectiveness is inherently limited by the nature and extent of the distribution discrepancy between training data and the test queries. Guided by this data distribution lens, we dissect CoT reasoning via three dimensions: (i) task —To what extent CoT reasoning can handle tasks that involve transformations or previously unseen task structures. (2) length —how CoT reasoning generalizes to chains with length different from that of training data; and (3) format —how sensitive CoT reasoning is to surface-level query form variations. To evaluate each aspect, we introduce DataAlchemy, a controlled and isolated experiment that allows us to train LLMs from scratch and systematically probe them under various distribution shifts.
Our findings reveal that CoT reasoning works effectively when applied to in-distribution or near in-distribution data but becomes fragile and prone to failure even under moderate distribution shifts. In some cases, LLMs generate fluent yet logically inconsistent reasoning steps. The results suggest that what appears to be structured reasoning can be a mirage, emerging from memorized or interpolated patterns in the training data rather than logical inference. These insights carry important implications for both practitioners and researchers. For practitioners, our results highlight the risk of relying on CoT as a plug-and-play solution for reasoning tasks and caution against equating CoT-style output with human thinking. For researchers, the results underscore the ongoing challenge of achieving reasoning that is both faithful and generalizable, motivating the need to develop models that can move beyond surface-level pattern recognition to exhibit deeper inferential competence. Our contributions are summarized as follows:
- Novel perspective. We propose a data distribution lens for CoT reasoning, illuminating that its effectiveness stems from structured inductive biases learned from in-distribution training data. This framework provides a principled lens for understanding why and when CoT reasoning succeeds or fails.
- Controlled environment. We introduce DataAlchemy, an isolated experimental framework that enables training LLMs from scratch and systematically probing CoT reasoning. This controlled setting allows us to isolate and analyze the effects of distribution shifts on CoT reasoning without interference from complex patterns learned during large-scale pre-training.
- Empirical validation. We conduct systematic empirical validation across three critical dimensions— task, length, and format. Our experiments demonstrate that CoT reasoning exhibits sharp performance degradation under distribution shifts, revealing that seemingly coherent reasoning masks shallow pattern replication.
- Real-world implication. This work reframes the understanding of contemporary LLMs’ reasoning capabilities and emphasizes the risk of over-reliance on COT reasoning as a universal problem-solving paradigm. It underscores the necessity for proper evaluation methods and the development of LLMs that possess authentic and generalizable reasoning capabilities.
## 2 Related Work
### 2.1 LLM Prompting and Co
Chain-of-Thought (CoT) prompting revolutionized how we elicit reasoning from Large Language Models by decomposing complex problems into intermediate steps (Wei et al., 2022). By augmenting few-shot exemplars with reasoning chains, CoT showed substantial performance gains on various tasks (Xu et al., 2024; Imani et al., 2023; Wei et al., 2022). Building on this, several variants emerged. Zero-shot CoT triggers reasoning without exemplars using instructional prompts (Kojima et al., 2022), and self-consistency enhances performance via majority voting over sampled chains (Wang et al., 2023). To reduce manual effort, Auto-CoT generates CoT exemplars using the models themselves (Zhang et al., 2023). Beyond linear chains, Tree-of-Thought (ToT) frames CoT as a tree search over partial reasoning paths (Yao et al., 2023), enabling lookahead and backtracking. SymbCoT combines symbolic reasoning with CoT by converting problems into formal representations (Xu et al., 2024). Recent work increasingly integrates CoT into the LLM inference process, generating long-form CoTs (Jaech et al., 2024; Team, 2024; Guo et al., 2025; Team et al., 2025). This enables flexible strategies like mistake correction, step decomposition, reflection, and alternative reasoning paths (Yeo et al., 2025; Chen et al., 2025a). The success of prompting techniques and long-form CoTs has led many to view them as evidence of emergent, human-like reasoning in LLMs. In this work, we challenge that viewpoint by adopting a data-centric perspective and demonstrating that CoT behavior arises largely from pattern matching over training distributions.
### 2.2 Discussion on Illusion of LLM Reasoning
While Chain-of-Thought prompting has led to impressive gains on complex reasoning tasks, a growing body of work has started questioning the nature of these improvements. One major line of research highlights the fragility of CoT reasoning. Minor and semantically irrelevant perturbations such as distractor phrases or altered symbolic forms can cause significant performance drops in state-of-the-art models (Mirzadeh et al., 2024; Tang et al., 2023). Models often incorporate such irrelevant details into their reasoning, revealing a lack of sensitivity to salient information. Other studies show that models prioritize the surface form of reasoning over logical soundness; in some cases, longer but flawed reasoning paths yield better final answers than shorter, correct ones (Bentham et al., 2024). Similarly, performance does not scale with problem complexity as expected—models may overthink easy problems and give up on harder ones (Shojaee et al., 2025). Another critical concern is the faithfulness of the reasoning process. Intervention-based studies reveal that final answers often remain unchanged even when intermediate steps are falsified or omitted (Lanham et al., 2023), a phenomenon dubbed the illusion of transparency (Bentham et al., 2024; Chen et al., 2025b). Together, these findings suggest that LLMs are not principled reasoners but rather sophisticated simulators of reasoning-like text. However, a systematic understanding of why and when CoT reasoning fails is still a mystery.
### 2.3 OOD Generalization of LLMs
Out-of-distribution (OOD) generalization, where test inputs differ from training data, remains a key challenge in machine learning, particularly for large language models (LLMs) Yang et al. (2024, 2023); Budnikov et al. (2025); Zhang et al. (2024b). Recent studies show that LLMs prompted to learn novel functions often revert to similar functions encountered during pretraining (Wang et al., 2024; Garg et al., 2022). Likewise, LLM generalization frequently depends on mapping new problems onto familiar compositional structures (Song et al., 2025). CoT prompting improves OOD generalization (Wei et al., 2022), with early work demonstrating length generalization for multi-step problems beyond training distributions (Yao et al., 2025; Shen et al., 2025). However, this ability is not inherent to CoT and heavily depends on model architecture and training setups. For instance, strong generalization in arithmetic tasks was achieved only when algorithmic structures were encoded into positional encodings (Cho et al., 2024). Similarly, finer-grained CoT demonstrations during training boost OOD performance, highlighting the importance of data granularity (Wang et al., 2025a). Theoretical and empirical evidence shows that CoT generalizes well only when test inputs share latent structures with training data; otherwise, performance declines sharply (Wang et al., 2025b; Li et al., 2025a). Despite its promise, CoT still struggles with genuinely novel tasks or formats. In the light of these brilliant findings, we propose rethinking CoT reasoning through a data distribution lens: decomposing CoT into task, length, and format generalization, and systematically investigating each in a controlled setting.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Text Transformation Task Generalization Framework
### Overview
This image is a technical diagram illustrating a framework for evaluating generalization in text transformation tasks. It defines basic atomic elements, specific transformation functions, and then categorizes different types of generalization challenges (Task, Length, Format) that a model might face. The diagram uses a consistent color-coding scheme (defined in a legend) to distinguish between input, output, training, and testing data across various scenarios.
### Components/Axes
The diagram is segmented into several distinct regions:
1. **Top-Left: Basic Definitions**
* **Basic atoms A**: A grid containing the 26 uppercase English letters (A-Z).
* **Element l = 5**: An example element (word) "APPLE" decomposed into its constituent letters: A, P, P, L, E.
* **Transformation Definitions**:
* **f1: ROT Transformation**: Illustrated with a downward arrow and the label "+13". It shows the input "APPLE" transforming to the output "N C C Y R". This represents a ROT13 cipher (each letter shifted 13 places in the alphabet).
* **f2: Cyclic Shift**: Illustrated with a circular arrow and the label "+1". It shows the input "APPLE" transforming to the output "E A P P L". This represents a cyclic shift where the last letter moves to the front.
2. **Top-Right: Legend**
* A box containing four colored circles with labels:
* **Red Circle**: Input
* **Blue Circle**: Output
* **Light Red Square**: Training
* **Light Blue Square**: Testing
3. **Main Right Section: Task Generalization**
This section is divided into three subsections, each exploring a different dimension of generalization.
* **Element (Top-Left of this section)**: A table with three rows (ID, Comp, OOD) and two columns (Training, Testing).
| Category | Training | Testing |
|----------|----------|---------|
| ID (In-Distribution) | A B C D -> A B C D | A B C D -> A B C D |
| Comp (Compositional) | A B C D -> A B C D | D C B A -> A B C D |
| OOD (Out-Of-Distribution) | A B C D -> A B C D | A B C E -> A B C E |
* **Transformation (Top-Right of this section)**: A table with four rows (ID, Comp, POOD, OOD) and columns describing function composition.
| Category | Training | Testing |
|----------|----------|---------|
| ID | f1 | f1 |
| Comp | {f1, f1 o f1, f2, f2 o f1} | f2 o f2 |
| POOD (Probabilistic OOD) | f1, f2 | f1, f2 |
| OOD | f1, f2 | f2, f2 |
A diagram to the right shows a **Training** block (light red) feeding into a function `f_comp`, which produces a **Testing** block (light blue).
* **Length Generalization (Bottom-Left of this section)**:
* A **Text Length** table showing training on length 4 ("A B C D") and testing on lengths 3 ("A B C") and 5 ("A B C D A").
| Training Length | Testing Lengths |
|-----------------|-----------------|
| 4 (A B C D) | 3 (A B C), 5 (A B C D A) |
* A diagram showing a **Training** block (light red) feeding into a function `f_s`, which produces a **Testing** block (light blue).
* A **Reasoning Step** box showing the function composition `f1 o f1 o f1` being simplified to `f1 o f1` (as `f1 o f1 = f1` for ROT13).
* **Format Generalization (Bottom-Right of this section)**:
* Three rows illustrating different format operations:
| Operation | Training | Testing |
|-----------|----------|---------|
| Insertion | A B C D -> A B C D | A B C D -> A B γ C D |
| Deletion | A B C D -> A B C D | A B C D -> A C D |
| Modify | A B C D -> A B C D | A B C D -> A B C ? |
* A diagram showing a **Training** block (light red) feeding into a function `f_s`, which produces a **Testing** block (light blue).
### Detailed Analysis
* **Spatial Grounding**: The legend is positioned in the top-right corner of the entire image. The "Task Generalization" title is centered above its three subsections. Within "Task Generalization," the "Element" and "Transformation" tables are side-by-side at the top, while "Length Generalization" and "Format Generalization" are side-by-side at the bottom.
* **Color-Coding Consistency**: The red/blue color scheme from the legend is applied consistently. For example, in the "Element" table, the input sequences "A B C D" are in red boxes, and the output sequences are in blue boxes. The training data blocks are light red, and testing data blocks are light blue in the flow diagrams.
* **Transformation Examples**: The f1 (ROT13) transformation is shown to map A->N, P->C, L->Y, E->R. The f2 (Cyclic Shift +1) transformation maps the sequence [A,P,P,L,E] to [E,A,P,P,L].
* **Generalization Taxonomy**: The diagram explicitly categorizes generalization challenges into:
1. **Element Generalization**: Testing on known vs. unknown symbols within the alphabet.
2. **Transformation Generalization**: Testing on seen vs. unseen functions or compositions of functions.
3. **Length Generalization**: Testing on sequences of lengths not seen during training.
4. **Format Generalization**: Testing on inputs/outputs with structural modifications (insertion, deletion, substitution) not present in training.
### Key Observations
* The framework is designed to systematically probe a model's ability to generalize beyond its training distribution across multiple axes (elements, functions, length, format).
* The "Comp" (Compositional) and "POOD" categories suggest a focus on testing systematic understanding, not just memorization. For instance, "Comp" in Transformation tests if the model can apply a novel composition of known functions (`f2 o f2`).
* The "Reasoning Step" box highlights that for certain transformations (like ROT13), understanding the algebraic property (`f1 o f1 = identity`) is key to generalization.
* The "Format Generalization" section tests robustness to noise or variations in the input/output structure, which is a common real-world challenge.
### Interpretation
This diagram outlines a rigorous evaluation protocol for artificial intelligence systems, particularly those designed for symbolic reasoning or algorithmic tasks involving text. It moves beyond simple accuracy metrics on held-out data to assess **systematic generalization**—the ability to apply learned rules to novel situations in a compositional and predictable way.
The framework suggests that true understanding of a task like text transformation requires more than memorizing input-output pairs. A robust system should:
1. Recognize that the underlying rules (like ROT13) apply uniformly to all elements, including unseen ones (OOD Element).
2. Understand functions as composable objects, enabling it to execute or predict the outcome of novel function chains (Comp/POOD Transformation).
3. Apply rules independently of sequence length (Length Generalization).
4. Maintain functional integrity despite superficial format changes (Format Generalization).
The inclusion of "Reasoning Step" implies that the ideal system would not just perform the transformation but also internalize the underlying logic (e.g., the cyclic nature of ROT13). This diagram is likely from a research paper or technical report focused on measuring and improving the **systematicity**, **compositionality**, and **robustness** of neural networks or other AI models on algorithmic tasks.
</details>
Figure 1: Framework of DataAlchemy. It creates an isolated and controlled environment to train LLMs from scratch and probe the task, length, and format generalization.
## 3 The Data Distribution Lens
We propose a fundamental reframing to understand what CoT actually represents. We hypothesize that the underlying mechanism is better understood through the lens of data distribution: rather than executing explicit reasoning procedures, CoT operates as a pattern-matching process that interpolates and extrapolates from the statistical regularities present in its training distribution. Specifically, we posit that CoT’s success stems not from a model’s inherent reasoning capacity, but from its ability to generalize conditionally to out-of-distribution (OOD) test cases that are structurally similar to in-distribution exemplars.
To formalize this view, we model CoT prompting as a conditional generation process constrained by the distributional properties of the training data. Let $D_train$ denote the training distribution over input-output pairs $(x,y)$ , where $x$ represents a reasoning problem and $y$ denotes the solution sequence (including intermediate reasoning steps). The model learns an approximation $f_θ(x)≈ y$ by minimizing empirical risk over samples drawn from $D_train$ .
Let the expected training risk be defined as:
$$
R_train(f_θ)=E_(x,y)∼D_train[\ell(f_θ(x),y)], \tag{1}
$$
where $\ell$ is a task-specific loss function (e.g., cross-entropy, token-level accuracy). At inference time, given a test input $a_test$ sampled from a potentially different distribution $D_test$ , the model generates a response $y_test$ conditioned on patterns learned from $D_train$ . The corresponding expected test risk is:
$$
R_test(f_θ)=E_(x,y)∼D_test[\ell(f_θ(x),y)]. \tag{2}
$$
The degree to which the model generalizes from $D_train$ to $D_test$ is governed by the distributional discrepancy between the two, which we quantify using divergence measures:
**Definition 3.1 (Distributional Discrepancy)**
*Given training distribution $D_train$ and test distribution $D_test$ , the distributional discrepancy is defined as:
$$
Δ(D_train,D_test)=H(D_train\parallelD_test) \tag{3}
$$
where $H(·\parallel·)$ is a divergence measure (e.g., KL divergence, Wasserstein distance) that quantifies the statistical distance between the two distributions.*
**Theorem 3.1 (CoT Generalization Bound)**
*Let $f_θ$ denote a model trained on $D_train$ with expected training risk $R_train(f_θ)$ . For a test distribution $D_test$ , the expected test risk $R_test(f_θ)$ is bounded by:
$$
R_test(f_θ)≤ R_train(f_θ)+Λ·Δ(D_train,D_test)+O≤ft(√{\frac{\log(1/δ)}{n}}\right) \tag{4}
$$
where $Λ>0$ is a Lipschitz constant that depends on the model architecture and task complexity, $n$ is the training sample size, and the bound holds with probability $1-δ$ , where $δ$ is the failure propability.*
The proof is provided in Appendix A.1
Building on this data distribution perspective, we identify three critical dimensions along which distributional shifts can occur, each revealing different aspects of CoT’s pattern-matching nature: ➊ Task generalization examines how well CoT transfers across different types of reasoning tasks. Novel tasks may have unique elements and underlying logical structure, which introduces distributional shifts that challenge the model’s ability to apply learned reasoning patterns. ➋ Length generalization investigates CoT’s robustness to reasoning chains of varying lengths. Since training data typically contains reasoning sequences within a certain length range, test cases requiring substantially longer or shorter reasoning chains represent a form of distributional shift along the sequence length dimension. This length discrepancy could result from the reasoning step or the text-dependent solution space. ➌ Format generalization explores how sensitive CoT is to variations in prompt formulation and structure. Due to various reasons (e.g., sophistical training data or diverse background of users), it is challenging for LLM practitioners to design a golden prompt to elicit knowledge suitable for the current case. Their detailed definition and implementation are given in subsequent sections.
Each dimension provides a unique lens for understanding the boundaries of CoT’s effectiveness and the mechanisms underlying its apparent reasoning capabilities. By systematically varying these dimensions in controlled experimental settings, we can empirically validate our hypothesis that CoT performance degrades predictably as distributional discrepancy increases, thereby revealing its fundamental nature as a pattern-matching rather than reasoning system.
## 4 DataAlchemy: An Isolated and Controlled Environment
To systematically investigate the influence of distributional shifts on CoT reasoning capabilities, we introduce DataAlchemy, a synthetic dataset framework designed for controlled experimentation. This environment enables us to train language models from scratch under precisely defined conditions, allowing for rigorous analysis of CoT behavior across different OOD scenarios. The overview is shown in Figure 1.
### 4.1 Basic Atoms and Elements
Let $A=\{\texttt{A},\texttt{B},\texttt{C},…,\texttt{Z}\}$ denote the alphabet of 26 basic atoms. An element $e$ is defined as an ordered sequence of atoms:
$$
e=(a_0,a_1,…,a_l-1) where a_i∈A, l∈ℤ^+ \tag{5}
$$
This design provides a versatile manipulation for the size of the dataset $D$ (i.e., $|D|=|A|^l$ ) by varying element length $l$ to train language models with various capacities. Meanwhile, it also allows us to systematically probe text length generalization capabilities.
### 4.2 Transformations
A transformation is an operation that operates on elements $F:e→\hat{e}$ . In this work, we consider two fundamental transformations: the ROT Transformation and the Cyclic Position Shift. To formally define the transformations, we introduce a bijective mapping $φ:A→ℤ_26$ , where $ℤ_26=\{0,1,…,25\}$ , such that $φ(c)$ maps a character to its zero-based alphabetical index.
**Definition 4.1 (ROT Transformation)**
*Given an element $e=(a_0,\ …,a_l-1)$ and a rotation parameter $n∈ℤ$ , the ROT Transformation $f_rot$ produces an element $\hat{e}=(\hat{a}_0,…,\hat{a}_l-1)$ . Each atom $\hat{a}_i$ is:
$$
\hat{a}_i=φ^-1((φ(a_i)+n)±od{26}) \tag{6}
$$
This operation cyclically shifts each atom $n$ positions forward in alphabetical order. For example, if $e=(\texttt{A},\texttt{P},\texttt{P},\texttt{L},\texttt{E})$ and $n=13$ , then $f_rot(e,13)=(\texttt{N},\texttt{C},\texttt{C},\texttt{Y},\texttt{R})$ .*
**Definition 4.2 (Cyclic Position Shift)**
*Given an element $e=(a_0,\ …,a_l-1)$ and a shift parameter $n∈ℤ$ , the Cyclic Position Shift $f_pos$ produces an element $\hat{e}=(\hat{a}_0,…,\hat{a}_l-1)$ . Each atom $\hat{a}_i$ is defined by a cyclic shift of indices:
$$
\hat{a}_i=a_(i-n)±od{l} \tag{7}
$$
This transformation cyclically shifts the positions of the atoms within the sequence by $n$ positions to the right. For instance, if $e=(\texttt{A},\texttt{P},\texttt{P},\texttt{L},\texttt{E})$ and $n=1$ , then $f_pos(e,1)=(\texttt{E},\texttt{A},\texttt{P},\texttt{P},\texttt{L})$ .*
**Definition 4.3 (Generalized Compositional Transformation)**
*To model multi-step reasoning, we define a compositional transformation as the successive application of a sequence of operations. Let $S=(f_1,f_2,…,f_k)$ be a sequence of operations, where each $f_i$ is one of the fundamental transformations $F=\{f_rot,f_pos\}$ with its respective parameters. The compositional transformation $f_S$ for the sequence $S$ is the function composition:
$$
f_S=f_k∘ f_k∘⋯∘ f_1 \tag{8}
$$
The resulting element $\hat{e}$ is obtained by applying the operations sequentially to an initial element $e$ :
$$
\hat{e}=f_k(f_k-1(…(f_1(e))…)) \tag{9}
$$*
This design enables the construction of arbitrarily complex transformation chains by varying the type, parameters, order, and length of operations within the sequence. At the sample time, we can naturally acquire the COT reasoning step by decomposing the intermediate process:
$$
\underbrace{f_S(e):}_Query \underbrace{e\xrightarrow{f_1}e^(1)\xrightarrow{f_2}e^(2)⋯\xrightarrow{f_k-1}e^(k-1)\xrightarrow{f_k}}_COT reasoning steps\underbrace{\boxed{\hat{e}}}_Answer \tag{1}
$$
### 4.3 Environment Setting
Through systematic manipulation of elements and transformations, DataAlchemy offers a flexible and controllable framework for training LLMs from scratch, facilitating rigorous investigation of diverse OOD scenarios. Without specification, we employ a decoder-only language model GPT-2 (Radford et al., 2019) with a configuration of 4 layers, 32 hidden dimensions, and 4 attention heads. We utilize a Byte-Pair Encoding (BPE) tokenizer. Both LLMs and the tokenizer follow the general modern LLM pipeline. During the inference time, we set the temperature to 1e-5. For rigor, we also study LLMs with various parameters, architectures, and temperatures in Section 8. Details of the implementation are provided in the Appendix B. We consider that each element consists of 4 basic atoms, which produces 456,976 samples for each dataset with varied transformations and token amounts. We initialize the two transformations $f_1=f_rot(e,13)$ and $f_2=f_pos(e,1)$ . We consider the exact match rate, Levenshtein distance (i.e., edit distance) (Yujian and Bo, 2007), and BLEU score (Papineni et al., 2002) as metrics and evaluate the produced reasoning step, answer, and full chain. Examples of the datasets and evaluations are shown in Appendix C
## 5 Task Generalization
Task generalization represents a fundamental challenge for CoT reasoning, as it directly tests a model’s ability to apply learned concepts and reasoning patterns to unseen scenarios. In our controlled experiments, both transformation and elements could be novel. Following this, we decompose task generalization into two primary dimensions: element generalization and transformation generalization.
Task Generalization Complexity. Guided by the data distribution lens, we first introduce a measure for generalization difficulty:
**Proposition 5.1 (Task Generalization Complexity)**
*For a reasoning chain $f_S$ operating on elements $e=(a_0,…,a_l-1)$ , define:
$$
\displaystyle(C)= \displaystyleα∑_i=1^mI≤ft[a_i∉E^i_train \right]+β∑_j=1^nI≤ft[f_j∉F_train \right]+γI≤ft[≤ft(f_1,f_2,…,f_k\right)∉P_train \right]+C_T \tag{11}
$$
as a measurement of task discrepancy $Δ_task$ , where $α,β,γ$ are weighting parameters for different novelty types and $C_T$ is task specific constant. $E^i_train ,F_train ,$ and $P_train$ denote the bit-wise element set, relation set and the order of relation set used during training.*
We establish a critical threshold beyond which CoT reasoning fails exponentially:
**Theorem 5.1 (Task Generalization Failure Threshold)**
*There exists a threshold $τ$ such that when $TGC(C)>τ$ , the probability of correct CoT reasoning drops exponentially:
$$
P(correct|C)≤ e^-δ(TGC(C)-τ) \tag{12}
$$*
The proof is provided in Appendix A.2.
### 5.1 Transformation Generalization
Transformation generalization evaluates the ability of CoT reasoning to effectively transfer when models encounter novel transformations during testing, which is an especially prevalent scenario in real-world applications.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Scatter Plot: BLEU Score vs. Edit Distance with Distribution Shift
### Overview
This image is a scatter plot visualizing the relationship between three variables: Edit Distance (x-axis), BLEU Score (y-axis), and Distribution Shift (color gradient). The plot contains approximately 30-35 data points, each represented by a semi-transparent circle. The overall trend suggests a negative correlation between Edit Distance and BLEU Score, with an additional relationship indicated by the color gradient.
### Components/Axes
* **X-Axis:**
* **Label:** "Edit Distance"
* **Scale:** Linear, ranging from 0.00 to 0.30.
* **Tick Marks:** 0.00, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30.
* **Y-Axis:**
* **Label:** "BLEU Score"
* **Scale:** Linear, ranging from 0.2 to 1.0.
* **Tick Marks:** 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0.
* **Color Bar (Legend):**
* **Position:** Right side of the chart, vertical.
* **Label:** "Distribution Shift"
* **Scale:** Linear, ranging from approximately 0.2 to 0.8.
* **Gradient:** Transitions from blue (low values, ~0.2) through purple to red (high values, ~0.8).
* **Tick Marks:** 0.2, 0.4, 0.6, 0.8.
* **Data Points:** Semi-transparent circles. Their position encodes Edit Distance (x) and BLEU Score (y). Their color encodes the Distribution Shift value according to the color bar.
### Detailed Analysis
**Spatial Grounding & Trend Verification:**
1. **Top-Left Cluster (Low Edit Distance, High BLEU):**
* **Position:** Concentrated near x=0.00, y=1.0.
* **Color:** Blue to light blue.
* **Estimated Values:** Edit Distance ≈ 0.00-0.02, BLEU Score ≈ 0.98-1.02, Distribution Shift ≈ 0.2-0.3.
* **Trend:** This cluster represents the best performance (highest BLEU, lowest Edit Distance) and the lowest Distribution Shift.
2. **Central & Right Scatter (Higher Edit Distance, Lower BLEU):**
* **Position:** Spread from x≈0.10 to x≈0.33, and y≈0.20 to y≈0.70.
* **Color:** Varies from purple to pink to red.
* **General Trend:** As we move from left to right (increasing Edit Distance), the points generally move downward (decreasing BLEU Score). Concurrently, the color shifts from purple towards red (increasing Distribution Shift).
* **Approximate Data Points (Grouped by visual clusters):**
* **Mid-Left (x≈0.10-0.15):** A few points with BLEU ≈ 0.65-0.70, colored purple (Distribution Shift ≈ 0.4-0.5).
* **Central Dense Cluster (x≈0.15-0.22):** Many points with BLEU scores between 0.25 and 0.55. Colors range from purple to pink (Distribution Shift ≈ 0.4-0.7).
* **Right Side (x≈0.22-0.33):** Points with BLEU scores mostly below 0.6, with several below 0.4. Colors are predominantly pink to red (Distribution Shift ≈ 0.6-0.8+). The point with the highest Edit Distance (~0.33) has a low BLEU score (~0.20) and is red (high Distribution Shift).
### Key Observations
1. **Strong Negative Correlation:** There is a clear inverse relationship between Edit Distance and BLEU Score. Points with low Edit Distance have high BLEU Scores, and vice-versa.
2. **Color Gradient Correlation:** The Distribution Shift (color) is strongly correlated with the other two metrics. Low Distribution Shift (blue) is associated with the optimal performance cluster (low Edit Distance, high BLEU). High Distribution Shift (red) is associated with poorer performance (higher Edit Distance, lower BLEU).
3. **Performance Degradation Path:** The data suggests a trajectory: as Distribution Shift increases, model performance degrades, manifesting as both higher Edit Distance and lower BLEU Score.
4. **Outliers:** There are a few points that slightly deviate from the main trend. For example, a purple point at approximately (x=0.15, y=0.70) has a relatively high BLEU score for its Edit Distance and Distribution Shift value.
### Interpretation
This chart likely evaluates the performance of a text generation or translation model under varying conditions. **BLEU Score** is a standard metric for evaluating machine-generated text against a reference, where higher is better. **Edit Distance** measures the amount of change needed to transform the generated text into the reference, where lower is better. **Distribution Shift** probably quantifies how much the input data distribution differs from the model's training distribution.
The data demonstrates that **distribution shift is a key factor in model performance degradation**. When the input data is very similar to the training data (low Distribution Shift, blue points), the model performs excellently (high BLEU, low Edit Distance). As the input data diverges from the training distribution (increasing Distribution Shift, moving to red), the model's outputs become less accurate (lower BLEU) and require more edits (higher Edit Distance). This visualization effectively argues that maintaining a low distribution shift is critical for reliable model performance, and it quantifies the cost of shift in terms of two complementary evaluation metrics. The tight clustering of the blue points suggests the model is highly consistent on in-distribution data, while the wider scatter of red points indicates more variable and generally worse performance on out-of-distribution data.
</details>
Figure 2: Performance of CoT reasoning on transformation generalization. Efficacy of CoT reasoning declines as the degree of distributional discrepancy increases.
Experimental Setup. To systematically evaluate the impact of transformations, we conduct experiments by varying transformations between training and testing sets while keeping other factors constant (e.g., elements, length, and format). Guided by the intuition formalized in Proposition 5.1, we define four incremental levels of distribution shift in transformations as shown in Figure 1: (i) In-Distribution (ID): The transformations in the test set are identical to those observed during training, e.g., $f_1∘ f_1→ f_1∘ f_1$ . (ii) Composition (CMP): Test samples comprise novel compositions of previously encountered transformations, though each individual transformation remains familiar, e.g., ${f_1∘ f_1,f_1∘ f_2,f_2∘ f_1}→ f_2∘ f_2$ . (iii) Partial Out-of-Distribution (POOD): Test data include compositions involving at least one novel transformation not seen during training, e.g., $f_1∘ f_1→ f_1∘ f_2$ . (iv) Out-of-Distribution (OOD): The test set contains entirely novel transformation types that are unseen in training, e.g., $f_1∘ f_1→ f_2∘ f_2$ .
Table 1: Full chain evaluation under different scenarios for transformation generalization.
| $f_1∘ f_1→ f_1∘ f_1$ $\{f_2∘ f_2,f_1∘ f_2,f_2∘ f_1\}→ f_1∘ f_1$ $f_1∘ f_2→ f_1∘ f_1$ | ID CMP POOD | 100.00% 0.01% 0.00% | 0 0.1326 0.1671 | 1 0.6867 0.4538 |
| --- | --- | --- | --- | --- |
| $f_2∘ f_2→ f_1∘ f_1$ | OOD | 0.00% | 0.2997 | 0.2947 |
Findings. Figure 2 illustrates the performance of the full chain under different distribution discrepancies computed by task generalize complexities (normalized between 0 and 1) in Definition 5.1. We can observe that, in general, the effectiveness of CoT reasoning decreases when distribution discrepancy increases. For the instance shown in Table 1, from in-distribution to composition, POOD, and OOD, the exact match decreases from 1 to 0.01, 0, and 0, and the edit distance increases from 0 to 0.13, 0.17 when tested on data with transformation $f_1∘ f_1$ . Apart from ID, LLMs cannot produce a correct full chain in most cases, while they can produce correct CoT reasoning when exposed to some composition and POOD conditions by accident. As shown in Table 2, from $f_1∘ f_2$ to $f_2∘ f_2$ , the LLMs can correctly answer 0.1% of questions. A close examination reveals that it is a coincidence, e.g., the query element is A, N, A, N, which happened to produce the same result for the two operations detailed in the Appendix D.1. When further analysis is performed by breaking the full chain into reasoning steps and answers, we observe strong consistency between the reasoning steps and answers. For example, under the composition generalization setting, the reasoning steps are entirely correct on test data distribution $f_1∘ f_1$ and $f_2∘ f_2$ , but with wrong answers. Probe these insistent cases in Appendix D.1, we can find that when a novel transformation (say $f_1∘ f_1$ ) is present, LLMs try to generalize the reasoning paths based on the most similar ones (i.e., $f_1∘ f_2$ ) seen during training, which leads to correct reasoning paths, yet incorrect answer, which echo the example in the introduction. Similarly, generalization from $f_1∘ f_2$ to $f_2∘ f_1$ or vice versa allows LLMs to produce correct answers that are attributed to the commutative property between the two orthogonal transformations with unfaithful reasoning paths. Collectively, the above results indicate that the CoT reasoning fails to generalize to novel transformations, not even to novel composition transforms. Rather than demonstrating a true understanding of text, CoT reasoning under task transformations appears to reflect a replication of patterns learned during training.
Table 2: Evaluation on different components in CoT reasoning on transformation generalization. CoT reasoning shows inconsistency with the reasoning steps and answers.
| $\{f_1∘ f_1,f_1∘ f_2,f_2∘ f_1\}→ f_2∘ f_2$ $\{f_1∘ f_2,f_2∘ f_1,f_2∘ f_2\}→ f_1∘ f_1$ $f_1∘ f_2→ f_2∘ f_1$ | 100.00% 100.00% 0.00% | 0.01% 0.01% 100.00% | 0.01% 0.01% 0.00% | 0.000 0.000 0.373 | 0.481 0.481 0.000 | 0.133 0.133 0.167 |
| --- | --- | --- | --- | --- | --- | --- |
| $f_2∘ f_1→ f_1∘ f_2$ | 0.00% | 100.00% | 0.00% | 0.373 | 0.000 | 0.167 |
Experiment settings. To further probe when CoT reasoning can generalize to unseen transformations, we conduct supervised fine-tuning (SFT) on a small portion $λ$ of unseen data. In this way, we can decrease the distribution discrepancy between the training and test sets, which might help LLMs to generalize to test queries.
Findings. As shown in Figure 3, we can find that generally a very small portion ( $λ=1.5e^-4$ ) of data can make the model quickly generalize to unseen transformations. The less discrepancy between the training and testing data, the quicker the model can generalize. This indicates that a similar pattern appears in the training data, helping LLMs to generalize to the test dataset.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: Exact Match Performance vs. SFT Data Ratio
### Overview
This image is a line chart comparing the performance of four different methods or models (ID, CMP, POOD, OOD) on an "Exact Match" metric as a function of the "SFT Data Ratio." The chart demonstrates how each method's accuracy improves with increasing amounts of supervised fine-tuning (SFT) data.
### Components/Axes
* **Chart Type:** Multi-series line chart.
* **Y-Axis:**
* **Label:** "Exact Match (%)"
* **Scale:** Linear, from 0 to 100, with major tick marks every 20 units (0, 20, 40, 60, 80, 100).
* **X-Axis:**
* **Label:** "SFT Data Ratio (×10⁻⁴)"
* **Scale:** Linear, from 0 to 6, with major tick marks at every integer (0, 1, 2, 3, 4, 5, 6). The "(×10⁻⁴)" indicates the values are scaled; for example, "1" represents a ratio of 0.0001.
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries:
1. **ID:** Blue dotted line with diamond markers (♦).
2. **CMP:** Purple solid line with square markers (■).
3. **POOD:** Orange dashed line with circle markers (●).
4. **OOD:** Green dash-dot line with triangle markers (▲).
### Detailed Analysis
The chart plots the "Exact Match (%)" for each method at discrete "SFT Data Ratio" points. The following data points are approximate visual estimates:
**1. ID (Blue, Dotted, Diamonds):**
* **Trend:** Perfectly flat, horizontal line at the top of the chart.
* **Data Points:** Maintains 100% Exact Match across all SFT Data Ratios from 0 to 6.
**2. CMP (Purple, Solid, Squares):**
* **Trend:** Steep, rapid ascent from near 0% to near 100%, followed by a plateau.
* **Data Points:**
* Ratio 0: ~0%
* Ratio 0.5: ~40%
* Ratio 1: ~80%
* Ratio 1.5: ~95%
* Ratio 2: ~98%
* Ratios 3-6: Plateaus at ~100%.
**3. POOD (Orange, Dashed, Circles):**
* **Trend:** Steady, strong upward curve that approaches 100% more gradually than CMP.
* **Data Points:**
* Ratio 0: ~0%
* Ratio 0.5: ~22%
* Ratio 1: ~50%
* Ratio 1.5: ~72%
* Ratio 2: ~88%
* Ratio 3: ~95%
* Ratio 4: ~97%
* Ratios 5-6: Plateaus near ~99%.
**4. OOD (Green, Dash-Dot, Triangles):**
* **Trend:** Slower, more gradual ascent that begins to plateau at a lower level than the other methods.
* **Data Points:**
* Ratio 0: ~0%
* Ratio 1: ~15%
* Ratio 2: ~45%
* Ratio 3: ~70%
* Ratio 4: ~85%
* Ratio 5: ~90%
* Ratio 6: ~90% (Plateaus).
### Key Observations
1. **Performance Hierarchy:** At all non-zero data ratios, the performance order is consistent: ID (best) > CMP > POOD > OOD (worst).
2. **Data Efficiency:** CMP is the most data-efficient method among the three that start at 0%. It reaches near-perfect performance with a very small data ratio (≈1.5 x 10⁻⁴).
3. **Ceiling Effect:** The ID method appears to be a theoretical or ideal baseline, as it shows perfect performance regardless of data ratio.
4. **OOD Limitation:** The OOD method not only learns slower but also appears to hit a performance ceiling around 90%, suggesting a fundamental limitation in its ability to achieve exact matches compared to the other approaches.
5. **Convergence:** Both CMP and POOD converge to near 100% performance, but CMP requires significantly less data to get there.
### Interpretation
This chart likely illustrates a study on model robustness or generalization, comparing in-distribution (ID) performance against various out-of-distribution (OOD) or specialized training scenarios (CMP, POOD).
* **What the data suggests:** The "ID" line represents the upper-bound performance on familiar data. The other lines show how different training or evaluation strategies (CMP, POOD, OOD) recover this performance as they are exposed to more fine-tuning data. The steep rise of CMP suggests it is a highly effective strategy for quickly adapting to or handling a specific distribution shift. The slower rise and lower plateau of OOD indicate it represents a more challenging distribution shift that the model cannot fully overcome with additional data alone.
* **Relationship between elements:** The X-axis (data ratio) is the independent variable controlling the amount of adaptation. The Y-axis (exact match) is the dependent measure of success. The diverging paths of the lines highlight the varying difficulty of the tasks or distributions they represent.
* **Notable anomaly:** The perfect, flat line for "ID" is striking. It implies that for the in-distribution test set, the base model (with zero additional SFT data) already achieves 100% exact match, or that this line represents a different, non-adaptive benchmark. This serves as a control, emphasizing that the challenges shown for the other lines are due to distribution shift, not model incapability on the core task.
</details>
Figure 3: Performance on unseen transformation using SFT in various levels of distribution shift. Introducing a small amount of unseen data helps CoT reasoning to generalize across different scenarios.
### 5.2 Element Generalization
Element generalization is another critical factor to consider when LLMs try to generalize to new tasks.
Experiment settings. Similar to transformation generalization, we fix other factors and consider three progressive distribution shifts for elements: ID, CMP, and OOD, as shown in Figure 1. It is noted that in composition, we test if CoT reasoning can be generalized to novel combinations when seeing all the basic atoms in the elements, e.g., $(\texttt{A},\texttt{B},\texttt{C},\texttt{D})→(\texttt{B},\texttt{C},\texttt{D},\texttt{A})$ . Based on the atom order in combination (can be measured by edit distance $n$ ), the CMP can be further developed. While for OOD, atoms that constitute the elements are totally unseen during the training.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Heatmap Comparison: BLEU Score vs. Exact Match by Scenario and Transformation
### Overview
The image displays two vertically stacked heatmaps comparing model performance across three scenarios (ID, CMP, OOD) and six transformations (f1, f2, f1∘f1, f1∘f2, f2∘f1, f2∘f2). The top heatmap visualizes BLEU Scores (0.0 to 1.0 scale), and the bottom heatmap visualizes Exact Match percentages (0% to 100% scale). The color intensity represents the score magnitude, with a corresponding color bar to the right of each chart.
### Components/Axes
* **Y-Axis (Both Charts):** Labeled "Scenario". Categories from top to bottom: `ID`, `CMP`, `OOD`.
* **X-Axis (Both Charts):** Labeled "Transformation". Categories from left to right: `f1`, `f2`, `f1∘f1`, `f1∘f2`, `f2∘f1`, `f2∘f2`.
* **Color Bar (Top Chart):** Labeled "BLEU Score". Scale ranges from 0.0 (blue) to 1.0 (red/pink).
* **Color Bar (Bottom Chart):** Labeled "Exact Match (%)". Scale ranges from 0 (blue) to 100 (red/pink).
### Detailed Analysis
**Top Heatmap: BLEU Score**
* **ID Scenario (Top Row):** All cells show a perfect score of `1.00`. The row is uniformly colored in the darkest red/pink.
* **CMP Scenario (Middle Row):** Scores are moderate to low, ranging from `0.16` to `0.71`. The trend shows a general decrease from left to right. The highest score (`0.71`) is for transformation `f1`. The lowest score (`0.16`) is for transformation `f2∘f2`.
* **OOD Scenario (Bottom Row):** Scores are low, ranging from `0.00` to `0.46`. The first two transformations (`f1`, `f2`) score `0.00`. The highest score (`0.46`) is for transformation `f1∘f1`. The row is predominantly blue, indicating low scores.
**Bottom Heatmap: Exact Match (%)**
* **ID Scenario (Top Row):** All cells show a perfect score of `100`. The row is uniformly colored in the darkest red/pink.
* **CMP Scenario (Middle Row):** All cells show a score of `0`. The entire row is uniformly colored in the darkest blue.
* **OOD Scenario (Bottom Row):** All cells show a score of `0`. The entire row is uniformly colored in the darkest blue.
### Key Observations
1. **Perfect In-Distribution Performance:** The `ID` scenario achieves perfect scores (BLEU=1.00, Exact Match=100%) across all six transformations, indicating flawless performance on in-distribution data.
2. **Complete Exact Match Failure:** Both the `CMP` (Compositional) and `OOD` (Out-of-Distribution) scenarios score `0%` on the Exact Match metric for every transformation, indicating a total failure to generate the exact target output.
3. **Partial Success in BLEU for CMP:** While failing on Exact Match, the `CMP` scenario shows non-zero BLEU scores, suggesting the model generates outputs that are partially similar to the reference (e.g., sharing some words or phrases). Performance degrades for more complex transformations (`f2∘f2`).
4. **Poor OOD Generalization:** The `OOD` scenario shows very low BLEU scores (≤0.46) and zero Exact Match, indicating the model generalizes poorly to completely unseen data distributions.
5. **Transformation Complexity Impact:** For the `CMP` scenario's BLEU scores, simpler transformations (`f1`, `f2`) yield higher scores than their compositions (`f1∘f1`, `f2∘f2`), suggesting performance degrades with increased compositional complexity.
### Interpretation
This data strongly suggests a model that has memorized or perfectly fits its training distribution (`ID` scenario) but lacks robust compositional reasoning and generalization capabilities.
* The **discrepancy between BLEU and Exact Match in the CMP scenario** is critical. It indicates the model can produce *somewhat relevant* language (hence non-zero BLEU) but fails to construct the *precisely correct* sequence of operations or tokens required for an exact match. This points to a weakness in systematic, rule-based reasoning.
* The **complete failure on OOD data** confirms the model's knowledge is not abstracted into generalizable principles. It cannot apply learned patterns to novel contexts.
* The **declining BLEU trend in CMP** for more complex transformations (`f2∘f2` being the lowest) suggests the model's partial compositional ability breaks down as the number of operations or their complexity increases.
In essence, the model exhibits "brittle expertise": it performs perfectly on familiar data but fails fundamentally when required to reason compositionally or generalize to new situations. The heatmap visualizes the stark boundary between its memorized performance and its lack of robust, abstract understanding.
</details>
Figure 4: Element generalization results on various scenarios and relations.
Findings. Similar to transformation generalization, the performances degrade sharply when facing the distribution shift consistently across all transformations, as shown in Figure 4. From ID to CMP and OOD, the exact match decreases from 1.0 to 0 and 0, for all cases. Most strikingly, the BLEU score is 0 when transferred to $f_1$ and $f_2$ transformations. A failure case in Appendix D.1 shows that the models cannot respond to any words when novel elements are present. We further explore when CoT reasoning can generalize to novel elements by conducting SFT. The results are summarized in Figure 5. We evaluate the performance under three exact matches for the full chain under three scenarios, CMP based on the edit distance n. The result is similar to SFT on transformation. The performance increases rapidly when presented with similar (a small $n$ ) examples in the training data. Interestingly, the exact match rate for CoT reasoning aligns with the lower bound of performance when $n=3$ , which might suggest the generalization of CoT reasoning on novel elements is very limited, even SFT on the downstream task. When we further analyze the exact match of reasoning, answer, and token during the training for $n=3$ , as summarized in Figure 5b. We find that there is a mismatch of accuracy between the answer and the reasoning step during the training process, which somehow might provide an explanation regarding why CoT reasoning is inconsistent in some cases.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: Exact Match (%) vs. SFT Data Ratio for Different 'n' Values
### Overview
This is a line chart with markers, plotting the relationship between the "SFT Data Ratio" (x-axis) and the "Exact Match (%)" performance (y-axis). The chart compares four distinct data series, each corresponding to a different value of a parameter labeled 'n' (n=1, n=2, n=3, n=4). The chart demonstrates how the exact match accuracy changes as the ratio of Supervised Fine-Tuning (SFT) data increases, with different performance trajectories for each 'n' value.
### Components/Axes
* **Chart Type:** Line chart with distinct markers for each data point.
* **Y-Axis:**
* **Label:** "Exact Match (%)"
* **Scale:** Linear scale from 0 to 100.
* **Major Ticks:** 0, 20, 40, 60, 80, 100.
* **X-Axis:**
* **Label:** "SFT Data Ratio"
* **Scale:** Linear scale from 0.0 to 1.0.
* **Major Ticks:** 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0.
* **Legend:**
* **Position:** Bottom-right corner of the plot area.
* **Entries:**
1. `n=4`: Green line with upward-pointing triangle markers (▲).
2. `n=3`: Orange line with circle markers (●).
3. `n=2`: Purple line with square markers (■).
4. `n=1`: Blue dashed line with diamond markers (◆).
### Detailed Analysis
**Data Series Trends and Approximate Points:**
1. **Series n=1 (Blue Diamonds, Dashed Line):**
* **Trend:** Perfectly flat, horizontal line at the top of the chart.
* **Data Points:** The exact match percentage is **100%** for all SFT Data Ratio values from 0.0 to 1.0.
2. **Series n=2 (Purple Squares, Solid Line):**
* **Trend:** Steep, concave-down curve. It shows the fastest initial improvement among the non-constant series.
* **Data Points (Approximate):**
* (0.0, 0%)
* (0.1, ~10%)
* (0.2, ~30%)
* (0.3, ~78%)
* (0.4, ~88%)
* (0.5, ~95%)
* (0.6, ~98%)
* (0.7, 100%)
* (0.8, 100%)
* (0.9, 100%)
* (1.0, 100%)
3. **Series n=3 (Orange Circles, Solid Line):**
* **Trend:** A steady, slightly concave-down curve. Improves more slowly than n=2 but faster than n=4.
* **Data Points (Approximate):**
* (0.0, 0%)
* (0.1, ~10%)
* (0.2, ~22%)
* (0.3, ~38%)
* (0.4, ~50%)
* (0.5, ~60%)
* (0.6, ~75%)
* (0.7, ~88%)
* (0.8, ~95%)
* (0.9, 100%)
* (1.0, 100%)
4. **Series n=4 (Green Triangles, Solid Line):**
* **Trend:** The most gradual, nearly linear upward slope. It requires the most SFT data to achieve high performance.
* **Data Points (Approximate):**
* (0.0, 0%)
* (0.1, ~10%)
* (0.2, ~20%)
* (0.3, ~30%)
* (0.4, ~40%)
* (0.5, ~50%)
* (0.6, ~60%)
* (0.7, ~70%)
* (0.8, ~80%)
* (0.9, ~90%)
* (1.0, 100%)
### Key Observations
1. **Performance Hierarchy:** For any given SFT Data Ratio greater than 0, the exact match performance strictly follows: n=1 > n=2 > n=3 > n=4.
2. **Saturation Point:** All series eventually reach 100% exact match. The SFT Data Ratio required to reach 100% decreases as 'n' decreases: n=4 requires a ratio of 1.0, n=3 requires ~0.9, n=2 requires ~0.7, and n=1 requires 0.0.
3. **Initial Slope:** The initial rate of improvement (slope from x=0.0 to x=0.3) is steepest for n=2, followed by n=3, then n=4. n=1 has no slope.
4. **Convergence:** All lines converge at the top-right corner of the chart (SFT Data Ratio=1.0, Exact Match=100%).
### Interpretation
The chart illustrates a clear inverse relationship between the parameter 'n' and the model's data efficiency for achieving perfect exact match performance. 'n' likely represents a measure of task difficulty, output complexity, or the number of constraints the model must satisfy (e.g., number of required elements in a generated response).
* **n=1 (Simplest Task):** The model achieves perfect performance with zero SFT data, suggesting this task is either trivially easy for the base model or the evaluation metric is not sensitive enough to detect errors.
* **n=2, 3, 4 (Increasingly Complex Tasks):** As 'n' increases, the model requires progressively more SFT data to reach the same level of performance. The curves show diminishing returns; the last few percentage points of accuracy require a disproportionately large increase in data ratio.
* **Practical Implication:** The data suggests that for more complex tasks (higher 'n'), investing in a larger SFT dataset is crucial. However, for simpler tasks (lower 'n'), a smaller, high-quality dataset may suffice. The chart provides a quantitative guide for resource allocation in data collection for fine-tuning, based on the complexity ('n') of the target task.
</details>
(a) Performance on unseen element via SFT in various CMP scenarios.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Bar Chart: SFT Data Ratio vs. Exact Match Percentage
### Overview
This is a grouped bar chart illustrating the relationship between the ratio of Supervised Fine-Tuning (SFT) data used and the resulting "Exact Match" performance percentage for three distinct evaluation components: "Reasoning Step," "Answer," and "Full Chain." The chart demonstrates a clear positive correlation between the amount of SFT data and model performance across all measured components.
### Components/Axes
* **Chart Type:** Grouped bar chart.
* **X-Axis (Horizontal):**
* **Label:** `SFT Data Ratio`
* **Scale:** Linear scale with discrete markers at 0.0, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8, and 1.0.
* **Y-Axis (Vertical):**
* **Label:** `Exact Match (%)`
* **Scale:** Linear scale from 0.0 to 1.0, with major gridlines at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located in the top-left corner of the plot area.
* **Reasoning Step:** Represented by purple bars with diagonal stripes (\\).
* **Answer:** Represented by red/salmon bars with diagonal stripes (\\).
* **Full Chain:** Represented by blue bars with a cross-hatch pattern (X).
### Detailed Analysis
Performance values are approximate, read from the chart's y-axis.
| SFT Data Ratio | Reasoning Step (Purple, \\) | Answer (Red, \\) | Full Chain (Blue, X) |
| :--- | :--- | :--- | :--- |
| **0.0** | ~0.00 | ~0.00 | ~0.00 |
| **0.1** | ~0.03 | ~0.05 | ~0.02 |
| **0.2** | ~0.18 | ~0.19 | ~0.19 |
| **0.3** | ~0.45 | ~0.42 | ~0.40 |
| **0.5** | ~0.75 | ~0.70 | ~0.71 |
| **0.6** | ~0.90 | ~0.88 | ~0.87 |
| **0.7** | ~0.90 | ~0.88 | ~0.87 |
| **0.8** | ~1.00 | ~1.00 | ~1.00 |
| **1.0** | ~1.00 | ~1.00 | ~1.00 |
**Trend Verification:**
* **All Series:** Exhibit a strong, positive, non-linear trend. Performance increases slowly at low data ratios (0.0-0.2), accelerates sharply between 0.2 and 0.6, and then plateaus near the maximum value of 1.0 (100%) from 0.8 onward.
* **Relative Performance:** The three metrics track each other very closely at every data point. "Reasoning Step" often has a very slight lead at intermediate ratios (e.g., at 0.3 and 0.5), but the differences are minimal.
### Key Observations
1. **Performance Saturation:** All three components achieve near-perfect (≈100%) Exact Match scores when the SFT Data Ratio reaches 0.8 and above.
2. **Critical Learning Phase:** The most significant performance gains occur when increasing the SFT Data Ratio from 0.2 to 0.6. This suggests this range is critical for model learning.
3. **Metric Alignment:** The extremely close performance of "Reasoning Step," "Answer," and "Full Chain" indicates that improvements in the model's reasoning process directly and proportionally translate to improvements in the final answer and the complete chain of thought.
4. **Low-Data Performance:** At very low data ratios (0.1), performance is minimal but non-zero, indicating some baseline capability or the effect of the pre-trained model before fine-tuning.
### Interpretation
This chart provides strong empirical evidence for the efficacy of Supervised Fine-Tuning (SFT) data in improving a model's performance on tasks requiring step-by-step reasoning and answer generation. The data suggests:
* **A Clear Dose-Response Relationship:** More high-quality SFT data leads to better performance, following a classic learning curve.
* **The Importance of Reasoning:** The tight coupling between "Reasoning Step" and "Answer" scores implies that the model's ability to produce correct intermediate steps is fundamental to generating correct final answers. You cannot improve one without improving the other.
* **Diminishing Returns:** After a certain point (here, a ratio of ~0.8), adding more SFT data yields negligible improvements, as the model has effectively mastered the task as measured by the Exact Match metric. This is crucial for understanding the cost-benefit trade-off in data collection for fine-tuning.
* **Investigative Insight (Peircean):** The chart acts as an *index* pointing to a causal relationship (SFT data causes performance gain) and provides *evidence* for a *hypothesis* about model learning dynamics. The near-perfect alignment of the three bars at each data point is a *sign* that the evaluation metrics are well-correlated and likely measuring facets of the same underlying capability. The plateau at high data ratios is a *clue* that the task's difficulty or the model's capacity may be the limiting factor, not the data quantity.
</details>
(b) Evaluation of CoT reasoning in SFT.
Figure 5: SFT performances for element generalization. SFT helps to generalize to novel elements.
## 6 Length Generalization
Length generalization examines how CoT reasoning degrades when models encounter test cases that differ in length from their training distribution. The difference in length could be introduced from the text space or the reasoning space of the problem. Therefore, we decompose length generalization into two complementary aspects: text length generalization and reasoning step generalization. Guided by instinct, we first propose to measure the length discrepancy.
Length Extrapolation Bound. We establish a power-law relationship for length extrapolation:
**Proposition 6.1 (Length Extrapolation Gaussian Degradation)**
*For a model trained on chain-of-thought sequences of fixed length $L_train$ , the generalization error at test length $L$ follows a Gaussian distribution:
$$
E(L)=E_0+≤ft(1-E_0\right)·≤ft(1-\exp≤ft(-\frac{≤ft(L-L_train \right)^2}{2σ^2}\right)\right) \tag{13}
$$
where $E_0$ is the in-distribution error at $L=L_train$ , $σ$ is the length generalization width parameter, and $L$ is the test sequence length*
The proof is provided in Appendix A.3.
### 6.1 Text Length Generalization
Text length generalization evaluates how CoT performance varies when the input text length (i.e., the element length $l$ ) differs from training examples. Considering the way LLMs process long text, this aspect is crucial because real-world problems often involve varying degrees of complexity that manifest as differences in problem statement length, context size, or information density.
Experiment settings. We pre-train LLMs on the dataset with text length merely on $l=4$ while fixing other factors and evaluate the performance on a variety of lengths. We consider three different padding strategies during the pre-training: (i) None: LLMs do not use any padding. (ii) Padding: We pad LLM to the max length of the context window. (iii) Group: We group the text and truncate it into segments with a maximum length.
Table 3: Evaluation for text length generalization.
| 2 3 4 | 0.00% 0.00% 100.00% | 0.00% 0.00% 100.00% | 0.00% 0.00% 100.00% | 0.3772 0.2221 0.0000 | 0.4969 0.3203 0.0000 | 0.5000 0.2540 0.0000 | 0.4214 0.5471 1.0000 | 0.1186 0.1519 1.0000 | 0.0000 0.0000 1.0000 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 5 | 0.00% | 0.00% | 0.00% | 0.1818 | 0.2667 | 0.2000 | 0.6220 | 0.1958 | 0.2688 |
| 6 | 0.00% | 0.00% | 0.00% | 0.3294 | 0.4816 | 0.3337 | 0.4763 | 0.1174 | 0.2077 |
Findings. As illustrated in the Table 3, the CoT reasoning failed to directly generate two test cases even though those lengths present a mild distribution shift. Further, the performance declines as the length discrepancy increases shown in Figure 6. For instance, from data with $l=4$ to those with $l=3$ or $l=5$ , the BLEU score decreases from 1 to 0.55 and 0.62. Examples in Appendix D.1 indicate that LLMs attempt to produce CoT reasoning with the same length as the training data by adding or removing tokens in the reasoning chains. The efficacy of CoT reasoning length generalization deteriorates as the discrepancy increases. Moreover, we consider using a different padding strategy to decrease the divergence between the training data and test cases. We found that padding to the max length doesn’t contribute to length generalization. However, the performance increases when we replace the padding with text by using the group strategy, which indicates its effectiveness.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Dual-Axis Grouped Bar Chart: BLEU Score and Edit Distance by Text Length
### Overview
This image displays a grouped bar chart with dual vertical axes, comparing three different methods ("None", "Group", "Padding") across five discrete text lengths (2, 3, 4, 5, 6). The chart measures two distinct performance metrics simultaneously: BLEU Score (left axis) and Edit Distance (right axis). The bars for each method are visually distinguished by color and pattern.
### Components/Axes
* **Chart Type:** Grouped Bar Chart with Dual Y-Axes.
* **X-Axis:**
* **Label:** "Text Length"
* **Categories/Ticks:** 2, 3, 4, 5, 6
* **Primary Y-Axis (Left):**
* **Label:** "BLEU Score"
* **Scale:** Linear, from 0.0 to 1.0.
* **Ticks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
* **Secondary Y-Axis (Right):**
* **Label:** "Edit Distance"
* **Scale:** Linear, but **inverted** (0.00 at top, 1.00 at bottom).
* **Ticks:** 0.00, 0.20, 0.40, 0.60, 0.80, 1.00
* **Legend:**
* **Position:** Bottom-left corner of the plot area.
* **Items:**
1. **None:** Purple bar with diagonal hatching (\\).
2. **Group:** Red bar with cross-hatching (X).
3. **Padding:** Blue bar with diamond/checkerboard hatching.
* **Data Representation:** For each text length on the x-axis, three bars are grouped together, one for each method in the legend. The height of each bar corresponds to the BLEU Score (read from the left axis). The same bar's vertical position from the top corresponds to the Edit Distance (read from the inverted right axis).
### Detailed Analysis
**Data Point Extraction (Approximate Values):**
Values are estimated based on bar height/position relative to the axes. The dual-axis nature means each bar encodes two values.
| Text Length | Method | BLEU Score (Left Axis) | Edit Distance (Right Axis) |
| :--- | :--- | :--- | :--- |
| **2** | None | ~0.42 | ~0.58 |
| | Group | ~0.49 | ~0.51 |
| | Padding | ~0.41 | ~0.59 |
| **3** | None | ~0.55 | ~0.45 |
| | Group | ~0.72 | ~0.28 |
| | Padding | ~0.54 | ~0.46 |
| **4** | None | ~0.80 | ~0.20 |
| | Group | **~1.00** | **~0.00** |
| | Padding | ~0.80 | ~0.20 |
| **5** | None | ~0.62 | ~0.38 |
| | Group | ~0.62 | ~0.38 |
| | Padding | ~0.62 | ~0.38 |
| **6** | None | ~0.48 | ~0.52 |
| | Group | ~0.51 | ~0.49 |
| | Padding | ~0.47 | ~0.53 |
**Trend Verification:**
* **BLEU Score Trend (Left Axis):** For all methods, the score generally increases from length 2 to a peak at length 4, then decreases at lengths 5 and 6. The "Group" method shows the most dramatic peak.
* **Edit Distance Trend (Right Axis, Inverted):** Because the axis is inverted, a *lower* bar visually indicates a *higher* Edit Distance value. The trend is inverse to BLEU Score: Edit Distance is lowest (best) at length 4 and higher (worse) at lengths 2, 5, and 6. "Group" achieves the best (lowest) Edit Distance at length 4.
### Key Observations
1. **Peak Performance at Length 4:** All methods achieve their best combined performance (highest BLEU, lowest Edit Distance) at a text length of 4. The "Group" method reaches a perfect BLEU Score of 1.0 and an Edit Distance of 0.00 at this point.
2. **"Group" Method Superiority:** The "Group" method (red, cross-hatched) consistently outperforms or matches the other two methods across all text lengths for both metrics, with the most significant advantage at lengths 3 and 4.
3. **Convergence at Length 5:** At text length 5, all three methods show nearly identical performance for both BLEU Score (~0.62) and Edit Distance (~0.38).
4. **Performance Degradation with Length:** Beyond the optimal length of 4, performance for all methods degrades as text length increases to 5 and 6, indicated by falling BLEU Scores and rising Edit Distances.
5. **"None" and "Padding" Similarity:** The "None" (purple) and "Padding" (blue) methods exhibit very similar performance profiles across all text lengths, often with values within 0.01-0.02 of each other.
### Interpretation
This chart likely evaluates different text processing or generation strategies ("None", "Group", "Padding") on tasks where output quality is measured by both similarity to a reference (BLEU) and amount of correction needed (Edit Distance). The inverted Edit Distance axis cleverly aligns "better" performance (lower distance) with a higher visual position on the chart, making it intuitive to see that the tallest bars represent the best overall results.
The data suggests a "sweet spot" at a text length of 4, where the task is neither too short (lacking context) nor too long (introducing complexity or error propagation). The "Group" method's exceptional performance at this length implies it is particularly effective at leveraging structure or context within sequences of this moderate length. The convergence of all methods at length 5 and their subsequent decline at length 6 indicates that for longer sequences, the choice of method becomes less critical, and the inherent difficulty of the task dominates. The near-identical performance of "None" and "Padding" suggests that simple padding offers no significant advantage over doing nothing for this specific task and metric set.
</details>
Figure 6: Performance of text length generalization across various padding strategies. Group strategies contribute to length generalization.
### 6.2 Reasoning Step Generalization
The reasoning step generalization investigates whether models can extrapolate to reasoning chains requiring different steps $k$ from those observed during training. which is a popular setting in multi-step reasoning tasks.
Experiment settings. Similar to text length generalization, we first pre-train the LLM with reasoning step $k=2$ , and evaluate on data with reasoning step $k=1$ or $k=3$ .
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: Exact Match Percentage vs. Data Percentage for Two k-Values
### Overview
The image is a line chart comparing the performance of two models or configurations, labeled `k=1` and `k=2`, as a function of the amount of data used. The performance metric is "Exact Match (%)". The chart demonstrates a clear inverse relationship between the two series as the data percentage increases.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Data Percentage"
* **Scale:** Linear, ranging from 0.0 to 1.0.
* **Tick Marks:** 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0.
* **Y-Axis:**
* **Label:** "Exact Match (%)"
* **Scale:** Linear, ranging from 0 to approximately 95 (the top grid line is at 80, but data exceeds it).
* **Tick Marks:** 0, 20, 40, 60, 80.
* **Legend:**
* **Position:** Bottom-right corner of the plot area.
* **Series 1:** `k=1` - Represented by a solid blue line with circular markers.
* **Series 2:** `k=2` - Represented by a dashed orange line with square markers.
### Detailed Analysis
**Data Series: k=1 (Blue Line, Circles)**
* **Trend:** Shows a strong, consistent upward trend. Performance improves monotonically as the data percentage increases.
* **Data Points (Approximate):**
* (0.0, 0%)
* (0.1, ~32%)
* (0.2, ~54%)
* (0.3, ~65%)
* (0.4, ~70%)
* (0.5, ~80%)
* (0.6, ~88%)
* (0.7, ~91%)
* (0.8, ~92%)
* (0.9, ~93%)
* (1.0, ~94%)
**Data Series: k=2 (Orange Dashed Line, Squares)**
* **Trend:** Shows an initial high performance that peaks early, followed by a steep, consistent decline. Performance degrades significantly as more data is added beyond a certain point.
* **Data Points (Approximate):**
* (0.0, ~92%)
* (0.1, ~93%) *[Peak]*
* (0.2, ~91%)
* (0.3, ~87%)
* (0.4, ~70%) *[Crossover point with k=1]*
* (0.5, ~39%)
* (0.6, ~11%)
* (0.7, ~4%)
* (0.8, ~1%)
* (0.9, ~0%)
* (1.0, ~0%)
### Key Observations
1. **Inverse Relationship:** The two series exhibit a near-perfect inverse relationship. As `k=1` performance rises, `k=2` performance falls.
2. **Crossover Point:** The lines intersect at a Data Percentage of approximately **0.4**, where both achieve an Exact Match of about **70%**.
3. **Performance Ceiling/Floor:** `k=1` approaches a performance ceiling near 94% with full data. `k=2` approaches a performance floor near 0% with full data.
4. **Early Peak for k=2:** The `k=2` configuration achieves its maximum performance with only 10% of the data.
5. **Steep Degradation:** The decline for `k=2` is particularly steep between 0.4 and 0.6 Data Percentage, dropping from ~70% to ~11%.
### Interpretation
This chart illustrates a fundamental trade-off, likely related to model complexity or capacity, parameterized by `k`.
* **`k=1`** appears to represent a **high-capacity or data-hungry model**. It starts with no capability (0% at 0% data) but effectively learns and generalizes as it is exposed to more data, showing a classic learning curve that plateaus as it approaches the dataset's limit.
* **`k=2`** appears to represent a **low-capacity or highly constrained model**. It performs exceptionally well on very little data (possibly due to strong inductive biases or memorization of a small set), but it cannot scale. As the data volume increases, the model becomes overwhelmed, fails to generalize, and its performance collapses. This is a potential sign of **underfitting** in the face of increasing data complexity or **catastrophic interference**.
The crossover at 40% data is critical. It suggests that for small datasets, the constrained `k=2` approach is superior. However, for any application where more than 40% of the available data can be utilized, the `k=1` approach is decisively better. The choice between them depends entirely on the expected data regime.
</details>
(a) Reasoning step. From k=2 to k=1
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Exact Match Percentage vs. Data Percentage for k=2 and k=3
### Overview
The image is a line chart comparing the performance of two models or configurations, labeled "k=2" and "k=3", across varying amounts of data. The chart plots the "Exact Match (%)" on the vertical axis against the "Data Percentage" on the horizontal axis. The blue line (k=3) shows a positive correlation, while the orange line (k=2) shows a strong negative correlation.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis (Horizontal):**
* **Label:** "Data Percentage"
* **Scale:** Linear, ranging from 0.0 to 1.0.
* **Tick Marks:** Labeled at every 0.1 increment (0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0).
* **Y-Axis (Vertical):**
* **Label:** "Exact Match (%)"
* **Scale:** Linear, ranging from 0 to 80.
* **Tick Marks:** Labeled at every 10% increment (0, 10, 20, 30, 40, 50, 60, 70, 80).
* **Legend:**
* **Position:** Bottom-right corner of the chart area.
* **Series 1:** Blue line with circular markers, labeled "k=3".
* **Series 2:** Orange line with square markers, labeled "k=2".
* **Grid:** Light gray gridlines are present for both major x and y ticks.
### Detailed Analysis
**Data Series: k=3 (Blue Line, Circle Markers)**
* **Trend:** The line starts near 0% and exhibits a slow, gradual increase until approximately 0.5 Data Percentage. After this point, the slope increases dramatically, showing a steep, near-linear rise to its peak.
* **Approximate Data Points:**
* Data % 0.0: ~0%
* Data % 0.1: ~0%
* Data % 0.2: ~2%
* Data % 0.3: ~6%
* Data % 0.4: ~10%
* Data % 0.5: ~9% (slight dip)
* Data % 0.6: ~12%
* Data % 0.7: ~29%
* Data % 0.8: ~49%
* Data % 0.9: ~63%
* Data % 1.0: ~73%
**Data Series: k=2 (Orange Line, Square Markers)**
* **Trend:** The line starts at its maximum value and exhibits a steep, consistent decline until approximately 0.5 Data Percentage. After this point, the decline halts, and the line fluctuates at a low level before dropping to near zero at the end.
* **Approximate Data Points:**
* Data % 0.0: ~80%
* Data % 0.1: ~54%
* Data % 0.2: ~35%
* Data % 0.3: ~23%
* Data % 0.4: ~17%
* Data % 0.5: ~12%
* Data % 0.6: ~12%
* Data % 0.7: ~16%
* Data % 0.8: ~19%
* Data % 0.9: ~16%
* Data % 1.0: ~0%
### Key Observations
1. **Inverse Relationship:** The two series demonstrate a clear inverse relationship. As the data percentage increases, the performance of k=3 improves significantly, while the performance of k=2 degrades severely.
2. **Crossover Point:** The lines intersect between Data Percentage 0.5 and 0.6, at an Exact Match value of approximately 12%. This is the point where both configurations perform equally.
3. **Performance Extremes:** k=2 achieves its best performance (~80%) with minimal data (0.0), while k=3 achieves its best performance (~73%) with the full dataset (1.0).
4. **Mid-Range Behavior:** Between 0.5 and 0.9 Data Percentage, k=2 shows a minor recovery and plateau (hovering between ~12% and ~19%) before its final drop, while k=3 enters its phase of rapid improvement.
### Interpretation
This chart likely illustrates a fundamental trade-off in a machine learning or information retrieval context, possibly related to the "k" in a k-nearest neighbors algorithm or a similar parameter.
* **k=2 (Low k):** This configuration appears to be highly specialized or "overfitted" to very small amounts of data. It performs exceptionally well when data is scarce but fails to generalize or improve—and in fact deteriorates—as more data is introduced. The steep initial drop suggests it is highly sensitive to noise or irrelevant information present in larger datasets.
* **k=3 (Higher k):** This configuration shows the classic behavior of a model that benefits from more data. It starts poorly but leverages additional data to build a more robust and accurate representation, leading to strong performance at high data percentages. The slight dip at 0.5 could indicate a transitional phase or a point where the added data introduces temporary complexity before being effectively utilized.
* **The Crossover:** The intersection point is critical. It defines the data regime where one should switch from using k=2 to k=3 to maximize performance. For projects with less than ~55% of the available data, k=2 is superior. For projects with more than ~55% of the data, k=3 is the clear choice.
* **Underlying Principle:** The data suggests that the optimal value of the parameter `k` is not static but is dependent on the volume of available training data. A smaller `k` may be preferable in low-data scenarios, while a larger `k` is necessary to harness the power of large datasets.
</details>
(b) Reasoning step. From k=2 to k=3
Figure 7: SFT performances for reasoning step generalization.
Findings. As showcased in Figure 7, CoT reasoning cannot generalize across data requiring different reasoning steps, indicating the failure of generalization. Then, we try to decrease the distribution discrepancy introduced by gradually increasing the ratio of unseen data while keeping the dataset size the same when pre-training the model. And then, we evaluate the performance on two datasets. As we can observe, the performance on the target dataset increases along with the ratio. At the same time, the LLMs can not generalize to the original training dataset because of the small amount of training data. The trend is similar when testing different-step generalization, which follows the intuition and validates our hypothesis directly.
## 7 Format Generalization
Format generalization assesses the robustness of CoT reasoning to surface-level variations in test queries. This dimension is especially crucial for determining whether models have internalized flexible, transferable reasoning strategies or remain reliant on the specific templates and phrasings encountered during training.
Format Alignment Score. We introduce a metric for measuring prompt similarity:
**Definition 7.1 (Format Alignment Score)**
*For training prompt distribution $P_train$ and test prompt $p_test$ :
$$
PAS(p_test)=\max_p∈ P_{train}\cos(φ(p),φ(p_test)) \tag{14}
$$
where $φ$ is a prompt embedding function.*
<details>
<summary>x10.png Details</summary>

### Visual Description
## Grouped Bar Chart: Edit Distance vs. Noise Level by Operation Type
### Overview
This is a grouped bar chart illustrating the relationship between increasing noise levels (as a percentage) and the resulting "Edit Distance" for four different categories of operations: All, Insertion, Deletion, and Modify. The chart includes error bars on each bar, indicating variability or confidence intervals around the mean edit distance measurements.
### Components/Axes
* **Chart Type:** Grouped bar chart with error bars.
* **X-Axis:** Labeled **"Noise Level (%)"**. It has six discrete categories: **5, 10, 15, 20, 25, 30**.
* **Y-Axis:** Labeled **"Edit Distance"**. The scale runs from **0.0 to 0.8**, with major tick marks at intervals of 0.2.
* **Legend:** Positioned in the **top-left corner** of the chart area. It defines four data series with distinct colors and patterns:
* **All:** Purple bars with diagonal stripes (\\).
* **Insertion:** Red bars with diagonal stripes (//).
* **Deletion:** Blue bars with a cross-hatch pattern (XX).
* **Modify:** Green bars with a dotted pattern (..).
### Detailed Analysis
Data values are approximate, read from the chart's visual alignment with the y-axis. Error bars represent the uncertainty range.
**At 5% Noise Level:**
* **All (Purple):** ~0.20 (Error bar range: ~0.18 to 0.22)
* **Insertion (Red):** ~0.25 (Error bar range: ~0.23 to 0.27)
* **Deletion (Blue):** ~0.12 (Error bar range: ~0.10 to 0.14)
* **Modify (Green):** ~0.20 (Error bar range: ~0.18 to 0.22)
**At 10% Noise Level:**
* **All (Purple):** ~0.35 (Error bar range: ~0.33 to 0.37)
* **Insertion (Red):** ~0.45 (Error bar range: ~0.43 to 0.47)
* **Deletion (Blue):** ~0.24 (Error bar range: ~0.22 to 0.26)
* **Modify (Green):** ~0.37 (Error bar range: ~0.35 to 0.39)
**At 15% Noise Level:**
* **All (Purple):** ~0.48 (Error bar range: ~0.46 to 0.50)
* **Insertion (Red):** ~0.60 (Error bar range: ~0.58 to 0.62)
* **Deletion (Blue):** ~0.35 (Error bar range: ~0.33 to 0.37)
* **Modify (Green):** ~0.50 (Error bar range: ~0.48 to 0.52)
**At 20% Noise Level:**
* **All (Purple):** ~0.58 (Error bar range: ~0.56 to 0.60)
* **Insertion (Red):** ~0.73 (Error bar range: ~0.71 to 0.75)
* **Deletion (Blue):** ~0.43 (Error bar range: ~0.41 to 0.45)
* **Modify (Green):** ~0.62 (Error bar range: ~0.60 to 0.64)
**At 25% Noise Level:**
* **All (Purple):** ~0.65 (Error bar range: ~0.63 to 0.67)
* **Insertion (Red):** ~0.80 (Error bar range: ~0.78 to 0.82)
* **Deletion (Blue):** ~0.50 (Error bar range: ~0.48 to 0.52)
* **Modify (Green):** ~0.71 (Error bar range: ~0.69 to 0.73)
**At 30% Noise Level:**
* **All (Purple):** ~0.71 (Error bar range: ~0.69 to 0.73)
* **Insertion (Red):** ~0.87 (Error bar range: ~0.85 to 0.89)
* **Deletion (Blue):** ~0.56 (Error bar range: ~0.54 to 0.58)
* **Modify (Green):** ~0.79 (Error bar range: ~0.77 to 0.81)
### Key Observations
1. **Universal Upward Trend:** For all four categories (All, Insertion, Deletion, Modify), the Edit Distance increases monotonically as the Noise Level increases from 5% to 30%.
2. **Operation Sensitivity Hierarchy:** There is a clear and consistent hierarchy in the magnitude of Edit Distance across all noise levels:
* **Insertion (Red)** consistently has the **highest** Edit Distance.
* **Deletion (Blue)** consistently has the **lowest** Edit Distance.
* **Modify (Green)** and **All (Purple)** fall in between, with Modify generally slightly higher than All.
3. **Divergence with Noise:** The gap between the operation types widens as noise increases. At 5% noise, the spread between the highest (Insertion) and lowest (Deletion) is ~0.13. At 30% noise, this spread increases to ~0.31.
4. **Error Bar Consistency:** The error bars are relatively small and consistent in size across all data points, suggesting stable measurements with low variance at each condition.
### Interpretation
This chart demonstrates the impact of data noise on the performance or cost of different edit operations, measured by "Edit Distance." The key finding is that **insertion operations are disproportionately affected by noise compared to deletion or modification operations.**
* **What the data suggests:** The system or model being evaluated becomes less accurate (higher edit distance) as input data becomes noisier. However, it is particularly vulnerable to errors that manifest as insertions. This could imply that the system has a bias towards adding extraneous information when confused by noise, or that its mechanism for handling insertions is less robust.
* **Relationship between elements:** The "All" category represents an aggregate metric. Its value being between the extremes of Insertion and Deletion/Modify confirms it is a composite measure. The fact that "Modify" tracks closely with but is slightly higher than "All" suggests modification errors are a significant component of the overall error profile.
* **Notable trends/anomalies:** The most significant trend is the linear-like increase for all series. There are no sudden jumps or plateaus, indicating a predictable, proportional relationship between noise and error within the tested range. The lack of any crossover in the hierarchy of the bars (Insertion always on top, Deletion always on bottom) is a strong, unambiguous result. The widening gap suggests that noise doesn't just increase errors uniformly; it specifically exacerbates the system's weakness in handling insertions.
</details>
(a) Format generalization. Performance under various perturbation methods.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: BLEU Score vs. Noise Level for Different Methods
### Overview
This image is a line chart comparing the performance of four different methods ("None", "Prompt", "Transformation", "Element") as measured by BLEU Score across increasing levels of noise. The chart demonstrates how each method's performance degrades as the input noise increases from 10% to 90%.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Noise Level (%)". The axis markers are at intervals of 10, starting at 10 and ending at 90.
* **Y-Axis (Vertical):** Labeled "BLEU Score". The axis scale ranges from 0.0 to 1.0, with markers at intervals of 0.2 (0.0, 0.2, 0.4, 0.6, 0.8, 1.0).
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries, each with a colored line sample, marker symbol, and label:
1. **None:** Light blue, diamond marker, dotted line.
2. **Prompt:** Purple, square marker, solid line.
3. **Transformation:** Orange-red, circle marker, dashed line.
4. **Element:** Teal/green, triangle marker, dash-dot line.
### Detailed Analysis
The chart plots four distinct data series. Below is an analysis of each, including approximate data points extracted from the visual plot.
**1. Series: "None" (Light Blue, Diamond, Dotted)**
* **Trend:** Perfectly horizontal line. This series shows no degradation in performance with increasing noise.
* **Data Points (Approximate):** The BLEU Score remains constant at **1.0** for all Noise Levels from 10% to 90%.
**2. Series: "Prompt" (Purple, Square, Solid)**
* **Trend:** Linearly decreasing trend. The line slopes downward steadily from left to right.
* **Data Points (Approximate):**
* Noise 10%: BLEU Score ≈ **0.95**
* Noise 20%: BLEU Score ≈ **0.85**
* Noise 30%: BLEU Score ≈ **0.78**
* Noise 40%: BLEU Score ≈ **0.72**
* Noise 50%: BLEU Score ≈ **0.65**
* Noise 60%: BLEU Score ≈ **0.60**
* Noise 70%: BLEU Score ≈ **0.55**
* Noise 80%: BLEU Score ≈ **0.48**
* Noise 90%: BLEU Score ≈ **0.42**
**3. Series: "Transformation" (Orange-Red, Circle, Dashed)**
* **Trend:** Linearly decreasing trend, steeper than the "Prompt" series.
* **Data Points (Approximate):**
* Noise 10%: BLEU Score ≈ **0.88**
* Noise 20%: BLEU Score ≈ **0.78**
* Noise 30%: BLEU Score ≈ **0.65**
* Noise 40%: BLEU Score ≈ **0.58**
* Noise 50%: BLEU Score ≈ **0.48**
* Noise 60%: BLEU Score ≈ **0.40**
* Noise 70%: BLEU Score ≈ **0.30**
* Noise 80%: BLEU Score ≈ **0.22**
* Noise 90%: BLEU Score ≈ **0.15**
**4. Series: "Element" (Teal/Green, Triangle, Dash-Dot)**
* **Trend:** Sharply decreasing, non-linear trend. The line drops rapidly initially and then flattens near zero.
* **Data Points (Approximate):**
* Noise 10%: BLEU Score ≈ **0.65**
* Noise 20%: BLEU Score ≈ **0.42**
* Noise 30%: BLEU Score ≈ **0.20**
* Noise 40%: BLEU Score ≈ **0.12**
* Noise 50%: BLEU Score ≈ **0.05**
* Noise 60%: BLEU Score ≈ **0.02**
* Noise 70%: BLEU Score ≈ **0.01**
* Noise 80%: BLEU Score ≈ **0.00**
* Noise 90%: BLEU Score ≈ **0.00**
### Key Observations
1. **Baseline Performance:** The "None" method serves as a perfect baseline, maintaining a BLEU Score of 1.0 regardless of noise. This suggests it represents an ideal or uncorrupted reference.
2. **Performance Hierarchy:** At all noise levels, the performance order from best to worst is consistent: "None" > "Prompt" > "Transformation" > "Element".
3. **Varying Resilience:** The methods show dramatically different resilience to noise. "Prompt" degrades the most gracefully, while "Element" is highly sensitive, losing most of its performance by 40% noise.
4. **Crossover Point:** The "Prompt" and "Transformation" lines do not cross; "Prompt" maintains a consistent performance advantage over "Transformation" across the entire noise spectrum.
5. **Convergence to Zero:** The "Element" method's performance effectively reaches zero (BLEU Score ≈ 0.0) at noise levels of 80% and above.
### Interpretation
This chart likely evaluates the robustness of different natural language processing or machine translation techniques against input corruption (noise). The "BLEU Score" is a standard metric for evaluating machine-generated text against a human reference.
* **"None"** probably represents the evaluation of the original, uncorrupted text against itself, hence the perfect score. It acts as the control.
* The downward trends for the other three methods demonstrate that introducing noise (e.g., typos, word swaps, deletions) degrades the quality of their output.
* The **"Prompt"** method appears to be the most robust technique among those tested. Its linear, shallow decline suggests it has mechanisms to handle or ignore a significant amount of input noise.
* The **"Transformation"** method is moderately robust but consistently underperforms the "Prompt" method.
* The **"Element"** method is the least robust. Its sharp initial decline indicates it is highly dependent on precise, clean input and fails catastrophically as noise increases. This could represent a more brittle, rule-based, or fine-grained processing approach.
The key takeaway is that the choice of method has a profound impact on system performance in noisy, real-world conditions. The "Prompt" approach demonstrates superior stability, making it potentially more reliable for applications where input quality cannot be guaranteed.
</details>
(b) Format generalization. Performance vs. various applied perturbation areas.
Figure 8: Performance of format generalization.
Experiment settings. To systematically probe this, we introduce four distinct perturbation modes to simulate scenario in real-world: (i) insertion, where a noise token is inserted before each original token; (ii) deletion: it deletes the original token; (iii) modification: it replaces the original token with a noise token; and (iv) hybrid mode: it combines multiple perturbations. Each mode is applied for tokens with probabilities $p$ , enabling us to quantify the model’s resilience to increasing degrees of prompt distribution shift.
Findings. As shown in Figure 8a, we found that generally CoT reasoning can be easily affected by the format changes. No matter insertion, deletion, modifications, or hybrid mode, it creates a format discrepancy that affects the correctness. Among them, the deletion slightly affects the performance. While the insertions are relatively highly influential on the results. We further divide the query into several sections: elements, transformations, and prompt tokens. As shown in Figure 8b, we found that the elements and transformation play an important role in the format, whereas the changes to other tokens rarely affect the results.
## 8 Temperature and Model Size
Temperature and model size generalization explores how variations in sampling temperature and model capacity can influence the stability and robustness of CoT reasoning. For the sake of rigorous evaluation, we further investigate whether different choices of temperatures and model sizes may significantly affect our results.
Experiment settings. We explore the impact of different temperatures on the validity of the presented results. We adopt the same setting in the transformation generalization.
Findings. As illustrated in Figure 9a, LLMs tend to generate consistent and reliable CoT reasoning across a broad range of temperature settings (e.g., from 1e-5 up to 1), provided the values remain within a suitable range. This stability is maintained even when the models are evaluated under a variety of distribution shifts.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Heatmap Analysis: BLEU Score and Edit Distance vs. Temperature
### Overview
The image displays two vertically stacked heatmaps that visualize the relationship between a "Temperature" parameter (x-axis) and three different "Scenarios" (y-axis) for two distinct performance metrics. The top heatmap measures **BLEU Score**, and the bottom heatmap measures **Edit Distance**. The data suggests an inverse relationship between the two metrics as temperature increases.
### Components/Axes
**Shared Elements:**
* **X-axis (Bottom):** Labeled "Temperature". It has six discrete, logarithmically spaced values: `1e-05`, `0.01`, `0.1`, `1.0`, `5.0`, `10.0`.
* **Y-axis (Left):** Labeled "Scenario". It lists three categories from top to bottom: `CMP`, `POOD`, `OOD`.
* **Color Bars (Right):** Each heatmap has its own vertical color bar legend.
* **Top (BLEU Score):** Scale ranges from approximately `0.001` (dark blue) to `0.687` (dark red). Key markers: `0.001`, `0.25`, `0.50`.
* **Bottom (Edit Distance):** Scale ranges from approximately `0.133` (dark blue) to `0.846` (dark red). Key markers: `0.25`, `0.50`, `0.75`.
**Top Heatmap: BLEU Score**
* **Title/Metric:** "BLEU Score" (label on the color bar).
* **Data Grid:** A 3x6 grid of colored cells, each containing a numerical value.
**Bottom Heatmap: Edit Distance**
* **Title/Metric:** "Edit Distance" (label on the color bar).
* **Data Grid:** A 3x6 grid of colored cells, each containing a numerical value.
### Detailed Analysis
**Top Heatmap: BLEU Score Data**
* **Trend Verification:** For all three scenarios (CMP, POOD, OOD), the BLEU score remains stable and relatively high for temperatures from `1e-05` to `1.0`. There is a **sharp, significant drop** in score at temperatures `5.0` and `10.0`.
* **Data Points (Row by Row):**
* **CMP (Top Row):** `0.687`, `0.687`, `0.687`, `0.686`, `0.019`, `0.002`.
* **POOD (Middle Row):** `0.454`, `0.454`, `0.454`, `0.455`, `0.010`, `0.002`.
* **OOD (Bottom Row):** `0.277`, `0.277`, `0.277`, `0.277`, `0.003`, `0.001`.
* **Spatial Grounding:** The highest BLEU scores (dark red cells, ~0.687) are located in the top-left region (CMP scenario, low temperatures). The lowest scores (dark blue cells, ~0.001) are in the bottom-right region (OOD scenario, high temperatures).
**Bottom Heatmap: Edit Distance Data**
* **Trend Verification:** For all three scenarios, the Edit Distance remains stable and relatively low for temperatures from `1e-05` to `1.0`. There is a **sharp, significant increase** in distance at temperatures `5.0` and `10.0`.
* **Data Points (Row by Row):**
* **CMP (Top Row):** `0.133`, `0.133`, `0.133`, `0.133`, `0.760`, `0.830`.
* **POOD (Middle Row):** `0.167`, `0.167`, `0.167`, `0.168`, `0.790`, `0.824`.
* **OOD (Bottom Row):** `0.299`, `0.299`, `0.299`, `0.300`, `0.834`, `0.846`.
* **Spatial Grounding:** The lowest Edit Distance values (dark blue cells, ~0.133) are in the top-left region (CMP, low temperatures). The highest values (dark red cells, ~0.846) are in the bottom-right region (OOD, high temperatures).
### Key Observations
1. **Inverse Correlation:** There is a clear inverse relationship between BLEU Score and Edit Distance. As one metric degrades (BLEU drops), the other worsens (Edit Distance increases).
2. **Critical Threshold:** A performance phase transition occurs between Temperature `1.0` and `5.0`. Performance is stable for T ≤ 1.0 and degrades sharply for T ≥ 5.0.
3. **Scenario Hierarchy:** At low temperatures (T ≤ 1.0), a consistent hierarchy exists:
* **BLEU Score:** `CMP` > `POOD` > `OOD`.
* **Edit Distance:** `CMP` < `POOD` < `OOD` (lower is better).
This suggests the `CMP` scenario yields the best baseline performance, followed by `POOD`, then `OOD`.
4. **Convergence at High Temperature:** At high temperatures (T ≥ 5.0), the performance of all scenarios converges to similarly poor levels for both metrics, though the hierarchical ordering is largely preserved.
### Interpretation
This data likely comes from an experiment evaluating a text generation or translation model's performance under different conditions. "Temperature" is a common parameter controlling randomness in model outputs.
* **What the data suggests:** Low temperature settings (≤ 1.0) produce outputs that are more similar to reference texts (higher BLEU) and require fewer edits (lower Edit Distance). High temperature settings (≥ 5.0) cause the model to generate text that diverges significantly from references, resulting in low similarity and high required edits.
* **Relationship between elements:** The two heatmaps are two sides of the same coin. BLEU Score measures n-gram overlap (similarity), while Edit Distance measures the number of operations needed to transform one text into another (dissimilarity). Their inverse correlation here is a logical validation of the metrics.
* **Notable Anomaly/Insight:** The stability of performance across a wide range of low temperatures (from 0.00001 to 1.0) is striking. It indicates the model's output quality is robust within this regime. The catastrophic drop at T=5.0 suggests a critical point where the model's sampling distribution becomes too diffuse, leading to incoherent or off-topic outputs.
* **Peircean Investigation:** The sign (the heatmap pattern) indicates that the model's behavior undergoes a fundamental shift. The icon (the sharp value change) points to a threshold effect. The interpretant (the conclusion) is that for reliable, high-quality output, the temperature must be kept at or below 1.0 for this particular model and task. The `OOD` scenario is the most sensitive, showing the lowest baseline performance and the highest final edit distance.
</details>
(a) Influences of various temperatures.
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Line Chart: Exact Match (%) vs. SFT Ratio for Different Model Sizes
### Overview
This is a line chart plotting the performance metric "Exact Match (%)" against the "SFT Ratio (×10⁻⁴)" for five different model sizes. The chart demonstrates how the exact match accuracy changes as the SFT (Supervised Fine-Tuning) ratio increases, with distinct performance curves for each model scale.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **X-Axis:**
* **Label:** `SFT Ratio (×10⁻⁴)`
* **Scale:** Linear, ranging from 1.0 to 4.0.
* **Major Ticks:** 1.0, 1.3, 1.7, 2.0, 2.5, 3.0, 3.5, 4.0.
* **Y-Axis:**
* **Label:** `Exact Match (%)`
* **Scale:** Linear, ranging from 0 to 100.
* **Major Ticks:** 0, 20, 40, 60, 80, 100.
* **Legend:** Positioned in the **bottom-right corner** of the plot area. It contains five entries, each with a unique color, line style, and marker:
1. `68K` - Solid purple line with square markers.
2. `589K` - Dashed orange-red line with circle markers.
3. `4.8M` - Dash-dot green line with diamond markers.
4. `35M` - Dotted blue line with diamond markers.
5. `543M` - Dashed light orange line with diamond markers.
### Detailed Analysis
**Data Series Trends and Approximate Key Points:**
1. **68K (Purple, Solid, Squares):**
* **Trend:** Starts relatively high, increases rapidly, and plateaus at 100%.
* **Key Points:** At SFT Ratio 1.0, Exact Match ≈ 40%. Rises steeply to ≈ 90% at 1.3. Reaches 100% by approximately 2.0 and remains there.
2. **589K (Orange-Red, Dashed, Circles):**
* **Trend:** Shows the most volatile early performance. Starts low, spikes dramatically, dips, then recovers to 100%.
* **Key Points:** At 1.0, ≈ 30%. Sharp increase to a peak of ≈ 95% near 1.2. Dips to ≈ 80% around 1.5. Recovers to 100% by 2.0 and stays there.
3. **4.8M (Green, Dash-Dot, Diamonds):**
* **Trend:** Remains near 0% for low ratios, then exhibits a very steep, almost vertical increase.
* **Key Points:** ≈ 0% from 1.0 to 2.0. Begins a sharp rise after 2.0, crossing 50% near 2.7. Reaches 100% by 3.0.
4. **35M (Blue, Dotted, Diamonds):**
* **Trend:** Similar to 4.8M but with a more gradual initial rise and an earlier takeoff point.
* **Key Points:** ≈ 0% at 1.0. First noticeable increase to ≈ 10% at 1.5. Rises steadily, crossing 50% near 2.3. Reaches 100% by 3.0.
5. **543M (Light Orange, Dashed, Diamonds):**
* **Trend:** The slowest to improve. Flat at 0% for the majority of the chart, then rises sharply at the highest ratios.
* **Key Points:** ≈ 0% from 1.0 to 3.0. Begins a steep ascent after 3.0, reaching ≈ 50% at 3.5 and 100% at 4.0.
### Key Observations
* **Model Size vs. Data Efficiency:** There is a clear inverse relationship between model size and data efficiency (SFT Ratio required for high performance). Smaller models (68K, 589K) achieve high exact match scores at much lower SFT Ratios (1.0-2.0) compared to larger models.
* **Performance Ceiling:** All models eventually reach 100% Exact Match, but at vastly different SFT Ratios. The 68K and 589K models plateau at 100% from a ratio of ~2.0 onward.
* **Critical Thresholds:** Each model size has a distinct "takeoff" point where performance begins to improve rapidly from near-zero. This threshold increases with model size (e.g., ~1.0 for 68K, ~2.0 for 4.8M, ~3.0 for 543M).
* **Volatility:** The 589K model shows significant volatility in the early training phase (ratios 1.0-1.7), unlike the smoother curves of the other models.
### Interpretation
The data suggests a fundamental trade-off in supervised fine-tuning between model scale and the amount of fine-tuning data (represented by the SFT Ratio) required to achieve task mastery. Smaller models appear to be more "data-efficient" for this specific task, reaching peak performance with less fine-tuning data. However, this does not necessarily mean they are better overall, as their absolute capacity is lower.
The delayed takeoff for larger models (4.8M, 35M, 543M) could indicate that they require a critical mass of fine-tuning data to overcome their initial, possibly more generalized, state and adapt to the specific task measured by "Exact Match." The 100% ceiling for all models implies the task is ultimately solvable given sufficient fine-tuning data relative to model size.
The anomaly of the 589K model's volatile early performance might suggest instability in the fine-tuning process for that specific model scale at low data regimes, or it could be an artifact of a specific experimental run. This chart is crucial for understanding the scaling laws of fine-tuning, indicating that simply increasing model size does not reduce the *relative* amount of fine-tuning data needed; in fact, it increases it.
</details>
(b) Influences of various sizes.
Figure 9: Temperature and model size. The findings hold under different temperatures and model sizes.
Experiment settings. We further examine the influence of model size by employing the same experimental configuration as used in the novel relation SFT study. In particular, we first pretrain models of different sizes using the transformation $f_1∘ f_1$ , and subsequently perform SFT on $f_2∘ f_2$ while varying the SFT ratios.
Finding. Fig. 9b shows the accuracy of models with different sizes using different SFT ratios, which closely matches the result of our default model size across all evaluated settings and configurations.
## 9 Discussion and Implication
Our investigation, conducted through the controlled environment of DataAlchemy, reveals that the apparent reasoning prowess of Chain-of-Thought (CoT) is largely a brittle mirage. The findings across task, length, and format generalization experiments converge on a conclusion: CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces.
While our experiments utilized models trained from scratch in a controlled environment, the principles uncovered are extensible to large-scale pre-trained models. We summarize the implications for practitioners as follows.
Guard Against Over-reliance and False Confidence. CoT should not be treated as a “plug-and-play” module for robust reasoning, especially in high-stakes domains like medicine, finance, or legal analysis. The ability of LLMs to produce “fluent nonsense” —plausible but logically flawed reasoning chains—can be more deceptive and damaging than an outright incorrect answer, as it projects a false aura of dependability. Sufficient auditing from domain experts is indispensable.
Prioritize Out-of-Distribution (OOD) Testing. Standard validation practices, where the test set closely mirrors the training set, are insufficient to gauge the true robustness of a CoT-enabled system. Practitioners must implement rigorous adversarial and OOD testing that systematically probes for vulnerabilities across task, length, and format variations.
Recognize Fine-Tuning as a Patch, Not a Panacea. Our results show that Supervised Fine-Tuning (SFT) can quickly “patch” a model’s performance on a new, specific data distribution. However, this should not be mistaken for achieving true generalization. It simply expands the model’s “in-distribution” bubble slightly. Relying on SFT to fix every OOD failure is an unsustainable and reactive strategy that fails to address the core issue: the model’s lack of abstract reasoning capability.
## 10 Conclusion
In this paper, we critically examine the COT reasoning of LLMs through the lens of data distribution, revealing that the perceived structured reasoning capability largely arises from inductive biases shaped by in-distribution training data. We propose a controlled environment, DataAlchemy, allowing systematic probing of CoT reasoning along three crucial dimensions: task structure, reasoning length, and query format. Empirical findings consistently demonstrate that CoT reasoning effectively reproduces reasoning patterns closely aligned with training distributions but suffers significant degradation when faced with distributional deviations. Such observations reveal the inherent brittleness and superficiality of current CoT reasoning capabilities. We provide insights that emphasize real-world implications for both practitioners and researchers.
*
## References
- Bentham et al. (2024) O. Bentham, N. Stringham, and A. Marasovic. Chain-of-thought unfaithfulness as disguised accuracy. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=ydcrP55u2e. Reproducibility Certification.
- Budnikov et al. (2025) M. Budnikov, A. Bykova, and I. P. Yamshchikov. Generalization potential of large language models. Neural Computing and Applications, 37(4):1973–1997, 2025.
- Chen et al. (2025a) Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou, T. Gao, and W. Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567, 2025a.
- Chen et al. (2025b) Y. Chen, J. Benton, A. Radhakrishnan, J. Uesato, C. Denison, J. Schulman, A. Somani, P. Hase, M. Wagner, F. Roger, et al. Reasoning models don’t always say what they think. arXiv preprint arXiv:2505.05410, 2025b.
- Cho et al. (2024) H. Cho, J. Cha, P. Awasthi, S. Bhojanapalli, A. Gupta, and C. Yun. Position coupling: Improving length generalization of arithmetic transformers using task structure. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=5cIRdGM1uG.
- Garg et al. (2022) S. Garg, D. Tsipras, P. S. Liang, and G. Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in neural information processing systems, 35:30583–30598, 2022.
- Guo et al. (2025) D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Imani et al. (2023) S. Imani, L. Du, and H. Shrivastava. Mathprompter: Mathematical reasoning using large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 37–42, 2023.
- Jaech et al. (2024) A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024.
- Kojima et al. (2022) T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Lanham et al. (2023) T. Lanham, A. Chen, A. Radhakrishnan, B. Steiner, C. Denison, D. Hernandez, D. Li, E. Durmus, E. Hubinger, J. Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
- Li et al. (2025a) H. Li, S. Lu, P.-Y. Chen, X. Cui, and M. Wang. Training nonlinear transformers for chain-of-thought inference: A theoretical generalization analysis. In The Thirteenth International Conference on Learning Representations, 2025a. URL https://openreview.net/forum?id=n7n8McETXw.
- Li et al. (2025b) Y. Li, Z. Lai, W. Bao, Z. Tan, A. Dao, K. Sui, J. Shen, D. Liu, H. Liu, and Y. Kong. Visual large language models for generalized and specialized applications. arXiv preprint arXiv:2501.02765, 2025b.
- Ling et al. (2023) Z. Ling, Y. Fang, X. Li, Z. Huang, M. Lee, R. Memisevic, and H. Su. Deductive verification of chain-of-thought reasoning. Advances in Neural Information Processing Systems, 36:36407–36433, 2023.
- Mirzadeh et al. (2024) I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024.
- Papineni et al. (2002) K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Radford et al. (2019) A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Shen et al. (2025) Z. Shen, H. Yan, L. Zhang, Z. Hu, Y. Du, and Y. He. Codi: Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint arXiv:2502.21074, 2025.
- Shojaee et al. (2025) P. Shojaee, I. Mirzadeh, K. Alizadeh, M. Horton, S. Bengio, and M. Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. arXiv preprint arXiv:2506.06941, 2025.
- Song et al. (2025) J. Song, Z. Xu, and Y. Zhong. Out-of-distribution generalization via composition: a lens through induction heads in transformers. Proceedings of the National Academy of Sciences, 122(6):e2417182122, 2025.
- Tang et al. (2023) X. Tang, Z. Zheng, J. Li, F. Meng, S.-C. Zhu, Y. Liang, and M. Zhang. Large language models are in-context semantic reasoners rather than symbolic reasoners. arXiv preprint arXiv:2305.14825, 2023.
- Team et al. (2025) K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025.
- Team (2024) Q. Team. Qwq: Reflect deeply on the boundaries of the unknown. Hugging Face, 2024.
- Ting et al. (2025) L. P.-Y. Ting, C. Zhao, Y.-H. Zeng, Y. J. Lim, and K.-T. Chuang. Beyond rag: Reinforced reasoning augmented generation for clinical notes. arXiv preprint arXiv:2506.05386, 2025.
- Wang et al. (2024) Q. Wang, Y. Wang, Y. Wang, and X. Ying. Can in-context learning really generalize to out-of-distribution tasks? arXiv preprint arXiv:2410.09695, 2024.
- Wang et al. (2023) X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
- Wang et al. (2025a) Y. Wang, F.-C. Chang, and P.-Y. Wu. Chain-of-thought prompting for out-of-distribution samples: A latent-variable study. arXiv e-prints, pages arXiv–2504, 2025a.
- Wang et al. (2025b) Y. Wang, F.-C. Chang, and P.-Y. Wu. A theoretical framework for ood robustness in transformers using gevrey classes. arXiv preprint arXiv:2504.12991, 2025b.
- Wei et al. (2022) J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Xu et al. (2024) J. Xu, H. Fei, L. Pan, Q. Liu, M.-L. Lee, and W. Hsu. Faithful logical reasoning via symbolic chain-of-thought. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13326–13365, 2024.
- Yang et al. (2024) J. Yang, K. Zhou, Y. Li, and Z. Liu. Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12):5635–5662, 2024.
- Yang et al. (2023) L. Yang, Y. Song, X. Ren, C. Lyu, Y. Wang, J. Zhuo, L. Liu, J. Wang, J. Foster, and Y. Zhang. Out-of-distribution generalization in natural language processing: Past, present, and future. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4533–4559, 2023.
- Yao et al. (2023) S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023.
- Yao et al. (2025) X. Yao, R. Ren, Y. Liao, and Y. Liu. Unveiling the mechanisms of explicit cot training: How chain-of-thought enhances reasoning generalization. arXiv e-prints, pages arXiv–2502, 2025.
- Yeo et al. (2025) E. Yeo, Y. Tong, M. Niu, G. Neubig, and X. Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025.
- Yu et al. (2023) Z. Yu, L. He, Z. Wu, X. Dai, and J. Chen. Towards better chain-of-thought prompting strategies: A survey. arXiv preprint arXiv:2310.04959, 2023.
- Yujian and Bo (2007) L. Yujian and L. Bo. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007.
- Zhang et al. (2024a) X. Zhang, C. Du, T. Pang, Q. Liu, W. Gao, and M. Lin. Chain of preference optimization: Improving chain-of-thought reasoning in llms. Advances in Neural Information Processing Systems, 37:333–356, 2024a.
- Zhang et al. (2024b) Y. Zhang, H. Wang, S. Feng, Z. Tan, X. Han, T. He, and Y. Tsvetkov. Can llm graph reasoning generalize beyond pattern memorization? In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2289–2305, 2024b.
- Zhang et al. (2023) Z. Zhang, A. Zhang, M. Li, and A. Smola. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=5NTt8GFjUHkr.
- Zhang et al. (2024c) Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola. Multimodal chain-of-thought reasoning in language models. Transactions on Machine Learning Research, 2024, 2024c.
- Zhao et al. (2025) C. Zhao, Z. Tan, C.-W. Wong, X. Zhao, T. Chen, and H. Liu. Scale: Towards collaborative content analysis in social science with large language model agents and human intervention. arXiv preprint arXiv:2502.10937, 2025.
- Zhao et al. (2023) W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
## Appendix A Proof of Theorems
### A.1 Proof of CoT Generalization Bound
* Proof*
Let $f_θ$ be a model trained on samples from the distribution $D_train$ using a loss function $\ell(f_θ(x),y)$ that is $Λ$ -Lipschitz and bounded. The expected test risk is given by
$$
R_test(f_θ)=E_(x,y)∼D_test≤ft[\ell(f_θ(x),y)\right]. \tag{15}
$$
We can decompose the test risk as
$$
R_test(f_θ)=R_train(f_θ)+≤ft(R_test(f_θ)-R_train(f_θ)\right). \tag{16}
$$
To bound the discrepancy between $R_test$ and $R_train$ , we invoke a standard result from statistical learning theory. Given that $\ell$ is $Λ$ -Lipschitz and the discrepancy measure $Δ(D_train,D_test)$ is an integral probability metric (e.g., Wasserstein-1 distance), we have
$$
≤ft|R_test(f_θ)-R_train(f_θ)\right|≤Λ·Δ(D_train,D_test). \tag{17}
$$
Therefore, the test risk satisfies
$$
R_test(f_θ)≤ R_train(f_θ)+Λ·Δ(D_train,D_test). \tag{18}
$$ We next account for the generalization gap between the empirical training risk $\hat{R}_train(f_θ)$ and the expected training risk $R_train(f_θ)$ . By applying a concentration inequality (e.g., Hoeffding’s inequality), with probability at least $1-δ$ , we have
$$
R_train(f_θ)≤\hat{R}_train(f_θ)+O≤ft(√{\frac{\log(1/δ)}{n}}\right), \tag{19}
$$
where $n$ is the number of training samples. Combining the above, we obtain that with high probability,
$$
R_test(f_θ)≤\hat{R}_train(f_θ)+Λ·Δ(D_train,D_test)+O≤ft(√{\frac{\log(1/δ)}{n}}\right). \tag{20}
$$
This concludes the proof. ∎
### A.2 Proof of Task Generalization Failure Threshold
We establish the exponential decay bound through a probabilistic analysis of reasoning failure modes in the presence of task generalization complexity.
Let $Ω$ denote the sample space of all possible reasoning configurations, and let $C∈Ω$ represent a specific configuration. We define the following events: $A_i$ as the event that element $a_i$ is novel, i.e., $a_i∉E^i_train$ ; $F_j$ as the event that transformation $f_j$ is novel, i.e., $f_j∉F_train$ ; and $Q$ as the event that the transformation sequence $(f_1,f_2,…,f_k)$ is novel, i.e., $(f_1,f_2,…,f_k)∉P_train$ .
Here we make the assumption that the reasoning failures induced by novel arguments, functions, and patterns contribute independently to the overall failure probability and hence we model the success probability as a product of component-wise success rates:
$$
P(correct|C)=P_0∏_i=1^mρ_a^I[A_i]∏_j=1^nρ_f^I[F_j]ρ_p^I[Q]ρ_c^C_T
$$
where $P_0∈(0,1]$ represents the baseline success probability when all components are within the training distribution, and $ρ_a,ρ_f,ρ_p,ρ_c∈(0,1)$ are the degradation factors associated with novel arguments, functions, patterns, and task-specific complexity, respectively.
$$
\displaystyle\ln P(correct\mid C) = \displaystyle\ln P_0+∑_i=1^mI[A_i] \lnρ_a+∑_j=1^nI[F_j] \lnρ_f+I[Q] \lnρ_p+C_T \lnρ_c \tag{21}
$$
For notational convenience, we define the positive constants:
$$
ξ_a:=-\lnρ_a>0,ξ_f:=-\lnρ_f>0,ξ_p:=-\lnρ_p>0,ξ_c:=-\lnρ_c>0
$$
hence we have:
$$
\ln P(correct|C)=\ln P_0-ξ_a∑_i=1^mI[A_i]-ξ_f∑_j=1^nI[F_j]-ξ_pI[Q]-ξ_cC_T \tag{22}
$$
Lemma: Relationship to TGC. The expression in equation above can be bounded in terms of $TGC(C)$ as follows:
$$
\ln P(correct|C)≤\ln P_0-δ·TGC(C) \tag{23}
$$
where $δ=\min(\frac{ξ_a}{α},\frac{ξ_f}{β},\frac{ξ_p}{γ},ξ_c)>0$ .
Proof of Lemma: From the definition of $TGC(C)$ in Eq. (11), we have:
$$
TGC(C)=α∑_i=1^mI[A_i]+β∑_j=1^nI[F_j]+γI[Q]+C_T \tag{24}
$$
By the definition of $δ$ , each term in Eq. (22) satisfies:
$$
ξ_a∑_i=1^mI[A_i]≥δα∑_i=1^mI[A_i] \tag{25}
$$
$$
ξ_f∑_j=1^nI[F_j]≥δβ∑_j=1^nI[F_j] \tag{26}
$$
$$
ξ_pI[Q]≥δγI[Q] \tag{27}
$$
$$
ξ_cC_T≥δ C_T \tag{28}
$$
Summing these inequalities establishes Eq. (23).
We now define the threshold $τ:=\frac{\ln P_0}{δ}$ . From Eq. (23), when $TGC(C)>τ$ , we have:
$$
\displaystyle\ln P(\mid C) \displaystyle≤\ln P_0-δ·(C) \displaystyle=δ(τ-(C)) \displaystyle=-δ((C)-τ) \tag{29}
$$
Exponentiating both sides yields the desired bound: $P(\mid C)≤ e^-δ(TGC(C)-τ)$
### A.3 Proof of Length Extrapolation Bound
* Proof*
Consider a transformer model $f_θ$ processing sequences of length $L$ . The model implicitly learns position-dependent representations through positional encodings $PE(i)∈ℝ^d$ for position $i∈\{1,…,L\}$ and attention patterns $A_ij=softmax≤ft(\frac{Q_iK_j^T}{√{d}}\right)$ . During training on fixed length $L_train$ , the model learns a specific distribution:
$$
p_train(h)=p(h\mid L=L_train) \tag{32}
$$
where $h=\{h_1,…,h_L\}$ represents hidden states. For sequences of length $L≠ L_train$ , we encounter distribution shift in two forms: (1) positional encoding mismatch, where the model has never seen positions $i>L_train$ if $L>L_train$ , and (2) attention pattern disruption, where the learned attention patterns are calibrated for length $L_train$ . The KL divergence between training and test distributions can be bounded:
$$
D_KL(p_test\|p_train)∝|L-L_train|^2 \tag{33}
$$ This quadratic relationship arises from linear accumulation of positional encoding errors and quadratic growth in attention pattern misalignment due to pairwise interactions. Let $E(L)$ be the prediction error at length $L$ . We decompose it as:
$$
E(L)=E_inherent(L)+E_shift(L) \tag{34}
$$
where $E_inherent(L)=E_0$ is the inherent model error (constant) and $E_shift(L)$ is the error due to distribution shift. The distribution shift error follows from the Central Limit Theorem. As the error accumulates over sequence positions, the total shift error converges to:
$$
E_shift(L)=(1-E_0)·≤ft(1-\exp≤ft(-\frac{(L-L_train)^2}{2σ^2}\right)\right) \tag{35}
$$ This form ensures that $E_shift(L_train)=0$ (no shift at training length) and $\lim_|L-L_{train|→∞}E_shift(L)=1-E_0$ (maximum error bounded by 1). The width parameter $σ$ depends on:
$$
σ=σ_0·√{\frac{d}{L_train}} \tag{36}
$$
where $σ_0$ is a model-specific constant, $d$ is the model dimension, and the $√{d/L_train}$ factor captures the concentration of measure in high dimensions. Therefore, the total error follows:
$$
E(L)=E_0+(1-E_0)·≤ft(1-\exp≤ft(-\frac{(L-L_train)^2}{2σ^2}\right)\right) \tag{37}
$$ This Gaussian form naturally emerges from the accumulation of position-dependent errors and matches the experimental observation of near-zero error at $L=L_train$ with symmetric increase in both directions. ∎
## Appendix B Experiment Details
We fine-tune a GPT-2–style decoder-only Transformer with a vocabulary size of 10,000. The model supports a maximum context length of 256 tokens. The hidden dimension is 32, the number of Transformer layers is 4, and the number of attention heads is 4. Each block includes a GELU-activated feed-forward sublayer with width $4× d_model$ .
The model is trained using the AdamW optimiser in mixed precision (FP16). The default learning rate is $3× 10^-3$ , and the schedule follows a cosine decay with a 10% warm-up ratio. Training is conducted for 10 epochs, using a batch size of 1024. A weight decay of 0.01 is applied, and gradient norms are clipped at 1.0.
## Appendix C Illustration of Datasets
Below are the examples of transformation $f_1$ and $f_2$ :
| Transformation[F1]: Transformation[F2]: | A A F Q [F1] <answer> N N S D A A L P [F2] <answer> A L P A |
| --- | --- |
aside from single transformation, we can composite transformations arbitrarily:
| Transformation[F1F2]: | A C I A [F1] [F2] <think> |
| --- | --- |
| N P V N [F2] <answer> | |
| P V N N | |
| Transformation[F2F2]: | N O V S [F2] [F2] <think> |
| O V S N [F2] <answer> | |
| V S N O | |
we use exact match, edit distance and BELU score to measure the discrepancy between generated tokens and the labels. For more than one transformation examples, we can further measure the discrepancy for reasoning and answering seperately.
## Appendix D Additional Experimental Results
### D.1 Additional Qualitative Analysis
#### D.1.1 Orthognal Transformation Caused Coincidence
The following case shows that even if the transformation is different, the model that trained on transformation $f_2∘ f_1$ can still provide correct answer through incorrect reasoning:
⬇
Prompt: ’ A A A B [F1] [F2] < answer >’
Generated:’ B A A A [F1] < answer > O N N N ’
Expected: ’ O N N N ’
#### D.1.2 Correct reasoning but failed in final answer
The following case shows that the model pretrained on the union of three transformation $f_1∘ f_2,f_2∘ f_1,f_2∘ f_2$ and test on $f_1∘ f_1$
⬇
Prompt: ’ A A A D [R1] [R1] < think >’
Generated: ’ N N N Q [R1] < answer > N N Q N ’
Expected: ’ N N N Q [R1] < answer > A A A D ’
#### D.1.3 Failure to generalize to novel element
The following case shows that the model trained on element set $a_i∈[A,M]$ can not generalize to unseen elements such as N or O
⬇
Prompt: ’ N N N O [F1] [F1] < think >’
Generated: ’ R V Q S [F1] < answer > E I D F ’
Expected: ’ A A A B [F1] < answer > N N N O ’
#### D.1.4 LLM reproduces CoT reasoning at seen lengths
The following case shows that model trained under $f_1∘ f_1$ tried to reproduce the length in training data by adding tokens in the reason chain even prompted with seen transformation $f_1$
⬇
Prompt: ’ A A B D [f1] < answer >’
Generated: ’ N O A Z N N O Q [f1]
< answer > A A B D ’
Expected: ’ N N O Q ’