2506.22058
Model: healer-alpha-free
# Lost at the Beginning of Reasoning
> Equal contribution. Correspondence to: Baohao Liao ( ).
## Abstract
Recent advancements in large language models (LLMs) have significantly advanced complex reasoning capabilities, particularly through extended chain-of-thought (CoT) reasoning that incorporates mechanisms such as backtracking, self-reflection, and self-correction. Despite these developments, the self-correction abilities of LLMs during long CoT reasoning remain underexplored. And recent findings on overthinking suggest that such models often engage in unnecessarily redundant reasoning. In this work, we empirically show that the first reasoning step exerts a disproportionately large influence on the final predictionâerrors introduced at this stage can substantially degrade subsequent reasoning quality. This phenomenon is consistently observed across various state-of-the-art open- and closed-source reasoning models. Leveraging this insight, we propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps while discarding suboptimal ones, achieving up to a 70% reduction in inference cost without sacrificing any accuracy. Our work highlights the central role of the first reasoning step in generating a high-quality reasoning trajectory, and thus enabling significantly efficient sampling.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Reasoning Trace Pruning and Evaluation Framework
### Overview
The image is a multi-panel technical diagram illustrating a method for evaluating and optimizing the reasoning traces of an AI model. It depicts a process where candidate reasoning steps are generated, evaluated by a reward model, and pruned to improve efficiency. The diagram includes flowcharts, a problem example, comparative bar charts, and performance metrics.
### Components/Axes
The diagram is divided into several distinct regions:
1. **Top Left - Reasoning Trace Structure:**
* A vertical flowchart labeled "Reasoning steps" on the left y-axis.
* Steps are represented by colored circles: `tâ` (blue), `tâ` (purple), `tâ` (pink), `...`, `t_T` (yellow).
* These steps lead to a final green square labeled `c_T` ("Final conclusion").
* The entire sequence is enclosed in a dashed box titled "Reasoning trace".
2. **Top Center - Candidate Evaluation:**
* A problem statement in a yellow box: "Aya walks 9 km each morning. [...] If she walks at s+1 km/h, how many minutes will the total be?"
* Below, a section titled "Candidate reasoning races" lists three candidate first reasoning steps (`tâ`):
* Candidate 1 (Blue box): "Treat the whole outing as just distance over speed with no fixed stop. [...] This leads to s=5, but ignores the café stop." -> Marked with a red **X**.
* Candidate 2 (Blue box): "The cafĂ© time is fixed in both scenarios, [...] The speed is therefore s=3, and the cafĂ© stop is t=60 minutes." -> Marked with a green **â**.
* Candidate 3 (Blue box): "Read the 9 km in 4 hours as a base speed of about 2.25. [...] This suggests s=2.25s, but again misses the café stop." -> Marked with a red **X**.
* Arrows point from candidates to a "Reward model" (icon of a gavel).
* A pair of scissors labeled "Early pruning" cuts off the incorrect candidates.
3. **Top Right - Complete Trace:**
* A box titled "Complete reasoning trace" containing a detailed, correct solution to the problem.
* Text: "The café time is fixed in both scenarios, [...] The base speed is therefore s=3, and the café stop is t=60 minutes. Alternatively, instead of comparing directly, set up the equations [...] From 9/s+t=4 and 9/(s+3)+t=2.5, subtract to get s=3, hence t=1h. [...] Her total time is boxed (195 minutes)."
4. **Bottom Left - Similarity Graph:**
* A line graph with the y-axis labeled "Similarity(c_T, t_i)" and the x-axis labeled "Reasoning step t_i".
* The curve shows a steep decline in similarity between the final conclusion (`c_T`) and early reasoning steps (`tâ`, `tâ`), leveling off for later steps (`tâ`, `t_T`).
* Data points are colored to match the reasoning steps above (blue, purple, pink, yellow).
5. **Bottom Center - Performance Bar Charts:**
* Two grouped bar charts.
* **Left Chart ("Accuracy"):** Compares "Correct tâ" (blue hatched bar) vs. "Incorrect tâ" (orange solid bar). The "Correct tâ" bar is significantly taller.
* **Right Chart ("#Tokens"):** Compares "Correct tâ" (blue hatched bar) vs. "Incorrect tâ" (orange solid bar). The "Incorrect tâ" bar is taller, indicating more tokens used.
6. **Bottom Right - Summary Metrics:**
* Two sets of horizontal bars comparing "Maj@N" (blue) and "Pruned" (green).
* **"Accuracy" set:** The "Pruned" bar is slightly shorter than the "Maj@N" bar.
* **"Number of tokens" set:** The "Pruned" bar is dramatically shorter than the "Maj@N" bar. A bracket underneath indicates "70% LESS".
### Detailed Analysis
* **Process Flow:** The diagram illustrates a pipeline: 1) Generate multiple candidate reasoning paths (`tâ`). 2) Use a reward model to score them. 3) Prune incorrect paths early. 4) Complete only the promising trace.
* **Problem Example:** The math problem serves as a concrete test case. The correct candidate (`tâ`) correctly identifies the fixed cafĂ© time as a key constraint, while incorrect candidates ignore it or misinterpret the base speed.
* **Similarity Trend:** The graph shows that the initial reasoning step (`tâ`) has the lowest similarity to the final conclusion (`c_T`), suggesting early steps are more abstract or set up the problem, while later steps converge toward the answer.
* **Performance Data:**
* Starting with a correct first step (`tâ`) leads to higher final accuracy.
* Starting with an incorrect first step leads to a longer, more token-heavy reasoning trace (likely due to backtracking or errors).
* The "Pruned" method achieves accuracy comparable to the "Maj@N" (likely Majority Vote at N) baseline while using **70% fewer tokens**.
### Key Observations
1. **Critical First Step:** The correctness of the initial reasoning step (`tâ`) is highly predictive of final accuracy and efficiency.
2. **Efficiency Gain:** The primary benefit of the pruning method is a massive reduction in computational cost (token usage), not a significant increase in accuracy.
3. **Visual Coding:** Colors are used consistently to link elements: blue for `tâ`/correct, orange for incorrect, and green for the final conclusion/pruned method.
4. **Symbolic Language:** Icons (gavel, scissors) and marks (â, X) provide immediate visual feedback on evaluation and pruning actions.
### Interpretation
This diagram presents a method for making AI reasoning more efficient and reliable. The core insight is that not all reasoning paths are equal; by evaluating and pruning poor initial steps early, the system can avoid wasting computation on futile lines of thought.
The data suggests that the quality of the "first thought" is crucial. An incorrect initial assumption (`tâ`) cascades into a longer, less accurate process. The pruning mechanism acts as a filter, preserving only the most promising reasoning threads.
The **70% reduction in tokens** is the standout result. It demonstrates that significant efficiency gains are possible without a major sacrifice in accuracy. This has practical implications for reducing the cost and latency of complex AI reasoning tasks. The framework essentially trades a small amount of accuracy (as seen in the slightly lower "Pruned" accuracy bar) for a large gain in efficiency, which is often a favorable trade-off in real-world applications.
The similarity graph provides a diagnostic insight: the early steps of a reasoning trace are the most divergent from the final answer. This supports the strategy of focusing evaluation and pruning efforts on these early, high-variance steps (`tâ`, `tâ`) rather than later ones.
</details>
Figure 1: Overview of our observation and efficient sampling. The first reasoning step $t_1$ heavily shapes the entire reasoning trajectory: a strong first step typically yields accurate solutions with fewer tokens (bottom left). Building on this observation, we propose to generate multiple candidate first steps, evaluate them with a reward model, and discard weaker candidates early (top right). This method maintains accuracy while substantially reducing token consumption by 70% (bottom right).
## 1 Introduction
Large language models (LLMs) have demonstrated remarkable performance across a variety of reasoning tasks, ranging from mathematical problem solving to multi-hop question answering (Hestness et al., 2017; Kaplan et al., 2020; Hoffmann et al., 2022). More recently, the advent of reasoning-oriented LLMs capable of performing long chain-of-thought (long-CoT) reasoning at test time has led to substantial advancements in these domains (Brown et al., 2020; Hurst et al., 2024; Anthropic, 2025; Team et al., 2024; Guo et al., 2025; Yang et al., 2025a; Wen et al., 2025; He et al., 2025). A widely held hypothesis attributes this progress to the modelsâ ability to backtrack, self-reflect, and self-correct, effectively leveraging contextual feedback to iteratively refine their responses.
However, recent studies suggest that long-CoT reasoning can also introduce inefficiencies. Models often âoverthinkâ by producing unnecessarily extended reasoning processes to solve problems (Chiang & Lee, 2024; Zhang et al., 2024a; Wang et al., 2025b; Liao et al., 2025b; a). This observation raises questions about the modelâs capacity for backtracking, self-reflection, and self-correction. It suggests that LLMs may lack awareness of the information they have already processed, leading to redundant or inefficient reasoning. Moreover, Liu et al. (2024a) demonstrate that LLMs are prone to the âlost-in-the-middleâ phenomenon, wherein information located in the middle of a long context is often overlooked. While their analysis is conducted in the context of information retrieval, it remains an open question whether similar positional biases affect long CoT as well.
In this work, we introduce a novel and previously underexplored perspective on long-CoT reasoning: many reasoning failures in long-CoT LLMs stem not from errors made mid-chain, but rather from flaws at the beginning of reasoning. Our experiments demonstrate that the first reasoning step has the most significant influence on the final prediction. When this first step is incorrect, the model is considerably more likely to arrive at an incorrect final answer (40% accuracy drop), highlighting the limited self-correction capabilities of current long-CoT LLMs. Notably, this phenomenon is consistently observed from five open- and closed-source long-CoT LLM families.
Motivated by this insight, we propose an efficient early pruning algorithm that exploits the predictive power of the first reasoning step. Specifically, by evaluating the quality of the first step, we identify and discard less promising reasoning traces early, continuing generation only for the more promising ones. This approach significantly reduces inference cost. Across five open-sourced long-CoT LLM families and five challenging mathematical, scientific reasoning and programming benchmarks, our method maintains accuracy while reducing inference budget by up to 70%. Our results show that the first step is not just the beginning of reasoning, but a key factor that influences both accuracy and efficiency, making it an important focus for future reasoning models.
Contributions. Our main contributions are as follows: (1) To the best of our knowledge, we firstly empirically establish a strong positive correlation between the first reasoning step and the final prediction across various open- and closed-sourced long-CoT LLM families (§ 3); (2) Inspired by this observation, we propose an efficient early pruning algorithm that halts generation for less promising initial steps, thereby improving inference efficiency while maintaining the accuracy (§ 4); (3) Both observation and proposed efficient sampling method are extensively validated on various long-CoT LLMs across different reasoning tasks, with necessary control experiments to disentangle the confounding factors.
## 2 Related work
Lost in the middle. Liu et al. (2024a) introduced the âlost in the middleâ effect, demonstrating that LLMs tend to overlook information in the middle of long contexts, performing better when relevant content appears at the beginning or end. This positional bias is evident across tasks like arithmetic reasoning (Shen et al., 2023; Liao & Monz, 2024), multiple-choice QA (Zheng et al., 2024; Pezeshkpour & Hruschka, 2023), text evaluation (Wang et al., 2024; Shi et al., 2024), passage ranking (Zhang et al., 2024b), and instructional prompt positioning (Liu et al., 2024b; Chen et al., 2024b). Additionally, studies have documented primacy and recency biases, where models disproportionately allocate attention to the first or final tokens, independent of their semantic relevance (Xiao et al., 2024; Qin et al., 2023; Barbero et al., 2025). While previous studies have primarily examined positional biases in external context, we investigate whether analogous biases emerge in internal reasoning trajectories of long chain-of-thought models. Different from attention-level analyses that focus on how the first input token shapes representations, our work shows that the first generated reasoning step greatly influences subsequent reasoning and final outcomes.
Efficient test-time reasoning. Test-time scaling methods aim to improve the accuracyâcompute trade-off by adapting sampling and aggregation. One line of work increases self-consistency efficiency by reducing sample counts (Li et al., 2024; Wan et al., 2025; Aggarwal et al., 2023; Xue et al., 2023), while another shortens chain-of-thought depth via fine-tuning or inference-only optimizations (Chen et al., 2024a; Luo et al., 2025; Hou et al., 2025; Fu et al., 2025a; Yang et al., 2025b). These methods, however, still rely on generating full reasoning traces. DeepConf (Fu et al., 2025b) instead uses local confidence to filter low-quality traces and terminate generation early. Our method takes a different focus: we assess the quality of the initial reasoning step, which strongly shapes subsequent reasoning, and prune weak starts before long traces unfold.
## 3 Lost at the beginning of reasoning
Motivated by the finding of Liu et al. (2024a), which demonstrates that query-relevant information is more impactful when positioned at either the beginning or end of an LLMâs context window, we first investigate whether a similar positional effect exists in long-CoT reasoning (§ 3.1). Our analysis reveals that the first reasoning step has great impact to the final conclusion. To validate this observation, we further perform two ablation studies, confirming the critical role of the first step in determining the modelâs final prediction (§ 3.2 and § 3.3).
Notation. Let $p$ represent the input prompt, consisting of both a system instruction and a user query. A reasoning model $M$ produces a sequence of CoT reasoning steps $t=[t_1,t_2,...,t_T]$ , followed by a final conclusion $c_T$ , such that the complete model output is given by $tâ c_T=M(p)$ , where $â$ means concatenation. In models such as DeepSeek-R1 (Guo et al., 2025) and Qwen3 (Team, 2025), the input-output format adheres to the following:
$$
p {<think{>}} t_1,t_2,âŠ,t_T {</think{>}} c_T
$$
The final prediction $q_T$ is then derived by applying an extraction function $g$ to the conclusion, i.e., $q_T=g(c_T)$ , where $g$ may, for example, extract values enclosed within \boxed{}.
The conclusion $c_T$ can be interpreted as a summary of the essential reasoning steps leading to the final prediction. This raises an interesting question: Is there a positional bias in how reasoning steps contribute to the conclusion? In other words, do certain steps have a disproportionately greater influence on $c_T$ than others?
### 3.1 Similarity between reasoning steps and the final conclusion
To understand how different reasoning steps contribute to the final conclusion, we measure the semantic similarity between each reasoning step $\{t_i\}_i=1^T$ and the final conclusion $c_T$ .
To assess how intermediate reasoning contributes to the final outcome, we measure the semantic similarity between each reasoning step $\{t_i\}_i=1^T$ and the final conclusion $c_T$ . This analysis reveals whether the reasoning process gradually aligns with the correct answer or diverges along the way.
Experimental setup. We evaluate 60 questions from AIME24 and AIME25 (MAA Committees, 2025) using DeepSeek-R1-Distill-Qwen-7B (abbreviated as DS-R1-Qwen-7B in the remainder of this paper) (Guo et al., 2025), Qwen3-8B (Yang et al., 2025a), Claude-3.7-Sonnet with thinking (Anthropic, 2025), GPT-OSS-20B (Agarwal et al., 2025), and Magistral-Small (Rastogi et al., 2025). For reproducibility, the exact model identifiers are provided in Appendix B. Generations are produced with temperature=1.0, top_p=0.9, min_p=0.05, and max_tokens=32768; for Claude-3.7-Sonnet, only max_tokens is set. All subsequent experiments adopt this hyperparameter configuration.
Segmentation of reasoning steps. We define a reasoning step as a complete logical leap or self-contained unit (Xiong et al., 2025), and segment reasoning traces with GPT-5. By default, we use GPT-5 mini for step segmentation; for GPT-OSS-20B, we instead use GPT-5, as the mini variant is incompatible. To complement this setup, we also employ heuristic segmentation based on reasoning switch keywords (Wang et al., 2025a), with details provided in Appendix A.
Similarity computation. We compute semantic similarity between each step $t_i$ and the conclusion $c_T$ by taking the cosine similarity of their embeddings obtained from all-MiniLM-L6-v2 (Reimers & Gurevych, 2019; Wang et al., 2020). To avoid inflated similarity from problem restatement, we use GPT-5 mini to remove question-overlap text at the beginning of traces. As a robustness check, we also report results with SPLADE similarity (Formal et al., 2021) in Appendix C, confirming that our findings are not specific to dense embeddings. Since traces vary in length, similarity curves are interpolated to a fixed number of steps (either the average or maximum length) for visualization.
This setup allows us to capture how reasoning trajectories semantically converge towardâor deviate fromâthe final answer across different models.
Result. Figure 2 shows that the first reasoning step exhibits the highest similarity to the final conclusion, after which similarity drops sharply. Beyond the initial few steps, similarity stabilizes at a lower level, with only minor fluctuations across the remainder of the reasoning process. These results suggest that the first step $t_1$ is most closely aligned with the final conclusion and likely sets the overall direction of the reasoning. Subsequent steps appear to introduce exploratory or redundant content that deviates from the final answer. Additional results using SPLADE similarities (Figure C.2) show the same trend, confirming that this pattern is robust across similarity metrics. Taken together, these findings show that the first reasoning step plays a disproportionately important role in shaping the final conclusion.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Chart: Similarity vs. Reasoning Step for Various AI Models
### Overview
The image displays a line chart comparing the performance of five different AI models. The chart plots a "Similarity" metric against the number of "Reasoning steps." All models show a general downward trend in similarity as the number of reasoning steps increases, though the rate of decline and final values vary significantly.
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **Y-Axis:**
* **Label:** `Similarity(c_T, t_i)`
* **Scale:** Linear, ranging from 0.50 to 0.85.
* **Ticks:** Major ticks at 0.05 intervals (0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85).
* **X-Axis:**
* **Label:** `Reasoning step t_i (GPT5)`
* **Scale:** Linear, ranging from 0 to 30.
* **Ticks:** Major ticks at intervals of 5 (0, 5, 10, 15, 20, 25, 30).
* **Legend:** Located in the top-right quadrant of the chart area. It contains five entries, each with a unique color, line style, and marker shape:
1. **DS-R1-Qwen-7B:** Blue line with circle markers.
2. **Qwen3-8B:** Orange line with diamond markers.
3. **Claude-3.7-Sonnet:** Green line with square markers.
4. **GPT-OSS-20B:** Purple line with upward-pointing triangle markers.
5. **Magistral-Small:** Brown line with downward-pointing triangle markers.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **DS-R1-Qwen-7B (Blue, Circles):**
* **Trend:** Starts very high, experiences a steep initial decline, then fluctuates with a general downward drift.
* **Key Points:** Step 0: ~0.84, Step 5: ~0.67, Step 10: ~0.60, Step 15: ~0.57, Step 20: ~0.52, Step 25: ~0.53.
2. **Qwen3-8B (Orange, Diamonds):**
* **Trend:** Starts the highest, declines sharply and consistently until around step 12, then shows a slight recovery before ending.
* **Key Points:** Step 0: ~0.86, Step 5: ~0.65, Step 10: ~0.55, Step 15: ~0.57, Step 20: ~0.53 (line ends near step 18).
3. **Claude-3.7-Sonnet (Green, Squares):**
* **Trend:** Starts lower than the top two, declines more gradually, and exhibits the most stable performance in the latter half, even showing a slight upward trend after step 15.
* **Key Points:** Step 0: ~0.75, Step 5: ~0.68, Step 10: ~0.63, Step 15: ~0.61, Step 20: ~0.64, Step 25: ~0.60, Step 30: ~0.61.
4. **GPT-OSS-20B (Purple, Up-Triangles):**
* **Trend:** Starts the lowest, declines steadily to a minimum around step 25, then shows a sharp recovery.
* **Key Points:** Step 0: ~0.64, Step 5: ~0.58, Step 10: ~0.54, Step 15: ~0.52, Step 20: ~0.53, Step 25: ~0.48 (lowest point on chart), Step 30: ~0.54.
5. **Magistral-Small (Brown, Down-Triangles):**
* **Trend:** Starts high, declines, and then fluctuates in a middle range before the line ends early.
* **Key Points:** Step 0: ~0.79, Step 5: ~0.64, Step 10: ~0.65, Step 15: ~0.60, Step 20: ~0.57 (line ends near step 20).
### Key Observations
* **Initial Performance:** At step 0, Qwen3-8B and DS-R1-Qwen-7B have the highest similarity scores (>0.84), while GPT-OSS-20B is the lowest (~0.64).
* **Rate of Decline:** Qwen3-8B and DS-R1-Qwen-7B show the steepest initial drops. Claude-3.7-Sonnet has the most gradual decline.
* **Stability:** Claude-3.7-Sonnet demonstrates the most stable performance after step 15, maintaining a similarity between 0.60 and 0.65.
* **Anomaly/Recovery:** GPT-OSS-20B is the only model to show a significant recovery trend, increasing from its low of ~0.48 at step 25 to ~0.54 at step 30.
* **Data Range:** The chart captures data up to different steps for different models. Qwen3-8B and Magistral-Small lines terminate before step 20 and step 25, respectively, while others extend to step 30.
### Interpretation
This chart likely visualizes how the internal consistency or output similarity of various large language models (LLMs) degrades as they are forced to perform longer chains of reasoning (simulated here with "GPT5" steps). The `Similarity(c_T, t_i)` metric probably measures how similar the model's state or output is at step `t_i` compared to some reference or initial state `c_T`.
* **Performance Implication:** Models that start with higher similarity (Qwen3-8B, DS-R1-Qwen-7B) may have stronger initial coherence but are more susceptible to "drift" or degradation over extended reasoning. Claude-3.7-Sonnet, while starting lower, appears more robust for longer reasoning chains.
* **Model Comparison:** The data suggests a trade-off between peak initial performance and sustained performance. For tasks requiring very long reasoning, Claude-3.7-Sonnet might be more reliable. The recovery of GPT-OSS-20B is intriguing and could indicate a different architectural approach or a point where the model "resets" or finds a new stable state.
* **Underlying Question:** The chart addresses a core challenge in AI: maintaining fidelity and coherence over long, multi-step processes. The variance between models highlights different capabilities and potential failure modes in complex reasoning tasks.
</details>
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Line Chart: Similarity vs. Reasoning Step for Various AI Models
### Overview
The image displays a line chart comparing the similarity metric, denoted as Similarity(C_T, t_i), across five different AI models over a series of reasoning steps (t_i). The chart illustrates how the similarity between some reference (C_T) and the model's output at step t_i changes as the reasoning process progresses. All models show a general downward trend in similarity as the number of reasoning steps increases, though the rate of decline and final values vary.
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **Y-Axis:**
* **Label:** `Similarity(C_T, t_i)`
* **Scale:** Linear, ranging from 0.4 to 0.9.
* **Major Ticks:** 0.4, 0.5, 0.6, 0.7, 0.8, 0.9.
* **X-Axis:**
* **Label:** `Reasoning step t_i (Heuristic)`
* **Scale:** Linear, ranging from 0 to approximately 55.
* **Major Ticks:** 0, 10, 20, 30, 40, 50.
* **Legend:** Positioned in the top-right quadrant of the chart area. It contains five entries, each with a unique color and marker symbol:
1. **DS-R1-Qwen-7B:** Blue line with circle markers.
2. **Qwen3-8B:** Orange line with diamond markers.
3. **Claude-3.7-Sonnet:** Green line with square markers.
4. **GPT-OSS-20B:** Purple line with upward-pointing triangle markers.
5. **Magistral-Small:** Brown line with downward-pointing triangle markers.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **DS-R1-Qwen-7B (Blue, Circles):**
* **Trend:** Starts very high, drops steeply within the first 5 steps, then fluctuates with a gradual downward drift.
* **Key Points:** Starts ~0.86 at step 0. Drops to ~0.65 by step 5. Fluctuates between ~0.55 and ~0.60 from steps 10-20. Ends its visible series around step 22 at ~0.56.
2. **Qwen3-8B (Orange, Diamonds):**
* **Trend:** Begins at the highest point, experiences a sharp initial decline, then stabilizes into a fluctuating pattern around a lower mean.
* **Key Points:** Starts at the chart's peak, ~0.88 at step 0. Plummets to ~0.60 by step 5. From steps 10-25, it oscillates roughly between 0.48 and 0.55. Its last visible point is near step 25 at ~0.54.
3. **Claude-3.7-Sonnet (Green, Squares):**
* **Trend:** Starts moderately high, declines more gradually than the first two models, and maintains a relatively higher similarity plateau in the mid-range before declining further.
* **Key Points:** Starts ~0.76 at step 0. Descends to ~0.65 by step 5 and holds near that level until step 10. Shows a local peak ~0.62 around step 18. Ends its visible series around step 24 at ~0.56.
4. **GPT-OSS-20B (Purple, Up-Triangles):**
* **Trend:** Begins at a lower initial similarity, drops quickly, and then follows a shallow, fluctuating decline, often being the lowest or among the lowest series.
* **Key Points:** Starts ~0.65 at step 0. Falls to ~0.50 by step 8. From steps 10-50, it mostly fluctuates in the 0.40-0.50 band, with a slight upward trend in the final steps (40-50), ending near ~0.54.
5. **Magistral-Small (Brown, Down-Triangles):**
* **Trend:** Starts high, drops very sharply to become the lowest series, and exhibits the most volatile, jagged pattern with significant oscillations throughout.
* **Key Points:** Starts ~0.78 at step 0. Crashes to ~0.40 by step 6. Shows extreme volatility, with deep troughs (e.g., ~0.36 at step 27) and sharp peaks (e.g., ~0.52 at step 42). It is the only series extending past step 50, ending at ~0.55.
### Key Observations
* **Universal Initial Drop:** All five models exhibit their highest similarity at step 0, followed by a precipitous decline within the first 5-10 reasoning steps.
* **Divergence in Mid-Range:** After the initial drop (post step ~10), the models diverge. Claude-3.7-Sonnet generally maintains the highest similarity, while Magistral-Small and GPT-OSS-20B often occupy the lower range.
* **Volatility:** Magistral-Small displays the most unstable behavior, with large, frequent swings in similarity. In contrast, Claude-3.7-Sonnet shows the smoothest trajectory after its initial decline.
* **Convergence at Extremes:** Despite different paths, the final data points for the series that extend to the right side of the chart (GPT-OSS-20B and Magistral-Small) converge into a similar range (~0.50-0.55) by step 50.
* **Series Length:** The data series for DS-R1-Qwen-7B, Qwen3-8B, and Claude-3.7-Sonnet terminate before step 30, while GPT-OSS-20B and Magistral-Small continue to step 50 and beyond.
### Interpretation
This chart likely visualizes a metric assessing how closely a model's reasoning at a given step (t_i) aligns with a final correct answer or a reference chain-of-thought (C_T). The **Peircean investigative reading** suggests:
1. **Divergence from the "Truth":** The universal sharp initial drop indicates that the very first steps of reasoning taken by these models are the most similar to the final reference. As the models generate more intermediate steps, their reasoning paths diverge significantly from the reference path (C_T). This could imply that the heuristic reasoning process introduces noise or alternative pathways not present in the reference.
2. **Model "Confidence" or "Focus":** The sustained higher similarity of Claude-3.7-Sonnet might suggest its intermediate reasoning steps remain more consistently aligned with the final outcome's logic. Conversely, the high volatility of Magistral-Small could indicate a less stable or more exploratory reasoning process, where steps frequently deviate and then correct course.
3. **The "Heuristic" Nature:** The x-axis label "(Heuristic)" is critical. It implies the reasoning steps are not necessarily ground-truth steps but are generated by a heuristic process. The declining similarity may therefore measure the drift of this heuristic process from an ideal path over time.
4. **Practical Implication:** For tasks requiring long chains of reasoning, this data suggests that monitoring similarity to a reference at early steps may not be predictive of later steps. The models' behaviors become highly individualized and less aligned with the reference as the process extends. The eventual convergence of some models at step 50 might indicate a return to a more aligned state, but only after considerable deviation.
**Language Declaration:** All text within the chart image is in English.
</details>
Figure 2: Cosine similarity between the embeddings of the $i$ -th reasoning step $t_i$ and the final conclusion $c_T$ , using the average number of reasoning steps for interpolation. The reasoning steps are segmented either by GPT-5 (left) or by heuristic rules (right). See Figure C.1 for results based on the maximum number of reasoning steps used for interpolation.
Given the strong alignment between early reasoning stepsâparticularly the firstâand the final conclusion, we hypothesize that the first step may significantly influence whether the reasoning model can arrive at a correct prediction.
### 3.2 Correlation between the first reasoning step and the final prediction
Given that the first reasoning step closely resembles the final conclusion, we investigate whether the essential reasoning required for the final prediction is already encapsulated in the first step. To this end, we analyze the prediction when conditioned solely on the first reasoning step. Specifically, we compute $c_1=M(p{<think{>}}t_1{</think{>}})$ , and derive the corresponding prediction $q_1=g(c_1)$ , which we compare against the ground truth $a$ . Based on this comparison, we categorize each first reasoning step as either first correct (if $q_1=a$ ) or first incorrect (otherwise).
Experimental setup. To better analyze the correlation, we sample 64 CoT traces per question using the same datasets in § 3.1. We exclude questions for which all 64 CoT traces result in either correct or incorrect predictions, as these are considered either too easy or too difficult, respectively, yielding 38 questions for DS-R1-Qwen-7B and 37 for Qwen3-8B. For each remaining question and its corresponding first reasoning step $t_1$ , we obtain the initial prediction $q_1$ as previously described. While GPT-5 provides more reliable segmentation, it is costly and difficult to reproduce. We therefore adopt the heuristic segmentation method in all subsequent experiments, which is shown to have compariable results with GPT5 segmentation in § 3.1. To better visualize the final outcomes, we categorize the questions into three groups based on the pass@1 accuracy of the final prediction $q_T$ , That is, the average accuracy across 64 CoT traces. corresponding to the intervals (0, 33%], (33%, 67%], and (67%, 100%). A higher pass@1 indicates a simpler question. This grouping allows us to assess whether our observations hold consistently across varying levels of question difficulty.
Result. As shown in Figure 3 (Left), the commonly assumed self-correction capability of reasoning models appears to be overstated. When the first reasoning step is incorrect, the modelâs final prediction is also likely to be incorrect. On average, final prediction accuracy drops by 40% when the first reasoning step is incorrect, with the most substantial decrease (43%) occurring for difficult questions (0â33% range) and a notable decline (33%) even for easier questions (67â100% range). In addition, we also compute the Pearson correlation between the correctness of the first prediction $p_1$ and the final prediction $p_T$ over all questions. The coefficient $r=0.60$ and p-value $p=0.0$ denote a moderately strong positive correlation. All these results underscore the pivotal role of the first reasoning step in steering the model toward a correct final answer, particularly in more complex instances where recovery from early mistakes is more challenging. Extending this analysis to DeepSeek and Qwen models of different sizes yields consistent trends: final accuracy remains substantially higher when the first step is correct, and the accuracy gap persists as model scale increases (Figure D.2).
Figure 3 (Right) further illustrates the modelâs tendency toward overthinking. Even when the first reasoning step is sufficient to arrive at the correct answer, the model continues to generate a substantial number of additional reasoning tokensâthe same scale in length to those generated following an incorrect first step. Both findings are also observed for Qwen3-8B in Figure D.1, reinforcing this pattern across models.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Grouped Bar Chart: Accuracy by Pass@1 of q_T
### Overview
This is a grouped bar chart comparing the accuracy percentage of two conditions ("First correct" and "First incorrect") across four categories defined by the "Pass@1 of q_T" metric. The chart visually demonstrates how initial correctness correlates with subsequent accuracy.
### Components/Axes
* **Chart Type:** Grouped bar chart.
* **Y-Axis:** Labeled "Accuracy (%)". Scale runs from 0 to 100 in increments of 20.
* **X-Axis:** Labeled "Pass@1 of q_T". Contains four categorical groups:
1. `(0, 33%]`
2. `(33%, 67%]`
3. `(67%, 100%]`
4. `Overall`
* **Legend:** Located in the top-right corner of the chart area.
* **Blue bar with diagonal hatching (\\):** "First correct"
* **Solid orange bar:** "First incorrect"
* **Data Labels:** Numerical accuracy values are printed directly above each bar.
### Detailed Analysis
The chart presents the following data points for each category:
**1. Category: (0, 33%]**
* **First correct (Blue, hatched):** 56.7%
* **First incorrect (Orange, solid):** 14.0%
* **Trend:** A large disparity exists. The "First correct" condition has an accuracy over 4 times higher than the "First incorrect" condition.
**2. Category: (33%, 67%]**
* **First correct (Blue, hatched):** 80.2%
* **First incorrect (Orange, solid):** 43.9%
* **Trend:** Both accuracies increase compared to the previous category. The gap remains substantial, with "First correct" being nearly double the accuracy of "First incorrect".
**3. Category: (67%, 100%]**
* **First correct (Blue, hatched):** 97.2%
* **First incorrect (Orange, solid):** 63.9%
* **Trend:** This category shows the highest accuracies for both conditions. "First correct" approaches near-perfect accuracy. The absolute gap between the two conditions is the largest here (33.3 percentage points).
**4. Category: Overall**
* **First correct (Blue, hatched):** 79.7%
* **First incorrect (Orange, solid):** 39.7%
* **Trend:** The "Overall" performance aggregates the previous categories. The "First correct" accuracy (79.7%) is very close to the value in the (33%, 67%] range, while the "First incorrect" accuracy (39.7%) is lower than its value in the same range.
### Key Observations
1. **Consistent Performance Gap:** In every category, the "First correct" condition yields significantly higher accuracy than the "First incorrect" condition.
2. **Positive Correlation with Pass@1:** For both conditions, accuracy increases as the "Pass@1 of q_T" metric increases from the lowest to the highest bracket.
3. **Non-Linear Aggregation:** The "Overall" accuracy for "First incorrect" (39.7%) is lower than its accuracy in the middle (33%, 67%] bracket (43.9%), suggesting the low accuracy in the (0, 33%] bracket heavily influences the aggregate.
4. **Peak Performance:** The highest observed accuracy is 97.2% for "First correct" in the (67%, 100%] bracket.
### Interpretation
The data strongly suggests that the "Pass@1 of q_T" metric is a powerful predictor of accuracy. A higher Pass@1 score is associated with better performance for both conditions. More critically, the condition of being "First correct" is itself a dominant factor, consistently leading to much higher accuracy than being "First incorrect," regardless of the Pass@1 bracket.
The "Overall" figures indicate that across all cases, an initial correct response ("First correct") is associated with an accuracy of approximately 80%, while an initial incorrect response ("First incorrect") halves that likelihood to about 40%. This implies a strong path-dependency or momentum effect: starting correctly sets a trajectory for sustained accuracy, while starting incorrectly makes recovery to a correct answer much less probable. The chart does not reveal causationâit's unclear if being "First correct" *causes* higher subsequent accuracy or if both are symptoms of an underlying factor like task ease or model capabilityâbut the correlation is stark and consistent.
</details>
<details>
<summary>x5.png Details</summary>

### Visual Description
## Grouped Bar Chart: Token Count After First Step by Pass@1 Accuracy
### Overview
This is a grouped bar chart comparing the average number of tokens generated after the first step for two outcomes ("First correct" and "First incorrect") across different ranges of Pass@1 accuracy for a variable denoted as `q_T`. The chart illustrates a relationship between initial accuracy and the length of subsequent generation.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **Y-Axis:** Labeled "Number of tokens after first step (k)". The scale runs from 0 to 16 in increments of 2.
* **X-Axis:** Labeled "Pass@1 of `q_T`". It contains four categorical groups:
1. `(0, 33%]`
2. `(33%, 67%]`
3. `(67%, 100%)`
4. `Overall`
* **Legend:** Positioned in the top-right corner of the chart area.
* **Blue, diagonally hatched bar:** "First correct"
* **Solid orange bar:** "First incorrect"
### Detailed Analysis
The chart presents data for four categories on the x-axis. For each category, two bars are shown side-by-side.
**1. Category: (0, 33%]**
* **First correct (Blue hatched):** The bar reaches a value of **12.6**.
* **First incorrect (Orange solid):** The bar is taller, reaching a value of **14.5**.
* **Trend:** In this lowest accuracy bracket, both outcomes result in relatively high token counts, with incorrect attempts generating approximately 1.9k more tokens on average.
**2. Category: (33%, 67%]**
* **First correct (Blue hatched):** The bar height is **8.3**.
* **First incorrect (Orange solid):** The bar height is **11.4**.
* **Trend:** Both values decrease compared to the first category. The gap between correct and incorrect remains significant (~3.1k tokens).
**3. Category: (67%, 100%)**
* **First correct (Blue hatched):** The bar is the shortest in the chart, at **4.7**.
* **First incorrect (Orange solid):** The bar height is **7.4**.
* **Trend:** This highest accuracy bracket shows the lowest token counts for both outcomes. The difference between correct and incorrect is ~2.7k tokens.
**4. Category: Overall**
* **First correct (Blue hatched):** The bar height is **8.2**.
* **First incorrect (Orange solid):** The bar height is **11.3**.
* **Trend:** The overall averages reflect the pattern seen in the individual brackets: incorrect first attempts lead to longer subsequent generations (~3.1k more tokens).
### Key Observations
1. **Consistent Disparity:** In every single category, the "First incorrect" bar is taller than the "First correct" bar. This indicates that when the first step is incorrect, the model generates more tokens in the subsequent step.
2. **Inverse Relationship with Accuracy:** There is a clear inverse relationship between the Pass@1 accuracy of `q_T` and the number of tokens generated after the first step. As accuracy increases (moving right on the x-axis), the token count for *both* correct and incorrect first steps decreases.
3. **Narrowing Gap:** The absolute difference in token count between "First incorrect" and "First correct" appears to narrow slightly as accuracy increases (from ~1.9k to ~2.7k to ~2.7k, with the overall gap at ~3.1k). The relative difference, however, remains substantial.
4. **Overall Averages:** The "Overall" category provides a summary, showing that across all accuracy levels, a first incorrect step leads to an average of 11.3k tokens, while a first correct step leads to 8.2k tokens.
### Interpretation
The data suggests a strong correlation between the quality of an initial step (measured by Pass@1 accuracy) and the efficiency of the subsequent generation process. Two primary inferences can be drawn:
1. **Error Recovery Cost:** When the model's first step is incorrect, it appears to engage in a longer "recovery" or "correction" process, generating significantly more tokens. This could represent the model backtracking, exploring alternative paths, or producing more verbose justifications to arrive at a correct final answer.
2. **Confidence and Efficiency:** Higher initial accuracy (`q_T`) is associated with greater efficiency in the following step. This could mean that when the model is more confident or correct from the start, it requires less additional text to complete the task. The process becomes more direct and less verbose.
The chart does not provide causal information. It is unknown whether a longer subsequent generation *causes* a correct outcome or is merely *associated* with an incorrect first step. However, the consistent pattern across all accuracy brackets strongly implies that the length of the generation following the first step is a meaningful indicator of the initial step's correctness and the overall difficulty or uncertainty of the problem for the model.
</details>
Figure 3: Accuracy and number of tokens on DS-R1-Qwen-7B. Left: The relationship between the accuracy of the final prediction ( $q_T$ ) and the correctness of the prediction solely based on the first reasoning step ( $q_1$ ) across different difficulty levels. If $q_1$ is incorrect, $q_T$ is more likely incorrect. Right: The number of tokens used for the final prediction after the first reasoning step $t_1$ , i.e., the number of tokens used for $[t_2,t_3,...,t_T]$ . Although $q_1$ is correct, the model still consumes a large amount of tokens for the following reasoning stepsâoverthinking.
### 3.3 Minor perturbation to the correct first step leads to significant loss
Building on our findings in § 3.2, which demonstrate a positive correlation between the modelâs first and final predictions, we further investigate the significance of the first reasoning step by introducing minor perturbations. Specifically, we slightly alter an initially correct reasoning step and provide it as input to the model to assess whether it can recover from such errors.
Experimental setup. Unlike § 3.2, where we analyze the correctness of the first reasoning step $t_1$ , here we treat the final correct conclusion $c_T$ âwhich satisfies $q_T=g(c_T)=a$ âas the new first reasoning step, denoted $t^\prime_1$ . This choice ensures that the step contains all necessary reasoning for arriving at the correct answer, which cannot be guaranteed for $t_1$ . As illustrated in Figure 3 (Left), an initially correct reasoning step can still lead to an incorrect final prediction. To construct $t^\prime_1$ , we apply the following perturbations to $c_T$ (see Appendix G for an example): (1) we remove not only the explicit answer formatting (e.g., \boxed{a}) but also any surrounding sentences that may directly disclose or repeat the final answer; (2) the resulting text from (1) is treated as the correct version of $t^\prime_1$ (serving as our baseline); (3) we generate an incorrect version by replacing the correct answer $a$ in the remaining reasoning with $a± 1$ or $a± 10$ . The answer of AIME question is integer in the range of $[0,999)$ .
Table 1: Perturbation experiments. Reported accuracy (%) with correct vs. incorrect first step. Even minor perturbations cause significant drops.
| DS-R1-Qwen-1.5B DS-R1-Qwen-7B DS-R1-Qwen-32B | 95.4 94.8 100.0 | 64.4 28.5 85.8 |
| --- | --- | --- |
| Qwen3-1.7B | 96.0 | 46.6 |
| Qwen3-8B | 71.4 | 37.0 |
| Qwen3-30B-A3B | 100.0 | 74.7 |
| Qwen3-235B-A22B | 100.0 | 78.7 |
These perturbations are minimal, as they preserve the core reasoning structure while only altering the final prediction in the incorrect variant. We then combine the prompt $p$ with the modified first reasoning step $t^\prime_1$ and input it to the model as $M(p{<think{>}}t^\prime_1Alternatively)$ to assess subsequent reasoning behavior.
Result. As shown in Table 1, we make two key observations: (1) Smaller models rarely reach 100% accuracy even when the first reasoning step is correct, suggesting that they may revise or deviate from their initial reasoning. In contrast, larger models (e.g., DS-R1-32B) consistently achieve 100% accuracy given a correct first step, indicating greater stability. (2) There is a substantial drop in accuracy when the first reasoning step is incorrect, highlighting that even minor errors early in the reasoning process can significantly affect the final prediction. These findings further indicate that the LLMâs ability to self-correct has been overestimated.
In this section, we observe that the reasoning model is particularly vulnerable at the initial stage of the reasoning process; an error in the first step can propagate and substantially degrade the final prediction. Can we develop a method to identify and retain more promising first reasoning steps while discarding suboptimal ones to enhance the overall generation efficiency?
## 4 Early pruning with hint from first step
In this section, we propose an efficient and straightforward sampling method to identify a promising first reasoning step. By doing so, we can terminate the generation process early when a suboptimal first step is detected, thereby reducing unnecessary computational overhead.
### 4.1 Problem definition
In contrast to the notation introduced in § 3, we incorporate a random seed $Δ$ to introduce stochasticity into the sampling process. Specifically, a sampled trace is computed as $tâ c_T=M(p,Δ)$ . By varying the random seed $Δ^n$ , we obtain diverse generations, denoted as $t^nâ c^n_T=M(p,Δ^n)$ , where $t^n=[t^n_1,t^n_2,âŠ,t^n_T]$ . In prior experiments, we sampled 64 CoT traces per question using 64 distinct random seeds $\{Δ^n\}_n=1^64$ .
A widely adopted technique for reasoning tasks is majority voting or self-consistency generation (Wang et al., 2022). To promote exploration of the reasoning space, models are typically sampled at high temperatures, resulting in diverse outputs. Majority voting then serves to aggregate these outputs by selecting the most frequent final prediction. Formally, majority voting over $K$ samples is defined as:
$$
q_maj@K=mode(\{q^n_T\}_n=1^K) where q^n_T=g(c^n_T)
$$
However, for models generating long CoT traces, majority voting becomes computationally expensive, as it requires sampling $N$ complete reasoning paths independently. In this section, we propose a more efficient approach that samples only $M$ full traces, where $M<N$ , while maintaining comparable majority voting performance to the one with $N$ samplings.
### 4.2 Methodology
In § 3, we demonstrated a strong positive correlation between the first reasoning step and the final prediction. This observation motivates a method that identifies the top $M$ most promising first reasoning steps out of a total of $N$ , and continues generation only for these selected $M$ candidates, while discarding the remaining $(N-M)$ .
Let a reasoning model generate $N$ candidate first reasoning step $\{t_1^1,t_1^2,\dots,t_1^N\}$ from a prompt $p$ with different random seeds $\{Δ^n\}_n=1^N$ . Each $t_1^n$ is the first reasoning step of a full reasoning trajectory. We define a scoring function $r:t_1^nââ$ that estimates the promise of a first step, e.g., rating from a reward model. We then select the top $M$ first steps based on their scores:
$$
R_top=TopM(\{r(t_1^n)\}_n=1^N)
$$
Only the selected $M$ first steps $\{t_1^n\mid nâR_top\}$ are used for further multi-step generation. The remaining $(N-M)$ are discarded. Since the first step typically requires only a small number of tokens to generate, this selection approach efficiently reduces computation by avoiding full sampling for the less promising $(N-M)$ candidates.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Multi-Panel Line Chart: Model Accuracy Comparison (Maj@K)
### Overview
The image is a grid of 20 line charts arranged in 4 rows and 5 columns. It compares the performance of two methods, "Vanilla N" and "TopM," across four different AI models and five different evaluation datasets. The performance metric is "Maj@K accuracy (%)" plotted against the parameter "K." The charts show how accuracy changes as K increases from 1 to 64.
### Components/Axes
* **Chart Type:** Multi-panel line chart (4x5 grid).
* **Y-Axis (All Charts):** Label: `Maj@K accuracy (%)`. Scale varies per row (model) and column (dataset), but all are linear percentage scales.
* **X-Axis (All Charts):** Label: `K`. Discrete markers at values: 1, 4, 8, 16, 32, 64. A vertical dashed black line is consistently placed at K=16 across all charts.
* **Legend:** Located in the top-left chart ("Average" for the first model). It defines two data series:
* `Vanilla N`: Orange line with circular markers.
* `TopM`: Blue line with diamond markers.
* **Row Labels (Model Names):** Positioned on the far right of each row, written in red text.
* Row 1: `DS-R1-Qwen-1.5B`
* Row 2: `DS-R1-Qwen-32B`
* Row 3: `Qwen3-8B`
* Row 4: `SW-OR1-7B-Preview`
* **Column Labels (Dataset Names):** Positioned at the top of each column.
* Column 1: `Average`
* Column 2: `AIME24`
* Column 3: `AIME25`
* Column 4: `HMMT`
* Column 5: `GPQA`
### Detailed Analysis
Data points are estimated from the grid lines. The vertical dashed line at K=16 is a key reference point.
**Row 1: DS-R1-Qwen-1.5B**
* **Average:** Both lines rise sharply from K=1 to K=8. TopM (blue) consistently leads Vanilla N (orange). At K=16, TopM ~40%, Vanilla N ~38%. Both plateau near 40% by K=64.
* **AIME24:** Steep rise to K=8. TopM leads. At K=16, TopM ~60%, Vanilla N ~58%. Both plateau near 60%.
* **AIME25:** Similar trend. At K=16, TopM ~40%, Vanilla N ~38%. Plateau near 40%.
* **HMMT:** Lower overall accuracy. At K=16, TopM ~25%, Vanilla N ~22%. Plateau near 25%.
* **GPQA:** At K=16, TopM ~40%, Vanilla N ~38%. Plateau near 40%.
**Row 2: DS-R1-Qwen-32B**
* **Average:** Very high accuracy. At K=16, TopM ~70%, Vanilla N ~68%. Both plateau near 70%.
* **AIME24:** At K=16, TopM ~85%, Vanilla N ~83%. Plateau near 85%.
* **AIME25:** At K=16, TopM ~70%, Vanilla N ~68%. Plateau near 70%.
* **HMMT:** At K=16, TopM ~55%, Vanilla N ~52%. Plateau near 55%.
* **GPQA:** At K=16, TopM ~64%, Vanilla N ~63%. Plateau near 64%.
**Row 3: Qwen3-8B**
* **Average:** At K=16, TopM ~80%, Vanilla N ~78%. Plateau near 80%.
* **AIME24:** At K=16, TopM ~85%, Vanilla N ~83%. Plateau near 85%.
* **AIME25:** At K=16, TopM ~70%, Vanilla N ~68%. Plateau near 70%.
* **HMMT:** At K=16, TopM ~60%, Vanilla N ~55%. Plateau near 60%.
* **GPQA:** At K=16, TopM ~62.5%, Vanilla N ~61%. Plateau near 62.5%.
**Row 4: SW-OR1-7B-Preview**
* **Average:** At K=16, TopM ~57%, Vanilla N ~55%. Plateau near 57%.
* **AIME24:** At K=16, TopM ~75%, Vanilla N ~73%. Plateau near 75%.
* **AIME25:** **Anomaly:** TopM dips at K=32 (~56%) before recovering. At K=16, TopM ~60%, Vanilla N ~58%. Both end near 60%.
* **HMMT:** At K=16, TopM ~40%, Vanilla N ~38%. Plateau near 40%.
* **GPQA:** **Anomaly:** Both lines show a peak at K=16 followed by a decline. At K=16, TopM ~53%, Vanilla N ~52%. By K=64, both drop to ~50%.
### Key Observations
1. **Consistent Superiority of TopM:** In 19 out of 20 charts, the blue "TopM" line is at or above the orange "Vanilla N" line for all values of K.
2. **Diminishing Returns:** Accuracy for both methods increases rapidly from K=1 to K=8 or K=16, then plateaus. The vertical dashed line at K=16 often marks the beginning of this plateau.
3. **Model Performance Hierarchy:** The DS-R1-Qwen-32B (Row 2) and Qwen3-8B (Row 3) models achieve the highest absolute accuracy scores across most datasets. The DS-R1-Qwen-1.5B (Row 1) and SW-OR1-7B-Preview (Row 4) show lower overall performance.
4. **Dataset Difficulty:** HMMT appears to be the most challenging dataset, yielding the lowest accuracy percentages for all models. AIME24 and AIME25 show similar, mid-range difficulty.
5. **Notable Anomalies:**
* **SW-OR1-7B-Preview on AIME25:** The TopM series shows an unusual dip at K=32.
* **SW-OR1-7B-Preview on GPQA:** Both methods show a clear performance peak at K=16 followed by a significant decline as K increases to 64, which is contrary to the plateau trend seen elsewhere.
### Interpretation
This visualization demonstrates the effectiveness of the "TopM" method over "Vanilla N" for improving the Maj@K accuracy of various language models on mathematical and reasoning benchmarks (AIME, HMMT, GPQA). The consistent lead of the blue line suggests TopM is a more robust strategy for leveraging multiple samples (K).
The plateau after K=16 indicates that for these tasks and models, generating more than 16-32 candidate solutions provides little to no additional benefit for majority voting accuracy. This is a crucial efficiency insight, suggesting computational resources can be optimized by limiting K.
The performance hierarchy between models is clear, with larger or more specialized models (like DS-R1-Qwen-32B) dominating. The anomalies in the SW-OR1-7B-Preview model's results, particularly the decline on GPQA at high K, may indicate issues with sample quality or diversity for that specific model-dataset combination, where additional samples introduce more noise than signal. The charts provide a clear, comparative snapshot of method efficacy and scaling behavior across different model architectures and task difficulties.
</details>
Figure 4: Majority voting accuracy with different number of samplings for four LLMs. The vertical dashed line denotes the smallest $M$ whose accuracy is equal to or larger than the accuracy of $N=64$ .
### 4.3 Experiments
Setup. We evaluate five families of reasoning modelsâDS-R1-Qwen (Guo et al., 2025), Qwen3 (Yang et al., 2025a), Skywork-OR1 (SW-OR1) (He et al., 2025), Magistral (Rastogi et al., 2025) and GPT-OSS (Agarwal et al., 2025) âon five challenging reasoning benchmarks spanning mathematics, science and programming: AIME24, AIME25 (MAA Committees, 2025), HMMT Feb 2025 (BalunoviÄ et al., 2025), the GPQA Diamond set (Rein et al., 2024) and LiveCodeBench Latest release, containing problems from JanâApr 2025. (Jain et al., 2024), consisting of 30, 30, 30, 198, and 175 problems, respectively. For decoding, we adopt each modelâs recommended temperature and top_p, with a maximum generation length of 32K tokens.
We consider values of $N$ and $M$ from the set $\{1,2,4,8,16,32,64\}$ . For early pruning, we fix $N=64$ and select the top $M$ most promising first-step candidates using reward scores predicted by a process reward model (PRM), Qwen2.5-Math-PRM-7B (Zhang et al., 2025). When $M=64$ , the accuracy is exactly the same as the one for $N=64$ , since all candidates are chosen. For PRM, a âstepâ is defined as a segment of text ending with â \n\n â, and each step receives an individual reward score. The trajectory $t^n_1$ contains multiple such steps, and we use the score of its final step as the overall reward, $r(t^n_1)$ . Notably, using a PRM to score $t^n_1$ is really cheap, because $t^n_1$ is short, and its computation is similar to generate one token with input $pâ t^n_1$ .
Unlike the definition used in § 3, where $t^n_1$ terminates upon generating the keyword âAlternativelyâ, we redefine $t^n_1$ in this section to have a fixed token length, $len(t^n_1)$ . The previous definition made it difficult to control the computational budget for generating first steps, as the trigger phrase might not appear or could occur late in the sequence. By fixing the token length, we achieve precise control over the generation budget when sampling $N$ candidate first steps. By default, $len(t^n_1)=512$ .
#### 4.3.1 Main results
In Figure 4, we analyze the performance as $M$ varies. We find that selecting the top 16 first reasoning steps from 64 candidates and continuing generation from them achieves accuracy on par with, or even exceeding, conventional sampling with $N=64$ . This trend is consistent across diverse LLMs and benchmarks. Interestingly, for certain casesâsuch as HMMT on DS-R1-Qwen-1.5B, and AIME24, AIME25, and GPQA on DS-R1-Qwen-32Bâusing as few as $M†8$ suffices to match the performance obtained with all 64 samples.
Table 2: Early pruning accuracy and efficiency. We select $M=16$ first steps with the highest reward scores out of $N=64$ candidate first steps. The number of tokens used for the 48 discarded first steps is also included for the early pruning method. Early pruning maintains the accuracy, even improves sometimes, while only requiring $<30\$ original inference budget.
| DS-R1-Qwen-1.5B | $N=64$ | 60.0 | 40.0 | 23.3 | 41.4 | 41.2 | $Ă$ 1.00 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| $N=16$ | 57.0 | 37.6 | 22.4 | 39.4 | 39.1 (-2.1) | $Ă$ 0.25 | |
| $M=16$ | 60.0 | 40.0 | 26.7 | 39.4 | 41.5 (+0.3) | $Ă$ 0.28 | |
| DS-R1-Qwen-32B | $N=64$ | 86.7 | 73.3 | 56.7 | 64.1 | 70.2 | $Ă$ 1.00 |
| $N=16$ | 85.4 | 70.9 | 52.6 | 63.9 | 68.2 (-2.0) | $Ă$ 0.25 | |
| $M=16$ | 86.7 | 73.3 | 56.7 | 63.6 | 70.1 (-0.1) | $Ă$ 0.29 | |
| Qwen3-8B | $N=64$ | 83.3 | 70.0 | 56.7 | 61.1 | 67.8 | $Ă$ 1.00 |
| $N=16$ | 79.8 | 69.0 | 52.3 | 61.3 | 65.6 (-2.2) | $Ă$ 0.25 | |
| $M=16$ | 83.3 | 70.0 | 60.0 | 62.1 | 68.9 (+1.1) | $Ă$ 0.28 | |
| SW-OR1-7B | $N=64$ | 80.0 | 60.0 | 40.0 | 50.5 | 57.6 | $Ă$ 1.00 |
| $N=16$ | 75.7 | 58.0 | 37.9 | 51.9 | 55.9 (-1.7) | $Ă$ 0.25 | |
| $M=16$ | 80.0 | 60.0 | 40.0 | 52.5 | 57.9 (+0.3) | $Ă$ 0.29 | |
| Magistral-Small | $N=64$ | 86.7 | 83.3 | 76.7 | 70.2 | 79.2 | $Ă$ 1.00 |
| $N=16$ | 87.1 | 82.6 | 71.1 | 69.2 | 77.5 (-1.7) | $Ă$ 0.25 | |
| $M=16$ | 86.7 | 83.3 | 73.3 | 70.2 | 78.4 (-0.8) | $Ă$ 0.28 | |
| GPT-OSS-20B | $N=64$ | 86.7 | 83.3 | 80.0 | 73.2 | 80.8 | $Ă$ 1.00 |
| $N=16$ | 85.3 | 81.7 | 73.3 | 72.7 | 78.3 (-2.5) | $Ă$ 0.25 | |
| $M=16$ | 86.7 | 83.3 | 80.0 | 73.2 | 80.8 (-0.0) | $Ă$ 0.27 | |
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Charts: Model Performance Comparison (pass@K)
### Overview
The image displays three horizontally arranged line charts comparing the performance of two methods ("Vanilla N" and "TopM") across three different language models. The metric is "pass@K," plotted against the parameter "K." All charts show an increasing trend for both methods as K increases.
### Components/Axes
* **Titles (Top-Center of each subplot):**
* Left Chart: `DS-R1-Qwen-32B`
* Middle Chart: `Qwen3-8B`
* Right Chart: `GPT-OSS-20B`
* **Y-Axis Label (Leftmost chart):** `pass@K`
* **X-Axis Label (Centered below all charts):** `K`
* **Legend (Bottom-right corner of the leftmost chart):**
* `Vanilla N`: Orange line with plus (`+`) markers.
* `TopM`: Blue line with circle (`o`) markers.
* **Axis Scales:**
* **X-Axis (K):** Linear scale from 1 to 15, with major ticks at every odd number (1, 3, 5, 7, 9, 11, 13, 15).
* **Y-Axis (pass@K):** Linear scales differ per chart:
* DS-R1-Qwen-32B: ~48 to ~63
* Qwen3-8B: ~42 to ~55
* GPT-OSS-20B: ~40 to ~61
### Detailed Analysis
**1. DS-R1-Qwen-32B (Left Chart)**
* **Trend:** Both lines show a steep initial rise that gradually flattens. The `TopM` (blue) line is consistently above the `Vanilla N` (orange) line.
* **Approximate Data Points:**
* **K=1:** Vanilla N â 49, TopM â 49 (nearly identical start).
* **K=3:** Vanilla N â 55, TopM â 56.
* **K=7:** Vanilla N â 59, TopM â 62.
* **K=15:** Vanilla N â 63, TopM â 63 (converge at the highest K).
**2. Qwen3-8B (Middle Chart)**
* **Trend:** Similar logarithmic growth pattern. The performance gap between `TopM` and `Vanilla N` is smaller here than in the first chart.
* **Approximate Data Points:**
* **K=1:** Vanilla N â 43, TopM â 44.
* **K=5:** Vanilla N â 49, TopM â 51.
* **K=11:** Vanilla N â 53, TopM â 54.
* **K=15:** Vanilla N â 55, TopM â 55 (very close convergence).
**3. GPT-OSS-20B (Right Chart)**
* **Trend:** Consistent upward trend. `TopM` maintains a clear and steady lead over `Vanilla N` across the entire range of K.
* **Approximate Data Points:**
* **K=1:** Vanilla N â 40, TopM â 43.
* **K=5:** Vanilla N â 53, TopM â 55.
* **K=9:** Vanilla N â 57, TopM â 59.
* **K=15:** Vanilla N â 61, TopM â 61 (converge at the final point).
### Key Observations
1. **Universal Superiority of TopM:** In all three models and at nearly every measured value of K, the `TopM` method yields a higher `pass@K` score than the `Vanilla N` method.
2. **Diminishing Returns:** The rate of improvement for `pass@K` slows as K increases for both methods, following a characteristic logarithmic curve.
3. **Convergence at High K:** For the DS-R1-Qwen-32B and Qwen3-8B models, the performance of the two methods becomes nearly indistinguishable at the highest measured K value (K=15). For GPT-OSS-20B, they converge at the final point.
4. **Model-Specific Scaling:** The absolute `pass@K` values and the size of the performance gap between methods vary by model. The DS-R1-Qwen-32B model shows the largest initial gap, while the Qwen3-8B model shows the smallest.
### Interpretation
This data demonstrates the effectiveness of the `TopM` method over the `Vanilla N` baseline for improving the pass@K metric across diverse model architectures and sizes. The `pass@K` metric typically measures the probability that at least one of K generated samples is correct. The consistent advantage of `TopM` suggests it is a more reliable sampling or ranking strategy for generating correct outputs, especially when the budget for attempts (K) is low to moderate.
The convergence at high K indicates that given enough attempts, the simpler `Vanilla N` method can eventually match the performance of the more sophisticated `TopM` method. Therefore, the primary value of `TopM` lies in **efficiency**âachieving higher success rates with fewer generations, which translates to reduced computational cost and latency. The variation across models implies that the benefit of `TopM` may depend on the underlying model's capabilities or training, being most pronounced in the DS-R1-Qwen-32B model in this comparison.
</details>
Figure 5: The pass rate on LiveCodeBench, where we set $N=16$ for early pruning (TopM).
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Charts: Average maj@K Accuracy vs. K for Different Models and Sequence Lengths
### Overview
The image contains two side-by-side line charts. Both charts plot "Average maj@K accuracy (%)" on the y-axis against "K" on the x-axis. The left chart compares the performance of different model architectures and sizes. The right chart compares performance for different input sequence lengths (`len(t1)`). Both charts show accuracy increasing with K and then plateauing.
### Components/Axes
**Common Elements:**
* **X-axis:** Label: `K`. Ticks and values: `1`, `4`, `8`, `16`, `32`, `64`. The scale is logarithmic.
* **Y-axis:** Label: `Average maj@K accuracy (%)`. Ticks and values: `50`, `55`, `60`, `65`, `70`.
* **Chart Type:** Line chart with markers at data points.
* **Grid:** Light horizontal grid lines are present at the major y-axis ticks.
**Left Chart - Model Comparison:**
* **Legend:** Located in the bottom-right quadrant of the chart area.
* `Vanilla N` - Orange line with downward-pointing triangle markers.
* `Qwen2.5-Math-PRM-7B` - Light blue line with circle markers.
* `Qwen2.5-Math-PRM-72B` - Dark blue line with square markers.
* `Qwen2.5-Math-RM-72B` - Pink line with diamond markers.
* **Annotation:** A vertical dashed black line is drawn at `K=16`.
**Right Chart - Sequence Length Comparison:**
* **Legend:** Located in the bottom-right quadrant of the chart area.
* `Vanilla N` - Orange line with downward-pointing triangle markers.
* `len(t1)=256` - Dark blue line with square markers.
* `len(t1)=512` - Light blue line with circle markers.
* `len(t1)=1024` - Pink line with diamond markers.
### Detailed Analysis
**Left Chart (Model Comparison):**
* **Trend Verification:** All four lines show a steep upward slope from K=1 to K=8, a gentler slope from K=8 to K=16, and then plateau with minimal change from K=16 to K=64.
* **Data Points (Approximate Values):**
* **Vanilla N (Orange, Triangle):** K=1: ~50%, K=4: ~62%, K=8: ~67%, K=16: ~70%, K=32: ~71%, K=64: ~72%.
* **Qwen2.5-Math-PRM-7B (Light Blue, Circle):** K=1: ~47%, K=4: ~68%, K=8: ~70%, K=16: ~72%, K=32: ~72%, K=64: ~72%.
* **Qwen2.5-Math-PRM-72B (Dark Blue, Square):** K=1: ~46%, K=4: ~65%, K=8: ~70%, K=16: ~72%, K=32: ~72%, K=64: ~72%.
* **Qwen2.5-Math-RM-72B (Pink, Diamond):** K=1: ~44%, K=4: ~63%, K=8: ~69%, K=16: ~72%, K=32: ~72%, K=64: ~72%.
* **Key Observations:** At K=1, the `Vanilla N` model has the highest accuracy (~50%). By K=4, the `Qwen2.5-Math-PRM-7B` model overtakes it. From K=16 onward, all three Qwen models converge to nearly identical accuracy (~72%), slightly outperforming `Vanilla N` (~71-72%). The vertical line at K=16 highlights the point where performance largely stabilizes.
**Right Chart (Sequence Length Comparison):**
* **Trend Verification:** All four lines follow the same general trend: a sharp rise from K=1 to K=8, followed by a plateau. The `len(t1)=1024` line shows a slight peak at K=32 before settling.
* **Data Points (Approximate Values):**
* **Vanilla N (Orange, Triangle):** K=1: ~51%, K=4: ~62%, K=8: ~67%, K=16: ~70%, K=32: ~71%, K=64: ~72%.
* **len(t1)=256 (Dark Blue, Square):** K=1: ~51%, K=4: ~66%, K=8: ~65%, K=16: ~70%, K=32: ~71%, K=64: ~72%.
* **len(t1)=512 (Light Blue, Circle):** K=1: ~48%, K=4: ~66%, K=8: ~69%, K=16: ~72%, K=32: ~72%, K=64: ~72%.
* **len(t1)=1024 (Pink, Diamond):** K=1: ~51%, K=4: ~63%, K=8: ~69%, K=16: ~72%, K=32: ~73%, K=64: ~72%.
* **Key Observations:** At K=1, performance is similar across all sequence lengths (~48-51%). At K=4, the `len(t1)=256` and `len(t1)=512` conditions show a slight advantage. By K=16 and beyond, the longer sequence lengths (`512` and `1024`) achieve marginally higher accuracy (~72-73%) compared to the shorter `256` length and the `Vanilla N` baseline (~71-72%). The `len(t1)=1024` condition shows the highest single point at K=32 (~73%).
### Interpretation
The data demonstrates two key findings related to the "maj@K" (majority vote at K) evaluation metric:
1. **Model Architecture and Scale:** The left chart suggests that specialized math models (Qwen2.5-Math variants) can surpass a baseline (`Vanilla N`) when using majority voting, but only after a sufficient number of samples (Kâ„4). The performance of the largest models (PRM-72B and RM-72B) is very similar, indicating potential diminishing returns from scaling beyond 72B parameters for this specific task and metric. The convergence of all Qwen models at Kâ„16 implies that with enough samples, the advantage of model size or specific training (PRM vs. RM) becomes less critical for achieving high majority-vote accuracy.
2. **Input Sequence Length:** The right chart indicates that providing longer input contexts (`len(t1)=512` or `1024`) can yield a small but consistent accuracy benefit over shorter contexts (`256`) when K is large (Kâ„16). This suggests that for problems requiring majority voting, a richer initial context helps the model generate a set of solutions where the correct answer is more frequently represented, making the majority vote more reliable. The peak at K=32 for the longest sequence is an interesting anomaly that might warrant further investigation.
**Overall Implication:** To maximize accuracy using the maj@K strategy, one should use a capable, specialized model (like the Qwen2.5-Math series) with a sufficiently long input context, and set K to at least 16. Increasing K beyond 32 provides minimal additional benefit, representing a trade-off between computational cost and performance gain.
</details>
Figure 6: Average accuracy across AIME24, AIME25 and HMMT on DS-R1-Qwen-32B. Left: Comparison of reward signals derived from different reward models. The choice of reward model has minimal impact on overall performance. Right: Effect of varying the length of the first reasoning step. Using a very short first step, like 256 tokens, leads to suboptimal performance, likely because it provides insufficient reasoning context to effectively evaluate the quality of the step.
Table 2 reports the detailed accuracy and token consumption of different methods. When using $M=16$ , early pruning consistently matches the accuracy of $N=64$ across a range of LLMs and benchmarks, while substantially outperforming $N=16$ under a comparable token budget. Notably, for Qwen3-8B, early pruning even yields a 1.1% improvement in average accuracy. Importantly, these gains come at less than 30% of the inference cost of majority voting with $N=64$ , underscoring the strong efficiency advantage of our method.
For the code generation benchmark, LiveCodeBench, where majority voting is not applicable, we present the pass rate in Figure 5. Early pruning consistently surpasses standard sampling given the same number of complete sampling.
#### 4.3.2 Ablation studies
Here we further validate our default settings.
Choice of reward model. In Figure 6 (Left), we evaluate two additional reward models: a larger PRM, Qwen2.5-Math-PRM-72B, and an outcome reward model, Qwen2.5-Math-RM-72B (Yang et al., 2024). The results indicate that the choice of reward model has minimal impact on performance. Notably, the smaller 7B PRM achieves comparable results, highlighting the efficiency of our approach.
Length of first step. In Figure 6 (Right), we examine the impact of varying the length of the first reasoning step. We observe that the shortest first step (i.e., $len(t^n_1)=256$ ) leads to degraded performance. We hypothesize that shorter $t^n_1$ sequences lack sufficient reasoning content, making them less informative for reliable reward model evaluation. Nevertheless, setting $len(t^n_1)â„ 512$ tokens yields consistently better performance than vanilla sampling.
Table 3: Average maj@K for early pruning, with first step defined by length or phrase.
| DS-R1-Qwen-1.5B DS-R1-Qwen-32B Qwen3-8B | 41.5 70.1 68.9 | 43.8 70.4 68.2 |
| --- | --- | --- |
| SW-OR1-7B | 57.9 | 57.7 |
Effect of first step split. Table 3 examines how defining the first step influences early pruning performance. The heuristic approach follows the method described in § 3. Overall, both definitions yield comparable results. Nevertheless, we recommend using the token-countâbased definition, as it provides a clearer way to manage the token budget across candidate first steps. Moreover, trigger phrases that signal step boundaries may vary across LLMs.
Table 4: Time spent by early pruning.
| DS-R1-Qwen-1.5B Qwen3-8B | $Ă$ 1.00 $Ă$ 1.00 | $Ă$ 0.27 $Ă$ 0.37 |
| --- | --- | --- |
Overhead from reward model. Relative to vanilla sampling, early pruning requires scoring the first step with a reward model, which introduces additional overhead. To ensure a fair comparison, we avoid using extra GPUs for deploying the reward model. Instead, our procedure is as follows: (1) load the reasoning model to generate candidate first steps and then offload it; (2) load the reward model on the same GPU to evaluate these steps and offload it; and (3) reload the reasoning model to continue generation from the selected first steps. The timing results are reported in Table 4. Notably, early pruning remains efficient both in terms of token usage and runtime, since evaluating the first step with the reward model is inexpensiveâcomparable to computing embeddings for the short first steps.
## 5 Conclusion
In this paper, we empirically demonstrate that the first reasoning step plays a critical role in determining the final outcome of a modelâs reasoning process. Errors introduced early can significantly degrade overall performance. Motivated by this observation, we propose an efficient sampling strategy that identifies and retains high-quality first reasoning steps, reducing the inference computation up to 70% across three model families. These findings suggest that improving or exploiting the very first step is a promising direction for building more accurate and efficient reasoning LLMs.
## Acknowledgements
This research was partly supported by the Netherlands Organization for Scientific Research (NWO) under project number VI.C.192.080, 024.004.022, NWA.1389.20.183, and KICH3.LTP.20.006, and the European Union under grant agreements No. 101070212 (FINDHR) and No. 101201510 (UNITE). We are very grateful for the computation support from eBay. Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of their respective employers, funders and/or granting authorities.
## References
- Agarwal et al. (2025) Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025.
- Aggarwal et al. (2023) Pranjal Aggarwal, Aman Madaan, Yiming Yang, et al. Letâs sample step by step: Adaptive-consistency for efficient reasoning and coding with llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12375â12396, 2023.
- Anthropic (2025) Anthropic. Claude 3.7 sonnet system card. 2025. URL https://api.semanticscholar.org/CorpusID:276612236.
- BalunoviÄ et al. (2025) Mislav BalunoviÄ, Jasper Dekoninck, Ivo Petrov, Nikola JovanoviÄ, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena.ai/.
- Barbero et al. (2025) Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar VeliÄkoviÄ, and Razvan Pascanu. Why do llms attend to the first token? arXiv preprint arXiv:2504.02732, 2025.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901, 2020.
- Chen et al. (2024a) Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187, 2024a.
- Chen et al. (2024b) Xinyi Chen, Baohao Liao, Jirui Qi, Panagiotis Eustratiadis, Christof Monz, Arianna Bisazza, and Maarten de Rijke. The SIFo benchmark: Investigating the sequential instruction following ability of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 1691â1706. Association for Computational Linguistics, 2024b. URL https://aclanthology.org/2024.findings-emnlp.92/.
- Chiang & Lee (2024) Cheng-Han Chiang and Hung-yi Lee. Over-reasoning and redundant calculation of large language models. In Yvette Graham and Matthew Purver (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 161â169, St. Julianâs, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-short.15/.
- Formal et al. (2021) Thibault Formal, Benjamin Piwowarski, and StĂ©phane Clinchant. Splade: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2288â2292, 2021.
- Formal et al. (2022) Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. From distillation to hard negative sampling: Making sparse neural ir models more effective, 2022. URL https://arxiv.org/abs/2205.04733.
- Fu et al. (2025a) Yichao Fu, Junda Chen, Yonghao Zhuang, Zheyu Fu, Ion Stoica, and Hao Zhang. Reasoning without self-doubt: More efficient chain-of-thought through certainty probing. In ICLR 2025 Workshop on Foundation Models in the Wild, 2025a. URL https://openreview.net/forum?id=wpK4IMJfdX.
- Fu et al. (2025b) Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. arXiv preprint arXiv:2508.15260, 2025b.
- Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- He et al. (2025) Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680, 2025. Notion Blog.
- Hestness et al. (2017) Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Hou et al. (2025) Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning. arXiv preprint arXiv:2504.01296, 2025.
- Hurst et al. (2024) Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
- Jain et al. (2024) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024.
- Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Li et al. (2024) Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. In ICLR, 2024.
- Liao & Monz (2024) Baohao Liao and Christof Monz. 3-in-1: 2d rotary adaptation for efficient finetuning, efficient batching and composability. arXiv preprint arXiv:2409.00119, 2024.
- Liao et al. (2025a) Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, and Caiming Xiong. Fractured chain-of-thought reasoning. arXiv preprint arXiv:2505.12992, 2025a.
- Liao et al. (2025b) Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative decoding for efficient LLM reasoning. arXiv preprint arXiv:2501.19324, 2025b.
- Liu et al. (2024a) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157â173, 2024a. doi: 10.1162/taclËaË00638. URL https://aclanthology.org/2024.tacl-1.9/.
- Liu et al. (2024b) Yijin Liu, Xianfeng Zeng, Chenze Shao, Fandong Meng, and Jie Zhou. Instruction position matters in sequence generation with large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 11652â11663, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.693. URL https://aclanthology.org/2024.findings-acl.693/.
- Luo et al. (2025) Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570, 2025.
- MAA Committees (2025) MAA Committees. AIME Problems and Solutions. https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions, 2025.
- Muennighoff et al. (2025) Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel CandĂšs, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025.
- Pezeshkpour & Hruschka (2023) Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
- Qin et al. (2023) Guanghui Qin, Yukun Feng, and Benjamin Van Durme. The NLP task effectiveness of long-range transformers. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 3774â3790. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.eacl-main.273. URL https://aclanthology.org/2023.eacl-main.273/.
- Rastogi et al. (2025) Abhinav Rastogi, Albert Q Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, et al. Magistral. arXiv preprint arXiv:2506.10910, 2025.
- Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1908.10084.
- Rein et al. (2024) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. In First Conference on Language Modeling, 2024.
- Shen et al. (2023) Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, and Yi Zhang. Positional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737, 2023.
- Shi et al. (2024) Lin Shi, Chiyu Ma, Wenhua Liang, Weicheng Ma, and Soroush Vosoughi. Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by llms. arXiv preprint arXiv:2406.07791, 2024.
- Team et al. (2024) Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Team (2025) Qwen Team. Qwen3, April 2025. URL https://qwenlm.github.io/blog/qwen3/.
- Wan et al. (2025) Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li. Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3613â3635, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long.184. URL https://aclanthology.org/2025.naacl-long.184/.
- Wang et al. (2024) Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9440â9450. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.acl-long.511/.
- Wang et al. (2020) Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 2020. URL https://arxiv.org/abs/2002.10957.
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Wang et al. (2025a) Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, et al. Thoughts are all over the place: On the underthinking of o1-like llms. arXiv preprint arXiv:2501.18585, 2025a.
- Wang et al. (2025b) Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. Chain-of-probe: Examining the necessity and accuracy of CoT step-by-step. In Findings of the Association for Computational Linguistics: NAACL 2025, pp. 2586â2606. Association for Computational Linguistics, 2025b. ISBN 979-8-89176-195-7. URL https://aclanthology.org/2025.findings-naacl.140/.
- Wen et al. (2025) Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng Zhang. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond. arXiv preprint arXiv:2503.10460, 2025.
- Xiao et al. (2024) Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF.
- Xiong et al. (2025) Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, and Sainbayar Sukhbaatar. Stepwiser: Stepwise generative judges for wiser reasoning. arXiv preprint arXiv:2508.19229, 2025.
- Xue et al. (2023) Mingfeng Xue, Dayiheng Liu, Wenqiang Lei, Xingzhang Ren, Baosong Yang, Jun Xie, Yidan Zhang, Dezhong Peng, and Jiancheng Lv. Dynamic voting for efficient reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 3085â3104, 2023.
- Yang et al. (2024) An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024.
- Yang et al. (2025a) An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025a.
- Yang et al. (2025b) Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Zheng Lin, Li Cao, and Weiping Wang. Dynamic early exit in reasoning models. arXiv preprint arXiv:2504.15895, 2025b.
- Zhang et al. (2024a) Yusen Zhang, Sarkar Snigdha Sarathi Das, and Rui Zhang. Verbosity $neq$ veracity: Demystify verbosity compensation behavior of large language models. arXiv preprint arXiv:2411.07858, 2024a.
- Zhang et al. (2025) Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301, 2025.
- Zhang et al. (2024b) Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia Wu, and Zhangyang Wang. Found in the middle: How language models use long contexts better via plug-and-play positional encoding. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024b.
- Zheng et al. (2024) Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations, 2024.
## Appendix A Keywords for Reasoning Step Segmentation
Good reasoning switching keywords should signal a clear switch in reasoning, while should not occur too frequently which often indicate only minor digressions, or too rare to be reliable delimiters. We collect a list of reasoning switching keywords from papers (Muennighoff et al., 2025; Hou et al., 2025). To assess which reasoning switch keywords could serve a similar role, we compute their average frequency per question in model outputs. After cosidering the frequencies and manually check the segment results, we found that âalternativelyâ strikes a balance, appearing often enough to capture step boundaries while still reflecting shifts in reasoning, for DS-R1-Qwen-7B and Qwen3-8B. We uses both âalternativelyâ and âwaitâ for Claude-3.7-Sonnet, GPT-OSS-20B and Magistral-Small.
Table A.1: Average frequency per question of discourse markers in model outputs.
| DS-R1-Qwen-7B Qwen3-8B | but 82 91 | wait 72 66 | alternatively 14 20 | hmm 7 6 | hold on 3 1 | let me confirm $<$ 1 $<$ 1 | however 10.1 10.4 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Claude-3.7-Sonnet | 28.5 | 22.6 | 0.2 | 6.1 | 0.7 | 0.0 | 0.9 |
| Magistral-Small | 47.1 | 54.8 | 3.0 | 0.0 | 0.9 | 0.2 | 1.4 |
| GPT-OSS-20B | 171.6 | 60.5 | 4.9 | 1.2 | 0.7 | 0.0 | 5.9 |
## Appendix B Model Identifiers
For reproducibility, we provide the exact model identifiers used in our experiments. The models are: DeepSeek-R1-Distill-Qwen-7B (deepseek-ai/DeepSeek-R1-Distill-Qwen-7B), Qwen3-8B (Qwen/Qwen3-8B), Claude 3.7 (claude-3-7-sonnet-20250219, accessed via the Claude API), GPT-OSS-20B (openai/gpt-oss-20b), and Magistral-Small (mistralai/Magistral-Small-2509).
## Appendix C Additional Similarity Analyses
We report the cosine similarity between the embedding of the $i$ -th reasoning step $t_i$ and the final conclusion $c_T$ , interpolated using the maximum number of reasoning steps (Figure C.1).
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Similarity Score vs. Reasoning Step for Five AI Models
### Overview
The image is a line chart plotting a similarity metric against reasoning steps for five different large language models. The chart illustrates how the similarity between a model's output at a given reasoning step (`t_i`) and a reference output (`C_T`) changes as the reasoning process progresses. The reference appears to be associated with "GPT5".
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** `Reasoning step t_i (GPT5)`
* **Scale:** Linear scale from 0 to approximately 180.
* **Major Tick Marks:** 0, 25, 50, 75, 100, 125, 150, 175.
* **Y-Axis (Vertical):**
* **Label:** `Similarity(C_T, t_i)`
* **Scale:** Linear scale from 0.50 to 0.85.
* **Major Tick Marks:** 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85.
* **Legend:**
* **Position:** Top-right corner of the chart area.
* **Entries (with associated line colors):**
1. `DS-R1-Qwen-7B` (Blue line)
2. `Qwen3-8B` (Orange line)
3. `Claude-3.7-Sonnet` (Green line)
4. `GPT-OSS-20B` (Purple line)
5. `Magistral-Small` (Brown line)
### Detailed Analysis
The chart displays five distinct data series, each representing a model's similarity trajectory. All series begin at step 0 with relatively high similarity scores and show a general downward trend as reasoning steps increase, though with different rates of decay and patterns of fluctuation.
1. **DS-R1-Qwen-7B (Blue):**
* **Trend:** Steep, consistent decline from the start, followed by a more gradual decrease and stabilization.
* **Data Points (Approximate):** Starts at ~0.84 (step 0). Drops sharply to ~0.65 by step 25. Continues a steady decline to ~0.55 by step 75. From step 75 to 125, it fluctuates between ~0.52 and ~0.55, ending near 0.53 at step 125.
2. **Qwen3-8B (Orange):**
* **Trend:** The steepest initial decline, followed by a plateau and a slight late rise.
* **Data Points (Approximate):** Starts highest at ~0.86 (step 0). Plummets to ~0.70 by step 25. Continues a strong decline to ~0.55 by step 75. It then stabilizes, fluctuating between ~0.54 and ~0.57 from step 75 to 125. Shows a slight upward trend from step 125 to its endpoint near step 140, reaching ~0.58.
3. **Claude-3.7-Sonnet (Green):**
* **Trend:** Initial decline, followed by a significant mid-chart rise and subsequent fall.
* **Data Points (Approximate):** Starts at ~0.75 (step 0). Declines to ~0.65 by step 25. Fluctuates around 0.60-0.65 until step 75. Then, it exhibits a notable rise, peaking at ~0.66 around step 95. After this peak, it declines again with high volatility, ending near 0.60 at step 135.
4. **GPT-OSS-20B (Purple):**
* **Trend:** Starts the lowest, shows a steady decline with moderate fluctuations, and extends the furthest along the x-axis.
* **Data Points (Approximate):** Starts at ~0.64 (step 0). Declines to ~0.58 by step 25. Continues a gradual, fluctuating descent to a low of ~0.48 around step 160. From step 160 to 180, it shows a recovery trend, rising back to ~0.54.
5. **Magistral-Small (Brown):**
* **Trend:** Moderate initial decline, followed by a long, relatively stable plateau.
* **Data Points (Approximate):** Starts at ~0.79 (step 0). Drops to ~0.68 by step 25. Declines more slowly to ~0.60 by step 75. From step 75 to 115, it remains remarkably stable, hovering tightly around 0.60-0.61. The line ends at approximately step 115.
### Key Observations
* **Universal Initial Decay:** All five models show a marked decrease in similarity to the reference within the first 25-50 reasoning steps.
* **Divergent Mid-Chart Behavior:** After the initial decay, model behaviors diverge significantly. Claude-3.7-Sonnet uniquely rises in the middle, GPT-OSS-20B continues a slow decline, and Magistral-Small plateaus.
* **Final Similarity Range:** By their respective endpoints, the models' similarity scores cluster in a lower range (approximately 0.48 to 0.64) compared to their starting points (0.64 to 0.86).
* **Volatility:** The green (Claude) and purple (GPT-OSS) lines exhibit the most high-frequency fluctuation, suggesting more variable similarity at each step. The brown (Magistral) line is the smoothest during its plateau phase.
### Interpretation
This chart likely visualizes a study on the consistency or faithfulness of different AI models' reasoning chains compared to a reference model (GPT5). The `Similarity(C_T, t_i)` metric quantifies how closely a model's intermediate reasoning step `t_i` aligns with the final output or a reference chain `C_T`.
The data suggests that as models engage in longer reasoning processes (more steps), their intermediate steps become less similar to the final reference output. This could indicate:
1. **Reasoning Divergence:** Models may explore different logical paths or incorporate more model-specific knowledge as reasoning progresses, moving away from the reference's "thought process."
2. **Error Accumulation:** Small deviations early in the chain may compound, leading to greater dissimilarity later.
3. **Model-Specific Strategies:** The distinct trajectories (e.g., Claude's mid-rise, Magistral's plateau) imply different internal mechanisms for maintaining consistency or recovering alignment during extended reasoning. The model that starts with the lowest similarity (GPT-OSS-20B) also shows the capacity for late-stage recovery, which is a notable anomaly.
In essence, the chart provides a diagnostic view of how different AI architectures maintain (or lose) alignment with a reference reasoning trajectory over time, which is critical for understanding reliability and interpretability in complex, multi-step tasks.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Similarity vs. Reasoning Step for Five AI Models
### Overview
The image is a line chart comparing the performance of five different AI models. It plots a similarity metric against the number of heuristic reasoning steps. The chart shows that for all models, the similarity metric generally decreases as the number of reasoning steps increases, with significant fluctuations and some late-stage recovery for certain models.
### Components/Axes
* **Chart Type:** Multi-line chart.
* **X-Axis:** Labeled "Reasoning step t_i (Heuristic)". It is a linear scale ranging from 0 to 120, with major tick marks at intervals of 20 (0, 20, 40, 60, 80, 100, 120).
* **Y-Axis:** Labeled "Similarity(C_T, t_i)". It is a linear scale ranging from 0.4 to 0.9, with major tick marks at intervals of 0.1 (0.4, 0.5, 0.6, 0.7, 0.8, 0.9).
* **Legend:** Positioned in the top-right corner of the chart area. It lists five models with corresponding colored lines:
* **DS-R1-Qwen-7B** (Blue line)
* **Qwen3-8B** (Orange line)
* **Claude-3.7-Sonnet** (Green line)
* **GPT-OSS-20B** (Purple line)
* **Magistral-Small** (Brown line)
### Detailed Analysis
The analysis below describes the visual trend for each data series before listing approximate key data points. All values are approximate based on visual inspection.
1. **DS-R1-Qwen-7B (Blue Line):**
* **Trend:** Starts very high, experiences a sharp initial decline, then continues a general downward trend with moderate fluctuations. It ends before step 60.
* **Key Points:** Starts near (0, ~0.85). Drops to ~0.7 by step 5. Fluctuates between ~0.6 and ~0.7 until step 30. Declines to a low of ~0.52 around step 55 before a slight uptick to end near (58, ~0.56).
2. **Qwen3-8B (Orange Line):**
* **Trend:** Starts the highest, plummets rapidly, then fluctuates in a lower band before showing a distinct upward trend in the later steps (60-80).
* **Key Points:** Starts near (0, ~0.88). Drops sharply to ~0.6 by step 10. Fluctuates between ~0.5 and ~0.6 until step 60. Begins a clear upward trend, reaching ~0.54 by step 75.
3. **Claude-3.7-Sonnet (Green Line):**
* **Trend:** Starts moderately high, declines with notable fluctuations, maintaining a relatively higher similarity than most other models in the mid-range (steps 20-50).
* **Key Points:** Starts near (0, ~0.75). Drops to ~0.68 by step 5. Fluctuates between ~0.6 and ~0.65 until step 30. Shows a local peak of ~0.62 around step 45 before declining to end near (58, ~0.55).
4. **GPT-OSS-20B (Purple Line):**
* **Trend:** Starts lower than the first three, declines sharply, then fluctuates at a low level for the remainder of the chart, showing a slight upward trend in the final quarter.
* **Key Points:** Starts near (0, ~0.65). Drops to ~0.5 by step 10. Fluctuates primarily between ~0.4 and ~0.5 from step 20 onward. Shows a slight increase towards the end, reaching ~0.52 near step 115.
5. **Magistral-Small (Brown Line):**
* **Trend:** Starts high, experiences the most severe initial drop, and then exhibits high volatility (large up-and-down swings) throughout the rest of the chart, with a general slight upward trend in the final third.
* **Key Points:** Starts near (0, ~0.78). Plummets to a low of ~0.4 by step 15. Recovers to ~0.5 by step 25. Drops again to the chart's absolute minimum of ~0.36 around step 58. From step 60 onward, it shows high volatility but a general upward trend, ending near (115, ~0.55).
### Key Observations
* **Universal Initial Drop:** All five models show a steep decline in the similarity metric within the first 10-15 reasoning steps.
* **Divergence in Mid-Range:** After the initial drop, the models' paths diverge. Claude-3.7-Sonnet (green) and DS-R1-Qwen-7B (blue) generally maintain higher similarity values in the 20-50 step range compared to the others.
* **High Volatility:** The Magistral-Small (brown) line is the most volatile, with large, frequent swings in value.
* **Late-Stage Recovery:** Qwen3-8B (orange) and Magistral-Small (brown) show a noticeable upward trend in similarity after step 60, suggesting a potential change in behavior or convergence in later reasoning stages.
* **Data Range:** The DS-R1-Qwen-7B and Claude-3.7-Sonnet data series terminate before step 60, while the others extend to nearly step 120.
### Interpretation
This chart likely visualizes how the internal reasoning or output of different large language models (LLMs) diverges from an initial state or reference point (`C_T`) as they perform more heuristic reasoning steps (`t_i`). The "Similarity" metric quantifies this divergence.
* **What the data suggests:** The universal sharp initial drop indicates that the models' reasoning paths quickly move away from their starting point or a reference chain-of-thought. The subsequent fluctuations and different levels suggest that models have distinct "reasoning signatures" or stability profiles. Models like Claude-3.7-Sonnet and DS-R1-Qwen-7B appear to maintain more consistent reasoning (higher similarity) over a medium number of steps.
* **How elements relate:** The x-axis (reasoning steps) is the independent variable, representing computational effort or depth of thought. The y-axis (similarity) is the dependent variable, measuring consistency or faithfulness. The legend identifies the specific model architectures being compared.
* **Notable anomalies:** The late-stage upward trends for Qwen3-8B and Magistral-Small are particularly interesting. They could indicate that after a period of divergence, these models' reasoning begins to re-converge towards the reference point, or that they enter a different phase of problem-solving. The extreme volatility of Magistral-Small might imply less stable or more exploratory reasoning processes.
* **Peircean investigative reading:** The chart is an indexical sign of model behaviorâit points directly to a measurable property (similarity decay). It is also a symbolic sign, requiring knowledge of AI metrics to interpret. The patterns suggest that "reasoning" in these models is not a linear, stable process but a dynamic one with phases of divergence, exploration, and potential re-convergence, varying significantly by model.
</details>
Figure C.1: Cosine similarity between the embeddings of the i-th reasoning step $t_i$ and the final conclusion $c_T$ , using the maximum number of reasoning steps for interpolation. The reasoning steps are segmented either by GPT-5 (left) or by heuristic rules (right).
In addition, we report SPLADE similarity (Formal et al., 2021), which provides a sparse lexicalâsemantic relevance signal by expanding tokens into weighted lexical features. We compute SPLADE similarity using the naver/splade-cocondenser-ensembledistil model (Formal et al., 2022), following standard practice. As shown in Figure C.2, the trends closely mirror those observed under cosine similarity. These results serve as complementary checks, confirming that our findings are not tied to dense embedding models and hold across both dense and sparse similarity measures.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: Similarity Across Reasoning Steps for Various AI Models
### Overview
This image is a line chart comparing the "Similarity" metric (denoted as \(C_T, t_i\)) across sequential "Reasoning steps" (denoted as \(t_i\)) for five different AI models. The chart illustrates how the similarity of each model's output or internal state changes as the reasoning process progresses through steps, presumably benchmarked against a reference model (GPT5, as indicated in the x-axis label).
### Components/Axes
* **Y-Axis:** Labeled **"Similarity(\(C_T, t_i\))"**. The scale ranges from 30 to 60, with major tick marks at intervals of 5 (30, 35, 40, 45, 50, 55, 60).
* **X-Axis:** Labeled **"Reasoning step \(t_i\) (GPT5)"**. The scale ranges from 0 to 30, with major tick marks at intervals of 5 (0, 5, 10, 15, 20, 25, 30).
* **Legend:** Positioned in the top-right quadrant of the chart area. It contains five entries, each with a colored line segment and a marker:
1. **DS-R1-Qwen-7B:** Blue line with circular markers.
2. **Qwen3-8B:** Orange line with diamond markers.
3. **Claude-3.7-Sonnet:** Green line with square markers.
4. **GPT-OSS-20B:** Purple line with upward-pointing triangle markers.
5. **Magistral-Small:** Brown line with downward-pointing triangle markers.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
All series show a general downward trend in similarity as reasoning steps increase, with varying rates of decline and stabilization patterns.
1. **DS-R1-Qwen-7B (Blue, Circles):**
* **Trend:** Starts very high, drops steeply until step ~10, then stabilizes with minor fluctuations.
* **Approximate Points:** (1, 60), (2, 53), (3, 47), (4, 43), (5, 40), (6, 38), (7, 37), (8, 36), (9, 36), (10, 36), (11, 36), (12, 36), (13, 36), (14, 36), (15, 35), (16, 34), (17, 34), (18, 34), (19, 34), (20, 34), (21, 34), (22, 35), (23, 35).
2. **Qwen3-8B (Orange, Diamonds):**
* **Trend:** Starts at the highest point, drops very steeply until step ~12, then continues a slower decline.
* **Approximate Points:** (1, 60), (2, 53), (3, 48), (4, 45), (5, 42), (6, 40), (7, 38), (8, 37), (9, 36), (10, 35), (11, 34), (12, 33), (13, 32), (14, 32), (15, 32), (16, 32).
3. **Claude-3.7-Sonnet (Green, Squares):**
* **Trend:** Starts lower than the first two, drops quickly until step ~7, then shows a slight recovery and stabilizes at a level higher than the others after step 15.
* **Approximate Points:** (1, 42), (2, 38), (3, 36), (4, 35), (5, 36), (6, 34), (7, 33), (8, 32), (9, 33), (10, 34), (11, 35), (12, 35), (13, 35), (14, 36), (15, 36), (16, 37), (17, 36), (18, 37), (19, 37), (20, 38), (21, 39), (22, 39), (23, 38), (24, 37), (25, 36), (26, 36), (27, 37), (28, 38), (29, 37), (30, 36), (31, 36), (32, 36).
4. **GPT-OSS-20B (Purple, Up-Triangles):**
* **Trend:** Starts the lowest, declines steadily to a minimum around step 25, then shows a sharp increase.
* **Approximate Points:** (1, 41), (2, 37), (3, 37), (4, 35), (5, 34), (6, 34), (7, 33), (8, 34), (9, 33), (10, 32), (11, 31), (12, 30), (13, 29), (14, 31), (15, 32), (16, 31), (17, 32), (18, 34), (19, 33), (20, 32), (21, 30), (22, 29), (23, 29), (24, 28), (25, 28), (26, 28), (27, 31), (28, 35), (29, 35).
5. **Magistral-Small (Brown, Down-Triangles):**
* **Trend:** Starts high, drops steeply, then exhibits more volatility (ups and downs) compared to other series before stabilizing.
* **Approximate Points:** (1, 48), (2, 43), (3, 40), (4, 38), (5, 35), (6, 35), (7, 36), (8, 37), (9, 38), (10, 37), (11, 36), (12, 35), (13, 34), (14, 34), (15, 35), (16, 36), (17, 36), (18, 36).
### Key Observations
1. **Universal Initial Drop:** All models show their highest similarity at the first reasoning step, followed by a significant decline. This suggests the initial output or state is most aligned with the reference (GPT5), and divergence increases with more reasoning.
2. **Convergence Zone:** Between steps 5 and 15, most models' similarity scores converge into a narrower band (approximately 32-38), indicating a phase where model behaviors become more similar relative to their starting points.
3. **Divergence After Step 15:** Post step 15, the models separate again. Claude-3.7-Sonnet (Green) maintains and even slightly increases its similarity, becoming the highest. GPT-OSS-20B (Purple) continues to drop to the lowest point before a late rebound. DS-R1-Qwen-7B (Blue) and Magistral-Small (Brown) stabilize in the middle.
4. **Outlier Behavior:** GPT-OSS-20B's sharp increase after step 25 is a notable anomaly compared to the general trends of other models.
5. **Model Grouping:** The two Qwen-based models (DS-R1-Qwen-7B and Qwen3-8B) start identically high but separate slightly, with Qwen3-8B ending lower. Claude-3.7-Sonnet demonstrates the most resilient similarity over long reasoning chains.
### Interpretation
This chart likely visualizes a study on how different large language models (LLMs) diverge from a reference model's (GPT5) reasoning trajectory. The "Similarity" metric could measure the overlap in generated text, attention patterns, or latent space representations at each step.
* **What the data suggests:** The process of multi-step reasoning causes models to develop distinct "thought paths." The initial step is highly constrained or prompted, leading to high similarity. As models generate more steps, their unique architectures, training data, and inference algorithms cause them to explore different reasoning branches, reducing similarity.
* **How elements relate:** The x-axis represents the progression of a complex task. The y-axis measures alignment with a benchmark. The different lines represent competing approaches. The chart answers: "Which models stay closer to GPT5's reasoning for longer, and when do they fundamentally diverge?"
* **Notable implications:** Claude-3.7-Sonnet's sustained higher similarity might indicate a reasoning style more aligned with GPT5's, or greater stability. The late surge in GPT-OSS-20B's similarity could suggest a "correction" mechanism or a convergence on a solution after an initial exploratory phase. The steep initial drop for all models highlights the challenge of maintaining alignment in multi-step deduction or generation tasks.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Chart: Similarity vs. Reasoning Step for Various AI Models
### Overview
This image is a line chart comparing the performance of five different AI models. It plots a "Similarity" metric against the number of "Reasoning steps" taken. All models exhibit a sharp initial decline in similarity, followed by varying patterns of fluctuation and stabilization over subsequent steps.
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **Y-Axis (Vertical):**
* **Label:** `Similarity(C_T, t_i)`
* **Scale:** Linear, ranging from 20 to 70.
* **Major Ticks:** 20, 30, 40, 50, 60, 70.
* **X-Axis (Horizontal):**
* **Label:** `Reasoning step t_i (Heuristic)`
* **Scale:** Linear, ranging from 0 to approximately 55.
* **Major Ticks:** 0, 10, 20, 30, 40, 50.
* **Legend (Top-Right Corner):** Contains five entries, each associating a model name with a specific line color and marker shape.
1. **DS-R1-Qwen-7B:** Blue line with circle markers.
2. **Qwen3-8B:** Orange line with diamond markers.
3. **Claude-3.7-Sonnet:** Green line with square markers.
4. **GPT-OSS-20B:** Purple line with upward-pointing triangle markers.
5. **Magistral-Small:** Brown line with downward-pointing triangle markers.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **DS-R1-Qwen-7B (Blue, Circles):**
* **Trend:** Starts highest, plummets steeply, then fluctuates in the mid-range.
* **Points:** Starts at ~70 (step 0). Drops to ~40 by step 5. Fluctuates between ~30 and ~40 from steps 5 to 25. Data series ends around step 25.
2. **Qwen3-8B (Orange, Diamonds):**
* **Trend:** Starts very high, drops sharply, then shows a slight recovery trend in later steps.
* **Points:** Starts at ~68 (step 0). Drops to ~30 by step 5. Fluctuates between ~25 and ~30 from steps 5 to 20. Shows a slight upward trend to ~30 by step 25.
3. **Claude-3.7-Sonnet (Green, Squares):**
* **Trend:** Starts moderately high, drops, then maintains the most stable and highest plateau among all models after the initial drop.
* **Points:** Starts at ~45 (step 0). Drops to ~40 by step 5. Remains relatively stable, fluctuating between ~35 and ~40 from steps 5 to 25. Data series ends around step 25.
4. **GPT-OSS-20B (Purple, Up-Triangles):**
* **Trend:** Starts moderately, drops, fluctuates at a lower level, then shows a distinct upward trend in the final segment.
* **Points:** Starts at ~43 (step 0). Drops to ~32 by step 5. Fluctuates between ~20 and ~30 from steps 5 to 45. Shows a clear upward trend from step 45, reaching ~38 by step 50.
5. **Magistral-Small (Brown, Down-Triangles):**
* **Trend:** Starts moderately, experiences the deepest drop, fluctuates at the lowest level for most of the chart, and ends with an upward trend.
* **Points:** Starts at ~48 (step 0). Drops sharply to ~20 by step 7. Fluctuates between ~20 and ~25 from steps 7 to 45. Shows an upward trend from step 45, reaching ~36 by step 55 (the furthest data point on the X-axis).
### Key Observations
* **Universal Initial Drop:** All five models show a dramatic decrease in the `Similarity(C_T, t_i)` metric within the first 5-7 reasoning steps.
* **Post-Drop Stratification:** After the initial drop, a clear performance hierarchy emerges and persists for most of the chart: Claude-3.7-Sonnet (green) maintains the highest similarity, followed by DS-R1-Qwen-7B (blue), then Qwen3-8B (orange) and GPT-OSS-20B (purple) in the middle, with Magistral-Small (brown) consistently at the lowest level until the final steps.
* **Late-Stage Divergence:** In the later steps (beyond step 40), GPT-OSS-20B and Magistral-Small exhibit a notable upward trend, while the other models' data series have ended.
* **Data Range Variance:** The models are not tested over the same number of steps. Claude-3.7-Sonnet and DS-R1-Qwen-7B data ends around step 25, Qwen3-8B around step 25, GPT-OSS-20B extends to step 50, and Magistral-Small extends the furthest to step 55.
### Interpretation
The chart likely visualizes how the internal reasoning or output of different AI models diverges from an initial reference point (`C_T`) as they perform more heuristic reasoning steps (`t_i`). The "Similarity" metric is presumably a measure of consistency or alignment.
The data suggests that **all models undergo a significant shift or "reasoning drift" very early in the process**. The subsequent stratification indicates that different models have different capacities to maintain similarity (or resist drift) during extended reasoning. Claude-3.7-Sonnet appears most robust in this specific test.
The late upward trend for GPT-OSS-20B and Magistral-Small is intriguing. It could indicate a form of "re-convergence" or a cyclical pattern in their reasoning processes over very long chains, or it might be an artifact of the specific heuristic steps being measured. The varying end points for each model's data series suggest the experiment or evaluation was not uniform in length for all models, which is a critical context for comparing their full performance curves.
</details>
Figure C.2: SPLADE similarity between the representations of the i-th reasoning step $t_i$ and the final conclusion $c_T$ , using the maximum number of reasoning steps for interpolation. The reasoning steps are segmented either by GPT-5 (left) or by heuristic rules (right).
## Appendix D Correlation between first reasoning step and final prediction
In the main text, we reported results on DS-R1-Qwen-7B. Here we extend the analysis to Qwen3-8B (Figure D.1) and further include DeepSeek and Qwen3 models of different sizes (Figure D.2). These additional results verify that the correlation between the first reasoning step and the final prediction persists as model size scales.
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Grouped Bar Chart: Accuracy by Pass@1 of q_T
### Overview
This is a grouped bar chart comparing the accuracy percentage (y-axis) of two conditionsâ"First correct" and "First incorrect"âacross four categories defined by the "Pass@1 of q_T" metric (x-axis). The chart demonstrates how accuracy varies with the Pass@1 score and differs between the two conditions.
### Components/Axes
* **Chart Type:** Grouped vertical bar chart.
* **Y-Axis:**
* **Label:** "Accuracy (%)"
* **Scale:** Linear scale from 0 to approximately 90, with major gridlines at intervals of 20 (0, 20, 40, 60, 80).
* **X-Axis:**
* **Label:** "Pass@1 of q_T"
* **Categories (from left to right):**
1. `(0, 33%]`
2. `(33%, 67%]`
3. `(67%, 100%)`
4. `Overall`
* **Legend:**
* **Position:** Top-right corner of the chart area.
* **Entry 1:** "First correct" - Represented by a blue bar with diagonal hatching (stripes running from top-left to bottom-right).
* **Entry 2:** "First incorrect" - Represented by a solid orange bar.
### Detailed Analysis
The chart presents paired data for each of the four x-axis categories. The values are displayed as data labels on top of each bar.
**1. Category: (0, 33%]**
* **First correct (Blue, hatched):** 16.7%
* **First incorrect (Orange, solid):** 11.9%
* **Trend:** This category has the lowest accuracy for both conditions. The "First correct" condition is 4.8 percentage points higher than "First incorrect."
**2. Category: (33%, 67%]**
* **First correct (Blue, hatched):** 55.6%
* **First incorrect (Orange, solid):** 50.6%
* **Trend:** A significant increase in accuracy for both conditions compared to the first category. "First correct" remains higher by 5.0 percentage points.
**3. Category: (67%, 100%)**
* **First correct (Blue, hatched):** 89.8%
* **First incorrect (Orange, solid):** 84.9%
* **Trend:** This category shows the highest accuracy for both conditions. The gap between the two conditions is 4.9 percentage points.
**4. Category: Overall**
* **First correct (Blue, hatched):** 68.5%
* **First incorrect (Orange, solid):** 56.7%
* **Trend:** This represents the aggregate performance. The "First correct" condition outperforms "First incorrect" by a substantial margin of 11.8 percentage points.
### Key Observations
1. **Consistent Performance Gap:** In every category, the "First correct" condition yields higher accuracy than the "First incorrect" condition. The gap is relatively consistent (~5 percentage points) within the three specific Pass@1 ranges but widens significantly in the "Overall" aggregate.
2. **Positive Correlation with Pass@1:** There is a clear, strong positive trend: as the Pass@1 of q_T increases (moving from left to right on the x-axis), the accuracy for both conditions increases dramatically.
3. **Highest and Lowest Points:** The highest accuracy is 89.8% for "First correct" in the (67%, 100%) range. The lowest accuracy is 11.9% for "First incorrect" in the (0, 33%] range.
4. **Overall Performance:** The "Overall" accuracy for "First correct" (68.5%) is higher than the accuracy for the middle category (33%, 67%], suggesting the dataset may be weighted toward higher Pass@1 scores.
### Interpretation
The data suggests a strong relationship between the Pass@1 metric of q_T and the system's accuracy. A higher Pass@1 score is a reliable predictor of higher accuracy, regardless of whether the first attempt was correct or incorrect.
The consistent advantage of the "First correct" condition implies that an initial correct answer has a positive carry-over effect, leading to better overall performance on subsequent tasks or evaluations within the same context. This could indicate a "confidence" or "momentum" effect in the model's reasoning process.
The significant widening of the performance gap in the "Overall" metric (11.8 points vs. ~5 points in the binned categories) is a key finding. It suggests that the benefits of a correct first attempt are not uniform but are particularly pronounced when aggregated across all difficulty levels (as represented by the Pass@1 ranges). This could mean that for easier problems (high Pass@1), the first attempt's correctness matters less, but for the full spectrum of problems, it becomes a major differentiating factor.
**Potential Anomaly/Note:** The "Overall" accuracy for "First correct" (68.5%) is notably higher than the accuracy for the (33%, 67%] bin (55.6%). This is mathematically possible only if a large proportion of the test samples fall into the highest-performing (67%, 100%) bin, pulling the weighted average up. This indicates the evaluation dataset is skewed towards questions where the model has a high Pass@1 probability.
</details>
<details>
<summary>x14.png Details</summary>

### Visual Description
## Grouped Bar Chart: Token Count After First Step by Pass@1 Performance
### Overview
This is a grouped bar chart comparing the average number of tokens generated after the first step (in thousands, denoted as 'k') for two conditions: "First correct" and "First incorrect" responses. The data is segmented into four categories based on the Pass@1 metric of a variable denoted as `q_T`. The chart illustrates how the token count varies with the model's initial performance.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **Y-Axis:** Labeled "Number of tokens after first step (k)". The scale runs from 0 to 18, with major tick marks every 2 units (0, 2, 4, ..., 18).
* **X-Axis:** Labeled "Pass@1 of `q_T`". It contains four categorical groups:
1. `(0, 33%]`
2. `(33%, 67%]`
3. `(67%, 100%)`
4. `Overall`
* **Legend:** Positioned in the top-right corner of the chart area.
* **First correct:** Represented by blue bars with diagonal hatching (\\).
* **First incorrect:** Represented by solid orange bars.
* **Data Labels:** The exact numerical value (in thousands) is printed above each bar.
### Detailed Analysis
The chart presents the following data points for each category:
**1. Category: (0, 33%]**
* **First correct (Blue, Hatched):** 15.4k tokens.
* **First incorrect (Orange, Solid):** 15.8k tokens.
* *Trend:* Both values are the highest in the chart. The "First incorrect" bar is slightly taller than the "First correct" bar.
**2. Category: (33%, 67%]**
* **First correct (Blue, Hatched):** 13.2k tokens.
* **First incorrect (Orange, Solid):** 13.9k tokens.
* *Trend:* Both values are lower than in the (0, 33%] category. The "First incorrect" bar remains taller than the "First correct" bar.
**3. Category: (67%, 100%)**
* **First correct (Blue, Hatched):** 9.5k tokens.
* **First incorrect (Orange, Solid):** 10.0k tokens.
* *Trend:* This category has the lowest token counts in the chart. The "First incorrect" bar is still taller than the "First correct" bar.
**4. Category: Overall**
* **First correct (Blue, Hatched):** 11.4k tokens.
* **First incorrect (Orange, Solid):** 12.5k tokens.
* *Trend:* This represents the aggregate average. The "First incorrect" bar is taller than the "First correct" bar.
### Key Observations
* **Consistent Pattern:** In every single categoryâacross all performance brackets and the overall averageâthe "First incorrect" condition results in a higher average number of tokens after the first step than the "First correct" condition.
* **Inverse Relationship with Performance:** There is a clear inverse relationship between the Pass@1 performance bracket and the token count. As the Pass@1 score increases (moving from left to right on the x-axis), the number of tokens after the first step decreases for both "correct" and "incorrect" conditions.
* **Magnitude of Difference:** The gap between "incorrect" and "correct" token counts is relatively consistent, ranging from approximately 0.4k to 1.1k tokens across the categories.
* **Lowest Resource Usage:** The (67%, 100%) performance bracket shows the most efficient (lowest) token usage after the first step for both conditions.
### Interpretation
The data suggests a strong correlation between a model's initial correctness and its subsequent computational efficiency (measured by token generation).
* **Efficiency of Correctness:** When a model's first step is correct, it tends to generate fewer tokens to complete the task. This implies that a correct initial reasoning step sets a more efficient, direct path to the solution, requiring less subsequent exploration or correction.
* **Cost of Errors:** An incorrect first step is associated with generating more tokens. This likely reflects the model needing to engage in more extensive "backtracking," self-correction, or alternative path exploration to recover from the initial error, thereby consuming more resources.
* **Performance Bracket Insight:** The decreasing token count with higher Pass@1 scores indicates that models or tasks where the model is more likely to be correct on the first try are also inherently less token-intensive. This could be because these are simpler tasks, or because the model's stronger performance allows for more concise reasoning.
* **Overall Implication:** The chart provides quantitative evidence that the quality of the initial reasoning step is a critical determinant of overall efficiency. Encouraging or ensuring correct first steps could lead to significant reductions in computational cost (token usage) for language model inference. The consistent pattern across all brackets reinforces the robustness of this relationship.
</details>
Figure D.1: Accuracy and number of tokens on Qwen3-8B. Left: The relationship between the accuracy of the final prediction ( $q_T$ ) and the correctness of the prediction solely based on the first reasoning step ( $q_1$ ) across different difficulty levels. If $q_1$ is incorrect, $q_T$ is more likely incorrect. Right: The number of tokens used for the final prediction after the first reasoning step $t_1$ , i.e., the number of tokens used for $[t_2,t_3,...,t_T]$ . Although $q_1$ is correct, the model still consumes a large amount of tokens for the following reasoning stepsâoverthinking.
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Grouped Bar Chart: Model Accuracy by First Step Correctness
### Overview
This image displays a grouped bar chart comparing the accuracy (in percentage) of five different AI models under two conditions: when the model's first step is correct versus when it is incorrect. The chart visually demonstrates the performance impact of initial step correctness across models of varying sizes.
### Components/Axes
* **Chart Type:** Grouped vertical bar chart.
* **X-Axis (Horizontal):**
* **Label:** "Models"
* **Categories (from left to right):** DS-R1-1.5B, DS-R1-32B, Qwen3-1.7B, Qwen3-30B-A3B, Qwen3-235B-A22B.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy (%)"
* **Scale:** Linear scale from 0 to 100, with major tick marks every 10 units (0, 10, 20, ..., 100).
* **Legend:**
* **Position:** Centered at the top of the chart area.
* **Series 1:** "Correct first step" - Represented by blue bars with a diagonal hatching pattern (///).
* **Series 2:** "Incorrect first step" - Represented by solid orange bars.
* **Data Labels:** The exact accuracy percentage is printed above each bar.
### Detailed Analysis
The chart presents paired data for each of the five models. For every model, the "Correct first step" condition yields significantly higher accuracy than the "Incorrect first step" condition.
**Data Points by Model (Left to Right):**
1. **DS-R1-1.5B**
* Correct first step (Blue, hatched): **92.7%**
* Incorrect first step (Orange, solid): **31.7%**
* *Visual Trend:* The blue bar is substantially taller than the orange bar, indicating a large performance drop when the first step is incorrect.
2. **DS-R1-32B**
* Correct first step (Blue, hatched): **90.2%**
* Incorrect first step (Orange, solid): **46.0%**
* *Visual Trend:* Similar pattern to the first model, with a large gap between the two conditions.
3. **Qwen3-1.7B**
* Correct first step (Blue, hatched): **95.2%**
* Incorrect first step (Orange, solid): **52.3%**
* *Visual Trend:* This model achieves the highest accuracy in the "Correct first step" condition among all models shown. The gap remains large.
4. **Qwen3-30B-A3B**
* Correct first step (Blue, hatched): **91.0%**
* Incorrect first step (Orange, solid): **73.0%**
* *Visual Trend:* The gap between the two bars begins to narrow compared to the previous models. The "Incorrect first step" accuracy is notably higher here.
5. **Qwen3-235B-A22B**
* Correct first step (Blue, hatched): **89.9%**
* Incorrect first step (Orange, solid): **79.0%**
* *Visual Trend:* This model shows the smallest performance gap between the two conditions. The "Incorrect first step" accuracy is the highest of all models for that condition.
### Key Observations
1. **Universal Performance Drop:** All five models exhibit lower accuracy when their first step is incorrect compared to when it is correct.
2. **Magnitude of Drop Varies:** The performance penalty for an incorrect first step is most severe for the smaller DS-R1-1.5B model (a drop of ~61 percentage points) and least severe for the largest Qwen3-235B-A22B model (a drop of ~10.9 percentage points).
3. **Peak Performance:** The highest single accuracy value is 95.2% (Qwen3-1.7B, Correct first step).
4. **Lowest Performance:** The lowest single accuracy value is 31.7% (DS-R1-1.5B, Incorrect first step).
5. **Inverse Trend in "Incorrect" Condition:** While "Correct first step" accuracy is relatively stable across models (ranging from ~90% to ~95%), the "Incorrect first step" accuracy shows a clear upward trend as we move from left to right across the models (from 31.7% to 79.0%).
### Interpretation
The data strongly suggests that **the correctness of a model's initial reasoning step is a critical determinant of its final answer accuracy.** A faulty start severely compromises performance across all tested architectures.
Furthermore, the chart reveals a potential correlation between **model scale/complexity and robustness to initial errors.** The models on the right (particularly the Qwen3 series, especially the largest Qwen3-235B-A22B) maintain much higher accuracy even when starting with an incorrect step. This implies that larger or more advanced models may have a greater capacity to recover from early mistakes or employ alternative reasoning pathways, making them less brittle. The smaller DS-R1-1.5B model appears highly dependent on getting the first step right.
In a technical context, this underscores the importance of prompt engineering or chain-of-thought techniques that guide models toward a correct initial step, especially for smaller models. It also highlights a key advantage of scaling model size: improved error tolerance during multi-step reasoning tasks.
</details>
Figure D.2: More results from Qwen and DeepSeek models of different sizes. We report the relationship between the accuracy of the final prediction ( $q_T$ ) and the correctness of the first reasoning step ( $q_1$ ). If $q_1$ is incorrect, $q_T$ is more likely to be incorrect.
## Appendix E Prompt Examples
The template for using the DeepSeek-R1 model is provided below. We apply the same structure to other open-source thinking models, with slight adjustments for model-specific input and output formats.
Prompt example for DeepSeek-R1: <|begin_of_sentence|>Please reason step by step, and put your final answer within \boxed{}. <|User|>[Question]<|Assistant|><think>
## Appendix F Example of a incorrect First Step
Figure F.1 illustrates a case where the initial reasoning step is incorrect.
Example of incorrect first step
Question: Four unit squares form a $2Ă 2$ grid. Each of the 12 unit line segments forming the sides of the squares is colored either red or blue in such a way that each unit square has 2 red sides and 2 blue sides. One example is shown below (red is solid, blue is dashed). Find the number of such colorings. [asy] size(4cm); defaultpen(linewidth(1.2)); draw((0, 0) â (2, 0) â (2, 1)); draw((0, 1) â (1, 1) â (1, 2) â (2,2)); draw((0, 0) â (0, 1), dotted); draw((1, 0) â (1, 1) â (2, 1) â (2, 2), dotted); draw((0, 1) â (0, 2) â (1, 2), dotted); [/asy] First reason: Alright, so I have this problem about coloring the sides of a 2x2 grid of squares. Each square has four sides, and each of these sides is colored either red or blue. The condition is that each individual square must have exactly two red sides and two blue sides. The question is asking for the number of such colorings. First, let me visualize the grid. Itâs a 2x2 grid, so there are four squares arranged in a larger square. Each square has four sides, and each side is shared with another square except for the outermost sides. But in the problem, it says there are 12 unit line segments forming the sides of the squares. Wait, let me check that: a 2x2 grid has 3 horizontal lines and 3 vertical lines, each of length 2 units. So, each horizontal line is divided into two segments, right? So, the total number of horizontal segments is 3 rows times 2 segments per row, which is 6, and similarly, the vertical segments are also 3 columns times 2 segments per column, which is another 6. So, 6 + 6 = 12 segments in total. Okay, that makes sense. Each of these 12 segments can be colored either red or blue. However, the constraint is that each of the four squares must have exactly two red sides and two blue sides. So, even though each segment can be colored independently, we have these local constraints for each square. I think a good way to approach this problem is by considering each square and the possible colorings for its sides, then seeing how they can fit together without violating the constraints. However, since the coloring of one squareâs side affects the adjacent squares, this could get complicated. Maybe I can model this as a graph where each segment is a node, and the squares impose constraints on the nodes (i.e., exactly two red and two blue edges around each square). But Iâm not too familiar with graph theory, so maybe thatâs not the best approach. Mistakes in first reasoning step: Mistake 1 incorrectly assumes that in a $2Ă 2$ grid, each corner square has two shared and two unshared sides, without properly accounting for edge distribution. Mistake 2 introduces unnecessary abstraction by treating segments as ânodes,â which obscures the natural structure: the segments themselves are the edges whose colours we choose, and each square simply places a rule on its four incident edges. Working directly with those four constraints is far cleaner than introducing an extra level of abstraction.
Figure F.1: Example of incorrect first step.
## Appendix G Examples for Minor Perturbation
Original final conclusion
To solve the problem, we are given triangle $ABC$ with side lengths $AB=5$ , $BC=9$ , and $AC=10$ . [âŠ] Step 4: Compute $AP$ . Since $P$ lies on line segment $AD$ , and $D$ lies outside the circle, the point $P$ lies between $A$ and $D$ . Therefore: $AP=DA-DP=\frac{325}{22}-\frac{2025}{286}.$ Find a common denominator: $\frac{325}{22}=\frac{4225}{286}â AP=\frac{4225-2025}{286}=\frac{2200}{286}=\frac{100}{13}.$ Final Step: Compute $m+n$ . The reduced form of $AP$ is $\frac{100}{13}$ , so: $m=100, n=13â m+n={\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}{\textbackslash boxed\{113\}}}.$ Final Answer: \boxed{113}
New correct first reason
To solve the problem, we are given triangle $ABC$ with side lengths $AB=5$ , $BC=9$ , and $AC=10$ . [âŠ] Step 4: Compute $AP$ . Since $P$ lies on line segment $AD$ , and $D$ lies outside the circle, the point $P$ lies between $A$ and $D$ . Therefore: $AP=DA-DP=\frac{325}{22}-\frac{2025}{286}.$ Find a common denominator: $\frac{325}{22}=\frac{4225}{286}â AP=\frac{4225-2025}{286}=\frac{2200}{286}=\frac{100}{13}.$ Final Step: Compute $m+n$ . The reduced form of $AP$ is $\frac{100}{13}$ , so: $m=100, n=13â m+n={\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}{113}}.$
New incorrect first reason
To solve the problem, we are given triangle $ABC$ with side lengths $AB=5$ , $BC=9$ , and $AC=10$ . [âŠ] Step 4: Compute $AP$ . Since $P$ lies on line segment $AD$ , and $D$ lies outside the circle, the point $P$ lies between $A$ and $D$ . Therefore: $AP=DA-DP=\frac{325}{22}-\frac{2025}{286}.$ Find a common denominator: $\frac{325}{22}=\frac{4225}{286}â AP=\frac{4225-2025}{286}=\frac{2200}{286}=\frac{100}{13}.$ Final Step: Compute $m+n$ . The reduced form of $AP$ is $\frac{100}{13}$ , so: $m=100, n=13â m+n={\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}{103}}.$
Figure G.1: Illustration of a targeted perturbation to create correct and incorrect first reasoning steps, as discussed in § 3.3. The modified context is highlighted in orange.