# Lost at the Beginning of Reasoning
> Equal contribution. Correspondence to: Baohao Liao (b.liao@uva.nl).
Abstract
Recent advancements in large language models (LLMs) have significantly advanced complex reasoning capabilities, particularly through extended chain-of-thought (CoT) reasoning that incorporates mechanisms such as backtracking, self-reflection, and self-correction. Despite these developments, the self-correction abilities of LLMs during long CoT reasoning remain underexplored. And recent findings on overthinking suggest that such models often engage in unnecessarily redundant reasoning. In this work, we empirically show that the first reasoning step exerts a disproportionately large influence on the final predictionâerrors introduced at this stage can substantially degrade subsequent reasoning quality. This phenomenon is consistently observed across various state-of-the-art open- and closed-source reasoning models. Leveraging this insight, we propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps while discarding suboptimal ones, achieving up to a 70% reduction in inference cost without sacrificing any accuracy. Our work highlights the central role of the first reasoning step in generating a high-quality reasoning trajectory, and thus enabling significantly efficient sampling.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Reasoning Trace & Reward Model
### Overview
The image depicts a diagram illustrating a reasoning trace process, likely within a large language model (LLM) context. It shows how candidate reasoning steps are evaluated by a reward model, pruned, and ultimately lead to a final conclusion. The diagram also includes a visualization of similarity between the final conclusion and intermediate reasoning steps, alongside a comparison of accuracy and token usage with and without pruning.
### Components/Axes
The diagram is segmented into several key areas:
* **Reasoning Trace (Left):** A flow diagram showing reasoning steps (t1 to tT) leading to a final conclusion (CT).
* **Candidate Reasoning Traces & Reward Model (Center):** Illustrates candidate reasoning steps being evaluated by a reward model (represented by a chef icon). Arrows indicate acceptance (green checkmark) or rejection (red X) of steps.
* **Similarity Graph (Bottom-Left):** A line graph plotting "Similarity (CT, t_i)" against "Reasoning step t_i".
* **Accuracy vs. #Tokens (Bottom-Center):** Two bar charts comparing "Accuracy" and "#Tokens" for "Correct t1" and "Incorrect t1", as well as "Mai@N" and "Pruned" scenarios.
* **Mai@N vs Pruned (Bottom-Right):** A visual comparison showing a reduction in token usage ("70% LESS") when using "Pruned" reasoning traces.
The axes for the similarity graph are:
* X-axis: Reasoning step t_i
* Y-axis: Similarity (CT, t_i)
The axes for the bar charts are:
* X-axis: Correct t1, Incorrect t1, Mai@N, Pruned
* Y-axis: Accuracy, #Tokens
### Detailed Analysis or Content Details
**Reasoning Trace:**
The reasoning trace shows a series of steps (t1, t2, t3⊠tT) represented as circles connected by arrows, culminating in a final conclusion (CT) represented as a rectangular block.
**Candidate Reasoning Traces & Reward Model:**
Three candidate reasoning steps (t1) are presented:
1. "Treat the whole outing as just distance over speed with no fixed stop. [âŠ] This leads to s=5, but ignores the cafĂ© stop." (Rejected - Red X)
2. "The cafĂ© time is fixed in both scenarios. [âŠ] The speed is therefore s=3, and the cafĂ© stop is t=60 minutes." (Accepted - Green Checkmark)
3. "Read the 9 km in 4 hours as a base speed of about 2.25. [âŠ] This suggests s=2.25, but again misses the cafĂ© stop." (Rejected - Red X)
The complete reasoning trace (rightmost column) provides the final solution: "The cafĂ© time is fixed in both scenarios. [âŠ] The base speed is therefore s=3, and the cafĂ© stop is t=60 minutes. Alternatively, instead of comparing directly, set up the equations [âŠ] [âŠ] From 9/s+t=4 and 9/(s+3)+t=2.5, subtract to get s=3, hence t=1h. [âŠ] Her total time is boxed(195 minutes)."
**Similarity Graph:**
The similarity graph shows a decreasing trend. The line starts at approximately 1.0 (high similarity) at t1 and slopes downward, reaching approximately 0.2-0.3 at tT.
**Accuracy vs. #Tokens Bar Charts:**
* **Accuracy:** "Correct t1" has a significantly higher accuracy (approximately 0.9) than "Incorrect t1" (approximately 0.3). "Mai@N" has an accuracy of approximately 0.8, while "Pruned" has an accuracy of approximately 0.7.
* **#Tokens:** "Correct t1" uses approximately 50 tokens, "Incorrect t1" uses approximately 40 tokens. "Mai@N" uses approximately 60 tokens, while "Pruned" uses approximately 20 tokens.
**Mai@N vs Pruned:**
The diagram highlights that pruning reduces the number of tokens by "70% LESS".
### Key Observations
* The reward model effectively filters out incorrect reasoning steps.
* Similarity between intermediate reasoning steps and the final conclusion decreases as the reasoning process progresses.
* Pruning significantly reduces token usage with a slight decrease in accuracy.
* Correct reasoning steps lead to higher accuracy.
### Interpretation
This diagram illustrates a method for improving the efficiency and accuracy of reasoning processes in LLMs. The reward model acts as a gatekeeper, eliminating unproductive reasoning paths early on ("Early pruning"). This pruning process reduces computational cost (measured in tokens) without significantly sacrificing accuracy. The similarity graph suggests that initial reasoning steps are more closely aligned with the final conclusion, and subsequent steps diverge as the model refines its understanding. The comparison between "Mai@N" and "Pruned" demonstrates the trade-off between accuracy and efficiency, suggesting that pruning can be a valuable technique for optimizing LLM performance. The example problem (Aya walks 9 km each morningâŠ) provides a concrete context for understanding the reasoning process. The use of visual metaphors (chef for reward model, scissors for pruning) enhances the clarity of the diagram.
</details>
Figure 1: Overview of our observation and efficient sampling. The first reasoning step $t_{1}$ heavily shapes the entire reasoning trajectory: a strong first step typically yields accurate solutions with fewer tokens (bottom left). Building on this observation, we propose to generate multiple candidate first steps, evaluate them with a reward model, and discard weaker candidates early (top right). This method maintains accuracy while substantially reducing token consumption by 70% (bottom right).
1 Introduction
Large language models (LLMs) have demonstrated remarkable performance across a variety of reasoning tasks, ranging from mathematical problem solving to multi-hop question answering (Hestness et al., 2017; Kaplan et al., 2020; Hoffmann et al., 2022). More recently, the advent of reasoning-oriented LLMs capable of performing long chain-of-thought (long-CoT) reasoning at test time has led to substantial advancements in these domains (Brown et al., 2020; Hurst et al., 2024; Anthropic, 2025; Team et al., 2024; Guo et al., 2025; Yang et al., 2025a; Wen et al., 2025; He et al., 2025). A widely held hypothesis attributes this progress to the modelsâ ability to backtrack, self-reflect, and self-correct, effectively leveraging contextual feedback to iteratively refine their responses.
However, recent studies suggest that long-CoT reasoning can also introduce inefficiencies. Models often âoverthinkâ by producing unnecessarily extended reasoning processes to solve problems (Chiang & Lee, 2024; Zhang et al., 2024a; Wang et al., 2025b; Liao et al., 2025b; a). This observation raises questions about the modelâs capacity for backtracking, self-reflection, and self-correction. It suggests that LLMs may lack awareness of the information they have already processed, leading to redundant or inefficient reasoning. Moreover, Liu et al. (2024a) demonstrate that LLMs are prone to the âlost-in-the-middleâ phenomenon, wherein information located in the middle of a long context is often overlooked. While their analysis is conducted in the context of information retrieval, it remains an open question whether similar positional biases affect long CoT as well.
In this work, we introduce a novel and previously underexplored perspective on long-CoT reasoning: many reasoning failures in long-CoT LLMs stem not from errors made mid-chain, but rather from flaws at the beginning of reasoning. Our experiments demonstrate that the first reasoning step has the most significant influence on the final prediction. When this first step is incorrect, the model is considerably more likely to arrive at an incorrect final answer (40% accuracy drop), highlighting the limited self-correction capabilities of current long-CoT LLMs. Notably, this phenomenon is consistently observed from five open- and closed-source long-CoT LLM families.
Motivated by this insight, we propose an efficient early pruning algorithm that exploits the predictive power of the first reasoning step. Specifically, by evaluating the quality of the first step, we identify and discard less promising reasoning traces early, continuing generation only for the more promising ones. This approach significantly reduces inference cost. Across five open-sourced long-CoT LLM families and five challenging mathematical, scientific reasoning and programming benchmarks, our method maintains accuracy while reducing inference budget by up to 70%. Our results show that the first step is not just the beginning of reasoning, but a key factor that influences both accuracy and efficiency, making it an important focus for future reasoning models.
Contributions. Our main contributions are as follows: (1) To the best of our knowledge, we firstly empirically establish a strong positive correlation between the first reasoning step and the final prediction across various open- and closed-sourced long-CoT LLM families (§ 3); (2) Inspired by this observation, we propose an efficient early pruning algorithm that halts generation for less promising initial steps, thereby improving inference efficiency while maintaining the accuracy (§ 4); (3) Both observation and proposed efficient sampling method are extensively validated on various long-CoT LLMs across different reasoning tasks, with necessary control experiments to disentangle the confounding factors.
2 Related work
Lost in the middle. Liu et al. (2024a) introduced the âlost in the middleâ effect, demonstrating that LLMs tend to overlook information in the middle of long contexts, performing better when relevant content appears at the beginning or end. This positional bias is evident across tasks like arithmetic reasoning (Shen et al., 2023; Liao & Monz, 2024), multiple-choice QA (Zheng et al., 2024; Pezeshkpour & Hruschka, 2023), text evaluation (Wang et al., 2024; Shi et al., 2024), passage ranking (Zhang et al., 2024b), and instructional prompt positioning (Liu et al., 2024b; Chen et al., 2024b). Additionally, studies have documented primacy and recency biases, where models disproportionately allocate attention to the first or final tokens, independent of their semantic relevance (Xiao et al., 2024; Qin et al., 2023; Barbero et al., 2025). While previous studies have primarily examined positional biases in external context, we investigate whether analogous biases emerge in internal reasoning trajectories of long chain-of-thought models. Different from attention-level analyses that focus on how the first input token shapes representations, our work shows that the first generated reasoning step greatly influences subsequent reasoning and final outcomes.
Efficient test-time reasoning. Test-time scaling methods aim to improve the accuracyâcompute trade-off by adapting sampling and aggregation. One line of work increases self-consistency efficiency by reducing sample counts (Li et al., 2024; Wan et al., 2025; Aggarwal et al., 2023; Xue et al., 2023), while another shortens chain-of-thought depth via fine-tuning or inference-only optimizations (Chen et al., 2024a; Luo et al., 2025; Hou et al., 2025; Fu et al., 2025a; Yang et al., 2025b). These methods, however, still rely on generating full reasoning traces. DeepConf (Fu et al., 2025b) instead uses local confidence to filter low-quality traces and terminate generation early. Our method takes a different focus: we assess the quality of the initial reasoning step, which strongly shapes subsequent reasoning, and prune weak starts before long traces unfold.
3 Lost at the beginning of reasoning
Motivated by the finding of Liu et al. (2024a), which demonstrates that query-relevant information is more impactful when positioned at either the beginning or end of an LLMâs context window, we first investigate whether a similar positional effect exists in long-CoT reasoning (§ 3.1). Our analysis reveals that the first reasoning step has great impact to the final conclusion. To validate this observation, we further perform two ablation studies, confirming the critical role of the first step in determining the modelâs final prediction (§ 3.2 and § 3.3).
Notation. Let $p$ represent the input prompt, consisting of both a system instruction and a user query. A reasoning model $\mathcal{M}$ produces a sequence of CoT reasoning steps $t=[t_{1},t_{2},...,t_{T}]$ , followed by a final conclusion $c_{T}$ , such that the complete model output is given by $t\oplus c_{T}=\mathcal{M}(p)$ , where $\oplus$ means concatenation. In models such as DeepSeek-R1 (Guo et al., 2025) and Qwen3 (Team, 2025), the input-output format adheres to the following:
$$
p\ \mathrm{{<}think{>}}\ t_{1},t_{2},\ldots,t_{T}\ \mathrm{{<}/think{>}}\ c_{T}
$$
The final prediction $q_{T}$ is then derived by applying an extraction function $g$ to the conclusion, i.e., $q_{T}=g(c_{T})$ , where $g$ may, for example, extract values enclosed within \boxed{}.
The conclusion $c_{T}$ can be interpreted as a summary of the essential reasoning steps leading to the final prediction. This raises an interesting question: Is there a positional bias in how reasoning steps contribute to the conclusion? In other words, do certain steps have a disproportionately greater influence on $c_{T}$ than others?
3.1 Similarity between reasoning steps and the final conclusion
To understand how different reasoning steps contribute to the final conclusion, we measure the semantic similarity between each reasoning step $\{t_{i}\}_{i=1}^{T}$ and the final conclusion $c_{T}$ .
To assess how intermediate reasoning contributes to the final outcome, we measure the semantic similarity between each reasoning step $\{t_{i}\}_{i=1}^{T}$ and the final conclusion $c_{T}$ . This analysis reveals whether the reasoning process gradually aligns with the correct answer or diverges along the way.
Experimental setup. We evaluate 60 questions from AIME24 and AIME25 (MAA Committees, 2025) using DeepSeek-R1-Distill-Qwen-7B (abbreviated as DS-R1-Qwen-7B in the remainder of this paper) (Guo et al., 2025), Qwen3-8B (Yang et al., 2025a), Claude-3.7-Sonnet with thinking (Anthropic, 2025), GPT-OSS-20B (Agarwal et al., 2025), and Magistral-Small (Rastogi et al., 2025). For reproducibility, the exact model identifiers are provided in Appendix B. Generations are produced with temperature=1.0, top_p=0.9, min_p=0.05, and max_tokens=32768; for Claude-3.7-Sonnet, only max_tokens is set. All subsequent experiments adopt this hyperparameter configuration.
Segmentation of reasoning steps. We define a reasoning step as a complete logical leap or self-contained unit (Xiong et al., 2025), and segment reasoning traces with GPT-5. By default, we use GPT-5 mini for step segmentation; for GPT-OSS-20B, we instead use GPT-5, as the mini variant is incompatible. To complement this setup, we also employ heuristic segmentation based on reasoning switch keywords (Wang et al., 2025a), with details provided in Appendix A.
Similarity computation. We compute semantic similarity between each step $t_{i}$ and the conclusion $c_{T}$ by taking the cosine similarity of their embeddings obtained from all-MiniLM-L6-v2 (Reimers & Gurevych, 2019; Wang et al., 2020). To avoid inflated similarity from problem restatement, we use GPT-5 mini to remove question-overlap text at the beginning of traces. As a robustness check, we also report results with SPLADE similarity (Formal et al., 2021) in Appendix C, confirming that our findings are not specific to dense embeddings. Since traces vary in length, similarity curves are interpolated to a fixed number of steps (either the average or maximum length) for visualization.
This setup allows us to capture how reasoning trajectories semantically converge towardâor deviate fromâthe final answer across different models.
Result. Figure 2 shows that the first reasoning step exhibits the highest similarity to the final conclusion, after which similarity drops sharply. Beyond the initial few steps, similarity stabilizes at a lower level, with only minor fluctuations across the remainder of the reasoning process. These results suggest that the first step $t_{1}$ is most closely aligned with the final conclusion and likely sets the overall direction of the reasoning. Subsequent steps appear to introduce exploratory or redundant content that deviates from the final answer. Additional results using SPLADE similarities (Figure C.2) show the same trend, confirming that this pattern is robust across similarity metrics. Taken together, these findings show that the first reasoning step plays a disproportionately important role in shaping the final conclusion.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Line Chart: Similarity of Reasoning Steps
### Overview
The image presents a line chart illustrating the similarity (Similarity(CÏ, t)) between reasoning steps (Reasoning step tᔹ (GPT5)) for five different language models: DS-R1-Qwen-7B, Qwen3-8B, Claude-3.7-Sonnet, GPT-OSS-20B, and Magistral-Small. The chart displays how similarity changes as the reasoning step number increases from 0 to approximately 32.
### Components/Axes
* **X-axis:** Reasoning step tᔹ (GPT5), ranging from 0 to 32.
* **Y-axis:** Similarity(CÏ, t), ranging from 0.50 to 0.85.
* **Legend:** Located in the top-right corner, identifying each line with a corresponding color:
* DS-R1-Qwen-7B (Blue)
* Qwen3-8B (Orange)
* Claude-3.7-Sonnet (Green)
* GPT-OSS-20B (Purple)
* Magistral-Small (Brown)
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points:
* **DS-R1-Qwen-7B (Blue):** The line starts at approximately 0.83 at step 0, then decreases steadily to around 0.53 at step 10. It fluctuates between approximately 0.53 and 0.56 from step 10 to 25, with a dip to around 0.51 at step 27, and ends at approximately 0.54 at step 32.
* **Qwen3-8B (Orange):** This line begins at approximately 0.81 at step 0, decreasing to around 0.58 at step 10. It then rises slightly to around 0.61 at step 15, before decreasing again to approximately 0.56 at step 20. It fluctuates between 0.56 and 0.59 from step 20 to 32, ending at approximately 0.58.
* **Claude-3.7-Sonnet (Green):** The line starts at approximately 0.66 at step 0, decreasing to around 0.61 at step 5. It then increases to a peak of approximately 0.67 at step 12, before fluctuating between approximately 0.61 and 0.64 from step 15 to 32, ending at approximately 0.62.
* **GPT-OSS-20B (Purple):** This line begins at approximately 0.65 at step 0, decreasing to around 0.54 at step 10. It remains relatively stable between approximately 0.53 and 0.56 from step 10 to 32, ending at approximately 0.54.
* **Magistral-Small (Brown):** The line starts at approximately 0.65 at step 0, decreasing to around 0.57 at step 5. It then decreases further to approximately 0.54 at step 10, and fluctuates between approximately 0.54 and 0.58 from step 10 to 32, ending at approximately 0.56.
### Key Observations
* All models exhibit a decreasing trend in similarity during the initial reasoning steps (0-10).
* Claude-3.7-Sonnet maintains the highest similarity scores throughout the reasoning process, although it also experiences a decrease initially.
* DS-R1-Qwen-7B and GPT-OSS-20B show the lowest similarity scores, particularly after step 10.
* Qwen3-8B and Magistral-Small exhibit similar behavior, with fluctuating similarity scores after the initial decrease.
* The similarity scores tend to stabilize after approximately 15 reasoning steps for most models.
### Interpretation
The chart suggests that the reasoning processes of these language models diverge as the number of reasoning steps increases. The initial decrease in similarity indicates that the models are exploring different paths or focusing on different aspects of the problem. The stabilization of similarity scores after a certain number of steps suggests that the models are converging towards a more consistent understanding or solution.
The higher similarity scores of Claude-3.7-Sonnet may indicate a more robust or coherent reasoning process compared to the other models. The lower similarity scores of DS-R1-Qwen-7B and GPT-OSS-20B could suggest that these models are more prone to exploring diverse or potentially less relevant reasoning paths.
The fluctuations in similarity scores for Qwen3-8B and Magistral-Small might reflect the models' ability to adapt and refine their reasoning based on the information gathered during each step. The chart provides valuable insights into the reasoning dynamics of different language models and can be used to assess their strengths and weaknesses in complex problem-solving tasks.
</details>
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: Similarity of Reasoning Steps
### Overview
This image presents a line chart illustrating the similarity between reasoning steps (represented by *t<sub>i</sub>*) and a target (*C<sub>T</sub>*) for five different language models: DS-R1-Qwen-7B, Qwen-8B, Claude-3.7-Sonnet, GPT-OSS-20B, and Magistral-Small. The chart tracks how this similarity changes as the reasoning step number increases from 0 to approximately 52.
### Components/Axes
* **X-axis:** "Reasoning step *t<sub>i</sub>* (Heuristic)", ranging from 0 to 52.
* **Y-axis:** "Similarity(*C<sub>T</sub>*, *t<sub>i</sub>*)", ranging from 0.4 to 0.9.
* **Legend:** Located in the top-right corner, identifying each line with a specific model name and color.
* DS-R1-Qwen-7B (Blue)
* Qwen-8B (Orange)
* Claude-3.7-Sonnet (Green)
* GPT-OSS-20B (Purple)
* Magistral-Small (Brown)
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points, verifying color consistency with the legend:
* **DS-R1-Qwen-7B (Blue):** The line starts at approximately 0.85 at step 0, rapidly decreases to around 0.55 by step 10, then fluctuates between 0.45 and 0.55 for the remainder of the steps, showing some oscillation.
* **Qwen-8B (Orange):** This line begins at approximately 0.88 at step 0, drops sharply to around 0.5 by step 10, and then exhibits a more erratic pattern, oscillating between approximately 0.4 and 0.6. It ends at around 0.55 at step 52.
* **Claude-3.7-Sonnet (Green):** Starts at approximately 0.75 at step 0, decreases to around 0.6 by step 10, and then remains relatively stable, fluctuating between 0.58 and 0.65 for the majority of the steps. It ends at approximately 0.62 at step 52.
* **GPT-OSS-20B (Purple):** Begins at approximately 0.7 at step 0, declines to around 0.45 by step 10, and then fluctuates between approximately 0.4 and 0.5, with some peaks reaching around 0.55. It ends at approximately 0.48 at step 52.
* **Magistral-Small (Brown):** Starts at approximately 0.65 at step 0, decreases to around 0.45 by step 10, and then gradually increases to approximately 0.55 by step 52, showing a slight upward trend in the later steps.
### Key Observations
* All models exhibit a significant drop in similarity during the initial reasoning steps (0-10).
* Claude-3.7-Sonnet maintains the highest similarity scores throughout the reasoning process, remaining consistently above 0.58.
* GPT-OSS-20B consistently shows the lowest similarity scores, generally staying below 0.5.
* Qwen-8B and DS-R1-Qwen-7B show the most volatile behavior, with significant fluctuations in similarity scores.
* Magistral-Small shows a slight increasing trend in similarity towards the end of the reasoning process.
### Interpretation
The chart suggests that the initial reasoning steps are the most divergent for all models, indicating a rapid shift away from the target concept. Claude-3.7-Sonnet demonstrates the most consistent alignment with the target throughout the reasoning process, suggesting a more stable and focused reasoning approach. GPT-OSS-20B, conversely, appears to drift away from the target more quickly and remains less aligned. The fluctuations observed in Qwen-8B and DS-R1-Qwen-7B could indicate a more exploratory or iterative reasoning process, where the model revisits and refines its understanding of the target. The slight increase in similarity for Magistral-Small towards the end suggests a potential convergence or refinement of its reasoning as it progresses.
The metric "Similarity(*C<sub>T</sub>*, *t<sub>i</sub>*)" likely represents a measure of how closely the model's internal representation at reasoning step *t<sub>i</sub>* aligns with the target concept *C<sub>T</sub>*. A higher similarity score indicates a stronger alignment, while a lower score suggests a greater divergence. This data could be used to evaluate the effectiveness and stability of different language models in performing reasoning tasks.
</details>
Figure 2: Cosine similarity between the embeddings of the $i$ -th reasoning step $t_{i}$ and the final conclusion $c_{T}$ , using the average number of reasoning steps for interpolation. The reasoning steps are segmented either by GPT-5 (left) or by heuristic rules (right). See Figure C.1 for results based on the maximum number of reasoning steps used for interpolation.
Given the strong alignment between early reasoning stepsâparticularly the firstâand the final conclusion, we hypothesize that the first step may significantly influence whether the reasoning model can arrive at a correct prediction.
3.2 Correlation between the first reasoning step and the final prediction
Given that the first reasoning step closely resembles the final conclusion, we investigate whether the essential reasoning required for the final prediction is already encapsulated in the first step. To this end, we analyze the prediction when conditioned solely on the first reasoning step. Specifically, we compute $c_{1}=\mathcal{M}(p\mathrm{{<}think{>}}t_{1}\mathrm{{<}/think{>}})$ , and derive the corresponding prediction $q_{1}=g(c_{1})$ , which we compare against the ground truth $a$ . Based on this comparison, we categorize each first reasoning step as either first correct (if $q_{1}=a$ ) or first incorrect (otherwise).
Experimental setup. To better analyze the correlation, we sample 64 CoT traces per question using the same datasets in § 3.1. We exclude questions for which all 64 CoT traces result in either correct or incorrect predictions, as these are considered either too easy or too difficult, respectively, yielding 38 questions for DS-R1-Qwen-7B and 37 for Qwen3-8B. For each remaining question and its corresponding first reasoning step $t_{1}$ , we obtain the initial prediction $q_{1}$ as previously described. While GPT-5 provides more reliable segmentation, it is costly and difficult to reproduce. We therefore adopt the heuristic segmentation method in all subsequent experiments, which is shown to have compariable results with GPT5 segmentation in § 3.1. To better visualize the final outcomes, we categorize the questions into three groups based on the pass@1 accuracy of the final prediction $q_{T}$ , That is, the average accuracy across 64 CoT traces. corresponding to the intervals (0, 33%], (33%, 67%], and (67%, 100%). A higher pass@1 indicates a simpler question. This grouping allows us to assess whether our observations hold consistently across varying levels of question difficulty.
Result. As shown in Figure 3 (Left), the commonly assumed self-correction capability of reasoning models appears to be overstated. When the first reasoning step is incorrect, the modelâs final prediction is also likely to be incorrect. On average, final prediction accuracy drops by 40% when the first reasoning step is incorrect, with the most substantial decrease (43%) occurring for difficult questions (0â33% range) and a notable decline (33%) even for easier questions (67â100% range). In addition, we also compute the Pearson correlation between the correctness of the first prediction $p_{1}$ and the final prediction $p_{T}$ over all questions. The coefficient $r=0.60$ and p-value $p=0.0$ denote a moderately strong positive correlation. All these results underscore the pivotal role of the first reasoning step in steering the model toward a correct final answer, particularly in more complex instances where recovery from early mistakes is more challenging. Extending this analysis to DeepSeek and Qwen models of different sizes yields consistent trends: final accuracy remains substantially higher when the first step is correct, and the accuracy gap persists as model scale increases (Figure D.2).
Figure 3 (Right) further illustrates the modelâs tendency toward overthinking. Even when the first reasoning step is sufficient to arrive at the correct answer, the model continues to generate a substantial number of additional reasoning tokensâthe same scale in length to those generated following an incorrect first step. Both findings are also observed for Qwen3-8B in Figure D.1, reinforcing this pattern across models.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Bar Chart: Accuracy vs. Pass@1 of qÏ
### Overview
This bar chart displays the accuracy of "First correct" and "First incorrect" predictions across different ranges of Pass@1 of qÏ. The x-axis represents the Pass@1 of qÏ ranges, and the y-axis represents the accuracy in percentage. The chart uses paired bars for each Pass@1 range, with blue bars representing "First correct" and orange bars representing "First incorrect". An "Overall" accuracy is also presented.
### Components/Axes
* **X-axis Title:** Pass@1 of qÏ
* **X-axis Categories:** (0, 33%], (33%, 67], (67%, 100%), Overall
* **Y-axis Title:** Accuracy (%)
* **Y-axis Scale:** 0 to 100, with increments of 20.
* **Legend:**
* Blue (hatched): First correct
* Orange: First incorrect
### Detailed Analysis
The chart consists of four sets of paired bars, plus an overall accuracy comparison.
* **(0, 33%] Pass@1 of qÏ:**
* First correct: Approximately 56.7% accuracy.
* First incorrect: Approximately 14.0% accuracy.
* **(33%, 67%] Pass@1 of qÏ:**
* First correct: Approximately 80.2% accuracy.
* First incorrect: Approximately 43.9% accuracy.
* **(67%, 100%] Pass@1 of qÏ:**
* First correct: Approximately 97.2% accuracy.
* First incorrect: Approximately 63.9% accuracy.
* **Overall:**
* First correct: Approximately 79.7% accuracy.
* First incorrect: Approximately 39.7% accuracy.
The "First correct" bars consistently show a higher accuracy than the "First incorrect" bars across all Pass@1 ranges. The accuracy of "First correct" increases as the Pass@1 of qÏ increases.
### Key Observations
* There is a strong positive correlation between Pass@1 of qÏ and the accuracy of "First correct" predictions.
* The accuracy of "First incorrect" predictions remains relatively low across all Pass@1 ranges.
* The largest difference in accuracy between "First correct" and "First incorrect" occurs in the (67%, 100%] Pass@1 range.
* The overall accuracy of "First correct" is 79.7%, while the overall accuracy of "First incorrect" is 39.7%.
### Interpretation
The data suggests that the model's ability to provide the correct answer on the first attempt (First correct) improves significantly as the Pass@1 of qÏ increases. Pass@1 of qÏ likely represents a measure of confidence or quality of the initial predictions. Higher Pass@1 values indicate a higher probability that the first prediction is correct. The substantial gap between "First correct" and "First incorrect" accuracy indicates that the model is more reliable when it has higher confidence in its initial predictions. The overall accuracy reflects the combined performance across all Pass@1 ranges, and the difference between the overall "First correct" and "First incorrect" accuracies highlights the model's overall effectiveness. The chart demonstrates a clear trade-off: higher confidence in initial predictions leads to significantly improved accuracy. This could be due to the model refining its predictions based on the initial Pass@1 score, or it could indicate that the initial Pass@1 score is a good proxy for the overall quality of the prediction.
</details>
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Bar Chart: Number of Tokens After First Step vs. Pass@1 of qÏ
### Overview
This bar chart compares the number of tokens generated after the first step for "First correct" and "First incorrect" scenarios, categorized by the Pass@1 of qÏ. Pass@1 of qÏ represents the percentage of times the correct answer is within the top 1 prediction. The chart displays the average number of tokens (in thousands, 'k') generated in each category.
### Components/Axes
* **X-axis:** Pass@1 of qÏ, with categories: "(0, 33%)", "(33%, 67%)", "(67%, 100%)", and "Overall".
* **Y-axis:** Number of tokens after first step (k), ranging from 0 to 16.
* **Legend:**
* "First correct" - represented by a dark blue, hatched bar.
* "First incorrect" - represented by a light orange, hatched bar.
* **Data Labels:** Each bar has a numerical label indicating the corresponding value.
### Detailed Analysis
The chart consists of four sets of paired bars, one for each Pass@1 of qÏ category.
* **(0, 33%):**
* "First correct": Approximately 12.6 k tokens. The bar slopes upward.
* "First incorrect": Approximately 14.5 k tokens. The bar slopes upward.
* **(33%, 67%):**
* "First correct": Approximately 8.3 k tokens. The bar slopes upward.
* "First incorrect": Approximately 11.4 k tokens. The bar slopes upward.
* **(67%, 100%):**
* "First correct": Approximately 4.7 k tokens. The bar slopes upward.
* "First incorrect": Approximately 7.4 k tokens. The bar slopes upward.
* **Overall:**
* "First correct": Approximately 8.2 k tokens. The bar slopes upward.
* "First incorrect": Approximately 11.3 k tokens. The bar slopes upward.
### Key Observations
* The "First incorrect" bars are consistently higher than the "First correct" bars across all Pass@1 of qÏ categories.
* As the Pass@1 of qÏ increases (from 0-33% to 67-100%), the number of tokens generated for both "First correct" and "First incorrect" generally decreases.
* The largest difference between "First correct" and "First incorrect" occurs in the (0, 33%) category, with a difference of approximately 1.9 k tokens.
* The smallest difference between "First correct" and "First incorrect" occurs in the Overall category, with a difference of approximately 3.1 k tokens.
### Interpretation
The data suggests that when the model's initial prediction is less accurate (lower Pass@1 of qÏ), it tends to generate more tokens before either arriving at the correct answer or determining that the initial attempt was incorrect. This is reflected in the higher token counts for "First incorrect" in the lower Pass@1 ranges. The decreasing token count as Pass@1 increases indicates that more accurate initial predictions lead to faster convergence to a solution. The overall difference between correct and incorrect first steps suggests that the model requires more tokens when it initially makes an incorrect prediction, even when considering all cases. This could be due to the model needing to explore more possibilities or correct its course after an initial misstep. The chart provides insight into the efficiency of the model's reasoning process based on the quality of its initial predictions.
</details>
Figure 3: Accuracy and number of tokens on DS-R1-Qwen-7B. Left: The relationship between the accuracy of the final prediction ( $q_{T}$ ) and the correctness of the prediction solely based on the first reasoning step ( $q_{1}$ ) across different difficulty levels. If $q_{1}$ is incorrect, $q_{T}$ is more likely incorrect. Right: The number of tokens used for the final prediction after the first reasoning step $t_{1}$ , i.e., the number of tokens used for $[t_{2},t_{3},...,t_{T}]$ . Although $q_{1}$ is correct, the model still consumes a large amount of tokens for the following reasoning stepsâoverthinking.
3.3 Minor perturbation to the correct first step leads to significant loss
Building on our findings in § 3.2, which demonstrate a positive correlation between the modelâs first and final predictions, we further investigate the significance of the first reasoning step by introducing minor perturbations. Specifically, we slightly alter an initially correct reasoning step and provide it as input to the model to assess whether it can recover from such errors.
Experimental setup. Unlike § 3.2, where we analyze the correctness of the first reasoning step $t_{1}$ , here we treat the final correct conclusion $c_{T}$ âwhich satisfies $q_{T}=g(c_{T})=a$ âas the new first reasoning step, denoted $t^{\prime}_{1}$ . This choice ensures that the step contains all necessary reasoning for arriving at the correct answer, which cannot be guaranteed for $t_{1}$ . As illustrated in Figure 3 (Left), an initially correct reasoning step can still lead to an incorrect final prediction. To construct $t^{\prime}_{1}$ , we apply the following perturbations to $c_{T}$ (see Appendix G for an example): (1) we remove not only the explicit answer formatting (e.g., \boxed{a}) but also any surrounding sentences that may directly disclose or repeat the final answer; (2) the resulting text from (1) is treated as the correct version of $t^{\prime}_{1}$ (serving as our baseline); (3) we generate an incorrect version by replacing the correct answer $a$ in the remaining reasoning with $a± 1$ or $a± 10$ . The answer of AIME question is integer in the range of $[0,999)$ .
Table 1: Perturbation experiments. Reported accuracy (%) with correct vs. incorrect first step. Even minor perturbations cause significant drops.
| DS-R1-Qwen-1.5B DS-R1-Qwen-7B DS-R1-Qwen-32B | 95.4 94.8 100.0 | 64.4 28.5 85.8 |
| --- | --- | --- |
| Qwen3-1.7B | 96.0 | 46.6 |
| Qwen3-8B | 71.4 | 37.0 |
| Qwen3-30B-A3B | 100.0 | 74.7 |
| Qwen3-235B-A22B | 100.0 | 78.7 |
These perturbations are minimal, as they preserve the core reasoning structure while only altering the final prediction in the incorrect variant. We then combine the prompt $p$ with the modified first reasoning step $t^{\prime}_{1}$ and input it to the model as $\mathcal{M}(p\mathrm{{<}think{>}}t^{\prime}_{1}\mathrm{Alternatively})$ to assess subsequent reasoning behavior.
Result. As shown in Table 1, we make two key observations: (1) Smaller models rarely reach 100% accuracy even when the first reasoning step is correct, suggesting that they may revise or deviate from their initial reasoning. In contrast, larger models (e.g., DS-R1-32B) consistently achieve 100% accuracy given a correct first step, indicating greater stability. (2) There is a substantial drop in accuracy when the first reasoning step is incorrect, highlighting that even minor errors early in the reasoning process can significantly affect the final prediction. These findings further indicate that the LLMâs ability to self-correct has been overestimated.
In this section, we observe that the reasoning model is particularly vulnerable at the initial stage of the reasoning process; an error in the first step can propagate and substantially degrade the final prediction. Can we develop a method to identify and retain more promising first reasoning steps while discarding suboptimal ones to enhance the overall generation efficiency?
4 Early pruning with hint from first step
In this section, we propose an efficient and straightforward sampling method to identify a promising first reasoning step. By doing so, we can terminate the generation process early when a suboptimal first step is detected, thereby reducing unnecessary computational overhead.
4.1 Problem definition
In contrast to the notation introduced in § 3, we incorporate a random seed $\epsilon$ to introduce stochasticity into the sampling process. Specifically, a sampled trace is computed as $t\oplus c_{T}=\mathcal{M}(p,\epsilon)$ . By varying the random seed $\epsilon^{n}$ , we obtain diverse generations, denoted as $t^{n}\oplus c^{n}_{T}=\mathcal{M}(p,\epsilon^{n})$ , where $t^{n}=[t^{n}_{1},t^{n}_{2},...,t^{n}_{T}]$ . In prior experiments, we sampled 64 CoT traces per question using 64 distinct random seeds $\{\epsilon^{n}\}_{n=1}^{64}$ .
A widely adopted technique for reasoning tasks is majority voting or self-consistency generation (Wang et al., 2022). To promote exploration of the reasoning space, models are typically sampled at high temperatures, resulting in diverse outputs. Majority voting then serves to aggregate these outputs by selecting the most frequent final prediction. Formally, majority voting over $K$ samples is defined as:
$$
q_{\text{maj@}K}=\text{mode}(\{q^{n}_{T}\}_{n=1}^{K})\quad\text{where}\quad q^{n}_{T}=g(c^{n}_{T})
$$
However, for models generating long CoT traces, majority voting becomes computationally expensive, as it requires sampling $N$ complete reasoning paths independently. In this section, we propose a more efficient approach that samples only $M$ full traces, where $M<N$ , while maintaining comparable majority voting performance to the one with $N$ samplings.
4.2 Methodology
In § 3, we demonstrated a strong positive correlation between the first reasoning step and the final prediction. This observation motivates a method that identifies the top $M$ most promising first reasoning steps out of a total of $N$ , and continues generation only for these selected $M$ candidates, while discarding the remaining $(N-M)$ .
Let a reasoning model generate $N$ candidate first reasoning step $\{t_{1}^{1},t_{1}^{2},...,t_{1}^{N}\}$ from a prompt $p$ with different random seeds $\{\epsilon^{n}\}_{n=1}^{N}$ . Each $t_{1}^{n}$ is the first reasoning step of a full reasoning trajectory. We define a scoring function $r:t_{1}^{n}â\mathbb{R}$ that estimates the promise of a first step, e.g., rating from a reward model. We then select the top $M$ first steps based on their scores:
$$
\mathcal{R}_{\text{top}}=\text{TopM}(\{r(t_{1}^{n})\}_{n=1}^{N})
$$
Only the selected $M$ first steps $\{t_{1}^{n}\mid nâ\mathcal{R}_{\text{top}}\}$ are used for further multi-step generation. The remaining $(N-M)$ are discarded. Since the first step typically requires only a small number of tokens to generate, this selection approach efficiently reduces computation by avoiding full sampling for the less promising $(N-M)$ candidates.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Chart: Ma@K Accuracy vs. K for Different Models
### Overview
The image presents a series of line charts comparing the Ma@K accuracy (Mean Average Precision at K) of different language models (Vanilla N, TopM, DS-R1-Owen-1.5B, DS-R1-Qwen-32B, Qwen3-8B, SW-OR1-7B-Preview) across varying values of K (148, 16, 32, 64). The charts are arranged in a 4x5 grid, with the first column representing an "Average" across all models, and the subsequent columns representing individual datasets: AIME24, AIME25, HMMT, and GPQA. Each chart displays multiple lines representing different models or configurations.
### Components/Axes
* **X-axis:** K (Values: 148, 16, 32, 64). Labeled as "K".
* **Y-axis:** Ma@K accuracy (%). Labeled as "Ma@K accuracy (%)". The scale varies between charts, ranging from approximately 10% to 80%.
* **Legend:** Located in the top-left corner of each chart.
* Vanilla N (Orange Line)
* TopM (Blue Line)
* DS-R1-Owen-1.5B (Red Line) - Only present in the GPQA chart.
* DS-R1-Qwen-32B (Blue Line) - Only present in the GPQA chart.
* Qwen3-8B (Black Line) - Only present in the GPQA chart.
* SW-OR1-7B-Preview (Red Line) - Only present in the GPQA chart.
* **Charts:** Arranged in a 4x5 grid.
* Row 1: Average, AIME24, AIME25, HMMT, GPQA
* Row 2: Average, AIME24, AIME25, HMMT, GPQA
* Row 3: Average, AIME24, AIME25, HMMT, GPQA
* Row 4: Average, AIME24, AIME25, HMMT, GPQA
### Detailed Analysis or Content Details
**Average Charts:**
* **Top Row:** TopM (blue) starts at approximately 30% and increases steadily to around 40% as K decreases. Vanilla N (orange) starts at approximately 30% and increases to around 35% as K decreases.
* **Second Row:** TopM (blue) starts at approximately 65% and remains relatively stable around 70% as K decreases. Vanilla N (orange) starts at approximately 50% and increases to around 70% as K decreases.
* **Third Row:** TopM (blue) starts at approximately 50% and increases to around 60% as K decreases. Vanilla N (orange) starts at approximately 45% and increases to around 60% as K decreases.
* **Bottom Row:** TopM (blue) starts at approximately 50% and remains relatively stable around 55% as K decreases. Vanilla N (orange) starts at approximately 50% and remains relatively stable around 50% as K decreases.
**AIME24 Charts:**
* **Top Row:** TopM (blue) starts at approximately 55% and increases to around 65% as K decreases. Vanilla N (orange) starts at approximately 55% and increases to around 65% as K decreases.
* **Second Row:** TopM (blue) starts at approximately 75% and remains relatively stable around 80% as K decreases. Vanilla N (orange) starts at approximately 65% and increases to around 80% as K decreases.
* **Third Row:** TopM (blue) starts at approximately 75% and remains relatively stable around 80% as K decreases. Vanilla N (orange) starts at approximately 65% and increases to around 80% as K decreases.
* **Bottom Row:** TopM (blue) starts at approximately 75% and remains relatively stable around 80% as K decreases. Vanilla N (orange) starts at approximately 65% and increases to around 80% as K decreases.
**AIME25 Charts:**
* **Top Row:** TopM (blue) starts at approximately 35% and increases to around 45% as K decreases. Vanilla N (orange) starts at approximately 35% and increases to around 45% as K decreases.
* **Second Row:** TopM (blue) starts at approximately 65% and remains relatively stable around 75% as K decreases. Vanilla N (orange) starts at approximately 55% and increases to around 75% as K decreases.
* **Third Row:** TopM (blue) starts at approximately 65% and remains relatively stable around 75% as K decreases. Vanilla N (orange) starts at approximately 55% and increases to around 75% as K decreases.
* **Bottom Row:** TopM (blue) starts at approximately 65% and remains relatively stable around 75% as K decreases. Vanilla N (orange) starts at approximately 55% and increases to around 75% as K decreases.
**HMMT Charts:**
* **Top Row:** TopM (blue) starts at approximately 15% and increases to around 25% as K decreases. Vanilla N (orange) starts at approximately 15% and increases to around 25% as K decreases.
* **Second Row:** TopM (blue) starts at approximately 40% and increases to around 50% as K decreases. Vanilla N (orange) starts at approximately 30% and increases to around 50% as K decreases.
* **Third Row:** TopM (blue) starts at approximately 35% and increases to around 45% as K decreases. Vanilla N (orange) starts at approximately 30% and increases to around 45% as K decreases.
* **Bottom Row:** TopM (blue) starts at approximately 30% and increases to around 35% as K decreases. Vanilla N (orange) starts at approximately 25% and increases to around 35% as K decreases.
**GPQA Charts:**
* **Top Row:** DS-R1-Owen-1.5B (red) starts at approximately 30% and increases to around 35% as K decreases. TopM (blue) starts at approximately 35% and remains relatively stable around 40% as K decreases.
* **Second Row:** DS-R1-Qwen-32B (blue) starts at approximately 55% and increases to around 62% as K decreases. TopM (blue) starts at approximately 60% and remains relatively stable around 64% as K decreases.
* **Third Row:** Qwen3-8B (black) starts at approximately 55% and increases to around 62% as K decreases. TopM (blue) starts at approximately 60% and remains relatively stable around 64% as K decreases.
* **Bottom Row:** SW-OR1-7B-Preview (red) starts at approximately 48% and increases to around 54% as K decreases. TopM (blue) starts at approximately 50% and remains relatively stable around 54% as K decreases.
### Key Observations
* Generally, Ma@K accuracy increases as K decreases across all models and datasets.
* TopM consistently outperforms Vanilla N in most datasets, especially at lower values of K.
* The HMMT dataset shows the lowest overall accuracy compared to other datasets.
* The GPQA dataset exhibits the highest accuracy, with some models exceeding 60%.
* The performance differences between models are more pronounced at lower values of K.
### Interpretation
The charts demonstrate the impact of the 'K' parameter on the Ma@K accuracy of different language models across various datasets. A lower 'K' value implies a more precise ranking, and the increasing accuracy as 'K' decreases suggests that these models are better at identifying the most relevant results when fewer options are considered. The consistent outperformance of TopM over Vanilla N indicates that the TopM approach (likely involving some form of filtering or ranking) is more effective at retrieving relevant information. The dataset-specific variations in accuracy highlight the models' sensitivity to the characteristics of the data. The lower accuracy on HMMT suggests that this dataset presents a greater challenge for these models, potentially due to its complexity or the nature of the questions. The GPQA dataset, with its higher accuracy scores, may be more aligned with the models' training data or have simpler question-answering requirements. The observed trends suggest that optimizing the 'K' parameter is crucial for maximizing the performance of these language models in information retrieval tasks. The differences in performance between models on GPQA suggest that model size and architecture play a significant role in achieving high accuracy on this dataset.
</details>
Figure 4: Majority voting accuracy with different number of samplings for four LLMs. The vertical dashed line denotes the smallest $M$ whose accuracy is equal to or larger than the accuracy of $N=64$ .
4.3 Experiments
Setup. We evaluate five families of reasoning modelsâDS-R1-Qwen (Guo et al., 2025), Qwen3 (Yang et al., 2025a), Skywork-OR1 (SW-OR1) (He et al., 2025), Magistral (Rastogi et al., 2025) and GPT-OSS (Agarwal et al., 2025) âon five challenging reasoning benchmarks spanning mathematics, science and programming: AIME24, AIME25 (MAA Committees, 2025), HMMT Feb 2025 (BalunoviÄ et al., 2025), the GPQA Diamond set (Rein et al., 2024) and LiveCodeBench Latest release, containing problems from JanâApr 2025. (Jain et al., 2024), consisting of 30, 30, 30, 198, and 175 problems, respectively. For decoding, we adopt each modelâs recommended temperature and top_p, with a maximum generation length of 32K tokens.
We consider values of $N$ and $M$ from the set $\{1,2,4,8,16,32,64\}$ . For early pruning, we fix $N=64$ and select the top $M$ most promising first-step candidates using reward scores predicted by a process reward model (PRM), Qwen2.5-Math-PRM-7B (Zhang et al., 2025). When $M=64$ , the accuracy is exactly the same as the one for $N=64$ , since all candidates are chosen. For PRM, a âstepâ is defined as a segment of text ending with â \n\n â, and each step receives an individual reward score. The trajectory $t^{n}_{1}$ contains multiple such steps, and we use the score of its final step as the overall reward, $r(t^{n}_{1})$ . Notably, using a PRM to score $t^{n}_{1}$ is really cheap, because $t^{n}_{1}$ is short, and its computation is similar to generate one token with input $p\oplus t^{n}_{1}$ .
Unlike the definition used in § 3, where $t^{n}_{1}$ terminates upon generating the keyword âAlternativelyâ, we redefine $t^{n}_{1}$ in this section to have a fixed token length, $\text{len}(t^{n}_{1})$ . The previous definition made it difficult to control the computational budget for generating first steps, as the trigger phrase might not appear or could occur late in the sequence. By fixing the token length, we achieve precise control over the generation budget when sampling $N$ candidate first steps. By default, $\text{len}(t^{n}_{1})=512$ .
4.3.1 Main results
In Figure 4, we analyze the performance as $M$ varies. We find that selecting the top 16 first reasoning steps from 64 candidates and continuing generation from them achieves accuracy on par with, or even exceeding, conventional sampling with $N=64$ . This trend is consistent across diverse LLMs and benchmarks. Interestingly, for certain casesâsuch as HMMT on DS-R1-Qwen-1.5B, and AIME24, AIME25, and GPQA on DS-R1-Qwen-32Bâusing as few as $M†8$ suffices to match the performance obtained with all 64 samples.
Table 2: Early pruning accuracy and efficiency. We select $M=16$ first steps with the highest reward scores out of $N=64$ candidate first steps. The number of tokens used for the 48 discarded first steps is also included for the early pruning method. Early pruning maintains the accuracy, even improves sometimes, while only requiring $<30\%$ original inference budget.
| DS-R1-Qwen-1.5B | $N=64$ | 60.0 | 40.0 | 23.3 | 41.4 | 41.2 | $Ă$ 1.00 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| $N=16$ | 57.0 | 37.6 | 22.4 | 39.4 | 39.1 (-2.1) | $Ă$ 0.25 | |
| $M=16$ | 60.0 | 40.0 | 26.7 | 39.4 | 41.5 (+0.3) | $Ă$ 0.28 | |
| DS-R1-Qwen-32B | $N=64$ | 86.7 | 73.3 | 56.7 | 64.1 | 70.2 | $Ă$ 1.00 |
| $N=16$ | 85.4 | 70.9 | 52.6 | 63.9 | 68.2 (-2.0) | $Ă$ 0.25 | |
| $M=16$ | 86.7 | 73.3 | 56.7 | 63.6 | 70.1 (-0.1) | $Ă$ 0.29 | |
| Qwen3-8B | $N=64$ | 83.3 | 70.0 | 56.7 | 61.1 | 67.8 | $Ă$ 1.00 |
| $N=16$ | 79.8 | 69.0 | 52.3 | 61.3 | 65.6 (-2.2) | $Ă$ 0.25 | |
| $M=16$ | 83.3 | 70.0 | 60.0 | 62.1 | 68.9 (+1.1) | $Ă$ 0.28 | |
| SW-OR1-7B | $N=64$ | 80.0 | 60.0 | 40.0 | 50.5 | 57.6 | $Ă$ 1.00 |
| $N=16$ | 75.7 | 58.0 | 37.9 | 51.9 | 55.9 (-1.7) | $Ă$ 0.25 | |
| $M=16$ | 80.0 | 60.0 | 40.0 | 52.5 | 57.9 (+0.3) | $Ă$ 0.29 | |
| Magistral-Small | $N=64$ | 86.7 | 83.3 | 76.7 | 70.2 | 79.2 | $Ă$ 1.00 |
| $N=16$ | 87.1 | 82.6 | 71.1 | 69.2 | 77.5 (-1.7) | $Ă$ 0.25 | |
| $M=16$ | 86.7 | 83.3 | 73.3 | 70.2 | 78.4 (-0.8) | $Ă$ 0.28 | |
| GPT-OSS-20B | $N=64$ | 86.7 | 83.3 | 80.0 | 73.2 | 80.8 | $Ă$ 1.00 |
| $N=16$ | 85.3 | 81.7 | 73.3 | 72.7 | 78.3 (-2.5) | $Ă$ 0.25 | |
| $M=16$ | 86.7 | 83.3 | 80.0 | 73.2 | 80.8 (-0.0) | $Ă$ 0.27 | |
<details>
<summary>x7.png Details</summary>

### Visual Description
## Chart: pass@K vs. K for Different Models
### Overview
The image presents three line charts, each depicting the relationship between 'pass@K' (y-axis) and 'K' (x-axis) for different language models: DS-R1-Owen-32B, Qwen3-8B, and GPT-OSS-20B. Each chart compares two data series: "Vanilla N" and "TopM". The charts visually demonstrate how the 'pass@K' metric changes as the value of 'K' increases for each model and sampling method.
### Components/Axes
* **X-axis Label:** "K" - ranging from 1 to 15, with markers at integer values.
* **Y-axis Label:** "pass@K" - ranging from approximately 40 to 65, with markers at intervals of 5.
* **Chart Titles:**
* "DS-R1-Owen-32B" (Top-left chart)
* "Qwen3-8B" (Top-center chart)
* "GPT-OSS-20B" (Top-right chart)
* **Legend:** Located in the top-left corner of each chart.
* "Vanilla N" - represented by an orange line with a circular marker.
* "TopM" - represented by a blue line with a circular marker.
### Detailed Analysis or Content Details
**Chart 1: DS-R1-Owen-32B**
* **Vanilla N:** The line slopes upward, starting at approximately 50 at K=1, reaching around 58 at K=7, and plateauing around 62-63 from K=9 onwards. Approximate data points: (1, 50), (3, 53), (5, 56), (7, 58), (9, 61), (11, 62), (13, 62), (15, 63).
* **TopM:** The line also slopes upward, but starts higher at approximately 53 at K=1, reaches around 63 at K=7, and plateaus around 64-65 from K=9 onwards. Approximate data points: (1, 53), (3, 58), (5, 61), (7, 63), (9, 64), (11, 64), (13, 65), (15, 65).
**Chart 2: Qwen3-8B**
* **Vanilla N:** The line slopes upward, starting at approximately 43 at K=1, reaching around 54 at K=7, and plateauing around 56-57 from K=9 onwards. Approximate data points: (1, 43), (3, 47), (5, 51), (7, 54), (9, 56), (11, 56), (13, 57), (15, 57).
* **TopM:** The line slopes upward, starting at approximately 42 at K=1, reaching around 55 at K=7, and plateauing around 57-58 from K=9 onwards. Approximate data points: (1, 42), (3, 46), (5, 50), (7, 55), (9, 57), (11, 57), (13, 58), (15, 58).
**Chart 3: GPT-OSS-20B**
* **Vanilla N:** The line slopes upward, starting at approximately 41 at K=1, reaching around 52 at K=7, and plateauing around 55-56 from K=9 onwards. Approximate data points: (1, 41), (3, 45), (5, 49), (7, 52), (9, 55), (11, 55), (13, 56), (15, 56).
* **TopM:** The line slopes upward, starting at approximately 44 at K=1, reaching around 56 at K=7, and plateauing around 58-60 from K=9 onwards. Approximate data points: (1, 44), (3, 48), (5, 52), (7, 56), (9, 58), (11, 59), (13, 60), (15, 60).
### Key Observations
* In all three charts, "TopM" consistently outperforms "Vanilla N" across all values of K.
* The performance gains from "Vanilla N" to "TopM" appear to diminish as K increases, with the lines flattening out and converging at higher K values.
* DS-R1-Owen-32B generally achieves the highest 'pass@K' values, followed by GPT-OSS-20B, and then Qwen3-8B.
* The rate of improvement in 'pass@K' with increasing K is steepest at lower K values (1-7) for all models and sampling methods.
### Interpretation
The charts demonstrate the impact of different sampling methods ("Vanilla N" and "TopM") on the 'pass@K' metric for three different language models. 'pass@K' likely represents the percentage of times the model successfully completes a task when considering the top K possible outputs. The consistent outperformance of "TopM" suggests that selecting from the most probable outputs (TopM) leads to higher success rates compared to a more naive sampling approach ("Vanilla N").
The diminishing returns observed at higher K values indicate that beyond a certain point, increasing the number of considered outputs does not significantly improve performance. This suggests that the most informative outputs are concentrated within the top K possibilities, and exploring beyond that point yields diminishing benefits.
The differences in overall 'pass@K' values between the models suggest varying levels of capability. DS-R1-Owen-32B appears to be the most capable model, followed by GPT-OSS-20B and Qwen3-8B, based on this metric. The charts provide valuable insights into the trade-offs between sampling methods and model performance, which can inform the selection of appropriate configurations for specific applications.
</details>
Figure 5: The pass rate on LiveCodeBench, where we set $N=16$ for early pruning (TopM).
<details>
<summary>x8.png Details</summary>

### Visual Description
## Chart: maj@K Accuracy vs. K for Different Models
### Overview
The image presents two line charts comparing the Average maj@K accuracy of different models across varying values of K. The charts appear to evaluate model performance based on the top K predictions. The left chart compares "Vanilla N" to three variations of "Qwen2.5-Math-PRM" and "Qwen2.5-Math-RM". The right chart compares "Vanilla N" to models with different context lengths (len(t1) = 256, 512, 1024).
### Components/Axes
Both charts share the following components:
* **X-axis:** Labeled "K", with values ranging from 1 to 64. The scale is logarithmic, with markers at 1, 4, 8, 16, 32, and 64.
* **Y-axis:** Labeled "Average maj@K accuracy (%)", with a scale ranging from 48% to 72%.
* **Legends:** Located in the top-right corner of each chart, identifying the different data series by color.
The left chart's legend includes:
* Vanilla N (Orange)
* Qwen2.5-Math-PRM-7B (Blue)
* Qwen2.5-Math-PRM-72B (Purple)
* Qwen2.5-Math-RM-72B (Pink)
The right chart's legend includes:
* Vanilla N (Orange)
* len(t1)=256 (Blue)
* len(t1)=512 (Light Blue)
* len(t1)=1024 (Pink)
### Detailed Analysis or Content Details
**Left Chart:**
* **Vanilla N:** Starts at approximately 48% at K=1, rises to around 68% at K=8, plateaus around 71% from K=16 to K=64.
* **Qwen2.5-Math-PRM-7B:** Starts at approximately 52% at K=1, rises sharply to around 68% at K=8, and plateaus around 71% from K=16 to K=64.
* **Qwen2.5-Math-PRM-72B:** Starts at approximately 52% at K=1, rises sharply to around 70% at K=8, and plateaus around 72% from K=16 to K=64.
* **Qwen2.5-Math-RM-72B:** Starts at approximately 52% at K=1, rises sharply to around 70% at K=8, and plateaus around 72% from K=16 to K=64.
* A vertical dashed line is present at K=8.
**Right Chart:**
* **Vanilla N:** Starts at approximately 48% at K=1, rises to around 68% at K=8, plateaus around 71% from K=16 to K=64.
* **len(t1)=256:** Starts at approximately 54% at K=1, rises sharply to around 68% at K=8, and plateaus around 71% from K=16 to K=64.
* **len(t1)=512:** Starts at approximately 54% at K=1, rises sharply to around 70% at K=8, and plateaus around 72% from K=16 to K=64.
* **len(t1)=1024:** Starts at approximately 54% at K=1, rises sharply to around 70% at K=8, and plateaus around 72% from K=16 to K=64.
### Key Observations
* In both charts, accuracy generally increases with K up to a point (around K=8-16), after which it plateaus.
* In the left chart, Qwen2.5-Math-PRM-72B and Qwen2.5-Math-RM-72B consistently outperform Qwen2.5-Math-PRM-7B. All three Qwen models outperform Vanilla N at higher K values.
* In the right chart, increasing the context length (len(t1)) generally improves accuracy, with len(t1)=1024 performing the best.
* The vertical dashed line at K=8 in the left chart may indicate a point of significant performance change or a threshold for evaluation.
### Interpretation
The data suggests that increasing model size (from 7B to 72B parameters in the left chart) and context length (in the right chart) improves the Average maj@K accuracy, particularly for smaller values of K. The plateauing of accuracy at higher K values indicates that beyond a certain point, increasing K does not significantly improve performance. The vertical line at K=8 in the left chart could represent a point where the models have converged to their maximum achievable accuracy.
The consistent performance of the larger Qwen models (72B) suggests that model capacity is a crucial factor in achieving high accuracy. Similarly, the improvement in accuracy with longer context lengths indicates that providing more information to the model can enhance its ability to make accurate predictions. The "Vanilla N" model serves as a baseline, and the other models demonstrate the benefits of specific architectural choices (PRM, RM) and increased context length. The charts provide a quantitative comparison of these different approaches, allowing for informed decisions about model selection and optimization.
</details>
Figure 6: Average accuracy across AIME24, AIME25 and HMMT on DS-R1-Qwen-32B. Left: Comparison of reward signals derived from different reward models. The choice of reward model has minimal impact on overall performance. Right: Effect of varying the length of the first reasoning step. Using a very short first step, like 256 tokens, leads to suboptimal performance, likely because it provides insufficient reasoning context to effectively evaluate the quality of the step.
Table 2 reports the detailed accuracy and token consumption of different methods. When using $M=16$ , early pruning consistently matches the accuracy of $N=64$ across a range of LLMs and benchmarks, while substantially outperforming $N=16$ under a comparable token budget. Notably, for Qwen3-8B, early pruning even yields a 1.1% improvement in average accuracy. Importantly, these gains come at less than 30% of the inference cost of majority voting with $N=64$ , underscoring the strong efficiency advantage of our method.
For the code generation benchmark, LiveCodeBench, where majority voting is not applicable, we present the pass rate in Figure 5. Early pruning consistently surpasses standard sampling given the same number of complete sampling.
4.3.2 Ablation studies
Here we further validate our default settings.
Choice of reward model. In Figure 6 (Left), we evaluate two additional reward models: a larger PRM, Qwen2.5-Math-PRM-72B, and an outcome reward model, Qwen2.5-Math-RM-72B (Yang et al., 2024). The results indicate that the choice of reward model has minimal impact on performance. Notably, the smaller 7B PRM achieves comparable results, highlighting the efficiency of our approach.
Length of first step. In Figure 6 (Right), we examine the impact of varying the length of the first reasoning step. We observe that the shortest first step (i.e., $\text{len}(t^{n}_{1})=256$ ) leads to degraded performance. We hypothesize that shorter $t^{n}_{1}$ sequences lack sufficient reasoning content, making them less informative for reliable reward model evaluation. Nevertheless, setting $\text{len}(t^{n}_{1})â„ 512$ tokens yields consistently better performance than vanilla sampling.
Table 3: Average maj@K for early pruning, with first step defined by length or phrase.
| DS-R1-Qwen-1.5B DS-R1-Qwen-32B Qwen3-8B | 41.5 70.1 68.9 | 43.8 70.4 68.2 |
| --- | --- | --- |
| SW-OR1-7B | 57.9 | 57.7 |
Effect of first step split. Table 3 examines how defining the first step influences early pruning performance. The heuristic approach follows the method described in § 3. Overall, both definitions yield comparable results. Nevertheless, we recommend using the token-countâbased definition, as it provides a clearer way to manage the token budget across candidate first steps. Moreover, trigger phrases that signal step boundaries may vary across LLMs.
Table 4: Time spent by early pruning.
| DS-R1-Qwen-1.5B Qwen3-8B | $Ă$ 1.00 $Ă$ 1.00 | $Ă$ 0.27 $Ă$ 0.37 |
| --- | --- | --- |
Overhead from reward model. Relative to vanilla sampling, early pruning requires scoring the first step with a reward model, which introduces additional overhead. To ensure a fair comparison, we avoid using extra GPUs for deploying the reward model. Instead, our procedure is as follows: (1) load the reasoning model to generate candidate first steps and then offload it; (2) load the reward model on the same GPU to evaluate these steps and offload it; and (3) reload the reasoning model to continue generation from the selected first steps. The timing results are reported in Table 4. Notably, early pruning remains efficient both in terms of token usage and runtime, since evaluating the first step with the reward model is inexpensiveâcomparable to computing embeddings for the short first steps.
5 Conclusion
In this paper, we empirically demonstrate that the first reasoning step plays a critical role in determining the final outcome of a modelâs reasoning process. Errors introduced early can significantly degrade overall performance. Motivated by this observation, we propose an efficient sampling strategy that identifies and retains high-quality first reasoning steps, reducing the inference computation up to 70% across three model families. These findings suggest that improving or exploiting the very first step is a promising direction for building more accurate and efficient reasoning LLMs.
Acknowledgements
This research was partly supported by the Netherlands Organization for Scientific Research (NWO) under project number VI.C.192.080, 024.004.022, NWA.1389.20.183, and KICH3.LTP.20.006, and the European Union under grant agreements No. 101070212 (FINDHR) and No. 101201510 (UNITE). We are very grateful for the computation support from eBay. Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of their respective employers, funders and/or granting authorities.
References
- Agarwal et al. (2025) Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025.
- Aggarwal et al. (2023) Pranjal Aggarwal, Aman Madaan, Yiming Yang, et al. Letâs sample step by step: Adaptive-consistency for efficient reasoning and coding with llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12375â12396, 2023.
- Anthropic (2025) Anthropic. Claude 3.7 sonnet system card. 2025. URL https://api.semanticscholar.org/CorpusID:276612236.
- BalunoviÄ et al. (2025) Mislav BalunoviÄ, Jasper Dekoninck, Ivo Petrov, Nikola JovanoviÄ, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena.ai/.
- Barbero et al. (2025) Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar VeliÄkoviÄ, and Razvan Pascanu. Why do llms attend to the first token? arXiv preprint arXiv:2504.02732, 2025.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901, 2020.
- Chen et al. (2024a) Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187, 2024a.
- Chen et al. (2024b) Xinyi Chen, Baohao Liao, Jirui Qi, Panagiotis Eustratiadis, Christof Monz, Arianna Bisazza, and Maarten de Rijke. The SIFo benchmark: Investigating the sequential instruction following ability of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 1691â1706. Association for Computational Linguistics, 2024b. URL https://aclanthology.org/2024.findings-emnlp.92/.
- Chiang & Lee (2024) Cheng-Han Chiang and Hung-yi Lee. Over-reasoning and redundant calculation of large language models. In Yvette Graham and Matthew Purver (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 161â169, St. Julianâs, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-short.15/.
- Formal et al. (2021) Thibault Formal, Benjamin Piwowarski, and StĂ©phane Clinchant. Splade: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2288â2292, 2021.
- Formal et al. (2022) Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. From distillation to hard negative sampling: Making sparse neural ir models more effective, 2022. URL https://arxiv.org/abs/2205.04733.
- Fu et al. (2025a) Yichao Fu, Junda Chen, Yonghao Zhuang, Zheyu Fu, Ion Stoica, and Hao Zhang. Reasoning without self-doubt: More efficient chain-of-thought through certainty probing. In ICLR 2025 Workshop on Foundation Models in the Wild, 2025a. URL https://openreview.net/forum?id=wpK4IMJfdX.
- Fu et al. (2025b) Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. arXiv preprint arXiv:2508.15260, 2025b.
- Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- He et al. (2025) Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680, 2025. Notion Blog.
- Hestness et al. (2017) Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Hou et al. (2025) Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning. arXiv preprint arXiv:2504.01296, 2025.
- Hurst et al. (2024) Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
- Jain et al. (2024) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024.
- Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Li et al. (2024) Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. In ICLR, 2024.
- Liao & Monz (2024) Baohao Liao and Christof Monz. 3-in-1: 2d rotary adaptation for efficient finetuning, efficient batching and composability. arXiv preprint arXiv:2409.00119, 2024.
- Liao et al. (2025a) Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, and Caiming Xiong. Fractured chain-of-thought reasoning. arXiv preprint arXiv:2505.12992, 2025a.
- Liao et al. (2025b) Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative decoding for efficient LLM reasoning. arXiv preprint arXiv:2501.19324, 2025b.
- Liu et al. (2024a) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157â173, 2024a. doi: 10.1162/taclËaË00638. URL https://aclanthology.org/2024.tacl-1.9/.
- Liu et al. (2024b) Yijin Liu, Xianfeng Zeng, Chenze Shao, Fandong Meng, and Jie Zhou. Instruction position matters in sequence generation with large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 11652â11663, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.693. URL https://aclanthology.org/2024.findings-acl.693/.
- Luo et al. (2025) Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570, 2025.
- MAA Committees (2025) MAA Committees. AIME Problems and Solutions. https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions, 2025.
- Muennighoff et al. (2025) Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel CandĂšs, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025.
- Pezeshkpour & Hruschka (2023) Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
- Qin et al. (2023) Guanghui Qin, Yukun Feng, and Benjamin Van Durme. The NLP task effectiveness of long-range transformers. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 3774â3790. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.eacl-main.273. URL https://aclanthology.org/2023.eacl-main.273/.
- Rastogi et al. (2025) Abhinav Rastogi, Albert Q Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, et al. Magistral. arXiv preprint arXiv:2506.10910, 2025.
- Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1908.10084.
- Rein et al. (2024) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. In First Conference on Language Modeling, 2024.
- Shen et al. (2023) Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, and Yi Zhang. Positional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737, 2023.
- Shi et al. (2024) Lin Shi, Chiyu Ma, Wenhua Liang, Weicheng Ma, and Soroush Vosoughi. Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by llms. arXiv preprint arXiv:2406.07791, 2024.
- Team et al. (2024) Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Team (2025) Qwen Team. Qwen3, April 2025. URL https://qwenlm.github.io/blog/qwen3/.
- Wan et al. (2025) Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li. Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3613â3635, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long.184. URL https://aclanthology.org/2025.naacl-long.184/.
- Wang et al. (2024) Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9440â9450. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.acl-long.511/.
- Wang et al. (2020) Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 2020. URL https://arxiv.org/abs/2002.10957.
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Wang et al. (2025a) Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, et al. Thoughts are all over the place: On the underthinking of o1-like llms. arXiv preprint arXiv:2501.18585, 2025a.
- Wang et al. (2025b) Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. Chain-of-probe: Examining the necessity and accuracy of CoT step-by-step. In Findings of the Association for Computational Linguistics: NAACL 2025, pp. 2586â2606. Association for Computational Linguistics, 2025b. ISBN 979-8-89176-195-7. URL https://aclanthology.org/2025.findings-naacl.140/.
- Wen et al. (2025) Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng Zhang. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond. arXiv preprint arXiv:2503.10460, 2025.
- Xiao et al. (2024) Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF.
- Xiong et al. (2025) Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, and Sainbayar Sukhbaatar. Stepwiser: Stepwise generative judges for wiser reasoning. arXiv preprint arXiv:2508.19229, 2025.
- Xue et al. (2023) Mingfeng Xue, Dayiheng Liu, Wenqiang Lei, Xingzhang Ren, Baosong Yang, Jun Xie, Yidan Zhang, Dezhong Peng, and Jiancheng Lv. Dynamic voting for efficient reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 3085â3104, 2023.
- Yang et al. (2024) An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024.
- Yang et al. (2025a) An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025a.
- Yang et al. (2025b) Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Zheng Lin, Li Cao, and Weiping Wang. Dynamic early exit in reasoning models. arXiv preprint arXiv:2504.15895, 2025b.
- Zhang et al. (2024a) Yusen Zhang, Sarkar Snigdha Sarathi Das, and Rui Zhang. Verbosity $neq$ veracity: Demystify verbosity compensation behavior of large language models. arXiv preprint arXiv:2411.07858, 2024a.
- Zhang et al. (2025) Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301, 2025.
- Zhang et al. (2024b) Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia Wu, and Zhangyang Wang. Found in the middle: How language models use long contexts better via plug-and-play positional encoding. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024b.
- Zheng et al. (2024) Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations, 2024.
Appendix A Keywords for Reasoning Step Segmentation
Good reasoning switching keywords should signal a clear switch in reasoning, while should not occur too frequently which often indicate only minor digressions, or too rare to be reliable delimiters. We collect a list of reasoning switching keywords from papers (Muennighoff et al., 2025; Hou et al., 2025). To assess which reasoning switch keywords could serve a similar role, we compute their average frequency per question in model outputs. After cosidering the frequencies and manually check the segment results, we found that âalternativelyâ strikes a balance, appearing often enough to capture step boundaries while still reflecting shifts in reasoning, for DS-R1-Qwen-7B and Qwen3-8B. We uses both âalternativelyâ and âwaitâ for Claude-3.7-Sonnet, GPT-OSS-20B and Magistral-Small.
Table A.1: Average frequency per question of discourse markers in model outputs.
| DS-R1-Qwen-7B Qwen3-8B | but 82 91 | wait 72 66 | alternatively 14 20 | hmm 7 6 | hold on 3 1 | let me confirm $<$ 1 $<$ 1 | however 10.1 10.4 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Claude-3.7-Sonnet | 28.5 | 22.6 | 0.2 | 6.1 | 0.7 | 0.0 | 0.9 |
| Magistral-Small | 47.1 | 54.8 | 3.0 | 0.0 | 0.9 | 0.2 | 1.4 |
| GPT-OSS-20B | 171.6 | 60.5 | 4.9 | 1.2 | 0.7 | 0.0 | 5.9 |
Appendix B Model Identifiers
For reproducibility, we provide the exact model identifiers used in our experiments. The models are: DeepSeek-R1-Distill-Qwen-7B (deepseek-ai/DeepSeek-R1-Distill-Qwen-7B), Qwen3-8B (Qwen/Qwen3-8B), Claude 3.7 (claude-3-7-sonnet-20250219, accessed via the Claude API), GPT-OSS-20B (openai/gpt-oss-20b), and Magistral-Small (mistralai/Magistral-Small-2509).
Appendix C Additional Similarity Analyses
We report the cosine similarity between the embedding of the $i$ -th reasoning step $t_{i}$ and the final conclusion $c_{T}$ , interpolated using the maximum number of reasoning steps (Figure C.1).
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Similarity of Reasoning Steps
### Overview
This line chart depicts the similarity (Similarity(CÏ, t)) between reasoning steps over a range of steps (0 to 175) for five different language models: DS-R1-Qwen-7B, Qwen3-8B, Claude-3.7-Sonnet, GPT-OSS-20B, and Magistral-Small. The chart illustrates how the similarity changes as the reasoning process progresses.
### Components/Axes
* **X-axis:** Reasoning step tᔹ (GPT5), ranging from 0 to approximately 175.
* **Y-axis:** Similarity(CÏ, t), ranging from 0.50 to 0.85.
* **Legend:** Located in the top-right corner, identifying each line with its corresponding model name and size.
* DS-R1-Qwen-7B (Blue)
* Qwen3-8B (Orange)
* Claude-3.7-Sonnet (Green)
* GPT-OSS-20B (Purple)
* Magistral-Small (Brown)
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points:
* **DS-R1-Qwen-7B (Blue):** The line starts at approximately 0.83 at step 0 and decreases rapidly to around 0.60 by step 25. It then fluctuates between approximately 0.58 and 0.65 until step 150, after which it shows a slight increase, ending at approximately 0.62 at step 175.
* **Qwen3-8B (Orange):** This line begins at approximately 0.81 at step 0 and declines to around 0.63 by step 25. It continues to decrease, reaching a low of approximately 0.55 around step 75. It then fluctuates between approximately 0.55 and 0.62 until step 175, ending at approximately 0.58.
* **Claude-3.7-Sonnet (Green):** Starting at approximately 0.80 at step 0, this line decreases to around 0.67 by step 25. It remains relatively stable, fluctuating between approximately 0.63 and 0.68 from step 50 to step 175, ending at approximately 0.65.
* **GPT-OSS-20B (Purple):** This line starts at approximately 0.66 at step 0 and decreases steadily to around 0.54 by step 50. It continues to decline, reaching a minimum of approximately 0.50 around step 150. It then shows a slight increase, ending at approximately 0.53 at step 175.
* **Magistral-Small (Brown):** Beginning at approximately 0.78 at step 0, this line decreases to around 0.63 by step 25. It fluctuates between approximately 0.60 and 0.67 from step 50 to step 175, ending at approximately 0.63.
### Key Observations
* All models exhibit a decreasing trend in similarity during the initial reasoning steps (0-25).
* Claude-3.7-Sonnet maintains the highest similarity scores throughout the reasoning process, with relatively low fluctuation.
* GPT-OSS-20B consistently shows the lowest similarity scores, and experiences the most significant decline.
* DS-R1-Qwen-7B, Qwen3-8B, and Magistral-Small show similar patterns of decline and fluctuation, with Qwen3-8B generally exhibiting slightly lower similarity than the other two.
### Interpretation
The chart suggests that as the reasoning process progresses, the consistency or similarity of the reasoning steps decreases for all the evaluated language models. This could indicate that the models diverge in their thought processes as they tackle more complex reasoning tasks. The relatively stable high similarity of Claude-3.7-Sonnet might suggest a more consistent and focused reasoning approach compared to the other models. The lower similarity and greater decline observed in GPT-OSS-20B could indicate a more exploratory or less focused reasoning process. The initial high similarity across all models suggests that they start with a similar understanding of the problem, but their approaches diverge as they proceed. The fluctuations in similarity for most models could represent moments of insight, correction, or exploration within the reasoning process. The data suggests that the model architecture and size play a role in the consistency of reasoning, with larger models not necessarily exhibiting more consistent reasoning.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Similarity vs. Reasoning Step
### Overview
This line chart depicts the similarity (Similarity(CÏ, t)) of several language models over reasoning steps (t). The x-axis represents the reasoning step, measured in a heuristic unit, ranging from 0 to approximately 120. The y-axis represents the similarity score, ranging from approximately 0.4 to 0.9. Five different language models are compared: DS-R1-Qwen-7B, Qwen3-8B, Claude-3.7-Sonnet, GPT-OSS-20B, and Magistral-Small.
### Components/Axes
* **X-axis:** "Reasoning step tᔹ (Heuristic)" - Ranges from 0 to 120.
* **Y-axis:** "Similarity(CÏ, t)" - Ranges from 0.4 to 0.9.
* **Legend (Top-Right):**
* DS-R1-Qwen-7B (Blue)
* Qwen3-8B (Orange)
* Claude-3.7-Sonnet (Green)
* GPT-OSS-20B (Purple)
* Magistral-Small (Grey)
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points.
* **DS-R1-Qwen-7B (Blue):** The line starts at approximately 0.73 at step 0, rapidly decreases to around 0.58 by step 10, then plateaus and fluctuates between approximately 0.52 and 0.62 from step 40 to 120.
* **Qwen3-8B (Orange):** This line begins at approximately 0.75 at step 0, decreases sharply to around 0.55 by step 10, then gradually declines to approximately 0.50 by step 60. It then fluctuates between approximately 0.50 and 0.58 from step 60 to 120.
* **Claude-3.7-Sonnet (Green):** Starts at approximately 0.72 at step 0, decreases to around 0.60 by step 10, and then maintains a relatively stable level between approximately 0.58 and 0.65 from step 20 to 120.
* **GPT-OSS-20B (Purple):** Begins at approximately 0.70 at step 0, drops to around 0.45 by step 20, and then fluctuates between approximately 0.42 and 0.55 from step 20 to 120. This line exhibits the most volatility.
* **Magistral-Small (Grey):** Starts at approximately 0.71 at step 0, decreases rapidly to around 0.45 by step 20, and then fluctuates between approximately 0.40 and 0.55 from step 20 to 120. It shows a similar pattern to GPT-OSS-20B, but with slightly lower overall similarity scores.
All lines show an initial decrease in similarity as the reasoning step increases.
### Key Observations
* All models exhibit a decline in similarity as reasoning steps increase, suggesting that the initial coherence or alignment with the target concept diminishes with further reasoning.
* Claude-3.7-Sonnet demonstrates the most stable similarity score over the reasoning steps, indicating a more consistent performance.
* GPT-OSS-20B and Magistral-Small show the most significant fluctuations in similarity, suggesting a less predictable reasoning process.
* DS-R1-Qwen-7B and Qwen3-8B have similar trajectories, with Qwen3-8B generally exhibiting slightly lower similarity scores.
### Interpretation
The chart suggests that the initial stages of reasoning (steps 0-20) are where the models are most aligned with the target concept (represented by the initial high similarity scores). As reasoning progresses, this alignment degrades for all models, but at different rates and with varying degrees of stability. The stability of Claude-3.7-Sonnet could indicate a more robust reasoning process or a better ability to maintain coherence over extended reasoning chains. The volatility of GPT-OSS-20B and Magistral-Small might suggest that these models are more prone to exploring diverse or tangential lines of reasoning, leading to fluctuations in similarity. The differences in trajectories between the models could be attributed to variations in their architectures, training data, or reasoning mechanisms. The metric "Similarity(CÏ, t)" likely measures the alignment of the model's reasoning process (at step t) with a target concept (CÏ). The heuristic nature of the reasoning step (tᔹ) suggests that it's not a direct measure of computational steps but rather a representation of the logical progression of reasoning.
</details>
Figure C.1: Cosine similarity between the embeddings of the i-th reasoning step $t_{i}$ and the final conclusion $c_{T}$ , using the maximum number of reasoning steps for interpolation. The reasoning steps are segmented either by GPT-5 (left) or by heuristic rules (right).
In addition, we report SPLADE similarity (Formal et al., 2021), which provides a sparse lexicalâsemantic relevance signal by expanding tokens into weighted lexical features. We compute SPLADE similarity using the naver/splade-cocondenser-ensembledistil model (Formal et al., 2022), following standard practice. As shown in Figure C.2, the trends closely mirror those observed under cosine similarity. These results serve as complementary checks, confirming that our findings are not tied to dense embedding models and hold across both dense and sparse similarity measures.
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Line Chart: Similarity vs. Reasoning Step
### Description
This line chart visualizes the relationship between similarity scores and reasoning steps. The x-axis represents the reasoning step (from 0 to 7), and the y-axis represents the similarity score (ranging from 0.0 to 1.0).
### Data
The chart displays two lines:
* **Train Data:** This line shows the similarity scores for the training dataset. It generally starts at a higher similarity score (around 0.8) at step 0 and gradually decreases as the reasoning step increases, eventually leveling off around 0.2-0.3 by step 7. There are some fluctuations along the way.
* **Test Data:** This line represents the similarity scores for the test dataset. It begins at a lower similarity score (around 0.4) at step 0 compared to the train data. It also decreases with increasing reasoning steps, but at a slower rate, ending around 0.1-0.2 at step 7. It exhibits more pronounced fluctuations than the train data.
### Key Observations
* The train data consistently exhibits higher similarity scores than the test data across all reasoning steps.
* Both train and test data show a negative correlation between reasoning step and similarity score â as the reasoning step increases, the similarity score tends to decrease.
* The test data demonstrates greater variability in similarity scores compared to the train data.
### Potential Implications
This suggests that the model performs better on the training data, as expected, and that the similarity between the model's reasoning and the ground truth decreases as the reasoning process becomes more complex (i.e., more reasoning steps are involved). The higher variability in the test data could indicate that the model is less robust or generalizes less effectively to unseen examples.
```
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Chart: Similarity vs. Reasoning Step
### Overview
This image presents a line chart illustrating the similarity (Similarity(CÏ, t)) of different language models over reasoning steps (t). The chart compares the performance of DS-R1-Qwen-7B, Qwen-3-8B, Claude-3.7-Sonnet, GPT-OSS-20B, and Magistral-Small as the number of reasoning steps increases.
### Components/Axes
* **X-axis:** Reasoning step tᔹ (Heuristic). Scale ranges from approximately 0 to 55.
* **Y-axis:** Similarity(CÏ, t). Scale ranges from approximately 20 to 70.
* **Legend:** Located in the top-right corner, identifying each line with a corresponding color:
* DS-R1-Qwen-7B (Blue)
* Qwen-3-8B (Orange)
* Claude-3.7-Sonnet (Green)
* GPT-OSS-20B (Purple)
* Magistral-Small (Brown)
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points:
* **DS-R1-Qwen-7B (Blue):** The line starts at approximately 68 at step 0, rapidly decreases to around 35 by step 5, then fluctuates between approximately 30 and 40 for steps 5 to 45. It then shows an upward trend, reaching approximately 40 at step 55.
* **Qwen-3-8B (Orange):** Starts at approximately 60 at step 0, quickly drops to around 25 by step 5, and remains relatively stable between approximately 22 and 30 for steps 5 to 45. It shows a slight increase to approximately 30 at step 55.
* **Claude-3.7-Sonnet (Green):** Begins at approximately 65 at step 0, decreases to around 38 by step 5, and maintains a relatively constant level between approximately 34 and 40 for steps 5 to 55.
* **GPT-OSS-20B (Purple):** Starts at approximately 62 at step 0, declines to around 28 by step 5, and fluctuates between approximately 25 and 35 for steps 5 to 55.
* **Magistral-Small (Brown):** Starts at approximately 55 at step 0, drops to around 22 by step 5, and remains relatively stable between approximately 20 and 30 for steps 5 to 55.
### Key Observations
* All models exhibit a significant drop in similarity within the first 5 reasoning steps.
* Claude-3.7-Sonnet maintains the highest similarity scores throughout most of the reasoning steps, remaining relatively stable.
* Magistral-Small consistently shows the lowest similarity scores.
* DS-R1-Qwen-7B shows the most significant increase in similarity towards the end of the reasoning steps (between steps 45 and 55).
* Qwen-3-8B and GPT-OSS-20B show similar trends, with relatively low and stable similarity scores.
### Interpretation
The chart suggests that all the evaluated language models initially lose similarity as they progress through reasoning steps. This could indicate that the models diverge from their initial state as they engage in more complex thought processes. However, the models differ in how they maintain or recover similarity.
Claude-3.7-Sonnet's stability suggests it is more robust in maintaining a consistent internal representation during reasoning. Magistral-Small's consistently low scores might indicate a weaker ability to maintain coherence or relevance during reasoning. The increase in similarity for DS-R1-Qwen-7B towards the end could suggest that it converges on a solution or a more stable state after an initial period of exploration.
The metric "Similarity(CÏ, t)" likely represents a measure of how closely the model's internal state at reasoning step *t* aligns with some target condition *CÏ*. The rapid initial drop suggests that the models quickly move away from their initial state, while the subsequent behavior reveals their ability to maintain coherence or converge on a solution. The differences between the models highlight their varying strengths and weaknesses in reasoning tasks.
</details>
Figure C.2: SPLADE similarity between the representations of the i-th reasoning step $t_{i}$ and the final conclusion $c_{T}$ , using the maximum number of reasoning steps for interpolation. The reasoning steps are segmented either by GPT-5 (left) or by heuristic rules (right).
Appendix D Correlation between first reasoning step and final prediction
In the main text, we reported results on DS-R1-Qwen-7B. Here we extend the analysis to Qwen3-8B (Figure D.1) and further include DeepSeek and Qwen3 models of different sizes (Figure D.2). These additional results verify that the correlation between the first reasoning step and the final prediction persists as model size scales.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: Accuracy vs. Pass@1 of qÏ
### Overview
This bar chart compares the accuracy of "First correct" and "First incorrect" predictions across different ranges of Pass@1 of qÏ. Pass@1 of qÏ is represented on the x-axis, and Accuracy (%) is represented on the y-axis. The chart displays accuracy for both correct and incorrect first predictions for three specific ranges of Pass@1 of qÏ, as well as an overall accuracy.
### Components/Axes
* **X-axis:** Pass@1 of qÏ, with categories: (0, 33%], (33%, 67%], (67%, 100%), and Overall.
* **Y-axis:** Accuracy (%), ranging from 0 to 90, with increments of 10.
* **Legend:** Located in the top-right corner.
* "First correct" - represented by a dark blue color with a diagonal hatch pattern.
* "First incorrect" - represented by a light orange color with a diagonal hatch pattern.
### Detailed Analysis
The chart consists of paired bars for each Pass@1 of qÏ category, representing the accuracy of the first correct and first incorrect predictions.
* **(0, 33%]**:
* First correct: Approximately 16.7% accuracy. (Dark Blue Bar)
* First incorrect: Approximately 11.9% accuracy. (Orange Bar)
* **(33%, 67%]**:
* First correct: Approximately 55.6% accuracy. (Dark Blue Bar)
* First incorrect: Approximately 50.6% accuracy. (Orange Bar)
* **(67%, 100%]**:
* First correct: Approximately 89.8% accuracy. (Dark Blue Bar)
* First incorrect: Approximately 84.9% accuracy. (Orange Bar)
* **Overall**:
* First correct: Approximately 68.5% accuracy. (Dark Blue Bar)
* First incorrect: Approximately 56.7% accuracy. (Orange Bar)
The "First correct" bars are consistently higher than the "First incorrect" bars across all categories. The trend shows that as Pass@1 of qÏ increases, the accuracy of "First correct" predictions also increases significantly.
### Key Observations
* The accuracy gap between "First correct" and "First incorrect" is most pronounced in the (67%, 100%] Pass@1 of qÏ range.
* The overall accuracy for "First correct" is 68.5%, while for "First incorrect" it is 56.7%.
* The lowest accuracy values are observed in the (0, 33%] Pass@1 of qÏ range.
### Interpretation
The data suggests a strong positive correlation between Pass@1 of qÏ and the accuracy of the first correct prediction. Higher values of Pass@1 of qÏ indicate a higher probability of selecting the correct answer as the first attempt, leading to significantly improved accuracy. The consistent outperformance of "First correct" over "First incorrect" indicates that the model is more likely to identify the correct answer when it has a higher confidence level (as reflected by Pass@1 of qÏ). The overall accuracy values provide a general performance benchmark, while the category-specific data reveals how performance varies across different confidence levels. The large jump in accuracy between the (33%, 67%] and (67%, 100%] ranges suggests a threshold effect, where exceeding a certain confidence level dramatically improves prediction accuracy. This could be due to the model's internal decision-making process, where a higher confidence score triggers a more reliable prediction mechanism.
</details>
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Bar Chart: Number of Tokens After First Step vs. Pass@1 of qÏ
### Overview
This bar chart compares the number of tokens generated after the first step for "First correct" and "First incorrect" predictions, categorized by the Pass@1 metric of qÏ. The chart displays the data for three qÏ ranges: (0, 33%], (33%, 67%], and (67%, 100%], as well as an overall average.
### Components/Axes
* **X-axis:** Pass@1 of qÏ, with categories: (0, 33%], (33%, 67%], (67%, 100%], and "Overall".
* **Y-axis:** Number of tokens after first step (k), ranging from 0 to 18, with increments of 2.
* **Legend:** Located in the top-right corner.
* "First correct" - represented by a blue, diagonally-striped bar.
* "First incorrect" - represented by an orange, diagonally-striped bar.
### Detailed Analysis
The chart consists of four groups of paired bars, one for each Pass@1 category and one for the overall average.
* **(0, 33%]**:
* "First correct": Approximately 15.4 k tokens.
* "First incorrect": Approximately 15.8 k tokens.
* **(33%, 67%]**:
* "First correct": Approximately 13.2 k tokens.
* "First incorrect": Approximately 13.9 k tokens.
* **(67%, 100%]**:
* "First correct": Approximately 9.5 k tokens.
* "First incorrect": Approximately 10.0 k tokens.
* **Overall**:
* "First correct": Approximately 11.4 k tokens.
* "First incorrect": Approximately 12.5 k tokens.
The "First incorrect" bars are consistently higher than the "First correct" bars across all categories, though the difference varies.
### Key Observations
* The number of tokens generated after the first step decreases as the Pass@1 of qÏ increases for both "First correct" and "First incorrect" predictions.
* The difference between "First correct" and "First incorrect" is most pronounced in the (0, 33%] category, with "First incorrect" being slightly higher.
* The overall "First incorrect" value (12.5 k) is higher than the overall "First correct" value (11.4 k).
### Interpretation
The data suggests that when the model's initial prediction is incorrect (as indicated by a lower Pass@1 score), it tends to generate more tokens in the subsequent step. This could indicate that the model attempts to "recover" from an incorrect initial prediction by generating more text. The decreasing number of tokens as Pass@1 increases suggests that more accurate initial predictions require less subsequent generation. The overall trend confirms that, on average, incorrect first steps lead to more tokens being generated. This could be a useful metric for evaluating and improving the model's prediction accuracy and efficiency. The chart demonstrates a clear inverse relationship between initial prediction accuracy (Pass@1) and the amount of text generated after the first step.
</details>
Figure D.1: Accuracy and number of tokens on Qwen3-8B. Left: The relationship between the accuracy of the final prediction ( $q_{T}$ ) and the correctness of the prediction solely based on the first reasoning step ( $q_{1}$ ) across different difficulty levels. If $q_{1}$ is incorrect, $q_{T}$ is more likely incorrect. Right: The number of tokens used for the final prediction after the first reasoning step $t_{1}$ , i.e., the number of tokens used for $[t_{2},t_{3},...,t_{T}]$ . Although $q_{1}$ is correct, the model still consumes a large amount of tokens for the following reasoning stepsâoverthinking.
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Accuracy - First Step Correctness
### Overview
This bar chart compares the accuracy of several models (DS-R1-1.5B, DS-R1-32B, Qwen3-1.7B, Qwen3-30B-A3B, and Qwen3-235B-A22B) based on whether their first step is correct or incorrect. Accuracy is measured in percentage (%). Each model has two bars representing "Correct first step" and "Incorrect first step".
### Components/Axes
* **X-axis:** Models - DS-R1-1.5B, DS-R1-32B, Qwen3-1.7B, Qwen3-30B-A3B, Qwen3-235B-A22B
* **Y-axis:** Accuracy (%) - Scale ranges from 0 to 100, with increments of 10.
* **Legend:**
* Blue (hashed pattern): Correct first step
* Orange: Incorrect first step
### Detailed Analysis
The chart consists of five groups of two bars, one blue and one orange, for each model.
* **DS-R1-1.5B:**
* Correct first step: Approximately 92.7%
* Incorrect first step: Approximately 31.7%
* **DS-R1-32B:**
* Correct first step: Approximately 90.2%
* Incorrect first step: Approximately 46.0%
* **Qwen3-1.7B:**
* Correct first step: Approximately 95.2%
* Incorrect first step: Approximately 52.3%
* **Qwen3-30B-A3B:**
* Correct first step: Approximately 91.0%
* Incorrect first step: Approximately 73.0%
* **Qwen3-235B-A22B:**
* Correct first step: Approximately 89.9%
* Incorrect first step: Approximately 79.0%
For each model, the blue bar (Correct first step) is significantly higher than the orange bar (Incorrect first step). The blue bars generally range between 89.9% and 95.2%, while the orange bars range between 31.7% and 79.0%.
### Key Observations
* Qwen3-1.7B exhibits the highest accuracy for the "Correct first step" (95.2%).
* DS-R1-1.5B exhibits the lowest accuracy for the "Incorrect first step" (31.7%).
* Qwen3-30B-A3B and Qwen3-235B-A22B have the highest inaccuracies for the "Incorrect first step" (73.0% and 79.0% respectively).
* There is a clear and consistent trend: all models perform significantly better when the first step is correct.
### Interpretation
The data demonstrates a strong correlation between the correctness of the first step and the overall accuracy of the models. Models with a higher percentage of correct first steps also exhibit higher overall accuracy. This suggests that the initial step is crucial for the success of these models. The differences in accuracy between models, particularly in the "Incorrect first step" category, could be due to variations in model architecture, training data, or other factors. The larger models (Qwen3-30B-A3B and Qwen3-235B-A22B) show a higher percentage of incorrect first steps, which could indicate that increased model size does not necessarily guarantee improved performance in the initial stages of processing. The significant gap between the blue and orange bars for each model highlights the importance of ensuring the first step is correct to maximize accuracy.
</details>
Figure D.2: More results from Qwen and DeepSeek models of different sizes. We report the relationship between the accuracy of the final prediction ( $q_{T}$ ) and the correctness of the first reasoning step ( $q_{1}$ ). If $q_{1}$ is incorrect, $q_{T}$ is more likely to be incorrect.
Appendix E Prompt Examples
The template for using the DeepSeek-R1 model is provided below. We apply the same structure to other open-source thinking models, with slight adjustments for model-specific input and output formats.
Prompt example for DeepSeek-R1: <|begin_of_sentence|>Please reason step by step, and put your final answer within \boxed{}. <|User|>[Question]<|Assistant|><think>
Appendix F Example of a incorrect First Step
Figure F.1 illustrates a case where the initial reasoning step is incorrect.
Example of incorrect first step
Question: Four unit squares form a $2Ă 2$ grid. Each of the 12 unit line segments forming the sides of the squares is colored either red or blue in such a way that each unit square has 2 red sides and 2 blue sides. One example is shown below (red is solid, blue is dashed). Find the number of such colorings. [asy] size(4cm); defaultpen(linewidth(1.2)); draw((0, 0) â (2, 0) â (2, 1)); draw((0, 1) â (1, 1) â (1, 2) â (2,2)); draw((0, 0) â (0, 1), dotted); draw((1, 0) â (1, 1) â (2, 1) â (2, 2), dotted); draw((0, 1) â (0, 2) â (1, 2), dotted); [/asy] First reason: Alright, so I have this problem about coloring the sides of a 2x2 grid of squares. Each square has four sides, and each of these sides is colored either red or blue. The condition is that each individual square must have exactly two red sides and two blue sides. The question is asking for the number of such colorings. First, let me visualize the grid. Itâs a 2x2 grid, so there are four squares arranged in a larger square. Each square has four sides, and each side is shared with another square except for the outermost sides. But in the problem, it says there are 12 unit line segments forming the sides of the squares. Wait, let me check that: a 2x2 grid has 3 horizontal lines and 3 vertical lines, each of length 2 units. So, each horizontal line is divided into two segments, right? So, the total number of horizontal segments is 3 rows times 2 segments per row, which is 6, and similarly, the vertical segments are also 3 columns times 2 segments per column, which is another 6. So, 6 + 6 = 12 segments in total. Okay, that makes sense. Each of these 12 segments can be colored either red or blue. However, the constraint is that each of the four squares must have exactly two red sides and two blue sides. So, even though each segment can be colored independently, we have these local constraints for each square. I think a good way to approach this problem is by considering each square and the possible colorings for its sides, then seeing how they can fit together without violating the constraints. However, since the coloring of one squareâs side affects the adjacent squares, this could get complicated. Maybe I can model this as a graph where each segment is a node, and the squares impose constraints on the nodes (i.e., exactly two red and two blue edges around each square). But Iâm not too familiar with graph theory, so maybe thatâs not the best approach. Mistakes in first reasoning step: Mistake 1 incorrectly assumes that in a $2Ă 2$ grid, each corner square has two shared and two unshared sides, without properly accounting for edge distribution. Mistake 2 introduces unnecessary abstraction by treating segments as ânodes,â which obscures the natural structure: the segments themselves are the edges whose colours we choose, and each square simply places a rule on its four incident edges. Working directly with those four constraints is far cleaner than introducing an extra level of abstraction.
Figure F.1: Example of incorrect first step.
Appendix G Examples for Minor Perturbation
Original final conclusion
To solve the problem, we are given triangle $ABC$ with side lengths $AB=5$ , $BC=9$ , and $AC=10$ . [âŠ] Step 4: Compute $AP$ . Since $P$ lies on line segment $AD$ , and $D$ lies outside the circle, the point $P$ lies between $A$ and $D$ . Therefore: $AP=DA-DP=\frac{325}{22}-\frac{2025}{286}.$ Find a common denominator: $\frac{325}{22}=\frac{4225}{286}\Rightarrow AP=\frac{4225-2025}{286}=\frac{2200}{286}=\frac{100}{13}.$ Final Step: Compute $m+n$ . The reduced form of $AP$ is $\frac{100}{13}$ , so: $m=100,\quad n=13\Rightarrow m+n={\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\text{{\textbackslash boxed\{113\}}}}.$ Final Answer: \boxed{113}
New correct first reason
To solve the problem, we are given triangle $ABC$ with side lengths $AB=5$ , $BC=9$ , and $AC=10$ . [âŠ] Step 4: Compute $AP$ . Since $P$ lies on line segment $AD$ , and $D$ lies outside the circle, the point $P$ lies between $A$ and $D$ . Therefore: $AP=DA-DP=\frac{325}{22}-\frac{2025}{286}.$ Find a common denominator: $\frac{325}{22}=\frac{4225}{286}\Rightarrow AP=\frac{4225-2025}{286}=\frac{2200}{286}=\frac{100}{13}.$ Final Step: Compute $m+n$ . The reduced form of $AP$ is $\frac{100}{13}$ , so: $m=100,\quad n=13\Rightarrow m+n={\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\text{{113}}}.$
New incorrect first reason
To solve the problem, we are given triangle $ABC$ with side lengths $AB=5$ , $BC=9$ , and $AC=10$ . [âŠ] Step 4: Compute $AP$ . Since $P$ lies on line segment $AD$ , and $D$ lies outside the circle, the point $P$ lies between $A$ and $D$ . Therefore: $AP=DA-DP=\frac{325}{22}-\frac{2025}{286}.$ Find a common denominator: $\frac{325}{22}=\frac{4225}{286}\Rightarrow AP=\frac{4225-2025}{286}=\frac{2200}{286}=\frac{100}{13}.$ Final Step: Compute $m+n$ . The reduced form of $AP$ is $\frac{100}{13}$ , so: $m=100,\quad n=13\Rightarrow m+n={\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\text{{103}}}.$
Figure G.1: Illustration of a targeted perturbation to create correct and incorrect first reasoning steps, as discussed in § 3.3. The modified context is highlighted in orange.