# The Evolution of Thought: Tracking LLM Overthinking via Reasoning Dynamics Analysis
Abstract
Test-time scaling via explicit reasoning trajectories significantly boosts large language model (LLM) performance but often triggers overthinking. To explore this, we analyze reasoning through two lenses: Reasoning Length Dynamics, which reveals a compensatory trade-off between thinking and answer content length that eventually leads to thinking redundancy, and Reasoning Semantic Dynamics, which identifies semantic convergence and repetitive oscillations. These dynamics uncover an instance-specific Reasoning Completion Point (RCP), beyond which computation continues without further performance gain. Since the RCP varies across instances, we propose a Reasoning Completion Point Detector (RCPD), an inference-time early-exit method that identifies the RCP by monitoring the rank dynamics of termination tokens (e.g., </think>). Across AIME and GPQA benchmarks using Qwen3 and DeepSeek-R1, RCPD reduces token usage by up to 44% while preserving accuracy, offering a principled approach to efficient test-time scaling.
The Evolution of Thought: Tracking LLM Overthinking via Reasoning Dynamics Analysis
Zihao Wei 1,2, Liang Pang 1, $\dagger$ , Jiahao Liu, Wenjie Shi , Jingcheng Deng 1,2, Shicheng Xu 1,2, Zenghao Duan 1,2, Fei Sun 1, Huawei Shen 1, Xueqi Cheng 1 1 State Key Laboratory of AI Safety,Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2 University of Chinese Academy of Sciences, Beijing, China {weizihao22z, pangliang}@ict.ac.cn
1 Introduction
Test-time scaling via explicit reasoning trajectories has significantly enhanced the problem-solving capabilities of Large Language Models (LLMs) DeepSeek-AI et al. (2025); Yang et al. (2025a); Team et al. (2025). While longer trajectories are generally associated with improved performance, recent studies reveal that this relationship is not strictly linear. Instead, excessive reasoning often leads to diminishing returns or even performance degradation, a phenomenon widely referred to as overthinking Ghosal et al. (2025); Han et al. (2025).
<details>
<summary>2508.17627v2/x1.png Details</summary>

### Visual Description
# Technical Document Extraction: Reasoning Dynamics Diagram
## Section 1: Reasoning Length Dynamic
### Axes
- **Y-axis**: Content Length (pink background)
- **X-axis**: Thinking Length (green background)
### Data Series
1. **Content** (Green line)
- Spatial grounding: [x: 0, y: 0] to [x: 100, y: 100]
- Trend: Monotonic decrease from top-left to bottom-right
- Key point: Intersects RCP at [x: 50, y: 50]
2. **Reasoning** (Pink line)
- Spatial grounding: [x: 0, y: 80] to [x: 100, y: 20]
- Trend: Oscillatory decline with local minima/maxima
- Key point: Intersects RCP at [x: 50, y: 50]
### Annotations
- **RCP** (Purple star): Intersection point of both lines
- **Prompt ?**: Text box with question mark icon
- **Thinking**: Speech bubble with thinking symbol
- **Content**: Lightbulb icon
### Legend
- Position: Top-right corner
- Entries:
- Green: Content
- Pink: Reasoning
## Section 2: Reasoning Semantic Dynamic
### Axes
- **Y-axis**: Latent Dimension 2 (pink background)
- **X-axis**: Latent Dimension 1 (green background)
### Data Series
1. **Reasoning** (Pink line)
- Spatial grounding: [x: 0, y: 100] to [x: 100, y: 0]
- Trend: Complex looping pattern with 3 major inflection points
- Key point: Intersects RCP cluster boundary
2. **Content** (Green line)
- Spatial grounding: [x: 70, y: 70] to [x: 100, y: 100]
- Trend: Dense cluster of intersecting lines forming triangular patterns
- Key point: Contains RCP star at cluster center
### Annotations
- **RCP** (Purple star): Located within green content cluster
- **Thinking**: Speech bubble with thinking symbol
- **Content**: Lightbulb icon
### Legend
- Position: Bottom-right corner
- Entries:
- Pink: Reasoning
- Green: Content
## Cross-Sectional Analysis
1. **Color Consistency Check**
- Top section: Green=Content, Pink=Reasoning
- Bottom section: Pink=Reasoning, Green=Content (color inversion)
- All legend entries match line colors in respective sections
2. **Trend Verification**
- Top section: Both lines show decreasing trends with RCP as equilibrium point
- Bottom section: Reasoning shows chaotic pattern, Content shows structured clustering
3. **Component Isolation**
- Header: Prompt ? and thinking tags
- Main Chart: Two distinct reasoning dynamics sections
- Footer: Latent dimension axes and RCP annotations
## Critical Observations
1. **Dimensional Transformation**
- Top section represents explicit reasoning/content lengths
- Bottom section shows latent semantic dimensions with similar RCP convergence
2. **RCP Significance**
- Appears in both sections as critical intersection point
- Suggests commonality between explicit and latent reasoning dynamics
3. **Pattern Divergence**
- Top section: Linear progression with single intersection
- Bottom section: Non-linear clustering with complex interactions
</details>
Figure 1: Overview of Reasoning Dynamics and RCP. The top panel summarizes Reasoning Length Dynamics, where content length decreases as thinking length grows under a thinking-content compensation regime until reaching the RCP. The bottom panel summarizes Reasoning Semantic Dynamics, where the latent semantic trajectory transitions from broad exploration to a stable neighborhood with repetitive oscillations, with the onset of convergence aligning with the RCP. The top and bottom panels are defined in § 2 and § 3, respectively.
This non-monotonic scaling behavior Wu et al. (2025b) motivates a fundamental question: how does each step of the reasoning process incrementally influence the final outcome? Despite its importance, such a fine-grained analysis of the intermediate reasoning steps remains insufficiently explored. By quantifying the functional impact of each step on the final answer, we implement a study of reasoning dynamics to uncover the internal mechanisms of overthinking and identify the regime where further reasoning tokens become redundant. To explore this, we analyze the reasoning dynamic of LLM from two complementary perspectives.
First, we explicitly characterize the inference as the interaction between the thinking process and the content process. Motivated by recent findings on the non-linear relationship between reasoning length and accuracy Su et al. (2025); Wu et al. (2025a), we investigate their Reasoning Length Dynamics and uncover a phenomenon we term thinking-content compensation, as shown in Figure 1. Initially, the thinking process and content generation exhibit a compensatory relationship: as thinking trajectories expand, the subsequent content becomes increasingly concise. However, this dynamic eventually transitions into a saturation phase. In this regime, although the model has acquired sufficient information to formulate a correct answer, the thinking process continues to lengthen while the content remains unchanged, leading to redundancy.
Second, to uncover the underlying mechanism driving these surface-level length patterns, we further analyze the trajectory of high-dimensional representations in the semantic space Li et al. (2025). In terms of Reasoning Semantic Dynamics, we identify semantic path convergence, as shown in Figure 1. We observe that these representations initially exhibit significant fluctuations, reflecting active exploration of the solution space. Subsequently, the inference path enters a convergence regime where the semantic representations converge toward a stable neighborhood, exhibiting repetitive oscillations within that localized space. This suggests that the solution hypothesis has stabilized.
Bridging these two dynamics suggests a structural explanation for overthinking: it arises when the model continues to generate reasoning tokens after the process has converged. This motivates a principled boundary between necessary reasoning and redundant computation after convergence. We call this boundary the Reasoning Completion Point (RCP). At the RCP, thinking-content compensation ends and the semantic trajectory begins to converge. This boundary is latent and instance-dependent, so mitigating overthinking requires detecting the RCP online during decoding rather than relying on a fixed reasoning length budget. Based on this formalization, we propose a Reasoning Completion Point Detector (RCPD), which monitors the rank of the special token designed to terminate the thinking process (e.g., </think>) to detect the RCP during decoding without additional computational overhead. Detecting the RCP enables early truncation of redundant reasoning, reducing token consumption while maintaining accuracy.
2 Reasoning Length Dynamics
Many reasoning-specialized LLMs expose an intermediate thinking process delimited by special tokens (e.g., <think> and </think>) and then transition to a content process that produces the final answer DeepSeek-AI et al. (2025); Yang et al. (2025a). This section asks a simple but diagnostic question: as we allocate more steps to the thinking process, how do the resulting content length and answer correctness change? We study this via an intervention that forces the model to stop thinking at controlled steps and immediately begin generating content.
2.1 Dynamics Setup: Stepwise Truncation
We formalize LLM generation as a sequential coupling of two distinct phases: a thinking process $t$ and a content process $y$ . Even when the raw trajectory $t$ is withheld in proprietary models due to safety or competitive considerations Jaech et al. (2024); Google DeepMind (2025), it remains the indispensable structural precursor that conditions the final output. Formally, the generation follows the joint distribution $P(y,t|x)=P(t|x)P(y|x,t)$ , where $x$ denotes the input prompt.
To analyze the reasoning process at a granular level, we decompose both the thinking trajectory $t$ and the content trajectory $y$ into discrete reasoning steps $s$ . For a trajectory consisting of raw tokens, we segment it into steps, $t=(s_{1},...,s_{K})$ , where each step $s_{k}$ represents a complete sentence segmented via NLTK (Bird and Loper, 2004). This mapping allows us to intervene on the thinking budget by defining the thinking prefix up to step $k$ as $t_{1:k}=(s_{1},...,s_{k})$ .
To explore how each step incrementally influences the outcome, we implement a stepwise truncation protocol. For each $k∈\{1,...,K\}$ , we truncate the thinking trajectory immediately after $s_{k}$ and force-inject the end-of-thinking delimiter (e.g., </think>). This injected delimiter serves as an explicit control signal that terminates the internal thinking process and triggers the transition to content generation:
$$
y_{k}\sim P(y\mid x,t_{1:k}). \tag{1}
$$
Under this protocol, we monitor two core length measures quantified by reasoning steps: (1) Thinking Length ( $L_{t}$ ): The cumulative number of allowed reasoning steps in the thinking phase, $L_{t}=k$ ; (2) Content Length ( $L_{c}$ ): The number of steps in the content process $y^{(k)}=(s^{y}_{1},...,s^{y}_{L_{c}})$ , where $L_{c}$ is the count of steps generated in response to the truncated thinking prefix $t_{1:k}$ .
By monitoring the evolution of $L_{t}$ , $L_{c}$ , and the corresponding answer correctness across $k$ , we implement a dynamical analysis of the reasoning process and its transition toward redundancy.
<details>
<summary>2508.17627v2/x2.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## 1. Labels and Axis Titles
- **X-Axis**: "Thinking Length (Lt)"
- **Y-Axis**: "Content Length (Lc)"
- **Title**: Split into two regions:
- **Left Region**: "Pre-RCP"
- **Right Region**: "Post-RCP"
- **Boundary Line**: Vertical dashed blue line at **Lt = 120**
## 2. Legend
- **Location**: Top-right corner of the plot
- **Entries**:
- **Red Circles**: "False Answer"
- **Green Circles**: "True Answer"
- **Purple Star**: "RCP"
## 3. Data Trends and Points
### Pre-RCP Region (Lt < 120)
- **False Answers (Red)**:
- Clustered primarily in the lower-left quadrant (Lt: 0–100, Lc: 0–200).
- Density decreases as Lt increases.
- **True Answers (Green)**:
- Spread across the entire Pre-RCP region (Lt: 0–120, Lc: 50–350).
- Higher concentration in the upper-middle range (Lt: 50–100, Lc: 150–250).
### Post-RCP Region (Lt ≥ 120)
- **False Answers (Red)**:
- **Absent** in this region.
- **True Answers (Green)**:
- Concentrated in the lower-right quadrant (Lt: 120–200, Lc: 50–100).
- Density decreases as Lt increases beyond 150.
### RCP (Purple Star)
- **Position**: Exactly at **(Lt = 120, Lc = 60)**.
- **Significance**: Marks the threshold between Pre-RCP and Post-RCP regions.
## 4. Spatial Grounding and Color Verification
- **Legend Colors Match Data Points**:
- Red = False Answers (confirmed in Pre-RCP).
- Green = True Answers (confirmed in both regions).
- Purple Star = RCP (confirmed at (120, 60)).
## 5. Trend Verification
- **Pre-RCP**:
- False Answers dominate lower Lt/Lc values.
- True Answers show variability but no clear trend.
- **Post-RCP**:
- True Answers shift to higher Lt and lower Lc, indicating a clear optimization post-RCP.
## 6. Component Isolation
- **Header**: Title split by the blue dashed line.
- **Main Chart**:
- **Pre-RCP**: Mixed red/green points.
- **Post-RCP**: Exclusively green points.
- **Footer**: Legend and axis labels.
## 7. Key Observations
- The RCP (purple star) acts as a decision boundary.
- Post-RCP, only True Answers remain, suggesting RCP filters out False Answers.
- True Answers in Post-RCP cluster around **Lt = 120–150** and **Lc = 50–100**, indicating optimized performance.
## 8. Missing Elements
- No embedded text blocks, data tables, or heatmaps present.
- No non-English text detected.
</details>
(a) AIME24 Question 19 Length Dynamics
<details>
<summary>2508.17627v2/x3.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## 1. Labels and Axis Titles
- **X-Axis**: "Thinking Length (Lt)" with numerical markers from 0 to 700 in increments of 100.
- **Y-Axis**: "Content Length (Lc)" with numerical markers from 0 to 1750 in increments of 250.
- **Regions**:
- **Pre-RCP**: Left of the vertical dashed line at x=600 (pink-shaded area).
- **Post-RCP**: Right of the vertical dashed line at x=600 (green-shaded area).
## 2. Legend
- **Location**: Top-right corner of the plot.
- **Entries**:
- **Red Dots**: Labeled "False".
- **Green Dots**: Labeled "True".
- **Purple Star**: Labeled "RCP".
## 3. Data Points and Trends
### **Red Dots (False)**
- **Distribution**:
- Concentrated in the lower-left region (x < 300, y < 250).
- Sparse presence in the Pre-RCP region (x = 300–500, y = 250–750).
- **Trend**:
- Downward slope from (x=0, y=1750) to (x=500, y=0).
- Density decreases as x increases.
### **Green Dots (True)**
- **Distribution**:
- Spread across both Pre-RCP and Post-RCP regions.
- Higher density in Post-RCP (x > 600, y = 0–250).
- **Trend**:
- Scattered distribution with no clear slope.
- Concentration near the Post-RCP boundary (x=600, y=50–250).
### **Purple Star (RCP)**
- **Location**:
- Coordinates: (x=600, y=50).
- Positioned at the intersection of the Post-RCP region and the x=600 boundary.
## 4. Spatial Grounding
- **Legend Position**:
- Coordinates: [x=650, y=150] (relative to the plot's top-right corner).
- **Color Consistency Check**:
- Red dots match "False" in the legend.
- Green dots match "True" in the legend.
- Purple star matches "RCP" in the legend.
## 5. Component Isolation
### **Header**
- Title: "Pre-RCP" (left) and "Post-RCP" (right) with a vertical dashed line at x=600.
### **Main Chart**
- **Data Series**:
- Red dots (False): 100+ points, primarily in Pre-RCP.
- Green dots (True): 200+ points, distributed across both regions.
- Purple star (RCP): 1 point at (600, 50).
### **Footer**
- No additional text or components.
## 6. Key Observations
- The vertical dashed line at x=600 demarcates a clear boundary between Pre-RCP and Post-RCP regions.
- The RCP (purple star) is the only data point in the Post-RCP region with a defined y-value (y=50).
- Red dots (False) dominate the lower-left quadrant, while green dots (True) are more evenly distributed but denser in Post-RCP.
## 7. Missing Information
- No explicit numerical data table or additional textual annotations beyond the legend and axis labels.
- No explicit units for "Thinking Length (Lt)" or "Content Length (Lc)" beyond the axis labels.
## 8. Language Notes
- All text is in English. No non-English content detected.
</details>
(b) AIME25 Question 17 Length Dynamics
Figure 2: Two-stage reasoning dynamics separated by RCP: the Pre-RCP Active Reasoning Stage and the Post-RCP Converged Reasoning Stage. The vertical dashed line indicates the RCP boundary. Additional examples are provided in Appendix Figure 7.
2.2 Observations
Figure 2 provides a representative visualization of thinking-content compensation and the eventual stabilization pattern.
Observation 1: Reasoning Compensation
In the early stage, shorter thinking force the model to offload more reasoning into the content to maintain performance, resulting in an increased content length.
In the early region of Figure 2, this effect appears directly as a separation between correct and incorrect samples. In Figure 2(a) around $k\!≈\!50$ , many incorrect answers cluster at very short content, whereas correct answers at the same $k$ often require substantially longer content. Figure 2(b) exhibits the same pattern: in the early region, incorrect answers concentrate near the bottom band, while correct answers extend to much larger content steps. These examples show that when the thinking budget is insufficient, maintaining correctness frequently requires expanding the content, and failures often manifest as terse outputs.
Observation 2: Thinking Redundancy
With sufficient thinking, content retracts to its conciseness and accuracy stabilizes; any further ineffective reasoning is manifested solely as an expansion of thinking.
In the late region of Figure 2, content length collapses into a narrow band and stabilizes after the dashed line, remaining constant even as $k$ increases by over 100 steps (Figures 2(a), 2(b)). This confirms that additional thinking budget in this stage no longer alters content volume but primarily accumulates as redundancy.
3 Reasoning Semantic Dynamics
While Section 2 establishes a macroscopic trade-off between thinking and content length, these metrics are mere proxies that do not distinguish between active hypothesis refinement and futile repetition. To uncover the mechanistic trigger of overthinking, we shift from surface-level length statistics to fine-grained semantic-level evolution, namely Reasoning Semantic Dynamics.
Following the conceptualization of reasoning as a trajectory in latent semantic space Cho et al. (2025), we probe the model’s internal evolution through its intermediate states. Since thinking tokens serve as intermediate computational steps, the most direct manifestation of the model’s current state is the answer it produces when forced to terminate. By tracking the semantic trajectory across truncation steps, we can trace the stabilization of the model’s emerging conclusion and map its search path.
Consequently, this section investigates a pivotal question: as the thinking process extends, how does the induced content evolve semantically, and when does it reach a state of terminal convergence?
3.1 Dynamics Setup: Semantic Lens
To explore the internal evolution of thoughts, we apply a semantic lens to the induced content distribution generated via the stepwise truncation protocol. For a fixed thinking prefix $t_{1:k}$ , the semantics of a single content can be highly sensitive to decoding stochasticity. To mitigate this instability and obtain a robust representation of the model’s hypothesis at each step $k$ , we independently sample $M$ content, denoted as $\{y_{k}^{(i)}\}_{i=1}^{M}$ following the Eq. 1.
Each continuation is mapped to a latent space using the Qwen3 Embedding model Zhang et al. (2025), yielding vectors $v_{k}^{(i)}=\mathcal{E}(y_{k}^{(i)})∈\mathbb{R}^{d}$ . To visualize these dynamics in a consistent coordinate system, we perform Principal Component Analysis (PCA) on the pooled set of embeddings over all truncation steps $k$ and samples $i$ . We project each embedding to a 2D point $\tilde{v}_{k}^{(i)}∈\mathbb{R}^{2}$ , and the resulting semantic trajectory is traced by the per-step centroid:
$$
\tilde{\mu}_{k}=\frac{1}{M}\sum_{i=1}^{M}\tilde{v}_{k}^{(i)}. \tag{2}
$$
The evolution of $\tilde{\mu}_{k}$ captures how the content semantics mature as $L_{t}$ increases. We use the 2D view for interpretability.
<details>
<summary>2508.17627v2/x4.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## 1. Chart Overview
This image depicts a **scatter plot** visualizing latent space dynamics across two dimensions, with color-coded reasoning stages and convergence metrics. The plot is divided into two distinct regions by a dashed blue diagonal line.
---
## 2. Axes and Labels
- **X-axis**: *Latent Dimension 1*
- Range: -0.20 to 0.15
- Ticks: -0.20, -0.15, -0.10, -0.05, 0.00, 0.05, 0.10, 0.15
- **Y-axis**: *Latent Dimension 2*
- Range: -0.05 to 0.15
- Ticks: -0.05, 0.00, 0.05, 0.10, 0.15
---
## 3. Legend and Colorbar
- **Legend** (Top-left corner):
- **Attraction Basin (95% CI)**: Light purple (hex: #E6E6FA)
- **Convergence Center**: Dark blue (hex: #00008B) with a cross marker (✖)
- **RCP**: Red star (★)
- **Spatial Grounding**: Legend occupies coordinates [x=0.05, y=0.95] (normalized to plot bounds).
- **Colorbar** (Right side):
- **Title**: *Thinking Step k*
- **Gradient**: Purple (0) → Yellow (200)
- **Key Observation**: Higher *k* values (yellow) cluster in the Post-RCP region.
---
## 4. Data Points and Trends
### Pre-RCP Region (Left of dashed line)
- **Color**: Predominantly purple (low *k* values).
- **Trend**: Points are dispersed with no clear clustering.
- **Key Data Points**:
- Convergence Center: (-0.01, -0.01) ✖
- RCP: (-0.18, 0.07) ★
### Post-RCP Region (Right of dashed line)
- **Color**: Gradient from teal (100) to yellow (200).
- **Trend**: Dense clustering around the RCP point (0.02, 0.03) ★.
- **Attraction Basin (95% CI)**: Light purple shaded ellipse centered at (0.02, 0.03).
---
## 5. Annotations
- **Pre-RCP**: "Pre-RCP: Active Reasoning Stage" (Red text at [-0.15, -0.05]).
- **Post-RCP**: "Post-RCP: Converged Reasoning Stage" (Green text at [0.1, 0.1]).
---
## 6. Spatial Relationships
- The **dashed blue line** (y = -x) acts as a boundary between Pre-RCP and Post-RCP regions.
- The **RCP point** (0.02, 0.03) ★ lies at the intersection of the Attraction Basin and Post-RCP cluster.
- The **Convergence Center** (-0.01, -0.01) ✖ is isolated in the Pre-RCP region.
---
## 7. Validation Checks
- **Legend Consistency**:
- RCP ★ matches red star in legend.
- Convergence Center ✖ matches dark blue cross.
- **Colorbar Alignment**:
- Yellow points (k=200) align with the upper end of the colorbar.
- Purple points (k=0) align with the lower end.
---
## 8. Summary
The plot illustrates a transition from dispersed, low-*k* reasoning stages (Pre-RCP) to clustered, high-*k* convergent reasoning (Post-RCP), anchored by the RCP point. The Convergence Center marks a critical node in the latent space.
</details>
Figure 3: Semantic trajectory showing the transition from Pre-RCP Active Exploration to Post-RCP Reasoning Convergence. The dashed line indicates the RCP boundary. Additional examples are provided in Appendix Figure 8.
<details>
<summary>2508.17627v2/x5.png Details</summary>

### Visual Description
# Technical Document Extraction: Graph Analysis
## 1. **Chart Overview**
- **Title**: Not explicitly labeled, but contextually divided into two stages:
- **Pre-RCP: Active Reasoning Stage** (pink shaded region)
- **Post-RCP: Converged Reasoning Stage** (green shaded region)
- **Axes**:
- **X-axis**: "Thinking Step k" (ranges from 0 to 250)
- **Y-axis**: "Σglobal(k)" (ranges from 0 to 600)
- **Legend**: Located in the top-right corner (color: blue for the data line).
## 2. **Key Components**
### 2.1 **Main Chart**
- **Data Line**:
- **Color**: Blue
- **Behavior**:
- **Pre-RCP Stage (0–150 steps)**: Highly volatile, with sharp peaks and troughs (e.g., spikes up to ~550, drops to ~50).
- **Post-RCP Stage (150–250 steps)**: Stabilizes near ~20, with minimal fluctuations.
- **Trend Verification**:
- Initial phase (0–50 steps): Rapid oscillations.
- Transition (50–100 steps): Gradual decline.
- Post-150 steps: Convergence to a steady value (~20).
### 2.2 **Inset (Zoomed-In View)**
- **Region**: Focuses on steps 75–225.
- **Y-axis**: "Σglobal(k)" (0–100).
- **Behavior**:
- Sharp decline from ~80 (step 75) to ~20 (step 150).
- Post-150 steps: Stabilizes near ~20.
## 3. **Spatial Grounding**
- **Legend Placement**: Top-right corner (color: blue).
- **Shaded Regions**:
- **Pre-RCP**: Pink (0–150 steps).
- **Post-RCP**: Green (150–250 steps).
- **Dashed Vertical Line**: At step 150, demarcating the two stages.
## 4. **Trend Analysis**
- **Pre-RCP Stage**:
- **Initial Phase (0–50 steps)**: High-frequency oscillations (e.g., peaks at ~550, troughs at ~50).
- **Transition Phase (50–100 steps)**: Gradual reduction in amplitude.
- **Post-RCP Stage**:
- **Convergence (150–250 steps)**: Line flattens near ~20, indicating stabilized reasoning.
## 5. **Textual Elements**
- **Headers**:
- "Pre-RCP: Active Reasoning Stage" (pink region).
- "Post-RCP: Converged Reasoning Stage" (green region).
- **Axis Labels**:
- X-axis: "Thinking Step k".
- Y-axis: "Σglobal(k)".
- **Legend**: Blue line corresponds to the data series.
## 6. **Data Table Reconstruction**
| Stage | Steps (k) | Σglobal(k) Range | Behavior |
|---------------------|-----------|------------------|-------------------------|
| Pre-RCP | 0–150 | 50–550 | Highly volatile |
| Post-RCP | 150–250 | ~20 | Stable |
| Inset Focus | 75–225 | 0–100 | Sharp decline to ~20 |
## 7. **Color Consistency Check**
- **Legend**: Blue matches the data line in both main chart and inset.
- **Shaded Regions**: Pink (Pre-RCP) and green (Post-RCP) align with stage labels.
## 8. **Conclusion**
The graph illustrates a two-stage reasoning process:
1. **Active Reasoning (Pre-RCP)**: Characterized by high variability in global reasoning metrics.
2. **Converged Reasoning (Post-RCP)**: Metrics stabilize after RCP intervention, indicating reduced uncertainty or error.
**Note**: No additional languages or data tables are present. All textual elements are in English.
</details>
Figure 4: Semantic convergence residual over thinking steps. $\mathcal{D}_{\text{global}}(k)$ declines and then approaches a low plateau. The vertical dashed line indicates the RCP boundary; the inset zooms into the late-step region for readability.
High-Dimensional Convergence Metric.
The 2D PCA projection is only for visualization. For quantitative analysis, we operationalize convergence via a semantic proxy distribution in a higher-dimensional PCA space that retains substantially more information than the 2D view. We project embeddings to an $r$ -dimensional PCA space ( $r\gg 2$ ), obtain $z_{k}^{(i)}∈\mathbb{R}^{r}$ , and approximate the induced distribution by a Gaussian $Q_{k}=\mathcal{N}(\boldsymbol{\mu}_{k},\Sigma_{k})$ :
$$
\displaystyle\boldsymbol{\mu}_{k} \displaystyle=\frac{1}{N}\sum_{i=1}^{N}z_{k}^{(i)}, \displaystyle\Sigma_{k} \displaystyle=\mathrm{Cov}\!\left(\{z_{k}^{(i)}\}_{i=1}^{N}\right)+\lambda I, \tag{3}
$$
where $\lambda I$ is a small ridge term for numerical stability. We estimate a terminal reference $Q_{∞}$ by pooling samples from a short tail window of the final truncation steps and computing its mean and covariance. We then measure the distance to the converged semantic regime by the global convergence residual
$$
\mathcal{D}_{\text{global}}(k)\triangleq D_{\text{KL}}(Q_{k}\parallel Q_{\infty}), \tag{4}
$$
which will later serve as a key ingredient for formalizing convergence.
3.2 Observations
Figure 3 provides a representative semantic view of the length-level dynamics in § 2; we refer to the resulting transition as semantic path convergence.
Observation 3: Semantic Exploration
During the initial phase, the semantic within the reasoning process undergo rapid shifts, indicating that the model is actively exploring feasible solution paths.
In the left region of Figure 3, the projected induced samples are sparsely scattered, with large gaps between neighboring points, and the per-step mean $\tilde{\mu}_{k}$ makes large jumps across truncation steps, suggesting that the model is still actively revising its semantic hypothesis. Figure 4 then quantifies this instability: $\mathcal{D}_{\text{global}}(k)$ varies widely in the early stage, indicating that the induced content distribution can shift markedly between adjacent truncation steps. This semantic instability also explains the length-level outcome bifurcation in Figure 2: at the same moderate $k$ , some runs compensate by generating longer content and reach the correct hypothesis, while others collapse to short, incorrect outputs.
Observation 4: Reasoning Convergence
As reasoning progresses, the semantics gradually converge toward a specific neighborhood, exhibiting repetitive oscillations within that localized space.
The right region in Figure 3 makes the convergence picture explicit: once the trajectory crosses the dashed line, the induced samples concentrate inside the 95% attraction basin around the convergence center, and $\tilde{\mu}_{k}$ largely stays within that localized area. To make this stable-neighborhood claim testable, Appendix D fits a 95% confidence ellipse in the 2D PCA plane and finds that 97.3% of post-transition points remain inside the basin.
Figure 4 provides the corresponding high-dimensional signal: after the turning point marked by the dashed line, $\mathcal{D}_{\text{global}}(k)$ approaches a low plateau, indicating that the induced content distribution is already close to its terminal state. Notably, this non-zero floor reflects an intrinsic entropy gap between the instantaneous state $Q_{k}$ and the aggregate convergence basin $Q_{∞}$ , suggesting a dynamic equilibrium within the basin rather than collapse to a single singularity. This also aligns with the right region in Figure 2, where thinking continues to grow while the content length stays almost unchanged. In this regime, extra steps mainly accumulate as redundant thinking and rarely alter the final answer.
These two observations together explain why more thinking can stop helping: once the content semantics have converged, additional thinking steps occur after the induced content distribution has effectively stabilized, making overthinking largely redundant.
Together with the length lens in § 2, the semantic trajectory suggests a shared transition from an early stage of active semantic change to a late regime of semantic stationarity.
4 Reasoning Completion Point
The empirical observations in previous sections suggest that LLM reasoning is not a monolithic process but a bipartite evolution. By synthesizing thinking-content compensation and semantic path convergence, we identify a fundamental two-stage transition in the model’s reasoning: Pre-RCP: Active Reasoning and Post-RCP: Converged Reasoning.
To formalize these stages, we first establish two quantitative metrics to capture the stability of the reasoning output. Consistent with the observations in § 2, as thinking length increases, the compensation effect eventually saturates. Content length $L_{c}$ ceases to change materially, and the induced content stabilizes. We formalize this step-to-step content change as:
$$
\Delta_{\text{content}}(k)\triangleq\left|L_{c}(k)-L_{c}(k-1)\right|, \tag{5}
$$
where $L_{c}(k)$ is measured at truncation step $k$ following § 2.1. Content stabilization occurs when $\Delta_{\text{content}}(k)$ drops below a small threshold $\epsilon_{c}$ and remains near zero thereafter.
Simultaneously, building upon the semantic path analysis in § 3, the induced semantics transition from high-drift exploration to a stable neighborhood. We quantify this through the global convergence residual $\mathcal{D}_{\text{global}}(k)$ , which stabilizes at a low plateau as thinking progresses. We operationalize the entry into this convergence region when $\mathcal{D}_{\text{global}}(k)≤\epsilon_{D}$ , where $\epsilon_{D}$ is set by the empirical tail fluctuations used to estimate $Q_{∞}$ .
Based on these metrics, we categorize the reasoning process into two distinct stages:
Pre-RCP: Active Reasoning Stage.
$$
\mathcal{D}_{\text{global}}(k)>\epsilon_{D}\quad\lor\quad\Delta_{\text{content}}(k)>\epsilon_{c}. \tag{6}
$$
The model is in a state of high-drift exploration; the semantic path continues to evolve, and the induced content remains volatile, frequently expanding to compensate for insufficient thinking. Computation in this stage is typically essential for answer maturation.
Post-RCP: Converged Reasoning Stage.
$$
\mathcal{D}_{\text{global}}(k)\leq\epsilon_{D}\quad\land\quad\Delta_{\text{content}}(k)\leq\epsilon_{c}. \tag{7}
$$
The semantic state stabilizes within a converged neighborhood, and the content length reaches a plateau. Additional compute beyond this point yields diminishing returns, often manifesting as redundant rationalization or "overthinking."
The critical boundary separating these two regimes is the Reasoning Completion Point (RCP). Formally, we define the RCP as the earliest truncation step $k_{\text{RCP}}$ where both length stabilization and semantic convergence are simultaneously achieved:
$$
k_{\text{RCP}}\triangleq\min\left\{k:\begin{aligned} &\Delta_{\text{content}}(k)\leq\epsilon_{c}\\
\land\ &\mathcal{D}_{\text{global}}(k)\leq\epsilon_{D}\end{aligned}\right\}, \tag{8}
$$
The thresholds $\epsilon_{c}$ and $\epsilon_{D}$ are empirically determined by the tail fluctuations in the late-step window. Under this unified formalization, the RCP serves as an instance-specific "stopping criterion" that respects the intrinsic complexity of the prompt. Beyond $k_{\text{RCP}}$ , additional thinking is largely redundant as it no longer materially alters the induced content’s length or conceptual essence. Consequently, mitigating overthinking reduces to an online detection problem: identifying the transition $k_{\text{RCP}}$ during decoding to terminate the thinking phase at the precise moment of reasoning maturity.
<details>
<summary>2508.17627v2/x6.png Details</summary>

### Visual Description
# Technical Document Extraction: Reasoning Stage Analysis
## Image Description
The image contains two line graphs stacked vertically, analyzing reasoning stages in a computational model. The graphs track metrics across "Thinking Step k" (x-axis) with distinct pre- and post-RCP (Reasoning Convergence Protocol) phases.
---
### **Top Graph: Accuracy Metrics**
**Title**:
- Pre-RCP: Active Reasoning Stage
- Post-RCP: Converged Reasoning Stage
**Axes**:
- **X-axis**: Thinking Step k (0–200)
- **Y-axis**: Accuracy (0.0–1.0)
**Legend**:
- **Red Line**: Generation Accuracy
- **Purple Dashed Line**: Answer Emergence
- **Placement**: Top-right corner
**Key Trends**:
1. **Generation Accuracy (Red)**:
- Starts at ~0.6 at step 0.
- Dips to ~0.3 at step 50.
- Rises sharply to ~1.0 by step 100.
- Stabilizes with minor fluctuations (~0.95–1.0) post-step 100.
2. **Answer Emergence (Purple Dashed)**:
- Remains at 0 until step 50.
- Spikes to ~0.8 at step 50.
- Stabilizes at ~0.95 post-step 100.
3. **Transition Marker**:
- Vertical blue line at step 100 separates Pre-RCP and Post-RCP phases.
**Spatial Grounding**:
- Legend colors match line colors exactly.
- Vertical blue line at x=100 aligns with phase transition.
---
### **Bottom Graph: Token Rank Metrics**
**Title**:
- Rank of `` token
**Axes**:
- **X-axis**: Thinking Step k (0–200)
- **Y-axis**: Rank of `` (0–2000)
**Legend**:
- **Green Line**: Rank of ``
- **Placement**: Bottom-right corner
**Key Trends**:
1. **Rank of `` (Green)**:
- Starts at ~1000 at step 0.
- Drops sharply to ~500 at step 50.
- Fluctuates between ~200–1000 until step 100.
- Stabilizes near 0 post-step 100 with minor spikes.
2. **Transition Marker**:
- Vertical blue line at step 100 aligns with phase transition.
**Spatial Grounding**:
- Legend color (green) matches line color exactly.
---
### **Cross-Graph Observations**
1. **Phase Transition**:
- Both graphs show a structural shift at step 100 (blue vertical line).
- Post-step 100, accuracy stabilizes, and token rank drops significantly.
2. **Correlation**:
- Sharp rise in Generation Accuracy (step 50–100) coincides with Answer Emergence spike.
- Token rank decline correlates with improved accuracy, suggesting reduced reliance on `` tokens as reasoning converges.
---
### **Data Extraction**
#### Top Graph Data Points (Approximate):
| Step k | Generation Accuracy | Answer Emergence |
|--------|---------------------|------------------|
| 0 | 0.6 | 0.0 |
| 50 | 0.3 | 0.8 |
| 100 | 1.0 | 0.95 |
| 150 | 1.0 | 0.95 |
| 200 | 1.0 | 0.95 |
#### Bottom Graph Data Points (Approximate):
| Step k | Rank of `` |
|--------|---------------------|
| 0 | 1000 |
| 50 | 500 |
| 100 | 200 |
| 150 | 50 |
| 200 | 10 |
---
### **Conclusion**
The graphs demonstrate that the Post-RCP phase achieves stable high accuracy (~1.0) and low `` token rank (~10), indicating effective reasoning convergence. The transition at step 100 marks a critical shift in model behavior.
</details>
Figure 5: Top panel: Accuracy stabilizes around answer emergence. Bottom panel: The rank of </think> ( $R_{k}$ ) drops precipitously at answer emergence; this drop serves as a signature of convergence.
5 Online Overthinking Elimination
The pivot to eliminating overthinking lies in the precise detection of the RCP. The logical progression of this section follows a "distill-to-detect" pipeline: we first leverage our reasoning dynamic findings (§ 2 and § 3) to extract gold RCP labels from full reasoning trajectories; however, to eliminate overthinking in practice, these insights must be translated into an online detector capable of real-time monitoring during inference. Specifically, we use the extracted offline data to train a tree-based model that identifies the latent transition into convergence. The resulting rules enable rapid RCP detection during standard decoding, effectively truncating redundant computation without sacrificing accuracy.
5.1 Offline RCP Extraction
In the “distill” stage of our pipeline, we start from the formal RCP criterion in Eq. 8 and extract instance-level annotations from full thinking trajectories. For each instance, we generate a complete reasoning trajectory, perform controlled truncation over candidate sentence boundaries to identify the earliest boundary at which reasoning has effectively converged, denoted $k_{\text{RCP}}$ . This procedure is inherently offline, as it depends on access to complete reasoning trajectories and auxiliary semantic probes that are unavailable during standard decoding. Accordingly, we treat the resulting $k_{\text{RCP}}$ annotations as gold supervision.
A key empirical observation from these labeled traces is that $k_{\text{RCP}}$ typically coincides with the first emergence of the final answer in the reasoning trajectories. While such answer emergence marks the onset of the Post-RCP convergence regime, it is not a reliable target for online detection. Answer surface forms vary widely across instances, including numbers, units, natural language, and multiple-choice letters, and may be rephrased, rendering string-based criteria brittle. To replace this unobservable event with an intrinsic proxy, we examine the model’s next-token distributions around $k_{\text{RCP}}$ and find a consistent transition in the rank $R_{k}$ of the thinking-termination delimiter </think>. Specifically, $R_{k}$ drops sharply as the answer first emerges, reflecting an increasing preference to terminate thinking. This delimiter-rank dynamics is internal to the model and largely invariant to answer format, making it the key distilled feature used for online RCP detection.
5.2 Online RCP Detection
Offline extraction reveals an online-accessible signature of convergence. When the final answer first emerges, the rank $R_{k}$ of the thinking-termination token </think> drops sharply (Figure 5). This motivates a practical reformulation of online RCP detection: given the rank observations available up to the current sentence boundary, decide whether decoding has reached the Reasoning Completion Point and can stop.
We instantiate this idea as a lightweight Reasoning Completion Point Detector (RCPD). At each sentence boundary, RCPD queries the next-token distribution and extracts the current $R_{k}$ together with a short history of preceding rank values. Using offline-extracted $k_{\text{RCP}}$ as supervision on Math-500 Lightman et al. (2023), we label boundaries and fit an interpretable decision tree whose splits capture the implicit stopping criterion in terms of rank dynamics. We then distill the tree into deterministic stopping rules (exact thresholds and window sizes in Appendix C).
During inference, we apply the same rules online; once triggered, we force </think> to terminate the thinking phase and immediately transition to answer generation. This completes the distill-to-detect pipeline and yields a practical stopping strategy for eliminating Post-RCP overthinking.
5.3 Experimental Evaluation
We evaluate on AIME24, AIME25 MAA (2025), and GPQA-D Rein et al. (2023), using Qwen3 (8B/14B/30B-A3B) and DeepSeek-R1-Distill-8B. We report efficiency measured by the average number of generated tokens (Tok), Accuracy (Acc), and the relative performance-cost ratio (RPCR), which measures the accuracy gain over the No-Think baseline per generated token. For readability, we report RPCR with two decimal places. For ease of comparison, we show the Compression Rate (CR; relative to full decoding) as an integer percentage in parentheses next to Tok. Metric definitions and token accounting details are provided in Appendix B. For autoregressive generation, Tok directly measures generation length. We compare against Full Reasoning, Budget Force (BF), No-Think, and DEER; we also report S-GRPO, which requires additional LLM training.
| Method | AIME24 | AIME25 | GPQA-D | Average | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Tok $\downarrow$ (CR % $\downarrow$ ) | Acc $\uparrow$ | RPCR $\uparrow$ | Tok $\downarrow$ (CR % $\downarrow$ ) | Acc $\uparrow$ | RPCR $\uparrow$ | Tok $\downarrow$ (CR % $\downarrow$ ) | Acc $\uparrow$ | RPCR $\uparrow$ | Tok $\downarrow$ (CR % $\downarrow$ ) | Acc $\uparrow$ | RPCR $\uparrow$ | |
| Based on Qwen3-8B | | | | | | | | | | | | |
| Full | 15435 (100%) | 72.22 | 33.84 | 17828 (100%) | 63.33 | 23.68 | 9514 (100%) | 60.10 | 23.09 | 14259 (100%) | 65.22 | 27.22 |
| No-Think | 7271 (47%) | 19.99 | 0.00 | 5036 (28%) | 21.11 | 0.00 | 2687 (28%) | 38.13 | 0.00 | 4998(35%) | 26.40 | 0.00 |
| BF | 10373 (67%) | 58.88 | 37.49 | 11772 (66%) | 55.56 | 29.26 | 3962 (42%) | 55.56 | 43.99 | 8702 (61%) | 56.66 | 34.77 |
| DEER | 13952 (90%) | 72.22 | 37.44 | 16628 (93%) | 67.78 | 28.07 | 9085 (95%) | 59.60 | 23.63 | 13222 (93%) | 66.53 | 30.35 |
| RCPD (Ours) | 9958 (65%) | 72.22 | 52.45 | 10067 (56%) | 63.33 | 41.94 | 4130 (43%) | 64.65 | 64.21 | 8052 (56%) | 66.73 | 50.09 |
| S-GRPO* | 8810 (57%) | 77.30 | 65.05 | - | - | - | 5271 (55%) | 55.40 | 32.76 | - | - | - |
| Based on Qwen3-14B | | | | | | | | | | | | |
| Full | 13350 (100%) | 75.56 | 35.79 | 16711 (100%) | 70.00 | 27.26 | 7711 (100%) | 64.31 | 13.81 | 12591 (100%) | 69.96 | 27.54 |
| No-Think | 6294 (47%) | 27.78 | 0.00 | 3533 (21%) | 24.44 | 0.00 | 2733 (35%) | 53.66 | 0.00 | 4186 (33%) | 35.29 | 0.00 |
| BF | 8974 (67%) | 54.44 | 29.71 | 10546 (63%) | 56.67 | 30.56 | 3932 (51%) | 62.29 | 21.95 | 7817 (62%) | 57.80 | 28.80 |
| DEER | 12265 (92%) | 72.22 | 36.23 | 14387 (86%) | 70.00 | 31.67 | 7274 (94%) | 64.48 | 14.87 | 11309 (90%) | 68.90 | 29.72 |
| RCPD (Ours) | 8799 (66%) | 73.33 | 51.77 | 9987 (60%) | 68.89 | 44.51 | 3709 (48%) | 67.17 | 36.42 | 7498 (60%) | 69.80 | 46.03 |
| S-GRPO* | 8932 (67%) | 77.90 | 56.11 | - | - | - | 4537 (59%) | 60.60 | 15.30 | - | - | - |
| Based on Qwen3-30B-A3B | | | | | | | | | | | | |
| Full | 13449 (100%) | 82.22 | 41.30 | 16457 (100%) | 74.44 | 31.06 | 7510 (100%) | 66.50 | 25.17 | 12472 (100%) | 74.38 | 33.56 |
| No-Think | 5667 (42%) | 26.67 | 0.00 | 3995 (24%) | 23.33 | 0.00 | 2637 (35%) | 47.60 | 0.00 | 4099 (33%) | 32.53 | 0.00 |
| BF | 11147 (83%) | 74.44 | 42.85 | 12708 (77%) | 65.56 | 33.23 | 4617 (61%) | 64.81 | 37.28 | 9490 (76%) | 68.27 | 37.66 |
| DEER | 12326 (92%) | 80.00 | 43.27 | 15891 (97%) | 67.78 | 27.97 | 7361 (98%) | 67.85 | 27.51 | 11860 (95%) | 71.87 | 33.17 |
| RCPD (Ours) | 11229 (83%) | 82.22 | 49.47 | 12670 (77%) | 74.44 | 40.34 | 4469 (60%) | 68.69 | 47.19 | 9457 (76%) | 75.11 | 45.02 |
| Based on DeepSeek-R1-Distill-8B | | | | | | | | | | | | |
| Full | 21339 (100%) | 76.67 | 26.56 | 23237 (100%) | 70.00 | 21.04 | 11537 (100%) | 58.73 | 17.86 | 18704 (100%) | 68.47 | 22.49 |
| BF | 16202 (76%) | 72.22 | 32.24 | 19740 (85%) | 62.22 | 20.83 | 8909 (77%) | 56.22 | 20.31 | 14950 (80%) | 63.55 | 24.85 |
| DEER | 21105 (99%) | 72.22 | 24.75 | 23129 (100%) | 67.77 | 20.17 | 11207 (97%) | 58.25 | 17.95 | 18480 (99%) | 66.08 | 21.47 |
| RCPD (Ours) | 16981 (80%) | 76.67 | 33.38 | 19602 (84%) | 67.77 | 23.80 | 9814 (85%) | 58.25 | 20.50 | 15465 (83%) | 67.56 | 26.61 |
Table 1: Experimental results across various types of reasoning models. We categorize methods into inference-time (top, above dashed line) and training-based (bottom). S-GRPO is a training-based baseline. CR (relative to full decoding) is shown as an integer percentage in parentheses next to Tok, and we report RPCR (accuracy gain over the No-Think baseline per generated token) with two decimal places. DeepSeek-R1-Distill-8B lacks a No-Think mode; its RPCR is computed against the Qwen3-8B No-Think baseline. Gray rows indicate the training-based S-GRPO baseline. S-GRPO* indicates that the result comes directly from their paper, while “-” signifies that the paper’s results are missing.
As shown in Table 1, results align with Reasoning Dynamics, indicating that extending the thinking process into the Post-RCP convergence regime offers no accuracy gains. RCPD exploits this by substantially reducing token generation compared to full decoding while maintaining competitive performance. Crucially, as an inference-time method without fine-tuning, RCPD rivals the training-based S-GRPO baseline; specifically, it outperforms S-GRPO on GPQA-D despite underperforming on AIME24. This demonstrates that full decoding often involves redundant computation well beyond the necessary RCP.
To ensure a rigorous evaluation, we compare with the BF baseline under an iso-computational setting. BF exhibits significant degradation, underscoring the limitation of ignoring Reasoning Dynamics. By enforcing a static budget, BF often truncates the process during Pre-RCP, before the answer has converged. In contrast, RCPD adapts to instance-specific complexity, allowing reasoning to proceed until it enters Post-RCP convergence.
Finally, qualitative analyses further support the Post-RCP interpretation. Appendix F shows that once the model enters Post-RCP convergence, it often exhibits cognitive inertia: subsequent tokens repeatedly rationalize the first-emerged answer rather than correcting it, and may even drift into redundant loops. RCPD truncates this regime by design, explaining why early stopping can improve both efficiency and robustness.
6 Related Work
Based on the efficient reasoning taxonomy established in Sui et al. (2025); Wang et al. (2025), we categorize related work into three classes, following and extending previous research: Post-training Based Methods, Prompt-based Methods, and Early Exit Methods.
Post-training Based Methods.
These methods leverage supervised fine-tuning with variable-length chain-of-thought (CoT) data Yu et al. (2024); Kang et al. (2025); Xia et al. (2025); Ma et al. (2025), integrate length rewards in reinforcement learning Kimi Team et al. (2025); Luo et al. (2025); Aggarwal and Welleck (2025), or employ latent space reasoning methods Hao et al. (2024); Cheng and Durme (2024); Shen et al. (2025). While existing approaches reduce reasoning tokens through various training strategies, they require extensive data and retraining. In contrast, our method requires no training.
Prompt-based Methods.
Approaches in this category Han et al. (2024); Muennighoff et al. (2025); Lee et al. (2025); OpenAI et al. (2025) use different prompts to enforce the model to generate concise CoT, thus reducing unnecessary reasoning steps Kang et al. (2025); Peng et al. (2025); Huang et al. (2025). Although this approach has shown some success in generating more concise reasoning chains, it has significant limitations. Specifically, the quality and flexibility of the prompt design directly affect the model’s performance. In the face of diverse and complex reasoning tasks, static prompting methods often fail to adapt effectively to the variability of inputs, leading to reasoning chains that lack sufficient depth and precision.
Early Exit Methods.
These methods intervene in the reasoning process by prematurely terminating generation to mitigate redundancy. Existing approaches primarily fall into two categories. The first relies on external monitoring mechanisms or additional model training. For instance, methods utilizing auxiliary classifiers Liu and Wang (2025), bandit controllers Sun et al. (2025), or difficulty estimators Pu et al. (2025) require extra components, while S-GRPO Dai et al. (2025) necessitates extra training of the LLM to learn early termination strategies. These approaches inevitably introduce additional training costs or inference latency. In contrast, a more recent line of work focuses on an overhead-free paradigm that leverages the model’s intrinsic states without extra components. DEER Yang et al. (2025b) exemplifies this by enabling early exiting without computational overhead. Our study aligns with this efficient paradigm.
7 Conclusion
In this work, we reveal the Reasoning Dynamics of LLMs, characterized by Thinking-Content Compensation and Semantic Path Convergence. These dynamics shape the reasoning process into an active reasoning phase and a converged reasoning phase separated by the RCP. We demonstrate that a simple method can effectively detect this point to mitigate overthinking. Our findings identify post-RCP redundancy as the primary cause of overthinking, highlighting reliable self-termination as a cornerstone for efficient test-time scaling.
8 Limitations
Our analysis and evaluation focus on reasoning models and are conducted primarily on the Qwen3 and DeepSeek-R1 families, spanning multiple model sizes and benchmarks. We prioritize these open-weight systems because they expose explicit thinking delimiters and the decoding signals needed for controlled interventions and RCPD. While these models exhibit consistent Reasoning Dynamics, we did not run the full suite of RCP analyses across the broader ecosystem of models, such as closed-source APIs, tool-augmented agents and multimodal models. Establishing how universal RCP-like phase transitions are beyond this scope remains an important direction.
9 Potential Risks
The proposed RCPD method primarily focuses on enhancing the efficiency of reasoning processes in LLMs. As such, there are no inherent risks associated with its implementation or deployment. The method does not introduce any new computational mechanisms that could adversely affect the model’s operation. Additionally, RCPD aims to mitigate overthinking in LLMs, which, in turn, reduces unnecessary resource consumption and computational overhead.
References
- P. Aggarwal and S. Welleck (2025) L1: controlling how long A reasoning model thinks with reinforcement learning. CoRR abs/2503.04697. External Links: Link, Document, 2503.04697 Cited by: §6.
- S. Bird and E. Loper (2004) NLTK: the natural language toolkit. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, July 21-26, 2004 - Poster and Demonstration, External Links: Link Cited by: §2.1.
- J. Cheng and B. V. Durme (2024) Compressed chain of thought: efficient reasoning through dense representations. CoRR abs/2412.13171. External Links: Link, Document, 2412.13171 Cited by: §6.
- D. Cho, A. B. Z. Zhang, B. Fehri, S. Wang, R. Chunara, R. Song, and H. Cai (2025) Correct reasoning paths visit shared decision pivots. CoRR abs/2509.21549. External Links: Link, Document, 2509.21549 Cited by: §3.
- M. Dai, C. Yang, and Q. Si (2025) S-GRPO: early exit via reinforcement learning in reasoning models. CoRR abs/2505.07686. External Links: Link, Document, 2505.07686 Cited by: §6.
- DeepSeek-AI, D. Guo, D. Yang, H. Zhang, et al. (2025) DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning. CoRR abs/2501.12948. External Links: Link, Document, 2501.12948 Cited by: §1, §2.
- S. S. Ghosal, S. Chakraborty, A. Reddy, Y. Lu, M. Wang, D. Manocha, F. Huang, M. Ghavamzadeh, and A. S. Bedi (2025) Does thinking more always help? mirage of test-time scaling in reasoning models. External Links: 2506.04210, Link Cited by: §1.
- Google DeepMind (2025) Gemini 3 pro model card. Model Card Google DeepMind. External Links: Link Cited by: §2.1.
- J. Han, Y. Huang, Y. Liao, Z. Jiang, X. Lu, H. Zhao, X. Wang, G. Zhou, S. Jiang, J. Liang, W. Zhou, Z. Sun, F. Yu, and Y. Xiao (2025) Your models have thought enough: training large reasoning models to stop overthinking. CoRR abs/2509.23392. External Links: Link, Document, 2509.23392 Cited by: §1.
- T. Han, Z. Wang, C. Fang, S. Zhao, S. Ma, and Z. Chen (2024) Token-budget-aware LLM reasoning. CoRR abs/2412.18547. External Links: Link, Document, 2412.18547 Cited by: §6.
- S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y. Tian (2024) Training large language models to reason in a continuous latent space. CoRR abs/2412.06769. External Links: Link, Document, 2412.06769 Cited by: §6.
- J. Huang, B. Lin, G. Feng, J. Chen, D. He, and L. Hou (2025) Efficient reasoning for large reasoning language models via certainty-guided reflection suppression. CoRR abs/2508.05337. External Links: Link, Document, 2508.05337 Cited by: §6.
- A. Jaech, A. Kalai, A. Lerer, A. Richardson, et al. (2024) OpenAI o1 system card. CoRR abs/2412.16720. External Links: Link, Document, 2412.16720 Cited by: §2.1.
- Y. Kang, X. Sun, L. Chen, and W. Zou (2025) C3oT: generating shorter chain-of-thought without compromising effectiveness. In AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, T. Walsh, J. Shah, and Z. Kolter (Eds.), pp. 24312–24320. External Links: Link, Document Cited by: §6, §6.
- Kimi Team, A. Du, B. Gao, B. Xing, et al. (2025) Kimi k1.5: scaling reinforcement learning with llms. CoRR abs/2501.12599. External Links: Link, Document, 2501.12599 Cited by: §6.
- A. Lee, E. Che, and T. Peng (2025) How well do llms compress their own chain-of-thought? A token complexity approach. CoRR abs/2503.01141. External Links: Link, Document, 2503.01141 Cited by: §6.
- M. Z. Li, K. K. Agrawal, A. Ghosh, K. K. Teru, A. Santoro, G. Lajoie, and B. A. Richards (2025) Tracing the representation geometry of language models from pretraining to post-training. CoRR abs/2509.23024. External Links: Link, Document, 2509.23024 Cited by: §1.
- H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023) Let’s verify step by step. arXiv preprint arXiv:2305.20050. Cited by: §5.2.
- X. Liu and L. Wang (2025) Answer convergence as a signal for early stopping in reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Eds.), Suzhou, China, pp. 17907–17918. External Links: Link, Document, ISBN 979-8-89176-332-6 Cited by: §6.
- H. Luo, L. Shen, H. He, Y. Wang, S. Liu, W. Li, N. Tan, X. Cao, and D. Tao (2025) O1-pruner: length-harmonizing fine-tuning for o1-like reasoning pruning. CoRR abs/2501.12570. External Links: Link, Document, 2501.12570 Cited by: §6.
- X. Ma, G. Wan, R. Yu, G. Fang, and X. Wang (2025) CoT-valve: length-compressible chain-of-thought tuning. CoRR abs/2502.09601. External Links: Link, Document, 2502.09601 Cited by: §6.
- MAA (2025) AIME problems and solutions. External Links: Link Cited by: §5.3.
- N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. J. Candès, and T. Hashimoto (2025) S1: simple test-time scaling. CoRR abs/2501.19393. External Links: Link, Document, 2501.19393 Cited by: §6.
- OpenAI, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, et al. (2025) Gpt-oss-120b & gpt-oss-20b model card. External Links: 2508.10925, Link Cited by: §6.
- K. Peng, L. Ding, Y. Ouyang, M. Fang, and D. Tao (2025) Revisiting overthinking in long chain-of-thought from the perspective of self-doubt. CoRR abs/2505.23480. External Links: Link, Document, 2505.23480 Cited by: §6.
- X. Pu, M. Saxon, W. Hua, and W. Y. Wang (2025) THOUGHTTERMINATOR: benchmarking, calibrating, and mitigating overthinking in reasoning models. CoRR abs/2504.13367. External Links: Link, Document, 2504.13367 Cited by: §6.
- D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2023) GPQA: a graduate-level google-proof q&a benchmark. External Links: 2311.12022, Link Cited by: §5.3.
- Z. Shen, H. Yan, L. Zhang, Z. Hu, Y. Du, and Y. He (2025) CODI: compressing chain-of-thought into continuous space via self-distillation. CoRR abs/2502.21074. External Links: Link, Document, 2502.21074 Cited by: §6.
- J. Su, J. Healey, P. Nakov, and C. Cardie (2025) Between underthinking and overthinking: an empirical study of reasoning length and correctness in llms. CoRR abs/2505.00127. External Links: Link, Document, 2505.00127 Cited by: §1.
- Y. Sui, Y. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, H. Chen, and X. B. Hu (2025) Stop overthinking: A survey on efficient reasoning for large language models. CoRR abs/2503.16419. External Links: Link, Document, 2503.16419 Cited by: §6.
- R. Sun, W. Cheng, D. Li, H. Chen, and W. Wang (2025) Stop when enough: adaptive early-stopping for chain-of-thought reasoning. CoRR abs/2510.10103. External Links: Link, Document, 2510.10103 Cited by: §6.
- M. L. Team, A. Gui, B. Li, B. Tao, B. Zhou, B. Chen, C. Zhang, C. Han, C. Yang, C. Zhang, C. Peng, C. Zhang, C. Chen, F. Li, G. Xu, G. Lin, H. Jiang, H. Liang, H. Fu, H. Ma, H. Liu, H. Hao, H. Tang, H. Zang, H. Ni, H. Su, J. Liu, J. Li, J. Liu, J. Zhang, J. Xu, J. Wang, J. Sun, J. Zhang, J. Shi, J. Yang, J. Wang, J. Ding, J. Kuang, J. Xu, K. He, K. Zhang, K. Wang, K. He, L. Wei, L. Shi, L. Qiu, L. Kong, L. Liu, L. Guo, L. An, M. Xia, M. Zhou, M. Zhu, P. Pei, P. Jia, Q. Gu, Q. Guo, Q. Huang, Q. Chen, Q. Weng, R. Weng, R. Shao, R. Li, S. Lei, S. Du, S. Liu, S. Zhou, S. Hu, S. Xu, S. Gong, T. Liang, T. Hu, W. He, W. Shi, W. Wang, W. Wu, W. Zhuo, W. Tang, W. Shi, W. Zhu, X. Su, X. Liu, X. Xi, X. Huang, X. Liu, X. Jiang, X. Shi, X. Shi, X. Li, X. Chen, X. Zhao, X. Huang, X. Zhang, X. Cao, X. Cai, Y. Zhang, Y. Chen, and Y. Liu (2025) LongCat-flash-thinking technical report. CoRR abs/2509.18883. External Links: Link, Document, 2509.18883 Cited by: §1.
- R. Wang, H. Wang, B. Xue, J. Pang, S. Liu, Y. Chen, J. Qiu, D. F. Wong, H. Ji, and K. Wong (2025) Harnessing the reasoning economy: A survey of efficient reasoning for large language models. CoRR abs/2503.24377. External Links: Link, Document, 2503.24377 Cited by: §6.
- Y. Wu, Y. Wang, T. Du, S. Jegelka, and Y. Wang (2025a) When more is less: understanding chain-of-thought length in llms. CoRR abs/2502.07266. External Links: Link, Document, 2502.07266 Cited by: §1.
- Y. Wu, Y. Wang, Z. Ye, T. Du, S. Jegelka, and Y. Wang (2025b) When more is less: understanding chain-of-thought length in llms. External Links: 2502.07266, Link Cited by: §1.
- H. Xia, Y. Li, C. T. Leong, W. Wang, and W. Li (2025) TokenSkip: controllable chain-of-thought compression in llms. CoRR abs/2502.12067. External Links: Link, Document, 2502.12067 Cited by: §6.
- A. Yang, A. Li, B. Yang, B. Zhang, et al. (2025a) Qwen3 technical report. External Links: 2505.09388, Link Cited by: §1, §2.
- C. Yang, Q. Si, Y. Duan, Z. Zhu, C. Zhu, Z. Lin, L. Cao, and W. Wang (2025b) Dynamic early exit in reasoning models. CoRR abs/2504.15895. External Links: Link, Document, 2504.15895 Cited by: §6.
- P. Yu, J. Xu, J. Weston, and I. Kulikov (2024) Distilling system 2 into system 1. CoRR abs/2407.06023. External Links: Link, Document, 2407.06023 Cited by: §6.
- Y. Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Lin, F. Huang, and J. Zhou (2025) Qwen3 embedding: advancing text embedding and reranking through foundation models. CoRR abs/2506.05176. External Links: Link, Document, 2506.05176 Cited by: §3.1.
Appendix A Additional Experiment
Due to space constraints, we report the Qwen3-32B results in Table 4 in the Appendix.
Appendix B Additional Evaluation Details
This section complements the main experimental description by specifying the evaluation protocol and baseline implementations used for early-exit comparisons. Specifically, our evaluation utilizes AIME24 and AIME25, which each contain 30 problems, and GPQA-D, which comprises 198 problems.
Decoding Configuration and Token Accounting.
We use each model’s official chat template and its delimiter-based reasoning format (e.g., <think>... </think>). Unless otherwise stated, we follow the recommended nucleus-sampling settings (Temperature $=0.6$ , Top- $p=0.95$ ) and set the maximum generation length to 32,768 tokens so that behaviors are not driven by hard token caps.
We report Tok as the average number of generated tokens per instance (including both thinking and content). To measure the efficiency trade-off, we report the relative performance–cost ratio (RPCR), defined as the accuracy gain over the No-Think baseline per generated token:
$$
\text{RPCR}=\frac{\text{Acc}-\text{Acc}_{\text{No-Think}}}{\text{Tok}}\times 10^{4}
$$
For readability, we report RPCR with two decimal places. Additionally, the Compression Rate (CR), defined as the token ratio relative to full decoding (i.e., $\mathrm{Tok}/\mathrm{Tok}_{\text{Full}}$ ), is shown as an integer percentage in parentheses next to Tok.
Baselines.
We compare RCPD against representative inference-time strategies:
- Full: standard decoding without early stopping.
- No-Think: prompting the model to skip explicit reasoning and answer directly.
- BF (Budget Force): static truncation that forces </think> after a fixed thinking budget.
- DEER: a trigger-based early-exit method that terminates upon reflective cue words and a confidence threshold on the final answer format.
- S-GRPO: a training-based early-exit policy (we report results directly from the original paper when available).
Iso-Compute Calibration for BF.
To make BF a strong baseline, we evaluate it under an oracle iso-computational setting: for each (model, benchmark), we retrospectively select the BF budget so that its average token usage matches RCPD. This gives BF the advantage of using a globally tuned budget that is not available at deployment time.
From Static Budgets to Adaptive Stopping.
Figure 6 visualizes a BF sweep on AIME24 (Qwen3-32B). Accuracy improves with more thinking only up to a narrow saturation region, while token usage continues to grow. Consequently, a single fixed budget cannot reliably stop near the saturation point across instances, whereas RCPD adapts to online convergence signals and thus achieves a more consistent compute–accuracy trade-off.
Impact of Temporal Consistency on Stopping Robustness.
Table 2 compares RCPD with a simplified baseline that triggers stopping as soon as the </think> token enters the top-5 of the next-token distribution at a sentence boundary (denoted as </think> -5). While this single-threshold rule confirms that delimiter rank is a meaningful convergence proxy, RCPD performs better by incorporating short-horizon temporal patterns (Appendix C) that reduce sensitivity to noise.
<details>
<summary>2508.17627v2/x7.png Details</summary>

### Visual Description
# Technical Document Extraction: Relationship Between Think Length and Accuracy
## Title
**Relationship Between Think Length and Accuracy**
## Axes
- **X-axis**: Think Length (Tokens)
- Range: 0 to 12,000
- Markers: 0, 1K, 2K, 4K, 8K, 16K, 32K
- **Y-axis**: Accuracy
- Range: 0.2 to 0.8
- Markers: 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8
## Legend
- **Position**: Bottom-right corner
- **Components**:
- **Budget Force**: Blue line with circle markers
- **RCPD**: Purple star marker
- **Deer**: Red triangle marker
## Data Series
### Budget Force (Blue Line)
- **Trend**: Steady upward slope from left to right.
- **Data Points**:
- [0, 0.200]
- [1K, 0.567]
- [2K, 0.567]
- [4K, 0.667]
- [8K, 0.733]
- [12K, 0.733]
- [16K, 0.733]
- [32K, 0.800]
### RCPD (Purple Star)
- **Trend**: Single data point at 8K tokens.
- **Data Point**:
- [8K, 0.833]
### Deer (Red Triangle)
- **Trend**: Single data point at 12K tokens.
- **Data Point**:
- [12K, 0.767]
## Annotations
- **RCPD Annotation**:
- Text: `0.833`
- Position: Above the purple star marker at 8K tokens.
- **Deer Annotation**:
- Text: `0.767`
- Position: Below the red triangle marker at 12K tokens.
## Spatial Grounding
- **Legend**: Bottom-right corner (confirmed via visual alignment).
- **Data Point Colors**:
- Budget Force: Blue (matches legend).
- RCPD: Purple (matches legend).
- Deer: Red (matches legend).
## Key Observations
1. **Budget Force** demonstrates a consistent increase in accuracy as think length grows, with a notable plateau between 8K and 16K tokens.
2. **RCPD** achieves the highest accuracy (0.833) at 8K tokens but is represented as a single point.
3. **Deer** shows lower accuracy (0.767) at 12K tokens compared to RCPD at the same token length.
## Notes
- No additional languages or non-English text detected.
- All textual elements (labels, axis titles, legends, annotations) are transcribed verbatim.
- Chart structure and trends are validated against legend and spatial grounding.
</details>
(a) Relationship between thinking length and accuracy
<details>
<summary>2508.17627v2/x8.png Details</summary>

### Visual Description
# Technical Document Extraction: Relationship Between Think Length and Content Length
## Chart Overview
The image is a **line chart** titled **"Relationship Between Think Length and Content Length"**. It visualizes the correlation between two variables:
- **X-axis**: Think Length (Tokens)
- **Y-axis**: Content Length (Tokens)
---
## Axis Details
### X-axis (Horizontal)
- **Title**: Think Length (Tokens)
- **Markers**:
- 0, 2K, 4K, 8K, 12K, 16K, 32K
- **Range**: 0 to 12,500 tokens
### Y-axis (Vertical)
- **Title**: Content Length (Tokens)
- **Markers**:
- 1K, 2K, 3K, 4K, 5K, 6K
- **Range**: 1,000 to 6,000 tokens
---
## Legend
- **Location**: Top-right corner
- **Components**:
1. **Budget Force** (Blue line with circle markers)
2. **RCPD** (Purple star marker)
3. **Deer** (Red triangle marker)
---
## Data Points and Trends
### 1. Budget Force (Blue Line)
- **Trend**:
- Starts at **4,463 tokens** (0 think length).
- Peaks at **5,803 tokens** (1K think length).
- Declines to **3,563 tokens** (2K think length).
- Rises to **4,668 tokens** (4K think length).
- Gradually decreases to **4,151 tokens** (8K think length).
- Sharp drop to **1,741 tokens** (12K think length).
- Further declines to **1,244 tokens** (16K think length).
- Stabilizes at **1,227 tokens** (32K think length).
- **Key Data Points**:
| Think Length (Tokens) | Content Length (Tokens) |
|-----------------------|-------------------------|
| 0 | 4,463 |
| 1K | 5,803 |
| 2K | 3,563 |
| 4K | 4,668 |
| 8K | 4,151 |
| 12K | 1,741 |
| 16K | 1,244 |
| 32K | 1,227 |
---
### 2. RCPD (Purple Star)
- **Single Data Point**:
- **1,195 tokens** at **8K think length**.
---
### 3. Deer (Red Triangle)
- **Single Data Point**:
- **1,379 tokens** at **12K think length**.
---
## Spatial Grounding
- **Legend Position**: Top-right corner (coordinates: [x: 0.85, y: 0.95] relative to chart bounds).
- **Data Point Colors**:
- Blue line markers match "Budget Force" legend.
- Purple star matches "RCPD" legend.
- Red triangle matches "Deer" legend.
---
## Component Isolation
1. **Header**: Chart title and subtitle.
2. **Main Chart**:
- X-axis and Y-axis with labeled markers.
- Three data series (Budget Force, RCPD, Deer).
3. **Footer**: No additional text or annotations.
---
## Observations
- **Budget Force** exhibits a **non-linear relationship**:
- Initial increase (0K → 1K think length).
- Subsequent fluctuations and a sharp decline after 8K think length.
- **RCPD** and **Deer** represent isolated data points, suggesting specific thresholds or anomalies.
---
## Conclusion
The chart highlights how **Think Length** influences **Content Length** for different models/systems. The **Budget Force** series dominates the trend, while **RCPD** and **Deer** provide contextual benchmarks at specific think lengths.
</details>
(b) Relationship between thinking length and content length
Figure 6: Static truncation (Budget Force) cannot reliably trade compute for accuracy, while RCPD stops near the saturation point and reduces tokens with minimal accuracy loss.
Appendix C Implementation Details of RCPD Stopping Criteria
As discussed in Section §3.3, to balance computational efficiency with detection accuracy, we distill the decision boundaries mined from the probe dataset into a set of deterministic, symbolic rules. The generation process terminates immediately if the rank trajectory of the </think> token, denoted as $R_{t}$ , satisfies any of the following logical conditions.
To facilitate reproducibility and future research, we will make our code and data publicly available upon acceptance.
1. Immediate Saturation.
This condition captures scenarios where the model exhibits overwhelming certainty. A rank of $R_{t}≤ 5$ indicates that the reasoning process has effectively collapsed into a definitive conclusion, rendering further generation redundant.
$$
\mathcal{C}_{\text{sat}}:R_{t}\leq 5 \tag{9}
$$
2. Progressive Convergence.
This rule corresponds to the “ladder pattern” observed in successful deductions. It requires a monotonic ascent in the token’s rank probability (i.e., descending rank values) over a fixed window, representing a directed convergence toward the solution.
$$
\begin{split}\mathcal{C}_{\text{conv}}:\bigwedge_{k=0}^{3}(R_{t-k}\leq\theta_{k}),\\
\text{where }\boldsymbol{\theta}=[10,50,100,1000]\end{split} \tag{10}
$$
3. Sustained Plateau.
To distinguish between true convergence and stochastic noise, this condition detects whether the model has reached an “entropic floor.” Even if the rank does not hit the absolute peak ( $R_{t}≈ 1$ ), a sustained presence within a low-rank region signals that no new information is being generated. We define this via two sub-conditions:
- Short-term Stability (Noise Filtering): Ensures high confidence is maintained over a narrow window.
$$
\mathcal{C}_{\text{short}}:\forall k\in\{0,1,2\},R_{t-k}\leq 20 \tag{11}
$$
- Long-term Persistence (Sampling Inertia): Detects a prolonged plateau where the model effectively loops or stalls in a high-probability region.
$$
\mathcal{C}_{\text{long}}:\forall k\in\{0,\dots,5\},R_{t-k}\leq 50 \tag{12}
$$
The final stop signal $S_{t}$ is triggered by the disjunction of these conditions:
$$
S_{t}=\mathcal{C}_{\text{sat}}\lor\mathcal{C}_{\text{conv}}\lor(\mathcal{C}_{\text{short}}\lor\mathcal{C}_{\text{long}}) \tag{13}
$$
Appendix D Quantifying Semantic Convergence
We quantify the stability of the semantic trajectory after the RCP by fitting a 95% confidence ellipse in the 2D PCA space (Figure 3). Let $\hat{k}$ denote the RCP step, and let $\tilde{\mathbf{v}}_{k}^{(i)}$ denote the PCA-projected embedding of the $i$ -th sampled continuation at step $k$ . We collect all post-RCP points into $\mathcal{C}=\{\tilde{\mathbf{v}}_{k}^{(i)}:k≥\hat{k}\}$ and estimate their sample mean $\boldsymbol{\mu}$ and covariance $\Sigma$ .
For any point $\mathbf{x}$ in this PCA plane, we measure its semantic deviation from the converged region using the squared Mahalanobis distance:
$$
D_{M}(\mathbf{x})^{2}=(\mathbf{x}-\boldsymbol{\mu})^{T}\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}) \tag{14}
$$
We then define the 95% confidence ellipse (semantic basin) as the level set:
$$
\mathcal{B}_{0.95}=\left\{\mathbf{x}\mid D_{M}(\mathbf{x})^{2}\leq\chi^{2}_{2}(0.95)\right\}, \tag{15}
$$
where $\chi^{2}_{2}(0.95)≈ 5.991$ is the 95th percentile of the $\chi^{2}$ distribution with 2 degrees of freedom. Under a locally elliptical (approximately Gaussian) stationary regime in the PCA plane, $D_{M}(\mathbf{x})^{2}$ follows $\chi^{2}_{2}$ , so $\mathcal{B}_{0.95}$ contains 95% probability mass of the converged distribution.
Finally, we compute the fraction of post-RCP points that remain within this basin. In Figure 3, 97.3% of subsequent trajectory points lie inside $\mathcal{B}_{0.95}$ , indicating that after the RCP the semantic path no longer drifts but fluctuates within a compact, steady neighborhood.
| Qwen3-8B | </think> -5 | 61.11 |
| --- | --- | --- |
| Qwen3-8B | RCPD | 75.11 |
| Qwen3-14B | </think> -5 | 65.72 |
| Qwen3-14B | RCPD | 69.79 |
| Qwen3-30B-A3B | </think> -5 | 73.84 |
| Qwen3-30B-A3B | RCPD | 75.11 |
| Qwen3-32B | </think> -5 | 73.26 |
| Qwen3-32B | RCPD | 73.37 |
Table 2: Static stopping by a single rank threshold (</think> -5) is competitive but consistently worse than RCPD, indicating that simple temporal patterns improve robustness.
Appendix E Additional Visualizations
To demonstrate that the same phenomena consistently appear across datasets, we provide additional examples that could not fit in the main text.
<details>
<summary>2508.17627v2/x9.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## Chart Overview
The image depicts a **scatter plot** comparing **Thinking Length (Lt)** and **Content Length (Lc)** across two regions: **Pre-RCP** and **Post-RCP**. Key elements include:
- **Axes**:
- **X-axis**: Thinking Length (Lt) [0, 400]
- **Y-axis**: Content Length (Lc) [0, 1000]
- **Regions**:
- **Pre-RCP**: Left of dashed blue vertical line at Lt = 400
- **Post-RCP**: Right of dashed blue vertical line at Lt = 400
## Legend
- **Location**: Top-right corner
- **Labels**:
- **Red**: False
- **Green**: True
- **Purple Star**: RCP
## Data Points and Trends
### False (Red Dots)
- **Pre-RCP**: Clustered in the lower-left quadrant (Lt: 0–100, Lc: 0–200).
- **Post-RCP**: Sparse distribution (Lt: 300–400, Lc: 50–150).
- **Trend**: Decreasing density in Post-RCP compared to Pre-RCP.
### True (Green Dots)
- **Pre-RCP**: Scattered across mid-to-upper regions (Lt: 100–300, Lc: 200–800).
- **Post-RCP**: Concentrated near the dashed line (Lt: 300–400, Lc: 100–400).
- **Trend**: Increased density in Post-RCP, with a gradient toward lower Lc values.
### RCP (Purple Star)
- **Location**: [420, 50] (outside Post-RCP region, slightly right of the dashed line).
- **Significance**: Single outlier point, distinct from other data series.
## Spatial Grounding
- **Legend**: Top-right corner (coordinates not explicitly defined).
- **RCP Star**: Positioned at [420, 50], outside the Post-RCP region.
## Component Isolation
1. **Header**: Chart title "Pre-RCP" and "Post-RCP" labels above the regions.
2. **Main Chart**:
- Scatter plot with color-coded data points.
- Dashed blue line at Lt = 400.
3. **Footer**: No explicit footer; legend serves as a reference.
## Verification
- **Legend Consistency**:
- Red (False) matches lower-left clusters.
- Green (True) aligns with mid-to-upper clusters.
- Purple star (RCP) is isolated at [420, 50].
- **Trend Validation**:
- False points decline in Post-RCP.
- True points increase in Post-RCP, with a downward trend in Lc.
## Notes
- No additional languages or textual blocks present.
- All data points and labels are explicitly mapped to their respective categories.
</details>
(a) AIME24 Question 2 Length Dynamics
<details>
<summary>2508.17627v2/x10.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## 1. **Axis Labels and Titles**
- **X-Axis**: "Thinking Length (Lt)"
- Range: 0 to 350 (linear scale)
- **Y-Axis**: "Content Length (Lc)"
- Range: 0 to 500 (linear scale)
## 2. **Legend and Data Series**
- **Legend Location**: Top-right corner
- **Red Dots**: Labeled "False"
- **Green Dots**: Labeled "True"
- **Purple Star**: Labeled "RCP"
## 3. **Key Visual Components**
- **Vertical Dashed Line**:
- Position: `Lt = 200`
- Label: "Pre-RCP" (left) / "Post-RCP" (right)
- **Data Point Distribution**:
- **Pre-RCP Region (Lt < 200)**:
- **Red Points ("False")**: Clustered near `Lt = 0–50`, `Lc = 0–100`
- **Green Points ("True")**: Scattered across `Lt = 50–200`, `Lc = 100–500`
- **Post-RCP Region (Lt ≥ 200)**:
- **Green Points ("True")**: Dominant, concentrated at `Lt = 200–350`, `Lc = 0–100`
- **Purple Star ("RCP")**: Single point at `(Lt = 210, Lc = 20)`
## 4. **Trend Verification**
- **Pre-RCP Region**:
- Red points ("False") show a **concentration in low-Lt, low-Lc** values.
- Green points ("True") exhibit a **downward trend** in `Lc` as `Lt` increases.
- **Post-RCP Region**:
- Green points ("True") form a **horizontal band** at `Lc = 0–100` across `Lt = 200–350`.
- The RCP point ("Purple Star") is an **outlier** in the Post-RCP region, positioned at `(210, 20)`.
## 5. **Spatial Grounding**
- **Legend**: Top-right corner (no overlap with data points).
- **RCP Star**: Positioned at `(210, 20)`, just right of the vertical line (`Lt = 200`).
- **Color Consistency**:
- Red = False (confirmed in Pre-RCP cluster).
- Green = True (confirmed in both regions).
- Purple = RCP (single point, no overlap with other series).
## 6. **Component Isolation**
- **Header**: Title "Pre-RCP" (left) / "Post-RCP" (right) above the plot.
- **Main Chart**: Scatter plot with two regions divided by `Lt = 200`.
- **Footer**: No additional text or labels.
## 7. **Data Table Reconstruction**
| Region | Data Series | Lt Range | Lc Range | Notes |
|------------|-------------|----------------|----------------|--------------------------------|
| Pre-RCP | False | 0–50 | 0–100 | Clustered in lower-left |
| Pre-RCP | True | 50–200 | 100–500 | Downward trend in Lc |
| Post-RCP | True | 200–350 | 0–100 | Horizontal band at low Lc |
| Post-RCP | RCP | 210 | 20 | Single outlier |
## 8. **Language and Transcription**
- **Language**: English (no non-English text present).
- **Transcribed Text**:
- Axis labels: "Thinking Length (Lt)", "Content Length (Lc)"
- Legend labels: "False", "True", "RCP"
- Region labels: "Pre-RCP", "Post-RCP"
## 9. **Critical Observations**
- The vertical line at `Lt = 200` demarcates a clear boundary between Pre-RCP and Post-RCP regions.
- The RCP point (`210, 20`) is the only non-green data point in the Post-RCP region, suggesting a potential anomaly or special case.
- Green points ("True") dominate both regions but show distinct spatial patterns:
- Pre-RCP: Scattered, decreasing `Lc` with increasing `Lt`.
- Post-RCP: Concentrated at low `Lc` values.
## 10. **Conclusion**
The plot illustrates a shift in data distribution after the RCP intervention, with most "True" values clustering at low `Lc` in the Post-RCP region. The RCP point itself is an outlier, warranting further investigation.
</details>
(b) AIME24 Question 9 Length Dynamics
Figure 7: Supplementary examples illustrating the transition from the Pre-RCP Active Reasoning Stage to the Post-RCP Converged Reasoning Stage (complementing Figure 2).
<details>
<summary>2508.17627v2/x11.png Details</summary>

### Visual Description
# Technical Document Extraction: Latent Space Trajectory Analysis
## Chart Overview
This visualization depicts a 2D latent space analysis of reasoning trajectories, with color-coded thinking steps and key convergence metrics. The chart contains three primary regions: active reasoning (pre-RCP), optimal stop (RCP), and converged reasoning (post-RCP).
## Axes and Labels
- **X-axis**: Latent Dimension 1 (ranging from -0.05 to 0.30)
- **Y-axis**: Latent Dimension 2 (ranging from -0.10 to 0.20)
- **Colorbar**: Thinking Step k (gradient from purple [0] to yellow [160])
## Legend Elements
1. **Attraction Basin (95% CI)**
- Symbol: Blue cross (✖)
- Position: [-0.01, -0.01]
- Color: Blue gradient (matches 95% CI shading)
2. **Convergence Center**
- Symbol: Blue X (✖)
- Position: [-0.01, -0.01]
- Color: Blue gradient (matches 95% CI shading)
3. **RCP (Optimal Stop)**
- Symbol: Red star (★)
- Position: [-0.02, 0.01]
- Color: Red (matches star marker)
## Key Regions
1. **Pre-RCP: Active Reasoning Stage**
- Red text annotation at [0.15, 0.10]
- Contains high-thinking-step trajectories (yellow-green points)
2. **Post-RCP: Converged Reasoning Stage**
- Green shaded region below blue dashed line
- Contains low-thinking-step trajectories (purple-blue points)
## Data Trends
1. **Trajectory Patterns**
- **Converging Paths**: Lines from yellow-green points (high k) toward red star (RCP)
Example: Point at [0.05, 0.05] (k=140) connects to RCP
- **Diverging Paths**: Lines from purple points (low k) moving away from RCP
Example: Point at [0.25, -0.05] (k=10) shows divergence
2. **Attraction Basin Dynamics**
- 95% CI shading (light blue) surrounds convergence center
- 80% of data points (n=160) cluster within this region
- Critical threshold: Blue dashed line at y=-0.05 separates active/converged stages
## Spatial Grounding
- **Legend Position**: Top-left corner (x=0.02, y=0.18)
- **RCP Marker**: Confirmed at [-0.02, 0.01] with red star
- **Convergence Center**: Blue X at [-0.01, -0.01] matches 95% CI shading
## Color Consistency Verification
1. RCP star (red) matches legend
2. Convergence Center (blue X) matches 95% CI shading
3. Thinking step gradient aligns with colorbar (purple=low, yellow=high)
## Missing Elements
- No data table present
- No non-English text detected
- All textual components transcribed verbatim
## Technical Implications
The chart demonstrates a clear phase transition in reasoning trajectories, with optimal stopping points (RCP) acting as attractors for high-thinking-step paths while lower-thinking-step trajectories exhibit divergent behavior. The 95% CI attraction basin provides a probabilistic framework for convergence analysis.
</details>
(a) AIME24 Question 1 Semantic Trajectory
<details>
<summary>2508.17627v2/x12.png Details</summary>

### Visual Description
# Technical Document Extraction: Latent Space Analysis Chart
## 1. Chart Components and Labels
### Axes
- **X-axis**: Latent Dimension 1 (ranging from -0.10 to 0.10)
- **Y-axis**: Latent Dimension 2 (ranging from -0.10 to 0.10)
### Legend (Top Left)
- **Attraction Basin (95% CI)**: Light blue shaded region
- **Convergence Center**: Dark blue "X" marker
- **RCP (Optimal Stop)**: Red star marker
### Colorbar (Right Side)
- **Title**: Thinking Step k
- **Range**: 0 (purple) to 700 (yellow)
- **Gradient**: Purple → Blue → Green → Yellow
## 2. Annotations and Textual Elements
- **Pre-RCP: Active Reasoning Stage** (Red text, top right quadrant)
- **Post-RCP: Converged Reasoning Stage** (Green text, bottom right quadrant)
- **RCP** (Red star, annotated with arrow pointing to cluster center)
- **Convergence Center** (Dark blue "X", annotated with arrow pointing to cluster center)
## 3. Spatial Grounding of Legend
- **Legend Position**: Top-left corner (coordinates: [0.02, 0.98] relative to chart bounds)
- **Color Consistency Check**:
- Attraction Basin: Light blue matches shaded region
- Convergence Center: Dark blue "X" matches cluster center
- RCP: Red star matches optimal stop marker
## 4. Data Trends and Visual Patterns
### Pre-RCP Region (Top Left)
- **Data Points**: Purple and blue dots (Thinking Step k: 0–300)
- **Distribution**: Scattered, forming loose clusters
- **Trend**: No clear convergence; points spread across Latent Dimensions 1 and 2
### Post-RCP Region (Bottom Right)
- **Data Points**: Green and yellow dots (Thinking Step k: 300–700)
- **Distribution**: Dense cluster around RCP (red star)
- **Trend**: Strong convergence toward RCP; points tightly grouped near [0.0, -0.05]
### Convergence Center
- **Location**: Near RCP (coordinates: [0.0, -0.05])
- **Visual Confirmation**: Dark blue "X" aligns with cluster center of Post-RCP data
## 5. Structural Analysis
### Regions of Interest
1. **Pre-RCP Active Reasoning Stage**:
- Latent Dimensions: -0.10 < x < 0.10, 0.0 < y < 0.10
- Key Feature: Dispersed data points with no dominant cluster
2. **Post-RCP Converged Reasoning Stage**:
- Latent Dimensions: -0.05 < x < 0.10, -0.10 < y < 0.0
- Key Feature: High-density cluster around RCP
### Color-Coded Thinking Steps
- **Low k (0–300)**: Purple/blue points (Pre-RCP)
- **High k (300–700)**: Green/yellow points (Post-RCP)
## 6. Missing Elements
- **Data Table**: Not present in the chart
- **Additional Text**: No non-English text detected
## 7. Summary of Key Findings
1. **Pre-RCP Stage**: Active reasoning with dispersed latent space exploration
2. **Post-RCP Stage**: Converged reasoning with optimal stop (RCP) at [0.0, -0.05]
3. **Convergence Mechanism**: Attraction Basin (light blue) guides data toward Convergence Center (dark blue "X")
4. **Optimal Stop**: RCP (red star) represents the most efficient reasoning endpoint
This extraction provides a complete technical description of the chart, enabling reconstruction of the visualization without the original image.
</details>
(b) AIME24 Question 30 Semantic Trajectory
Figure 8: Additional semantic trajectory examples (supplementary to Figure 3).
Using the same 95% confidence ellipse protocol in Section D, we quantify post-RCP semantic stability for these additional cases. For AIME24 Question 1 (Figure 8(a)), 94.3% of Post-RCP points lie inside the 95% basin, and for AIME24 Question 30 (Figure 8(b)), the ratio is 96.5%. The consistently high in-basin fractions support that semantic convergence after the RCP is not instance-specific but recurs across different problems.
| Injection: 16 (Original: 12) Injection: 12 (Original: 16) | 11 11 | 49 51 | 4 2 |
| --- | --- | --- | --- |
Table 3: Impact of Modified Intermediate Answer on Final Reasoning Outcome. The results demonstrate a strong confirmation bias: in both settings, the model overwhelmingly adopts the injected intermediate answer rather than recovering the correct original answer.
| Full No-Think BF | 11955 (100%) 5731 (48%) 10071 (84%) | 82.22 33.33 78.89 | 40.90 0.00 45.24 | 16878 (100%) 5472 (32%) 11772 (70%) | 65.56 17.78 55.56 | 28.31 0.00 32.09 | 7357 (100%) 2277 (31%) 4137 (56%) | 69.53 50.88 64.98 | 25.35 0.00 34.08 | 12063 (100%) 4493 (37%) 8660 (72%) | 72.44 33.99 66.48 | 31.87 0.00 37.52 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| DEER | 12002 (100%) | 81.11 | 39.81 | 14905 (88%) | 66.67 | 32.80 | 6841 (93%) | 69.70 | 27.51 | 11249 (93%) | 72.49 | 34.23 |
| RCPD | 10062 (84%) | 82.22 | 48.59 | 10917 (65%) | 66.67 | 44.78 | 4094 (56%) | 71.21 | 49.66 | 8358 (69%) | 73.37 | 47.12 |
Table 4: Additional results on Qwen3-32B, reporting Tok (CR%), Acc, and $10^{4}×$ RPCR (two decimals).
Appendix F Examples of Overthinking
This section describes three typical cases of overthinking phenomena in model reasoning. The first case involves repeated verification of a correct answer, while the other two cases involve repeated derivation of speculative incorrect answers. These examples illustrate how overthinking can lead to unnecessary computational waste and inefficient problem-solving processes.
To further assess the effectiveness of the model’s post-RCP validation mechanism, we conduct a counterfactual modification experiment. We manually edit the initial tentative answer in the reasoning trace of Table 3. Concretely, we replace “…perhaps the number of intersections is 12…” with “…perhaps the number of intersections is 16…”. After injecting this incorrect intermediate conclusion, we run 64 independent reasoning trials. As shown in Table 3, the model exhibits a pronounced confirmation bias. Only 11 trials recover the correct answer (12), whereas 49 out of 64 converge to the injected value (16). We observe the same pattern under the symmetric setting that flips the injection from 16 to 12, where the model again predominantly adopts the newly specified answer. Overall, these results indicate that Post-RCP generation seldom performs genuine re-verification and instead tends to rationalize earlier tokens.
User: Please reason step by step, and put your final answer within boxed. Find the number of rectangles that can be formed inside a fixed regular dodecagon ( $12$ -gon) where each side of the rectangle lies on either a side or a diagonal of the dodecagon. The diagram below shows three of those rectangles. Assistant: ... Therefore, returning to the dodecagon, with 6 diameters, the number of rectangles would be C(6,2)=15. But the problem says that the diagram shows three of those rectangles, and asks for the total number. Maybe 15 is the answer? ... Therefore, our formula seems valid. Therefore, returning to the dodecagon. If there are 6 diameters, then number of rectangles is C(6,2) = 15. But the problem says "three of those rectangles" are shown in the diagram. Maybe the diagram is just an example, and the answer is indeed 15? ... For example, choosing 0-6 and 1-7 gives a rectangle with vertices 0,1,6,7. Choosing 0-6 and 2-8 gives rectangle 0,2,6,8. Etc. Each of these should be unique. Therefore, total number is 15. ... Therefore, the rectangle must be inscribed in the same circle. Therefore, must have diagonals as diameters. Therefore, 15. ... Therefore, I think the answer is 15. **Final Answer** $15$
Table 5: As shown in the table, the answer 15 was repeatedly derived through a speculative combinatorial formula during the problem-solving process. The assistant multiple times applied the same formula $\binom{6}{2}=15$ based on the assumption of 6 diameters, without rigorous validation or alternative approaches. For conciseness, only representative instances of this repetitive reasoning are included. This circular derivation process illustrates a clear case of overthinking, resulting in unnecessary computational effort.
User: Please reason step by step, and put your final answer within . Every morning Aya goes for a $9$ -kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop. Assistant: ... The walking time is 3 hours, coffee shop time is 24 minutes. So total time is 3 hours + 24 minutes. Let me convert 3 hours to minutes: 180 minutes + 24 minutes = 204 minutes. ... So 204 minutes. Wait, but let me check again. Wait, 3 hours is 180 minutes. 180 + 24 = 204. So 204 minutes. Is that the answer? ... Then adding 24 minutes gives 204 minutes. Therefore, the answer is 204. Alternatively, convert 24 minutes to 0.4 hours, total time 3.4 hours, which is 3 hours 24 minutes. Either way, the problem asks for the number of minutes. So 204 minutes. **Final Answer** $204$
Table 6: As shown in the table, the answer 204 was repeatedly verified multiple times during the problem-solving process. The reasoning process involved 20 instances of repeated verification, and for the sake of brevity, only the most representative instances are included in the table. This repetitive verification process led to a significant waste of computational resources, which is a typical case of overthinking.
User: Please reason step by step, and put your final answer within . Define $f(x)=||x|-\tfrac{1}{2}|$ and $g(x)=||x|-\tfrac{1}{4}|$ . Find the number of intersections of the graphs of $[y=4g(f(\sin(2\pi x)))\quad\text{ and }\quad x=4g(f(\cos(3\pi y))).]$ Assistant: ... This is a complex equation, but perhaps we can analyze it by considering the possible values. Let me note that both x and y are in [0,1], and the functions are continuous. Therefore, the number of solutions is finite, and we can try to find them by considering the number of times the curves cross. Alternatively, since both functions are symmetric and periodic, perhaps the number of intersections is 12, or some other number. ... let me consider the following: each time the function F(x) completes an oscillation, it crosses the function G(y) multiple times. Since F(x) has two oscillations and G(y) has three, the total number of intersections could be 2 * 3 * 2 = 12, but this is just a guess. ... Given the complexity, perhaps the number of intersections is 12, as each oscillation of F(x) (2 oscillations) intersects each oscillation of G(y) (3 oscillations) twice, giving 2 * 3 * 2 = 12. ... Given the complexity, it’s hard to say, but if each oscillation intersects twice, then 2 * 3 * 2 = 12 might be the answer. ... Given that, and considering that (1,1) is one of them, I’ll go with 12 as the answer. **Final Answer** $12$
Table 7: As shown in the table, the answer 12 was repeatedly speculated multiple times during the problem-solving process. The reasoning involved several instances of similar speculation based on oscillatory patterns and multiplicative guesses, and for brevity, only representative excerpts are included. This repetitive speculation process led to a waste of computational resources, exemplifying overthinking.
Appendix G Declaration of AI Use
In accordance with the ACL Policy on AI Writing Assistance, we openly disclose the use of AI tools in the preparation of this manuscript.
Writing Assistance
We utilized Gemini 3-Pro solely as a language editing tool. Its application was strictly limited to paraphrasing, checking grammar, and polishing the original text written by the authors to enhance readability and clarity. The AI tool was not used to generate new scientific concepts, formulate hypotheses, or interpret the results. The authors have reviewed all AI-assisted text and retain full responsibility for the accuracy, originality, and integrity of the content.
Experimental Models
It is important to distinguish the writing assistant from the models studied in this work. The methodologies, experiments, and results presented in this paper are based on the Qwen3 and DeepSeek-R1 large language models. These models served exclusively as the subjects of our research and the foundation for our experimental framework; they were not involved in the writing or drafting of this manuscript.