# Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
**Authors**: Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, Yu Meng
> University of Virginia
wlchen@virginia.edu, liqianp@google.com
## Abstract
Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal “overthinking,” leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens —tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@ $n$ , a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@ $n$ matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Charts: Accuracy vs. Token Count & Deep-Thinking Ratio
### Overview
The image presents two charts comparing the accuracy of different models (AIME 25, AIME 24, HMMT 25, and GPQA-D) under varying conditions. The left chart shows accuracy (Pass@1) as a function of token count, while the right chart shows accuracy as a function of a "Deep-Thinking Ratio". Each chart includes shaded regions representing confidence intervals around the data.
### Components/Axes
**Left Chart:**
* **X-axis:** Token Count (ranging from approximately 2500 to 10000)
* **Y-axis:** Accuracy (Pass@1) (ranging from approximately 0.5 to 0.9)
* **Title:** "Avg Correlation r = -0.544"
* **Data Series:**
* AIME 25 (Light Blue)
* AIME 24 (Green)
* HMMT 25 (Red)
* GPQA-D (Yellow)
**Right Chart:**
* **X-axis:** Deep-Thinking Ratio (ranging from approximately 0.135 to 0.180) labeled as "(Ours) Deep-Thinking Ratio"
* **Y-axis:** Accuracy (Pass@1) (ranging from approximately 0.5 to 0.9)
* **Title:** "Avg Correlation r = 0.828"
* **Data Series:**
* AIME 25 (Light Blue)
* AIME 24 (Green)
* HMMT 25 (Red)
* GPQA-D (Yellow)
**Legend:** Located at the bottom center of the image. It maps colors to model names:
* AIME 25 (Light Blue)
* AIME 24 (Green)
* HMMT 25 (Red)
* GPQA-D (Yellow)
### Detailed Analysis or Content Details
**Left Chart (Accuracy vs. Token Count):**
* **AIME 25 (Light Blue):** Starts at approximately 0.82 at 2500 tokens, rises to a peak of approximately 0.84 at 5000 tokens, then declines to approximately 0.78 at 10000 tokens. The correlation coefficient (r) is -0.704.
* **AIME 24 (Green):** Starts at approximately 0.81 at 2500 tokens, rises to a peak of approximately 0.83 at 5000 tokens, then declines to approximately 0.76 at 10000 tokens. The correlation coefficient (r) is -0.407.
* **HMMT 25 (Red):** Starts at approximately 0.68 at 2500 tokens, decreases to approximately 0.64 at 5000 tokens, and then rises slightly to approximately 0.66 at 10000 tokens. The correlation coefficient (r) is -0.783.
* **GPQA-D (Yellow):** Starts at approximately 0.71 at 2500 tokens, decreases to approximately 0.67 at 5000 tokens, and then rises slightly to approximately 0.69 at 10000 tokens. The correlation coefficient (r) is -0.284.
**Right Chart (Accuracy vs. Deep-Thinking Ratio):**
* **AIME 25 (Light Blue):** Starts at approximately 0.75 at 0.135, rises steadily to approximately 0.91 at 0.180. The correlation coefficient (r) is 0.862.
* **AIME 24 (Green):** Starts at approximately 0.76 at 0.135, rises steadily to approximately 0.86 at 0.180. The correlation coefficient (r) is 0.715.
* **HMMT 25 (Red):** Starts at approximately 0.60 at 0.135, rises steadily to approximately 0.71 at 0.180. The correlation coefficient (r) is 0.941.
* **GPQA-D (Yellow):** Starts at approximately 0.66 at 0.135, rises steadily to approximately 0.78 at 0.180. The correlation coefficient (r) is 0.795.
### Key Observations
* **Left Chart:** Accuracy generally *decreases* with increasing token count for AIME 25 and AIME 24, while HMMT 25 and GPQA-D show a more complex relationship. The negative correlation coefficient (-0.544) indicates a general inverse relationship between token count and accuracy.
* **Right Chart:** Accuracy generally *increases* with increasing Deep-Thinking Ratio for all models. The positive correlation coefficient (0.828) indicates a strong positive relationship between Deep-Thinking Ratio and accuracy.
* AIME 25 consistently demonstrates the highest accuracy in the right chart.
* HMMT 25 and GPQA-D have lower accuracy compared to AIME 25 and AIME 24 in both charts.
### Interpretation
The data suggests that increasing the token count does not necessarily lead to improved accuracy, and may even be detrimental for some models (AIME 25 and AIME 24). This could indicate that these models are reaching a point of diminishing returns with longer input sequences, or that the increased token count introduces noise or irrelevant information.
Conversely, increasing the Deep-Thinking Ratio consistently improves accuracy across all models. This suggests that allowing the models to engage in more complex reasoning processes is beneficial for performance. The strong positive correlation between Deep-Thinking Ratio and accuracy supports this interpretation.
The difference in performance between the AIME models and the HMMT/GPQA-D models suggests that the AIME models may be better equipped to handle complex reasoning tasks or benefit more from increased computational resources. The outliers in the data (e.g., the peak accuracy of AIME 25 at 5000 tokens in the left chart) could be due to specific characteristics of the dataset or the model architecture. Further investigation would be needed to understand these anomalies.
</details>
Figure 1: Comparison of correlations between accuracy and proxies for thinking effort. The plots illustrate the relationship between model performance and two inference-time measures of thinking effort on GPT-OSS-120B- medium across AIME 2024/2025, HMMT 2025, and GPQA-Diamond. (Left) Output token count exhibits a moderate negative correlation (average $r=-0.544$ ), suggesting that output length is an unreliable indicator of performance. (Right) In contrast, our proposed deep-thinking ratio demonstrates a strong positive correlation with accuracy (average $r=0.828$ ).
## 1 Introduction
Large language models (LLMs) have achieved remarkable reasoning capabilities by generating explicit thought traces, most notably through the Chain-of-Thought (CoT) paradigm [wei2022chain-d1a]. Prior works have shown that increasing the number of reasoning tokens generated can generally boost task performance [jaech2024openai, guo2025deepseek, anthropic2025claude3-7, anthropic2025claude4, oai2025o3mini, yang2025qwen3, team2025kimi, zhong2024evaluation], motivating methods that encourage longer and more elaborate thinking traces [muennighoff2025s1, balachandran2025inference-time-7c9, yeo2025demystifying-b6f].
However, a growing body of evidence suggests that token counts are unreliable indicators of model performance during inference, as longer reasoning does not consistently translate into higher accuracy [wu2025when-905, aggarwal2025optimalthinkingbench-3bf, sui2025stop-ced, su2025between-f85]. Empirical studies reveal inverted-U relationships between CoT length and performance [wu2025when-905], as well as inverse-scaling behaviors in which longer reasoning traces systematically degrade performance [gema2025inverse-bad]. Excessive reasoning may reflect overthinking, wherein models amplify flawed heuristics or fixate on irrelevant details [feng2025what-321]. Consequently, relying on length as a metric for reasoning quality not only encourages verbosity over clarity but also wastes computational resources on uninformative tokens. Though recent work has attempted to assess the semantic structure of CoTs (e.g., by representing reasoning traces as graphs), such approaches often rely on costly auxiliary parsing or external annotations [feng2025what-321]. Addressing these limitations requires more principled and efficient methods for measuring thinking effort that can distinguish effective reasoning from uninformative generation.
In this work, we introduce deep-thinking ratio (DTR) as a direct measure of inference-time thinking effort. Instead of relying on surface-level features like output length, we focus on how individual tokens are produced internally. We posit that when a token prediction stabilizes in early layers, subsequent depth-wise modifications entail relatively low computational effort, resembling less thinking. In contrast, token predictions that undergo sustained revision in deeper layers before converging reflect greater thinking [chuang2023dola-0c6]. We operationalize this idea by projecting intermediate-layer hidden states into the vocabulary space and comparing each layer’s prediction distribution to the final-layer distribution. Tokens whose distributions do not converge until deeper layers are identified as deep-thinking tokens. By counting the proportion of deep-thinking tokens in a generated sequence, we obtain DTR, which provides a simple, mechanistically grounded measure of thinking effort, requiring neither task-specific heuristics nor external structural annotations.
Across four challenging mathematical and scientific reasoning benchmarks—AIME 2024, AIME 2025, HMMT 2025, and GPQA [aops2024aime1, aops2024aime2, aops2025aime1, aops2025aime2, hmmt2025, rein2024gpqa] —and a range of reasoning-focused language models, including GPT-OSS, DeepSeek-R1, and Qwen3 families [openai2025gpt-oss-120b-a33, guo2025deepseek, yang2025qwen3], we demonstrate that measuring deep-thinking tokens yields strong correlations with task accuracy. The achieved correlation is substantially higher than those obtained using length-based or confidence-based baselines. Furthermore, we show that deep-thinking tokens can be leveraged for parallel inference scaling, where preferentially selecting and aggregating responses with higher DTR achieves performance comparable or better than standard consensus-based methods, while requiring only half the compute cost. Our contributions are summarized as follows:
- We introduce deep-thinking ratio (DTR)—a measure that counts the ratio of deep-thinking tokens in a sequence whose predictions undergo sustained revision in deeper layers before converging—as a new lens for characterizing inference-time thinking effort.
- We empirically show that, across multiple reasoning benchmarks and model families, DTR of a generated sequence exhibits strong positive correlations with task accuracy, outperforming length-based and confidence-based baselines significantly.
- We introduce Think@ $n$ , a test-time scaling strategy that preferentially selects and aggregates samples with higher DTR. By early halting unpromising generations based on DTR estimated from short prefixes, Think@ $n$ matches or surpasses standard self-consistency with approximately half the inference cost.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Heatmap: Attention Weights Visualization
### Overview
The image presents a heatmap visualizing attention weights, likely from a neural network model. The heatmap displays the intensity of attention between different layers (y-axis) and different tokens in a sequence (x-axis). The color intensity represents the attention weight, ranging from 0.0 (dark brown) to 1.0 (light yellow).
### Components/Axes
* **Y-axis:** "i-th Layer", ranging from 1 to 29, with tick marks at integer values.
* **X-axis:** A sequence of tokens: "A and B", "*", "8", "13", "[in", "boxed", "]", "And", "the_correct", "choice", "is", "D", "13", "The", "answer", "is", "$boxed", "{", "return", "}".
* **Color Scale:** A gradient from dark brown (0.0) to light yellow (1.0) representing attention weight.
* **Legend:** A colorbar on the right side of the heatmap, indicating the mapping between color and attention weight.
### Detailed Analysis
The heatmap shows varying attention weights across layers and tokens. Here's a breakdown of notable observations:
* **Layer 1-10:** Generally low attention weights (dark brown) across all tokens.
* **Layer 11-15:** Increased attention around the tokens "A and B", "8", "13", "[in", "boxed", "]", "And", "the_correct", "choice", "is", "D", "13", "The", "answer", "is". Specifically, there's a peak around layer 13 and the tokens "A and B", "8", "13", "[in", "boxed", "]".
* **Layer 16-20:** Attention weights are more distributed, with peaks around "the_correct", "choice", "is", "D", "13", "The", "answer", "is".
* **Layer 21-25:** A strong peak in attention appears around the tokens "answer", "is", "$boxed", "{", "return", "}".
* **Layer 26-29:** Attention weights decrease again, with some residual attention around "answer", "is", "$boxed", "{", "return", "}".
Here's a more granular look at approximate attention weights (with uncertainty due to visual estimation):
| Layer | "A and B" | "*" | "8" | "13" | "[in" | "boxed" | "]" | "And" | "the_correct" | "choice" | "is" | "D" | "13" | "The" | "answer" | "is" | "$boxed" | "{" | "return" | "}" |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ~0.02 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 | ~0.01 |
| 5 | ~0.03 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 | ~0.02 |
| 10 | ~0.05 | ~0.03 | ~0.04 | ~0.04 | ~0.04 | ~0.04 | ~0.03 | ~0.04 | ~0.04 | ~0.04 | ~0.04 | ~0.04 | ~0.04 | ~0.04 | ~0.04 | ~0.04 | ~0.03 | ~0.03 | ~0.03 | ~0.03 |
| 13 | ~0.75 | ~0.2 | ~0.6 | ~0.65 | ~0.5 | ~0.6 | ~0.4 | ~0.5 | ~0.5 | ~0.5 | ~0.5 | ~0.5 | ~0.5 | ~0.5 | ~0.5 | ~0.5 | ~0.4 | ~0.4 | ~0.4 | ~0.4 |
| 17 | ~0.4 | ~0.1 | ~0.3 | ~0.3 | ~0.2 | ~0.3 | ~0.2 | ~0.3 | ~0.6 | ~0.6 | ~0.6 | ~0.5 | ~0.4 | ~0.4 | ~0.5 | ~0.5 | ~0.3 | ~0.3 | ~0.3 | ~0.3 |
| 22 | ~0.2 | ~0.05 | ~0.1 | ~0.1 | ~0.05 | ~0.1 | ~0.05 | ~0.1 | ~0.2 | ~0.2 | ~0.2 | ~0.1 | ~0.1 | ~0.1 | ~0.7 | ~0.7 | ~0.6 | ~0.6 | ~0.6 | ~0.5 |
| 27 | ~0.1 | ~0.02 | ~0.05 | ~0.05 | ~0.02 | ~0.05 | ~0.02 | ~0.05 | ~0.1 | ~0.1 | ~0.1 | ~0.05 | ~0.05 | ~0.05 | ~0.3 | ~0.3 | ~0.2 | ~0.2 | ~0.2 | ~0.1 |
### Key Observations
* The attention weights are not uniform. Certain layers and tokens receive significantly higher attention than others.
* The attention pattern appears to evolve as the information propagates through the layers. Early layers focus on "A and B", "8", "13", "[in", "boxed", "]", while later layers shift attention to "answer", "is", "$boxed", "{", "return", "}".
* The highest attention weights are concentrated in layers 13, 17, and 22, suggesting these layers are crucial for processing the input sequence.
### Interpretation
This heatmap likely represents the attention mechanism within a transformer-based model, possibly used for question answering or code generation. The model appears to initially focus on the input context ("A and B", "8", "13", "[in", "boxed", "]"), then shifts its attention towards identifying the correct answer and the subsequent actions ("answer", "is", "$boxed", "{", "return", "}").
The evolution of attention weights suggests a hierarchical processing of information. The initial layers extract basic features, while deeper layers combine these features to make more informed decisions. The peaks in attention weights indicate the parts of the input sequence that are most relevant for the model's task.
The fact that attention is focused on "$boxed", "{", and "return" in the later layers suggests the model is preparing to execute or return a result, potentially within a programming context. The presence of these tokens indicates the model is not just identifying the answer but also considering the actions needed to present or utilize it.
</details>
Figure 2: Heatmap of thought: We plot the Jensen–Shannon divergence (JSD) values between the distributions of the last (36th) layer and intermediate layers for an answer sequence from GPT-OSS-120B- high. Functional and templated words (e.g., “and”, “is”, “boxed”, “ <| return |> ”) often converge at relatively shallow layers; Completions after operators (e.g., “+”, “=”) and answer tokens/symbols (e.g., “13”, “(D)”) do not settle until deeper layers. Interestingly, the answer token “13” gradually surfaces in earlier layers after its first appearance.
## 2 Measuring Deep-Thinking Ratio
### 2.1 Preliminaries
We consider an autoregressive language model $f_{\theta}$ composed of $L$ transformer layers, hidden dimension $d$ , and vocabulary $V$ . Given a prefix sequence $y_{<t}$ , the forward pass at generation step $t$ produces a sequence of residual stream states $\{h_{t,l}\}_{l=1}^{L}$ , where $h_{t,l}\in\mathbb{R}^{d}$ denotes the hidden state after layer $l$ . The final-layer output $h_{t,L}$ is projected by the language modeling head (i.e., the unembedding matrix) $W_{U}\in\mathbb{R}^{|V|\times d}$ to produce logits over the vocabulary.
Prior research on early exiting [teerapittayanon2016branchynet, elbayad2019depth, schuster2022confident, din2024jump, belrose2023eliciting] has demonstrated that, without specialized auxiliary training, applying the language modeling head directly to intermediate-layer hidden states effectively yields meaningful predictive distributions [nostalgebraist2020lens, kao2020bert]. Building on this line of works, we project intermediate-layer hidden states into the vocabulary space using the same unembedding matrix $W_{U}$ . For each intermediate layer $l\in\{1,\ldots,L-1\}$ , we compute the logit vector $z_{t,l}$ and probability distribution $p_{t,l}$ as
$$
\displaystyle p_{t,l}=\mathrm{softmax}(z_{t,l}),\;\;\;z_{t,l} \displaystyle=W_{U}h_{t,l} \tag{1}
$$
The model’s final-layer distribution is denoted by $p_{t,L}$ .
### 2.2 Deep-Thinking Tokens
<details>
<summary>x3.png Details</summary>

### Visual Description
## Diagram: Deep-Thinking Regime Evaluation
### Overview
The diagram illustrates a process for evaluating a "Deep-Thinking Regime" within a model, likely a neural network. It shows a forward pass through multiple layers of the model, followed by a computation of the Jensen-Shannon Divergence (JSD) between probability distributions at different layers, and a comparison against a threshold. The diagram visually represents whether the JSD values exceed a threshold of 0.5.
### Components/Axes
The diagram is segmented into three main regions:
1. **Deep-Thinking Regime:** A vertical stack of rectangular blocks representing the layers of the model. Labeled from "10-th layer" at the top to "1-st layer" at the bottom, with an ellipsis indicating intermediate layers.
2. **Compute JSD:** A section showing probability distributions (histograms) corresponding to each layer, and lines connecting them to a JSD computation. The distributions are labeled as *p<sub>10th</sub>*, *p<sub>9th</sub>*, *p<sub>8th</sub>*, *p<sub>7th</sub>*, and *p<sub>1st</sub>*.
3. **Threshold Comparison:** A vertical column of checkmarks and crosses, indicating whether the computed JSD value for each layer is less than or equal to 0.5.
The primary labels are:
* "Model Forward Pass" (above the Deep-Thinking Regime)
* "Compute JSD (*p<sub>10th</sub>* || *p<sub>7th</sub>*)" (above the JSD computation section)
* "< Threshold 0.5?" (above the threshold comparison section)
### Detailed Analysis
The diagram shows the following JSD values and corresponding threshold comparisons:
* **10-th layer:** JSD = 0.00, Result: Checkmark (≤ 0.5)
* **9-th layer:** JSD = 0.08, Result: Checkmark (≤ 0.5)
* **8-th layer:** JSD = 0.36, Result: Checkmark (≤ 0.5)
* **7-th layer:** JSD = 0.76, Result: Cross ( > 0.5)
* **6-th layer:** JSD = 0.78, Result: Cross ( > 0.5)
* **5-th layer:** JSD = 0.82, Result: Cross ( > 0.5)
* **4-th layer:** JSD = 0.86, Result: Cross ( > 0.5)
* **3-rd layer:** JSD = 0.85, Result: Cross ( > 0.5)
* **2-nd layer:** JSD = 0.93, Result: Cross ( > 0.5)
* **1-st layer:** JSD = 0.96, Result: Cross ( > 0.5)
The JSD values generally increase as we move down through the layers (from 10th to 1st). The distributions themselves appear to be histograms, with the x-axis representing the values and the y-axis representing the probability density. The distributions are visually different, suggesting varying data characteristics at each layer.
### Key Observations
* The initial layers (10th, 9th, and 8th) have JSD values below the threshold of 0.5, indicating a high degree of similarity between the probability distributions.
* From the 7th layer onwards, the JSD values consistently exceed the threshold, suggesting a significant divergence in the probability distributions.
* The JSD values increase steadily from the 7th layer to the 1st layer, indicating a growing divergence.
### Interpretation
This diagram likely represents a method for identifying the point at which a model transitions from a "Deep-Thinking Regime" to a state where its internal representations become significantly different. The JSD is used as a metric to quantify this difference. The threshold of 0.5 appears to be a critical value, separating layers that maintain a consistent internal representation from those that diverge.
The increasing JSD values as we move down the layers suggest that the lower layers of the model are processing information in a fundamentally different way than the higher layers. This could be due to the model learning more abstract or complex features in the lower layers, or it could indicate a loss of information or a shift in the model's focus.
The diagram implies that the 7th layer is a key transition point, marking the beginning of the divergence. This information could be used to optimize the model's architecture or training process to maintain a more consistent internal representation throughout all layers. The use of JSD suggests that the goal is to maintain similarity in the *shape* of the probability distributions, rather than simply matching the values themselves.
</details>
Figure 3: Illustration of our method of identifying deep-thinking tokens. Suppose a model with 10 layers, by setting the depth fraction $\rho=0.8$ , the token is successfully classified as a deep-thinking token at generation step $t$ since its JSD with the final-layer distribution first fall below the threshold $g$ only until it reaches the late-settling regime.
Input : Autoregressive LM $f_{\theta}$ with $L$ layers and unembedding matrix $W_{U}$ ; Input prompt $x$ ; Threshold $g$ ; Depth fraction $\rho$ Output : $\mathrm{DTR}(S)$ of the generated sequence $S$ $C\leftarrow 0$ ; // deep thinking token count $S\leftarrow\emptyset$ ; // generated sequence $y_{t}\leftarrow\mathtt{[BOS]}$ ; // initialize with start token while $y_{t}\neq\mathtt{[EOS]}$ do Sample $y_{t}\sim p_{t,L}\!\left(f_{\theta}(\cdot\mid x,S)\right)$ ; $S\leftarrow(S,y_{t})$ ; for $l\leftarrow 1$ to $L$ do $p_{t,l}\leftarrow\mathrm{softmax}(W_{U}h_{t,l})$ ; $D_{t,l}\leftarrow\mathrm{JSD}(p_{t,L},p_{t,l})$ ; end for $c_{t}\leftarrow\min\{l:\min_{j\leq l}D_{t,j}\leq g\}$ ; if $c_{t}\geq\lceil(1-\rho)L\rceil$ then $C\leftarrow C+1$ ; end if end while return $C/|S|$ ; Algorithm 1 Computing Deep-Thinking Ratio (DTR)
We posit that inference-time thinking effort for a token manifests as the continued evolution of predictive distributions (i.e., $p_{t,l}$ ) across LM layers. Tokens with earlier distributional stabilization correspond to less additional thinking, while those having later stabilization correspond to needing more extended internal thinking. In other words, simple tokens stabilize early with shallow computation, whereas difficult tokens requiring more thinking exhibit distributional shifts in deeper layers with more computation. To illustrate this, we show a motivation example on answering a GQPA [rein2024gpqa] question in Figure ˜ 2.
To quantify this behavior, we measure how long a token’s predictive distribution continues to change before settling, operationalized as the layer at which the intermediate distribution becomes sufficiently close to the final-layer distribution. Specifically, for each generation step $t$ and layer $l$ , we compute the Jensen–Shannon divergence (JSD) between the intermediate-layer distribution $p_{t,l}$ and the final-layer distribution $p_{t,L}$ :
$$
\displaystyle D_{t,l} \displaystyle\;\coloneqq\;\operatorname{JSD}\!\left(p_{t,L}\,\|\,p_{t,l}\right) \displaystyle=H\!\left(\frac{p_{t,L}+p_{t,l}}{2}\right)-\tfrac{1}{2}H(p_{t,L})-\tfrac{1}{2}H(p_{t,l}), \tag{2}
$$
where $H(\cdot)$ denotes Shannon entropy. By construction, $D_{t,L}=0$ . A trajectory $l\mapsto D_{t,l}$ that approaches zero only at later layers indicates prolonged distributional revision (think more), whereas early convergence indicates that the model settles on its final prediction with fewer subsequent updates (think less). We employ JSD due to its symmetry and boundedness, following [chuang2023dola-0c6]. We explore other distance metrics in Appendix ˜ A.
To enforce a strict notion of settling, we compute:
$$
\displaystyle\bar{D}_{t,l}=\min_{j\leq l}D_{t,j}. \tag{3}
$$
We define the settling depth $c_{t}$ as the first layer at which $\bar{D}_{t,l}$ falls below a fixed threshold $g$ :
$$
\displaystyle c_{t}=\min\left\{l\in\{1,\ldots,L\}:\bar{D}_{t,l}\leq g\right\}. \tag{4}
$$
We then define a deep-thinking regime using a depth fraction $\rho\in(0,1)$ , with
$$
\displaystyle\mathcal{L}_{\text{deep-thinking}}=\left\{l:l\geq\left\lceil\rho\times L\right\rceil\right\}. \tag{5}
$$
A token is classified as a deep-thinking token (i.e., requiring more layer computations and more thinking effort to become sufficiently close to the final-layer distribution) if $c_{t}\in\mathcal{L}_{\text{deep-thinking}}$ . An illustration is shown in Figure ˜ 3.
Finally, for a generated sequence $S$ of length $T$ , we define the deep-thinking ratio, $\mathrm{DTR}(S)$ , for the sequence as the proportion of tokens that settle in the late regime:
$$
\displaystyle\mathrm{DTR}(S)=\frac{1}{T}\sum_{t=1}^{T}\mathbb{1}\!\left[c_{t}\in\mathcal{L}_{\text{deep-thinking}}\right]. \tag{6}
$$
A higher DTR indicates that a larger fraction of tokens undergo extended computation for distributional revision before stabilizing. We note that our proposed method does not imply that early-settling tokens are suboptimal; rather, it provides a depth-wise characterization of inference-time thinking effort that complements the surface-level token length measure. We show the overall algorithm of DTR in Algorithm ˜ 1. We also provide qualitative examples in Appendix ˜ E.
## 3 Deep-Thinking Ratio Reflects Task Accuracy More Reliably
We empirically evaluate whether our distributional distance-based measurement provides a more faithful and robust characterization of inference-time thinking effort than surface-level, length-based proxies (i.e., token counts).
#### Models.
We evaluate eight variants of reasoning LLMs from three model families: GPT-OSS-20B (with low, medium, and high reasoning levels) and GPT-OSS-120B (with low, medium, and high reasoning levels) [openai2025gpt-oss-120b-a33], DeepSeek-R1-70B [guo2025deepseek], For brevity, we refer DeepSeek-R1-70B to Llama-3.3-70B-Instruct distilled with DeepSeek-R1 generated samples (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B). and Qwen3-30B-Thinking [yang2025qwen3]. These models are known for their strong, long CoT capability in mathematical and complex reasoning, and span multiple parametric scales for comprehensive coverage.
#### Tasks.
We focus on reasoning-intensive benchmarks where scaling CoT-style computation at inference time plays a central role. We adopt four benchmarks widely used in recent evaluations of LLM reasoning capabilities [xai2025grok4, openai2025gpt5, balunovic2025matharena], including three competition-level mathematical problem sets, AIME 2024 [aops2024aime1, aops2024aime2], AIME 2025 [aops2025aime1, aops2025aime2], and HMMT 2025 [hmmt2025], as well as the diamond set of GPQA [rein2024gpqa], which consists of challenging graduate-level scientific questions.
#### Decoding settings.
Following [gema2025inverse-bad], we prompt models to reason step by step using a fixed, neutral instruction, without specifying a reasoning budget or explicitly encouraging longer deliberation. This setup allows each model to naturally allocate inference-time computation on a per-instance basis, avoiding confounds introduced by externally imposed token budgets or budget-conditioning prompts. Following standard practice in natural overthinking analyses [gema2025inverse-bad], we sample multiple responses for each question (25 responses per question in our experiments). Across these samples, models naturally exhibit variation in reasoning length and internal computation patterns. We use the developer recommended sampling parameters for all tested models: temperature=1.0 and top p =1.0 for GPT-OSS series; temperature=0.6 and top p = 0.95 for DeepSeek-R1-70B and Qwen-3-30B-Thinking.
For each sampled response, we record intermediate-layer hidden states, obtain their projected probability distribution, and compute DTR as described in Section ˜ 2. We uniformly set the settling threshold $g=0.5$ and the depth fraction $\rho=0.85$ to define the deep-thinking regime. We also analyze with different values and the results are provided in Section ˜ 3.2. The reported statistics are averaged over 30 random seeds across decoding runs.
### 3.1 Results
To quantify the relationship between inference-time thinking effort and task performance, we measure the association between thinking effort scores and answer accuracy by computing Pearson correlation coefficient. Specifically, we conduct a binned analysis following [gema2025inverse-bad] by partitioning sampled sequences into quantile bins (i.e., 5 bins) based on their DTR (Equation ˜ 6) and computing the average accuracy within each bin.
Table 1: Pearson correlations between task accuracy and different inference-time measures, including length-based and confidence-based baselines, across eight model variants and four reasoning benchmarks. Correlation values are color-coded: strong positive correlations ( $0.5\sim 1$ ) are shown in dark green, weak positive correlations ( $0\sim 0.5$ ) in light green, weak negative correlations ( $-0.5\sim 0$ ) in light orange, and strong negative correlations ( $-1\sim-0.5$ ) in dark orange.
| OSS-120B-low | Token Length AIME 2025 0.504 | Reverse Token Length -0.504 | Log Probability 0.872 | Negative Perplexity 0.453 | Negative Entropy 0.863 | Self-Certainty 0.803 | DTR (Ours) 0.930 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| OSS-120B-medium | -0.365 | 0.365 | 0.817 | 0.246 | 0.822 | 0.815 | 0.862 |
| OSS-120B-high | -0.961 | 0.961 | 0.705 | 0.552 | 0.711 | 0.728 | 0.796 |
| OSS-20B-low | -0.689 | 0.689 | 0.579 | 0.849 | 0.665 | 0.275 | 0.373 |
| OSS-20B-medium | -0.757 | 0.757 | 0.616 | -0.677 | 0.637 | 0.097 | 0.161 |
| OSS-20B-high | -0.385 | 0.385 | 0.455 | -0.795 | 0.550 | 0.489 | 0.610 |
| DeepSeek-R1-70B | -0.973 | 0.973 | 0.961 | 0.955 | 0.946 | 0.899 | 0.974 |
| Qwen3-30B-Thinking | -0.663 | 0.663 | -0.008 | -0.035 | 0.154 | 0.828 | 0.855 |
| AIME 2024 | | | | | | | |
| OSS-120B-low | -0.166 | 0.166 | 0.897 | 0.682 | 0.869 | 0.741 | 0.840 |
| OSS-120B-medium | -0.680 | 0.680 | 0.795 | -0.293 | 0.908 | 0.924 | 0.533 |
| OSS-120B-high | -0.755 | 0.755 | 0.700 | -0.275 | 0.593 | 0.654 | 0.905 |
| OSS-20B-low | -0.655 | 0.655 | 0.548 | -0.342 | 0.667 | 0.584 | 0.730 |
| OSS-20B-medium | -0.827 | 0.827 | 0.195 | -0.150 | 0.440 | 0.252 | -0.192 |
| OSS-20B-high | -0.989 | 0.989 | 0.809 | 0.262 | 0.921 | 0.855 | 0.824 |
| DeepSeek-R1-70B | -0.987 | 0.987 | -0.037 | 0.223 | 0.067 | 0.287 | 0.430 |
| Qwen3-30B-Thinking | -0.869 | 0.869 | -0.857 | -0.720 | -0.680 | -0.246 | -0.657 |
| GPQA-Diamond | | | | | | | |
| OSS-120B-low | 0.682 | -0.682 | 0.984 | 0.172 | 0.995 | 0.996 | 0.976 |
| OSS-120B-medium | -0.340 | 0.340 | 0.973 | 0.316 | 0.985 | 0.981 | 0.823 |
| OSS-120B-high | -0.970 | 0.970 | 0.854 | 0.501 | 0.813 | 0.885 | 0.845 |
| OSS-20B-low | -0.602 | 0.602 | 0.984 | 0.235 | 0.991 | 0.917 | 0.935 |
| OSS-20B-medium | -0.847 | 0.847 | 0.914 | 0.468 | 0.911 | 0.889 | 0.718 |
| OSS-20B-high | -0.794 | 0.794 | 0.879 | 0.461 | 0.902 | 0.915 | 0.992 |
| DeepSeek-R1-70B | -0.930 | 0.930 | 0.068 | -0.133 | -0.165 | -0.532 | 0.885 |
| Qwen3-30B-Thinking | -0.634 | 0.634 | 0.589 | 0.865 | 0.711 | 0.943 | 0.828 |
| HMMT 2025 | | | | | | | |
| OSS-120B-low | 0.871 | -0.871 | 0.761 | 0.629 | 0.695 | 0.884 | 0.305 |
| OSS-120B-medium | -0.793 | 0.793 | 0.706 | 0.045 | 0.618 | 0.631 | 0.926 |
| OSS-120B-high | -0.967 | 0.967 | 0.750 | 0.503 | 0.728 | 0.754 | 0.972 |
| OSS-20B-low | -0.634 | 0.634 | -0.695 | 0.549 | -0.359 | -0.489 | 0.689 |
| OSS-20B-medium | -0.668 | 0.668 | 0.447 | 0.336 | 0.424 | 0.331 | 0.247 |
| OSS-20B-high | -0.352 | 0.352 | 0.537 | 0.994 | 0.831 | 0.628 | 0.932 |
| DeepSeek-R1-70B | -0.866 | 0.866 | 0.879 | 0.889 | 0.858 | 0.905 | 0.902 |
| Qwen3-30B-Thinking | -0.950 | 0.950 | -0.803 | -0.762 | -0.801 | 0.745 | 0.911 |
| Average | -0.594 | 0.594 | 0.527 | 0.219 | 0.571 | 0.605 | 0.683 |
We compare deep-thinking token measurement against the following baselines, including length-based proxies and confidence-based approaches, which are also commonly adopted to assess generation quality.
#### Token count.
The total number of tokens generated in the model’s output reasoning traces. This measure is widely framed as a direct proxy for test-time compute, and underlies many empirical studies of inference-time scaling [jaech2024openai, guo2025deepseek, anthropic2025claude3-7, anthropic2025claude4, oai2025o3mini, yang2025qwen3, team2025kimi, zhong2024evaluation].
#### Reverse token count.
As a complementary baseline, we additionally consider reverse token count, defined as the negative of the total number of generated tokens for each response. This transformation is included to account for the frequently observed inverse relationship between reasoning length and accuracy in LLM overthinking [wu2025when-905, gema2025inverse-bad].
#### Log probability.
Following the notation in Section ˜ 2, let a generated sequence $S=(y_{1},\dots,y_{T})$ . At generation step $t$ , the model’s output prediction distribution (at final-layer $L$ ) over the vocabulary $\mathcal{V}$ is denoted by $p_{t,L}(\cdot)$ . We compute the average log-probability of the sampled tokens:
$$
\displaystyle\mathrm{LogProb}(S)\;=\;\frac{1}{T}\sum_{t=1}^{T}\log p_{t,L}(y_{t}) \tag{7}
$$
Higher values indicate that the model assigns higher likelihood to its own generation and are commonly interpreted as higher confidence.
#### Negative perplexity.
Perplexity is defined as the exponentiated negative average log-probability:
$$
\displaystyle\mathrm{PPL}(S)\;=\;\exp\!\left(-\frac{1}{T}\sum_{t=1}^{T}\log p_{t,L}(y_{t})\right) \tag{8}
$$
We report negative perplexity $-\mathrm{PPL}(S)$ so that larger values correspond to higher confidence.
#### Negative entropy.
To incorporate information from the full prediction distribution over $\mathcal{V}$ rather than only the sampled token, we compute the average entropy:
$$
\displaystyle\mathrm{Ent}(S)\;=\;\frac{1}{T}\sum_{t=1}^{T}H(p_{t,L}),\;\;\;H(p_{t,L})=-\sum_{v\in\mathcal{V}}p_{t,L}(v)\log p_{t,L}(v) \tag{9}
$$
We report negative entropy $-\mathrm{Ent}(S)$ , where larger values indicate more peaked distributions and thus greater model confidence.
#### Self-Certainty.
We also include Self-Certainty [kang2025scalable-de3], a distributional confidence metric based on the idea that higher confidence corresponds to prediction distributions that are further from the uniform distribution $u$ , which represents maximum uncertainty. Formally, self-certainty is defined as the average Kullback-Leibler (KL) divergence between $u(v)=1/|\mathcal{V}|$ and $p_{t,L}$ :
$$
\displaystyle\mathrm{Self}\text{-}\mathrm{Certainty}(S) \displaystyle=\;\frac{1}{T}\sum_{t=1}^{T}\mathrm{KL}\!\left(u\,\|\,p_{t,L}\right) \displaystyle=\;-\frac{1}{T|\mathcal{V}|}\sum_{t=1}^{T}\sum_{v\in\mathcal{V}}\log\!\big(|\mathcal{V}|\,p_{t,L}(v)\big) \tag{10}
$$
For all baselines, correlations are computed using the same protocol, where sequences are ranked and binned by token count (or its negation) or confidence scores.
Table ˜ 1 reports the correlation between task accuracy and different measurments, across eight model variants and four benchmarks. As observed, measuring sequences with token count exhibits notable oranged-colored values ( $r<0$ ), with mean $r=-0.59$ . This indicates that longer generations are more associated with lower performance, aligning with recent reports of inverse scaling and overthinking. Extended reasoning traces could be symptomatic of redundant, misguided, or error-amplifying deliberation. The results underscore the unreliability of using surface-level length feature as proxy for effective problem solving. Reversing token count yields a positive correlation of identical magnitude. However, the improvement is purely post hoc, reflecting the empirical regularity in regimes where shorter responses are more accurate. As such, reverse token count only serve as a statistical adjustment, rather than capture principled notion of computation or thinking effort.
Compared to token count measure, confidence-based measures (log probability, negative perplexity, negative entropy, and self-certainty) exhibit moderately positive correlations with mean $r=0.219\sim 0.605$ , as reflected by the predominance of green-colored values. This indicates that model confidence captures partial information about correctness. However, their behavior is relatively heterogeneous across models and benchmarks: while certain configurations achieve strong positive correlations, others deteriorate to weak or even negative associations. This inconsistency suggests that confidence signals might conflate other factors like overconfidence, and therefore do not reliably reflect inference-time compute effort or problem solving effectiveness.
In contrast, our proposed measurement of DTR demonstrates the strongest and most stable relationship with task performance, achieving the highest average correlation of $r=0.683$ , outperforming both reverse token count and Self-Certainty, the best-performing baselines among confidence-based approaches. Overall, DTR remains positive across models and benchmarks, exhibiting the fewest orange-colored values (2 out of the 32 model–benchmark settings tested). Collectively, the results show that computing DTR over output sequences provides a more faithful and robust characterization of successful reasoning outcomes than token volume alone or confidence-based alternatives.
### 3.2 Effect of Settling Thresholds and Depth Fractions
<details>
<summary>x4.png Details</summary>

### Visual Description
## Chart: Accuracy vs. Deep-Thinking Ratio
### Overview
The image presents a line chart illustrating the relationship between "Deep-Thinking Ratio" and "Accuracy (Pass@1)". Three different lines are plotted, each representing a different "threshold" value (g). The chart also includes shaded regions around each line, likely representing a confidence interval or standard deviation.
### Components/Axes
* **X-axis:** "Deep-Thinking Ratio" ranging from approximately 0.20 to 0.50. Markers are present at 0.24, 0.32, and 0.48.
* **Y-axis:** "Accuracy (Pass@1)" ranging from approximately 0.60 to 0.70. Markers are present at 0.60, 0.625, 0.65, 0.675, and 0.70.
* **Legend:** Located in the top-right corner, it identifies three lines:
* Blue line with circle markers: "threshold g=2.5e-01"
* Gray line with circle markers: "threshold g=5.0e-01"
* Teal line with circle markers: "threshold g=7.5e-01"
* **Annotations:** Two annotations are present on the chart:
* "0.820" near the blue line at approximately Deep-Thinking Ratio = 0.24
* "0.962" near the gray line at approximately Deep-Thinking Ratio = 0.32
* "r = 0.012" near the teal line at approximately Deep-Thinking Ratio = 0.48
### Detailed Analysis
Let's analyze each line individually:
* **Blue Line (g=2.5e-01):** This line shows an initial upward slope from a Deep-Thinking Ratio of approximately 0.20 to 0.24, reaching an accuracy of around 0.625 to 0.675. After 0.24, the line plateaus and fluctuates around 0.67. Data points (approximate):
* (0.20, 0.62)
* (0.24, 0.67)
* (0.32, 0.67)
* (0.40, 0.67)
* (0.48, 0.67)
* **Gray Line (g=5.0e-01):** This line exhibits a more pronounced upward trend from a Deep-Thinking Ratio of approximately 0.20 to 0.32, increasing from an accuracy of around 0.62 to 0.675. After 0.32, the line declines slightly and fluctuates around 0.65. Data points (approximate):
* (0.20, 0.62)
* (0.24, 0.64)
* (0.32, 0.675)
* (0.40, 0.65)
* (0.48, 0.65)
* **Teal Line (g=7.5e-01):** This line shows a relatively flat trend with some fluctuations. It starts around an accuracy of 0.62 at a Deep-Thinking Ratio of 0.20 and remains around 0.65-0.67 until 0.48. Data points (approximate):
* (0.20, 0.62)
* (0.24, 0.64)
* (0.32, 0.65)
* (0.40, 0.65)
* (0.48, 0.65)
The shaded regions around each line indicate variability, but the exact nature of this variability (e.g., standard deviation, confidence interval) is not specified.
### Key Observations
* The gray line (g=5.0e-01) demonstrates the highest accuracy at a Deep-Thinking Ratio of approximately 0.32.
* The blue line (g=2.5e-01) shows the most stable accuracy after a Deep-Thinking Ratio of 0.24.
* The teal line (g=7.5e-01) exhibits the least change in accuracy across the range of Deep-Thinking Ratios.
* The annotation "r = 0.012" suggests a correlation coefficient, likely between the Deep-Thinking Ratio and Accuracy for the teal line.
### Interpretation
The chart suggests that there is an optimal "Deep-Thinking Ratio" for maximizing accuracy, and this optimal ratio is dependent on the "threshold" value (g). The threshold value likely represents a parameter controlling the depth or complexity of the "thinking" process. A Deep-Thinking Ratio of around 0.32 with a threshold of 5.0e-01 appears to yield the highest accuracy. The correlation coefficient of 0.012 for the teal line indicates a very weak positive correlation between Deep-Thinking Ratio and Accuracy when g=7.5e-01. The shaded areas around the lines suggest that the relationship between Deep-Thinking Ratio and Accuracy is not deterministic and is subject to some degree of variability. The annotation values (0.820 and 0.962) are likely related to some metric or calculation performed on the data, but their exact meaning is unclear without further context. The chart is likely presenting results from an experiment or simulation investigating the impact of a "Deep-Thinking Ratio" on the performance of a model or system.
</details>
(a) Effect of different settling threshold $g$ .
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Deep-Thinking Ratio
### Overview
This image presents a line chart illustrating the relationship between "Deep-Thinking Ratio" and "Accuracy (Pass@1)". Four different data series are plotted, each representing a different "depth fraction" value (ρ). The chart shows how accuracy changes as the deep-thinking ratio increases.
### Components/Axes
* **X-axis:** "Deep-Thinking Ratio" ranging from 0.0 to 0.4, with markers at 0.0, 0.1, 0.2, 0.3, and 0.4.
* **Y-axis:** "Accuracy (Pass@1)" ranging from 0.600 to 0.700, with markers at 0.600, 0.625, 0.650, 0.675, and 0.700.
* **Legend:** Located in the bottom-left corner, it identifies four data series:
* depth fraction ρ=8.0e-01 (Blue line with circle markers)
* depth fraction ρ=8.5e-01 (Red line with circle markers)
* depth fraction ρ=9.0e-01 (Purple line with circle markers)
* depth fraction ρ=9.5e-01 (Teal line with circle markers)
* **Data Markers:** Each data point is marked with a circle, and error bars are present for each point.
* **Annotations:** Two data points are annotated with numerical values: (0.1, 0.979) and (0.1, 0.947), and (0.3, 0.962) and (0.4, 0.916).
### Detailed Analysis
Let's analyze each data series individually, noting trends and approximate values.
* **depth fraction ρ=8.0e-01 (Blue):** This line starts at approximately (0.0, 0.622), increases slightly to around (0.1, 0.652), plateaus around (0.2, 0.655), then increases more significantly to approximately (0.3, 0.675) and finally reaches (0.4, 0.685).
* **depth fraction ρ=8.5e-01 (Red):** This line begins at approximately (0.0, 0.618), rises to around (0.1, 0.648), continues to increase to approximately (0.2, 0.655), then increases more steeply to around (0.3, 0.670) and reaches (0.4, 0.680).
* **depth fraction ρ=9.0e-01 (Purple):** This line starts at approximately (0.0, 0.615), increases to around (0.1, 0.645), continues to increase to approximately (0.2, 0.650), then increases more steeply to around (0.3, 0.665) and reaches (0.4, 0.675).
* **depth fraction ρ=9.5e-01 (Teal):** This line begins at approximately (0.0, 0.620), rises to around (0.1, 0.650), continues to increase to approximately (0.2, 0.655), then increases more steeply to around (0.3, 0.672) and reaches (0.4, 0.682).
The annotated values at x=0.1 are (0.979, 0.947) and at x=0.4 are (0.962, 0.916). These values do not correspond to the y-axis scale.
### Key Observations
* All four lines exhibit a generally upward trend, indicating that increasing the "Deep-Thinking Ratio" tends to improve "Accuracy (Pass@1)".
* The lines are relatively close together, suggesting that the "depth fraction" parameter has a moderate effect on accuracy.
* The rate of increase in accuracy appears to be higher for larger "Deep-Thinking Ratio" values (between 0.2 and 0.4).
* The annotated values are likely errors or represent a different metric not displayed on the chart.
### Interpretation
The chart demonstrates a positive correlation between the "Deep-Thinking Ratio" and "Accuracy (Pass@1)". This suggests that incorporating more "deep thinking" into the process (whatever that entails in this context) leads to improved performance. The different "depth fraction" values represent variations in the amount of "deep thinking" applied. The relatively small differences between the lines indicate that the specific value of the "depth fraction" is less critical than simply increasing the "Deep-Thinking Ratio" overall. The increasing slope of the lines at higher "Deep-Thinking Ratio" values suggests diminishing returns – the benefit of additional "deep thinking" may decrease as the ratio increases. The annotated values are anomalous and require further investigation to understand their meaning and source. The chart is likely presenting results from an experiment or simulation evaluating the impact of a particular technique (deep thinking) on a task's accuracy.
</details>
(b) Effect of different depth fraction $\rho$ .
Figure 4: Effect of hyper-parameters on thinking effort measurement and accuracy profiles. We analyze the impact of hyper-parameters by sweeping different settling threshold $g$ and depth fraction $\rho$ . (a) Varying $g$ has more impacts the correlation; a permissive threshold ( $g=0.25$ ) yields flatter trends, whereas $g=0.5$ provides the most robust positive signal. (b) Varying $\rho$ shifts the range of thinking effort scores but maintains overall consistent positive slopes. Overall, stricter criteria (higher $g$ , lower $\rho$ ) reduce the range of DTR, with $(g,\rho)=(0.5,0.85)$ offering an ideal balance between stability and correlation.
We conduct an analysis to understand how our two key hyper-parameters—the settling threshold $g$ and the late-settling depth fraction $\rho$ —affect the measured thinking effort and its correlation with task performance. Figure ˜ 4 illustrates the accuracy profiles across varying thinking efforts (i.e., average late-settling token ratios), derived by $g\in\{0.25,0.5,0.75\}$ and $\rho\in\{0.8,0.85,0.9,0.95\}$ . We set $\rho$ fixed to $0.85$ , when sweeping $g$ , and $g$ fixed to $0.5$ when sweeping $\rho$ . We report results on GPQA-D using GPT-OSS-20B with reasoning level high.
We conclude the following observations: (1) the magnitude of the measured sequence-level thinking effort is directly influenced by the strictness of these parameters. Specifically, both Figures ˜ 4(a) and 4(b) show that imposing stricter criteria—a higher settling threshold $g$ or a lower depth fraction $\rho$ —results in a reduction of the average late-settling token ratio. This is mechanistically consistent: a higher $g$ requires the intermediate states to be distributionally far to the final output until reaching deeper layers in the late regime to be considered settle; while a lower $\rho$ restricts the definition of the late regime to a narrower band of deeper layers. Both conditions naturally filter out more candidates, resulting in fewer tokens being classified as late-settling and consequently a lower range of overall thinking effort scores.
(2) The settling threshold $g$ has a more pronounced impact on the correlation between thinking effort and accuracy than the depth fraction $\rho$ . As shown in Figure ˜ 4(b), varying $\rho$ shifts the range of late-settling ratios due to varying strictness but maintains a consistent, positive slope across all settings, indicating that the metric is relatively robust to the specific definition of the late layers. In contrast, Figure ˜ 4(a) reveals that the choice of $g$ has more impact on measured results: a softer threshold of $g=0.25$ yields a flatter trend with lower correlation value, suggesting that it may be overly permissive, including tokens with less computational efforts and diminishing the measurement’s ability to distinguish high-quality trajectory. Conversely, thresholds of $g=0.5$ and $g=0.75$ exhibit more robust positive correlations reflecting the accuracy.
(3) Overall, we can see that when the criteria are overly restrictive ( $g=0.75$ and $\rho\in\{0.9,0.95\}$ ), the trends, while still maintaining positive correlations, appears to be slightly more unstable due to the potential filtering of informative high computational tokens. Among the tested configurations, $(g,\rho)=(0.5,0.85)$ strikes an ideal balance, yielding a reliable trend with high correlation values.
## 4 Deep-Thinking Tokens Enable Efficient Test-Time Scaling
Repeated sampling is a popular strategy for scaling test-time compute, in parallel to generating long CoT [brown2024large-581, gupta2025test-time-19d, saad-falcon2024archon-cb5, stroebl2024inference-4ca, saad-falcon2025shrinking-bf7]. It improves accuracy by aggregating multiple independently generated samples per problem at the cost of increased inference budget. In this section, we explore whether our proposed DTR measure can be leveraged to preferentially select and aggregate higher-quality samples towards better performance.
Table 2: Comparison of task accuracy and average inference cost (k tokens) under different aggregation methods, across four reasoning benchmarks. The reported cost reductions ( $\Delta$ %) are shown relative to Cons@ $n$ . Think@ $n$ achieves the best overall performance while reducing inference cost by approximately 50%. Methods with ${\dagger}$ adopt a prefix length of 50 to determine early stopping.
| Method | AIME 25 | | AIME 24 | | HMMT 25 | | GPQA-D | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Acc | Cost ( $\Delta$ %) | | Acc | Cost ( $\Delta$ %) | | Acc | Cost ( $\Delta$ %) | | Acc | Cost ( $\Delta$ %) | |
| OSS-120B-medium | | | | | | | | | | | |
| Cons@ $n$ | 92.7 | 307.6 (–) | | 92.7 | 235.1 (–) | | 80.0 | 355.6 (–) | | 73.8 | 93.5 (–) |
| Mean@ $n$ | 80.0 | 307.6 (–) | | 81.6 | 235.1 (–) | | 62.6 | 355.6 (–) | | 69.9 | 93.5 (–) |
| Long@ $n$ | 86.7 | 307.6 (–) | | 86.7 | 235.1 (–) | | 73.3 | 355.6 (–) | | 73.2 | 93.5 (–) |
| Short@ $n$ | 87.3 | 255.7 (-17%) | | 88.0 | 200.9 (-15%) | | 77.3 | 290.4 (-18%) | | 73.3 | 84.4 (-10%) |
| Self-Certainty@ $n$ † | 87.3 | 150.6 (-51%) | | 91.3 | 119.3 (-49%) | | 78.0 | 177.0 (-50%) | | 76.0 | 47.9 (-49%) |
| Think@ $n$ † | 94.7 | 155.4 (-49%) | | 93.3 | 121.3 (-48%) | | 80.0 | 181.9 (-49%) | | 74.7 | 48.8 (-48%) |
| Qwen3-4B-Thinking | | | | | | | | | | | |
| Cons@ $n$ | 86.7 | 1073.1 (–) | | 93.3 | 950.1 (–) | | 63.3 | 1275.7 (–) | | 67.8 | 410.6 (–) |
| Mean@ $n$ | 81.2 | 1073.1 (–) | | 86.3 | 950.1 (–) | | 55.7 | 1275.7 (–) | | 66.9 | 410.6 (–) |
| Long@ $n$ | 85.3 | 1073.1 (–) | | 86.7 | 950.1 (–) | | 52.7 | 1275.7 (–) | | 66.7 | 410.6 (–) |
| Short@ $n$ | 90.0 | 983.6 (-8%) | | 90.0 | 871.0 (-8%) | | 63.3 | 1165.7 (-9%) | | 68.2 | 382.9 (-7%) |
| Self-Certainty@ $n$ † | 86.7 | 548.9 (-49%) | | 90.0 | 480.9 (-49%) | | 63.3 | 641.4 (-50%) | | 68.2 | 206.6 (-50%) |
| Think@ $n$ † | 90.0 | 537.5 (-50%) | | 93.3 | 482.2 (-49%) | | 66.7 | 641.4 (-50%) | | 69.7 | 206.8 (-50%) |
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Scatter Plots: Model Performance Comparison
### Overview
The image presents two scatter plots comparing the performance of two language models, OSS-120B-medium and Qwen-3-4B-Thinking, across different prompting strategies ("Think", "Cons", "Self-Certainty", "Short", "Long", "Mean"). The plots visualize the relationship between "Cost (tokens)" and "Accuracy" for each model and prompting strategy.
### Components/Axes
* **X-axis:** Cost (tokens), ranging from approximately 1.5 x 10<sup>5</sup> to 2.5 x 10<sup>5</sup> for the left plot (OSS-120B-medium) and approximately 5 x 10<sup>5</sup> to 9 x 10<sup>5</sup> for the right plot (Qwen-3-4B-Thinking).
* **Y-axis:** Accuracy, ranging from approximately 0.72 to 0.85 for the left plot and approximately 0.72 to 0.81 for the right plot.
* **Models:** OSS-120B-medium (left plot), Qwen-3-4B-Thinking (right plot).
* **Prompting Strategies (Legend):**
* Think@n (Light Blue)
* Cons@n (Green)
* Self-Certainty@n (Orange)
* Short@n (Violet)
* Long@n (Pink)
* Mean@n (Blue)
### Detailed Analysis or Content Details
**OSS-120B-medium (Left Plot):**
* **Think@n:** Located at approximately (1.6 x 10<sup>5</sup>, 0.84).
* **Cons@n:** Located at approximately (2.3 x 10<sup>5</sup>, 0.84).
* **Self-Certainty@n:** Located at approximately (1.55 x 10<sup>5</sup>, 0.81).
* **Short@n:** Located at approximately (2.4 x 10<sup>5</sup>, 0.80).
* **Long@n:** Located at approximately (2.5 x 10<sup>5</sup>, 0.78).
* **Mean@n:** Located at approximately (2.3 x 10<sup>5</sup>, 0.73).
**Qwen-3-4B-Thinking (Right Plot):**
* **Think@n:** Located at approximately (5.2 x 10<sup>5</sup>, 0.80).
* **Cons@n:** Located at approximately (8.8 x 10<sup>5</sup>, 0.78).
* **Self-Certainty@n:** Located at approximately (5.1 x 10<sup>5</sup>, 0.77).
* **Short@n:** Located at approximately (8.5 x 10<sup>5</sup>, 0.78).
* **Long@n:** Located at approximately (9.1 x 10<sup>5</sup>, 0.74).
* **Mean@n:** Located at approximately (6.0 x 10<sup>5</sup>, 0.73).
**Trends:**
* **OSS-120B-medium:** There isn't a strong linear trend. Accuracy is relatively high across most prompting strategies, with a slight decrease as cost increases.
* **Qwen-3-4B-Thinking:** Similar to OSS-120B-medium, there isn't a strong linear trend. Accuracy is generally lower than OSS-120B-medium, and there's a slight decrease in accuracy with increasing cost.
### Key Observations
* OSS-120B-medium generally achieves higher accuracy than Qwen-3-4B-Thinking across all prompting strategies.
* For both models, the "Think@n" prompting strategy appears to yield the highest accuracy.
* The "Mean@n" prompting strategy consistently results in the lowest accuracy for both models.
* The cost (tokens) varies significantly across prompting strategies, particularly for Qwen-3-4B-Thinking.
### Interpretation
The data suggests that OSS-120B-medium is a more accurate model than Qwen-3-4B-Thinking, based on the observed accuracy scores. The "Think@n" prompting strategy appears to be the most effective for both models, potentially indicating that encouraging the model to "think step-by-step" improves performance. The "Mean@n" strategy consistently underperforms, suggesting that averaging outputs might not be a beneficial approach.
The difference in cost (tokens) between prompting strategies highlights a trade-off between accuracy and computational expense. Choosing the optimal prompting strategy involves balancing the desired level of accuracy with the available resources. The fact that Qwen-3-4B-Thinking requires significantly more tokens to achieve comparable (though lower) accuracy suggests it is less efficient than OSS-120B-medium.
The scatter plots provide a visual representation of this trade-off, allowing for a quick comparison of the performance characteristics of each model and prompting strategy. The lack of strong linear trends suggests that the relationship between cost and accuracy is complex and may be influenced by other factors not captured in this analysis.
</details>
Figure 5: Comparison of the trade-off between task accuracy and inference cost (tokens) with different aggregation methods. Accuracy is averaged across all four datasets (AIME 24/25, HMMT 25, GPQA-D). Our Think@ $n$ method achieves the best overall Pareto-optimal performance. It matches or exceeds the accuracy of Cons@n with approximately half the inference cost, while Self-Certainty@ $n$ is notably less efficient.
Table 3: Impact of prefix length ( $\ell_{\text{prefix}}$ ) on Think@ $n$ performance and inference cost for AIME 2025. Using a short prefix of 50 tokens to estimate DTR outperforms using longer ones, and is comparable to full sequence (all) while providing significant cost savings. We also report Pass@1 and Cons@ $n$ for reference. Subscripts denote the standard deviation across 10 trials.
| | Accuracy | Cost (k tokens) |
| --- | --- | --- |
| Pass@1 | 80.0 4.2 | 6.4 |
| Cons@ $n$ | 90.0 2.5 | 307.6 |
| Think@ $n$ | | |
| Prefix length | | |
| 50 | 94.7 1.6 | 155.4 |
| 100 | 92.0 1.6 | 154.1 |
| 500 | 92.7 1.3 | 153.2 |
| 1000 | 92.7 1.3 | 177.4 |
| 2000 | 92.0 1.3 | 198.8 |
| all | 94.0 0.3 | 307.6 |
#### Experimental setups.
We follow the best-of-n (BoN) evaluation protocol commonly adopted in recent test-time scaling studies [fu2025deep]. For each problem, we sample $n$ responses using identical decoding settings, and compare the following aggregation methods: Cons@ $n$ : Standard self-consistency [wang2023selfconsistency], which performs majority voting over all $n$ sampled responses; Mean@ $n$ : The average accuracy of all the $n$ samples, reflecting a baseline of no preferential aggregation; Long@ $n$ and Short@ $n$ : Majority voting over the longest/shortest $\eta$ percent of the $n$ samples, ranked by token count [hassid2025don, agarwal2025first]. Self-Certainty@ $n$ : Majority voting over the highest-scoring $\eta$ percent of the $n$ samples, ranked by Self-Certainty score (the best-performing baseline in Section ˜ 3); Think@ $n$ : Majority voting over the highest-scoring $\eta$ percent of the $n$ samples, ranked by DTR $(\cdot)$ . All methods operate on the same pool of $n$ samples. We set $n=48$ and $\eta=50\$ . More analysis are provided in Appendix ˜ C. The results are averaged across 10 trials.
#### Results.
We report the results in Table ˜ 2. To compare efficiency, we explicitly account for early stopping for Short@ $n$ , Self-Certainty@ $n$ , and Think@ $n$ , which aggregate only a subset of samples. Specifically, we report the average per-problem inference cost, measured as the total number of generated tokens, under the following protocols.
For Cons@ $n$ and Mean@ $n$ , the inference cost is defined as the sum of token counts across all $n$ sampled responses= (i.e., $\sum_{i=1}^{n}|S_{i}|$ ) corresponding to full decoding without early stopping. For Short@ $n$ , we rank samples by their length and select the shortest $\eta\times n$ samples. The inference cost is computed as the sum of the token count of the selected samples, plus an early-stopping overhead equal to $\ell_{\text{longest\_short}}\times\eta\times n$ , where $\ell_{\text{short}}$ denotes the length of the longest sample among the selected shortest subset. This term accounts for partially generated samples that are terminated once subset generation completes (i.e., bounded by $\ell_{\text{longest\_short}}$ ). The inference cost for Long@ $n$ is the same as Cons@ $n$ and Mean@ $n$ as it requires full decoding to select longest samples. For Think@ $n$ , samples are ranked by DTR, computed from a fixed prefix. Let $\ell_{\text{prefix}}$ denote the number of prefix tokens used to estimate $\mathrm{DTR}(S[:\ell_{\text{prefix}}])$ . The inference cost is defined as the total token count of the top $\eta\times n$ ranked samples, plus a fixed prefix overhead of $\ell_{\text{prefix}}\times\eta\times n$ , which reflects the cost of generating all candidates prior to early termination. Self-Certainty@ $n$ follows the same cost computation as Think@ $n$ , differing only in that samples are ranked by $\mathrm{Self\text{-}Certainty}(S[:\ell_{\text{prefix}}])$ rather than $\mathrm{DTR}(S[:\ell_{\text{prefix}}])$ .
Table ˜ 3 reports a preliminary ablation on AIME 25 that varies $\ell_{\text{prefix}}$ . We find that using only $\ell_{\text{prefix}}=50$ tokens achieves higher accuracy than longer prefixes and matches the performance obtained using the full sequence, while significantly reducing inference cost. Accordingly, we fix $\ell_{\text{prefix}}=50$ for all experiments in Table ˜ 2.
As shown, Cons@ $n$ incurs the highest inference cost due to full decoding of every candidate, while providing a strong accuracy baseline. Mean@ $n$ has the same cost as Cons@ $n$ but is the worst-performing one among all methods. Under early stopping, Short@ $n$ achieves modest cost savings relative to Cons@ $n$ , yet consistently underperforms it in accuracy. Long@ $n$ exhibits further degraded performance compared to Short@ $n$ without offering any cost-saving benefits. This indicates that length-based heuristics remain a coarse proxy for reasoning quality and often fail to reliably identify high-quality samples, leading to suboptimal aggregations. Self-Certainty@ $n$ substantially reduces inference cost by enabling early stopping using short prefixes, but nonetheless underperforms both Cons@ $n$ and Think@ $n$ on three of the four evaluated benchmarks. In contrast, Think@ $n$ consistently matches or exceeds the accuracy of Cons@ $n$ while requiring approximately half the inference cost. The Pareto-optimal performance is most evident in the averaged results shown in Figure ˜ 5, where Think@ $n$ achieves the best overall accuracy-cost trade-off. In sum, these results demonstrate that DTR provides a more informative and reliable selection signal, enabling efficient parallel scaling of inference compute.
## 5 Related Work
### 5.1 Relationship between CoT Length and Performance
The paradigm of test-time scaling has largely operated on the assertion that allocating more computation, typically manifested as longer CoT sequences, boosts reasoning performance wei2022chain-d1a, guo2025deepseek, muennighoff2025s1. Recent empirical studies have highlighted nuances to the universality of this “longer is better” heuristic [feng2025what-321, wu2025when-905]. gema2025inverse-bad identify inverse scaling regimes where increased reasoning length systematically degrades accuracy across diverse tasks, particularly when models are prone to distraction. Similarly, wu2025when-905 characterize the relationship between CoT length and accuracy as an “inverted-U” curve, suggesting an optimal length exists beyond which performance deteriorates due to factors like error accumulation.
Several works have proposed methods to exploit corresponding observations by favoring conciseness. hassid2025don demonstrated that the shortest reasoning chains among sampled candidates are often the most accurate, proposing inference-time length-based voting for efficient generations. A close work by agarwal2025first also introduced a training-free strategy that selects the first completed trace in parallel decoding, reducing token usage while maintaining accuracy. On the training side, shrivastava2025sample proposed Group Filtered Policy Optimization (GFPO) to explicitly curb length inflation in RL by rejection sampling that filters longer responses, demonstrating that models can think less without sacrificing performance. Our work aligns with these perspectives by confirming that raw token count is an unreliable proxy for effective reasoning effort, but we diverge by proposing a mechanistic internal signal rather than simply relying on surface-level brevity heuristics.
### 5.2 Leveraging Internal Information in LLMs
A rich line of work has investigated how LMs internally represent and manipulate information across layers, and how internal states can be exploited. Central to this direction is the observation that intermediate representations in LMs often encode meaningful signals before reaching the final layer. Early evidence for this view was provided by nostalgebraist2020lens, which projects intermediate hidden states directly into the vocabulary space using the model’s unembedding matrix—a technique we adopt in our work. The results reveal that autoregressive transformers form coarse guesses about the next token that are iteratively refined across layers. Subsequent analyses [belrose2023eliciting] further introduce learned, layer-specific affine transformations that better align intermediate representations with the final prediction space, enabling more interpretable token predictions in shallower layers.
Beyond model probing, chuang2023dola-0c6 exploits the empirical finding that factual knowledge in LMs is often more salient in particular layers. By contrasting logits from higher and lower layers, they propose a decoding method that amplifies factual signals and improves factuality. A recent work by vilas2025tracing-3dc introduces latent-trajectory signals characterizing the temporal evolution of hidden states across generated reasoning traces to predict correctness. While the work examines the sequential dimension of representations, our work focuses on the depth-wise evolution of predictions across layers for individual tokens.
Complementary interpretability works also revisit how LLMs utilize depth at inference. gupta2025how-6d8 shows that early layers tend to favor high-frequency, generic token guesses, which are subsequently refined into contextually appropriate predictions. csords2025do-6d4 suggest that later layers primarily perform fine-grained distributional refinement rather than introducing fundamentally new transformations, raising questions about the efficiency of depth utilization in modern LLMs. These findings reinforce the view that internal predictions may stabilize before the final layer, aligning with our motivations. Overall, our goal is not to modify or construct internal states to develop new methods aimed at improving model capabilities. Instead, we leverage natural, unaltered internal representations as a proxy for measuring model computational effort, which implicitly reflects thinking effort in LLMs.
## 6 Conclusion
We introduced deep-thinking ratio (DTR) as a novel measure of inference-time reasoning effort in LLMs. By tracking depth-wise stabilization of token predictions, DTR provides a more reliable signal of effective reasoning than surface-level proxies such as token length or confidence. Building on this insight, we proposed Think@ $n$ , a test-time scaling strategy that leverages DTR for early selection and aggregation, achieving comparable or better performance than standard self-consistency while substantially reducing inference cost. Together, our results suggest that measuring how models think internally, rather than how long they think, is a promising direction. Future work may leverage this insight to explore how effective reasoning is characterized—shifting the focus from generating longer chains of thought to inducing deeper, more computationally intensive reasoning, and potentially enabling more reliable and efficient reasoning models.
## Acknowledgements
We thank Congchao Wang and colleagues from Google AIR for their valuable support. We also thank Yu-Min Tseng from Virginia Tech and members of Meng-Lab at UVA for their helpful discussion.
## References
## Appendix A Comparison of Different Distance Metrics for DTR
Our method (Section ˜ 2) adopts Jensen–Shannon divergence (JSD) to quantify the discrepancy between intermediate-layer and final-layer predictions and compute DTR. Alternative notions of distance are possible. Here we explore two additional metrics: Kullback–Leibler divergence (KLD) and cosine similarity. The results are presented in Figure ˜ 6.
#### Kullback–Leibler divergence.
By replacing JSD with KLD in Equation ˜ 2, we compute the divergence between the final-layer distribution $p_{t,L}$ and the intermediate-layer distribution $p_{t,l}$ as
$$
\displaystyle D^{\text{KL}}_{t,l}=\mathrm{KL}(p_{t,L}\,\|\,p_{t,l}) \tag{11}
$$
#### Cosine similarity.
We replace the distributional comparison defined in Section ˜ 2.2 with a representation-space measure using cosine similarity. Instead of projecting intermediate-layer hidden states into the vocabulary space via the shared unembedding matrix $W_{U}$ (Equation ˜ 1), we directly compute the cosine similarity between the intermediate-layer hidden state $h_{t,l}$ and the final-layer hidden state $h_{t,L}$ . The distance is defined as
$$
\displaystyle D^{\text{cos}}_{t,l}=1-\frac{\langle h_{t,l},h_{t,L}\rangle}{\|h_{t,l}\|\|h_{t,L}\|} \tag{12}
$$
For both KLD and cosine similarity, we then apply the same configurations in Section ˜ 2.2 to identify deep-thinking tokens and compute KLD-based DTR and cosine-based DTR.
#### Results.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Cosine Similarity
### Overview
This image presents a line chart illustrating the relationship between DTR (likely a measure of distance or time) and Accuracy (Pass@1). The chart includes a solid blue line representing the primary data series, a dashed grey line representing another data series, and shaded blue regions indicating the confidence interval around the solid blue line. A correlation coefficient (r = 0.633) is also displayed.
### Components/Axes
* **Title:** Cosine Similarity
* **X-axis:** DTR, ranging from approximately 0.060 to 0.078. Axis markers are present at 0.060, 0.066, 0.072, and 0.078.
* **Y-axis:** Accuracy (Pass@1), ranging from approximately 0.775 to 0.850. Axis markers are present at 0.775, 0.800, 0.825, and 0.850.
* **Data Series 1 (Solid Blue Line):** Represents the primary relationship between DTR and Accuracy.
* **Data Series 2 (Dashed Grey Line):** Represents a secondary relationship between DTR and Accuracy.
* **Confidence Interval (Shaded Blue Region):** Indicates the uncertainty around the solid blue line.
* **Correlation Coefficient:** r = 0.633, positioned near the center of the chart.
### Detailed Analysis
**Data Series 1 (Solid Blue Line):**
The solid blue line exhibits an overall upward trend, then a slight decline.
* At DTR = 0.060, Accuracy is approximately 0.776.
* At DTR = 0.066, Accuracy is approximately 0.795.
* At DTR = 0.072, Accuracy reaches a peak of approximately 0.845.
* At DTR = 0.078, Accuracy decreases to approximately 0.815.
**Data Series 2 (Dashed Grey Line):**
The dashed grey line shows a generally increasing trend, but with more fluctuation.
* At DTR = 0.060, Accuracy is approximately 0.790.
* At DTR = 0.066, Accuracy is approximately 0.800.
* At DTR = 0.072, Accuracy is approximately 0.825.
* At DTR = 0.078, Accuracy is approximately 0.820.
**Confidence Interval:**
The shaded blue region around the solid blue line is wider at DTR values around 0.066 and 0.078, indicating greater uncertainty in the accuracy estimates at those points. The confidence interval is narrower around DTR = 0.072, suggesting more confidence in the accuracy estimate at that point.
### Key Observations
* The correlation coefficient of 0.633 suggests a moderate positive correlation between DTR and Accuracy.
* Accuracy generally increases with increasing DTR up to a point (around 0.072), after which it slightly decreases.
* The confidence interval indicates that the relationship between DTR and Accuracy is more uncertain at the lower and upper ends of the DTR range.
* The dashed grey line consistently shows higher accuracy values than the solid blue line, but the difference is small.
### Interpretation
The chart suggests that there is a sweet spot for DTR, where accuracy is maximized (around 0.072). Increasing DTR beyond this point leads to a slight decrease in accuracy. The moderate correlation coefficient indicates that DTR is a factor influencing accuracy, but it is not the sole determinant. The confidence intervals highlight the inherent variability in the accuracy measurements. The dashed grey line could represent a different model or approach that consistently performs slightly better than the primary model represented by the solid blue line. The data suggests that optimizing DTR is important for maximizing accuracy, but other factors also play a role. The "Cosine Similarity" title implies that the accuracy is being measured based on the similarity between some vectors or representations, and DTR might be a parameter controlling the distance or threshold used in that similarity calculation.
</details>
(a) Cosine similarity as the distance metric on AIME 25.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Line Chart: Kullback-Leibler Divergence
### Overview
This image presents a line chart illustrating the relationship between DTR (likely a measure of diversity or distribution) and Accuracy (Pass@1). The chart includes a central line representing the trend, a shaded region indicating the variance or confidence interval, and a correlation coefficient.
### Components/Axes
* **Title:** Kullback-Leibler Divergence
* **X-axis:** DTR, ranging from approximately 0.345 to 0.390.
* **Y-axis:** Accuracy (Pass@1), ranging from approximately 0.72 to 0.85.
* **Data Series:** A single line representing the relationship between DTR and Accuracy.
* **Shaded Region:** A light blue shaded area around the line, representing the variance or confidence interval.
* **Correlation Coefficient:** "r = -0.698" displayed near the data series.
* **Data Points:** Four circular data points are plotted along the line.
### Detailed Analysis
The line chart shows a generally decreasing trend between DTR and Accuracy.
* **Point 1:** At DTR ≈ 0.345, Accuracy ≈ 0.815. (Circle, blue)
* **Point 2:** At DTR ≈ 0.360, Accuracy ≈ 0.820. (Circle, blue)
* **Point 3:** At DTR ≈ 0.375, Accuracy ≈ 0.845. (Circle, blue) - This is the peak accuracy.
* **Point 4:** At DTR ≈ 0.390, Accuracy ≈ 0.750. (Circle, blue)
The shaded region indicates the uncertainty around the central line. It is wider at higher DTR values (around 0.390), suggesting greater variance in accuracy at those points. The correlation coefficient of -0.698 indicates a strong negative correlation between DTR and Accuracy.
### Key Observations
* The accuracy initially increases with DTR, reaching a peak around DTR = 0.375, and then decreases sharply as DTR increases further.
* The negative correlation suggests that as DTR increases, accuracy tends to decrease.
* The variance in accuracy appears to be larger at higher DTR values.
### Interpretation
The chart suggests that there is an optimal level of DTR for maximizing accuracy. Increasing DTR beyond this optimal point leads to a significant decrease in accuracy. The Kullback-Leibler Divergence, as the chart title indicates, is likely being used as a measure of the difference between two probability distributions. In this context, the DTR might represent the degree of divergence, and the chart shows how accuracy is affected by this divergence. The negative correlation implies that as the distributions become more dissimilar (higher DTR), the accuracy of the system decreases. The shaded region represents the uncertainty in the relationship, potentially due to noise in the data or variations in the experimental setup. The peak accuracy at DTR ≈ 0.375 suggests a sweet spot where the distributions are sufficiently different to provide useful information but not so different that they lead to confusion or errors.
</details>
(b) KL divergence as the distance metric on AIME 25.
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Line Chart: Jensen-Shannon Divergence
### Overview
This image presents a line chart illustrating the relationship between DTR (likely a measure of data transformation rate) and Accuracy (Pass@1). The chart displays two lines, one solid and one dashed, both with shaded regions representing confidence intervals. A Pearson correlation coefficient (r) is also indicated on the chart.
### Components/Axes
* **Title:** Jensen-Shannon Divergence
* **X-axis:** DTR (ranging approximately from 0.145 to 0.180)
* **Y-axis:** Accuracy (Pass@1) (ranging approximately from 0.68 to 0.86)
* **Line 1 (Solid Blue):** Represents the primary data series.
* **Line 2 (Dashed Blue):** Represents a secondary data series.
* **Shaded Region:** Represents the confidence interval around each line.
* **Data Points:** Circles mark specific data points along each line.
* **Correlation Coefficient:** r = 0.869
### Detailed Analysis
The solid blue line shows an upward trend, starting at approximately 0.70 accuracy at a DTR of 0.145. It rises to approximately 0.83 accuracy at a DTR of 0.165, and then plateaus, reaching approximately 0.85 accuracy at a DTR of 0.180. The confidence interval around this line is relatively wide at lower DTR values and narrows as DTR increases.
The dashed blue line also exhibits an upward trend, but it starts at a slightly lower accuracy of approximately 0.68 at a DTR of 0.145. It rises more steeply than the solid line, reaching approximately 0.81 accuracy at a DTR of 0.165, and then also plateaus, reaching approximately 0.85 accuracy at a DTR of 0.180. The confidence interval around this line is also wide at lower DTR values and narrows as DTR increases.
Specific data points (approximated from the chart):
* **Solid Line:**
* DTR = 0.145, Accuracy ≈ 0.70
* DTR = 0.155, Accuracy ≈ 0.78
* DTR = 0.165, Accuracy ≈ 0.83
* DTR = 0.180, Accuracy ≈ 0.85
* **Dashed Line:**
* DTR = 0.145, Accuracy ≈ 0.68
* DTR = 0.155, Accuracy ≈ 0.76
* DTR = 0.165, Accuracy ≈ 0.81
* DTR = 0.180, Accuracy ≈ 0.85
### Key Observations
* Both data series demonstrate a positive correlation between DTR and Accuracy. As DTR increases, Accuracy generally increases.
* The dashed line initially shows lower accuracy than the solid line, but both lines converge at higher DTR values.
* The confidence intervals suggest a degree of uncertainty in the accuracy measurements, particularly at lower DTR values.
* The Pearson correlation coefficient of 0.869 indicates a strong positive linear correlation between the two data series.
### Interpretation
The chart suggests that increasing the DTR generally improves the Accuracy (Pass@1) of the system being evaluated. The convergence of the two lines at higher DTR values could indicate that the effect of the specific transformation represented by the dashed line diminishes as DTR increases, or that the two transformations become more similar in their impact. The strong positive correlation (r = 0.869) confirms a robust relationship between DTR and Accuracy. The Jensen-Shannon Divergence, as the chart title suggests, is likely being used to measure the similarity between two probability distributions, and the chart demonstrates how this divergence relates to the system's performance (Accuracy). The confidence intervals provide a measure of the variability in the accuracy measurements, which is important for assessing the reliability of the results. The plateauing of both lines at higher DTR values suggests there may be a point of diminishing returns, where further increases in DTR do not lead to significant improvements in accuracy.
</details>
(c) JS divergence as the distance metric on AIME 25.
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Line Chart: Cosine Similarity
### Overview
This image presents a line chart illustrating the relationship between DTR (likely Document Title Ratio or a similar metric) and Accuracy (Pass@1). A shaded region represents the confidence interval around the accuracy line. A dashed line represents a constant value for comparison. The chart title is "Cosine Similarity".
### Components/Axes
* **Title:** Cosine Similarity (centered at the top)
* **X-axis:** DTR (Document Title Ratio), ranging from approximately 0.060 to 0.078, with markers at 0.060, 0.066, 0.072, and 0.078.
* **Y-axis:** Accuracy (Pass@1), ranging from approximately 0.56 to 0.68, with markers at 0.56, 0.60, 0.64, 0.68.
* **Solid Line (Blue):** Represents the Accuracy (Pass@1) as a function of DTR.
* **Shaded Region (Light Blue):** Represents the confidence interval around the Accuracy line.
* **Dashed Line (Gray):** A horizontal line at approximately y = 0.64.
* **Text Annotation:** "r = 0.172" positioned near the dashed line, likely indicating a correlation coefficient.
### Detailed Analysis
The solid blue line representing Accuracy (Pass@1) shows an initial increase with increasing DTR, reaching a peak around DTR = 0.072, then a decrease as DTR increases further.
* **Point 1:** At DTR = 0.060, Accuracy is approximately 0.59.
* **Point 2:** At DTR = 0.066, Accuracy is approximately 0.67.
* **Point 3:** At DTR = 0.072, Accuracy is approximately 0.68.
* **Point 4:** At DTR = 0.078, Accuracy is approximately 0.61.
The confidence interval (shaded region) is wider around the peak (DTR = 0.072) and narrower at the extremes (DTR = 0.060 and 0.078). The dashed line remains constant at approximately 0.64. The correlation coefficient 'r' is 0.172, indicating a weak positive correlation.
### Key Observations
* Accuracy initially increases with DTR, suggesting that a higher Document Title Ratio improves accuracy up to a certain point.
* Beyond DTR = 0.072, increasing DTR leads to a decrease in Accuracy. This suggests an optimal DTR value exists.
* The confidence interval indicates uncertainty in the accuracy measurements, particularly around the peak.
* The correlation coefficient of 0.172 suggests a weak positive relationship between DTR and Accuracy.
### Interpretation
The chart demonstrates a non-linear relationship between DTR and Accuracy (Pass@1). Initially, a higher DTR correlates with improved accuracy, likely because a more descriptive or relevant document title provides more information for the model to assess. However, beyond a certain point, increasing DTR may introduce noise or irrelevant information, leading to a decrease in accuracy. The optimal DTR appears to be around 0.072.
The dashed line at 0.64 serves as a baseline for comparison. The accuracy values generally fluctuate around this baseline, with the peak exceeding it significantly. The weak correlation coefficient suggests that DTR is not a strong predictor of accuracy on its own, and other factors likely play a significant role. The confidence intervals highlight the variability in the accuracy measurements, indicating that the observed trends may not be statistically significant in all cases.
</details>
(d) Cosine similarity as the distance metric on HMMT 25.
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Line Chart: Kullback-Leibler Divergence
### Overview
This image presents a line chart illustrating the relationship between DTR (likely a measure of diversity or distribution) and Accuracy (Pass@1). The chart includes a solid blue line representing the observed data, a dashed grey line representing a trendline, and a shaded blue region indicating the confidence interval around the observed data. A Pearson correlation coefficient (r) is also displayed.
### Components/Axes
* **Title:** Kullback-Leibler Divergence
* **X-axis:** DTR (ranging approximately from 0.37 to 0.41)
* **Y-axis:** Accuracy (Pass@1) (ranging approximately from 0.60 to 0.68)
* **Data Series 1:** Solid Blue Line - Observed Accuracy vs. DTR
* **Data Series 2:** Dashed Grey Line - Trendline
* **Confidence Interval:** Shaded Blue Region
* **Correlation Coefficient:** r = 0.409
### Detailed Analysis
The solid blue line shows a non-monotonic relationship between DTR and Accuracy.
* **Point 1:** At approximately DTR = 0.37, Accuracy is approximately 0.61.
* **Point 2:** At approximately DTR = 0.375, Accuracy peaks at approximately 0.665.
* **Point 3:** At approximately DTR = 0.39, Accuracy decreases to approximately 0.63. The correlation coefficient, r = 0.409, is displayed near this point.
* **Point 4:** At approximately DTR = 0.405, Accuracy increases slightly to approximately 0.64.
* **Point 5:** At approximately DTR = 0.41, Accuracy increases to approximately 0.65.
The dashed grey line represents a general upward trend, though it doesn't perfectly fit the observed data. The shaded blue region around the solid blue line indicates the uncertainty in the accuracy measurements. The confidence interval is wider around DTR = 0.39, suggesting greater variability in accuracy at that point.
### Key Observations
* The relationship between DTR and Accuracy is not strictly linear. Accuracy increases initially with DTR, then decreases, and then shows a slight increase again.
* The correlation coefficient of 0.409 suggests a moderate positive correlation between DTR and Accuracy, but the non-monotonic behavior indicates the relationship is more complex than a simple linear correlation.
* The confidence interval is relatively wide, indicating some uncertainty in the accuracy measurements.
### Interpretation
The chart suggests that increasing DTR initially improves Accuracy (Pass@1), but beyond a certain point (around DTR = 0.375), further increases in DTR may lead to a decrease in accuracy. This could indicate an optimal level of diversity or distribution (DTR) for maximizing accuracy. The Kullback-Leibler Divergence, as the chart title suggests, is likely being used to measure the difference between two probability distributions, and the chart is exploring how this divergence impacts the model's accuracy. The moderate correlation suggests that DTR is a factor influencing accuracy, but other factors are also likely at play. The non-linear relationship implies that simply maximizing DTR will not necessarily maximize accuracy. The confidence intervals highlight the need for further investigation to confirm these trends and reduce uncertainty.
</details>
(e) KL divergence as the distance metric on HMMT 25.
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Line Chart: Jensen-Shannon Divergence
### Overview
This image presents a line chart illustrating the relationship between DTR (likely a measure of diversity or retrieval rate) and Accuracy (Pass@1), with confidence intervals represented by a shaded region. Two lines are plotted, representing different conditions or models. A correlation coefficient is also displayed.
### Components/Axes
* **Title:** Jensen-Shannon Divergence
* **X-axis:** DTR (ranging approximately from 0.135 to 0.180)
* **Y-axis:** Accuracy (Pass@1) (ranging approximately from 0.56 to 0.68)
* **Line 1 (Solid Line):** Represents a trend with circular markers.
* **Line 2 (Dashed Line):** Represents a trend with circular markers.
* **Shaded Region:** Represents the confidence interval around the lines, colored in light blue.
* **Text Annotation:** "r = ≤ 0.895" – indicating a correlation coefficient.
### Detailed Analysis
**Line 1 (Solid Line):**
The solid line exhibits an upward trend initially, then plateaus.
* At DTR ≈ 0.135, Accuracy ≈ 0.56.
* At DTR ≈ 0.150, Accuracy ≈ 0.60.
* At DTR ≈ 0.165, Accuracy ≈ 0.64.
* At DTR ≈ 0.180, Accuracy ≈ 0.65.
**Line 2 (Dashed Line):**
The dashed line shows a more gradual upward trend throughout the range.
* At DTR ≈ 0.135, Accuracy ≈ 0.56.
* At DTR ≈ 0.150, Accuracy ≈ 0.59.
* At DTR ≈ 0.165, Accuracy ≈ 0.63.
* At DTR ≈ 0.180, Accuracy ≈ 0.67.
The confidence interval (shaded region) is widest around DTR ≈ 0.165 and narrows towards the extremes of the DTR range.
The correlation coefficient is stated as "r = ≤ 0.895", indicating a strong positive correlation, but with some uncertainty. The annotation is positioned near the dashed line, suggesting it may relate to that line's correlation.
### Key Observations
* Both lines demonstrate a positive correlation between DTR and Accuracy. Higher DTR values generally correspond to higher Accuracy values.
* The dashed line consistently shows lower accuracy values than the solid line across the entire DTR range.
* The confidence intervals suggest greater uncertainty in the relationship around DTR ≈ 0.165.
* The solid line shows a diminishing return in accuracy as DTR increases beyond approximately 0.165.
### Interpretation
The chart suggests that increasing DTR generally improves Accuracy (Pass@1). The difference between the two lines indicates that the method represented by the solid line performs better than the method represented by the dashed line across all DTR values. The confidence intervals provide a measure of the variability in the data, and the correlation coefficient confirms a strong positive relationship between the two variables. The plateauing of the solid line suggests that there may be a point of diminishing returns, where further increases in DTR do not lead to significant improvements in Accuracy. The Jensen-Shannon Divergence title suggests that this chart is analyzing the divergence between two probability distributions, and how that divergence impacts the accuracy of a retrieval or classification system. The DTR likely represents a measure of the diversity of the retrieved results, and the accuracy is measuring the probability that the top result is correct.
</details>
(f) JS divergence as the distance metric on HMMT 25.
Figure 6: Comparison of correlation between accuracy and deep-thinking ratio (DTR) using different distance metrics (cosine similarity, KL divergence, and JS divergence) on AIME 25 (top row) and HMMT 25 (bottom row).
We report the correlation results of KLD-based and cosine-based DTR, compared with our main JSD-based DTR method, on AIME 25 and HMMT 25 using OSS-120B- medium. Across both datasets, JSD-based DTR consistently achieves the strongest positive correlation with accuracy ( $r$ = 0.869 on AIME 25; $r$ = 0.895 on HMMT 25), justifying its use in our definition of DTR in Section ˜ 2. In contrast, cosine-based DTR exhibits substantially weaker and unstable correlations ( $r$ = 0.633 on AIME 25 and only $r$ = 0.172 on HMMT 25). KLD-based DTR shows similarly inconsistent behavior, with a negative correlation on AIME 25 ( $r$ = -0.698) and a modest positive correlation on HMMT 25 ( $r$ = 0.409). This inconsistency may stem from the asymmetric and numerically unstable nature of KLD: early-layer predictions tend to be high-entropy and relatively flat, assigning probability mass to many tokens that are later driven to near-zero values. Consequently, KLD can become artificially small, making the measure highly sensitive.
## Appendix B DTR Under Different GPT-OSS Reasoning Levels
Figure ˜ 7 illustrates how DTR varies in different reasoning-level configurations (i.e., low, medium, and high) of the GPT-OSS-120B model. We observe an interesting and consistent trend on both AIME 25 and GPQA-D: although the underlying model weights remain identical and only the system prompt differs, lower reasoning-level configurations exhibit higher DTR values, whereas higher reasoning-level configurations yield systematically smaller DTR while achieving better task accuracy.
A potential explanation is that higher reasoning levels may redistribute computation from depth to sequence length, effectively flattening per-token, layer-wise computation. Models with higher reasoning levels require less deep revision for each individual token but instead generate longer reasoning chains with more forward passes, resulting in greater total effective compute and improved task performance. Since DTR is defined as the proportion of deep-thinking tokens (i.e., averaged over the total number of generated tokens), longer sequences increase the denominator in the DTR calculation and thus produce smaller values. This also suggests DTR might not be directly comparable across different models or model modes.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Chart: Accuracy vs. DTR for AIME 25 and GPQA-D
### Overview
The image presents two line charts comparing the relationship between Accuracy (Pass@1) and DTR (Decoding-Time Ratio) for two datasets: AIME 25 and GPQA-D. Each chart displays three lines representing different levels of complexity: Low, Medium, and High. The charts aim to demonstrate how accuracy changes with varying decoding times for each dataset and complexity level.
### Components/Axes
* **X-axis:** DTR (Decoding-Time Ratio). Scale ranges from approximately 0.125 to 0.21.
* **Y-axis:** Accuracy (Pass@1). Scale ranges from approximately 0.4 to 0.95.
* **Datasets:** AIME 25 (left chart), GPQA-D (right chart).
* **Complexity Levels (Legend):**
* Low (Light Blue, dashed line with circle markers)
* Medium (Green, dashed line with circle markers)
* High (Red, dashed line with circle markers)
* **Legend Position:** Bottom-center of the image.
### Detailed Analysis or Content Details
**AIME 25 (Left Chart)**
* **High (Red):** The line slopes downward slightly, starting at approximately 0.91 at DTR 0.125 and ending at approximately 0.88 at DTR 0.20. A value of approximately 0.769 is annotated near DTR 0.15.
* **Medium (Green):** The line shows an upward trend, starting at approximately 0.74 at DTR 0.125 and reaching approximately 0.84 at DTR 0.20. A value of approximately 0.849 is annotated near DTR 0.175.
* **Low (Light Blue):** The line exhibits an upward trend, starting at approximately 0.48 at DTR 0.125 and reaching approximately 0.60 at DTR 0.20. A value of approximately 0.937 is annotated near DTR 0.175.
**GPQA-D (Right Chart)**
* **High (Red):** The line slopes downward, starting at approximately 0.77 at DTR 0.12 and ending at approximately 0.74 at DTR 0.21. A value of approximately 0.839 is annotated near DTR 0.15.
* **Medium (Green):** The line shows an upward trend, starting at approximately 0.69 at DTR 0.12 and reaching approximately 0.72 at DTR 0.21. A value of approximately 0.871 is annotated near DTR 0.18.
* **Low (Light Blue):** The line exhibits a relatively flat trend, starting at approximately 0.65 at DTR 0.12 and ending at approximately 0.66 at DTR 0.21. A value of approximately 0.936 is annotated near DTR 0.18.
### Key Observations
* In both datasets, the "High" complexity level consistently demonstrates the highest accuracy at lower DTR values.
* As DTR increases, the accuracy of the "High" complexity level tends to decrease more rapidly than the other levels.
* The "Low" complexity level shows a more gradual increase in accuracy with increasing DTR, and in the GPQA-D dataset, it remains relatively flat.
* The annotated values appear to be peak accuracy points for each complexity level within each dataset.
### Interpretation
The charts suggest a trade-off between accuracy and decoding time. Higher complexity models (represented by the "High" line) achieve higher accuracy initially but experience a more significant drop in accuracy as decoding time increases. Lower complexity models (represented by the "Low" line) have lower initial accuracy but are more robust to increases in decoding time. The "Medium" complexity level offers a balance between these two extremes.
The differences between AIME 25 and GPQA-D indicate that the optimal complexity level and DTR may vary depending on the specific dataset. The annotated values likely represent points where the models achieve a good balance between accuracy and decoding speed for each complexity level. The relatively flat "Low" line in GPQA-D suggests that increasing decoding time does not significantly improve accuracy for this dataset at the lowest complexity level. This could indicate that the dataset is relatively simple or that the low-complexity model has already reached its performance limit.
</details>
Figure 7: Deep-thinking ratio (DTR) under different reasoning level configurations of OSS-120B models.
## Appendix C Additional Analysis of Think@ $\bm{n}$
Here we provide additional analysis on how Think@ $n$ behaves when varying (i) the number of sampled responses $n$ and (ii) the retained top- $\eta$ percentage used for voting.
#### Effect of the number of samples n .
Figure ˜ 8(a) compares Think@ $n$ against Cons@ $n$ (i.e., self-consistency) as $n$ increases ( $n\in\{16,32,48\})$ . Think@ $n$ improves monotonically with larger $n$ , where the advantage over Cons@ $n$ becomes more pronounced. Sampling more responses makes the correct cluster of answers to be larger and more likely to appear. Think@ $n$ is able to exploit this enlarged candidate pool by preferentially selecting better samples, leading to stronger performance gains over Cons@ $n$ .
#### Effect of top- $\bm{\eta}$ percentage.
Figure ˜ 8(a) evaluates Think@ $n$ under different top- $\eta$ percent ( $\eta\in\{25\$ ). Performance peaks at $\eta$ =50%, while decrease for a smaller fraction ( $\eta$ =25%) and a larger fraction ( $\eta$ =75%). This suggests a trade-off: selecting too few samples reduces voting robustness, potentially with fewer strong candidates to stabilize majority vote, whereas selecting too many might admit lower-quality samples that dilute the benefit of Think@ $n$ . Overall, the results support our choice of $\eta$ =50% as a stable operating point.
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Number of Samples
### Overview
This image presents a line chart illustrating the relationship between the number of samples (n) and accuracy for two different methods: "Think@n" and "Cons@n". The chart displays accuracy on the y-axis and the number of samples on the x-axis.
### Components/Axes
* **X-axis Title:** "Number of Samples n"
* **Y-axis Title:** "Accuracy"
* **X-axis Markers:** 16, 32, 48
* **Y-axis Scale:** Ranges from approximately 0.900 to 0.945, with gridlines at 0.900, 0.915, 0.930, and 0.945.
* **Legend:** Located in the top-right corner.
* "Think@n" - Represented by a blue line with circular markers.
* "Cons@n" - Represented by a teal line with circular markers.
### Detailed Analysis
**Think@n (Blue Line):**
The blue line representing "Think@n" shows a generally upward trend.
* At n = 16, Accuracy ≈ 0.900.
* At n = 32, Accuracy ≈ 0.930.
* At n = 48, Accuracy ≈ 0.945.
**Cons@n (Teal Line):**
The teal line representing "Cons@n" also shows an upward trend, but it plateaus after n = 32.
* At n = 16, Accuracy ≈ 0.900.
* At n = 32, Accuracy ≈ 0.930.
</details>
(a) Comparison of different number of samples $n$ .
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Top η (%)
### Overview
This image presents a line chart comparing the accuracy of two methods, "Think@n" and "Cons@n", across different values of "Top η (%)". The chart displays how accuracy changes as the percentage of top results considered (η) varies from 25% to 75%.
### Components/Axes
* **X-axis:** "Top η (%)" with markers at 25, 50, and 75.
* **Y-axis:** "Accuracy" with a scale ranging from approximately 0.900 to 0.945.
* **Data Series 1:** "Think@n" represented by a solid blue line with circular markers.
* **Data Series 2:** "Cons@n" represented by a dashed grey line.
* **Legend:** Located in the top-right corner, detailing the mapping of colors/line styles to the methods.
### Detailed Analysis
* **Think@n (Blue Line):** The line exhibits an upward trend initially, then a downward trend.
* At Top η (%) = 25, Accuracy ≈ 0.900.
* At Top η (%) = 50, Accuracy ≈ 0.945. This is the peak accuracy for this method.
* At Top η (%) = 75, Accuracy ≈ 0.930.
* **Cons@n (Grey Dashed Line):** The line is approximately horizontal, indicating a relatively constant accuracy across the range of Top η (%).
* At Top η (%) = 25, Accuracy ≈ 0.275.
* At Top η (%) = 50, Accuracy ≈ 0.275.
* At Top η (%) = 75, Accuracy ≈ 0.275.
### Key Observations
* "Think@n" demonstrates a significant increase in accuracy as Top η (%) increases from 25% to 50%, but then decreases slightly at 75%.
* "Cons@n" maintains a consistent accuracy level throughout the entire range of Top η (%).
* "Think@n" consistently outperforms "Cons@n" across all values of Top η (%).
### Interpretation
The data suggests that the "Think@n" method is sensitive to the number of top results considered. It achieves its highest accuracy when considering the top 50% of results, indicating an optimal balance between precision and recall. Beyond this point, including more results (up to 75%) leads to a slight decrease in accuracy, potentially due to the inclusion of less relevant results.
The consistent accuracy of "Cons@n" suggests that its performance is not significantly affected by the number of top results considered. This could indicate that it is a more robust method, but its overall accuracy is substantially lower than "Think@n" at its peak.
The large difference in accuracy between the two methods suggests that "Think@n" is a more effective approach, particularly when the top 50% of results are considered. The chart highlights the importance of tuning the parameter η to optimize the performance of the "Think@n" method.
</details>
(b) Comparison of different top- $\eta$ percentage.
Figure 8: Analysis of Think@ $n$ with different number of samples $n$ and top- $\eta$ percent. (a) As $n$ increases, Think@ $n$ consistently benefits from larger candidate pools and exhibits a widening performance gap over Cons@ $n$ at higher $n$ . (b) Performance peaks at $\eta$ =50%, while overly aggressive filtering and overly permissive selection could lead to degraded accuracy.
## Appendix D Prompts
We provide the prompts adopted in our experiments for math tasks (AIME 2024, AIME 2025, HMMT 2025) in Table ˜ 4 and for GPQA in Table ˜ 5.
Table 4: Inference prompt for math tasks (AIME 2024, AIME 2025, HMMT 2025).
Prompt for AIME 2024, AIME 2025, HMMT 2025
Please reason step by step, and put your final answer within $\backslash$ boxed $\{\}$ .
Table 5: Inference prompt for GPQA.
Prompt for GPQA
You will be given a multiple choice question with different choices such as (A), (B), (C), (D). Think step by step before giving a final answer to this question. Always finish your answer with ‘The final answer is $\backslash$ boxed $\{$ (X) $\}$ .’, where X is the correct answer choice. If none of the options match, choose the closest option as the final answer.
## Appendix E Qualitative Examples
We present an example question from the AIME 2025 dataset along with its ground-truth answer (Table ˜ 6), and two outputs from OSS-120- medium: one incorrect (Table ˜ 7) and one correct (Table ˜ 8). Notably, the incorrect output is substantially more verbose (27,724 tokens) and exhibits a lower DTR value (13.9), whereas the correct output is much more concise (3,725 tokens) and achieves a higher DTR value (19.0).
Table 6: An example question and its ground-truth answer from AIME 2025.
Example question from AIME 2025.
Circle $\omega_{1}$ with radius 6 centered at point $A$ is internally tangent at point $B$ to circle $\omega_{2}$ with radius 15. Points $C$ and $D$ lie on $\omega_{2}$ such that $\overline{BC}$ is a diameter of $\omega_{2}$ and $\overline{BC}\perp\overline{AD}$ . The rectangle $EFGH$ is inscribed in $\omega_{1}$ such that $\overline{EF}\perp\overline{BC}$ , $C$ is closer to $\overline{GH}$ than to $\overline{EF}$ , and $D$ is closer to $\overline{FG}$ than to $\overline{EH}$ , as shown. Triangles $\triangle DGF$ and $\triangle CHG$ have equal areas. The area of rectangle $EFGH$ is $\frac{m}{n}$ , where $m$ and $n$ are relatively prime positive integers. Find $m+n$ .
Ground truth answer.
293
Table 7: An incorrect output of OSS-120B- medium to the example question.
Incorrect output with DTR = 13.9% and token count = 27,724.
We need to interpret geometry. Setup: Circle $\omega_{1}$ radius 6 centered at A, internally tangent to larger circle $\omega_{2}$ radius 15 at point B. So B is common point where $\omega_{1}$ touches inside $\omega_{2}$ . Since internal tangency, distance AB = 15-6 = 9, with B on line AB. Let’s set coordinate system. Place A at origin? maybe easier: Let $\omega_{1}$ centered at (0,0) radius 6. Then B is on circle $\omega_{1}$ and also on $\omega_{2}$ . Since $\omega_{2}$ larger radius 15 and centered at O? Let’s denote O is center of $\omega_{2}$ . Distance OA = 9 (since AB = 9) and AB =9? Wait AB = distance between centers = 15-6 = 9 yes. B lies on line OA, outside? Actually internal tangency: smaller inside larger; they touch at point on the line connecting centers outward from smaller to larger. So if we set A at (0,0) and O at (9,0) (so O is to the right), then B is point on both circles on line AO to the right of A? The point outward from A towards O at distance 6 from A, i.e., B is at (6,0). Indeed distance from O to B: O(9,0) to B(6,0) distance 3 which is radius 15? Wait radius of $\omega_{2}$ is 15, O to B should be 15. But O(9,0) to B(6,0) distance 3, not 15. Something wrong. Let’s set O to be (0,0) maybe easier. Place $\omega_{2}$ centered at O (0,0), radius 15. $\omega_{1}$ radius 6 internally tangent at B; its center A is distance 15-6 = 9 from O, along some direction. Choose B as point of tangency on line OA outward from A to O? Actually internal tangency: small circle inside larger, touching at one point B where the circles meet. The line of centers passes through B. The distance from O to B is 15 (radius). The distance from A to B is 6. Since O to A is 9. B is on the line OA beyond A? Wait O to A is 9, radius of small is 6, so going from O to A to B is along same line: O—A—B direction? Since B is on circumference of both circles. The distance from O to B = 15 = OA + AB? OA=9, AB=6 gives OA+AB=15 so yes B is outward from O through A. So O at origin, A at distance 9 from O in direction e.g., positive x. Let O at (0,0), A at (9,0). Then B is further along x axis at O to B distance 15, so B at (15,0). Check: distance from A (9,0) to B (15,0) =6 correct. Good. Now BC is a diameter of $\omega_{2}$ , meaning B to C passes through O (center of $\omega_{2}$ ). Since B is at (15,0), the opposite point on $\omega_{2}$ is at C at (-15,0) (diameter through O). So C is at (-15,0). So line BC is along x-axis. Given BC $\perp$ AD, so AD is perpendicular to x-axis, meaning AD is vertical line through A? Wait AD segment connects A (9,0) to D point on $\omega_{2}$ , such that AD is perpendicular to BC i.e., vertical line through A (x=9). Since AD must intersect $\omega_{2}$ at point D. So D is point where vertical line x=9 meets $\omega_{2}$ (radius 15). Equation x 2 + y 2 = 225. Substituting x=9 => 81 + y 2 =225 => y 2 =144 => y= ±12. Which one? Since orientation maybe D is above? Not specified. We’ll consider D at (9,12) (upper) maybe. Could also be lower but rectangle orientation may pick one. Now rectangle EFGH is inscribed in $\omega_{1}$ (center A (9,0) radius 6). The rectangle is oriented not necessarily axis-aligned? We have EF parallel to GH, FG parallel to EH. Given EF $\perp$ BC (BC horizontal), so EF (and GH) are vertical? Wait EF is perpendicular to BC, which is horizontal, so EF is vertical. So one side of rectangle is vertical. Therefore rectangle sides are aligned with vertical and horizontal: EF vertical, FG horizontal etc. So rectangle is axis-aligned (vertical/horizontal) in the coordinate system where BC is horizontal. Thus rectangle sides are parallel to the axes (since BC horizontal, EF vertical). Good. Now C is closer to GH than to EF. GH is opposite side of EF: rectangle with vertical sides EF (left) and GH (right) maybe? Let’s figure orientation: Usually rectangle labelled E-F-G-H in order around, perhaps clockwise; but not essential. If EF is vertical, then FG is horizontal, GH vertical, HE horizontal. So EF left side, GH right side (if rectangle sits to the right of EF). Not sure. C is at (-15,0) far left of both. "C is closer to GH than to EF" meaning distance from C to line GH is less than distance to line EF. Since C is far left, that would mean GH is left side, EF right side? Wait distance from C to a vertical line is horizontal distance. If GH is left side (more left), its distance to C is smaller than distance to EF (right side). That matches C being closer to GH. So GH is left side of rectangle (more left), EF right side. Good. Similarly, D is closer to FG than to EH. Since FG is top side (horizontal), EH bottom side (horizontal). D is (9,12) high point; distance to top side FG smaller than to bottom side EH (makes sense if rectangle is near top). So FG is top side, EH bottom side. So rectangle EFGH likely positioned with upper side near D, left side near C. … Omitted for illustrative purposes … Seems no integer factorization. Could factor with quadratics with irrational coefficients, not helpful. Thus x may be irrational; thus k not integer rational. Therefore area is irrational? Conflict with problem statement. Given time is over, I think maybe my earlier assumption that rectangle is axis-aligned might be wrong; perhaps rectangle rotated, leading to rational area. Given complexity, I think answer maybe m+n = something like 207? Let’s guess? Not good. Given limited time, I will output placeholder. assistantfinal $207$
Table 8: A correct output of OSS-120B- medium to the example question.
Correct output with DTR = 19.0% and token count = 3,725.
We need to parse geometry. Let’s define coordinates. We have two circles: $\omega_{2}$ radius 15, $\omega_{1}$ radius 6 centered at A, internally tangent at point B. So A is inside $\omega_{2}$ , distance AB = radius difference = 15 - 6 = 9. B is the point of tangency, along line AB. Let’s set coordinate system: Let AB be vertical? Choose B at origin maybe easier. Let’s position $\omega_{2}$ centered at O (call it O2) maybe O? Actually $\omega_{2}$ ’s center is some point O (call it O). $\omega_{1}$ ’s center A is at distance 9 from O, with B being point on both circles along line O-A. Let’s place O at (0,0). Let AB be along positive x-axis: place B at (15,0) maybe? Then $\omega_{2}$ radius 15 centered at O (0,0). If B is point of tangency where $\omega_{1}$ touches $\omega_{2}$ internally, then B lies on $\omega_{2}$ at distance 15 from O. So B at (15,0). Then A is located on line OB, inside, distance 6 from B (since radius of $\omega_{1}$ is 6). And O to A distance is 9 (since OA = 15-6 = 9). So O (0,0), B (15,0). Then A is at (9,0) (since O to A is 9 along same line). Yes A at (9,0). Then $\omega_{1}$ radius 6 centered at (9,0). Good. Now Points C and D lie on $\omega_{2}$ such that BC is a diameter of $\omega_{2}$ . So BC passes through center O and length 30 (diameter). Since B is at (15,0), the opposite end of diameter is point C located such that O is midpoint: O = (0,0) is midpoint between B and C. So C = (-15,0). Indeed because B (15,0) opposite across origin yields C (-15,0). So BC is horizontal line (x-axis). Good. … Omitted for illustrative purposes … *Triangle $DGF$ :* Its base $GF$ has length $w$ ; the vertical distance from $D(9,12)$ to the line $y=h/2$ (the line containing $GF$ ) is $12-\frac{h}{2}$ . Hence $[DGF]=\frac{1}{2}w\Bigl(12-\frac{h}{2}\Bigr).$ *Triangle $CHG$ :* Side $HG$ is vertical with length $h$ ; the horizontal distance from $C(-15,0)$ to the line $x=9-\frac{w}{2}$ (the line containing $HG$ ) is $24-\frac{w}{2}$ . Hence $[CHG]=\frac{1}{2}h\Bigl(24-\frac{w}{2}\Bigr).$ The problem states $[DGF]=[CHG]$ . Cancelling the factor $\frac{1}{2}$ gives $w\Bigl(12-\frac{h}{2}\Bigr)=h\Bigl(24-\frac{w}{2}\Bigr).$ The terms $-\frac{wh}{2}$ appear on both sides and cancel, leaving $12w=24h\quad\Longrightarrow\quad w=2h.$ — Solving for $w$ and $h$ Substituting (2) into (1): $(2h)^{2}+h^{2}=144\;\Longrightarrow\;5h^{2}=144\;\Longrightarrow\;h^{2}=\frac{144}{5}.$ Hence $h=\frac{12}{\sqrt{5}}=\frac{12\sqrt{5}}{5},\qquad w=2h=\frac{24}{\sqrt{5}}=\frac{24\sqrt{5}}{5}.$ The area of rectangle $EFGH$ is $\text{Area}=wh=2h^{2}=2\left(\frac{144}{5}\right)=\frac{288}{5}.$ Thus $m=288,\;n=5$ (relatively prime), and $m+n=288+5=293.$ $\boxed{293}$