# Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
**Authors**: Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, Yu Meng
> University of Virginia
> \thepa
\uselogo Corresponding author:
wlchen@virginia.edu, liqianp@google.com
Abstract
Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal “overthinking,” leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens —tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@ $n$ , a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@ $n$ matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Line Charts: Accuracy vs. Token Count and Deep-Thinking Ratio
### Overview
The image contains two side-by-side line charts comparing model performance metrics. The left chart plots accuracy against token count, while the right chart plots accuracy against a "Deep-Thinking Ratio." Both charts include four data series with distinct colors and correlation coefficients (r-values) indicating relationships between variables.
### Components/Axes
**Left Chart (Token Count vs. Accuracy):**
- **X-axis**: Token Count (2500–10000, increments of 2500)
- **Y-axis**: Accuracy (Pass@1) (0.5–0.8, increments of 0.1)
- **Legend**: Located at the bottom, mapping colors to models:
- Blue: AIME 25
- Green: AIME 24
- Red: HMMT 25
- Yellow: GPQA-D
- **Correlation Labels**: Overlaid on lines with r-values:
- AIME 25: r = -0.704
- AIME 24: r = -0.407
- HMMT 25: r = -0.783
- GPQA-D: r = -0.284
- **Title**: "Avg Correlation r = -0.544"
**Right Chart (Deep-Thinking Ratio vs. Accuracy):**
- **X-axis**: Deep-Thinking Ratio (0.135–0.180, increments of 0.015)
- **Y-axis**: Accuracy (Pass@1) (0.5–0.8, increments of 0.1)
- **Legend**: Same as left chart (bottom placement)
- **Correlation Labels**: Overlaid on lines with r-values:
- AIME 25: r = 0.862
- AIME 24: r = 0.715
- HMMT 25: r = 0.941
- GPQA-D: r = 0.795
- **Title**: "Avg Correlation r = 0.828"
### Detailed Analysis
**Left Chart Trends:**
1. **AIME 25 (Blue)**: Starts at ~0.8 accuracy at 2500 tokens, declines steadily to ~0.75 at 10,000 tokens (r = -0.704).
2. **AIME 24 (Green)**: Peaks at ~0.78 at 5000 tokens, then declines to ~0.72 at 10,000 tokens (r = -0.407).
3. **HMMT 25 (Red)**: Starts at ~0.65 at 2500 tokens, declines sharply to ~0.5 at 10,000 tokens (r = -0.783).
4. **GPQA-D (Yellow)**: Flat line at ~0.7 across all token counts (r = -0.284).
**Right Chart Trends:**
1. **AIME 25 (Blue)**: Increases from ~0.68 at 0.135 ratio to ~0.82 at 0.180 ratio (r = 0.862).
2. **AIME 24 (Green)**: Rises from ~0.7 to ~0.78, plateauing at higher ratios (r = 0.715).
3. **HMMT 25 (Red)**: Steep upward trend from ~0.55 to ~0.68 (r = 0.941).
4. **GPQA-D (Yellow)**: Gradual increase from ~0.65 to ~0.72 (r = 0.795).
### Key Observations
- **Negative Correlation (Left Chart)**: All models except GPQA-D show declining accuracy with increased token count. HMMT 25 has the steepest decline (r = -0.783).
- **Positive Correlation (Right Chart)**: All models improve accuracy with higher deep-thinking ratios. HMMT 25 demonstrates the strongest positive relationship (r = 0.941).
- **GPQA-D Anomaly**: Maintains stable accuracy across token counts (left chart) but shows moderate improvement with deep-thinking ratio (right chart).
### Interpretation
The data suggests a trade-off between computational resource usage (token count) and performance. Models like HMMT 25 and AIME 25 exhibit diminishing returns with increased token count but benefit significantly from deeper reasoning (higher deep-thinking ratios). GPQA-D’s stability across token counts implies efficiency but limited scalability. The stark contrast in correlation magnitudes (-0.544 vs. +0.828) highlights that deep-thinking ratio is a more reliable predictor of accuracy than raw token count. This aligns with Peircean principles of abductive reasoning: the simplest explanation (ratio-driven performance) accounts for observed patterns without overcomplicating the model.
</details>
Figure 1: Comparison of correlations between accuracy and proxies for thinking effort. The plots illustrate the relationship between model performance and two inference-time measures of thinking effort on GPT-OSS-120B- medium across AIME 2024/2025, HMMT 2025, and GPQA-Diamond. (Left) Output token count exhibits a moderate negative correlation (average $r=-0.544$ ), suggesting that output length is an unreliable indicator of performance. (Right) In contrast, our proposed deep-thinking ratio demonstrates a strong positive correlation with accuracy (average $r=0.828$ ).
1 Introduction
Large language models (LLMs) have achieved remarkable reasoning capabilities by generating explicit thought traces, most notably through the Chain-of-Thought (CoT) paradigm [wei2022chain-d1a]. Prior works have shown that increasing the number of reasoning tokens generated can generally boost task performance [jaech2024openai, guo2025deepseek, anthropic2025claude3-7, anthropic2025claude4, oai2025o3mini, yang2025qwen3, team2025kimi, zhong2024evaluation], motivating methods that encourage longer and more elaborate thinking traces [muennighoff2025s1, balachandran2025inference-time-7c9, yeo2025demystifying-b6f].
However, a growing body of evidence suggests that token counts are unreliable indicators of model performance during inference, as longer reasoning does not consistently translate into higher accuracy [wu2025when-905, aggarwal2025optimalthinkingbench-3bf, sui2025stop-ced, su2025between-f85]. Empirical studies reveal inverted-U relationships between CoT length and performance [wu2025when-905], as well as inverse-scaling behaviors in which longer reasoning traces systematically degrade performance [gema2025inverse-bad]. Excessive reasoning may reflect overthinking, wherein models amplify flawed heuristics or fixate on irrelevant details [feng2025what-321]. Consequently, relying on length as a metric for reasoning quality not only encourages verbosity over clarity but also wastes computational resources on uninformative tokens. Though recent work has attempted to assess the semantic structure of CoTs (e.g., by representing reasoning traces as graphs), such approaches often rely on costly auxiliary parsing or external annotations [feng2025what-321]. Addressing these limitations requires more principled and efficient methods for measuring thinking effort that can distinguish effective reasoning from uninformative generation.
In this work, we introduce deep-thinking ratio (DTR) as a direct measure of inference-time thinking effort. Instead of relying on surface-level features like output length, we focus on how individual tokens are produced internally. We posit that when a token prediction stabilizes in early layers, subsequent depth-wise modifications entail relatively low computational effort, resembling less thinking. In contrast, token predictions that undergo sustained revision in deeper layers before converging reflect greater thinking [chuang2023dola-0c6]. We operationalize this idea by projecting intermediate-layer hidden states into the vocabulary space and comparing each layer’s prediction distribution to the final-layer distribution. Tokens whose distributions do not converge until deeper layers are identified as deep-thinking tokens. By counting the proportion of deep-thinking tokens in a generated sequence, we obtain DTR, which provides a simple, mechanistically grounded measure of thinking effort, requiring neither task-specific heuristics nor external structural annotations.
Across four challenging mathematical and scientific reasoning benchmarks—AIME 2024, AIME 2025, HMMT 2025, and GPQA [aops2024aime1, aops2024aime2, aops2025aime1, aops2025aime2, hmmt2025, rein2024gpqa] —and a range of reasoning-focused language models, including GPT-OSS, DeepSeek-R1, and Qwen3 families [openai2025gpt-oss-120b-a33, guo2025deepseek, yang2025qwen3], we demonstrate that measuring deep-thinking tokens yields strong correlations with task accuracy. The achieved correlation is substantially higher than those obtained using length-based or confidence-based baselines. Furthermore, we show that deep-thinking tokens can be leveraged for parallel inference scaling, where preferentially selecting and aggregating responses with higher DTR achieves performance comparable or better than standard consensus-based methods, while requiring only half the compute cost. Our contributions are summarized as follows:
- We introduce deep-thinking ratio (DTR)—a measure that counts the ratio of deep-thinking tokens in a sequence whose predictions undergo sustained revision in deeper layers before converging—as a new lens for characterizing inference-time thinking effort.
- We empirically show that, across multiple reasoning benchmarks and model families, DTR of a generated sequence exhibits strong positive correlations with task accuracy, outperforming length-based and confidence-based baselines significantly.
- We introduce Think@ $n$ , a test-time scaling strategy that preferentially selects and aggregates samples with higher DTR. By early halting unpromising generations based on DTR estimated from short prefixes, Think@ $n$ matches or surpasses standard self-consistency with approximately half the inference cost.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Heatmap: Distribution of Terms Across Layers
### Overview
The image is a heatmap visualizing the distribution of specific textual terms across 35 layers (y-axis) and 15 categories (x-axis). The color intensity ranges from dark brown (0.0) to light yellow (1.0), indicating the magnitude of values associated with each term-layer combination.
### Components/Axes
- **Y-axis (i-th Layer)**: Labeled "i-th Layer" with values from 1 to 35, representing discrete layers.
- **X-axis (Categories)**: Labeled with the following terms (left to right):
1. "A and B = 8 + 5 = 13"
2. "fun"
3. "boxed"
4. "13"
5. "in"
6. "July"
7. "Thus"
8. "the correct choice is"
9. "D"
10. "13 ."
11. "The final answer is"
12. "$boxed"
13. "13"
14. "D"
15. "$return>"
- **Color Legend**: Located on the right, with a gradient from dark brown (0.0) to light yellow (1.0). The legend is labeled with numerical values (0.0, 0.2, 0.4, 0.6, 0.8, 1.0).
### Detailed Analysis
- **X-axis Categories**:
- "A and B = 8 + 5 = 13": High values (light yellow) in layers 33–35, with a peak at layer 33.
- "boxed": High values in layers 33–35, with a peak at layer 33.
- "13": Moderate values (orange) in layers 15–25, with a peak at layer 25.
- "in", "July", "Thus": Low values (dark brown) across most layers, with slight increases in layers 25–30.
- "the correct choice is": High values in layers 33–35, with a peak at layer 33.
- "D": High values in layers 33–35, with a peak at layer 33.
- "13 .": Moderate values in layers 15–25, with a peak at layer 25.
- "The final answer is": High values in layers 33–35, with a peak at layer 33.
- "$boxed": High values in layers 33–35, with a peak at layer 33.
- "$return>": High values in layers 33–35, with a peak at layer 33.
- **Y-axis Layers**:
- Layers 1–10: Dominated by dark brown (low values) across most categories.
- Layers 15–25: Moderate values (orange) in "13", "13 .", and "D".
- Layers 33–35: High values (light yellow) in "A and B = 8 + 5 = 13", "boxed", "the correct choice is", "D", "$boxed", and "$return>".
### Key Observations
1. **Peaks in Top Layers**: The highest values (light yellow) are concentrated in layers 33–35 for categories like "A and B = 8 + 5 = 13", "boxed", "the correct choice is", "D", "$boxed", and "$return>".
2. **Moderate Values in Middle Layers**: Layers 15–25 show moderate values (orange) for "13", "13 .", and "D".
3. **Low Values in Bottom Layers**: Layers 1–10 are predominantly dark brown, indicating minimal activity or significance.
4. **Anomalies**: The category "13" has a distinct peak at layer 25, while "D" shows peaks at layers 25 and 33.
### Interpretation
The heatmap suggests a hierarchical or layered structure where certain terms (e.g., mathematical expressions, final answers) gain prominence in higher layers. The repeated appearance of "13" and "D" in middle and top layers may indicate critical nodes or decision points. The dominance of dark brown in lower layers implies these layers are less dynamic or less significant in the context of the data. The peaks in layers 33–35 for terms like "the final answer is" and "$return>" suggest these layers represent conclusions or outputs in a process. The use of mathematical notation (e.g., "A and B = 8 + 5 = 13") and programming-like syntax (e.g., "$boxed", "$return>") hints at a computational or algorithmic context, possibly related to code analysis or natural language processing.
</details>
Figure 2: Heatmap of thought: We plot the Jensen–Shannon divergence (JSD) values between the distributions of the last (36th) layer and intermediate layers for an answer sequence from GPT-OSS-120B- high. Functional and templated words (e.g., “and”, “is”, “boxed”, “ <| return |> ”) often converge at relatively shallow layers; Completions after operators (e.g., “+”, “=”) and answer tokens/symbols (e.g., “13”, “(D)”) do not settle until deeper layers. Interestingly, the answer token “13” gradually surfaces in earlier layers after its first appearance.
2 Measuring Deep-Thinking Ratio
2.1 Preliminaries
We consider an autoregressive language model $f_{\theta}$ composed of $L$ transformer layers, hidden dimension $d$ , and vocabulary $V$ . Given a prefix sequence $y_{<t}$ , the forward pass at generation step $t$ produces a sequence of residual stream states $\{h_{t,l}\}_{l=1}^{L}$ , where $h_{t,l}∈\mathbb{R}^{d}$ denotes the hidden state after layer $l$ . The final-layer output $h_{t,L}$ is projected by the language modeling head (i.e., the unembedding matrix) $W_{U}∈\mathbb{R}^{|V|× d}$ to produce logits over the vocabulary.
Prior research on early exiting [teerapittayanon2016branchynet, elbayad2019depth, schuster2022confident, din2024jump, belrose2023eliciting] has demonstrated that, without specialized auxiliary training, applying the language modeling head directly to intermediate-layer hidden states effectively yields meaningful predictive distributions [nostalgebraist2020lens, kao2020bert]. Building on this line of works, we project intermediate-layer hidden states into the vocabulary space using the same unembedding matrix $W_{U}$ . For each intermediate layer $l∈\{1,...,L-1\}$ , we compute the logit vector $z_{t,l}$ and probability distribution $p_{t,l}$ as
$$
\displaystyle p_{t,l}=\mathrm{softmax}(z_{t,l}),\;\;\;z_{t,l} \displaystyle=W_{U}h_{t,l} \tag{1}
$$
The model’s final-layer distribution is denoted by $p_{t,L}$ .
2.2 Deep-Thinking Tokens
<details>
<summary>x3.png Details</summary>

### Visual Description
## Diagram: Deep-Thinking Regime with JSD Threshold Analysis
### Overview
The diagram illustrates a multi-layered "Deep-Thinking Regime" (1st to 10th layers) where each layer generates a probability distribution (`p1st` to `p10th`). A Jensen-Shannon Divergence (JSD) calculation compares the 10th layer's distribution (`p10th`) against all other layers (`pith`). Results are evaluated against a threshold of 0.5, with green checks (✓) for values below the threshold and red crosses (✗) for values above.
---
### Components/Axes
1. **Left Panel: Deep-Thinking Regime**
- Vertical stack of 10 layers (1st to 10th), labeled with their respective layer numbers.
- Each layer outputs a probability distribution (`p1st` to `p10th`), represented as histograms.
- Layers are color-coded: darker purple for higher layers (10th–8th), lighter purple for lower layers (7th–1st).
2. **Middle Panel: JSD Computation**
- Title: "Compute JSD(p₁₀ᵗʰ || pᵢᵗʰ)".
- Vertical axis lists `p10th` to `p1st` (top to bottom).
- Horizontal axis shows JSD values (0.00 to 0.96) with incremental markers (0.00, 0.08, 0.36, etc.).
- Dotted lines connect `p10th` to each `pith` for visual comparison.
3. **Right Panel: Threshold Evaluation**
- Title: "< Threshold 0.5?".
- Vertical axis lists JSD values (0.00, 0.08, 0.36, 0.76, 0.78, 0.82, 0.86, 0.85, 0.93, 0.96).
- Green checks (✓) for values < 0.5; red crosses (✗) for values ≥ 0.5.
---
### Detailed Analysis
1. **Layer Outputs (Left Panel)**
- All layers show distinct histogram distributions, with no explicit numerical values provided for individual bins.
2. **JSD Values (Middle Panel)**
- **p10th vs. p10th**: JSD = 0.00 (✓).
- **p10th vs. p9th**: JSD = 0.08 (✓).
- **p10th vs. p8th**: JSD = 0.36 (✓).
- **p10th vs. p7th**: JSD = 0.76 (✗).
- **p10th vs. p6th**: JSD = 0.78 (✗).
- **p10th vs. p5th**: JSD = 0.82 (✗).
- **p10th vs. p4th**: JSD = 0.86 (✗).
- **p10th vs. p3th**: JSD = 0.85 (✗).
- **p10th vs. p2th**: JSD = 0.93 (✗).
- **p10th vs. p1st**: JSD = 0.96 (✗).
3. **Threshold Evaluation (Right Panel)**
- Values < 0.5 (0.00, 0.08, 0.36) are marked with green checks (✓).
- Values ≥ 0.5 (0.76–0.96) are marked with red crosses (✗).
---
### Key Observations
1. **Trend in JSD Values**:
- Higher layers (10th, 9th, 8th) exhibit significantly lower JSD values compared to lower layers (7th–1st).
- JSD increases monotonically as layers descend from 10th to 1st.
2. **Threshold Compliance**:
- Only the top 3 layers (10th, 9th, 8th) meet the JSD threshold (< 0.5).
- All lower layers (7th–1st) exceed the threshold, indicating greater divergence from `p10th`.
3. **Distribution Similarity**:
- The 10th layer is perfectly aligned with itself (JSD = 0.00).
- The 9th and 8th layers show moderate similarity (JSD = 0.08, 0.36), while lower layers diverge sharply.
---
### Interpretation
1. **Model Performance**:
- The top 3 layers (10th–8th) demonstrate strong alignment with the 10th layer's distribution, suggesting they are critical for maintaining consistency in the "Deep-Thinking Regime."
- Lower layers (7th–1st) exhibit poor alignment, potentially indicating instability or inefficiency in earlier processing stages.
2. **Threshold Significance**:
- The 0.5 threshold acts as a binary classifier for layer reliability. Layers below this threshold are deemed "acceptable" for similarity to `p10th`, while those above are flagged as outliers.
3. **Structural Implications**:
- The diagram implies a hierarchical dependency: higher layers refine or stabilize outputs from lower layers. The divergence in lower layers may propagate errors upward if not corrected by the top layers.
4. **Anomalies**:
- The 7th layer (JSD = 0.76) is the first to exceed the threshold, marking a critical point where performance degrades.
- The 1st layer (JSD = 0.96) shows the greatest divergence, suggesting foundational layers may require optimization.
---
### Conclusion
This diagram highlights the importance of upper layers in maintaining distributional consistency within the model. The JSD threshold serves as a diagnostic tool to identify layers contributing to stability versus those introducing variability. Addressing the divergence in lower layers could enhance overall model robustness.
</details>
Figure 3: Illustration of our method of identifying deep-thinking tokens. Suppose a model with 10 layers, by setting the depth fraction $\rho=0.8$ , the token is successfully classified as a deep-thinking token at generation step $t$ since its JSD with the final-layer distribution first fall below the threshold $g$ only until it reaches the late-settling regime.
Input : Autoregressive LM $f_{\theta}$ with $L$ layers and unembedding matrix $W_{U}$ ; Input prompt $x$ ; Threshold $g$ ; Depth fraction $\rho$ Output : $\mathrm{DTR}(S)$ of the generated sequence $S$ $C← 0$ ; // deep thinking token count $S←\emptyset$ ; // generated sequence $y_{t}←\mathtt{[BOS]}$ ; // initialize with start token while $y_{t}≠\mathtt{[EOS]}$ do Sample $y_{t}\sim p_{t,L}\!\left(f_{\theta}(·\mid x,S)\right)$ ; $S←(S,y_{t})$ ; for $l← 1$ to $L$ do $p_{t,l}←\mathrm{softmax}(W_{U}h_{t,l})$ ; $D_{t,l}←\mathrm{JSD}(p_{t,L},p_{t,l})$ ; end for $c_{t}←\min\{l:\min_{j≤ l}D_{t,j}≤ g\}$ ; if $c_{t}≥\lceil(1-\rho)L\rceil$ then $C← C+1$ ; end if end while return $C/|S|$ ; Algorithm 1 Computing Deep-Thinking Ratio (DTR)
We posit that inference-time thinking effort for a token manifests as the continued evolution of predictive distributions (i.e., $p_{t,l}$ ) across LM layers. Tokens with earlier distributional stabilization correspond to less additional thinking, while those having later stabilization correspond to needing more extended internal thinking. In other words, simple tokens stabilize early with shallow computation, whereas difficult tokens requiring more thinking exhibit distributional shifts in deeper layers with more computation. To illustrate this, we show a motivation example on answering a GQPA [rein2024gpqa] question in Figure ˜ 2.
To quantify this behavior, we measure how long a token’s predictive distribution continues to change before settling, operationalized as the layer at which the intermediate distribution becomes sufficiently close to the final-layer distribution. Specifically, for each generation step $t$ and layer $l$ , we compute the Jensen–Shannon divergence (JSD) between the intermediate-layer distribution $p_{t,l}$ and the final-layer distribution $p_{t,L}$ :
$$
\displaystyle D_{t,l} \displaystyle\;\coloneqq\;\operatorname{JSD}\!\left(p_{t,L}\,\|\,p_{t,l}\right) \displaystyle=H\!\left(\frac{p_{t,L}+p_{t,l}}{2}\right)-\tfrac{1}{2}H(p_{t,L})-\tfrac{1}{2}H(p_{t,l}), \tag{2}
$$
where $H(·)$ denotes Shannon entropy. By construction, $D_{t,L}=0$ . A trajectory $l\mapsto D_{t,l}$ that approaches zero only at later layers indicates prolonged distributional revision (think more), whereas early convergence indicates that the model settles on its final prediction with fewer subsequent updates (think less). We employ JSD due to its symmetry and boundedness, following [chuang2023dola-0c6]. We explore other distance metrics in Appendix ˜ A.
To enforce a strict notion of settling, we compute:
$$
\displaystyle\bar{D}_{t,l}=\min_{j\leq l}D_{t,j}. \tag{3}
$$
We define the settling depth $c_{t}$ as the first layer at which $\bar{D}_{t,l}$ falls below a fixed threshold $g$ :
$$
\displaystyle c_{t}=\min\left\{l\in\{1,\ldots,L\}:\bar{D}_{t,l}\leq g\right\}. \tag{4}
$$
We then define a deep-thinking regime using a depth fraction $\rho∈(0,1)$ , with
$$
\displaystyle\mathcal{L}_{\text{deep-thinking}}=\left\{l:l\geq\left\lceil\rho\times L\right\rceil\right\}. \tag{5}
$$
A token is classified as a deep-thinking token (i.e., requiring more layer computations and more thinking effort to become sufficiently close to the final-layer distribution) if $c_{t}∈\mathcal{L}_{\text{deep-thinking}}$ . An illustration is shown in Figure ˜ 3.
Finally, for a generated sequence $S$ of length $T$ , we define the deep-thinking ratio, $\mathrm{DTR}(S)$ , for the sequence as the proportion of tokens that settle in the late regime:
$$
\displaystyle\mathrm{DTR}(S)=\frac{1}{T}\sum_{t=1}^{T}\mathbb{1}\!\left[c_{t}\in\mathcal{L}_{\text{deep-thinking}}\right]. \tag{6}
$$
A higher DTR indicates that a larger fraction of tokens undergo extended computation for distributional revision before stabilizing. We note that our proposed method does not imply that early-settling tokens are suboptimal; rather, it provides a depth-wise characterization of inference-time thinking effort that complements the surface-level token length measure. We show the overall algorithm of DTR in Algorithm ˜ 1. We also provide qualitative examples in Appendix ˜ E.
3 Deep-Thinking Ratio Reflects Task Accuracy More Reliably
We empirically evaluate whether our distributional distance-based measurement provides a more faithful and robust characterization of inference-time thinking effort than surface-level, length-based proxies (i.e., token counts).
Models.
We evaluate eight variants of reasoning LLMs from three model families: GPT-OSS-20B (with low, medium, and high reasoning levels) and GPT-OSS-120B (with low, medium, and high reasoning levels) [openai2025gpt-oss-120b-a33], DeepSeek-R1-70B [guo2025deepseek], For brevity, we refer DeepSeek-R1-70B to Llama-3.3-70B-Instruct distilled with DeepSeek-R1 generated samples (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B). and Qwen3-30B-Thinking [yang2025qwen3]. These models are known for their strong, long CoT capability in mathematical and complex reasoning, and span multiple parametric scales for comprehensive coverage.
Tasks.
We focus on reasoning-intensive benchmarks where scaling CoT-style computation at inference time plays a central role. We adopt four benchmarks widely used in recent evaluations of LLM reasoning capabilities [xai2025grok4, openai2025gpt5, balunovic2025matharena], including three competition-level mathematical problem sets, AIME 2024 [aops2024aime1, aops2024aime2], AIME 2025 [aops2025aime1, aops2025aime2], and HMMT 2025 [hmmt2025], as well as the diamond set of GPQA [rein2024gpqa], which consists of challenging graduate-level scientific questions.
Decoding settings.
Following [gema2025inverse-bad], we prompt models to reason step by step using a fixed, neutral instruction, without specifying a reasoning budget or explicitly encouraging longer deliberation. This setup allows each model to naturally allocate inference-time computation on a per-instance basis, avoiding confounds introduced by externally imposed token budgets or budget-conditioning prompts. Following standard practice in natural overthinking analyses [gema2025inverse-bad], we sample multiple responses for each question (25 responses per question in our experiments). Across these samples, models naturally exhibit variation in reasoning length and internal computation patterns. We use the developer recommended sampling parameters for all tested models: temperature=1.0 and top p =1.0 for GPT-OSS series; temperature=0.6 and top p = 0.95 for DeepSeek-R1-70B and Qwen-3-30B-Thinking.
For each sampled response, we record intermediate-layer hidden states, obtain their projected probability distribution, and compute DTR as described in Section ˜ 2. We uniformly set the settling threshold $g=0.5$ and the depth fraction $\rho=0.85$ to define the deep-thinking regime. We also analyze with different values and the results are provided in Section ˜ 3.2. The reported statistics are averaged over 30 random seeds across decoding runs.
3.1 Results
To quantify the relationship between inference-time thinking effort and task performance, we measure the association between thinking effort scores and answer accuracy by computing Pearson correlation coefficient. Specifically, we conduct a binned analysis following [gema2025inverse-bad] by partitioning sampled sequences into quantile bins (i.e., 5 bins) based on their DTR (Equation ˜ 6) and computing the average accuracy within each bin.
Table 1: Pearson correlations between task accuracy and different inference-time measures, including length-based and confidence-based baselines, across eight model variants and four reasoning benchmarks. Correlation values are color-coded: strong positive correlations ( $0.5\sim 1$ ) are shown in dark green, weak positive correlations ( $0\sim 0.5$ ) in light green, weak negative correlations ( $-0.5\sim 0$ ) in light orange, and strong negative correlations ( $-1\sim-0.5$ ) in dark orange.
| OSS-120B-low | Token Length AIME 2025 0.504 | Reverse Token Length -0.504 | Log Probability 0.872 | Negative Perplexity 0.453 | Negative Entropy 0.863 | Self-Certainty 0.803 | DTR (Ours) 0.930 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| OSS-120B-medium | -0.365 | 0.365 | 0.817 | 0.246 | 0.822 | 0.815 | 0.862 |
| OSS-120B-high | -0.961 | 0.961 | 0.705 | 0.552 | 0.711 | 0.728 | 0.796 |
| OSS-20B-low | -0.689 | 0.689 | 0.579 | 0.849 | 0.665 | 0.275 | 0.373 |
| OSS-20B-medium | -0.757 | 0.757 | 0.616 | -0.677 | 0.637 | 0.097 | 0.161 |
| OSS-20B-high | -0.385 | 0.385 | 0.455 | -0.795 | 0.550 | 0.489 | 0.610 |
| DeepSeek-R1-70B | -0.973 | 0.973 | 0.961 | 0.955 | 0.946 | 0.899 | 0.974 |
| Qwen3-30B-Thinking | -0.663 | 0.663 | -0.008 | -0.035 | 0.154 | 0.828 | 0.855 |
| AIME 2024 | | | | | | | |
| OSS-120B-low | -0.166 | 0.166 | 0.897 | 0.682 | 0.869 | 0.741 | 0.840 |
| OSS-120B-medium | -0.680 | 0.680 | 0.795 | -0.293 | 0.908 | 0.924 | 0.533 |
| OSS-120B-high | -0.755 | 0.755 | 0.700 | -0.275 | 0.593 | 0.654 | 0.905 |
| OSS-20B-low | -0.655 | 0.655 | 0.548 | -0.342 | 0.667 | 0.584 | 0.730 |
| OSS-20B-medium | -0.827 | 0.827 | 0.195 | -0.150 | 0.440 | 0.252 | -0.192 |
| OSS-20B-high | -0.989 | 0.989 | 0.809 | 0.262 | 0.921 | 0.855 | 0.824 |
| DeepSeek-R1-70B | -0.987 | 0.987 | -0.037 | 0.223 | 0.067 | 0.287 | 0.430 |
| Qwen3-30B-Thinking | -0.869 | 0.869 | -0.857 | -0.720 | -0.680 | -0.246 | -0.657 |
| GPQA-Diamond | | | | | | | |
| OSS-120B-low | 0.682 | -0.682 | 0.984 | 0.172 | 0.995 | 0.996 | 0.976 |
| OSS-120B-medium | -0.340 | 0.340 | 0.973 | 0.316 | 0.985 | 0.981 | 0.823 |
| OSS-120B-high | -0.970 | 0.970 | 0.854 | 0.501 | 0.813 | 0.885 | 0.845 |
| OSS-20B-low | -0.602 | 0.602 | 0.984 | 0.235 | 0.991 | 0.917 | 0.935 |
| OSS-20B-medium | -0.847 | 0.847 | 0.914 | 0.468 | 0.911 | 0.889 | 0.718 |
| OSS-20B-high | -0.794 | 0.794 | 0.879 | 0.461 | 0.902 | 0.915 | 0.992 |
| DeepSeek-R1-70B | -0.930 | 0.930 | 0.068 | -0.133 | -0.165 | -0.532 | 0.885 |
| Qwen3-30B-Thinking | -0.634 | 0.634 | 0.589 | 0.865 | 0.711 | 0.943 | 0.828 |
| HMMT 2025 | | | | | | | |
| OSS-120B-low | 0.871 | -0.871 | 0.761 | 0.629 | 0.695 | 0.884 | 0.305 |
| OSS-120B-medium | -0.793 | 0.793 | 0.706 | 0.045 | 0.618 | 0.631 | 0.926 |
| OSS-120B-high | -0.967 | 0.967 | 0.750 | 0.503 | 0.728 | 0.754 | 0.972 |
| OSS-20B-low | -0.634 | 0.634 | -0.695 | 0.549 | -0.359 | -0.489 | 0.689 |
| OSS-20B-medium | -0.668 | 0.668 | 0.447 | 0.336 | 0.424 | 0.331 | 0.247 |
| OSS-20B-high | -0.352 | 0.352 | 0.537 | 0.994 | 0.831 | 0.628 | 0.932 |
| DeepSeek-R1-70B | -0.866 | 0.866 | 0.879 | 0.889 | 0.858 | 0.905 | 0.902 |
| Qwen3-30B-Thinking | -0.950 | 0.950 | -0.803 | -0.762 | -0.801 | 0.745 | 0.911 |
| Average | -0.594 | 0.594 | 0.527 | 0.219 | 0.571 | 0.605 | 0.683 |
We compare deep-thinking token measurement against the following baselines, including length-based proxies and confidence-based approaches, which are also commonly adopted to assess generation quality.
Token count.
The total number of tokens generated in the model’s output reasoning traces. This measure is widely framed as a direct proxy for test-time compute, and underlies many empirical studies of inference-time scaling [jaech2024openai, guo2025deepseek, anthropic2025claude3-7, anthropic2025claude4, oai2025o3mini, yang2025qwen3, team2025kimi, zhong2024evaluation].
Reverse token count.
As a complementary baseline, we additionally consider reverse token count, defined as the negative of the total number of generated tokens for each response. This transformation is included to account for the frequently observed inverse relationship between reasoning length and accuracy in LLM overthinking [wu2025when-905, gema2025inverse-bad].
Log probability.
Following the notation in Section ˜ 2, let a generated sequence $S=(y_{1},...,y_{T})$ . At generation step $t$ , the model’s output prediction distribution (at final-layer $L$ ) over the vocabulary $\mathcal{V}$ is denoted by $p_{t,L}(·)$ . We compute the average log-probability of the sampled tokens:
$$
\displaystyle\mathrm{LogProb}(S)\;=\;\frac{1}{T}\sum_{t=1}^{T}\log p_{t,L}(y_{t}) \tag{7}
$$
Higher values indicate that the model assigns higher likelihood to its own generation and are commonly interpreted as higher confidence.
Negative perplexity.
Perplexity is defined as the exponentiated negative average log-probability:
$$
\displaystyle\mathrm{PPL}(S)\;=\;\exp\!\left(-\frac{1}{T}\sum_{t=1}^{T}\log p_{t,L}(y_{t})\right) \tag{8}
$$
We report negative perplexity $-\mathrm{PPL}(S)$ so that larger values correspond to higher confidence.
Negative entropy.
To incorporate information from the full prediction distribution over $\mathcal{V}$ rather than only the sampled token, we compute the average entropy:
$$
\displaystyle\mathrm{Ent}(S)\;=\;\frac{1}{T}\sum_{t=1}^{T}H(p_{t,L}),\;\;\;H(p_{t,L})=-\sum_{v\in\mathcal{V}}p_{t,L}(v)\log p_{t,L}(v) \tag{9}
$$
We report negative entropy $-\mathrm{Ent}(S)$ , where larger values indicate more peaked distributions and thus greater model confidence.
Self-Certainty.
We also include Self-Certainty [kang2025scalable-de3], a distributional confidence metric based on the idea that higher confidence corresponds to prediction distributions that are further from the uniform distribution $u$ , which represents maximum uncertainty. Formally, self-certainty is defined as the average Kullback-Leibler (KL) divergence between $u(v)=1/|\mathcal{V}|$ and $p_{t,L}$ :
$$
\displaystyle\mathrm{Self}\text{-}\mathrm{Certainty}(S) \displaystyle=\;\frac{1}{T}\sum_{t=1}^{T}\mathrm{KL}\!\left(u\,\|\,p_{t,L}\right) \displaystyle=\;-\frac{1}{T|\mathcal{V}|}\sum_{t=1}^{T}\sum_{v\in\mathcal{V}}\log\!\big(|\mathcal{V}|\,p_{t,L}(v)\big) \tag{10}
$$
For all baselines, correlations are computed using the same protocol, where sequences are ranked and binned by token count (or its negation) or confidence scores.
Table ˜ 1 reports the correlation between task accuracy and different measurments, across eight model variants and four benchmarks. As observed, measuring sequences with token count exhibits notable oranged-colored values ( $r<0$ ), with mean $r=-0.59$ . This indicates that longer generations are more associated with lower performance, aligning with recent reports of inverse scaling and overthinking. Extended reasoning traces could be symptomatic of redundant, misguided, or error-amplifying deliberation. The results underscore the unreliability of using surface-level length feature as proxy for effective problem solving. Reversing token count yields a positive correlation of identical magnitude. However, the improvement is purely post hoc, reflecting the empirical regularity in regimes where shorter responses are more accurate. As such, reverse token count only serve as a statistical adjustment, rather than capture principled notion of computation or thinking effort.
Compared to token count measure, confidence-based measures (log probability, negative perplexity, negative entropy, and self-certainty) exhibit moderately positive correlations with mean $r=0.219\sim 0.605$ , as reflected by the predominance of green-colored values. This indicates that model confidence captures partial information about correctness. However, their behavior is relatively heterogeneous across models and benchmarks: while certain configurations achieve strong positive correlations, others deteriorate to weak or even negative associations. This inconsistency suggests that confidence signals might conflate other factors like overconfidence, and therefore do not reliably reflect inference-time compute effort or problem solving effectiveness.
In contrast, our proposed measurement of DTR demonstrates the strongest and most stable relationship with task performance, achieving the highest average correlation of $r=0.683$ , outperforming both reverse token count and Self-Certainty, the best-performing baselines among confidence-based approaches. Overall, DTR remains positive across models and benchmarks, exhibiting the fewest orange-colored values (2 out of the 32 model–benchmark settings tested). Collectively, the results show that computing DTR over output sequences provides a more faithful and robust characterization of successful reasoning outcomes than token volume alone or confidence-based alternatives.
3.2 Effect of Settling Thresholds and Depth Fractions
<details>
<summary>x4.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs Deep-Thinking Ratio by Threshold
### Overview
The image shows a scatter plot comparing accuracy (Pass@1) against deep-thinking ratio for three different threshold values (g=2.5e-01, g=5.0e-01, g=7.5e-01). Each threshold is represented by a distinct color-coded data series with connecting lines and shaded confidence intervals. The plot includes correlation coefficients (r-values) for each series.
### Components/Axes
- **X-axis**: Deep-Thinking Ratio (0.24 to 0.48)
- **Y-axis**: Accuracy (Pass@1) (0.600 to 0.700)
- **Legend**:
- Blue circles: threshold g=2.5e-01
- Brown circles: threshold g=5.0e-01
- Cyan circles: threshold g=7.5e-01
- **Annotations**:
- r = 0.820 (blue series)
- r = 0.962 (brown series)
- r = 0.012 (cyan series)
### Detailed Analysis
1. **Blue Series (g=2.5e-01)**:
- Positioned at x ≈ 0.48
- Accuracy range: 0.635–0.650
- Correlation: r = 0.820 (strong positive trend)
- Data points show moderate clustering with slight upward trend
2. **Brown Series (g=5.0e-01)**:
- Positioned at x ≈ 0.32
- Accuracy range: 0.625–0.660
- Correlation: r = 0.962 (very strong positive trend)
- Tight clustering with clear upward trajectory
3. **Cyan Series (g=7.5e-01)**:
- Positioned at x ≈ 0.24
- Accuracy range: 0.620–0.665
- Correlation: r = 0.012 (negligible correlation)
- Widely scattered points with no discernible pattern
### Key Observations
- The brown series (g=5.0e-01) demonstrates the strongest linear relationship (r=0.962) between deep-thinking ratio and accuracy
- The cyan series (g=7.5e-01) shows essentially no correlation (r=0.012) despite having the highest deep-thinking ratio
- All series exhibit similar accuracy ranges (0.620–0.665), but with varying distributions
- Confidence intervals (shaded areas) are widest for the cyan series, indicating greater uncertainty
### Interpretation
The data suggests that threshold g=5.0e-01 optimally balances deep-thinking ratio and accuracy, with the strongest predictive relationship (r=0.962). The threshold g=7.5e-01 appears ineffective at correlating these metrics, while g=2.5e-01 shows moderate effectiveness. The similar accuracy ranges across thresholds imply that performance differences stem primarily from deep-thinking ratio distribution rather than absolute accuracy levels. The strong correlation in the brown series indicates a potential causal relationship worth investigating further, possibly representing an optimal operational threshold for the system being measured.
</details>
(a) Effect of different settling threshold $g$ .
<details>
<summary>x5.png Details</summary>

### Visual Description
## Scatter Plot: Relationship Between Deep-Thinking Ratio and Accuracy (Pass@1)
### Overview
The image is a scatter plot showing the relationship between **Deep-Thinking Ratio** (x-axis) and **Accuracy (Pass@1)** (y-axis). Four distinct data series are represented by different colors, each corresponding to a specific **depth fraction** (ρ). Each series includes a trend line and a correlation coefficient (r) indicating the strength of the relationship. The plot highlights how accuracy varies with the deep-thinking ratio across different depth fractions.
---
### Components/Axes
- **X-axis (Deep-Thinking Ratio)**: Ranges from 0.0 to 0.45, with increments of 0.1.
- **Y-axis (Accuracy (Pass@1))**: Ranges from 0.6 to 0.7, with increments of 0.025.
- **Legend**: Located in the **bottom-left corner**, mapping colors to depth fractions:
- **Blue**: depth fraction ρ = 8.0e-1
- **Red**: depth fraction ρ = 8.5e-1
- **Pink**: depth fraction ρ = 9.0e-1
- **Cyan**: depth fraction ρ = 9.5e-1
- **Trend Lines**: Each data series has a linear regression line with a labeled correlation coefficient (r).
---
### Detailed Analysis
#### Data Series and Trends
1. **Blue Series (ρ = 8.0e-1)**:
- **Data Points**:
- (0.0, 0.65), (0.1, 0.64), (0.2, 0.63), (0.3, 0.62), (0.4, 0.61)
- **Trend**: Slight downward slope (r = 0.979).
- **Observation**: Accuracy decreases marginally as the deep-thinking ratio increases.
2. **Red Series (ρ = 8.5e-1)**:
- **Data Points**:
- (0.2, 0.63), (0.25, 0.64), (0.3, 0.65), (0.35, 0.66)
- **Trend**: Strong upward slope (r = 0.962).
- **Observation**: Accuracy increases significantly with higher deep-thinking ratios.
3. **Pink Series (ρ = 9.0e-1)**:
- **Data Points**:
- (0.05, 0.64), (0.1, 0.65), (0.15, 0.65)
- **Trend**: Flat (r = 0.947).
- **Observation**: Accuracy remains stable across the range of deep-thinking ratios.
4. **Cyan Series (ρ = 9.5e-1)**:
- **Data Points**:
- (0.0, 0.62), (0.1, 0.63), (0.2, 0.64), (0.3, 0.65), (0.4, 0.66), (0.45, 0.67)
- **Trend**: Strong upward slope (r = 0.916).
- **Observation**: Accuracy increases consistently with higher deep-thinking ratios.
---
### Key Observations
- **Correlation Strength**:
- Blue (ρ = 8.0e-1) and Red (ρ = 8.5e-1) series show the strongest correlations (r = 0.979 and 0.962, respectively).
- Cyan (ρ = 9.5e-1) has the weakest correlation (r = 0.916), despite a clear upward trend.
- **Accuracy Trends**:
- Higher depth fractions (e.g., ρ = 9.5e-1) generally show higher accuracy at higher deep-thinking ratios.
- The Red series (ρ = 8.5e-1) exhibits the most pronounced improvement in accuracy with increasing deep-thinking ratio.
- **Overlap and Variability**:
- Data points for different series overlap in the lower-left quadrant (e.g., 0.0–0.1 deep-thinking ratio).
- The Cyan series (ρ = 9.5e-1) has the highest accuracy at the maximum deep-thinking ratio (0.45, 0.67).
---
### Interpretation
The data suggests that **higher depth fractions** (e.g., ρ = 9.5e-1) are associated with **greater accuracy** at higher deep-thinking ratios, though the relationship varies in strength. The **Red series (ρ = 8.5e-1)** demonstrates the most consistent and significant improvement in accuracy as the deep-thinking ratio increases, while the **Blue series (ρ = 8.0e-1)** shows a slight decline. The **Cyan series (ρ = 9.5e-1)**, despite a strong upward trend, has a weaker correlation, possibly due to higher variability or external factors.
The plot implies that **deep-thinking ratio** is a critical factor in determining accuracy, but its impact depends on the **depth fraction**. Depth fractions closer to 0.9 (e.g., ρ = 8.5e-1, 9.0e-1) show more predictable relationships, while extreme values (e.g., ρ = 8.0e-1, 9.5e-1) may introduce noise or diminishing returns. This could inform optimization strategies for balancing depth and computational efficiency in systems relying on deep-thinking processes.
</details>
(b) Effect of different depth fraction $\rho$ .
Figure 4: Effect of hyper-parameters on thinking effort measurement and accuracy profiles. We analyze the impact of hyper-parameters by sweeping different settling threshold $g$ and depth fraction $\rho$ . (a) Varying $g$ has more impacts the correlation; a permissive threshold ( $g=0.25$ ) yields flatter trends, whereas $g=0.5$ provides the most robust positive signal. (b) Varying $\rho$ shifts the range of thinking effort scores but maintains overall consistent positive slopes. Overall, stricter criteria (higher $g$ , lower $\rho$ ) reduce the range of DTR, with $(g,\rho)=(0.5,0.85)$ offering an ideal balance between stability and correlation.
We conduct an analysis to understand how our two key hyper-parameters—the settling threshold $g$ and the late-settling depth fraction $\rho$ —affect the measured thinking effort and its correlation with task performance. Figure ˜ 4 illustrates the accuracy profiles across varying thinking efforts (i.e., average late-settling token ratios), derived by $g∈\{0.25,0.5,0.75\}$ and $\rho∈\{0.8,0.85,0.9,0.95\}$ . We set $\rho$ fixed to $0.85$ , when sweeping $g$ , and $g$ fixed to $0.5$ when sweeping $\rho$ . We report results on GPQA-D using GPT-OSS-20B with reasoning level high.
We conclude the following observations: (1) the magnitude of the measured sequence-level thinking effort is directly influenced by the strictness of these parameters. Specifically, both Figures ˜ 4(a) and 4(b) show that imposing stricter criteria—a higher settling threshold $g$ or a lower depth fraction $\rho$ —results in a reduction of the average late-settling token ratio. This is mechanistically consistent: a higher $g$ requires the intermediate states to be distributionally far to the final output until reaching deeper layers in the late regime to be considered settle; while a lower $\rho$ restricts the definition of the late regime to a narrower band of deeper layers. Both conditions naturally filter out more candidates, resulting in fewer tokens being classified as late-settling and consequently a lower range of overall thinking effort scores.
(2) The settling threshold $g$ has a more pronounced impact on the correlation between thinking effort and accuracy than the depth fraction $\rho$ . As shown in Figure ˜ 4(b), varying $\rho$ shifts the range of late-settling ratios due to varying strictness but maintains a consistent, positive slope across all settings, indicating that the metric is relatively robust to the specific definition of the late layers. In contrast, Figure ˜ 4(a) reveals that the choice of $g$ has more impact on measured results: a softer threshold of $g=0.25$ yields a flatter trend with lower correlation value, suggesting that it may be overly permissive, including tokens with less computational efforts and diminishing the measurement’s ability to distinguish high-quality trajectory. Conversely, thresholds of $g=0.5$ and $g=0.75$ exhibit more robust positive correlations reflecting the accuracy.
(3) Overall, we can see that when the criteria are overly restrictive ( $g=0.75$ and $\rho∈\{0.9,0.95\}$ ), the trends, while still maintaining positive correlations, appears to be slightly more unstable due to the potential filtering of informative high computational tokens. Among the tested configurations, $(g,\rho)=(0.5,0.85)$ strikes an ideal balance, yielding a reliable trend with high correlation values.
4 Deep-Thinking Tokens Enable Efficient Test-Time Scaling
Repeated sampling is a popular strategy for scaling test-time compute, in parallel to generating long CoT [brown2024large-581, gupta2025test-time-19d, saad-falcon2024archon-cb5, stroebl2024inference-4ca, saad-falcon2025shrinking-bf7]. It improves accuracy by aggregating multiple independently generated samples per problem at the cost of increased inference budget. In this section, we explore whether our proposed DTR measure can be leveraged to preferentially select and aggregate higher-quality samples towards better performance.
Table 2: Comparison of task accuracy and average inference cost (k tokens) under different aggregation methods, across four reasoning benchmarks. The reported cost reductions ( $\Delta$ %) are shown relative to Cons@ $n$ . Think@ $n$ achieves the best overall performance while reducing inference cost by approximately 50%. Methods with ${\dagger}$ adopt a prefix length of 50 to determine early stopping.
| Method | AIME 25 | | AIME 24 | | HMMT 25 | | GPQA-D | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Acc | Cost ( $\Delta$ %) | | Acc | Cost ( $\Delta$ %) | | Acc | Cost ( $\Delta$ %) | | Acc | Cost ( $\Delta$ %) | |
| OSS-120B-medium | | | | | | | | | | | |
| Cons@ $n$ | 92.7 | 307.6 (–) | | 92.7 | 235.1 (–) | | 80.0 | 355.6 (–) | | 73.8 | 93.5 (–) |
| Mean@ $n$ | 80.0 | 307.6 (–) | | 81.6 | 235.1 (–) | | 62.6 | 355.6 (–) | | 69.9 | 93.5 (–) |
| Long@ $n$ | 86.7 | 307.6 (–) | | 86.7 | 235.1 (–) | | 73.3 | 355.6 (–) | | 73.2 | 93.5 (–) |
| Short@ $n$ | 87.3 | 255.7 (-17%) | | 88.0 | 200.9 (-15%) | | 77.3 | 290.4 (-18%) | | 73.3 | 84.4 (-10%) |
| Self-Certainty@ $n$ † | 87.3 | 150.6 (-51%) | | 91.3 | 119.3 (-49%) | | 78.0 | 177.0 (-50%) | | 76.0 | 47.9 (-49%) |
| Think@ $n$ † | 94.7 | 155.4 (-49%) | | 93.3 | 121.3 (-48%) | | 80.0 | 181.9 (-49%) | | 74.7 | 48.8 (-48%) |
| Qwen3-4B-Thinking | | | | | | | | | | | |
| Cons@ $n$ | 86.7 | 1073.1 (–) | | 93.3 | 950.1 (–) | | 63.3 | 1275.7 (–) | | 67.8 | 410.6 (–) |
| Mean@ $n$ | 81.2 | 1073.1 (–) | | 86.3 | 950.1 (–) | | 55.7 | 1275.7 (–) | | 66.9 | 410.6 (–) |
| Long@ $n$ | 85.3 | 1073.1 (–) | | 86.7 | 950.1 (–) | | 52.7 | 1275.7 (–) | | 66.7 | 410.6 (–) |
| Short@ $n$ | 90.0 | 983.6 (-8%) | | 90.0 | 871.0 (-8%) | | 63.3 | 1165.7 (-9%) | | 68.2 | 382.9 (-7%) |
| Self-Certainty@ $n$ † | 86.7 | 548.9 (-49%) | | 90.0 | 480.9 (-49%) | | 63.3 | 641.4 (-50%) | | 68.2 | 206.6 (-50%) |
| Think@ $n$ † | 90.0 | 537.5 (-50%) | | 93.3 | 482.2 (-49%) | | 66.7 | 641.4 (-50%) | | 69.7 | 206.8 (-50%) |
<details>
<summary>x6.png Details</summary>

### Visual Description
## Scatter Plots: Model Performance vs. Cost
### Overview
The image contains two side-by-side scatter plots comparing model performance (accuracy) against computational cost (tokens) for two AI systems: **OSS-120B-medium** (left) and **Qwen3-4B-Thinking** (right). Each plot uses color-coded data points to represent different evaluation metrics or strategies, with labels like "Think@n" and "Self-Certainty@n".
---
### Components/Axes
- **X-axis**: Cost (tokens), logarithmic scale (1.5×10⁵ to 9×10⁵ tokens).
- **Y-axis**: Accuracy (0.72 to 0.85 range).
- **Legends**:
- **OSS-120B-medium**: Blue (Think@n), Green (Cons@n), Purple (Short@n), Pink (Long@n), Yellow (Self-Certainty@n), Cyan (Mean@n).
- **Qwen3-4B-Thinking**: Same color scheme as above.
- **Data Points**: Each point is labeled with its metric (e.g., "Think@n") and positioned at specific (cost, accuracy) coordinates.
---
### Detailed Analysis
#### OSS-120B-medium (Left Plot)
- **Think@n**: Highest accuracy (0.85) at lowest cost (1.5×10⁵ tokens).
- **Cons@n**: Second-highest accuracy (0.84) at moderate cost (2.2×10⁵ tokens).
- **Self-Certainty@n**: Accuracy 0.83 at 2×10⁵ tokens.
- **Short@n**: Accuracy 0.79 at 2.5×10⁵ tokens.
- **Long@n**: Accuracy 0.78 at 3×10⁵ tokens.
- **Mean@n**: Lowest accuracy (0.76) at 2.8×10⁵ tokens.
#### Qwen3-4B-Thinking (Right Plot)
- **Cons@n**: Highest accuracy (0.79) at 6.5×10⁵ tokens.
- **Think@n**: Accuracy 0.80 at 5×10⁵ tokens.
- **Self-Certainty@n**: Accuracy 0.78 at 6×10⁵ tokens.
- **Short@n**: Accuracy 0.77 at 7×10⁵ tokens.
- **Long@n**: Accuracy 0.76 at 8×10⁵ tokens.
- **Mean@n**: Lowest accuracy (0.75) at 7.5×10⁵ tokens.
---
### Key Observations
1. **Cost-Accuracy Tradeoff**:
- OSS-120B-medium achieves higher accuracy (0.76–0.85) at significantly lower costs (1.5–3×10⁵ tokens) compared to Qwen3-4B-Thinking (5–8×10⁵ tokens).
- Qwen3-4B-Thinking shows diminishing returns: higher costs correlate with lower accuracy (e.g., Long@n at 8×10⁵ tokens has 0.76 accuracy).
2. **Performance Variability**:
- "Mean@n" points (likely average performance) are consistently the lowest in accuracy for both models, suggesting heterogeneity in individual metric performance.
3. **Efficiency**:
- OSS-120B-medium’s "Think@n" and "Cons@n" strategies outperform Qwen3-4B-Thinking’s equivalents despite lower computational costs.
---
### Interpretation
The data highlights **OSS-120B-medium** as a more cost-effective solution for high-accuracy tasks, with its top-performing strategies ("Think@n", "Cons@n") achieving near-peak accuracy at minimal token expenditure. In contrast, **Qwen3-4B-Thinking** exhibits inefficiency, with higher costs yielding only marginal accuracy gains. The "Mean@n" points underscore systemic underperformance in average configurations, suggesting that optimal strategies (e.g., "Think@n") require deliberate design rather than default settings. These findings imply that model architecture and strategy selection are critical for balancing accuracy and resource efficiency in AI systems.
</details>
Figure 5: Comparison of the trade-off between task accuracy and inference cost (tokens) with different aggregation methods. Accuracy is averaged across all four datasets (AIME 24/25, HMMT 25, GPQA-D). Our Think@ $n$ method achieves the best overall Pareto-optimal performance. It matches or exceeds the accuracy of Cons@n with approximately half the inference cost, while Self-Certainty@ $n$ is notably less efficient.
Table 3: Impact of prefix length ( $\ell_{\text{prefix}}$ ) on Think@ $n$ performance and inference cost for AIME 2025. Using a short prefix of 50 tokens to estimate DTR outperforms using longer ones, and is comparable to full sequence (all) while providing significant cost savings. We also report Pass@1 and Cons@ $n$ for reference. Subscripts denote the standard deviation across 10 trials.
| | Accuracy | Cost (k tokens) |
| --- | --- | --- |
| Pass@1 | 80.0 4.2 | 6.4 |
| Cons@ $n$ | 90.0 2.5 | 307.6 |
| Think@ $n$ | | |
| Prefix length | | |
| 50 | 94.7 1.6 | 155.4 |
| 100 | 92.0 1.6 | 154.1 |
| 500 | 92.7 1.3 | 153.2 |
| 1000 | 92.7 1.3 | 177.4 |
| 2000 | 92.0 1.3 | 198.8 |
| all | 94.0 0.3 | 307.6 |
Experimental setups.
We follow the best-of-n (BoN) evaluation protocol commonly adopted in recent test-time scaling studies [fu2025deep]. For each problem, we sample $n$ responses using identical decoding settings, and compare the following aggregation methods: Cons@ $n$ : Standard self-consistency [wang2023selfconsistency], which performs majority voting over all $n$ sampled responses; Mean@ $n$ : The average accuracy of all the $n$ samples, reflecting a baseline of no preferential aggregation; Long@ $n$ and Short@ $n$ : Majority voting over the longest/shortest $\eta$ percent of the $n$ samples, ranked by token count [hassid2025don, agarwal2025first]. Self-Certainty@ $n$ : Majority voting over the highest-scoring $\eta$ percent of the $n$ samples, ranked by Self-Certainty score (the best-performing baseline in Section ˜ 3); Think@ $n$ : Majority voting over the highest-scoring $\eta$ percent of the $n$ samples, ranked by DTR $(·)$ . All methods operate on the same pool of $n$ samples. We set $n=48$ and $\eta=50\%$ . More analysis are provided in Appendix ˜ C. The results are averaged across 10 trials.
Results.
We report the results in Table ˜ 2. To compare efficiency, we explicitly account for early stopping for Short@ $n$ , Self-Certainty@ $n$ , and Think@ $n$ , which aggregate only a subset of samples. Specifically, we report the average per-problem inference cost, measured as the total number of generated tokens, under the following protocols.
For Cons@ $n$ and Mean@ $n$ , the inference cost is defined as the sum of token counts across all $n$ sampled responses= (i.e., $\sum_{i=1}^{n}|S_{i}|$ ) corresponding to full decoding without early stopping. For Short@ $n$ , we rank samples by their length and select the shortest $\eta× n$ samples. The inference cost is computed as the sum of the token count of the selected samples, plus an early-stopping overhead equal to $\ell_{\text{longest\_short}}×\eta× n$ , where $\ell_{\text{short}}$ denotes the length of the longest sample among the selected shortest subset. This term accounts for partially generated samples that are terminated once subset generation completes (i.e., bounded by $\ell_{\text{longest\_short}}$ ). The inference cost for Long@ $n$ is the same as Cons@ $n$ and Mean@ $n$ as it requires full decoding to select longest samples. For Think@ $n$ , samples are ranked by DTR, computed from a fixed prefix. Let $\ell_{\text{prefix}}$ denote the number of prefix tokens used to estimate $\mathrm{DTR}(S[:\ell_{\text{prefix}}])$ . The inference cost is defined as the total token count of the top $\eta× n$ ranked samples, plus a fixed prefix overhead of $\ell_{\text{prefix}}×\eta× n$ , which reflects the cost of generating all candidates prior to early termination. Self-Certainty@ $n$ follows the same cost computation as Think@ $n$ , differing only in that samples are ranked by $\mathrm{Self\text{-}Certainty}(S[:\ell_{\text{prefix}}])$ rather than $\mathrm{DTR}(S[:\ell_{\text{prefix}}])$ .
Table ˜ 3 reports a preliminary ablation on AIME 25 that varies $\ell_{\text{prefix}}$ . We find that using only $\ell_{\text{prefix}}=50$ tokens achieves higher accuracy than longer prefixes and matches the performance obtained using the full sequence, while significantly reducing inference cost. Accordingly, we fix $\ell_{\text{prefix}}=50$ for all experiments in Table ˜ 2.
As shown, Cons@ $n$ incurs the highest inference cost due to full decoding of every candidate, while providing a strong accuracy baseline. Mean@ $n$ has the same cost as Cons@ $n$ but is the worst-performing one among all methods. Under early stopping, Short@ $n$ achieves modest cost savings relative to Cons@ $n$ , yet consistently underperforms it in accuracy. Long@ $n$ exhibits further degraded performance compared to Short@ $n$ without offering any cost-saving benefits. This indicates that length-based heuristics remain a coarse proxy for reasoning quality and often fail to reliably identify high-quality samples, leading to suboptimal aggregations. Self-Certainty@ $n$ substantially reduces inference cost by enabling early stopping using short prefixes, but nonetheless underperforms both Cons@ $n$ and Think@ $n$ on three of the four evaluated benchmarks. In contrast, Think@ $n$ consistently matches or exceeds the accuracy of Cons@ $n$ while requiring approximately half the inference cost. The Pareto-optimal performance is most evident in the averaged results shown in Figure ˜ 5, where Think@ $n$ achieves the best overall accuracy-cost trade-off. In sum, these results demonstrate that DTR provides a more informative and reliable selection signal, enabling efficient parallel scaling of inference compute.
5 Related Work
5.1 Relationship between CoT Length and Performance
The paradigm of test-time scaling has largely operated on the assertion that allocating more computation, typically manifested as longer CoT sequences, boosts reasoning performance wei2022chain-d1a, guo2025deepseek, muennighoff2025s1. Recent empirical studies have highlighted nuances to the universality of this “longer is better” heuristic [feng2025what-321, wu2025when-905]. gema2025inverse-bad identify inverse scaling regimes where increased reasoning length systematically degrades accuracy across diverse tasks, particularly when models are prone to distraction. Similarly, wu2025when-905 characterize the relationship between CoT length and accuracy as an “inverted-U” curve, suggesting an optimal length exists beyond which performance deteriorates due to factors like error accumulation.
Several works have proposed methods to exploit corresponding observations by favoring conciseness. hassid2025don demonstrated that the shortest reasoning chains among sampled candidates are often the most accurate, proposing inference-time length-based voting for efficient generations. A close work by agarwal2025first also introduced a training-free strategy that selects the first completed trace in parallel decoding, reducing token usage while maintaining accuracy. On the training side, shrivastava2025sample proposed Group Filtered Policy Optimization (GFPO) to explicitly curb length inflation in RL by rejection sampling that filters longer responses, demonstrating that models can think less without sacrificing performance. Our work aligns with these perspectives by confirming that raw token count is an unreliable proxy for effective reasoning effort, but we diverge by proposing a mechanistic internal signal rather than simply relying on surface-level brevity heuristics.
5.2 Leveraging Internal Information in LLMs
A rich line of work has investigated how LMs internally represent and manipulate information across layers, and how internal states can be exploited. Central to this direction is the observation that intermediate representations in LMs often encode meaningful signals before reaching the final layer. Early evidence for this view was provided by nostalgebraist2020lens, which projects intermediate hidden states directly into the vocabulary space using the model’s unembedding matrix—a technique we adopt in our work. The results reveal that autoregressive transformers form coarse guesses about the next token that are iteratively refined across layers. Subsequent analyses [belrose2023eliciting] further introduce learned, layer-specific affine transformations that better align intermediate representations with the final prediction space, enabling more interpretable token predictions in shallower layers.
Beyond model probing, chuang2023dola-0c6 exploits the empirical finding that factual knowledge in LMs is often more salient in particular layers. By contrasting logits from higher and lower layers, they propose a decoding method that amplifies factual signals and improves factuality. A recent work by vilas2025tracing-3dc introduces latent-trajectory signals characterizing the temporal evolution of hidden states across generated reasoning traces to predict correctness. While the work examines the sequential dimension of representations, our work focuses on the depth-wise evolution of predictions across layers for individual tokens.
Complementary interpretability works also revisit how LLMs utilize depth at inference. gupta2025how-6d8 shows that early layers tend to favor high-frequency, generic token guesses, which are subsequently refined into contextually appropriate predictions. csords2025do-6d4 suggest that later layers primarily perform fine-grained distributional refinement rather than introducing fundamentally new transformations, raising questions about the efficiency of depth utilization in modern LLMs. These findings reinforce the view that internal predictions may stabilize before the final layer, aligning with our motivations. Overall, our goal is not to modify or construct internal states to develop new methods aimed at improving model capabilities. Instead, we leverage natural, unaltered internal representations as a proxy for measuring model computational effort, which implicitly reflects thinking effort in LLMs.
6 Conclusion
We introduced deep-thinking ratio (DTR) as a novel measure of inference-time reasoning effort in LLMs. By tracking depth-wise stabilization of token predictions, DTR provides a more reliable signal of effective reasoning than surface-level proxies such as token length or confidence. Building on this insight, we proposed Think@ $n$ , a test-time scaling strategy that leverages DTR for early selection and aggregation, achieving comparable or better performance than standard self-consistency while substantially reducing inference cost. Together, our results suggest that measuring how models think internally, rather than how long they think, is a promising direction. Future work may leverage this insight to explore how effective reasoning is characterized—shifting the focus from generating longer chains of thought to inducing deeper, more computationally intensive reasoning, and potentially enabling more reliable and efficient reasoning models.
Acknowledgements
We thank Congchao Wang and colleagues from Google AIR for their valuable support. We also thank Yu-Min Tseng from Virginia Tech and members of Meng-Lab at UVA for their helpful discussion.
References
Appendix A Comparison of Different Distance Metrics for DTR
Our method (Section ˜ 2) adopts Jensen–Shannon divergence (JSD) to quantify the discrepancy between intermediate-layer and final-layer predictions and compute DTR. Alternative notions of distance are possible. Here we explore two additional metrics: Kullback–Leibler divergence (KLD) and cosine similarity. The results are presented in Figure ˜ 6.
Kullback–Leibler divergence.
By replacing JSD with KLD in Equation ˜ 2, we compute the divergence between the final-layer distribution $p_{t,L}$ and the intermediate-layer distribution $p_{t,l}$ as
$$
\displaystyle D^{\text{KL}}_{t,l}=\mathrm{KL}(p_{t,L}\,\|\,p_{t,l}) \tag{11}
$$
Cosine similarity.
We replace the distributional comparison defined in Section ˜ 2.2 with a representation-space measure using cosine similarity. Instead of projecting intermediate-layer hidden states into the vocabulary space via the shared unembedding matrix $W_{U}$ (Equation ˜ 1), we directly compute the cosine similarity between the intermediate-layer hidden state $h_{t,l}$ and the final-layer hidden state $h_{t,L}$ . The distance is defined as
$$
\displaystyle D^{\text{cos}}_{t,l}=1-\frac{\langle h_{t,l},h_{t,L}\rangle}{\|h_{t,l}\|\|h_{t,L}\|} \tag{12}
$$
For both KLD and cosine similarity, we then apply the same configurations in Section ˜ 2.2 to identify deep-thinking tokens and compute KLD-based DTR and cosine-based DTR.
Results.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Cosine Similarity
### Overview
The chart illustrates the relationship between DTR (Dynamic Time Refinement) and Accuracy (Pass@1) using a blue line with shaded confidence intervals. A dashed trend line with a correlation coefficient (r = 0.633) is overlaid, indicating a moderate positive relationship. The data suggests a non-linear trend with a peak accuracy at a specific DTR value.
### Components/Axes
- **Title**: "Cosine Similarity" (top-center)
- **X-axis**: Labeled "DTR" with values ranging from 0.060 to 0.078 (increments of 0.006).
- **Y-axis**: Labeled "Accuracy (Pass@1)" with values from 0.775 to 0.850 (increments of 0.005).
- **Legend**: No explicit legend, but the blue line and shaded area represent the primary data series. The dashed line represents the trend line.
- **Data Points**: Blue circles mark specific DTR-accuracy pairs.
- **Shaded Area**: Light blue region around the blue line indicates variability or confidence intervals.
### Detailed Analysis
- **Data Points**:
- At DTR = 0.060: Accuracy ≈ 0.780
- At DTR = 0.066: Accuracy ≈ 0.790
- At DTR = 0.072: Accuracy ≈ 0.845 (peak)
- At DTR = 0.078: Accuracy ≈ 0.815
- **Trend Line**: Dashed line with r = 0.633, showing a moderate positive correlation between DTR and Accuracy.
- **Shaded Area**: The width of the shaded region varies, suggesting increasing uncertainty or variability in Accuracy as DTR increases.
### Key Observations
1. **Peak Accuracy**: The highest Accuracy (0.845) occurs at DTR = 0.072, followed by a decline at DTR = 0.078.
2. **Non-Linear Relationship**: Accuracy increases sharply from DTR = 0.060 to 0.072, then decreases, indicating a potential optimal DTR value.
3. **Moderate Correlation**: The trend line (r = 0.633) suggests a positive but not strongly linear relationship between DTR and Accuracy.
4. **Variability**: The shaded area widens as DTR increases, implying greater uncertainty in Accuracy predictions at higher DTR values.
### Interpretation
The chart demonstrates that increasing DTR improves Accuracy up to a critical point (DTR = 0.072), after which performance declines. This suggests an optimal DTR threshold for maximizing Accuracy. The moderate correlation (r = 0.633) indicates that while DTR influences Accuracy, other factors may also play a role. The shaded area highlights the model's sensitivity to DTR, with higher variability at larger DTR values. The peak at DTR = 0.072 could represent a balance between computational complexity and performance, warranting further investigation into the underlying mechanisms.
</details>
(a) Cosine similarity as the distance metric on AIME 25.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: Kullback-Leibler Divergence
### Overview
The chart illustrates the relationship between DTR (x-axis) and Accuracy (Pass@1, y-axis) with a blue line and shaded confidence interval. A negative correlation (r = -0.698) is annotated, indicating an inverse relationship between the variables.
### Components/Axes
- **X-axis (DTR)**: Labeled "DTR" with values ranging from 0.345 to 0.390 in increments of 0.015.
- **Y-axis (Accuracy)**: Labeled "Accuracy (Pass@1)" with values from 0.72 to 0.84 in increments of 0.02.
- **Legend**: Positioned at the top, labeled "Pass@1" with a blue color matching the line and shaded area.
- **Shaded Area**: Represents a confidence interval (likely 95%) around the blue line.
### Detailed Analysis
- **Data Points**:
- (0.345, 0.83)
- (0.360, 0.83)
- (0.375, 0.84)
- (0.390, 0.73)
- **Trend**: The blue line starts at 0.83 accuracy for DTR=0.345, remains stable until DTR=0.375 (peak at 0.84), then sharply declines to 0.73 at DTR=0.390. The shaded area follows the line’s trajectory, widening slightly near the peak.
### Key Observations
1. **Negative Correlation**: The r = -0.698 value confirms a strong inverse relationship between DTR and accuracy.
2. **Peak Performance**: Accuracy peaks at DTR=0.375 (0.84), suggesting an optimal DTR value before performance degrades.
3. **Confidence Interval**: The shaded area indicates variability in accuracy, with the widest spread near the peak (DTR=0.375).
### Interpretation
The chart demonstrates that increasing DTR initially stabilizes or slightly improves accuracy, but beyond DTR=0.375, performance drops significantly. The negative correlation implies that higher DTR values may introduce noise or overfitting, reducing model reliability. The shaded confidence interval highlights uncertainty in accuracy estimates, particularly at the peak DTR value. This suggests a trade-off between model complexity (DTR) and generalization performance (Pass@1), with an optimal DTR range likely between 0.345 and 0.375.
</details>
(b) KL divergence as the distance metric on AIME 25.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Jensen-Shannon Divergence
### Overview
The chart illustrates the relationship between DTR (x-axis) and Accuracy (Pass@1) (y-axis) using a blue line with markers and a shaded confidence interval. A dashed reference line labeled "r = 0.869" is included for comparison.
### Components/Axes
- **X-axis (DTR)**: Ranges from 0.15 to 0.18 in increments of 0.01. Labels: 0.15, 0.16, 0.17, 0.18.
- **Y-axis (Accuracy (Pass@1))**: Ranges from 0.70 to 0.85 in increments of 0.01. Labels: 0.70, 0.75, 0.80, 0.85.
- **Legend**: Not explicitly labeled, but two distinct lines are present:
- **Blue solid line**: Represents the primary data series.
- **Dashed line**: Represents a reference line (r = 0.869).
### Detailed Analysis
- **Data Points**:
- At DTR = 0.15: Accuracy = 0.70 (blue marker).
- At DTR = 0.16: Accuracy = 0.80 (blue marker).
- At DTR = 0.17: Accuracy = 0.85 (blue marker).
- At DTR = 0.18: Accuracy = 0.85 (blue marker).
- **Trends**:
- The blue line slopes upward, indicating increasing accuracy with higher DTR values.
- The dashed line (r = 0.869) is a diagonal reference, suggesting a strong positive correlation between DTR and accuracy.
- The shaded area around the blue line represents variability or confidence intervals, widening slightly as DTR increases.
### Key Observations
- Accuracy improves significantly from 0.70 (DTR = 0.15) to 0.85 (DTR = 0.17–0.18).
- The highest accuracy (0.85) is observed at DTR = 0.17 and 0.18, with a slight plateau.
- The shaded region indicates uncertainty, with the widest spread between DTR = 0.15 and 0.16.
### Interpretation
The chart demonstrates a strong positive correlation (r = 0.869) between DTR and accuracy in the Jensen-Shannon Divergence metric. As DTR increases, accuracy consistently rises, peaking at DTR = 0.17–0.18. The shaded confidence interval suggests variability in the data, particularly at lower DTR values. This implies that optimizing DTR could be a critical factor in improving performance, though the plateau at higher DTR values may indicate diminishing returns or a saturation point. The dashed reference line reinforces the linear relationship, emphasizing the reliability of the trend.
</details>
(c) JS divergence as the distance metric on AIME 25.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Cosine Similarity vs Accuracy (Pass@1)
### Overview
The chart illustrates the relationship between DTR (x-axis) and Accuracy (Pass@1) (y-axis) using a blue line with markers and a shaded confidence interval. A dashed reference line labeled with a correlation coefficient (r = 0.172) is overlaid. The data shows a non-linear trend with a peak in accuracy at mid-DTR values.
### Components/Axes
- **X-axis (DTR)**: Ranges from 0.060 to 0.078 in increments of 0.006. Labeled "DTR" with gridlines.
- **Y-axis (Accuracy)**: Ranges from 0.56 to 0.68 in increments of 0.02. Labeled "Accuracy (Pass@1)" with gridlines.
- **Legend**: No explicit legend, but the blue line represents the primary data series. The dashed line is annotated with "r = 0.172".
- **Shaded Area**: Light blue region surrounding the blue line, indicating variability or confidence intervals.
### Detailed Analysis
- **Blue Line Data Points**:
- At DTR = 0.060: Accuracy ≈ 0.59 (uncertainty: ±0.015, based on shaded area).
- At DTR = 0.066: Accuracy ≈ 0.68 (peak).
- At DTR = 0.072: Accuracy ≈ 0.68 (slight plateau).
- At DTR = 0.078: Accuracy ≈ 0.63 (decline).
- **Dashed Line**: Represents a weak linear trend (r = 0.172), suggesting minimal correlation between DTR and accuracy.
### Key Observations
1. **Non-linear Trend**: Accuracy increases sharply until DTR = 0.066, then plateaus and declines.
2. **Shaded Variability**: The shaded area indicates significant uncertainty, especially at lower DTR values (e.g., ±0.015 at DTR = 0.060).
3. **Weak Correlation**: The low r-value (0.172) implies the dashed line does not strongly represent the data's trend.
### Interpretation
The data suggests that higher DTR values initially improve accuracy, but beyond a threshold (DTR ≈ 0.066), performance degrades. The shaded area highlights model instability or dataset variability. The weak correlation (r = 0.172) indicates that DTR alone poorly predicts accuracy, implying other factors may influence the relationship. The peak at DTR = 0.066 could represent an optimal parameter setting, while the decline at higher DTR values might signal overfitting or diminishing returns.
</details>
(d) Cosine similarity as the distance metric on HMMT 25.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Scatter Plot: Kullback-Leibler Divergence
### Overview
The image is a scatter plot titled "Kullback-Leibler Divergence," showing the relationship between DTR (Divergence Threshold Ratio) and Accuracy (Pass@1). The plot includes a solid line with data points, a dashed trend line, and a shaded confidence interval. Key annotations include a correlation coefficient (r = 0.409) and axis labels.
---
### Components/Axes
- **X-axis (DTR)**: Labeled "DTR" with values ranging from 0.375 to 0.405 in increments of 0.005.
- **Y-axis (Accuracy)**: Labeled "Accuracy (Pass@1)" with values from 0.600 to 0.675 in increments of 0.005.
- **Legend**: Located in the top-left corner, with two entries:
- Solid blue line: "Pass@1" (data points).
- Dashed blue line: "Trend Line" (r = 0.409).
- **Shaded Area**: Light blue region surrounding the solid line, likely representing a confidence interval or variability.
---
### Detailed Analysis
1. **Data Points**:
- **DTR = 0.375**: Accuracy ≈ 0.605 (solid blue circle).
- **DTR = 0.390**: Accuracy ≈ 0.635 (solid blue circle).
- **DTR = 0.400**: Accuracy ≈ 0.630 (solid blue circle).
- **DTR = 0.405**: Accuracy ≈ 0.640 (solid blue circle).
2. **Trend Line**:
- Dashed blue line with a correlation coefficient **r = 0.409**, indicating a moderate positive relationship between DTR and Accuracy.
3. **Shaded Area**:
- Light blue region spans the range of data points, suggesting variability or uncertainty in the relationship.
---
### Key Observations
- The solid line shows a **non-linear trend**: Accuracy increases sharply from DTR 0.375 to 0.390, then plateaus slightly before rising again at 0.405.
- The dashed trend line (r = 0.409) indicates a **weak positive correlation**, suggesting DTR explains ~16.7% of the variance in Accuracy (r² = 0.167).
- The shaded area highlights **high variability** in Accuracy at lower DTR values (e.g., 0.375–0.390) compared to higher DTR values (0.400–0.405).
---
### Interpretation
- **Relationship**: The moderate positive correlation (r = 0.409) implies that increasing DTR generally improves Accuracy, but the relationship is not strongly deterministic. Other factors likely influence Accuracy.
- **Optimal DTR**: The peak Accuracy (0.635) occurs at DTR = 0.390, suggesting this may be a local maximum. However, the trend line shows a slight upward trajectory beyond this point.
- **Confidence Interval**: The shaded area indicates uncertainty in the relationship, particularly at lower DTR values. This could reflect model instability or data sparsity in those regions.
- **Practical Implications**: While higher DTR values correlate with better performance, the weak correlation suggests trade-offs (e.g., computational cost, overfitting) may need consideration.
---
### Spatial Grounding
- **Legend**: Top-left corner, clearly labeling the solid and dashed lines.
- **Title**: Centered at the top of the plot.
- **Axes**: Grid lines align with axis ticks for precise value extraction.
- **Data Points**: Solid blue circles are evenly spaced along the x-axis, with the shaded area enveloping the solid line.
---
### Content Details
- **Textual Elements**:
- Title: "Kullback-Leibler Divergence"
- Axis Labels: "DTR" (x-axis), "Accuracy (Pass@1)" (y-axis)
- Legend: "Pass@1" (solid blue), "Trend Line" (dashed blue, r = 0.409)
- Annotations: "r = 0.409" near the dashed line.
- **Numerical Values**:
- DTR: 0.375, 0.390, 0.400, 0.405
- Accuracy: 0.605, 0.635, 0.630, 0.640
---
### Final Notes
The plot demonstrates a **moderate positive relationship** between DTR and Accuracy, with notable variability at lower DTR values. The shaded confidence interval and trend line provide context for interpreting the data, but the weak correlation (r = 0.409) suggests further investigation into confounding variables or model adjustments may be warranted.
</details>
(e) KL divergence as the distance metric on HMMT 25.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Chart: Jensen-Shannon Divergence
### Overview
The chart illustrates the relationship between Jensen-Shannon Divergence (DTR) and accuracy (Pass@1) across four data points. A blue line connects the points, with a shaded confidence interval and a dashed trend line labeled with a correlation coefficient (r = 0.895).
### Components/Axes
- **X-axis (DTR)**: Ranges from 0.135 to 0.180 in increments of 0.015.
- **Y-axis (Accuracy (Pass@1))**: Ranges from 0.56 to 0.68 in increments of 0.02.
- **Legend**: No explicit legend, but the blue line and shaded area represent the primary data series. The dashed line represents the trend line.
- **Data Points**: Four blue circles marked at specific DTR values.
### Detailed Analysis
- **Data Points**:
- (0.135, 0.56)
- (0.150, 0.60)
- (0.165, 0.65)
- (0.180, 0.64)
- **Trend Line**: A dashed line with a slope of r = 0.895, indicating a strong positive correlation between DTR and accuracy.
- **Confidence Interval**: A shaded blue area surrounds the blue line, widening as DTR increases, suggesting increasing uncertainty in accuracy estimates.
### Key Observations
1. **Positive Correlation**: Accuracy increases with DTR, as shown by the upward slope of the blue line and the high r-value (0.895).
2. **Confidence Interval Expansion**: The shaded area widens from DTR = 0.135 to 0.180, indicating greater variability in accuracy at higher DTR values.
3. **Slight Decline at High DTR**: The final data point (0.180, 0.64) shows a minor decrease in accuracy compared to the peak at DTR = 0.165 (0.65).
### Interpretation
The data suggests that higher Jensen-Shannon Divergence (DTR) values are associated with improved accuracy (Pass@1), with a strong linear relationship (r = 0.895). However, the widening confidence interval implies that the relationship becomes less predictable at higher DTR values. The slight decline in accuracy at DTR = 0.180 may indicate a plateau or diminishing returns, warranting further investigation into the factors influencing this trend. The chart highlights the trade-off between DTR and accuracy, emphasizing the need for balanced optimization in applications relying on this metric.
</details>
(f) JS divergence as the distance metric on HMMT 25.
Figure 6: Comparison of correlation between accuracy and deep-thinking ratio (DTR) using different distance metrics (cosine similarity, KL divergence, and JS divergence) on AIME 25 (top row) and HMMT 25 (bottom row).
We report the correlation results of KLD-based and cosine-based DTR, compared with our main JSD-based DTR method, on AIME 25 and HMMT 25 using OSS-120B- medium. Across both datasets, JSD-based DTR consistently achieves the strongest positive correlation with accuracy ( $r$ = 0.869 on AIME 25; $r$ = 0.895 on HMMT 25), justifying its use in our definition of DTR in Section ˜ 2. In contrast, cosine-based DTR exhibits substantially weaker and unstable correlations ( $r$ = 0.633 on AIME 25 and only $r$ = 0.172 on HMMT 25). KLD-based DTR shows similarly inconsistent behavior, with a negative correlation on AIME 25 ( $r$ = -0.698) and a modest positive correlation on HMMT 25 ( $r$ = 0.409). This inconsistency may stem from the asymmetric and numerically unstable nature of KLD: early-layer predictions tend to be high-entropy and relatively flat, assigning probability mass to many tokens that are later driven to near-zero values. Consequently, KLD can become artificially small, making the measure highly sensitive.
Appendix B DTR Under Different GPT-OSS Reasoning Levels
Figure ˜ 7 illustrates how DTR varies in different reasoning-level configurations (i.e., low, medium, and high) of the GPT-OSS-120B model. We observe an interesting and consistent trend on both AIME 25 and GPQA-D: although the underlying model weights remain identical and only the system prompt differs, lower reasoning-level configurations exhibit higher DTR values, whereas higher reasoning-level configurations yield systematically smaller DTR while achieving better task accuracy.
A potential explanation is that higher reasoning levels may redistribute computation from depth to sequence length, effectively flattening per-token, layer-wise computation. Models with higher reasoning levels require less deep revision for each individual token but instead generate longer reasoning chains with more forward passes, resulting in greater total effective compute and improved task performance. Since DTR is defined as the proportion of deep-thinking tokens (i.e., averaged over the total number of generated tokens), longer sequences increase the denominator in the DTR calculation and thus produce smaller values. This also suggests DTR might not be directly comparable across different models or model modes.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Line Charts: AIME 25 and GPQA-D Performance Analysis
### Overview
The image contains two side-by-side line charts comparing accuracy (Pass@1) against DTR (Dynamic Task Rate) for three difficulty levels: Low, Medium, and High. Each chart represents a different evaluation framework ("AIME 25" and "GPQA-D"). The charts use color-coded lines with correlation coefficients (r-values) indicating relationships between DTR and accuracy.
### Components/Axes
- **Y-Axis**: Accuracy (Pass@1) ranging from 0.45 to 0.90 (AIME 25) and 0.64 to 0.76 (GPQA-D).
- **X-Axis**: DTR (Dynamic Task Rate) ranging from 0.125 to 0.200 (AIME 25) and 0.12 to 0.21 (GPQA-D).
- **Legend**: Located at the bottom, with:
- **Blue**: Low difficulty
- **Green**: Medium difficulty
- **Red**: High difficulty
- **Correlation Coefficients (r)**:
- AIME 25: High (0.769), Medium (0.849), Low (0.937)
- GPQA-D: High (0.839), Medium (0.871), Low (0.980)
### Detailed Analysis
#### AIME 25 Chart
- **High Difficulty (Red)**:
- Accuracy starts at ~0.90 (DTR=0.125) and decreases slightly to ~0.89 (DTR=0.200).
- Correlation (r=0.769) indicates a moderate positive relationship.
- **Medium Difficulty (Green)**:
- Accuracy rises from ~0.70 (DTR=0.15) to ~0.75 (DTR=0.175).
- Correlation (r=0.849) suggests a stronger linear relationship.
- **Low Difficulty (Blue)**:
- Accuracy increases from ~0.45 (DTR=0.125) to ~0.60 (DTR=0.200).
- Correlation (r=0.937) shows the strongest linear trend.
#### GPQA-D Chart
- **High Difficulty (Red)**:
- Accuracy remains stable (~0.76 at DTR=0.12 to ~0.76 at DTR=0.21).
- Correlation (r=0.839) indicates a moderate relationship.
- **Medium Difficulty (Green)**:
- Accuracy fluctuates between ~0.68 (DTR=0.15) and ~0.72 (DTR=0.18).
- Correlation (r=0.871) reflects a slightly stronger trend than High difficulty.
- **Low Difficulty (Blue)**:
- Accuracy rises from ~0.64 (DTR=0.12) to ~0.68 (DTR=0.21).
- Correlation (r=0.980) demonstrates the strongest linear dependency.
### Key Observations
1. **Difficulty-Accuracy Tradeoff**:
- High difficulty consistently achieves the highest accuracy but shows the weakest correlation with DTR.
- Low difficulty has the lowest accuracy but the strongest correlation with DTR.
2. **Trend Variability**:
- High difficulty lines are relatively flat, suggesting robustness to DTR changes.
- Low difficulty lines exhibit steep slopes, indicating sensitivity to DTR adjustments.
3. **Framework Differences**:
- GPQA-D generally shows higher accuracy across all difficulty levels compared to AIME 25.
- GPQA-D's Low difficulty line has the highest r-value (0.980), suggesting near-perfect linear alignment.
### Interpretation
The data reveals a critical insight: **higher difficulty tasks prioritize accuracy over adaptability**, maintaining performance despite DTR variations (low r-values). Conversely, **lower difficulty tasks are more adaptive** to DTR changes but struggle with baseline accuracy. This suggests that evaluation frameworks like AIME 25 and GPQA-D may emphasize different aspects of task performance—robustness vs. sensitivity. The near-perfect correlation (r=0.980) in GPQA-D's Low difficulty line implies that DTR adjustments could be a reliable lever for tuning performance in simpler tasks, whereas harder tasks require alternative optimization strategies.
</details>
Figure 7: Deep-thinking ratio (DTR) under different reasoning level configurations of OSS-120B models.
Appendix C Additional Analysis of Think@ $\bm{n}$
Here we provide additional analysis on how Think@ $n$ behaves when varying (i) the number of sampled responses $n$ and (ii) the retained top- $\eta$ percentage used for voting.
Effect of the number of samples n.
Figure ˜ 8(a) compares Think@ $n$ against Cons@ $n$ (i.e., self-consistency) as $n$ increases ( $n∈\{16,32,48\})$ . Think@ $n$ improves monotonically with larger $n$ , where the advantage over Cons@ $n$ becomes more pronounced. Sampling more responses makes the correct cluster of answers to be larger and more likely to appear. Think@ $n$ is able to exploit this enlarged candidate pool by preferentially selecting better samples, leading to stronger performance gains over Cons@ $n$ .
Effect of top- $\bm{\eta}$ percentage.
Figure ˜ 8(a) evaluates Think@ $n$ under different top- $\eta$ percent ( $\eta∈\{25\%,50\%,75\%\}$ ). Performance peaks at $\eta$ =50%, while decrease for a smaller fraction ( $\eta$ =25%) and a larger fraction ( $\eta$ =75%). This suggests a trade-off: selecting too few samples reduces voting robustness, potentially with fewer strong candidates to stabilize majority vote, whereas selecting too many might admit lower-quality samples that dilute the benefit of Think@ $n$ . Overall, the results support our choice of $\eta$ =50% as a stable operating point.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Line Graph: Accuracy vs. Number of Samples n
### Overview
The image is a line graph comparing the accuracy of two methods, "Think@n" and "Cons@n," across different numbers of samples (n). The x-axis represents the number of samples (16, 32, 48), and the y-axis represents accuracy (0.900 to 0.945). Two lines are plotted: a blue line for "Think@n" and a teal line for "Cons@n."
### Components/Axes
- **X-axis**: "Number of Samples n" with discrete values at 16, 32, and 48.
- **Y-axis**: "Accuracy" with a scale from 0.900 to 0.945.
- **Legend**: Located in the bottom-right corner, with:
- **Blue circles**: "Think@n"
- **Teal circles**: "Cons@n"
### Detailed Analysis
- **Think@n (Blue Line)**:
- At **n = 16**: Accuracy ≈ 0.900
- At **n = 32**: Accuracy ≈ 0.930
- At **n = 48**: Accuracy ≈ 0.945
- **Trend**: Steep upward slope, indicating a significant increase in accuracy with more samples.
- **Cons@n (Teal Line)**:
- At **n = 16**: Accuracy ≈ 0.900
- At **n = 32**: Accuracy ≈ 0.915
- At **n = 48**: Accuracy ≈ 0.928
- **Trend**: Gradual upward slope, with a slower rate of improvement compared to "Think@n."
### Key Observations
1. Both methods start with the same accuracy (0.900) at n = 16.
2. "Think@n" consistently outperforms "Cons@n" across all sample sizes.
3. The gap between the two lines widens as n increases, suggesting "Think@n" scales more effectively with larger datasets.
4. "Cons@n" shows a plateau in accuracy gains after n = 32, while "Think@n" continues to improve.
### Interpretation
The data demonstrates that "Think@n" achieves higher accuracy than "Cons@n" as the number of samples increases. This suggests that "Think@n" may be more robust or efficient in handling larger datasets. The widening gap between the two lines implies that "Think@n" could be preferable for applications requiring high accuracy with substantial data. However, the slower improvement of "Cons@n" might indicate limitations in its methodology or design for scaling. The graph highlights the importance of method selection based on dataset size and accuracy requirements.
</details>
(a) Comparison of different number of samples $n$ .
<details>
<summary>x15.png Details</summary>

### Visual Description
## Line Graph: Accuracy vs Top η (%)
### Overview
The image is a line graph comparing two performance metrics ("Think@n" and "Cons@n") across varying percentages of "Top η (%)". The graph shows a clear trend in accuracy for "Think@n" with a peak at 50% η, while "Cons@n" remains constant.
### Components/Axes
- **X-axis**: "Top η (%)" with markers at 25%, 50%, and 75%.
- **Y-axis**: "Accuracy" scaled from 0.900 to 0.945 in increments of 0.005.
- **Legend**: Located in the bottom-right corner, with:
- **Blue circles**: "Think@n" (solid line).
- **Gray dashed line**: "Cons@n".
### Detailed Analysis
- **Think@n (Blue Line)**:
- At 25% η: Accuracy = 0.900.
- At 50% η: Accuracy peaks at 0.945.
- At 75% η: Accuracy drops to 0.930.
- **Cons@n (Gray Dashed Line)**:
- Constant accuracy of 0.930 across all η values.
### Key Observations
1. **Peak Performance**: "Think@n" achieves its highest accuracy (0.945) at 50% η.
2. **Decline at Higher η**: Accuracy decreases by 0.015 when η increases from 50% to 75%.
3. **Baseline Comparison**: "Cons@n" maintains a stable accuracy of 0.930, serving as a reference point.
### Interpretation
The data suggests that "Think@n" performs optimally at 50% η, with a significant drop in accuracy at higher η values. This implies a potential trade-off between η percentage and performance for "Think@n". The constant "Cons@n" line indicates a stable baseline, possibly representing a control or alternative method. The sharp decline after 50% η for "Think@n" raises questions about overfitting or diminishing returns at higher η values. The graph highlights the importance of balancing η selection to maximize accuracy while avoiding performance degradation.
</details>
(b) Comparison of different top- $\eta$ percentage.
Figure 8: Analysis of Think@ $n$ with different number of samples $n$ and top- $\eta$ percent. (a) As $n$ increases, Think@ $n$ consistently benefits from larger candidate pools and exhibits a widening performance gap over Cons@ $n$ at higher $n$ . (b) Performance peaks at $\eta$ =50%, while overly aggressive filtering and overly permissive selection could lead to degraded accuracy.
Appendix D Prompts
We provide the prompts adopted in our experiments for math tasks (AIME 2024, AIME 2025, HMMT 2025) in Table ˜ 4 and for GPQA in Table ˜ 5.
Table 4: Inference prompt for math tasks (AIME 2024, AIME 2025, HMMT 2025).
Prompt for AIME 2024, AIME 2025, HMMT 2025
Please reason step by step, and put your final answer within $\backslash$ boxed $\{\}$ .
Table 5: Inference prompt for GPQA.
Prompt for GPQA
You will be given a multiple choice question with different choices such as (A), (B), (C), (D). Think step by step before giving a final answer to this question. Always finish your answer with ‘The final answer is $\backslash$ boxed $\{$ (X) $\}$ .’, where X is the correct answer choice. If none of the options match, choose the closest option as the final answer.
Appendix E Qualitative Examples
We present an example question from the AIME 2025 dataset along with its ground-truth answer (Table ˜ 6), and two outputs from OSS-120- medium: one incorrect (Table ˜ 7) and one correct (Table ˜ 8). Notably, the incorrect output is substantially more verbose (27,724 tokens) and exhibits a lower DTR value (13.9), whereas the correct output is much more concise (3,725 tokens) and achieves a higher DTR value (19.0).
Table 6: An example question and its ground-truth answer from AIME 2025.
Example question from AIME 2025.
Circle $\omega_{1}$ with radius 6 centered at point $A$ is internally tangent at point $B$ to circle $\omega_{2}$ with radius 15. Points $C$ and $D$ lie on $\omega_{2}$ such that $\overline{BC}$ is a diameter of $\omega_{2}$ and $\overline{BC}\perp\overline{AD}$ . The rectangle $EFGH$ is inscribed in $\omega_{1}$ such that $\overline{EF}\perp\overline{BC}$ , $C$ is closer to $\overline{GH}$ than to $\overline{EF}$ , and $D$ is closer to $\overline{FG}$ than to $\overline{EH}$ , as shown. Triangles $\triangle DGF$ and $\triangle CHG$ have equal areas. The area of rectangle $EFGH$ is $\frac{m}{n}$ , where $m$ and $n$ are relatively prime positive integers. Find $m+n$ .
Ground truth answer.
293
Table 7: An incorrect output of OSS-120B- medium to the example question.
Incorrect output with DTR = 13.9% and token count = 27,724.
We need to interpret geometry. Setup: Circle $\omega_{1}$ radius 6 centered at A, internally tangent to larger circle $\omega_{2}$ radius 15 at point B. So B is common point where $\omega_{1}$ touches inside $\omega_{2}$ . Since internal tangency, distance AB = 15-6 = 9, with B on line AB. Let’s set coordinate system. Place A at origin? maybe easier: Let $\omega_{1}$ centered at (0,0) radius 6. Then B is on circle $\omega_{1}$ and also on $\omega_{2}$ . Since $\omega_{2}$ larger radius 15 and centered at O? Let’s denote O is center of $\omega_{2}$ . Distance OA = 9 (since AB = 9) and AB =9? Wait AB = distance between centers = 15-6 = 9 yes. B lies on line OA, outside? Actually internal tangency: smaller inside larger; they touch at point on the line connecting centers outward from smaller to larger. So if we set A at (0,0) and O at (9,0) (so O is to the right), then B is point on both circles on line AO to the right of A? The point outward from A towards O at distance 6 from A, i.e., B is at (6,0). Indeed distance from O to B: O(9,0) to B(6,0) distance 3 which is radius 15? Wait radius of $\omega_{2}$ is 15, O to B should be 15. But O(9,0) to B(6,0) distance 3, not 15. Something wrong. Let’s set O to be (0,0) maybe easier. Place $\omega_{2}$ centered at O (0,0), radius 15. $\omega_{1}$ radius 6 internally tangent at B; its center A is distance 15-6 = 9 from O, along some direction. Choose B as point of tangency on line OA outward from A to O? Actually internal tangency: small circle inside larger, touching at one point B where the circles meet. The line of centers passes through B. The distance from O to B is 15 (radius). The distance from A to B is 6. Since O to A is 9. B is on the line OA beyond A? Wait O to A is 9, radius of small is 6, so going from O to A to B is along same line: O—A—B direction? Since B is on circumference of both circles. The distance from O to B = 15 = OA + AB? OA=9, AB=6 gives OA+AB=15 so yes B is outward from O through A. So O at origin, A at distance 9 from O in direction e.g., positive x. Let O at (0,0), A at (9,0). Then B is further along x axis at O to B distance 15, so B at (15,0). Check: distance from A (9,0) to B (15,0) =6 correct. Good. Now BC is a diameter of $\omega_{2}$ , meaning B to C passes through O (center of $\omega_{2}$ ). Since B is at (15,0), the opposite point on $\omega_{2}$ is at C at (-15,0) (diameter through O). So C is at (-15,0). So line BC is along x-axis. Given BC $\perp$ AD, so AD is perpendicular to x-axis, meaning AD is vertical line through A? Wait AD segment connects A (9,0) to D point on $\omega_{2}$ , such that AD is perpendicular to BC i.e., vertical line through A (x=9). Since AD must intersect $\omega_{2}$ at point D. So D is point where vertical line x=9 meets $\omega_{2}$ (radius 15). Equation x 2 + y 2 = 225. Substituting x=9 => 81 + y 2 =225 => y 2 =144 => y= ±12. Which one? Since orientation maybe D is above? Not specified. We’ll consider D at (9,12) (upper) maybe. Could also be lower but rectangle orientation may pick one. Now rectangle EFGH is inscribed in $\omega_{1}$ (center A (9,0) radius 6). The rectangle is oriented not necessarily axis-aligned? We have EF parallel to GH, FG parallel to EH. Given EF $\perp$ BC (BC horizontal), so EF (and GH) are vertical? Wait EF is perpendicular to BC, which is horizontal, so EF is vertical. So one side of rectangle is vertical. Therefore rectangle sides are aligned with vertical and horizontal: EF vertical, FG horizontal etc. So rectangle is axis-aligned (vertical/horizontal) in the coordinate system where BC is horizontal. Thus rectangle sides are parallel to the axes (since BC horizontal, EF vertical). Good. Now C is closer to GH than to EF. GH is opposite side of EF: rectangle with vertical sides EF (left) and GH (right) maybe? Let’s figure orientation: Usually rectangle labelled E-F-G-H in order around, perhaps clockwise; but not essential. If EF is vertical, then FG is horizontal, GH vertical, HE horizontal. So EF left side, GH right side (if rectangle sits to the right of EF). Not sure. C is at (-15,0) far left of both. "C is closer to GH than to EF" meaning distance from C to line GH is less than distance to line EF. Since C is far left, that would mean GH is left side, EF right side? Wait distance from C to a vertical line is horizontal distance. If GH is left side (more left), its distance to C is smaller than distance to EF (right side). That matches C being closer to GH. So GH is left side of rectangle (more left), EF right side. Good. Similarly, D is closer to FG than to EH. Since FG is top side (horizontal), EH bottom side (horizontal). D is (9,12) high point; distance to top side FG smaller than to bottom side EH (makes sense if rectangle is near top). So FG is top side, EH bottom side. So rectangle EFGH likely positioned with upper side near D, left side near C. … Omitted for illustrative purposes … Seems no integer factorization. Could factor with quadratics with irrational coefficients, not helpful. Thus x may be irrational; thus k not integer rational. Therefore area is irrational? Conflict with problem statement. Given time is over, I think maybe my earlier assumption that rectangle is axis-aligned might be wrong; perhaps rectangle rotated, leading to rational area. Given complexity, I think answer maybe m+n = something like 207? Let’s guess? Not good. Given limited time, I will output placeholder. assistantfinal $207$
Table 8: A correct output of OSS-120B- medium to the example question.
Correct output with DTR = 19.0% and token count = 3,725.
We need to parse geometry. Let’s define coordinates. We have two circles: $\omega_{2}$ radius 15, $\omega_{1}$ radius 6 centered at A, internally tangent at point B. So A is inside $\omega_{2}$ , distance AB = radius difference = 15 - 6 = 9. B is the point of tangency, along line AB. Let’s set coordinate system: Let AB be vertical? Choose B at origin maybe easier. Let’s position $\omega_{2}$ centered at O (call it O2) maybe O? Actually $\omega_{2}$ ’s center is some point O (call it O). $\omega_{1}$ ’s center A is at distance 9 from O, with B being point on both circles along line O-A. Let’s place O at (0,0). Let AB be along positive x-axis: place B at (15,0) maybe? Then $\omega_{2}$ radius 15 centered at O (0,0). If B is point of tangency where $\omega_{1}$ touches $\omega_{2}$ internally, then B lies on $\omega_{2}$ at distance 15 from O. So B at (15,0). Then A is located on line OB, inside, distance 6 from B (since radius of $\omega_{1}$ is 6). And O to A distance is 9 (since OA = 15-6 = 9). So O (0,0), B (15,0). Then A is at (9,0) (since O to A is 9 along same line). Yes A at (9,0). Then $\omega_{1}$ radius 6 centered at (9,0). Good. Now Points C and D lie on $\omega_{2}$ such that BC is a diameter of $\omega_{2}$ . So BC passes through center O and length 30 (diameter). Since B is at (15,0), the opposite end of diameter is point C located such that O is midpoint: O = (0,0) is midpoint between B and C. So C = (-15,0). Indeed because B (15,0) opposite across origin yields C (-15,0). So BC is horizontal line (x-axis). Good. … Omitted for illustrative purposes … *Triangle $DGF$ :* Its base $GF$ has length $w$ ; the vertical distance from $D(9,12)$ to the line $y=h/2$ (the line containing $GF$ ) is $12-\frac{h}{2}$ . Hence $[DGF]=\frac{1}{2}w\Bigl(12-\frac{h}{2}\Bigr).$ *Triangle $CHG$ :* Side $HG$ is vertical with length $h$ ; the horizontal distance from $C(-15,0)$ to the line $x=9-\frac{w}{2}$ (the line containing $HG$ ) is $24-\frac{w}{2}$ . Hence $[CHG]=\frac{1}{2}h\Bigl(24-\frac{w}{2}\Bigr).$ The problem states $[DGF]=[CHG]$ . Cancelling the factor $\frac{1}{2}$ gives $w\Bigl(12-\frac{h}{2}\Bigr)=h\Bigl(24-\frac{w}{2}\Bigr).$ The terms $-\frac{wh}{2}$ appear on both sides and cancel, leaving $12w=24h\quad\Longrightarrow\quad w=2h.$ — Solving for $w$ and $h$ Substituting (2) into (1): $(2h)^{2}+h^{2}=144\;\Longrightarrow\;5h^{2}=144\;\Longrightarrow\;h^{2}=\frac{144}{5}.$ Hence $h=\frac{12}{\sqrt{5}}=\frac{12\sqrt{5}}{5},\qquad w=2h=\frac{24}{\sqrt{5}}=\frac{24\sqrt{5}}{5}.$ The area of rectangle $EFGH$ is $\text{Area}=wh=2h^{2}=2\left(\frac{144}{5}\right)=\frac{288}{5}.$ Thus $m=288,\;n=5$ (relatively prime), and $m+n=288+5=293.$ $\boxed{293}$