# Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models
**Authors**: Zhenyuan Guo, Tong Chen, Wenlong Meng, Chen Gong, Xin Yu, Chengkun Wei, Wenzhi Chen
## Abstract
Large Reasoning Models (LRMs) excel at solving complex problems by explicitly generating a reasoning trace before deriving the final answer. However, these extended generations incur substantial memory footprint and computational overhead, bottlenecking LRMs’ efficiency. This work uses attention maps to analyze the influence of reasoning traces and uncover an interesting phenomenon: only some decision-critical tokens in a reasoning trace steer the model toward the final answer, while the remaining tokens contribute negligibly. Building on this observation, we propose Dyn amic T hinking-Token S election (DynTS). This method identifies decision-critical tokens and retains only their associated Key-Value (KV) cache states during inference, evicting the remaining redundant entries to optimize efficiency. Across six benchmarks, DynTS surpasses the state-of-the-art KV cache compression methods, improving Pass@1 by $2.6\$ under the same budget. Compared to vanilla Transformers, it reduces inference latency by $1.84–2.62\times$ and peak KV-cache memory footprint by $3.32–5.73\times$ without compromising LRMs’ reasoning performance. The code is available at the github link. https://github.com/Robin930/DynTS
KV Cache Compression, Efficient LRM, LLM
## 1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Token Handling Methods Comparison
### Overview
The diagram compares five methods (Transformers, SnapKV, StreamingLLM, H2O, DynTS) for handling tokens in a sequence, highlighting how each method retains or processes tokens to generate a final answer. It includes a legend mapping colors to token types and a section showing predicted token importance to the answer.
### Components/Axes
- **Title**: "Methods" (top row) and "Tokens" (horizontal axis).
- **Legend** (right side):
- Gray: All Tokens
- Orange: High Importance Prefill Tokens
- Yellow: Attention Sink Tokens
- Blue: Local Tokens
- Green: Heavy-Hitter Tokens
- Red: Predicted Importance Tokens
- **Methods Rows** (left to right):
1. **Transformers**: All gray blocks (All Tokens).
2. **SnapKV**: Orange blocks in the "Observation Window" (highlighted by dashed lines).
3. **StreamingLLM**: Yellow (Attention Sink Tokens) and blue (Local Tokens).
4. **H2O**: Green (Heavy-Hitter Tokens) and blue (Local Tokens).
5. **DynTS**: Red (Predicted Importance Tokens) and blue (Local Tokens).
- **Predicted Importance Section** (bottom):
- Arrows point from red/blue blocks in each method to the "Answer" label, indicating token contribution to the final output.
### Detailed Analysis
- **Transformers**: Retains all tokens (gray blocks), no filtering.
- **SnapKV**: Focuses on orange blocks within the observation window (middle section of the sequence).
- **StreamingLLM**: Uses yellow (Attention Sink Tokens) and blue (Local Tokens), suggesting a focus on local context.
- **H2O**: Prioritizes green (Heavy-Hitter Tokens) and blue (Local Tokens), emphasizing critical tokens.
- **DynTS**: Highlights red (Predicted Importance Tokens) and blue (Local Tokens), with arrows showing their direct influence on the answer.
- **Legend Consistency**: Colors in each row match the legend (e.g., orange in SnapKV corresponds to "High Importance Prefill Tokens").
### Key Observations
- **Token Retention Strategy**: Methods vary from retaining all tokens (Transformers) to selective filtering (others).
- **Importance Indicators**: Red and blue tokens in DynTS are explicitly linked to the answer via arrows, suggesting dynamic importance prediction.
- **Color Coding**: Each method’s token types are visually distinct, aiding comparison.
### Interpretation
The diagram illustrates how different token-handling methods balance token retention and processing efficiency. Transformers retain all tokens, while others filter based on importance (e.g., SnapKV’s observation window, H2O’s heavy-hitter tokens). DynTS introduces a predictive layer, emphasizing tokens deemed critical for the answer. The use of color-coded tokens and directional arrows clarifies the flow from token selection to answer generation, highlighting the trade-offs between context retention and computational efficiency.
</details>
<details>
<summary>x2.png Details</summary>

### Visual Description
## Bar Chart: Model Performance Comparison
### Overview
The chart compares the accuracy and KV cache length of various language models (LMs) on a classification task. It uses grouped bars for accuracy (%) and a dashed line for KV cache length, with models listed on the x-axis.
### Components/Axes
- **X-axis**: Model names (Transformers, DynTS, Window StreamingLLM, SepLLM, H2O, SnapKV, R-KV)
- **Y-axis (left)**: Accuracy (%) ranging from 0 to 70
- **Y-axis (right)**: KV Cache Length ranging from 2k to 20k
- **Legend**:
- Gray bars: Accuracy (%)
- Blue dashed line: KV Cache Length
- **Positioning**:
- Legend: Top center
- Blue dashed line: Overlaid on bars, spanning all x-axis categories
### Detailed Analysis
- **Accuracy (%)**:
- Transformers: 63.6%
- DynTS: 63.5%
- Window StreamingLLM: 49.4%
- SepLLM: 51.6%
- H2O: 54.5%
- SnapKV: 58.8%
- R-KV: 59.8%
- **KV Cache Length**:
- Constant at ~10k across all models (blue dashed line)
### Key Observations
1. **Accuracy Variance**:
- Transformers and DynTS achieve the highest accuracy (63.6% and 63.5%, respectively), with a negligible 0.1% difference.
- Other models show significantly lower accuracy, with Window StreamingLLM at the lowest (49.4%).
2. **KV Cache Consistency**:
- All models maintain identical KV cache length (~10k), indicating no trade-off between cache efficiency and accuracy in this dataset.
3. **Performance Gradient**:
- Accuracy decreases from Transformers/DynTS to Window StreamingLLM, then gradually improves through H2O, SnapKV, and R-KV.
### Interpretation
The data suggests that KV cache length is not a limiting factor for accuracy in this benchmark, as all models maintain the same cache efficiency. The stark accuracy gap between Transformers/DynTS and other models implies architectural or training differences rather than resource constraints. The near-identical performance of Transformers and DynTS highlights potential optimization opportunities for newer models. Notably, the absence of a clear correlation between cache length and accuracy challenges assumptions about hardware-software co-design trade-offs in LLM deployment.
</details>
Figure 1: (Left) Comparison of token selection strategies across different KV cache eviction methods. In each row, colored blocks denote the retained high-importance tokens, while grey blocks represent the evicted tokens during LRM inference. (Right) The average reasoning performance and KV cache memory footprint on DeepSeek-R1-Distall-Llama-8B and DeepSeek-R1-Distall-Qwen-7B across six reasoning benchmarks.
Recent advancements in Large Reasoning Models (LRMs) (Chen et al., 2025) have significantly strengthened the reasoning capabilities of Large Language Models (LLMs). Representative models such as DeepSeek-R1 (Guo et al., 2025), Gemini-3-Pro (DeepMind, 2025), and ChatGPT-5.2 (OpenAI, 2025) support deep thinking mode to strengthen reasoning capability in the challenging mathematics, programming, and science tasks (Zhang et al., 2025b). These models spend a substantial number of intermediate thinking tokens on reflection, reasoning, and verification to derive the correct response during inference (Feng et al., 2025). However, the thinking process necessitates the immense KV cache memory footprint and attention-related computational cost, posing a critical deployment challenge in resource-constrained environments.
KV cache compression techniques aim to optimize the cache state by periodically evicting non-essential tokens (Shi et al., 2024; WEI et al., 2025; Liu et al., 2025b; Qin et al., 2025), typically guided by predefined token retention rules (Chen et al., 2024; Xiao et al., 2024; Devoto et al., 2024) or attention-based importance metrics (Zhang et al., 2023; Li et al., 2024; Choi et al., 2025). Nevertheless, incorporating them into the inference process of LRMs faces two key limitations: (1) Methods designed for long-context prefilling are ill-suited to the short-prefill and long-decoding scenarios of LRMs; (2) Methods tailored for long-decoding struggle to match the reasoning performance of the Full KV baseline (SOTA $60.9\$ vs. Full KV $63.6\$ , Fig. 1 Left). Specifically, in LRM inference, the model conducts an extensive reasoning process and then summarizes the reasoning content to derive the final answer (Minegishi et al., 2025). This implies that the correctness of the final answer relies on the thinking tokens within the preceding reasoning (Bogdan et al., 2025). However, existing compression methods cannot identify the tokens that are essential to the future answer. This leads to a significant misalignment between the retained tokens and the critical thinking tokens, resulting in degradation in the model’s reasoning performance.
To address this issue, we analyze the LRM’s generated content and study which tokens are most important for the model to steer the final answer. Some works point out attention weights capturing inter-token dependencies (Vaswani et al., 2017; Wiegreffe and Pinter, 2019; Bogdan et al., 2025), which can serve as a metric to assess the importance of tokens. Consequently, we decompose the generated content into a reasoning trace and a final answer, and then calculate the importance score of each thinking token in the trajectory by aggregating the attention weights from the answer to thinking tokens. We find that only a small subset of thinking tokens ( $\sim 20\$ tokens in the reasoning trace, see Section § 3.1) have significant scores, which may be critical for the final answer. To validate these hypotheses, we retain these tokens and prompt the model to directly generate the final answer. Experimental results show that the model maintains close accuracy compared to using the whole KV cache. This reveals a Pareto principle The Pareto principle, also known as the 80/20 rule, posits that $20\$ of critical factors drive $80\$ of the outcomes. In this paper, it implies that a small fraction of pivotal thinking tokens dictates the correctness of the model’s final response. in LRMs: only a small subset of decision-critical thinking tokens with high importance scores drives the model toward the final answer, while the remaining tokens contribute negligibly.
Based on the above insight, we introduce DynTS (Dyn amic T hinking-Token S election), a novel method for dynamically predicting and selecting decision-critical thinking tokens on-the-fly during decoding, as shown in Fig. 1 (Left). The key innovation of DynTS is the integration of a trainable, lightweight Importance Predictor at the final layer of LRMs, enabling the model to dynamically predict the importance of each thinking token to the final answer. By utilizing importance scores derived from sampled reasoning traces as supervision signals, the predictor learns to distinguish between critical tokens and redundant tokens. During inference, DynTS manages memory through a dual-window mechanism: generated tokens flow from a Local Window (which captures recent context) into a Selection Window (which stores long-term history). Once the KV cache reaches the budget, the system retains the KV cache of tokens with higher predicted importance scores in the Select Window and all tokens in the Local Window (Zhang et al., 2023; Chen et al., 2024). By evicting redundant KV cache entries, DynTS effectively reduces both system memory pressure and computational overhead. We also theoretically analyze the computational overhead introduced by the importance predictor and the savings from cache eviction, and derive a Break-Even Condition for net computational gain.
Then, we train the Importance Predictor based on the MATH (Hendrycks et al., 2021) train set, and evaluate DynTS on the other six reasoning benchmarks. The reasoning performance and KV cache length compare with the SOTA KV cache compression method, as reported in Fig. 1 (Right). Our method reduces the KV cache memory footprint by up to $3.32–5.73\times$ without compromising reasoning performance compared to the full-cache transformer baseline. Within the same budget, our method achieves a $2.6\$ improvement in accuracy over the SOTA KV cache compression approach.
## 2 Preliminaries
#### Large Reasoning Model (LRM).
Unlike standard LLMs that directly generate answers, LRMs incorporate an intermediate reasoning process prior to producing the final answer (Chen et al., 2025; Zhang et al., 2025a; Sui et al., 2025). Given a user prompt $\mathbf{x}=(x_{1},\dots,x_{M})$ , the model generated content represents as $\mathbf{y}$ , which can be decomposed into a reasoning trace $\mathbf{t}$ and a final answer $\mathbf{a}$ . The trajectory is delimited by a start tag <think> and an end tag </think>. Formally, the model output is defined as:
$$
\mathbf{y}=[\texttt{<think>},\mathbf{t},\texttt{</think>},\mathbf{a}], \tag{1}
$$
where the trajectory $\mathbf{t}=(t_{1},\dots,t_{L})$ composed of $L$ thinking tokens, and $\mathbf{a}=(a_{1},\dots,a_{K})$ represents the answer composed of $K$ tokens. During autoregressive generation, the model conducts a reasoning phase that produces thinking tokens $t_{i}$ , followed by an answer phase that generates the answer token $a_{i}$ . This process is formally defined as:
$$
P(\mathbf{y}|\mathbf{x})=\underbrace{\prod_{i=1}^{L}P(t_{i}|\mathbf{x},\mathbf{t}_{<i})}_{\text{Reasoning Phase}}\cdot\underbrace{\prod_{j=1}^{K}P(a_{j}|\mathbf{x},\mathbf{t},\mathbf{a}_{<j})}_{\text{Answer Phase}} \tag{2}
$$
Since the length of the reasoning trace significantly exceeds that of the final answer ( $L\gg K$ ) (Xu et al., 2025), we focus on selecting critical thinking tokens in the reasoning trace to reduce memory and computational overhead.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: Importance Score Analysis Across Question and Thinking Phases
### Overview
The image displays a dual-phase line chart comparing importance scores across two cognitive processes: "Question" (left) and "Thinking" (right). The chart tracks importance scores (y-axis) against sequential steps (x-axis) with distinct visual patterns in each phase. A red dashed line represents the mean score (0.126) and ratio (0.211), serving as a reference point for interpretation.
### Components/Axes
- **Y-Axis (Importance Score)**:
- Labeled "Importance Score" with a gradient from "Low" (bottom) to "High" (top).
- Scale ranges from 0 (low) to 1 (high), though no intermediate markers are visible.
- **X-Axis (Step)**:
- Labeled "Step" with numerical markers at 0, 200, 4000, 6000, 8000, 10000, and 12000.
- Divided into two sections: "Question" (0–2000 steps) and "Thinking" (2000–12000 steps).
- **Legend**:
- Positioned on the left, with blue representing "High" importance and white representing "Low" importance.
- **Red Dashed Line**:
- Labeled "Mean Score: 0.126; Ratio: 0.211" in red text, spanning both phases.
### Detailed Analysis
#### Question Phase (0–2000 Steps)
- **Visual Trend**:
- High variability with frequent sharp peaks (importance scores approaching 1) and troughs (scores near 0).
- Peaks occur at irregular intervals, suggesting episodic high-importance moments.
- **Key Data Points**:
- Multiple spikes exceed the red dashed line (mean score), indicating critical question-formation events.
#### Thinking Phase (2000–12000 Steps)
- **Visual Trend**:
- Lower overall variability compared to the "Question" phase, with most scores clustering below the red dashed line.
- Intermittent spikes (e.g., near 4000, 6000, and 12000 steps) suggest sporadic high-importance insights.
- **Key Data Points**:
- A prominent peak at ~6000 steps exceeds the mean score, potentially representing a pivotal realization.
- Final spike at 12000 steps aligns with the "Question" phase's pattern, possibly indicating a resolution or conclusion.
### Key Observations
1. **Phase Contrast**:
- The "Question" phase exhibits higher dynamic importance scores, while the "Thinking" phase is more stable but less intense.
2. **Mean Score Context**:
- The red dashed line (mean = 0.126) acts as a baseline, with most "Thinking" phase scores falling below it.
3. **Ratio Interpretation**:
- The ratio (0.211) likely reflects the proportion of steps with scores above the mean, though this requires domain-specific validation.
4. **Temporal Patterns**:
- Spikes in both phases occur at irregular intervals, suggesting non-linear cognitive processes.
### Interpretation
The chart illustrates the cognitive dynamics of problem-solving, where the "Question" phase is characterized by bursts of high-importance moments (e.g., formulating critical queries), while the "Thinking" phase involves sustained but lower-intensity processing with occasional breakthroughs. The mean score (0.126) and ratio (0.211) quantify the overall distribution, indicating that high-importance events are relatively rare but impactful. The final spike at 12000 steps may signify a resolution or synthesis of earlier insights, reinforcing the cyclical nature of cognitive work. The data underscores the importance of tracking both qualitative (spike patterns) and quantitative (mean/ratio) metrics to understand decision-making processes.
</details>
Figure 2: Importance scores of question tokens and thinking tokens in a reasoning trace, computed based on attention contributions to the answer. Darker colors indicate higher importance. The red dashed line shows the mean importance score, and the annotated ratio indicates the fraction of tokens with importance above the mean.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Ratio (%)
### Overview
The chart displays four data series representing accuracy percentages across varying ratios (2% to 50%). The y-axis shows accuracy (%), and the x-axis shows ratio (%). Four lines are plotted: "Full" (gray dashed), "Bottom" (blue squares), "Random" (green triangles), and "Top" (red circles). The legend is positioned in the upper-right corner.
### Components/Axes
- **X-axis (Ratio %)**: Labeled "Ratio (%)" with ticks at 2, 4, 6, 8, 10, 20, 30, 40, 50.
- **Y-axis (Accuracy %)**: Labeled "Accuracy (%)" with ticks at 60, 65, 70, 75, 80, 85, 90, 95.
- **Legend**: Located in the upper-right corner, with four entries:
- **Full**: Gray dashed line with "X" markers.
- **Bottom**: Blue squares.
- **Random**: Green triangles.
- **Top**: Red circles.
### Detailed Analysis
1. **Full (Gray Dashed Line)**:
- Maintains a flat trend at ~95% accuracy across all ratios.
- No significant variation observed.
2. **Top (Red Circles)**:
- Starts at ~88% accuracy at 2% ratio.
- Increases to ~92% by 10% ratio.
- Plateaus near 95% after 20% ratio.
3. **Random (Green Triangles)**:
- Begins at ~63% accuracy at 2% ratio.
- Dips slightly to ~62% at 4% ratio.
- Steadily rises to ~85% at 50% ratio.
4. **Bottom (Blue Squares)**:
- Starts at ~66% accuracy at 2% ratio.
- Drops to ~63% at 4% ratio.
- Gradually increases to ~80% at 50% ratio.
### Key Observations
- **Highest Accuracy**: "Full" and "Top" lines dominate, with "Full" being the most consistent.
- **Significant Growth**: "Random" and "Bottom" lines show gradual improvement as ratio increases, with "Random" surpassing "Bottom" after ~20% ratio.
- **Diminishing Returns**: "Top" line plateaus near 95% after 20% ratio, suggesting limited gains beyond this point.
- **Initial Dip**: Both "Random" and "Bottom" lines experience minor accuracy drops between 2% and 4% ratios.
### Interpretation
The data suggests that higher ratios generally correlate with improved accuracy, particularly for "Top" and "Random" series. The "Full" line likely represents a theoretical maximum or baseline accuracy. The "Top" line’s plateau indicates diminishing returns after 20% ratio, while the "Random" line’s steady rise implies that variability in data may enhance performance as more data is included. The "Bottom" line’s slower improvement could reflect a model or approach less sensitive to increased data volume. The initial dip in "Random" and "Bottom" lines at 4% ratio warrants further investigation into potential data quality or sampling issues at lower ratios.
</details>
<details>
<summary>x5.png Details</summary>

### Visual Description
## Radar Chart: Performance Comparison Across Datasets
### Overview
The image is a radar chart comparing four data series ("Full," "Bottom," "Random," "Top") across six labeled axes: AMC23, AIME25, GPQA-D, GAOKAO2023EN, AIME24, and MATH500. The radial axis ranges from 0 to 100. Each data series is represented by a distinct line and marker style, with shaded regions indicating variability or confidence intervals.
### Components/Axes
- **Axes**:
- AMC23 (top-left)
- AIME25 (top-right)
- GPQA-D (bottom-left)
- GAOKAO2023EN (bottom-center)
- AIME24 (bottom-right)
- MATH500 (top-center)
- **Legend**:
- **Full**: Gray star markers, solid line
- **Bottom**: Blue square markers, dashed line
- **Random**: Green triangle markers, dotted line
- **Top**: Red circle markers, bold line
- **Radial Scale**: 0–100, with tick marks at 20, 40, 60, 80, 100.
### Detailed Analysis
1. **AMC23**:
- **Full**: ~85 (gray star)
- **Bottom**: ~70 (blue square)
- **Random**: ~65 (green triangle)
- **Top**: ~90 (red circle)
2. **AIME25**:
- **Full**: ~75
- **Bottom**: ~60
- **Random**: ~55
- **Top**: ~80
3. **GPQA-D**:
- **Full**: ~50
- **Bottom**: ~40
- **Random**: ~35
- **Top**: ~60
4. **GAOKAO2023EN**:
- **Full**: ~70
- **Bottom**: ~55
- **Random**: ~50
- **Top**: ~85
5. **AIME24**:
- **Full**: ~90
- **Bottom**: ~75
- **Random**: ~65
- **Top**: ~95
6. **MATH500**:
- **Full**: ~80
- **Bottom**: ~60
- **Random**: ~55
- **Top**: ~90
### Key Observations
- **Top** (red) consistently achieves the highest scores across all datasets, with values ranging from 60 (GPQA-D) to 95 (AIME24).
- **Full** (gray) performs second-best, with scores between 50 (GPQA-D) and 90 (AIME24).
- **Random** (green) shows the lowest performance, with scores between 35 (GPQA-D) and 65 (AIME24).
- **Bottom** (blue) has intermediate scores, ranging from 40 (GPQA-D) to 75 (AIME24).
- The **Top** series demonstrates the most consistent dominance, particularly in AIME24 and MATH500.
### Interpretation
The chart suggests a hierarchical performance structure:
1. **Top** (red) outperforms all other methods across all datasets, indicating it may represent an optimal or gold-standard approach.
2. **Full** (gray) acts as a mid-tier performer, suggesting it is a robust but suboptimal solution.
3. **Random** (green) and **Bottom** (blue) underperform, with "Random" showing particularly weak results in GPQA-D and AIME25. This could imply that random selection or baseline methods are ineffective for these tasks.
4. The shaded regions (likely representing confidence intervals or variability) are narrowest for **Top**, indicating higher reliability in its performance metrics.
The data highlights a clear stratification of effectiveness, with **Top** methods consistently achieving ~20–30% higher scores than **Full**, and **Random** methods lagging by ~40–50% in critical datasets like GPQA-D and AIME25. This pattern underscores the importance of structured, non-random approaches in these evaluation contexts.
</details>
Figure 3: (Left) Reasoning performance trends as a function of thinking token retention ratio, where the $x$ -axis indicates the retention percentage and the $y$ -axis is the accuracy. (Right) Accuracy across all datasets when retaining $30\$ of the thinking tokens.
#### Attention Mechanism.
Attention Mechanism is a core component of Transformer-based LRMs, such as Multi-Head Attention (Vaswani et al., 2017), Grouped-Query Attention (Ainslie et al., 2023), and their variants. To highlight the memory challenges in LRMs, we formulate the attention computation at the token level. Consider the decode step $t$ . Let $\mathbf{h}_{t}\in\mathbb{R}^{d}$ be the input hidden state of the current token. The model projects $\mathbf{h}_{t}$ into query, key, and value vectors:
$$
\mathbf{q}_{t}=\mathbf{W}_{Q}\mathbf{h}_{t},\quad\mathbf{k}_{t}=\mathbf{W}_{K}\mathbf{h}_{t},\quad\mathbf{v}_{t}=\mathbf{W}_{V}\mathbf{h}_{t}, \tag{3}
$$
where $\mathbf{W}_{Q},\mathbf{W}_{K},\mathbf{W}_{V}$ are learnable projection matrices. The query $\mathbf{q}_{t}$ attends to the keys of all preceding positions $j\in\{1,\dots,t\}$ . The attention weight $a_{t,j}$ between the current token $t$ and a past token $j$ is:
$$
\alpha_{t,j}=\frac{\exp(e_{t,j})}{\sum_{i=1}^{t}\exp(e_{t,i})},\qquad e_{t,j}=\frac{\mathbf{q}_{t}^{\top}\mathbf{k}_{j}}{\sqrt{d_{k}}}. \tag{4}
$$
These scores represent the relevance of the current step to the $j$ -th token. Finally, the output of the attention head $\mathbf{o}_{t}$ is the weighted sum of all historical value vectors:
$$
\mathbf{o}_{t}=\sum_{j=1}^{t}\alpha_{t,j}\mathbf{v}_{j}. \tag{5}
$$
As Equation 5 implies, calculating $\mathbf{o}_{t}$ requires access to the entire sequence of past keys and values $\{\mathbf{k}_{j},\mathbf{v}_{j}\}_{j=1}^{t-1}$ . In standard implementation, these vectors are stored in the KV cache to avoid redundant computation (Vaswani et al., 2017; Pope et al., 2023). In the LRMs’ inference, the reasoning trace is exceptionally long, imposing significant memory bottlenecks and increasing computational overhead.
## 3 Observations and Insight
This section presents the observed sparsity of thinking tokens and the Pareto Principle in LRMs, serving as the basis for DynTS. Detailed experimental settings and additional results are provided in Appendix § B.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Diagram: Machine Learning Model Training and Inference Pipeline
### Overview
The image depicts a technical diagram illustrating the training and inference processes of a machine learning model, specifically a Large Reasoning Model (LRM) with an Importance Predictor (IP). The diagram is divided into two main sections: **Training** (left) and **Inference** (right), with distinct color-coded components and token flow visualization.
---
### Components/Axes
#### Training Section (Left)
- **Input Tokens**: Gray squares at the bottom, labeled as the starting point.
- **Large Reasoning Model (LRM)**: Blue rectangle processing input tokens.
- **Thinking Tokens**: Red squares generated by the LRM, representing intermediate reasoning steps.
- **Mean Squared Error Loss**: Orange gradient overlay on thinking tokens, indicating error calculation.
- **Importance Predictor (IP)**: Red rectangle analyzing token importance.
- **Answer**: Final output derived from processed tokens.
#### Inference Section (Right)
- **Current Token (X)**: Input token at the start of the inference pipeline.
- **LRM with IP**: Blue component processing tokens during inference.
- **Output (Y)**: Predicted next token.
- **Predicted Score**: Numerical values (e.g., 0.2, 0.5) indicating token importance.
- **KV Cache Cache Budget**: Memory constraint for retaining tokens.
- **Steps**: Sequential processing steps (e.g., "Reach Budget," "Select Critical Tokens").
---
### Detailed Analysis
#### Training Section
1. **Input Tokens → LRM**: Input tokens are fed into the LRM, which generates **Thinking Tokens** (red squares).
2. **Error Calculation**: The **Mean Squared Error Loss** (orange gradient) is applied to the thinking tokens to measure prediction accuracy.
3. **Importance Prediction**: The **Importance Predictor (IP)** evaluates token relevance, prioritizing critical tokens for retention.
4. **Aggregation**: Tokens are aggregated to form the final **Answer**.
#### Inference Section
1. **Token Processing Flow**:
- **Current Token (X)** is processed by the **LRM with IP**, producing **Output (Y)** and a **Predicted Score**.
- Tokens are evaluated against a **Reach Budget** (pink shaded area) to determine criticality.
- **Select Critical Tokens**: Tokens with scores above thresholds (e.g., 0.5) are retained; others are evicted.
- **KV Cache Budget**: Limits the number of tokens retained in memory (e.g., tokens A, B, E, G are kept).
2. **Token Retention Logic**:
- Tokens like **A** (score: ∞) and **B** (score: ∞) are always retained.
- Lower-scoring tokens (e.g., **C**: 0.2, **D**: 0.1) are evicted to optimize memory usage.
- Critical tokens (e.g., **E**: 0.5, **G**: 0.4) are retained for subsequent steps.
---
### Key Observations
1. **Training Efficiency**: The IP reduces computational overhead by focusing on high-importance tokens during training.
2. **Inference Optimization**: The KV Cache Budget enforces memory constraints, prioritizing tokens with scores ≥ 0.5 for retention.
3. **Token Dynamics**: High-scoring tokens (e.g., **A**, **B**) dominate the cache, while lower-scoring tokens (e.g., **C**, **D**) are evicted early.
4. **Iterative Steps**: The inference process repeats across multiple steps, refining token selection and output predictions.
---
### Interpretation
- **Purpose**: The diagram demonstrates how the LRM with IP balances accuracy and efficiency by dynamically managing token importance during training and inference.
- **Critical Tokens**: Tokens with infinite scores (A, B) are deemed essential, possibly representing ground-truth or high-confidence predictions.
- **Memory Constraints**: The KV Cache Budget ensures the model operates within resource limits, evicting less critical tokens to maintain performance.
- **Trade-offs**: While retaining high-scoring tokens improves output quality, aggressive eviction of low-scoring tokens may risk losing nuanced context.
---
### Uncertainties
- Exact numerical thresholds for the **Reach Budget** and **KV Cache Budget** are not explicitly defined.
- The relationship between **Mean Squared Error Loss** and token importance scores requires further clarification.
- The role of **Thinking Tokens** in the training phase is abstracted; their exact function in error calculation is unspecified.
</details>
Figure 4: Overview of DynTS. (Left) Importance Predictor Training. The upper heatmap visualizes attention weights, where orange intensity represents the importance of thinking tokens to the answer. The lower part shows a LRM integrated with an Importance Predictor (IP) to learn these importance scores. (Right) Inference with KV Cache Selection. The model outputs the next token and a predicted importance score of the current token. When the cache budget is reached, the selection strategy retains the KV cache of question tokens, local tokens, and top-k thinking tokens based on the predicted importance score.
### 3.1 Sparsity for Thinking Tokens
Previous works (Bogdan et al., 2025; Zhang et al., 2023; Singh et al., 2024) have shown that attention weights (Eq. 4) serve as a reliable proxy for token importance. Building on this insight, we calculate an importance score for each question and thinking token by accumulating the attention they receive from all answer tokens. Formally, the importance scores are defined as:
$$
I_{x_{j}}=\sum_{i=1}^{K}\alpha_{a_{i},x_{j}},\qquad I_{t_{j}}=\sum_{i=1}^{K}\alpha_{a_{i},t_{j}}, \tag{6}
$$
where $I_{x_{j}}$ and $I_{t_{j}}$ denote the importance scores of the $j$ -th question token $x_{j}$ and thinking token $t_{j}$ . Here, $\alpha_{a_{i},x_{j}}$ and $\alpha_{a_{i},t_{j}}$ represent the attention weights from the $i$ -th answer token $a_{i}$ to the corresponding question or thinking token, and $K$ is the total number of answer tokens. We perform full autoregressive inference on LRMs to extract attention weights and compute token-level importance scores for both question and thinking tokens.
Observation. As illustrated in Fig. 2, the question tokens (left panel) exhibit consistently significant and dense importance scores. In contrast, the thinking tokens (right panel) display a highly sparse distribution. Despite the extensive reasoning trace (exceeding 12k tokens), only $21.1\$ of thinking tokens exceed the mean importance score. This indicates that the vast majority of reasoning steps exert only a marginal influence on the final answer.
Analysis. Follow attention-based methods (Cai et al., 2025; Li et al., 2024; Cai et al., 2024), tokens with higher importance scores intuitively correspond to decision-critical reasoning steps, which are critical for the model to generate the final answer. The low-importance tokens serve as syntactic scaffolding or intermediate states that become redundant after reasoning progresses (We report the ratio of Content Words, see Appendix B.2). Consequently, we hypothesize that the model maintains close reasoning performance to that of the full token sequence, even when it selectively retains only these critical thinking tokens.
### 3.2 Pareto Principle in LRMs
To validate the aforementioned hypothesis, we retain all question tokens while preserving only the top- $p\$ of thinking tokens ranked by importance score, and prompt the model to directly generate the final answer.
Observation. As illustrated in Fig. 3 (Left), the importance-based top- $p\$ selection strategy substantially outperforms both random- and bottom-selection baselines. Notably, the model recovers nearly its full performance (grey dashed line) when retaining only $\sim 30\$ thinking tokens with top importance scores. Fig. 3 (Right) further confirms this trend across six diverse datasets, where the performance polygon under the top- $30\$ retention strategy almost completely overlaps with the full thinking tokens.
Insights. These empirical results illustrate and reveal the Pareto Principle in LRM reasoning: Only a small subset of thinking tokens ( $30\$ ) with high importance scores serve as “pivotal nodes,” which are critical for the model to output a final answer, while the remaining tokens contribute negligibly to the outcome. This finding provides strong empirical support for LRMs’ KV cache compression, indicating that it is possible to reduce memory footprint and computational overhead without sacrificing performance.
## 4 Dynamic Thinking-Token Selection
Building on the Pareto Principle in LRMs, critical thinking tokens can be identified via the importance score computed by Equation 6. However, this computation requires the attention weights from the answer to the thinking tokens, which are inaccessible until the model completes the entire decoding stage. To address this limitation, we introduce an Importance Predictor that dynamically estimates the importance score of each thinking token during inference time. Furthermore, we design a decoding-time KV cache Selection Strategy that retains critical thinking tokens and evicts redundant ones. We refer to this approach as DynTS (Dyn amic T hinking Token S election), and the overview is illustrated in Fig. 4.
### 4.1 Importance Predictor
#### Integrate Importance Predictor in LRMs.
Transformer-based Large Language Models (LLMs) typically consist of stacked Transformer blocks followed by a language modeling head (Vaswani et al., 2017), where the output of the final block serves as a feature representation of the current token. Building on this architecture, we attach an additional lightweight MLP head to the final hidden state, named as Importance Predictor (Huang et al., 2024). It is used to predict the importance score of the current thinking token during model inference, capturing its contribution to the final answer. Formally, we define the modified LRM as a mapping function $\mathcal{M}$ that processes the input sequence $\mathbf{x}_{\leq t}$ to produce a dual-output tuple comprising the next token $x_{t+1}$ and the current importance score $s_{x_{t}}$ :
$$
\mathcal{M}(\mathbf{x}_{\leq t})\rightarrow(x_{t+1},s_{x_{t}}) \tag{7}
$$
#### Predictor Training.
To obtain supervision signals for training, we prompt the LRMs based on the training dataset to generate complete sequences denoted as $\{x_{1\dots M},t_{1\dots L},a_{1\dots K}\}$ , filtering out incorrect or incomplete reasoning. Here, $x$ , $t$ , and $a$ represent the question, thinking, and answer tokens, respectively. Based on the observation in Section § 3, the thinking tokens significantly outnumber answer tokens ( $L\gg K$ ), and question tokens remain essential. Therefore, DynTS only focuses on predicting the importance of thinking tokens. By utilizing the attention weights from answer to thinking tokens, we derive the ground-truth importance score $I_{t_{i}}$ for each thinking token according to Equation 6. Finally, the Importance Predictor parameters can be optimized by minimizing the Mean Squared Error (MSE) loss (Wang and Bovik, 2009) as follows:
$$
\mathcal{L}_{\text{MSE}}=\frac{1}{K}\sum_{i=1}^{K}(I_{t_{i}}-s_{t_{i}})^{2}. \tag{8}
$$
To preserve the LRMs’ original performance, we freeze the backbone parameters and optimize the Importance Predictor exclusively. The trained model can predict the importance of thinking tokens to the answer. This paper focuses on mathematical reasoning tasks. We optimize the Importance Predictor only on the MATH training set and validated across six other datasets (See Section § 6.1).
### 4.2 KV Cache Selection
During LRMs’ inference, we establish a maximum KV cache budget $B$ , which is composed of a question window $W_{q}$ , a selection window $W_{s}$ , and a local window $W_{l}$ , formulated as $B=W_{q}+W_{s}+W_{l}$ . Specifically, the question window stores the KV caches of question tokens generated during the prefilling phase, i.e., the window size $W_{q}$ is equal to the number of question tokens $M$ ( $W_{q}=M$ ). Since these tokens are critical for the final answer (see Section § 3), we assign an importance score of $+\infty$ to these tokens, ensuring their KV caches are immune to eviction throughout the inference process.
In the subsequent decoding phase, we maintain a sequential stream of tokens. Newly generated KV caches and their corresponding importance scores are sequentially appended to the selection window ( $W_{s}$ ) and the local window ( $W_{l}$ ). Once the total token count reaches the budget limit $B$ , the critical token selection process is triggered, as illustrated in Fig. 4 (Right). Within the selection window, we retain the KV caches of the top- $k$ tokens with the highest scores and evict the remainder. Simultaneously, drawing inspiration from (Chen et al., 2024; Zhang et al., 2023; Zhao et al., 2024), we maintain the KV caches within the local window to ensure the overall coherence of the subsequently generated sequence. This inference process continues until decoding terminates.
## 5 Theoretical Overhead Analysis
In DynTS, the KV cache selection strategy reduces computational overhead by constraining cache length, while the importance predictor introduces a slight overhead. In this section, we theoretically analyze the trade-off between these two components and derive the Break-even Condition required to achieve net computational gains.
Notions. Let $\mathcal{M}_{\text{base}}$ be the vanilla LRM with $L$ layers and hidden dimension $d$ , and $\mathcal{M}_{\text{opt}}$ be the LRM with Importance Predictor (MLP: $d\to 2d\to d/2\to 1$ ). We define the prefill length as $M$ and the current decoding step as $i\in\mathbb{Z}^{+}$ . For vanilla decoding, the effective KV cache length grows linearly as $S_{i}^{\text{base}}=M+i$ . While DynTS evicts $K$ tokens by the KV Cache Selection when the effective KV cache length reaches the budget $B$ . Resulting in the effective length $S_{i}^{\text{opt}}=M+i-n_{i}\cdot K$ , where $n_{i}=\max\left(0,\left\lfloor\frac{(M+i)-B}{K}\right\rfloor+1\right)$ denotes the count of cache eviction event at step $i$ . By leveraging Floating-Point Operations (FLOPs) to quantify computational overhead, we establish the following theorem. The detailed proof is provided in Appendix A.
**Theorem 5.1 (Computational Gain)**
*Let $\Delta\mathcal{C}(i)$ be the reduction FLOPs achieved by DynTS at decoding step $i$ . The gain function is derived as the difference between the eviction event savings from KV Cache Selection and the introduced overhead of the predictor:
$$
\Delta\mathcal{C}(i)=\underbrace{n_{i}\cdot 4LdK}_{\text{Eviction Saving}}-\underbrace{(6d^{2}+d)}_{\text{Predictor Overhead}}, \tag{9}
$$*
Based on the formulation above, we derive a critical corollary regarding the net computational gain.
**Corollary 5.2 (Break-even Condition)**
*To achieve a net computational gain ( $\Delta\mathcal{C}(i)>0$ ) at the $n_{i}$ -th eviction event, the eviction volume $K$ must satisfy the following inequality:
$$
K>\frac{6d^{2}+d}{n_{i}\cdot 4Ld}\approx\frac{1.5d}{n_{i}L} \tag{10}
$$*
This inequality provides a theoretical lower bound for the eviction volume $K$ . demonstrating that the break-even point is determined by the model’s architectural (hidden dimension $d$ and layer count $L$ ).
Table 1: Performance comparison of different methods on R1-Llama and R1-Qwen. We report the average Pass@1 and Throughput (TPS) across six benchmarks. “Transformers” denotes the full cache baseline, and “Window” represents the local window baseline.
| Transformers Window StreamingLLM | 47.3 18.6 20.6 | 215.1 447.9 445.8 | 28.6 14.6 16.6 | 213.9 441.3 445.7 | 86.5 59.5 65.0 | 200.6 409.4 410.9 | 46.4 37.6 37.8 | 207.9 408.8 407.4 | 73.1 47.0 53.4 | 390.9 622.6 624.6 | 87.5 58.1 66.1 | 323.4 590.5 592.1 | 61.6 39.2 43.3 | 258.6 486.7 487.7 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| SepLLM | 30.0 | 448.2 | 20.0 | 445.1 | 71.0 | 414.1 | 39.7 | 406.6 | 61.4 | 635.0 | 74.5 | 600.4 | 49.4 | 491.6 |
| H2O | 38.6 | 426.2 | 22.6 | 423.4 | 82.5 | 396.1 | 41.6 | 381.5 | 67.5 | 601.8 | 82.7 | 573.4 | 55.9 | 467.1 |
| SnapKV | 39.3 | 438.2 | 24.6 | 436.3 | 80.5 | 406.9 | 41.9 | 394.1 | 68.7 | 615.7 | 83.1 | 584.5 | 56.3 | 479.3 |
| R-KV | 44.0 | 437.4 | 26.0 | 434.7 | 86.5 | 409.5 | 44.5 | 394.9 | 71.4 | 622.6 | 85.2 | 589.2 | 59.6 | 481.4 |
| DynTS (Ours) | 49.3 | 444.6 | 29.3 | 443.5 | 87.0 | 412.9 | 46.3 | 397.6 | 72.3 | 631.8 | 87.2 | 608.2 | 61.9 | 489.8 |
| R1-Qwen | | | | | | | | | | | | | | |
| Transformers | 52.0 | 357.2 | 35.3 | 354.3 | 87.5 | 376.2 | 49.0 | 349.4 | 77.9 | 593.7 | 91.3 | 517.3 | 65.5 | 424.7 |
| Window | 41.3 | 650.4 | 31.3 | 643.0 | 82.0 | 652.3 | 45.9 | 634.1 | 71.8 | 815.2 | 85.0 | 767.0 | 59.5 | 693.7 |
| StreamingLLM | 42.0 | 655.7 | 29.3 | 648.5 | 85.0 | 657.2 | 45.9 | 631.1 | 71.2 | 824.0 | 85.8 | 786.1 | 59.8 | 700.5 |
| SepLLM | 38.6 | 650.0 | 31.3 | 647.6 | 85.5 | 653.2 | 45.6 | 639.5 | 72.0 | 820.1 | 84.4 | 792.2 | 59.6 | 700.4 |
| H2O | 42.6 | 610.9 | 33.3 | 610.7 | 84.5 | 609.9 | 48.1 | 593.6 | 74.1 | 780.1 | 87.0 | 725.4 | 61.6 | 655.1 |
| SnapKV | 48.6 | 639.6 | 33.3 | 633.1 | 87.5 | 633.2 | 46.5 | 622.0 | 74.9 | 787.4 | 88.2 | 768.7 | 63.2 | 680.7 |
| R-KV | 44.0 | 639.5 | 32.6 | 634.7 | 85.0 | 636.8 | 47.2 | 615.1 | 75.8 | 792.8 | 88.8 | 765.5 | 62.2 | 680.7 |
| DynTS (Ours) | 52.0 | 645.6 | 36.6 | 643.0 | 88.5 | 646.0 | 48.1 | 625.7 | 76.4 | 788.5 | 90.0 | 779.5 | 65.3 | 688.1 |
## 6 Experiment
This section introduces experimental settings, followed by the results, ablation studies on retanind tokens and hyperparameters, and the Importance Predictor analysis. For more detailed configurations and additional results, please refer to Appendix C and D.
### 6.1 Experimental Setup
Models and Datasets. We conduct experiments on two mainstream LRMs: R1-Qwen (DeepSeek-R1-Distill-Qwen-7B) and R1-Llama (DeepSeek-R1-Distill-Llama-8B) (Guo et al., 2025). To evaluate the performance and robustness of our method across diverse tasks, we select five mathematical reasoning datasets of varying difficulty levels—AIME24 (Zhang and Math-AI, 2024), AIME25 (Zhang and Math-AI, 2025), AMC23 https://huggingface.co/datasets/math-ai/amc23, GK23EN (GAOKAO2023EN) https://huggingface.co/datasets/MARIO-Math-Reasoning/Gaokao2023-Math-En, and MATH500 (Hendrycks et al., 2021) —along with the GPQA-D (GPQA-Diamond) (Rein et al., 2024) scientific question-answering dataset as evaluation benchmarks.
Implementation Details. (1) Training Settings: To train the importance predictor, we sample the model-generated contents with correct answers from the MATH training set and calculate the importance scores of thinking tokens. We freeze the model backbone and optimize only the predictor ( $3$ -layer MLP), setting the number of training epochs to 15, the learning rate to $5\text{e-}4$ , and the maximum sequence length to 18,000. (2) Inference Settings. Following (Guo et al., 2025), setting the maximum decoding steps to 16,384, the sampling temperature to 0.6, top- $p$ to 0.95, and top- $k$ to 20. We apply budget settings based on task difficulty. For challenging benchmarks (AIME24, AIME25, AMC23, and GPQA-D), we set the budget $B$ to 5,000 with a local window size of 2,000; For simpler tasks, the budget is set to 3,000 with a local window of 1,500 for R1-Qwen and 1,000 for R1-Llama. The token retention ratio in the selection window is set to 0.4 for R1-Qwen and 0.3 for R1-Llama. We generate 5 responses for each problem and report the average Pass@1 as the evaluation metric.
Baselines. Our approach focuses on compressing the KV cache by selecting critical tokens. Therefore, we compare our method against the state-of-the-art KV cache compressing approaches. These include StreamingLLM (Xiao et al., 2024), H2O (Zhang et al., 2023), SepLLM (Chen et al., 2024), and SnapKV (Li et al., 2024) (decode-time variant (Liu et al., 2025a)) for LLMs, along with R-KV (Cai et al., 2025) for LRMs. To ensure a fair comparison, all methods were set with the same token overhead and maximum budget. We also report results for standard Transformers and local window methods as evaluation baselines.
### 6.2 Main Results
Reasoning Accuracy. As shown in Table 1, our proposed DynTS consistently outperforms all other KV cache eviction baselines. On R1-Llama and R1-Qwen, DynTS achieves an average accuracy of $61.9\$ and $65.3\$ , respectively, significantly surpassing the runner-up methods R-KV ( $59.6\$ ) and SnapKV ( $63.2\$ ). Notably, the overall reasoning capability of DynTS is on par with the Full Cache Transformers baseline ( $61.9\$ vs. $61.6\$ on R1-Llama, $65.3\$ vs. $65.5\$ on R1-Llama). Even outperforms the Transformers on several challenging tasks, such as AIME24 on R1-Llama, where it improves accuracy by $2.0\$ ; and AIME25 on R1-Qwen, where it improves accuracy by $1.3\$ .
Table 2: Ablation study on different token retention strategies in DynTS, where w.o. Q / T / L denotes the removal of Question tokens (Q), critical Thinking tokens (T), and Local window tokens (L), respectively. T-Random and T-Bottom represent strategies that select thinking tokens randomly and the tokens with the bottom-k importance scores, respectively.
| DynTS w.o. L w.o. Q | 49.3 40.6 19.3 | 29.3 23.3 14.6 | 87.0 86.5 59.0 | 46.3 46.3 38.1 | 72.3 72.0 47.8 | 87.2 85.5 59.8 | 61.9 59.0 39.8 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| w.o. T | 44.0 | 27.3 | 85.0 | 44.0 | 71.5 | 85.9 | 59.6 |
| T-Random | 24.6 | 16.0 | 59.5 | 37.4 | 51.7 | 63.9 | 42.2 |
| T-Bottom | 20.6 | 15.3 | 59.0 | 37.3 | 47.3 | 59.5 | 39.8 |
| R1-Qwen | | | | | | | |
| DynTS | 52.0 | 36.6 | 88.5 | 48.1 | 76.4 | 90.0 | 65.3 |
| w.o. L | 42.0 | 32.0 | 87.5 | 46.3 | 75.2 | 87.0 | 61.6 |
| w.o. Q | 46.0 | 36.0 | 86.0 | 43.9 | 75.1 | 89.0 | 62.6 |
| w.o. T | 47.3 | 34.6 | 85.5 | 49.1 | 75.1 | 89.2 | 63.5 |
| T-Random | 46.0 | 32.6 | 84.5 | 47.5 | 73.8 | 86.9 | 61.9 |
| T-Bottom | 38.0 | 30.0 | 80.0 | 44.3 | 69.8 | 83.3 | 57.6 |
Inference Efficiency. Referring to Table 1, DynTS achieves $1.9\times$ and $1.6\times$ speedup compared to standard Transformers on R1-Llama and R1-Qwen, respectively, across all benchmarks. While DynTS maintains throughput comparable to other KV Cache compression methods. Further observing Figure 5, as the generated sequence length grows, standard Transformers suffer from linear accumulation in both memory footprint and compute overhead (GFLOPs), leading to continuous throughput degradation. In contrast, DynTS effectively bounds resource consumption. The distinctive sawtooth pattern illustrates our periodic compression mechanism, where the inflection points correspond to the execution of KV Cache Selection to evict the KV pairs of non-essential thinking tokens. Consequently, the efficiency advantage escalates as the decoding step increases, DynTS achieves a peak speedup of 4.51 $\times$ , compresses the memory footprint to 0.19 $\times$ , and reduces the compute overhead to 0.52 $\times$ compared to the full-cache baseline. The zoom-in view reveals that the computational cost drops below the baseline immediately after the first KV cache eviction. This illustrates that our experimental settings are rational, as they satisfy the break-even condition ( $K=900\geq\frac{1.5d}{n_{i}L}=192$ ) outlined in Corollary 5.2.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Performance Comparison of Transformers vs DynTS Across Decoding Steps
### Overview
The image presents three vertically stacked line charts comparing the performance of two models (Transformers and DynTS) across three metrics: Throughput (TPS), KV Memory (GB), and GFLOPs. Each chart tracks performance as decoding steps increase from 0 to 15k, with vertical dashed lines marking key evaluation points (2k, 5k, 7k, 10k, 12k, 15k). The charts use a dual-color scheme (gray for Transformers, red for DynTS) with shaded confidence intervals.
### Components/Axes
1. **Top Subplot (Throughput)**:
- **Y-axis**: Throughput (TPS) from 0 to 1250
- **X-axis**: Decoding Steps (0 to 15k)
- **Legend**: Top-right corner (gray = Transformers, red = DynTS)
- **Annotations**: Multiplier labels (e.g., "1.55x") on DynTS line at key points
2. **Middle Subplot (KV Memory)**:
- **Y-axis**: KV Memory (GB) from 0 to 40
- **X-axis**: Decoding Steps (0 to 15k)
- **Legend**: Same as top subplot
- **Annotations**: Multiplier labels (e.g., "0.58x") on DynTS line
3. **Bottom Subplot (GFLOPs)**:
- **Y-axis**: GFLOPs from 15 to 35
- **X-axis**: Decoding Steps (0 to 15k)
- **Legend**: Same as above
- **Inset**: Zoomed view of 4500-4900 decoding steps with "1.005x" annotation
### Detailed Analysis
1. **Throughput (TPS)**:
- Transformers (gray) show a steep initial decline, stabilizing near 200 TPS after 2k steps.
- DynTS (red) maintains higher throughput, with multipliers increasing from 1.55x (2k steps) to 4.51x (15k steps).
- Confidence intervals (shaded areas) narrow as decoding steps increase.
2. **KV Memory (GB)**:
- Both models show linear growth, but DynTS consistently uses less memory.
- Multipliers decrease from 0.58x (2k steps) to 0.19x (15k steps), indicating DynTS's memory efficiency improves over time.
3. **GFLOPs**:
- Both models exhibit linear scaling, but DynTS maintains higher computational efficiency.
- Multipliers decrease from 0.87x (2k steps) to 0.52x (15k steps), suggesting diminishing returns for DynTS's efficiency advantage.
- Inset reveals near-parity at 4800 steps (1.005x multiplier).
### Key Observations
1. **Performance Trends**:
- DynTS consistently outperforms Transformers in throughput (x1.55–4.51) and computational efficiency (x0.52–0.87).
- DynTS demonstrates superior memory efficiency (x0.19–0.58), with the gap widening at higher decoding steps.
2. **Anomalies**:
- The GFLOPs multiplier approaches 1.0 at 4800 steps, suggesting potential convergence in computational efficiency at mid-range decoding.
- Throughput confidence intervals for DynTS narrow significantly after 7k steps, indicating stabilized performance.
3. **Spatial Patterns**:
- All subplots share the same x-axis scale, enabling direct comparison of decoding step impacts.
- Vertical dashed lines create visual alignment across subplots for key evaluation points.
### Interpretation
The data demonstrates that DynTS offers a **multiplicative performance advantage** over Transformers across all metrics, with the most significant gains in throughput (up to 4.51x) and memory efficiency (down to 0.19x). While computational efficiency (GFLOPs) shows diminishing returns (0.52x at 15k steps), DynTS maintains a consistent edge. The near-parity at 4800 steps in the GFLOPs inset suggests potential optimization opportunities for mid-range decoding scenarios. The widening performance gap at higher decoding steps implies DynTS scales more effectively for long-context tasks, making it preferable for applications requiring both speed and resource efficiency.
</details>
Figure 5: Real-time throughput, memory, and compute overhead tracking over total decoding step. The inflection points in the sawtooth correspond to the steps where DynTS executes KV Cache Selection.
### 6.3 Ablation Study
Impact of Retained Token. As shown in Tab. 2, the full DynTS method outperforms all ablated variants, achieving the highest average accuracy on both R1-Llama ( $61.9\$ ) and R1-Qwen ( $65.3\$ ). This demonstrates that all retained tokens of DynTS are critical for the model to output the correct final answer. Moreover, we observe that the strategy for selecting thinking tokens plays a critical role in the model’s reasoning performance. When some redundant tokens are retained (T-Random and T-Bottom strategies), there is a significant performance drop compared to completely removing thinking tokens ( $59.6\$ on R1-Llama and $63.5\$ on R1-Qwen). This finding demonstrates the effectiveness of our Importance Predictor to identify critical tokens. It also explains why existing KV cache compression methods hurt model performance: inadvertently retaining redundant tokens. Finally, the local window is crucial for preserving local linguistic coherence, which contributes to stable model performance.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Heatmap: Pass@1 Performance Comparison of R1-Llama and R1-Qwen Models
### Overview
The image presents a comparative heatmap analysis of two language models (R1-Llama and R1-Qwen) across varying local window sizes (500, 1000, 2000, 3000) and ratio parameters (0.1, 0.2, 0.3, 0.4, 0.5). Pass@1 metrics are visualized using a color gradient from 50 (light yellow) to 56 (dark blue), with numerical values embedded in each cell.
### Components/Axes
- **X-axis (Horizontal)**: Ratio (0.1, 0.2, 0.3, 0.4, 0.5)
- **Y-axis (Vertical)**: Local Window Size (500, 1000, 2000, 3000)
- **Legend**: Vertical colorbar on the right, labeled "Pass@1" with values 50–56
- **Model Labels**:
- Left section: **R1-Llama**
- Right section: **R1-Qwen**
### Detailed Analysis
#### R1-Llama Section
| Window Size | Ratio 0.1 | Ratio 0.2 | Ratio 0.3 | Ratio 0.4 | Ratio 0.5 |
|-------------|-----------|-----------|-----------|-----------|-----------|
| 3000 | 49.1 | 50.1 | 50.6 | 50.7 | 51.4 |
| 2000 | 49.5 | 51.7 | 52.8 | 52.5 | 50.9 |
| 1000 | 49.9 | 52.7 | 51.0 | 51.9 | 51.7 |
| 500 | 49.8 | 52.1 | 50.7 | 50.8 | 51.7 |
#### R1-Qwen Section
| Window Size | Ratio 0.1 | Ratio 0.2 | Ratio 0.3 | Ratio 0.4 | Ratio 0.5 |
|-------------|-----------|-----------|-----------|-----------|-----------|
| 3000 | 53.9 | 53.9 | 53.2 | 54.4 | 53.8 |
| 2000 | 52.4 | 51.9 | 54.6 | 56.3 | 53.7 |
| 1000 | 52.2 | 54.4 | 53.8 | 53.3 | 53.0 |
| 500 | 51.5 | 51.8 | 52.0 | 54.3 | 54.6 |
### Key Observations
1. **R1-Qwen Dominance**: R1-Qwen consistently outperforms R1-Llama across all configurations, with a maximum Pass@1 of **56.3** (2000 window size, 0.4 ratio) vs. R1-Llama's peak of **52.8**.
2. **Ratio Sensitivity**: Both models show improved performance with higher ratios, though R1-Qwen's gains are more pronounced (e.g., 51.5 → 54.6 for 500 window size).
3. **Window Size Tradeoffs**:
- R1-Llama's performance peaks at smaller window sizes (500–1000) but declines at 2000/3000.
- R1-Qwen maintains strong performance across all window sizes, with 2000 window size showing optimal results.
4. **Anomalies**:
- R1-Llama's 3000 window size at 0.1 ratio (49.1) is the lowest value, suggesting a configuration mismatch.
- R1-Qwen's 2000 window size at 0.4 ratio (56.3) stands out as the global maximum.
### Interpretation
The data demonstrates that R1-Qwen exhibits superior scalability and efficiency compared to R1-Llama, particularly in high-ratio scenarios. The heatmap reveals that:
- **R1-Qwen's robustness**: Maintains high Pass@1 across all window sizes, indicating better generalization.
- **R1-Llama's limitations**: Struggles with larger window sizes, possibly due to computational constraints or architectural inefficiencies.
- **Optimal configuration**: For R1-Qwen, the 2000 window size and 0.4 ratio yields the best results, suggesting a balance between context length and parameter utilization.
The color gradient visually reinforces these trends, with darker blues correlating to higher Pass@1 values. The embedded numerical values confirm the heatmap's accuracy, while the spatial arrangement allows direct comparison between models. This analysis highlights R1-Qwen as the more versatile model for applications requiring adaptability across varying input sizes and ratios.
</details>
Figure 6: The accuracy of R1-Llama and R1-Qwen across different local window sizes and selection window retention ratios.
Local Window & Retention Ratio. As shown in Fig. 6, we report the model’s reasoning performance across different configurations. The performance improves with a larger local window and a higher retention ratio within a reasonable range. These two settings respectively ensure local contextual coherence and an adequate number of thinking tokens. Setting either to overly small values leads to pronounced performance degradation. However, excessively large values introduce a higher proportion of non-essential tokens, which in turn negatively impacts model performance. Empirically, a local window size of approximately 2,000 and a retention ratio of 0.3–0.4 yield optimal performance. We further observe that R1-Qwen is particularly sensitive to the local window size. This may be caused by the Dual Chunk Attention introduced during the long-context pre-training stage (Yang et al., 2025), which biases attention toward tokens within the local window.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Model Performance Metrics
### Overview
The image contains two stacked line graphs. The top graph shows two metrics (MSE Loss and Kendall) over 400 steps, while the bottom graph displays multiple overlapping lines representing percentile-based overlap rates. Both graphs share the same x-axis ("Step") but have distinct y-axes.
### Components/Axes
**Top Graph:**
- **X-axis (Step):** 0 to 400 (linear scale)
- **Y-axis (Value):** 0 to 3 (linear scale)
- **Legend:** Top-right corner
- Blue line: MSE Loss
- Orange line: Kendall
**Bottom Graph:**
- **X-axis (Step):** 0 to 400 (linear scale)
- **Y-axis (Overlap Rate %):** 20 to 100 (linear scale)
- **Legend:** Top-right corner
- Purple: Top-20%
- Teal: Top-30%
- Blue: Top-40%
- Green: Top-50%
- Light blue: Top-60%
- Yellow: Top-70%
- Dark blue: Top-80%
- Light green: Top-90%
### Detailed Analysis
**Top Graph Trends:**
1. **MSE Loss (Blue):**
- Starts at ~3.0 at step 0
- Sharp decline to ~0.2 by step 50
- Minor fluctuations between 0.1-0.3 from step 100-400
- Peak value: 3.0 (step 0)
- Final value: ~0.2 (step 400)
2. **Kendall (Orange):**
- Starts at ~0.8 at step 0
- Small spike to ~1.2 at step 25
- Stabilizes at ~0.8-0.9 from step 50-400
- Peak value: 1.2 (step 25)
- Final value: ~0.85 (step 400)
**Bottom Graph Trends:**
- All lines show gradual upward trends until ~step 100, then flatten
- **Top-20% (Purple):**
- Starts at 20% (step 0)
- Rises to ~65% by step 100
- Final value: ~75% (step 400)
- **Top-90% (Light Green):**
- Starts at 90% (step 0)
- Rises to ~98% by step 100
- Final value: ~99% (step 400)
- All lines converge toward similar values by step 400
### Key Observations
1. MSE Loss shows rapid initial improvement, stabilizing at low values
2. Kendall metric remains relatively stable after initial volatility
3. Overlap rates demonstrate consistent improvement across all percentiles
4. Higher percentile lines (Top-70% to Top-90%) maintain >90% overlap throughout
5. All metrics show minimal change after step 200
### Interpretation
The data suggests a model training process with:
- **Rapid initial learning** (evidenced by MSE Loss drop from 3.0 to 0.2 in first 50 steps)
- **Stable performance** in later stages (both metrics plateau after step 100)
- **Consistent coverage improvement** across all performance percentiles
- **Diminishing returns** after step 200, as all metrics stabilize
The convergence of overlap rates toward similar values by step 400 implies the model achieves comparable performance across different evaluation thresholds. The Kendall metric's stability suggests consistent ranking performance, while the MSE Loss indicates successful error minimization. The overlap rate patterns may reflect the model's ability to maintain performance across different confidence intervals or evaluation criteria.
</details>
Figure 7: The top panel illustrates the convergence of MSE Loss and the Kendall rank correlation coefficient over training steps. The bottom panel tracks the overlap rate of the top- $20\$ ground-truth tokens within the top- $p\$ ( $p\in[20,90]$ ) predicted tokens.
Budget. We report the model’s reasoning performance and throughput in different budget settings in Fig. 8. As expected, as the KV budget increases, the accuracy of R1-Llama and R1-Qwen improves, and the throughput decreases. At the maximum evaluated budget of 5000, DynTS delivers its strongest reasoning results ( $53.0\$ for R1-Llama and $56.3\$ for R1-Qwen), minimizing the performance gap with the full-cache baseline.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Bar Chart: R1-Llama vs R1-Qwen Performance Across KV Budgets
### Overview
The image contains a dual-axis bar chart comparing the performance of two models (R1-Llama and R1-Qwen) across five KV Budget thresholds (2500–5000). Two metrics are measured: **Pass@1** (blue bars) and **Throughput (TPS)** (orange lines). The chart is split into two side-by-side panels, one for each model.
---
### Components/Axes
- **X-Axis**: KV Budget (2500, 3000, 3500, 4000, 4500, 5000)
- **Left Y-Axis (Pass@1)**: Scale 30–80 (percentage)
- **Right Y-Axis (Throughput)**: Scale 300–800 (TPS)
- **Legend**:
- Blue = Pass@1
- Orange = Throughput
- **Legend Position**: Top-right corner of the entire chart
- **Model Labels**:
- Left panel: R1-Llama
- Right panel: R1-Qwen
---
### Detailed Analysis
#### R1-Llama Panel
- **Pass@1 (Blue Bars)**:
- 2500 KV: 44.2
- 3000 KV: 50.4
- 3500 KV: 51.0
- 4000 KV: 50.8
- 4500 KV: 49.9
- 5000 KV: 53.0
- **Throughput (Orange Line)**:
- 2500 KV: 780
- 3000 KV: 720
- 3500 KV: 660
- 4000 KV: 550
- 4500 KV: 450
- 5000 KV: 400
#### R1-Qwen Panel
- **Pass@1 (Blue Bars)**:
- 2500 KV: 49.8
- 3000 KV: 52.6
- 3500 KV: 54.1
- 4000 KV: 54.3
- 4500 KV: 54.3
- 5000 KV: 56.3
- **Throughput (Orange Line)**:
- 2500 KV: 760
- 3000 KV: 700
- 3500 KV: 680
- 4000 KV: 650
- 4500 KV: 600
- 5000 KV: 600
---
### Key Observations
1. **Pass@1 Trends**:
- Both models show a **general upward trend** in Pass@1 as KV Budget increases, with minor fluctuations.
- R1-Qwen consistently outperforms R1-Llama across all KV Budgets (e.g., 56.3 vs. 53.0 at 5000 KV).
2. **Throughput Trends**:
- Both models exhibit a **steady decline** in Throughput as KV Budget increases.
- R1-Qwen maintains higher Throughput values than R1-Llama at equivalent KV Budgets (e.g., 600 vs. 400 TPS at 5000 KV).
3. **Trade-off Pattern**:
- Higher KV Budgets improve Pass@1 but reduce Throughput, suggesting a resource allocation trade-off.
- R1-Qwen demonstrates better efficiency, achieving higher Pass@1 with less throughput degradation.
---
### Interpretation
The data reveals a **performance-versus-efficiency trade-off** between the two models. R1-Qwen consistently achieves higher Pass@1 scores while maintaining superior Throughput across all KV Budgets, indicating it is more optimized for both accuracy and resource utilization. The decline in Throughput with increasing KV Budget suggests that larger budgets prioritize accuracy over computational speed. This pattern could reflect differences in model architecture, training data, or inference optimization strategies between the two models.
</details>
Figure 8: Accuracy and throughput across varying KV budgets.
### 6.4 Analysis of Importance Predictor
To validate that the Importance Predictor effectively learns the ground-truth thinking token importance scores, we report the MSE Loss and the Kendall rank correlation coefficient (Abdi, 2007) in the top panel of Fig. 7. As the number of training steps increases, both metrics exhibit clear convergence. The MSE loss demonstrates that the predictor can fit the true importance scores. The Kendall coefficient measures the consistency of rankings between the ground-truth importance scores and the predicted values. This result indicates that the predictor successfully captures each thinking token’s importance to the answer. Furthermore, we analyze the overlap rate of predicted critical thinking tokens, as shown in the bottom panel of Fig. 7. Notably, at the end of training, the overlap rate of critical tokens within the top $30\$ of the predicted tokens exceeds $80\$ . This confirms that the Importance Predictor in DynTS effectively identifies the most pivotal tokens, ensuring the retention of essential thinking tokens even at high compression rates.
## 7 Related Work
Recent works on KV cache compression have primarily focused on classical LLMs, applying eviction strategies based on attention scores or heuristic rules. One line of work addresses long-context pruning at the prefill stage. Such as SnapKV (Li et al., 2024), PyramidKV (Cai et al., 2024), and AdaKV (Feng et al., 2024). However, they are ill-suited for the inference scenarios of LRMs, which have short prefill tokens followed by long decoding steps. Furthermore, several strategies have been proposed specifically for the decoding phase. For instance, H2O (Zhang et al., 2023) leverages accumulated attention scores, StreamingLLM (Xiao et al., 2024) retains attention sinks and recent tokens, and SepLLM (Chen et al., 2024) preserves only the separator tokens. More recently, targeting LRMs, (Cai et al., 2025) introduced RKV, which adds a similarity-based metric to evict redundant tokens, while RLKV (Du et al., 2025) utilizes reinforcement learning to retain critical reasoning heads. However, these methods fail to accurately assess the contribution of intermediate tokens to the final answer. Consequently, they risk erroneously evicting decision-critical tokens, compromising the model’s reasoning performance.
## 8 Conclusion and Discussion
In this work, we investigated the relationship between the reasoning traces and their final answers in LRMs. Our analysis revealed a Pareto Principle in LRMs: only the decision-critical thinking tokens ( $20\$ in the reasoning traces) steer the model toward the final answer. Building on this insight, we proposed DynTS, a novel KV cache compression method. Departing from current strategies that rely on local attention scores for eviction, DynTS introduces a learnable Importance Predictor to predict the contribution of the current token to the final answer. Based on the predicted score, DynTS retains pivotal KV cache. Empirical results on six datasets confirm that DynTS outperforms other SOTA baselines. We also discuss the limitations of DynTS and outline potential directions for future improvement. Please refer to Appendix E for details.
## Impact Statement
This paper presents work aimed at advancing the field of KV cache compression. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. The primary impact of this research is to improve the memory and computational efficiency of LRM’s inference. By reducing memory requirements, our method helps lower the barrier to deploying powerful models on resource-constrained edge devices. We believe our work does not introduce specific ethical or societal risks beyond the general considerations inherent to advancing generative AI.
## References
- H. Abdi (2007) The kendall rank correlation coefficient. Encyclopedia of measurement and statistics 2, pp. 508–510. Cited by: §6.4.
- J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai (2023) Gqa: training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245. Cited by: §2.
- P. C. Bogdan, U. Macar, N. Nanda, and A. Conmy (2025) Thought anchors: which llm reasoning steps matter?. arXiv preprint arXiv:2506.19143. Cited by: §1, §1, §3.1.
- Z. Cai, W. Xiao, H. Sun, C. Luo, Y. Zhang, K. Wan, Y. Li, Y. Zhou, L. Chang, J. Gu, et al. (2025) R-kv: redundancy-aware kv cache compression for reasoning models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, Cited by: §3.1, §6.1, §7.
- Z. Cai, Y. Zhang, B. Gao, Y. Liu, Y. Li, T. Liu, K. Lu, W. Xiong, Y. Dong, J. Hu, et al. (2024) Pyramidkv: dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069. Cited by: §3.1, §7.
- G. Chen, H. Shi, J. Li, Y. Gao, X. Ren, Y. Chen, X. Jiang, Z. Li, W. Liu, and C. Huang (2024) Sepllm: accelerate large language models by compressing one segment into one separator. arXiv preprint arXiv:2412.12094. Cited by: §1, §1, §4.2, §6.1, §7.
- Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou, T. Gao, and W. Che (2025) Towards reasoning era: a survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567. Cited by: §1, §2.
- D. Choi, J. Lee, J. Tack, W. Song, S. Dingliwal, S. M. Jayanthi, B. Ganesh, J. Shin, A. Galstyan, and S. B. Bodapati (2025) Think clearly: improving reasoning via redundant token pruning. arXiv preprint arXiv:2507.08806 4. Cited by: §1.
- G. DeepMind (2025) A new era of intelligence with gemini 3. Note: https://blog.google/products/gemini/gemini-3/#gemini-3-deep-think Cited by: §1.
- A. Devoto, Y. Zhao, S. Scardapane, and P. Minervini (2024) A simple and effective $L\_2$ norm-based strategy for kv cache compression. arXiv preprint arXiv:2406.11430. Cited by: §1.
- W. Du, L. Jiang, K. Tao, X. Liu, and H. Wang (2025) Which heads matter for reasoning? rl-guided kv cache compression. arXiv preprint arXiv:2510.08525. Cited by: §7.
- S. Feng, G. Fang, X. Ma, and X. Wang (2025) Efficient reasoning models: a survey. arXiv preprint arXiv:2504.10903. Cited by: §1.
- Y. Feng, J. Lv, Y. Cao, X. Xie, and S. K. Zhou (2024) Ada-kv: optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550. Cited by: §7.
- D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §B.1, §1, §6.1, §6.1.
- D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021) Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Cited by: §1, §6.1.
- W. Huang, Z. Zhai, Y. Shen, S. Cao, F. Zhao, X. Xu, Z. Ye, Y. Hu, and S. Lin (2024) Dynamic-llava: efficient multimodal large language models via dynamic vision-language context sparsification. arXiv preprint arXiv:2412.00876. Cited by: §4.1.
- W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pp. 611–626. Cited by: §B.1.
- Y. Li, Y. Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen (2024) Snapkv: llm knows what you are looking for before generation. Advances in Neural Information Processing Systems 37, pp. 22947–22970. Cited by: §1, §3.1, §6.1, §7.
- M. Liu, A. Palnitkar, T. Rabbani, H. Jae, K. R. Sang, D. Yao, S. Shabihi, F. Zhao, T. Li, C. Zhang, et al. (2025a) Hold onto that thought: assessing kv cache compression on reasoning. arXiv preprint arXiv:2512.12008. Cited by: §6.1.
- Y. Liu, J. Fu, S. Liu, Y. Zou, S. Zhang, and J. Zhou (2025b) KV cache compression for inference efficiency in llms: a review. In Proceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing, pp. 207–212. Cited by: §1.
- G. Minegishi, H. Furuta, T. Kojima, Y. Iwasawa, and Y. Matsuo (2025) Topology of reasoning: understanding large reasoning models through reasoning graph properties. arXiv preprint arXiv:2506.05744. Cited by: §1.
- OpenAI (2025) OpenAI. External Links: Link Cited by: §1.
- R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean (2023) Efficiently scaling transformer inference. Proceedings of machine learning and systems 5, pp. 606–624. Cited by: §2.
- Z. Qin, Y. Cao, M. Lin, W. Hu, S. Fan, K. Cheng, W. Lin, and J. Li (2025) Cake: cascading and adaptive kv cache eviction with layer preferences. arXiv preprint arXiv:2503.12491. Cited by: §1.
- D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024) Gpqa: a graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, Cited by: §6.1.
- L. Shi, H. Zhang, Y. Yao, Z. Li, and H. Zhao (2024) Keep the cost down: a review on methods to optimize llm’s kv-cache consumption. arXiv preprint arXiv:2407.18003. Cited by: §1.
- C. Singh, J. P. Inala, M. Galley, R. Caruana, and J. Gao (2024) Rethinking interpretability in the era of large language models. arXiv preprint arXiv:2402.01761. Cited by: §3.1.
- Y. Sui, Y. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, N. Zou, et al. (2025) Stop overthinking: a survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419. Cited by: §2.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1, §2, §2, §4.1.
- Z. Wang and A. C. Bovik (2009) Mean squared error: love it or leave it? a new look at signal fidelity measures. IEEE Signal Processing Magazine 26 (1), pp. 98–117. External Links: Document Cited by: §4.1.
- G. WEI, X. Zhou, P. Sun, T. Zhang, and Y. Wen (2025) Rethinking key-value cache compression techniques for large language model serving. Proceedings of Machine Learning and Systems 7. Cited by: §1.
- S. Wiegreffe and Y. Pinter (2019) Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Hong Kong, China, pp. 11–20. External Links: Link, Document Cited by: §1.
- G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis (2024) EFFICIENT streaming language models with attention sinks. Cited by: §C.3, §1, §6.1, §7.
- F. Xu, Q. Hao, Z. Zong, J. Wang, Y. Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Meng, et al. (2025) Towards large reasoning models: a survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686. Cited by: §2.
- A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §6.3.
- D. Zhang, Z. Li, M. Zhang, J. Zhang, Z. Liu, Y. Yao, H. Xu, J. Zheng, X. Chen, Y. Zhang, et al. (2025a) From system 1 to system 2: a survey of reasoning large language models. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
- Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, W. Hua, H. Wu, Z. Guo, Y. Wang, N. Muennighoff, et al. (2025b) A survey on test-time scaling in large language models: what, how, where, and how well?. arXiv preprint arXiv:2503.24235. Cited by: §1.
- Y. Zhang and T. Math-AI (2024) American invitational mathematics examination (aime) 2024. Cited by: §6.1.
- Y. Zhang and T. Math-AI (2025) American invitational mathematics examination (aime) 2025. Cited by: §6.1.
- Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett, et al. (2023) H2o: heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36, pp. 34661–34710. Cited by: §1, §1, §3.1, §4.2, §6.1, §7.
- J. Zhao, Z. Fang, S. Li, S. Yang, and S. He (2024) Buzz: beehive-structured sparse kv cache with segmented heavy hitters for efficient llm inference. arXiv preprint arXiv:2410.23079. Cited by: §4.2.
## Appendix A Proof of Cumulative Computational Gain
**Definition A.1 (Predictor Overhead)**
*Let $\mathcal{M}_{\text{base}}$ be the vanilla LRM with $L$ layers and a hidden dimension $d$ , and $\mathcal{M}_{\text{opt}}$ be the LRM with Importance Predictor. The predictor is defined as a three-layer linear MLP with dimensions $d\to m_{1}\to m_{2}\to m_{3}$ . The computational cost per decode step is:
$$
\mathcal{C}_{\text{mlp}}=2(d\cdot m_{1}+m_{1}\cdot m_{2}+m_{2}\cdot m_{3}). \tag{11}
$$
Setting $m_{1}=2d$ , $m_{2}=d/2$ , and $m_{3}=1$ yields:
$$
\mathcal{C}_{\text{mlp}}=6d^{2}+d \tag{12}
$$*
**Definition A.2 (Effective KV Cache Length)**
*Let $M$ denote the length of the prefill sequence. At decode step $i$ , when the effective cache length reaches the budget $B$ , DynTS performs KV Cache Selection to evict $K$ redundant tokens. The effective KV cache length $S_{i}$ for the base and optimized models is given by:
$$
S_{i}^{\text{base}}=M+i,\quad S_{i}^{\text{opt}}=M+i-n_{i}\cdot K, \tag{13}
$$
where $n_{i}=\max\left(0,\left\lfloor\frac{(M+i)-B}{K}\right\rfloor+1\right)$ denotes the count of cache eviction event at step $i$ .*
**Definition A.3 (LLM Overhead)**
*The computational overhead per step for a decoder-only transformer is composed of a static component $\mathcal{C}_{\text{static}}$ (independent of sequence length) and a dynamic attention component $\mathcal{C}_{\text{attn}}$ (linearly dependent on effective cache length). The static cost $\mathcal{C}_{\text{static}}$ for the backbone remains identical for both models. The self-attention cost for a single layer is $4\cdot d\cdot S_{i}$ (counting $Q\cdot K^{\top}$ and $\text{Softmax}\cdot V$ ). Across $L$ layers:
$$
\mathcal{C}_{\text{attn}}(S_{i})=4\cdot L\cdot d\cdot S_{i} \tag{14}
$$*
* Proof: Computational Gain*
Let $\Delta\mathcal{C}(i)$ be the reduction in FLOPs achieved by DynTS at decoding step $i$ , which is defined as $\text{FLOPs}(\mathcal{M}_{\text{base}}(i))-\text{FLOPs}(\mathcal{M}_{\text{opt}}(i))$ :
$$
\Delta\mathcal{C}(i)=\left[\mathcal{C}_{\text{static}}(i)+\mathcal{C}_{\text{attn}}(S_{i}^{\text{base}})\right]-\left[\mathcal{C}_{\text{static}}(i)+\mathcal{C}_{\text{mlp}}+\mathcal{C}_{\text{attn}}(S_{i}^{\text{opt}})\right]. \tag{15}
$$
Eliminating the static term $\mathcal{C}_{\text{static}}$
$$
\Delta\mathcal{C}(i)=\mathcal{C}_{\text{attn}}(S_{i}^{\text{base}})-\mathcal{C}_{\text{attn}}(S_{i}^{\text{opt}})-\mathcal{C}_{\text{mlp}}. \tag{16}
$$
Substituting $S_{i}^{\text{base}}$ , $S_{i}^{\text{opt}}$ and $\mathcal{C}_{\text{mlp}}$ :
$$
\displaystyle\Delta\mathcal{C}(i) \displaystyle=4\cdot L\cdot d\cdot(M+i)-4\cdot L\cdot d\cdot(M+i-n_{i}\cdot K)-\mathcal{C}_{\text{mlp}} \displaystyle=\underbrace{n_{i}\cdot 4LdK}_{\text{Eviction Saving}}-\underbrace{(6d^{2}+d)}_{\text{Predictor Overhead}}. \tag{17}
$$
This completes the proof. $\square$ ∎
## Appendix B Empirical Analysis and Observations
### B.1 Implementation Details
To calculate the importance of each thinking token to the final answer sequence, we first utilized vLLM (Kwon et al., 2023) to generate the complete reasoning trace. Following (Guo et al., 2025), we set the temperature to 0.6, top-p to 0.95, top-k to 20, and max length to 16,384. To ensure sequence completeness, we sampled 5 times for each question and filtered out samples with incomplete inference traces. Then, we fed the full sequence into the model for a single forward pass to extract the submatrices of attention weight, corresponding to the answer and thinking tokens. Finally, we aggregate the matrices across all layers and heads, and sum along the answer dimension (rows). The aggregated 1D vector is the importance score of each thinking token.
Based on the calculated importance scores, we employ three selection strategies to retain critical thinking tokens: top- $p\$ , bottom- $p\$ , and random sampling, where $p\in[2,4,6,8,10,20,30,40,50]$ . The retained tokens are concatenated with the original question to form the input sequence. The input sequence is processed by vLLM over 5 independent runs using the aforementioned configuration. We report the average Pass@1 across these results as the final accuracy.
### B.2 Ratio of Content Words
To investigate the distinctions of thinking tokens with varying importance scores, we employed spaCy to analyze the Part-of-Speech (POS) tags of each token. Specifically, we heuristically categorized nouns, verbs, adjectives, adverbs, and proper nouns as Content Words carrying substantive meaning, while treating other POS tags as Function Words with limited semantic information. The thinking tokens were sorted by importance score and then partitioned into ten equal parts. We report the ratio of Content Words and Function Words within each part in Fig 9. The tokens with higher importance scores exhibit a significantly higher proportion of content words, suggesting that they encode the core semantic meaning. Conversely, tokens with lower scores are predominantly function words, which primarily serve as syntactic scaffolding or intermediate states to maintain sequence coherence. Consequently, once the full sentence is generated, removing these low-importance tokens has a negligible impact on overall comprehension.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Bar Chart: R1-Llama | AIME24
### Overview
The chart displays the distribution of **Content Words** (red) and **Function Words** (gray) across 10 percentile categories, ranging from "90-100%" to "Top-10%". Each category represents a percentage range of tokens or words, with the x-axis showing their proportional contribution as a ratio (%).
### Components/Axes
- **Y-Axis (Categories)**:
- 90-100%
- 80-90%
- 70-80%
- 60-70%
- 50-60%
- 40-50%
- 30-40%
- 20-30%
- 10-20%
- Top-10%
- **X-Axis (Ratio %)**: 0 to 100%.
- **Legend**:
- Red = Content Words
- Gray = Function Words
- **Title**: "R1-Llama | AIME24" (top-center).
### Detailed Analysis
- **Content Words (Red)**:
- 90-100%: 31.0%
- 80-90%: 30.8%
- 70-80%: 30.7%
- 60-70%: 30.8%
- 50-60%: 31.1%
- 40-50%: 32.2%
- 30-40%: 34.1%
- 20-30%: 36.9%
- 10-20%: 40.3%
- Top-10%: 45.4%
- **Function Words (Gray)**:
- Calculated as 100% minus Content Words for each category (e.g., 90-100%: 69.0%, 80-90%: 69.2%, etc.).
### Key Observations
1. **Content Words** increase monotonically from 31.0% (90-100%) to 45.4% (Top-10%), indicating a higher concentration in lower percentile ranges.
2. **Function Words** dominate in higher percentile ranges (e.g., 69.0% in 90-100%) and decrease as the category percentage drops.
3. The largest gap between Content and Function Words occurs in the 90-100% category (31.0% vs. 69.0%).
### Interpretation
The data suggests that **Content Words** are more prevalent in the "Top-10%" category, while **Function Words** dominate in higher percentile ranges (e.g., 90-100%). This could imply that:
- **Content Words** are more concentrated in selective or high-quality segments (e.g., critical tokens for model performance).
- **Function Words** (e.g., prepositions, articles) are more common in bulk data, aligning with their role as structural elements in language.
- The trend may reflect a trade-off between meaningful content and syntactic structure in text analysis, relevant for tasks like model interpretability or NLP optimization.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
## Bar Chart: R1-Llama | AIME25
### Overview
The chart compares the distribution of **Content Words** (red) and **Function Words** (gray with diagonal stripes) across 10 percentage-based categories (e.g., 90-100%, 80-90%, ..., Top-10%). Each category represents a range of values, with the x-axis showing the ratio (%) of each word type within those ranges.
### Components/Axes
- **Title**: "R1-Llama | AIME25" (top-center).
- **X-Axis**: Labeled "Ratio (%)" with a scale from 0 to 100.
- **Y-Axis**: Categories listed vertically from top to bottom:
`90-100%`, `80-90%`, `70-80%`, `60-70%`, `50-60%`, `40-50%`, `30-40%`, `20-30%`, `10-20%`, `Top-10%`.
- **Legend**:
- Red: **Content Words** (solid color).
- Gray (diagonal stripes): **Function Words**.
Positioned in the top-right corner.
### Detailed Analysis
- **Content Words (Red)**:
- Values decrease as the y-axis categories progress from `90-100%` to `Top-10%`:
`90-100%`: 29.3% → `Top-10%`: 44.3%.
- Highest value in `Top-10%` (44.3%), lowest in `90-100%` (29.3%).
- **Function Words (Gray)**:
- Values increase as the y-axis categories progress from `90-100%` to `Top-10%`:
`90-100%`: 70.7% → `Top-10%`: 55.7%.
- Highest value in `90-100%` (70.7%), lowest in `Top-10%` (55.7%).
- **Bar Lengths**:
- Red bars (Content Words) are consistently shorter than gray bars (Function Words) in all categories.
- Example: In `90-100%`, red = 29.3% (left), gray = 70.7% (right).
### Key Observations
1. **Inverse Relationship**: Content Words and Function Words exhibit an inverse correlation across categories.
2. **Top-10% Outlier**: The `Top-10%` category has the highest Content Words ratio (44.3%) and the lowest Function Words ratio (55.7%).
3. **Consistency**: Function Words dominate all categories, with ratios ranging from 55.7% to 70.7%.
### Interpretation
- **Data Implications**:
- The `Top-10%` category’s elevated Content Words ratio suggests a focus on substantive terms in this group, possibly indicating higher quality or specificity.
- Function Words (e.g., prepositions, conjunctions) dominate in higher percentage ranges (`90-100%`), implying these categories may prioritize structural language over content.
- **Trend Verification**:
- Content Words increase monotonically from `90-100%` to `Top-10%` (29.3% → 44.3%).
- Function Words decrease monotonically (70.7% → 55.7%).
- **Anomalies**:
- No outliers; trends are consistent across all categories.
- **Contextual Relevance**:
- The chart likely reflects linguistic analysis of text data, where "Content Words" (nouns, verbs) and "Function Words" (grammatical connectors) are categorized by frequency or importance. The `Top-10%` label suggests a focus on high-impact or critical segments of the data.
## Final Notes
- All legend colors match bar colors exactly.
- No textual content in other languages detected.
- Spatial grounding confirms legend placement (top-right) and axis alignment (y-axis left, x-axis bottom).
</details>
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: R1-Llama | AMC23
### Overview
The chart compares the distribution of **Content Words** (red) and **Function Words** (gray with diagonal stripes) across 10 percentile categories (90-100% to Top-10%). Each category represents a range of ratios, with values explicitly labeled on the bars.
### Components/Axes
- **X-axis**: "Ratio (%)" (0–100), representing the percentage contribution of each word type.
- **Y-axis**: Categories labeled as percentile ranges (e.g., "90-100%", "80-90%", ..., "Top-10%").
- **Legend**:
- Red = Content Words
- Gray (diagonal stripes) = Function Words
- **Placement**: Legend is in the top-right corner. Bars are horizontal, with red bars on the left and gray bars on the right for each category.
### Detailed Analysis
- **Content Words (Red)**:
- 90-100%: 25.7%
- 80-90%: 27.4%
- 70-80%: 28.7%
- 60-70%: 29.1%
- 50-60%: 29.6%
- 40-50%: 30.6%
- 30-40%: 32.3%
- 20-30%: 35.4%
- 10-20%: 39.1%
- Top-10%: 45.3%
- **Function Words (Gray)**:
- 90-100%: 74.3% (100% - 25.7%)
- 80-90%: 72.6% (100% - 27.4%)
- 70-80%: 71.3% (100% - 28.7%)
- 60-70%: 70.9% (100% - 29.1%)
- 50-60%: 70.4% (100% - 29.6%)
- 40-50%: 69.4% (100% - 30.6%)
- 30-40%: 67.7% (100% - 32.3%)
- 20-30%: 64.6% (100% - 35.4%)
- 10-20%: 60.9% (100% - 39.1%)
- Top-10%: 54.7% (100% - 45.3%)
### Key Observations
1. **Inverse Relationship**: As the percentile category becomes more inclusive (e.g., moving from "90-100%" to "Top-10%"), the ratio of **Content Words** increases, while **Function Words** decrease.
2. **Highest Ratios**:
- Content Words peak at **45.3%** in the "Top-10%" category.
- Function Words peak at **74.3%** in the "90-100%" category.
3. **Gradual Shift**: The transition from "90-100%" to "Top-10%" shows a steady increase in Content Words (25.7% → 45.3%) and a corresponding decline in Function Words (74.3% → 54.7%).
### Interpretation
The data suggests that higher-ranked categories (e.g., "Top-10%") prioritize **Content Words** (substantive, topic-specific terms), while lower-ranked categories rely more on **Function Words** (grammatical, structural terms like "and," "the"). This could indicate that top-performing segments (e.g., in a document or dataset) are more focused on core content, whereas broader or less specific segments include more filler or structural language. The trend aligns with linguistic principles where concise, high-quality content often minimizes functional redundancy.
</details>
<details>
<summary>x14.png Details</summary>

### Visual Description
## Bar Chart: R1-Llama | GPQA-D
### Overview
The chart compares the distribution of **Content Words** (red) and **Function Words** (gray with diagonal stripes) across different percentile ranges of a dataset. The x-axis represents the ratio (%) of each word type, while the y-axis categorizes data into percentile brackets (e.g., 90-100%, 80-90%, ..., Top-10%). Each bar shows the proportion of Content Words (left) and Function Words (right) within each percentile range.
### Components/Axes
- **Title**: "R1-Llama | GPQA-D" (top center).
- **X-Axis**: Labeled "Ratio (%)" with a scale from 0 to 100.
- **Y-Axis**: Categorized into percentile ranges:
- 90-100%
- 80-90%
- 70-80%
- 60-70%
- 50-60%
- 40-50%
- 30-40%
- 20-30%
- 10-20%
- Top-10%
- **Legend**: Located in the top-right corner, with:
- **Red**: Content Words
- **Gray (diagonal stripes)**: Function Words
### Detailed Analysis
- **Content Words (Red)**:
- 90-100%: 36.3%
- 80-90%: 37.1%
- 70-80%: 38.4%
- 60-70%: 39.1%
- 50-60%: 39.6%
- 40-50%: 40.8%
- 30-40%: 42.9%
- 20-30%: 45.7%
- 10-20%: 49.6%
- Top-10%: 56.6%
- **Function Words (Gray)**:
- Calculated as 100% minus Content Words for each category (e.g., 90-100%: 63.7%, 80-90%: 62.9%, etc.).
### Key Observations
1. **Content Words Increase Gradually**: Content Words rise steadily from 36.3% (90-100%) to 56.6% (Top-10%), indicating a higher proportion of substantive terms in higher-ranked content.
2. **Function Words Decrease Gradually**: Function Words decline from 63.7% (90-100%) to 43.4% (Top-10%), suggesting less reliance on structural terms in top-ranked content.
3. **Steepest Growth in Top-10%**: The largest jump in Content Words occurs between 10-20% (49.6%) and Top-10% (56.6%), highlighting a concentration of meaningful content in the highest-ranked tier.
### Interpretation
The data suggests that higher-ranked content (e.g., Top-10%) contains a greater proportion of **Content Words** (e.g., nouns, verbs, adjectives) relative to **Function Words** (e.g., prepositions, conjunctions). This implies that top-tier content is more direct and information-dense, while lower-ranked content relies more on structural language. The trend aligns with expectations for relevance, as higher-ranked content likely prioritizes clarity and specificity over grammatical complexity. The steep increase in the Top-10% bracket may indicate a threshold effect, where the most critical information is concentrated in a smaller subset of the data.
</details>
<details>
<summary>x15.png Details</summary>

### Visual Description
## Bar Chart: R1-Llama | GK23EN
### Overview
The image is a horizontal bar chart titled "R1-Llama | GK23EN," comparing the distribution of "Content Words" and "Function Words" across performance-based categories. The x-axis represents the "Ratio (%)" from 0 to 100, while the y-axis lists performance ranges (e.g., "90-100%", "80-90%", ..., "Top-10%"). Each bar is divided into two segments: red for "Content Words" and gray for "Function Words," with numerical values explicitly labeled for the red segments.
### Components/Axes
- **Title**: "R1-Llama | GK23EN" (top center).
- **X-Axis**: Labeled "Ratio (%)" with a scale from 0 to 100.
- **Y-Axis**: Categories labeled as performance ranges:
`90-100%`, `80-90%`, `70-80%`, `60-70%`, `50-60%`, `40-50%`, `30-40%`, `20-30%`, `10-20%`, `Top-10%`.
- **Legend**: Located on the right, with:
- **Red**: "Content Words"
- **Gray**: "Function Words"
- **Data Labels**: Numerical percentages (e.g., "26.8%", "28.1%") are placed next to the red segments of each bar.
### Detailed Analysis
- **Content Words (Red)**:
- `90-100%`: 26.8%
- `80-90%`: 28.1%
- `70-80%`: 30.1%
- `60-70%`: 31.4%
- `50-60%`: 32.7%
- `40-50%`: 34.2%
- `30-40%`: 36.8%
- `20-30%`: 39.3%
- `10-20%`: 42.0%
- `Top-10%`: 46.0%
- **Function Words (Gray)**:
- Not explicitly labeled, but inferred as the remaining percentage (100% - red value) for each category. For example, in `90-100%`, Function Words would be 73.2% (100% - 26.8%).
### Key Observations
1. **Trend**: The proportion of "Content Words" increases as the performance range decreases (e.g., from 26.8% in `90-100%` to 46.0% in `Top-10%`).
2. **Inverse Relationship**: "Function Words" decrease correspondingly as "Content Words" increase (e.g., 73.2% to 54.0% across the same range).
3. **Outlier**: The `Top-10%` category has the highest "Content Words" (46.0%), suggesting a potential focus on content in lower-performing segments.
### Interpretation
The data suggests that lower-performing categories (e.g., `Top-10%`) exhibit a higher reliance on "Content Words" compared to "Function Words." This could imply that models or systems in these categories prioritize factual or descriptive language over grammatical or structural elements. The inverse relationship between the two word types highlights a trade-off in linguistic composition across performance tiers. The explicit labeling of "Content Words" percentages ensures clarity, while the absence of Function Words values requires inference, introducing minor uncertainty in their exact distribution.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
## Bar Chart: R1-Llama | MATH500
### Overview
The chart compares the ratio of **Content Words** (red) and **Function Words** (gray) across performance percentiles of the R1-Llama model on the MATH500 dataset. Each horizontal bar represents a percentile range (e.g., "90-100%", "80-90%", etc.), with the x-axis showing the percentage contribution of each word type.
### Components/Axes
- **X-axis**: "Ratio (%)" (0–100% scale).
- **Y-axis**: Performance percentiles (90-100% to Top-10%).
- **Legend**:
- Red: Content Words
- Gray: Function Words
- **Bar Segmentation**: Each bar is split into red (Content Words) and gray (Function Words) segments.
### Detailed Analysis
- **90-100%**:
- Content Words: 25.7%
- Function Words: ~74.3% (100% - 25.7%)
- **80-90%**:
- Content Words: 27.2%
- Function Words: ~72.8%
- **70-80%**:
- Content Words: 29.1%
- Function Words: ~70.9%
- **60-70%**:
- Content Words: 30.2%
- Function Words: ~69.8%
- **50-60%**:
- Content Words: 30.8%
- Function Words: ~69.2%
- **40-50%**:
- Content Words: 32.0%
- Function Words: ~68.0%
- **30-40%**:
- Content Words: 34.1%
- Function Words: ~65.9%
- **20-30%**:
- Content Words: 37.0%
- Function Words: ~63.0%
- **10-20%**:
- Content Words: 40.1%
- Function Words: ~59.9%
- **Top-10%**:
- Content Words: 45.8%
- Function Words: ~54.2%
### Key Observations
1. **Inverse Relationship**: As model performance decreases (from 90-100% to Top-10%), the ratio of **Content Words** increases steadily (25.7% → 45.8%).
2. **Function Words Decline**: Function Words consistently decrease across lower percentiles, suggesting higher-performing models use fewer functional terms.
3. **Top-10% Outlier**: The highest-performing models (90-100%) use the fewest Content Words (25.7%), while the lowest-performing (Top-10%) use the most (45.8%).
### Interpretation
The data suggests that **higher-performing models** (90-100%) prioritize concise, functional language (e.g., prepositions, conjunctions) over content-heavy phrasing. Lower-performing models (Top-10%) exhibit verbosity, with nearly double the Content Word ratio. This could indicate inefficiencies in lower-performing models, such as redundant explanations or over-reliance on descriptive language. The trend aligns with linguistic theories where functional words streamline communication, while content words dominate in less structured or verbose outputs.
**Note**: All values are approximate, derived from bar segment lengths and x-axis scaling. No textual anomalies or outliers beyond the described trend.
</details>
<details>
<summary>x17.png Details</summary>

### Visual Description
## Bar Chart: R1-Qwen | AIME24
### Overview
The image is a horizontal bar chart comparing the distribution of "Content Words" (red) and "Function Words" (gray) across 10 percentile categories (90-100% to Top-10%). The x-axis represents the ratio (%) of each word type, while the y-axis lists the percentile ranges. The chart is sourced from AIME24 and labeled "R1-Qwen."
### Components/Axes
- **Title**: "R1-Qwen | AIME24" (top center).
- **X-Axis**: "Ratio (%)" with a scale from 0 to 100.
- **Y-Axis**: Percentile categories (90-100%, 80-90%, ..., Top-10%) listed in descending order.
- **Legend**:
- Red: Content Words (top-right).
- Gray (diagonal lines): Function Words (top-right).
- **Bars**:
- Red bars (Content Words) increase in height from 90-100% to 10-20%, then slightly decrease in Top-10%.
- Gray bars (Function Words) decrease in height across all categories, with the largest drop between 90-100% and 80-90%.
### Detailed Analysis
- **Content Words (Red)**:
- 90-100%: 19.0%
- 80-90%: 24.7%
- 70-80%: 29.1%
- 60-70%: 31.0%
- 50-60%: 32.8%
- 40-50%: 33.9%
- 30-40%: 35.6%
- 20-30%: 37.5%
- 10-20%: 39.4%
- Top-10%: 38.2%
- **Function Words (Gray)**:
- 90-100%: 81.0% (100% - 19.0%)
- 80-90%: 75.3% (100% - 24.7%)
- 70-80%: 70.9% (100% - 29.1%)
- 60-70%: 69.0% (100% - 31.0%)
- 50-60%: 67.2% (100% - 32.8%)
- 40-50%: 66.1% (100% - 33.9%)
- 30-40%: 64.4% (100% - 35.6%)
- 20-30%: 62.5% (100% - 37.5%)
- 10-20%: 60.6% (100% - 39.4%)
- Top-10%: 61.8% (100% - 38.2%)
### Key Observations
1. **Content Words** show a consistent upward trend from 90-100% (19.0%) to 10-20% (39.4%), peaking in the 10-20% range.
2. **Function Words** decrease monotonically across all categories, with the steepest drop between 90-100% and 80-90%.
3. The **Top-10%** category has the highest Content Words (38.2%) and the lowest Function Words (61.8%), indicating a shift toward substantive content in the most critical percentile range.
4. The red and gray bars are inversely proportional, as their percentages sum to 100% in each category.
### Interpretation
The data suggests that as the percentile range narrows (e.g., focusing on the "Top-10%"), the proportion of **Content Words** increases, while **Function Words** decrease. This implies that higher-impact or more critical data points (e.g., Top-10%) are dominated by substantive content rather than functional language. The slight dip in Content Words in the Top-10% (38.2% vs. 39.4% in 10-20%) may indicate a refinement or filtering effect in the most extreme percentile. The chart highlights a clear inverse relationship between word type distribution and percentile significance, with functional language being more prevalent in broader, less critical ranges.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Horizontal Bar Chart: R1-Qwen | AIME25
### Overview
The chart visualizes the distribution of **Content Words** (red) and **Function Words** (gray with diagonal stripes) across different performance percentile ranges (90-100% to Top-10%). Each bar represents the percentage ratio of Content Words within a specific percentile category. Function Words are implied as the complementary portion but are not explicitly labeled with numerical values.
### Components/Axes
- **Title**: "R1-Qwen | AIME25" (top center).
- **X-Axis**: Labeled "Ratio (%)" with a scale from 0 to 100.
- **Y-Axis**: Categories ordered from top to bottom:
`90-100%`, `80-90%`, `70-80%`, `60-70%`, `50-60%`, `40-50%`, `30-40%`, `20-30%`, `10-20%`, `Top-10%`.
- **Legend**: Located on the right, with:
- **Red**: Content Words.
- **Gray (diagonal stripes)**: Function Words.
### Detailed Analysis
- **Content Words (Red Bars)**:
- `90-100%`: 20.9%
- `80-90%`: 26.2%
- `70-80%`: 29.7%
- `60-70%`: 31.3%
- `50-60%`: 32.2%
- `40-50%`: 33.5%
- `30-40%`: 34.9%
- `20-30%`: 35.9%
- `10-20%`: 37.5%
- `Top-10%`: 37.5%
- **Function Words (Gray Bars)**:
- No explicit numerical values are provided in the image. However, the gray bars visually represent the remaining percentage (100% - Content Words) for each category. For example:
- `90-100%`: ~79.1% (100% - 20.9%).
- `Top-10%`: ~62.5% (100% - 37.5%).
### Key Observations
1. **Content Words Trend**:
- Content Words increase monotonically as percentile ranges decrease (e.g., 20.9% in `90-100%` to 37.5% in `Top-10%`).
- The highest ratios occur in the lowest percentile categories (`10-20%` and `Top-10%`).
2. **Function Words Inference**:
- Function Words decrease as Content Words increase, suggesting an inverse relationship between the two categories.
3. **Symmetry in Top Categories**:
- `10-20%` and `Top-10%` share identical Content Word ratios (37.5%), indicating a potential threshold or saturation point.
### Interpretation
The data suggests that higher-performing or more relevant items (lower percentile ranges, e.g., `Top-10%`) are characterized by a greater proportion of **Content Words** relative to **Function Words**. This could imply:
- **Semantic Richness**: Lower-ranked items (e.g., `90-100%`) may prioritize functional language (e.g., prepositions, conjunctions), while higher-ranked items focus on substantive content.
- **Performance Correlation**: The inverse relationship might reflect a design choice where content density correlates with effectiveness in the AIME25 benchmark.
- **Threshold Effect**: The plateau at 37.5% in the top two categories (`10-20%` and `Top-10%`) could indicate diminishing returns or a cap on content word utility beyond this point.
The chart highlights a clear trend where Content Words dominate in lower percentile ranges, potentially guiding optimizations for text generation or analysis in similar benchmarks.
</details>
<details>
<summary>x19.png Details</summary>

### Visual Description
## Bar Chart: R1-Qwen | AMC23
### Overview
The chart compares the distribution of **Content Words** (red) and **Function Words** (gray with diagonal stripes) across performance-based percentage ranges (90-100% to Top-10%). Each bar represents the ratio (%) of these word types within specific performance tiers.
### Components/Axes
- **Title**: "R1-Qwen | AMC23" (top-center).
- **Legend**:
- Red: Content Words.
- Gray (diagonal stripes): Function Words.
- **Y-Axis**: Performance tiers labeled as percentage ranges (90-100%, 80-90%, ..., Top-10%), ordered from highest to lowest.
- **X-Axis**: "Ratio (%)" with a scale from 0 to 100.
- **Bars**: Horizontal bars for each category, positioned side-by-side under their respective performance tier.
### Detailed Analysis
- **Content Words (Red)**:
- 90-100%: 21.1%
- 80-90%: 24.6%
- 70-80%: 27.7%
- 60-70%: 30.1%
- 50-60%: 31.9%
- 40-50%: 33.5%
- 30-40%: 35.8%
- 20-30%: 37.5%
- 10-20%: 39.2%
- Top-10%: 39.6%
- **Function Words (Gray)**:
- 90-100%: 78.9% (100% - 21.1%)
- 80-90%: 75.4% (100% - 24.6%)
- 70-80%: 72.3% (100% - 27.7%)
- 60-70%: 69.9% (100% - 30.1%)
- 50-60%: 68.1% (100% - 31.9%)
- 40-50%: 66.5% (100% - 33.5%)
- 30-40%: 64.2% (100% - 35.8%)
- 20-30%: 62.5% (100% - 37.5%)
- 10-20%: 60.8% (100% - 39.2%)
- Top-10%: 60.4% (100% - 39.6%)
### Key Observations
1. **Inverse Relationship**: As performance tiers decrease (moving down the y-axis), the ratio of **Content Words** increases, while **Function Words** decrease.
2. **Dominance of Function Words**: Function Words consistently occupy the majority (>60%) in all tiers, with the highest proportion in the 90-100% tier (78.9%).
3. **Content Words Growth**: Content Words show a steady increase from 21.1% (90-100%) to 39.6% (Top-10%), suggesting a shift toward content-focused language in lower-performing tiers.
### Interpretation
The data implies that higher-performing tiers (e.g., 90-100%) rely more heavily on **Function Words** (e.g., prepositions, conjunctions), which may indicate structural or grammatical complexity. In contrast, lower-performing tiers (e.g., Top-10%) exhibit a greater proportion of **Content Words** (e.g., nouns, verbs), potentially reflecting a focus on substantive content over linguistic structure. This trend could highlight a trade-off between grammatical precision and content richness in performance metrics. The consistent dominance of Function Words across all tiers suggests their foundational role in language processing, even as Content Words gain prominence in less optimal performance ranges.
</details>
<details>
<summary>x20.png Details</summary>

### Visual Description
## Bar Chart: R1-Qwen | GPQA-D
### Overview
The chart compares the distribution of **Content Words** (red) and **Function Words** (gray with diagonal stripes) across different frequency ranges (90-100% to Top-10%). Each bar represents the percentage of words in a specific frequency bin, with values explicitly labeled on the bars.
### Components/Axes
- **X-axis**: "Ratio (%)" (0–100), representing the percentage of words in each frequency bin.
- **Y-axis**: Frequency ranges (90-100%, 80-90%, ..., Top-10%), ordered from highest to lowest frequency.
- **Legend**:
- Red: Content Words
- Gray (diagonal stripes): Function Words
- **Labels**:
- Chart title: "R1-Qwen | GPQA-D"
- Axis titles: "Ratio (%)" (x-axis), frequency ranges (y-axis)
- Data labels: Percentages on each bar (e.g., "30.4%" for Content Words in 90-100%).
### Detailed Analysis
- **Content Words (Red)**:
- 90-100%: 30.4%
- 80-90%: 36.4%
- 70-80%: 39.8%
- 60-70%: 42.0%
- 50-60%: 43.6%
- 40-50%: 44.7%
- 30-40%: 46.1%
- 20-30%: 46.9%
- 10-20%: 48.2%
- Top-10%: 47.9%
- **Function Words (Gray)**:
- Calculated as 100% minus Content Words for each bin (e.g., 69.6% in 90-100%, 63.6% in 80-90%).
### Key Observations
1. **Content Words** consistently increase as frequency decreases, peaking at 48.2% in the 10-20% range.
2. **Function Words** dominate in higher frequency bins (e.g., 69.6% in 90-100%) and decrease as frequency drops.
3. The **Top-10%** bin shows a slight decline in Content Words (47.9%) compared to the 10-20% bin (48.2%).
### Interpretation
The data suggests that **lower-frequency words** (e.g., 10-20%, Top-10%) are more likely to be **Content Words** (semantically meaningful terms), while **higher-frequency words** (e.g., 90-100%) are predominantly **Function Words** (grammatical/syntactic elements). This aligns with linguistic principles where function words (e.g., "and," "the") are typically more frequent, while content words (e.g., nouns, verbs) are less frequent but carry more meaning. The slight drop in Content Words in the Top-10% bin may reflect a smaller sample size or a natural distribution tail-off. The chart highlights a clear inverse relationship between word frequency and content-to-function word ratio.
</details>
<details>
<summary>x21.png Details</summary>

### Visual Description
## Horizontal Bar Chart: R1-Qwen | GK23EN
### Overview
The chart compares the distribution of **Content Words** (red) and **Function Words** (gray) across different frequency ranges (90-100% to Top-10%). Each bar represents the percentage ratio of these word types within specific frequency brackets. The x-axis shows the ratio (%), while the y-axis lists frequency ranges in descending order.
### Components/Axes
- **X-Axis**: Ratio (%) from 0 to 100, labeled "Ratio (%)".
- **Y-Axis**: Frequency ranges (90-100%, 80-90%, ..., Top-10%), ordered from highest to lowest frequency.
- **Legend**: Located in the top-right corner, with red representing **Content Words** and gray representing **Function Words**.
- **Bars**: Horizontal bars for each category, with red bars (Content Words) consistently shorter than gray bars (Function Words) in all frequency ranges.
### Detailed Analysis
- **Content Words (Red)**:
- **90-100%**: 23.1%
- **80-90%**: 26.2%
- **70-80%**: 29.1%
- **60-70%**: 31.6%
- **50-60%**: 33.6%
- **40-50%**: 35.6%
- **30-40%**: 37.4%
- **20-30%**: 39.2%
- **10-20%**: 40.0%
- **Top-10%**: 38.6%
- **Function Words (Gray)**:
- Calculated as 100% minus Content Words for each category (e.g., 90-100%: 76.9%, 80-90%: 73.8%, etc.).
### Key Observations
1. **Content Words** peak at **40.0%** in the **10-20%** frequency range, then slightly decrease to **38.6%** in the **Top-10%** range.
2. **Function Words** dominate in the **Top-10%** range (**61.4%**) and decrease progressively as frequency increases.
3. The **20-30%** and **10-20%** ranges show the highest proportions of Content Words, while the **90-100%** range has the lowest (**23.1%**).
### Interpretation
The data suggests that **Function Words** (e.g., prepositions, conjunctions) are more prevalent in high-frequency terms (Top-10%), while **Content Words** (e.g., nouns, verbs) dominate in mid-frequency ranges (20-30% and 10-20%). This aligns with linguistic patterns where function words are often the most frequent in corpora, but content words become more prominent in less frequent terms. The slight dip in Content Words in the **Top-10%** range may indicate a focus on function words in the most common terms, possibly due to grammatical or structural roles. The chart highlights a trade-off between word frequency and semantic content, with function words dominating high-frequency tiers.
</details>
<details>
<summary>x22.png Details</summary>

### Visual Description
## Bar Chart: R1-Qwen | MATH500
### Overview
The chart compares the ratio of Content Words and Function Words across performance tiers (90-100% to Top-10%) for the R1-Qwen model on the MATH500 dataset. Content Words are represented by red bars, while Function Words use gray diagonal stripes. The x-axis shows percentages (0-100%), and the y-axis categorizes performance ranges.
### Components/Axes
- **X-Axis**: Ratio (%) (0 to 100, linear scale)
- **Y-Axis**: Performance tiers (90-100%, 80-90%, ..., Top-10%)
- **Legend**:
- Red: Content Words
- Gray (diagonal stripes): Function Words
- **Data Labels**: Percentages embedded in red bars (e.g., "20.6%", "39.8%")
### Detailed Analysis
1. **Content Words (Red Bars)**:
- **90-100%**: 20.6%
- **80-90%**: 24.5%
- **70-80%**: 27.5%
- **60-70%**: 29.8%
- **50-60%**: 31.9%
- **40-50%**: 33.9%
- **30-40%**: 36.0%
- **20-30%**: 37.8%
- **10-20%**: 38.9%
- **Top-10%**: 39.8%
2. **Function Words (Gray Bars)**:
- Calculated as 100% minus Content Words for each tier:
- **90-100%**: 79.4%
- **80-90%**: 75.5%
- **70-80%**: 72.5%
- **60-70%**: 70.2%
- **50-60%**: 68.1%
- **40-50%**: 66.1%
- **30-40%**: 64.0%
- **20-30%**: 62.2%
- **10-20%**: 61.1%
- **Top-10%**: 60.2%
### Key Observations
- **Inverse Relationship**: Content Words increase monotonically (20.6% → 39.8%) as performance tiers decrease, while Function Words decrease correspondingly (79.4% → 60.2%).
- **Steepest Gradient**: The largest shift occurs between 90-100% and 80-90% tiers (Content Words +3.9%, Function Words -3.9%).
- **Top-10% Dominance**: Content Words reach their peak (39.8%) in the highest-performing tier, suggesting a correlation between content-focused language and performance.
### Interpretation
The data implies that higher-performing tiers (e.g., Top-10%) rely more heavily on Content Words, which may reflect domain-specific knowledge or problem-solving relevance. Function Words dominate in lower tiers, potentially indicating generic or structural language use. This trend could highlight how R1-Qwen prioritizes substantive content over functional phrasing in high-stakes scenarios, aligning with its optimization for mathematical reasoning tasks. The consistent inverse relationship across all tiers suggests a systematic design choice favoring content density in high-performance outputs.
</details>
Figure 9: Proportion of content words versus function words in thinking tokens. Bars represent deciles sorted by importance score; e.g., the bottom bar indicates the ratio for the top- $10\$ most important tokens.
<details>
<summary>x23.png Details</summary>

### Visual Description
## Line Chart: Importance Score Distribution Across Reasoning Steps
### Overview
The image displays a line chart divided into two distinct sections: a "Question" segment on the left and a "Thinking" segment on the right. The chart visualizes the distribution of "Importance Scores" across sequential "Reasoning Steps," with a red dashed line indicating a calculated mean score and ratio. The y-axis uses a gradient from light blue (low) to dark blue (high) to represent importance scores, while the x-axis spans from 0 to 7000 reasoning steps.
---
### Components/Axes
- **X-Axis (Reasoning Step)**:
- Labeled "Reasoning Step" with numerical markers at 0, 25, 50, 1000, 2000, 3000, 4000, 5000, 6000, and 7000.
- The "Question" segment spans 0–50 steps, while the "Thinking" segment spans 0–7000 steps.
- **Y-Axis (Importance Score)**:
- Labeled "Importance Score" with a gradient from light blue (low) to dark blue (high).
- A horizontal red dashed line labeled "Mean Score: 0.228; Ratio: 0.239" spans the entire chart.
- **Legend**:
- Located on the left, with a single entry: "Importance Score" (blue gradient).
- The red dashed line is explicitly labeled as the mean score and ratio.
---
### Detailed Analysis
#### "Question" Segment (0–50 Steps)
- **Structure**: A vertical bar chart with 50 bars, each representing a reasoning step.
- **Trend**:
- Bars vary in height, with some reaching near the top of the y-axis (high importance) and others near the bottom (low importance).
- No consistent pattern; distribution appears irregular.
- **Key Data Points**:
- Highest bars (dark blue) occur at steps ~10, 20, 30, and 40.
- Lowest bars (light blue) occur at steps ~5, 15, 25, and 45.
#### "Thinking" Segment (0–7000 Steps)
- **Structure**: A line chart with blue vertical lines (spikes) over a white background.
- **Trend**:
- Spikes are irregular, with some reaching the top of the y-axis (high importance) and others near the bottom (low importance).
- The red dashed line (mean score) remains consistently at ~0.228 across all steps.
- **Key Data Points**:
- Notable spikes occur at steps ~1000, 3000, 5000, and 7000.
- The mean score (0.228) is significantly lower than the peak values of individual spikes.
---
### Key Observations
1. **Divergent Distributions**:
- The "Question" segment shows a more concentrated distribution of importance scores, while the "Thinking" segment exhibits sporadic, high-intensity spikes.
2. **Mean Score Context**:
- The red dashed line (mean score: 0.228) is much lower than the peak values in both segments, suggesting most reasoning steps have low importance.
3. **Ratio Interpretation**:
- The ratio (0.239) is slightly higher than the mean score, possibly indicating a proportional relationship between the mean and another metric (e.g., total importance or step count).
---
### Interpretation
- **Data Implications**:
- The chart likely represents the importance of reasoning steps in a decision-making or problem-solving process. The "Question" segment may reflect initial, structured steps, while the "Thinking" segment captures exploratory or iterative reasoning.
- The low mean score (0.228) suggests that most steps are not critical, but the presence of spikes indicates moments of high cognitive demand or significance.
- **Anomalies**:
- The "Thinking" segment’s spikes at 1000, 3000, 5000, and 7000 steps may correspond to pivotal moments in the reasoning process (e.g., hypothesis testing, conclusion drawing).
- The "Question" segment’s irregular distribution implies variability in the importance of early steps, possibly due to contextual factors or question complexity.
- **Technical Relevance**:
- The chart could be used to optimize reasoning algorithms by identifying and prioritizing high-importance steps. The mean and ratio provide a baseline for evaluating overall efficiency.
---
### Spatial Grounding
- **Legend**: Positioned on the left, aligned with the y-axis.
- **Red Dashed Line**: Horizontally centered across the chart, spanning both segments.
- **Axis Labels**: Clearly labeled at the top (x-axis) and left (y-axis), with numerical markers for reference.
---
### Content Details
- **Textual Elements**:
- "Question" (top-left of the left segment).
- "Thinking" (top-center of the right segment).
- "Mean Score: 0.228; Ratio: 0.239" (red dashed line label).
- **Numerical Values**:
- X-axis: 0, 25, 50, 1000, 2000, 3000, 4000, 5000, 6000, 7000.
- Y-axis: Gradient from light blue (low) to dark blue (high).
---
### Final Notes
The chart provides a visual representation of reasoning step importance, highlighting both structured and exploratory phases. The mean score and ratio offer quantitative insights, while the irregular distributions suggest the need for further analysis to identify patterns or optimize processes.
</details>
<details>
<summary>x24.png Details</summary>

### Visual Description
## Line Chart: Importance Score Across Reasoning Steps
### Overview
The image is a line chart divided into two sections: "Question" (left) and "Thinking" (right). It visualizes the distribution of "Importance Scores" across "Reasoning Steps" (x-axis) with a vertical scale from "Low" to "High" (y-axis). A red dashed line labeled "Mean Score: 0.138; Ratio: 0.224" spans the "Thinking" section. The chart uses blue lines for data points and a legend in the top-left corner.
---
### Components/Axes
- **Y-Axis (Left)**:
- Label: "Importance Score"
- Scale: Gradient from light blue (Low) to dark blue (High).
- Position: Left edge of the chart.
- **X-Axis (Bottom)**:
- Label: "Reasoning Step"
- Scale: Linear from 0 to 12,000.
- Position: Bottom edge of the chart.
- **Legend**:
- Located in the top-left corner.
- Blue color corresponds to "Importance Score."
- **Title**:
- "Question" (left section) and "Thinking" (right section) are labeled at the top.
- **Red Dashed Line**:
- Label: "Mean Score: 0.138; Ratio: 0.224"
- Position: Horizontal across the "Thinking" section.
---
### Detailed Analysis
#### "Question" Section (Left)
- **Data**:
- A single vertical blue line dominates the leftmost part of the chart.
- The line peaks sharply at the start (x ≈ 0) and tapers off toward x ≈ 100.
- No other data points are visible in this section.
#### "Thinking" Section (Right)
- **Data**:
- Multiple blue spikes appear across the x-axis (0–12,000).
- Spikes are irregular, with varying heights and frequencies.
- Most spikes are below the red dashed line (mean score), but some exceed it.
- Notable clusters of spikes occur near x ≈ 4,000, 8,000, and 12,000.
- **Red Dashed Line**:
- Horizontal line at y ≈ 0.138 (mean score).
- Ratio of 0.224 is annotated in red text near the line.
---
### Key Observations
1. **High Importance in "Question" Phase**:
- The "Question" section shows a single, extreme spike, indicating a critical importance score at the start.
2. **Variable Importance in "Thinking" Phase**:
- The "Thinking" section has scattered spikes, suggesting inconsistent importance scores.
- The red dashed line (mean score: 0.138) acts as a reference, with most spikes below it.
3. **Ratio Interpretation**:
- The ratio of 0.224 may represent the proportion of "Thinking" steps with scores above the mean.
4. **Outliers**:
- Spikes near x ≈ 12,000 in the "Thinking" section are the highest in that phase.
---
### Interpretation
The chart suggests that the initial "Question" phase is highly impactful, with a single critical step dominating the importance score. In contrast, the "Thinking" phase exhibits lower average importance (mean score: 0.138) but includes sporadic high-value steps (e.g., near x ≈ 12,000). The ratio of 0.224 implies that only a small fraction of "Thinking" steps exceed the mean score, highlighting inefficiencies or variability in reasoning. The red dashed line serves as a benchmark, emphasizing that most "Thinking" steps are less impactful than the initial question. This could indicate a need to optimize reasoning processes to elevate the significance of later steps.
</details>
<details>
<summary>x25.png Details</summary>

### Visual Description
## Line Chart: Importance Score Across Reasoning Steps
### Overview
The image displays a line chart divided into two sections: "Question" (left) and "Thinking" (right). The chart visualizes the distribution of "Importance Score" across sequential "Reasoning Steps." The y-axis represents "Importance Score" with a gradient from light blue (low) to dark blue (high), while the x-axis spans "Reasoning Step" from 0 to 7000. A red dashed line labeled "Mean Score: 0.208; Ratio: 0.281" is overlaid on the "Thinking" section.
---
### Components/Axes
- **Y-Axis (Importance Score)**:
- Gradient scale from "Low" (light blue) to "High" (dark blue).
- No explicit numerical values, but the red dashed line provides a reference at ~0.208.
- **X-Axis (Reasoning Step)**:
- Linear scale from 0 to 7000.
- Divided into two sub-regions:
- **Question**: 0–100 (vertical bar chart).
- **Thinking**: 100–7000 (line chart).
- **Legend**:
- Implied by the y-axis gradient (no explicit legend box).
- Red dashed line represents the mean score and ratio.
---
### Detailed Analysis
#### Question Section (0–100)
- **Structure**: Vertical bars with varying heights.
- **Trend**:
- Bars are densely packed, with no clear pattern.
- Heights fluctuate, suggesting inconsistent importance scores.
- **Key Data**:
- No explicit numerical values, but the y-axis gradient indicates scores range from low to high.
#### Thinking Section (100–7000)
- **Structure**: Line chart with sharp spikes and troughs.
- **Trend**:
- Spikes (dark blue) indicate moments of high importance.
- Troughs (light blue) suggest periods of low importance.
- The red dashed line (mean score: 0.208) is consistently below most spikes, indicating most steps have low importance.
- **Key Data**:
- **Mean Score**: 0.208 (low, suggesting most steps are not critical).
- **Ratio**: 0.281 (possibly the proportion of steps with high importance, but context is unclear).
---
### Key Observations
1. **Question Section**:
- No clear trend; importance scores vary unpredictably.
- Limited to 100 steps, suggesting a focused or initial phase of reasoning.
2. **Thinking Section**:
- High variability in importance scores, with frequent spikes.
- The mean score (0.208) is significantly lower than the peak values, indicating most steps are not critical.
- The ratio (0.281) may imply that ~28% of steps have high importance, but this is speculative without further context.
---
### Interpretation
- **Data Implications**:
- The "Thinking" phase shows that while most reasoning steps are low in importance, there are critical moments (spikes) where importance surges. This could reflect the model's focus on specific insights or decisions.
- The low mean score (0.208) suggests that the majority of reasoning steps are not pivotal, which might indicate inefficiency or redundancy in the process.
- **Relationships**:
- The "Question" section (0–100) appears to be a precursor to the "Thinking" phase, with no direct correlation between the two.
- The red dashed line in the "Thinking" section acts as a benchmark, highlighting the disparity between average and peak importance.
- **Anomalies**:
- The abrupt transition from the "Question" to "Thinking" sections (100 steps) may indicate a shift in reasoning complexity or scope.
- The lack of a clear pattern in the "Question" section contrasts with the structured spikes in the "Thinking" phase, suggesting different cognitive processes.
---
### Notes on Data Extraction
- **Uncertainty**:
- Numerical values (e.g., mean score, ratio) are explicitly labeled but lack units or context for interpretation.
- The y-axis gradient is qualitative, making precise score comparisons difficult.
- **Missing Elements**:
- No explicit legend or colorbar for the y-axis gradient.
- No explanation of the "Ratio" metric (0.281) or its calculation method.
---
### Conclusion
The chart illustrates the dynamic nature of importance scores during reasoning, with the "Thinking" phase dominated by sporadic high-importance steps. The low mean score underscores the prevalence of low-importance steps, raising questions about the efficiency of the reasoning process. Further analysis would require clarification on the metrics and their implications.
</details>
<details>
<summary>x26.png Details</summary>

### Visual Description
## Line Chart: Importance Scores Across Reasoning Steps
### Overview
The image is a line chart divided into two sections: "Question" (left) and "Thinking" (right). It visualizes the distribution of "Importance Scores" across "Reasoning Steps" (x-axis) and "Importance Score" (y-axis). A red dashed line labeled with a "Mean Score" and "Ratio" is overlaid on the "Thinking" section.
### Components/Axes
- **Y-Axis (Left)**: "Importance Score" with a gradient from "Low" (light blue) to "High" (dark blue).
- **X-Axis (Right)**: "Reasoning Step" ranging from 0 to 14,000.
- **Legend**: A vertical color bar on the left, labeled "Importance Score" with a gradient from light blue (low) to dark blue (high).
- **Red Dashed Line**: Horizontal line in the "Thinking" section, labeled "Mean Score: 0.170; Ratio: 0.237".
### Detailed Analysis
- **Question Section (Left)**:
- A vertical line with dense, high-intensity spikes (dark blue) concentrated near the top of the y-axis.
- No numerical values or labels for individual data points.
- The x-axis for this section is not explicitly labeled but appears to span 0–200 reasoning steps.
- **Thinking Section (Right)**:
- A horizontal line with sparse, irregular spikes (dark blue) distributed across the x-axis (0–14,000).
- The red dashed line spans the entire "Thinking" section, indicating a mean importance score of **0.170** and a ratio of **0.237**.
- The ratio likely compares the "Thinking" section's importance to the "Question" section.
### Key Observations
1. **High Importance in "Question"**: The "Question" section shows a concentrated cluster of high importance scores, suggesting critical focus on the initial question phase.
2. **Variable Importance in "Thinking"**: The "Thinking" section has scattered, low-intensity spikes, indicating inconsistent importance across reasoning steps.
3. **Mean and Ratio**: The mean score of 0.170 (low) and ratio of 0.237 suggest the "Thinking" section's importance is significantly lower than the "Question" section.
### Interpretation
The chart implies that the model prioritizes the "Question" phase over the "Thinking" phase. The high concentration of importance scores in the "Question" section suggests that the initial query is the primary driver of the model's output. In contrast, the "Thinking" phase exhibits lower and more variable importance, possibly indicating that reasoning steps are less critical or that their impact is diluted. The red dashed line quantifies this disparity, highlighting that the "Thinking" section's average importance is only 23.7% of the "Question" section's baseline. This could reflect a design choice where the model emphasizes direct question interpretation over iterative reasoning.
</details>
<details>
<summary>x27.png Details</summary>

### Visual Description
## Line Chart: Question vs Thinking Importance Scores
### Overview
The image displays a dual-axis line chart comparing importance scores across reasoning steps. The left section ("Question") shows a vertical bar chart with importance scores, while the right section ("Thinking") presents a line plot with importance scores over reasoning steps. A red dashed horizontal line indicates a calculated mean score and ratio.
### Components/Axes
- **X-Axis (Horizontal):**
- Labeled "Reasoning Step"
- Scale: 0 to 5000 (discrete intervals)
- Subdivisions: Marked at 0, 1000, 2000, 3000, 4000, 5000
- **Y-Axis (Vertical):**
- Labeled "Importance Score"
- Scale: 0 (Low, white) to High (blue gradient)
- Color legend: Blue gradient from white (low) to dark blue (high)
- **Legend:**
- Position: Left side of the chart
- Indicates importance score gradient (no explicit labels beyond color)
- **Red Dashed Line:**
- Position: Horizontal across the entire chart
- Label: "Mean Score: 0.325; Ratio: 0.257" (centered text)
### Detailed Analysis
1. **Question Section (Left):**
- Vertical bars represent importance scores for discrete reasoning steps (0–40).
- Bars are densely packed, with most reaching the "High" end of the importance scale.
- No explicit numerical values, but visual density suggests high consistency in importance.
2. **Thinking Section (Right):**
- Line plot spans reasoning steps 0–5000.
- Importance scores fluctuate significantly, with sporadic spikes (e.g., ~1000, 2000, 4000 steps).
- Most scores cluster near the baseline (low importance), with occasional high peaks.
- Red dashed line intersects at:
- **Mean Score:** 0.325 (low on the importance scale)
- **Ratio:** 0.257 (suggesting a proportional relationship between Question and Thinking phases)
### Key Observations
- **High Importance in Question Phase:** The left section shows concentrated high-importance scores, indicating critical reasoning steps during the "Question" phase.
- **Volatile Importance in Thinking Phase:** The right section exhibits erratic spikes, suggesting sporadic high-importance moments during reasoning.
- **Mean vs. Ratio:** The mean score (0.325) is low, but the ratio (0.257) implies the Thinking phase's importance is a fraction of the Question phase's peak values.
- **Outliers:** Sharp spikes at ~1000, 2000, and 4000 steps in the Thinking phase may represent pivotal reasoning moments.
### Interpretation
The chart highlights a stark contrast between the structured, high-importance "Question" phase and the variable "Thinking" phase. The red dashed line contextualizes the Thinking phase's average importance (0.325) as a small fraction (25.7%) of the Question phase's peak values. This suggests that while the Question phase drives critical reasoning, the Thinking phase is characterized by intermittent bursts of insight rather than sustained importance. The ratio (0.257) quantifies this disparity, emphasizing the need for further analysis into why the Thinking phase exhibits such volatility despite lower average scores.
</details>
<details>
<summary>x28.png Details</summary>

### Visual Description
## Line Chart: Question vs. Thinking Importance Scores
### Overview
The image displays a line chart comparing importance scores across two phases: "Question" (left) and "Thinking" (right). The x-axis represents "Reasoning Step" (0–7000), and the y-axis represents "Importance Score" (Low to High). A red dashed line labeled "Mean Score: 0.252; Ratio: 0.223" spans the "Thinking" phase, indicating a baseline metric.
### Components/Axes
- **X-Axis (Reasoning Step)**:
- Left ("Question"): 0–100 (discrete steps).
- Right ("Thinking"): 0–7000 (continuous scale).
- **Y-Axis (Importance Score)**:
- Gradient from blue (Low) to white (High).
- **Legend**:
- Positioned on the left, but no explicit legend labels are visible. Blue lines represent data points.
- **Red Dashed Line**:
- Horizontal line at y ≈ 0.252, labeled with mean score and ratio.
### Detailed Analysis
- **Question Phase (0–100)**:
- Vertical spikes dominate, with importance scores reaching the "High" range (near white).
- Approximately 20–30 peaks per 100 steps, suggesting frequent high-importance moments.
- **Thinking Phase (0–7000)**:
- Sporadic spikes, mostly below the red dashed line (y ≈ 0.252).
- Approximately 5–10 significant peaks per 1000 steps, with most values clustered near the baseline.
- **Red Dashed Line**:
- Mean score of 0.252 (low importance) and ratio of 0.223 (22.3% of "Question" phase peaks).
### Key Observations
1. **High Importance in "Question" Phase**:
- Peaks frequently exceed 0.8 on the importance scale, indicating critical reasoning steps.
2. **Low Baseline in "Thinking" Phase**:
- Most steps cluster near the 0.252 mean, with only rare spikes reaching higher values.
3. **Ratio Discrepancy**:
- The 0.223 ratio suggests the "Thinking" phase's importance is a fraction of the "Question" phase's peaks, despite the latter's lower frequency.
### Interpretation
The data implies that the "Question" phase contains concentrated, high-impact reasoning moments, while the "Thinking" phase is characterized by diffuse, lower-importance steps. The red dashed line acts as a reference, highlighting that even the mean importance in "Thinking" is minimal compared to the "Question" phase. This could reflect a cognitive process where problem identification (Question) drives critical decisions, whereas subsequent analysis (Thinking) involves less decisive, exploratory steps. The ratio of 0.223 underscores a potential inefficiency or divergence in importance distribution between phases.
</details>
Figure 10: Visualization of token importance scores for question and thinking tokens within DeepSeek-R1-Distill-Llama-8B reasoning traces across six samples from AIME24, AIME25, AMC23, GPQA-D, GAOKAO2023EN, and MATH500.
<details>
<summary>x29.png Details</summary>

### Visual Description
## Line Chart: Importance Score Across Reasoning Steps
### Overview
The image is a line chart divided into two sections: "Question" (left) and "Thinking" (right). It visualizes the distribution of "Importance Score" across "Reasoning Steps" (0–12,000). A red dashed line labeled "Mean Score: 0.416; Ratio: 0.201" spans the chart horizontally. The y-axis ranges from "Low" to "High" importance scores, with blue vertical lines representing data points.
### Components/Axes
- **X-axis (Reasoning Step)**: Labeled "Reasoning Step" with a scale from 0 to 12,000.
- **Y-axis (Importance Score)**: Labeled "Importance Score" with a gradient from "Low" (bottom) to "High" (top).
- **Legend**: Located in the top-left corner, with a blue color representing "Importance Score."
- **Red Dashed Line**: Horizontal line labeled "Mean Score: 0.416; Ratio: 0.201," positioned at approximately 40% of the y-axis height.
### Detailed Analysis
- **Question Section (Left)**:
- Dense cluster of vertical blue lines concentrated near the left edge (0–500 reasoning steps).
- Scores drop sharply after 500 steps, with minimal activity beyond this point.
- **Thinking Section (Right)**:
- Sparse, scattered blue lines distributed across the entire x-axis (0–12,000).
- No clear clustering; scores vary widely, with some peaks and troughs.
- **Red Dashed Line**:
- Positioned at ~40% of the y-axis, indicating a moderate mean importance score.
- The ratio "0.201" suggests a proportional relationship (e.g., 20.1% of steps fall below the mean).
### Key Observations
1. **Initial Focus on "Question"**: High importance scores are concentrated in the early reasoning steps (0–500), suggesting critical analysis occurs early.
2. **Divergence in "Thinking"**: Later steps (500–12,000) show lower and more variable importance scores, indicating less consistent or exploratory reasoning.
3. **Mean and Ratio**: The mean score of 0.416 (mid-range) and ratio of 0.201 imply that ~20% of steps fall below the average importance.
### Interpretation
The chart highlights a stark contrast between the "Question" and "Thinking" phases. The early "Question" phase is characterized by high, concentrated importance scores, likely reflecting structured or critical analysis. In contrast, the "Thinking" phase exhibits lower and more dispersed scores, suggesting a shift toward exploratory or less focused reasoning. The mean score of 0.416 and ratio of 0.201 indicate that while the average importance is moderate, a significant portion of steps (20.1%) are below this threshold, pointing to potential inefficiencies or variability in later reasoning stages. This could imply that the model prioritizes initial problem framing over sustained, nuanced analysis.
</details>
<details>
<summary>x30.png Details</summary>

### Visual Description
## Line Chart: Importance Scores Across Question and Reasoning Steps
### Overview
The image displays a line chart comparing importance scores across two phases: "Question" (left) and "Thinking" (right). The y-axis represents "Importance Score" (0–1), while the x-axis tracks "Reasoning Step" (0–12,000). A red dashed line labeled "Mean Score: 0.346; Ratio: 0.205" spans the "Thinking" phase, dividing the chart visually.
---
### Components/Axes
- **Y-Axis (Left)**: "Importance Score" (0 = Low, 1 = High), with a gradient from light to dark blue.
- **X-Axis (Top)**: "Reasoning Step" (0–12,000), split into two segments:
- **Left Segment**: "Question" phase (0–100 steps).
- **Right Segment**: "Thinking" phase (0–12,000 steps).
- **Legend**: Positioned on the far left, with a blue gradient labeled "Importance Score."
- **Red Dashed Line**: Horizontal line at y=0.346, labeled with mean score and ratio.
---
### Detailed Analysis
#### Question Phase (0–100 steps)
- **Trend**:
- High initial scores (dark blue spikes) at steps 0–20.
- Sharp drop to low scores (light blue) around step 50.
- Gradual recovery to moderate scores (mid-blue) by step 100.
- **Key Data Points**:
- Peak at step 0: ~0.8.
- Trough at step 50: ~0.1.
- Recovery to ~0.4 by step 100.
#### Thinking Phase (0–12,000 steps)
- **Trend**:
- Initial moderate scores (~0.3–0.5) in steps 0–2,000.
- Fluctuating scores with sporadic spikes (up to ~0.7) and troughs (~0.1) throughout.
- Final sharp increase to high scores (~0.9–1.0) in the last 500 steps (11,500–12,000).
- **Key Data Points**:
- Mean score: 0.346 (red dashed line).
- Ratio of steps below mean: 0.205 (20.5% of steps).
- Final spike at step 12,000: ~1.0.
---
### Key Observations
1. **Question Phase**: Importance scores are volatile, with a clear dip at step 50 suggesting a critical transition point.
2. **Thinking Phase**:
- Scores are highly variable, with no consistent upward/downward trend.
- Final 500 steps show a dramatic rise, possibly indicating a resolution phase.
3. **Mean vs. Ratio**:
- The mean score (0.346) is below the red dashed line, suggesting most steps cluster around lower importance.
- The ratio (0.205) implies 20.5% of steps fall below the mean, highlighting significant variability.
---
### Interpretation
- **Phase Dynamics**: The "Question" phase exhibits structured volatility, while the "Thinking" phase reflects exploratory reasoning with sporadic insights (spikes) and uncertainty (troughs).
- **Final Spike**: The abrupt rise in the last steps may represent a critical insight or conclusion, contrasting with earlier variability.
- **Mean and Ratio**: The low mean score (0.346) and high ratio (0.205) suggest that most reasoning steps are relatively unimportant, with only a minority driving significant outcomes.
- **Implications**: The chart underscores the non-linear nature of reasoning, where critical steps (e.g., final insights) dominate despite earlier variability. The "Question" phase’s dip at step 50 might indicate a shift from initial assumptions to deeper analysis.
</details>
<details>
<summary>x31.png Details</summary>

### Visual Description
## Heatmap: Importance Scores Across Reasoning Steps
### Overview
The image is a heatmap visualizing the distribution of "Importance Scores" across "Reasoning Steps" in two distinct phases: "Question" (left) and "Thinking" (right). The y-axis represents "Importance Score" (ranging from "Low" to "High"), while the x-axis represents "Reasoning Step" (0 to 8000). A red dashed line labeled with "Mean Score: 0.475; Ratio: 0.232" spans the "Thinking" section, indicating a statistical summary of the data.
---
### Components/Axes
- **Y-Axis (Importance Score)**:
- Labeled "Importance Score" with a gradient from "Low" (bottom) to "High" (top).
- No explicit numerical scale, but the red dashed line suggests a normalized range (likely 0–1).
- **X-Axis (Reasoning Step)**:
- Labeled "Reasoning Step" with numerical markers at 0, 20, 40, 2000, 4000, 6000, and 8000.
- Divided into two regions: "Question" (0–40) and "Thinking" (40–8000).
- **Legend**:
- Not explicitly labeled, but the blue gradient implies a continuous scale of importance scores.
- The red dashed line represents a statistical summary (mean and ratio).
---
### Detailed Analysis
#### "Question" Section (0–40 Reasoning Steps)
- **Visual Trend**:
- High concentration of dark blue (high importance scores) in the first 20 steps, followed by a sharp drop in importance after step 20.
- The "Question" phase shows a bimodal distribution, with two distinct peaks in importance.
- **Key Data Points**:
- Steps 0–10: High importance (dark blue).
- Steps 10–20: Moderate importance (lighter blue).
- Steps 20–40: Low importance (white/empty).
#### "Thinking" Section (40–8000 Reasoning Steps)
- **Visual Trend**:
- Importance scores are highly variable, with sporadic peaks and troughs.
- A gradual decline in overall importance from step 40 to 8000, with occasional spikes (e.g., around 2000, 4000, and 6000).
- **Statistical Summary**:
- **Mean Score**: 0.475 (moderate importance).
- **Ratio**: 0.232 (23.2% of steps have high importance scores).
- **Notable Features**:
- The red dashed line (mean score) is positioned near the middle of the y-axis, indicating a balanced distribution of scores.
- The "Thinking" phase shows no clear pattern, suggesting randomness or complexity in importance allocation.
---
### Key Observations
1. **Concentration in "Question" Phase**:
- The first 20 steps (Question) dominate in importance, with a sharp decline afterward.
2. **Variability in "Thinking" Phase**:
- The "Thinking" phase lacks a consistent pattern, with importance scores fluctuating unpredictably.
3. **Statistical Summary**:
- The mean score (0.475) and ratio (0.232) suggest that only a small fraction of steps in the "Thinking" phase are highly important.
4. **Red Dashed Line**:
- Acts as a reference for average importance, highlighting the disparity between the "Question" and "Thinking" phases.
---
### Interpretation
The data suggests that the "Question" phase is critical, with a concentrated allocation of importance scores in the initial steps. This implies that the problem's core elements are addressed early, while the "Thinking" phase involves exploratory or iterative reasoning with less predictable importance. The mean score of 0.475 and ratio of 0.232 indicate that while some steps in the "Thinking" phase are highly important, the majority are not, reflecting a sparse or fragmented reasoning process. The absence of a clear trend in the "Thinking" phase may point to inefficiencies or the need for refinement in later stages of reasoning.
</details>
<details>
<summary>x32.png Details</summary>

### Visual Description
## Line Chart: Importance Score Across Reasoning Steps
### Overview
The image displays a line chart divided into two sections: "Question" (left) and "Thinking" (right). The chart visualizes the distribution of "Importance Score" across sequential "Reasoning Steps," with a red dashed line indicating a calculated mean score and ratio. The y-axis uses a gradient from blue (low) to white (high) to represent importance scores.
---
### Components/Axes
- **X-Axis (Reasoning Step)**:
- Labeled "Reasoning Step" with numerical markers at 0, 500, 1000, 1500, ..., 14,000.
- Divided into two regions by vertical dashed lines at **100** (end of "Question") and **14,000** (end of "Thinking").
- **Y-Axis (Importance Score)**:
- Labeled "Importance Score" with a gradient scale from **Low** (blue) to **High** (white).
- No explicit numerical scale, but the red dashed line is positioned at **0.347**.
- **Legend**:
- Located on the left, with a single entry: **Blue** (representing all data points).
- **Annotations**:
- Red dashed line labeled **"Mean Score: 0.347; Ratio: 0.226"** centered horizontally.
---
### Detailed Analysis
1. **"Question" Section (0–100 Reasoning Steps)**:
- High density of vertical blue lines, indicating frequent reasoning steps.
- Importance scores cluster near the top of the y-axis (high values), with occasional dips to mid-range scores.
- No clear trend, but variability suggests dynamic importance during this phase.
2. **"Thinking" Section (100–14,000 Reasoning Steps)**:
- Lower density of blue lines compared to the "Question" section.
- Importance scores are more dispersed, with many steps near the bottom of the y-axis (low values).
- A notable spike in high-importance steps occurs near the end (12,000–14,000), suggesting a concentrated effort or critical insight.
3. **Red Dashed Line**:
- Positioned at **y = 0.347**, representing the mean importance score.
- The ratio **0.226** likely indicates the proportion of steps exceeding the mean score (22.6%).
---
### Key Observations
- **High Importance in "Question" Phase**: The initial steps (0–100) show a higher concentration of critical reasoning, with most scores near the top of the y-axis.
- **Declining Importance in "Thinking" Phase**: The majority of steps in the "Thinking" phase have low importance scores, with only a small fraction (22.6%) exceeding the mean.
- **Late-Stage Spike**: A sudden increase in high-importance steps near the end of the "Thinking" phase (12,000–14,000) may indicate a breakthrough or final synthesis.
---
### Interpretation
- **Efficiency of Reasoning**: The low mean score (0.347) suggests that most reasoning steps are not highly impactful, potentially indicating inefficiencies in the process.
- **Phase Dynamics**: The "Question" phase is characterized by high-impact steps, while the "Thinking" phase is dominated by lower-impact steps, with a late-stage surge possibly reflecting problem resolution.
- **Ratio Significance**: The ratio of 0.226 implies that only a minority of steps contribute meaningfully, highlighting the need for optimization in the "Thinking" phase to reduce low-value steps.
This chart underscores the disparity in reasoning quality between phases and suggests opportunities to refine the "Thinking" process for greater efficiency.
</details>
<details>
<summary>x33.png Details</summary>

### Visual Description
## Heatmap: Importance Scores Across Reasoning Steps
### Overview
The image displays a heatmap visualizing importance scores across reasoning steps, divided into two sections: "Question" (left) and "Thinking" (right). A red dashed line labeled with a mean score (0.604) and ratio (0.198) spans the chart horizontally. The y-axis represents importance scores (Low to High), while the x-axis tracks reasoning steps (0–7000).
### Components/Axes
- **Y-Axis (Importance Score)**:
- Labeled "Importance Score" with a gradient from white (Low) to blue (High).
- A horizontal red dashed line at ~0.604 marks the mean score.
- **X-Axis (Reasoning Step)**:
- Labeled "Reasoning Step" with a linear scale from 0 to 7000.
- **Legend**:
- Positioned on the left, with blue representing "High" importance and white representing "Low" importance.
- **Embedded Text**:
- "Mean Score: 0.604; Ratio: 0.198" in red text on the red dashed line.
### Detailed Analysis
- **Question Section (Left)**:
- A solid blue vertical bar spans the entire y-axis, indicating consistently high importance scores for the question.
- **Thinking Section (Right)**:
- A grid of vertical blue lines with varying opacity, suggesting fluctuating importance scores across reasoning steps.
- Most lines are faint (low importance), with occasional dense clusters of darker blue (high importance).
- **Red Dashed Line**:
- Positioned centrally, intersecting the x-axis at ~3500 reasoning steps.
- Annotated with "Mean Score: 0.604" (median importance) and "Ratio: 0.198" (proportion of high-importance steps).
### Key Observations
1. **Question Dominance**: The question section shows uniform high importance, contrasting with the variability in the thinking section.
2. **Thinking Variability**: The thinking section exhibits sporadic high-importance steps (dark blue lines) but no clear trend.
3. **Mean vs. Ratio**: The mean score (0.604) is moderate, but the low ratio (0.198) indicates that high-importance steps are rare in the thinking process.
4. **Red Line Placement**: The red dashed line’s central position suggests the average importance score is mid-range, but the ratio highlights a skewed distribution toward low importance.
### Interpretation
- **Data Implications**: The chart demonstrates that the model prioritizes the question over the reasoning process. While the question consistently receives high importance, the thinking steps are mostly low-importance, with only occasional critical steps (dark blue lines).
- **Trend Verification**: The thinking section’s lack of a clear upward/downward trend confirms the sporadic nature of high-importance steps. The red line’s central placement aligns with the mean score but does not reflect the low ratio, indicating a bimodal distribution.
- **Anomalies**: The stark contrast between the question’s uniformity and the thinking section’s variability suggests potential inefficiencies in reasoning step prioritization. The low ratio (0.198) implies most reasoning steps are deemed unimportant, which may warrant optimization.
</details>
<details>
<summary>x34.png Details</summary>

### Visual Description
## Bar Chart: Importance Score Distribution Across Question and Reasoning Steps
### Overview
The image is a horizontal bar chart comparing the importance scores of two sections: "Question" and "Thinking". The y-axis represents "Importance Score" (ranging from "Low" to "High"), while the x-axis represents "Reasoning Step" (0 to 4000). A red dashed line labeled "Mean Score: 1.086; Ratio: 0.125" spans the chart horizontally, serving as a reference for average importance.
### Components/Axes
- **Y-Axis (Importance Score)**:
- Labeled "Importance Score" with a gradient from light blue (low) to dark blue (high).
- No numerical scale provided, but the gradient implies a continuous range.
- **X-Axis (Reasoning Step)**:
- Labeled "Reasoning Step" with a linear scale from 0 to 4000.
- Divided into two regions:
- **Question**: 0–40 (solid blue bar).
- **Thinking**: 40–4000 (variable blue bars).
- **Legend**:
- Located on the left, with a single entry: "High" (dark blue) and "Low" (light blue).
- **Red Dashed Line**:
- Positioned horizontally across the chart, labeled with "Mean Score: 1.086; Ratio: 0.125".
### Detailed Analysis
- **Question Section**:
- A single solid blue bar spans the entire 0–40 range on the x-axis.
- The bar is uniformly dark blue, indicating a high importance score.
- **Thinking Section**:
- Composed of numerous vertical blue bars (likely representing individual reasoning steps).
- Bars vary in height, with some reaching the "High" end of the y-axis and others near the "Low" end.
- The red dashed line (mean score) intersects the "Thinking" section, suggesting an average importance score of 1.086.
- The "Ratio: 0.125" is annotated on the red line, though its reference value is not explicitly stated (e.g., ratio of mean to maximum, or mean to question score).
### Key Observations
1. **Question Importance**: The "Question" section has a consistently high importance score, as indicated by the solid dark blue bar.
2. **Thinking Variability**: The "Thinking" section shows significant variability in importance scores, with some steps rated highly and others low.
3. **Mean Score**: The red dashed line (mean score of 1.086) is slightly above the baseline (assuming a baseline of 1.0), indicating an average importance slightly above neutral.
4. **Ratio Ambiguity**: The "Ratio: 0.125" is unclear without additional context (e.g., is it the mean divided by the question's score, or the mean divided by the maximum possible score?).
### Interpretation
- The chart highlights that the **question itself is critically important**, while the **reasoning steps exhibit mixed importance**. The mean score of 1.086 suggests that, on average, reasoning steps are moderately important, but the variability (e.g., some steps rated highly, others low) indicates inconsistency in their perceived value.
- The "Ratio: 0.125" likely reflects a proportional relationship (e.g., the mean score is 12.5% of a reference value), but without explicit context, its significance remains speculative. This could imply that the average importance of reasoning steps is a small fraction of the question's importance or another metric.
- The red dashed line serves as a visual anchor, emphasizing that the average importance of reasoning steps is relatively low compared to the question's high score.
### Spatial Grounding
- **Legend**: Left side, aligned with the y-axis.
- **Red Dashed Line**: Centered horizontally, spanning the entire chart.
- **Question Section**: Leftmost 40 units on the x-axis.
- **Thinking Section**: Rightmost 3960 units on the x-axis.
### Content Details
- **Question**: Single bar (0–40) with dark blue color (high importance).
- **Thinking**: Multiple bars (40–4000) with varying blue shades (low to high importance).
- **Mean Score**: 1.086 (red dashed line).
- **Ratio**: 0.125 (annotated on the red line).
### Notable Trends
- The "Question" section is a single, uniform bar, while the "Thinking" section shows no clear trend (e.g., no gradual increase or decrease in importance).
- The red dashed line (mean) is positioned slightly above the baseline, suggesting a slight overall positive bias in reasoning step importance.
### Conclusion
The chart underscores the critical role of the question in the reasoning process, contrasted with the variable importance of individual reasoning steps. The mean score and ratio provide quantitative context but require further clarification to fully interpret their implications.
</details>
Figure 11: Visualization of token importance scores for question and thinking tokens within DeepSeek-R1-Distill-Qwen-7B reasoning traces across six samples from AIME24, AIME25, AMC23, GPQA-D, GAOKAO2023EN, and MATH500.
### B.3 Additional Results
<details>
<summary>x35.png Details</summary>

### Visual Description
## Line Chart: R1-Llama | AIME24
### Overview
The chart compares the accuracy (%) of four strategies ("Full", "Top", "Random", "Bottom") across varying ratios (%) from 2 to 50. The y-axis ranges from 30% to 65%, with the "Full" strategy consistently achieving the highest accuracy, while "Bottom" remains the lowest. The "Top" and "Random" strategies show distinct trends, with "Top" improving steadily and "Random" exhibiting a late surge.
### Components/Axes
- **X-axis**: Ratio (%) (2, 4, 6, 8, 10, 20, 30, 40, 50)
- **Y-axis**: Accuracy (%) (30–65%)
- **Legend**:
- Gray dashed line: "Full"
- Red solid line: "Top"
- Green solid line: "Random"
- Blue solid line: "Bottom"
- **Legend Position**: Top-right corner
### Detailed Analysis
1. **"Full" (Gray Dashed Line)**:
- Flat line at ~65% accuracy across all ratios.
- No variation observed; consistently the highest performer.
2. **"Top" (Red Solid Line)**:
- Starts at ~55% (ratio 2) and increases steadily to ~62% (ratio 50).
- Slope: ~0.14% accuracy gain per ratio increment.
3. **"Random" (Green Solid Line)**:
- Begins at ~28% (ratio 2), dips to ~27% (ratio 8), then rises sharply.
- Reaches ~48% at ratio 50, showing a U-shaped trend with a late surge.
4. **"Bottom" (Blue Solid Line)**:
- Fluctuates between ~30–35% across all ratios.
- Slight upward trend (from ~30% at ratio 2 to ~37% at ratio 50).
### Key Observations
- **Outlier**: "Random" strategy underperforms initially but surpasses "Bottom" after ratio 20.
- **Trend**: "Top" shows the most significant improvement with increasing ratios.
- **Anomaly**: "Full" remains flat despite ratio changes, suggesting it is unaffected by the ratio parameter.
### Interpretation
The chart demonstrates that the "Full" strategy is optimal, maintaining peak accuracy regardless of ratio. The "Top" strategy improves predictably with higher ratios, making it a viable alternative if resource constraints exist. The "Random" strategy’s late surge suggests potential inefficiencies in early stages or hidden patterns in later ratios. "Bottom" consistently underperforms, indicating systemic limitations. The data implies that strategy selection should prioritize "Full" for maximum accuracy, with "Top" as a secondary option for scalable applications.
</details>
<details>
<summary>x36.png Details</summary>

### Visual Description
## Line Chart: R1-Llama | AIME25
### Overview
The chart compares the accuracy (%) of four different strategies ("Full", "Bottom", "Random", "Top") across varying ratios (%) from 2 to 50. Accuracy is measured on the y-axis (30–50%), while the x-axis represents the ratio (%) in increments of 2. The legend is positioned in the top-right corner, with distinct markers for each strategy.
### Components/Axes
- **X-axis (Ratio %)**: Labeled "Ratio (%)", ranging from 2 to 50 in increments of 2.
- **Y-axis (Accuracy %)**: Labeled "Accuracy (%)", ranging from 30 to 50 in increments of 2.5.
- **Legend**: Located in the top-right corner, with four entries:
- **Full**: Gray stars (★)
- **Bottom**: Blue squares (■)
- **Random**: Green triangles (▲)
- **Top**: Red circles (●)
### Detailed Analysis
1. **Full (Gray Stars)**:
- Constant accuracy at **49.5%** across all ratios.
- No variation observed; represents a baseline or theoretical maximum.
2. **Bottom (Blue Squares)**:
- Starts at **34%** (ratio 2%), gradually increases to **38.5%** (ratio 50%).
- Minor fluctuations: dips to **33.5%** at ratio 4%, peaks at **38.5%** at ratio 50%.
3. **Random (Green Triangles)**:
- Highly variable, with no clear trend.
- Peaks at **40%** (ratio 40%), then drops to **39%** (ratio 50%).
- Lowest point at **30%** (ratio 20%).
4. **Top (Red Circles)**:
- Steady upward trend from **44%** (ratio 2%) to **49.5%** (ratio 50%).
- Consistent growth with minor plateaus (e.g., 48% at ratio 10%).
### Key Observations
- **Top strategy** demonstrates the most significant improvement, outperforming all others by 50% ratio.
- **Random strategy** exhibits erratic behavior, with a sharp dip at 20% ratio and a late surge at 40%.
- **Full strategy** remains static, suggesting it may represent an ideal or control scenario.
- **Bottom strategy** shows moderate growth but lags behind Top and Full.
### Interpretation
The data suggests that the **Top strategy** is the most effective, with accuracy increasing linearly as the ratio grows. The **Random strategy**’s volatility implies a lack of systematic improvement, possibly due to unstructured sampling. The **Full strategy**’s constant high accuracy may indicate an optimal or theoretical ceiling, while the **Bottom strategy** represents a middle-ground approach with gradual but limited gains. The anomalies in the Random series (e.g., the 20% ratio dip) could reflect outliers or contextual factors not captured in the chart. Overall, the chart highlights the importance of structured, incremental improvements (Top) over random or static approaches.
</details>
<details>
<summary>x37.png Details</summary>

### Visual Description
## Line Chart: R1-Llama | AMC23
### Overview
The chart compares the accuracy of four methods (Full, Bottom, Random, Top) across varying ratios (2% to 50%) on the AMC23 benchmark. The y-axis represents accuracy (%), and the x-axis represents the ratio (%). The "Full" method is the benchmark (AMC23), while the other methods show varying performance trends.
### Components/Axes
- **X-axis (Ratio %)**: Labeled "Ratio (%)", with markers at 2, 4, 6, 8, 10, 20, 30, 40, 50.
- **Y-axis (Accuracy %)**: Labeled "Accuracy (%)", with markers at 65, 70, 75, 80, 85, 90, 95.
- **Legend**: Located in the top-right corner, with four entries:
- **Full** (gray dashed line)
- **Bottom** (blue line)
- **Random** (green line)
- **Top** (red line)
### Detailed Analysis
1. **Full (Benchmark)**:
- A dashed gray line remains consistently at ~95% accuracy across all ratios.
- No significant variation; serves as the reference point.
2. **Top (Red Line)**:
- Starts at ~88% at 2%, increases steadily to ~95% by 10%, then plateaus.
- Reaches the benchmark (95%) at 10% ratio and maintains it.
3. **Random (Green Line)**:
- Begins at ~64% at 2%, dips to ~63% at 6%, then rises to ~85% at 50%.
- Shows a general upward trend with minor fluctuations.
4. **Bottom (Blue Line)**:
- Starts at ~65% at 2%, dips to ~63% at 6%, then increases to ~80% at 50%.
- Exhibits a gradual upward trend with a notable dip at 6%.
### Key Observations
- The **Full** method (AMC23) is the highest-performing, maintaining ~95% accuracy regardless of ratio.
- The **Top** method outperforms others, surpassing the benchmark at 10% ratio.
- **Random** and **Bottom** methods show improvement with higher ratios but remain below the benchmark.
- The **Random** method has a minor dip at 6% ratio (~63%), while the **Bottom** method dips slightly at the same point.
### Interpretation
The chart demonstrates that increasing the ratio improves accuracy for all non-benchmark methods, with **Top** being the most effective. The **Full** method (AMC23) acts as a static benchmark, highlighting the performance gap. The **Random** and **Bottom** methods show potential for improvement but require higher ratios to approach the benchmark. The dip in **Random** and **Bottom** at 6% ratio may indicate an anomaly or suboptimal configuration at that specific ratio. Overall, the data suggests that the **Top** method is the most efficient for achieving high accuracy, while the **Full** method represents the ideal target.
</details>
<details>
<summary>x38.png Details</summary>

### Visual Description
## Line Chart: R1-Llama | GPQA-D
### Overview
The chart illustrates the relationship between "Ratio (%)" (x-axis) and "Accuracy (%)" (y-axis) across four distinct data series. The y-axis ranges from 36% to 44%, while the x-axis spans 2% to 50%. Four lines represent different conditions: "Full" (gray dashed), "Bottom" (blue solid), "Random" (green solid), and "Top" (red solid). Key trends include a sharp rise in the "Random" series at higher ratios and stability in the "Top" series.
### Components/Axes
- **X-axis**: "Ratio (%)" (2% to 50%, increments of 2%).
- **Y-axis**: "Accuracy (%)" (36% to 44%, increments of 2%).
- **Legend**:
- Gray dashed line: "Full"
- Blue solid line: "Bottom"
- Green solid line: "Random"
- Red solid line: "Top"
- **Placement**: Legend is centered-right; axes are labeled with clear titles.
### Detailed Analysis
1. **Top (Red Solid Line)**:
- Starts at ~42% at 2%.
- Peaks at ~44% by 10%, then fluctuates between 43.5%–44%.
- Remains stable above 43% for ratios ≥10%.
2. **Random (Green Solid Line)**:
- Begins at ~36% at 2%.
- Dips to ~35.5% at 8%.
- Sharp upward trend from 30% (37%) to 40% (42%).
- Ends at ~42% at 50%.
3. **Bottom (Blue Solid Line)**:
- Stays between 36%–38% across all ratios.
- Minor fluctuations (e.g., 37.5% at 2%, 36.5% at 8%).
4. **Full (Gray Dashed Line)**:
- Horizontal line at ~44% across all ratios.
### Key Observations
- The "Top" series maintains the highest accuracy, closely tracking the "Full" line.
- The "Random" series exhibits a significant accuracy jump at ratios ≥30%, surpassing the "Bottom" series.
- The "Bottom" series remains consistently the lowest performer.
- The "Full" line acts as a ceiling, suggesting maximum achievable accuracy.
### Interpretation
The chart demonstrates that the "Top" condition achieves near-optimal accuracy (44%), aligning with the "Full" condition. The "Random" series's sharp rise at higher ratios suggests a threshold effect, where increased ratio values unlock higher performance. The "Bottom" series's stability implies it represents a baseline or control group. The "Full" line's flat trajectory indicates it may represent a theoretical maximum or reference point. The divergence between "Random" and "Top" at lower ratios highlights the importance of structured conditions for optimal performance.
</details>
<details>
<summary>x39.png Details</summary>

### Visual Description
## Line Chart: R1-Llama | GK23EN Accuracy vs. Ratio
### Overview
The chart compares the accuracy performance of four methods ("Full," "Bottom," "Random," "Top") across varying ratios (2% to 50%). Accuracy is measured on the y-axis (62%–74%), while the x-axis represents the ratio percentage. The "Full" method shows the highest baseline accuracy, while "Random" demonstrates the steepest improvement with increasing ratio.
### Components/Axes
- **X-axis (Ratio %)**: Increments of 2% from 2% to 50%.
- **Y-axis (Accuracy %)**: Range from 62% to 74%.
- **Legend**: Positioned on the right, with four entries:
- **Full**: Dashed gray line with star markers.
- **Bottom**: Solid blue line with square markers.
- **Random**: Solid green line with triangle markers.
- **Top**: Solid red line with circle markers.
### Detailed Analysis
1. **Full (Gray Stars)**:
- Starts at ~72% accuracy at 2% ratio.
- Remains relatively flat, peaking at ~73.5% by 50% ratio.
- Minimal fluctuation throughout.
2. **Bottom (Blue Squares)**:
- Begins at ~62% accuracy at 2% ratio.
- Gradual upward trend, reaching ~67.5% at 50% ratio.
- Linear progression with no sharp changes.
3. **Random (Green Triangles)**:
- Starts at ~62% accuracy at 2% ratio.
- Steep upward trajectory, reaching ~69.5% at 50% ratio.
- Accelerates significantly between 20% and 50% ratio.
4. **Top (Red Circles)**:
- Begins at ~71% accuracy at 2% ratio.
- Consistent upward trend, peaking at ~73.5% by 50% ratio.
- Slight dip at 40% ratio (~73%) before recovering.
### Key Observations
- **Top vs. Full**: The "Top" method closely matches the "Full" method's performance, suggesting comparable effectiveness despite potential differences in implementation.
- **Random's Surge**: The "Random" method shows the most dramatic improvement (7.5% gain from 2% to 50% ratio), indicating sensitivity to ratio increases.
- **Bottom's Steady Climb**: The "Bottom" method exhibits the slowest growth (5.5% gain), implying limited responsiveness to ratio changes.
- **Convergence at 50%**: All methods except "Full" approach ~67–70% accuracy at maximum ratio, suggesting diminishing returns for non-"Full" approaches.
### Interpretation
The data highlights trade-offs between method complexity and scalability:
- **"Full" Method**: Likely represents a resource-intensive baseline (e.g., full model training) with stable performance, ideal for high-stakes applications.
- **"Top" Method**: A near-optimal alternative to "Full," possibly optimized for efficiency without significant accuracy loss.
- **"Random" Method**: Demonstrates high potential for improvement with increased ratio, suggesting it may leverage stochastic processes (e.g., random sampling) that benefit from larger datasets.
- **"Bottom" Method**: Least responsive to ratio changes, possibly indicating a fundamental limitation in its design (e.g., fixed feature set).
The chart underscores the importance of method selection based on resource constraints and performance requirements. While "Full" and "Top" prioritize stability, "Random" offers a cost-effective path to higher accuracy with sufficient ratio investment.
</details>
<details>
<summary>x40.png Details</summary>

### Visual Description
## Line Chart: R1-Llama | MATH500 Accuracy vs. Ratio
### Overview
The chart compares the accuracy performance of four methods ("Full", "Top", "Bottom", "Random") across varying ratios (2-50%) on the MATH500 dataset. Accuracy is measured in percentage, with the "Full" method serving as a reference benchmark.
### Components/Axes
- **X-axis**: Ratio (%) - Discrete markers at 2, 4, 6, 8, 10, 20, 30, 40, 50
- **Y-axis**: Accuracy (%) - Continuous scale from 76 to 90
- **Legend**:
- Gray dashed line: Full (90% constant)
- Red solid line: Top
- Blue solid line: Bottom
- Green solid line: Random
- **Title**: Positioned at top-center in bold text
### Detailed Analysis
1. **Top (Red Line)**:
- Starts at ~85% accuracy at 2% ratio
- Gradually increases to ~90% by 50% ratio
- Maintains steady upward slope throughout
2. **Random (Green Line)**:
- Begins at ~76% accuracy at 2% ratio
- Accelerates sharply after 30% ratio
- Reaches ~85% accuracy at 50% ratio
3. **Bottom (Blue Line)**:
- Starts at ~76% accuracy at 2% ratio
- Shows minimal growth until 30% ratio
- Accelerates slightly after 30%, reaching ~83% at 50%
4. **Full (Gray Dashed Line)**:
- Horizontal reference line at 90% accuracy
- All methods fall short of this benchmark
### Key Observations
- The "Top" method demonstrates the most consistent improvement, closing the gap with the "Full" benchmark by 50% ratio
- "Random" method exhibits a threshold effect, with negligible gains below 30% ratio but significant improvement afterward
- "Bottom" method shows the slowest progress, maintaining ~76-78% accuracy until 30% ratio
- All methods except "Full" display non-linear growth patterns
### Interpretation
The data suggests that the "Top" method most effectively scales with increasing ratio, achieving near-optimal performance close to the "Full" benchmark. The "Random" method's performance characteristics imply potential utility in high-ratio scenarios despite poor initial results. The "Bottom" method's stagnant early performance raises questions about its architectural efficiency. The "Full" line likely represents theoretical maximum accuracy, serving as an aspirational target for model development. The divergence between methods at lower ratios highlights the importance of ratio optimization in model selection for MATH500 tasks.
</details>
<details>
<summary>x41.png Details</summary>

### Visual Description
## Line Chart: R1-Qwen | AIME24
### Overview
The chart compares the accuracy (%) of four data series ("Full," "Bottom," "Random," "Top") across varying ratios (%) from 2% to 50%. The y-axis represents accuracy (30–70%), and the x-axis represents ratio (%) with non-linear spacing (2, 4, 6, 8, 10, 20, 30, 40, 50). The legend is positioned in the top-right corner, with colors matching the data series.
### Components/Axes
- **X-axis (Ratio %)**: Labeled "Ratio (%)" with markers at 2, 4, 6, 8, 10, 20, 30, 40, 50.
- **Y-axis (Accuracy %)**: Labeled "Accuracy (%)" with markers at 30, 40, 50, 60, 70.
- **Legend**: Top-right corner, with labels:
- Gray dashed line: "Full"
- Blue solid line: "Bottom"
- Green solid line: "Random"
- Red solid line: "Top"
### Detailed Analysis
1. **Top (Red Solid Line)**:
- Starts at ~55% accuracy at 2% ratio.
- Sharp upward trend to ~70% by 8% ratio.
- Plateaus near 70% from 10% to 50% ratio.
- Key data points:
- 2%: ~55%
- 4%: ~65%
- 6%: ~68%
- 8%: ~70%
- 10%–50%: ~70%
2. **Full (Gray Dashed Line)**:
- Flat line at ~70% accuracy across all ratios.
- No visible variation.
3. **Bottom (Blue Solid Line)**:
- Starts at ~40% accuracy at 2% ratio.
- Slight dip to ~38% at 4% ratio.
- Gradual rise to ~45% by 50% ratio.
- Key data points:
- 2%: ~40%
- 4%: ~38%
- 6%: ~39%
- 8%: ~39%
- 10%: ~39%
- 20%: ~40%
- 30%: ~42%
- 40%: ~43%
- 50%: ~45%
4. **Random (Green Solid Line)**:
- Highly fluctuating, with peaks and troughs.
- Peaks at ~38% (4% ratio) and ~35% (50% ratio).
- Lowest point at ~28% (20% ratio).
- Key data points:
- 2%: ~36%
- 4%: ~38%
- 6%: ~32%
- 8%: ~34%
- 10%: ~36%
- 20%: ~28%
- 30%: ~29%
- 40%: ~30%
- 50%: ~35%
### Key Observations
- **Top and Full** achieve the highest accuracy, with Top showing rapid improvement at lower ratios and Full maintaining consistent performance.
- **Bottom** demonstrates a steady but modest improvement as ratio increases.
- **Random** exhibits erratic behavior, with no clear trend and significant dips (e.g., 20% ratio).
- **Diminishing Returns**: Top and Full plateau near 70% accuracy after 8–10% ratio, suggesting limited gains beyond this point.
### Interpretation
The data suggests that higher ratios (Top/Full) correlate with improved accuracy, while Random’s inconsistency implies lower reliability. The sharp rise in Top’s accuracy at lower ratios (2%–8%) indicates efficiency gains in early stages, but performance stabilizes afterward. Bottom’s gradual improvement highlights potential for scalability, whereas Random’s volatility underscores the need for structured approaches. The plateau in Top and Full implies that beyond 10% ratio, additional resources yield minimal accuracy improvements, pointing to possible optimization opportunities.
</details>
<details>
<summary>x42.png Details</summary>

### Visual Description
## Line Chart: R1-Qwen | AIME25 Performance by Ratio
### Overview
The chart illustrates the accuracy performance of four different strategies (Full, Bottom, Random, Top) across varying ratios (2% to 50%) on the AIME25 dataset. Accuracy is measured on a y-axis (20-60%), while the x-axis represents the ratio of data used. Four distinct lines represent each strategy's performance trend.
### Components/Axes
- **X-axis (Ratio %)**: Labeled "Ratio (%)", with tick marks at 2, 4, 6, 8, 10, 20, 30, 40, 50.
- **Y-axis (Accuracy %)**: Labeled "Accuracy (%)", with tick marks at 20, 25, 30, 35, 40, 45, 50, 55, 60.
- **Legend**: Located in the top-right corner, with four entries:
- Gray dashed line: "Full"
- Blue solid line: "Bottom"
- Green solid line: "Random"
- Red solid line: "Top"
- **Data Points**: Markers (circles) at each ratio interval for all strategies.
### Detailed Analysis
1. **Top (Red Line)**:
- Starts at ~35% accuracy at 2% ratio.
- Sharp increase to ~50% at 4%, ~55% at 6%, ~58% at 8%, and ~60% at 10%.
- Plateaus near 60% from 10% to 50% ratio.
- **Key Data Points**:
- 2%: 35%, 4%: 50%, 6%: 55%, 8%: 58%, 10%: 60%, 50%: 60%.
2. **Bottom (Blue Line)**:
- Begins at ~42% at 2%, dips to ~38% at 4%, then stabilizes at ~39-40% until 10%.
- Gradual rise to ~44% at 50%.
- **Key Data Points**:
- 2%: 42%, 4%: 38%, 6%: 39%, 8%: 40%, 10%: 38%, 50%: 44%.
3. **Random (Green Line)**:
- Fluctuates between ~34% (2%) and ~36% (8%), then drops to ~22% at 30%.
- Recovers to ~30% at 50%.
- **Key Data Points**:
- 2%: 34%, 4%: 35%, 6%: 33%, 8%: 36%, 10%: 31%, 20%: 30%, 30%: 22%, 40%: 25%, 50%: 30%.
4. **Full (Gray Dashed Line)**:
- Constant at 60% across all ratios.
- **Key Data Points**:
- All ratios: 60%.
### Key Observations
- **Top Strategy**: Dominates performance, achieving near-maximum accuracy (60%) after 10% ratio.
- **Bottom Strategy**: Shows a U-shaped trend, with a mid-range dip followed by recovery.
- **Random Strategy**: Highly volatile, with a significant drop at 30% ratio, suggesting instability.
- **Full Strategy**: Acts as a benchmark, maintaining perfect accuracy (60%) regardless of ratio.
### Interpretation
The data suggests that the **Top strategy** is the most effective, likely leveraging high-quality or prioritized data to achieve optimal results. The **Bottom strategy** performs moderately, with initial underperformance improving as more data is incorporated. The **Random strategy**'s erratic behavior indicates poor reliability, possibly due to overfitting or noise in the dataset. The **Full strategy**'s constant 60% accuracy may represent an upper bound or idealized performance, serving as a reference point. The sharp rise in Top strategy accuracy at lower ratios implies that prioritizing critical data points (e.g., top-ranked examples) is more impactful than using larger, unfiltered datasets. The Random strategy's dip at 30% ratio warrants further investigation, as it deviates significantly from other trends.
</details>
<details>
<summary>x43.png Details</summary>

### Visual Description
## Line Chart: R1-Qwen | AMC23
### Overview
The chart compares the accuracy (%) of four models (Full, Bottom, Random, Top) across varying "Ratio (%)" values (2-50). Accuracy is plotted on the y-axis (70-95%), with distinct trends observed for each model.
### Components/Axes
- **X-axis**: "Ratio (%)" with markers at 2, 4, 6, 8, 10, 20, 30, 40, 50.
- **Y-axis**: "Accuracy (%)" with markers at 70, 75, 80, 85, 90, 95.
- **Legend**: Located in the top-right corner, associating:
- Gray dashed line: Full
- Blue squares: Bottom
- Green triangles: Random
- Red circles: Top
### Detailed Analysis
1. **Top (Red Circles)**:
- Starts at ~87% accuracy at 2% ratio.
- Sharp upward trend to ~95% by 8% ratio.
- Plateaus near 95% for ratios ≥8%.
2. **Full (Gray Dashed Line)**:
- Consistently flat at ~95% accuracy across all ratios.
3. **Bottom (Blue Squares)**:
- Fluctuates between ~70-75% accuracy.
- Minor peaks at 4% (~74%), 10% (~74%), and 30% (~75%).
- Dips to ~72% at 8% and 20% ratios.
4. **Random (Green Triangles)**:
- Begins at ~70% at 2% ratio.
- Dips to ~69% at 10% ratio.
- Sharp rise to ~83% at 50% ratio.
### Key Observations
- **Top Model**: Rapid improvement in accuracy with low ratios (2%→8%), then stabilizes.
- **Full Model**: Maintains highest accuracy (~95%) regardless of ratio.
- **Random Model**: Poor performance at low ratios but improves significantly at 50%.
- **Bottom Model**: Stable but low accuracy (~70-75%) across all ratios.
### Interpretation
The data suggests:
1. **Top and Full Models** outperform others, with Top showing rapid gains at low ratios and Full maintaining consistency.
2. **Random Model**'s improvement at 50% ratio implies potential benefits from higher sampling or resource allocation.
3. **Bottom Model**'s stability suggests it may be less sensitive to ratio changes but underperforms compared to others.
4. The **Full Model**'s flat line indicates it might represent a baseline or optimal configuration unaffected by ratio adjustments.
The chart highlights trade-offs between model configurations and resource ratios, with Top and Full models being most effective for accuracy-critical applications.
</details>
<details>
<summary>x44.png Details</summary>

### Visual Description
## Line Chart: R1-Qwen | GPQA-D Accuracy vs. Ratio
### Overview
The chart compares the accuracy performance of four different configurations (Full, Bottom, Random, Top) across varying ratios (2% to 50%). Accuracy is measured on a y-axis (36%–50%), while the x-axis represents the ratio percentage. A gray dashed reference line at 50% accuracy is included for benchmarking.
### Components/Axes
- **X-axis**: Ratio (%) – Increments from 2% to 50% in 2% steps.
- **Y-axis**: Accuracy (%) – Scale from 36% to 50%.
- **Legend**: Located in the top-right corner, with four entries:
- **Full**: Gray dashed line (flat performance).
- **Bottom**: Blue line (lowest initial accuracy).
- **Random**: Green line (most volatile trend).
- **Top**: Red line (highest final accuracy).
- **Reference Line**: Gray dashed line at 50% accuracy.
### Detailed Analysis
1. **Full (Gray Dashed Line)**:
- Remains flat at ~48–50% accuracy across all ratios.
- No significant variation observed.
2. **Bottom (Blue Line)**:
- Starts at ~40% accuracy at 2% ratio.
- Dips to ~39% at 10% ratio.
- Gradually rises to ~42% at 50% ratio.
- Trend: Slight upward trajectory with minor fluctuations.
3. **Random (Green Line)**:
- Begins at ~38% accuracy at 2% ratio.
- Drops to ~36% at 10% ratio.
- Sharp upward spike to ~45% at 40% ratio.
- Continues rising to ~48% at 50% ratio.
- Trend: Highly volatile, with a dramatic increase in later ratios.
4. **Top (Red Line)**:
- Starts at ~48% accuracy at 2% ratio.
- Peaks at ~50% accuracy by 40% ratio.
- Slight dip to ~49.5% at 50% ratio.
- Trend: Steady upward climb with minor stabilization at higher ratios.
### Key Observations
- **Top** and **Full** configurations consistently outperform others, with **Top** reaching the 50% benchmark.
- **Random** shows the most significant improvement, surpassing **Bottom** and **Full** at higher ratios (40%+).
- **Bottom** remains the lowest-performing configuration throughout.
- The **Random** configuration’s sharp rise at 40% ratio suggests a potential threshold effect or optimization at mid-to-high ratios.
### Interpretation
The data suggests that **Top** and **Full** configurations are optimized for high accuracy, with **Top** achieving near-perfect performance. The **Random** configuration’s volatility indicates inconsistent behavior, though it outperforms others at higher ratios. The **Bottom** configuration’s flat trajectory implies limited adaptability. The 50% reference line highlights a performance ceiling, with only **Top** and **Full** approaching it. The sharp rise in **Random** at 40% ratio warrants further investigation into whether specific ratio thresholds unlock hidden efficiencies.
</details>
<details>
<summary>x45.png Details</summary>

### Visual Description
## Line Chart: R1-Qwen | GK23EN Accuracy vs. Ratio
### Overview
The chart compares the accuracy performance of four different configurations (Full, Bottom, Random, Top) across varying ratios (2% to 50%). Accuracy is measured on the y-axis (66%–78%), while the x-axis represents the ratio percentage. The legend is positioned at the top-right corner, with distinct line styles and markers for each configuration.
### Components/Axes
- **X-axis (Ratio %)**: Labeled "Ratio (%)", with ticks at 2, 4, 6, 8, 10, 20, 30, 40, 50.
- **Y-axis (Accuracy %)**: Labeled "Accuracy (%)", with ticks at 66, 68, 70, 72, 74, 76, 78.
- **Legend**: Located at the top-right, with:
- **Full**: Gray dashed line (flat line at ~78%).
- **Bottom**: Blue squares (line starts at ~66%, rises to ~70% by 50%).
- **Random**: Green triangles (line fluctuates between ~65%–67%, spikes to ~72% at 50%).
- **Top**: Red circles (line starts at ~70%, rises sharply to ~78% by 8%, then plateaus).
### Detailed Analysis
1. **Top (Red Circles)**:
- Starts at ~70% accuracy at 2% ratio.
- Sharp upward trend to ~78% by 8% ratio.
- Plateaus at ~78% from 8% to 50% ratio.
- **Key Trend**: Highest accuracy, with diminishing returns after 8%.
2. **Full (Gray Dashed Line)**:
- Flat line at ~78% accuracy across all ratios.
- **Key Trend**: Consistent performance, unaffected by ratio changes.
3. **Bottom (Blue Squares)**:
- Starts at ~66% at 2% ratio.
- Gradual increase to ~70% by 50% ratio.
- **Key Trend**: Slow, linear improvement with ratio.
4. **Random (Green Triangles)**:
- Fluctuates between ~65%–67% for ratios 2%–40%.
- Sharp spike to ~72% at 50% ratio.
- **Key Trend**: Unstable performance until 50%, where accuracy jumps abruptly.
### Key Observations
- **Top Configuration**: Dominates in accuracy, achieving ~78% with minimal ratio (8%).
- **Full Configuration**: Matches Top’s peak accuracy but maintains it consistently.
- **Random Configuration**: Exhibits erratic behavior, with a significant outlier at 50%.
- **Bottom Configuration**: Shows steady but suboptimal improvement.
### Interpretation
The data suggests that the **Top** and **Full** configurations are the most effective, with Top achieving peak accuracy at lower ratios and Full maintaining stability. The **Random** configuration’s abrupt spike at 50% may indicate a threshold effect or data anomaly, as its performance is otherwise inconsistent. The **Bottom** configuration demonstrates gradual improvement but remains the least accurate overall. The chart highlights the importance of ratio optimization for Top and Full configurations, while Random’s performance requires further investigation to determine reliability.
</details>
<details>
<summary>x46.png Details</summary>

### Visual Description
## Line Chart: R1-Qwen | MATH500
### Overview
The chart illustrates the relationship between "Ratio (%)" (x-axis) and "Accuracy (%)" (y-axis) across four distinct data series: Full, Bottom, Random, and Top. The y-axis ranges from 80% to 95%, while the x-axis spans from 2% to 50% in increments. The legend is positioned in the upper-right quadrant, with color-coded labels for each series.
### Components/Axes
- **X-axis (Ratio %)**: Labeled "Ratio (%)", with markers at 2, 4, 6, 8, 10, 20, 30, 40, and 50.
- **Y-axis (Accuracy %)**: Labeled "Accuracy (%)", with markers at 82, 84, 86, 88, 90, 92, and 94.
- **Legend**: Located in the upper-right corner, with the following mappings:
- **Full**: Gray dashed line (constant value).
- **Bottom**: Blue solid line.
- **Random**: Green solid line.
- **Top**: Red solid line.
### Detailed Analysis
1. **Full (Gray Dashed Line)**:
- Maintains a constant accuracy of **94%** across all ratios.
- Positioned at the top of the chart, unaffected by ratio changes.
2. **Top (Red Solid Line)**:
- Starts at **~82%** at 2% ratio.
- Sharp upward trend to **~94%** by 4% ratio.
- Plateaus near **94%** for ratios ≥4%.
- Key data points:
- 2%: ~82%
- 4%: ~88%
- 8%: ~92%
- 10%: ~93%
- 20%: ~94%
- 30%: ~94%
- 40%: ~94%
- 50%: ~94%
3. **Bottom (Blue Solid Line)**:
- Begins at **~82%** at 2% ratio.
- Gradual upward trend to **~86%** at 50% ratio.
- Key data points:
- 2%: ~82%
- 4%: ~82.5%
- 8%: ~83%
- 10%: ~83.2%
- 20%: ~84%
- 30%: ~84.5%
- 40%: ~85%
- 50%: ~86%
4. **Random (Green Solid Line)**:
- Stable at **~82%** until 40% ratio.
- Sudden jump to **~85%** at 50% ratio.
- Key data points:
- 2%: ~82%
- 4%: ~82.2%
- 8%: ~82.1%
- 10%: ~82%
- 20%: ~81.5%
- 30%: ~81.8%
- 40%: ~82.5%
- 50%: ~85%
### Key Observations
- **Top Series**: Dominates performance, achieving near-peak accuracy (94%) after a rapid initial increase.
- **Bottom Series**: Shows consistent but slower improvement compared to Top.
- **Random Series**: Exhibits minimal variation until a late-stage spike at 50% ratio.
- **Full Series**: Represents a theoretical upper bound, unaffected by ratio adjustments.
### Interpretation
The chart suggests that the "Top" strategy achieves the highest accuracy, particularly after a 4% ratio threshold, while the "Full" series represents an idealized baseline. The "Random" series underperforms until a late-stage anomaly, and the "Bottom" series demonstrates steady but suboptimal growth. The "Full" line’s constancy implies it may represent a control or reference model, while the "Top" line’s sharp rise indicates a highly effective, ratio-sensitive approach. The "Random" series’ late jump at 50% could signal an outlier or contextual factor not reflected in lower ratios.
</details>
Figure 12: Reasoning performance trends of Llama and Qwen models across six datasets as a function of thinking token retention ratio.
We present the evaluation results of R1-Llama (DeepSeek-R1-Distill-Llama-8B) and R1-Qwen (DeepSeek-R1-Distill-Qwen-7B) across the AIME24, AIME25, AMC23, GPQA-Diamond, GK23EN, and MATH-500 datasets. Fig. 10 and 11 illustrate the variation of token importance scores over decoding steps for representative samples from each dataset. Additionally, Fig. 12 demonstrates the reasoning performance under different thinking token retention strategies and retention ratios.
## Appendix C Detailed Experimental Setup
### C.1 Device and Environment
All experiments were conducted on a server equipped with 8 NVIDIA H800 GPUs. We utilize the Hugging Face transformers library and vLLM as the primary inference engine, and employ the veRL framework to optimize the parameters of the importance predictor.
### C.2 Training of Importance Predictor
We generate 5 diverse reasoning traces for each question in the MATH training dataset. Only the traces leading to the correct answer are retained as training data. This process yields 7,142 and 7,064 valid samples for R1-Qwen and R1-Llama, respectively. The Importance Predictor is implemented as an MLP with dimensions of $(3584\to 7168\to 1792\to 1)$ and $(4096\to 8192\to 2048\to 1)$ for R1-Qwen and R1-Llama. We conduct the training using the veRL framework for a total of 15 epochs. The global batch size is set to 256, with a micro-batch size of 4 for the gradient accumulation step. Optimization is performed using AdamW optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.95$ , and a weight decay of $0.01$ . The learning rate is initialized at $5\times 10^{-4}$ with a cosine decay schedule, and the gradient clipping threshold is set to $1.0$ .
### C.3 Inference Setting
<details>
<summary>x47.png Details</summary>

### Visual Description
## Bar Chart: KV Cache Length Comparison (Transformers vs DynTS)
### Overview
The chart compares KV Cache Length (in 10³) between two models: Transformers (blue bars) and DynTS (red bars) across six datasets. Each bar pair includes a multiplier indicating the relative efficiency of DynTS compared to Transformers.
### Components/Axes
- **Y-axis**: KV Cache Length (10³), logarithmic scale from 0 to 20.
- **X-axis**: Datasets: AIME24, AIME25, AMC23, GaoKao2023En, GPQA-D, MATH500.
- **Legend**:
- Blue = Transformers
- Red = DynTS (Ours)
- **Annotations**: Multipliers (e.g., "3.4x") above each bar pair, showing DynTS efficiency relative to Transformers.
### Detailed Analysis
1. **AIME24**:
- Transformers: ~17.0 (10³)
- DynTS: ~5.0 (10³)
- Multiplier: 3.4x
2. **AIME25**:
- Transformers: ~17.5 (10³)
- DynTS: ~5.0 (10³)
- Multiplier: 3.4x
3. **AMC23**:
- Transformers: ~17.0 (10³)
- DynTS: ~5.0 (10³)
- Multiplier: 3.3x
4. **GaoKao2023En**:
- Transformers: ~19.5 (10³)
- DynTS: ~5.0 (10³)
- Multiplier: 3.8x
5. **GPQA-D**:
- Transformers: ~17.0 (10³)
- DynTS: ~3.0 (10³)
- Multiplier: 5.6x
6. **MATH500**:
- Transformers: ~17.5 (10³)
- DynTS: ~3.0 (10³)
- Multiplier: 5.7x
### Key Observations
- **Transformers** consistently show higher KV Cache Lengths (16.5–19.5 × 10³) across all datasets.
- **DynTS** demonstrates significantly lower KV Cache Lengths (3.0–5.0 × 10³), with multipliers ranging from 3.3x to 5.7x.
- **GPQA-D** and **MATH500** exhibit the largest efficiency gains (5.6x and 5.7x), suggesting DynTS is particularly effective for these tasks.
- DynTS values drop below 4 × 10³ in GPQA-D and MATH500, while Transformers remain above 17 × 10³.
### Interpretation
The data highlights DynTS's superior efficiency in KV Cache Length compared to traditional Transformers. The varying multipliers indicate task-dependent performance: GPQA-D and MATH500 benefit most from DynTS optimizations, likely due to their complexity or structure. This suggests DynTS could reduce computational overhead in memory-intensive applications, though further analysis is needed to confirm causality. The logarithmic y-axis emphasizes the scale disparity, underscoring DynTS's potential for resource-constrained environments.
</details>
(a) R1-Llama
<details>
<summary>x48.png Details</summary>

### Visual Description
## Bar Chart: KV Cache Length Comparison (Transformers vs DynTS)
### Overview
The chart compares KV Cache Length (in 10³ units) between two models: Transformers (blue bars) and DynTS (red bars) across six datasets. Each bar pair includes a multiplier indicating how many times larger the Transformer cache is compared to DynTS.
### Components/Axes
- **X-axis**: Datasets (AIME24, AIME25, AMC23, GaoKao2023En, GPQA-D, MATH500)
- **Y-axis**: KV Cache Length (10³ units), ranging from 0.0 to 20.0
- **Legend**: Top-center, with blue = Transformers, red = DynTS (Ours)
- **Annotations**: Multipliers (e.g., "3.4x") above each bar pair, indicating Transformer/DynTS ratio
### Detailed Analysis
| Dataset | Transformers (10³) | DynTS (10³) | Multiplier |
|-------------------|--------------------|-------------|------------|
| AIME24 | ~17.0 | ~5.0 | 3.4x |
| AIME25 | ~17.5 | ~5.0 | 3.4x |
| AMC23 | ~17.0 | ~5.0 | 3.3x |
| GaoKao2023En | ~19.0 | ~5.0 | 3.8x |
| GPQA-D | ~17.0 | ~3.1 | 5.5x |
| MATH500 | ~17.5 | ~3.1 | 5.7x |
### Key Observations
1. **Transformer Dominance**: Transformers consistently require 3–5.7x more KV Cache Length than DynTS across all datasets.
2. **Efficiency Gains**: DynTS achieves the highest efficiency (5.5–5.7x) in GPQA-D and MATH500, suggesting dataset-specific optimizations.
3. **Consistency**: Multipliers remain stable (3.3–3.8x) for most datasets except GPQA-D and MATH500, where efficiency gains spike.
### Interpretation
The data demonstrates that DynTS significantly reduces KV Cache Length compared to standard Transformers, with efficiency gains amplifying in complex reasoning tasks (GPQA-D, MATH500). This implies DynTS’s dynamic state management is particularly effective for multi-step reasoning, though the exact mechanisms (e.g., state pruning, attention optimization) would require deeper analysis. The near-identical Transformer cache sizes across datasets suggest uniform architectural overhead, while DynTS’s variable efficiency highlights its adaptability to task complexity.
</details>
(b) R1-Qwen
Figure 13: Comparison of average KV Cache length between standard Transformers and DynTS across six benchmarks. The arrows and annotations indicate the compression ratio achieved by our method on each dataset.
<details>
<summary>x49.png Details</summary>

### Visual Description
## Line Chart: Performance Comparison of Transformers vs DynTS Across Decoding Steps
### Overview
The image presents three vertically stacked line charts comparing the performance of two systems: "Transformers" (gray line) and "DynTS" (red line) across three metrics: Throughput (TPS), KV Memory (GB), and GFLOPs. The x-axis represents decoding steps (0–15k), with vertical dashed lines marking key intervals (2k, 5k, 7k, 10k, 12k, 15k). Annotations with multipliers (e.g., "1.39×") highlight performance differences at specific decoding steps.
---
### Components/Axes
1. **Top Subplot: Throughput (TPS)**
- **Y-axis**: Throughput (TPS) from 0 to 1250.
- **X-axis**: Decoding Steps (0–15k).
- **Legend**: Gray = Transformers, Red = DynTS.
- **Annotations**: Multipliers (e.g., "1.39×", "1.86×") on the red line at decoding steps 2k, 5k, 7k, 10k, 12k, and 15k.
2. **Middle Subplot: KV Memory (GB)**
- **Y-axis**: KV Memory (GB) from 0 to 15.
- **X-axis**: Decoding Steps (0–15k).
- **Legend**: Gray = Transformers, Red = DynTS.
- **Annotations**: Multipliers (e.g., "0.64×", "0.47×") on the red line at decoding steps 2k, 5k, 7k, 10k, 12k, and 15k.
3. **Bottom Subplot: GFLOPs**
- **Y-axis**: GFLOPs from 20 to 40.
- **X-axis**: Decoding Steps (0–15k).
- **Legend**: Gray = Transformers, Red = DynTS.
- **Annotations**: Multipliers (e.g., "0.85×", "0.74×") on the red line at decoding steps 2k, 5k, 7k, 10k, 12k, and 15k.
- **Inset**: Zoomed view of decoding steps 4500–4900, showing a "1.003×" multiplier at 4700 steps.
---
### Detailed Analysis
#### Throughput (TPS)
- **Transformers (gray)**: Starts at ~1200 TPS, declines sharply to ~250 TPS by 15k steps.
- **DynTS (red)**: Starts at ~1100 TPS, declines more gradually, maintaining higher values than Transformers at later steps. Multipliers increase from 1.39× (2k) to 3.74× (15k), indicating DynTS outperforms Transformers by ~3.7x at 15k steps.
#### KV Memory (GB)
- **Transformers (gray)**: Linear increase from ~0 to ~15 GB.
- **DynTS (red)**: Linear increase from ~0 to ~3 GB. Multipliers decrease from 0.64× (2k) to 0.20× (15k), showing DynTS uses ~80% less memory than Transformers at 15k steps.
#### GFLOPs
- **Transformers (gray)**: Linear increase from ~22.5 to ~40 GFLOPs.
- **DynTS (red)**: Linear increase from ~20 to ~25 GFLOPs. Multipliers decrease from 0.85× (2k) to 0.44× (15k), indicating DynTS achieves ~56% of Transformers' compute efficiency at 15k steps.
---
### Key Observations
1. **Throughput**: DynTS maintains higher throughput than Transformers at later decoding steps, with performance gains increasing over time.
2. **Memory Efficiency**: DynTS consistently uses less memory than Transformers, with efficiency improving as decoding steps increase.
3. **Compute Efficiency**: DynTS lags in GFLOPs but shows diminishing returns relative to Transformers.
4. **Inset Anomaly**: At 4700 steps, DynTS achieves a 1.003× multiplier (slightly better than Transformers), but this is an outlier compared to broader trends.
---
### Interpretation
The data suggests **DynTS optimizes throughput and memory efficiency** at the cost of compute efficiency compared to Transformers. The increasing multiplier in throughput (up to 3.74×) implies DynTS becomes more effective for long decoding tasks, while its lower GFLOPs indicate trade-offs in computational intensity. The inset highlights a rare instance where DynTS briefly outperforms Transformers in compute, but this is not sustained. Overall, DynTS appears tailored for scenarios prioritizing memory and throughput over raw compute power.
</details>
Figure 14: Real-time throughput, memory, and compute overhead tracking for R1-Qwen over total decoding steps. The results exhibit a trend consistent with R1-Llama, confirming the scalability of DynTS across different model architectures.
We implement DynTS and all baseline methods using the Hugging Face transformers library for KV cache compression. To ensure fairness, we use the effective KV cache length as the compression signal. Whenever the cache size reaches the predefined budget, all methods are restricted to retain an identical number of KV pairs. For SnapKV, H2O, and R-KV, we set identical local window sizes and retention ratios to those of our methods. For SepLLM, we preserve the separator tokens and evict the earliest generated non-separator tokens until the total cache length matches ours. For StreamingLLM, we set the same sink token size, following (Xiao et al., 2024). We set the number of parallel generated sequences to 20. The generation hyperparameters are configured as follows: temperature $T=0.6$ , top- $p=0.95$ , top- $k=20$ , and a maximum new token limit of 16,384. We conduct 5 independent sampling runs for all datasets. Ablation studies on the local window, retention ratio, and budget were conducted across four challenging benchmarks (AIME24, AIME25, AMC23, and GPQA-D) with the same other configurations to verify effectiveness. Fig. 6 and 8 report the mean results across these datasets.
## Appendix D Additional Results
Inference Efficiency on R1-Qwen. Complementing the efficiency analysis of R1-Llama presented in the main text, Figure 14 illustrates the real-time throughput, memory footprint, and computational overhead for R1-Qwen. Consistent with previous observations, DynTS exhibits significant scalability advantages over the standard Transformer baseline as the sequence length increases, achieving a peak throughput speedup of $3.74\times$ while compressing the memory footprint to $0.20\times$ and reducing the cumulative computational cost (GFLOPs) to $0.44\times$ after the last KV cache selection step. The recurrence of the characteristic sawtooth pattern further validates the robustness of our periodic KV Cache Selection mechanism, demonstrating that it effectively bounds resource accumulation and delivers substantial efficiency gains across diverse LRM architectures by continuously evicting non-essential thinking tokens.
KV Cache Compression Ratio. Figure 13 explicitly visualizes the reduction in KV Cache length achieved by DynTS across diverse reasoning tasks. By dynamically filtering out non-essential thinking tokens, our method drastically reduces the memory footprint compared to the full-cache Transformers baseline. For instance, on the MATH500 benchmark, DynTS achieves an impressive compression ratio of up to $5.7\times$ , reducing the average cache length from over 17,000 tokens to the constrained budget of 3,000. These results directly explain the memory and throughput advantages reported in the efficiency analysis, confirming that DynTS successfully maintains high reasoning accuracy with a fraction of the memory cost.
<details>
<summary>x50.png Details</summary>

### Visual Description
## Line Chart: Model Performance Metrics
### Overview
The image contains two line charts stacked vertically. The top chart compares two metrics (MSE Loss and Kendall) over training steps, while the bottom chart shows multiple overlapping lines representing "Overlap Rate (%)" across different percentile thresholds (Top-20% to Top-90%) over the same step range.
### Components/Axes
**Top Chart:**
- **X-axis (Step):** Ranges from 0 to 400 in increments of 50.
- **Y-axis (Value):** Ranges from 0 to 15 in increments of 5.
- **Legend:**
- Blue line: MSE Loss
- Orange line: Kendall
**Bottom Chart:**
- **X-axis (Step):** Same as top chart (0–400).
- **Y-axis (Overlap Rate %):** Ranges from 20% to 100% in increments of 20%.
- **Legend:**
- Purple: Top-20%
- Dark blue: Top-30%
- Teal: Top-40%
- Light blue: Top-50%
- Green: Top-60%
- Yellow: Top-70%
- Lime: Top-80%
- Bright yellow: Top-90%
### Detailed Analysis
**Top Chart Trends:**
1. **MSE Loss (Blue):**
- Starts with a sharp spike (~15) at Step 0, dropping to ~0.5 by Step 50.
- Remains stable at ~0.2–0.3 from Step 100 onward.
2. **Kendall (Orange):**
- Begins with a smaller spike (~3) at Step 0, declining to ~0.1 by Step 50.
- Stabilizes at ~0.05–0.1 from Step 100 onward.
**Bottom Chart Trends:**
- All lines show gradual decline over steps, with higher percentile thresholds (e.g., Top-90%) starting and ending at lower overlap rates.
- **Key Patterns:**
- Top-20% (purple) starts at ~85% and drops to ~60% by Step 400.
- Top-90% (bright yellow) starts at ~95% and decreases to ~85%.
- Lines exhibit jagged fluctuations, especially in early steps (0–100).
### Key Observations
1. **Top Chart:** Both metrics stabilize after initial volatility, suggesting convergence in model performance.
2. **Bottom Chart:**
- Higher percentile thresholds (e.g., Top-90%) maintain higher overlap rates throughout training.
- Lower thresholds (e.g., Top-20%) show steeper declines, indicating greater divergence from top-performing segments.
- Overlap rates for all thresholds plateau between Steps 200–400.
### Interpretation
The data suggests:
1. **Model Convergence:** The top chart indicates that both MSE Loss and Kendall metrics stabilize after ~100 steps, implying the model reaches a steady state.
2. **Performance Segmentation:** The bottom chart reveals that the model's predictions for higher percentile thresholds (e.g., Top-90%) remain more consistent with the "top" data segments, while lower thresholds (e.g., Top-20%) diverge significantly. This could reflect:
- **Selectivity:** The model prioritizes high-confidence predictions (Top-90%) over uncertain ones (Top-20%).
- **Training Dynamics:** Early steps may involve aggressive exploration, causing lower thresholds to overfit or underfit, while later steps refine predictions for dominant patterns.
3. **Anomalies:** The sharp initial spikes in the top chart (Step 0–50) might indicate initialization effects or data preprocessing artifacts.
**Critical Insight:** The divergence between high and low percentile overlap rates highlights potential trade-offs between model confidence and generalization. Further analysis could explore whether the Top-20% segment represents outliers or edge cases requiring specialized handling.
</details>
Figure 15: Training dynamics of the Importance Predictor on R1-Qwen. The top panel displays the convergence of MSE Loss and Kendall correlation, while the bottom panel shows the overlap rate of the top- $20\$ ground-truth tokens within the top- $p\$ ( $p\in[20,90]$ ) predicted tokens across training steps.
Importance Predictor Analysis for R1-Qwen. Complementing the findings on R1-Llama, Figure 15 depicts the learning trajectory of the Importance Predictor for the R1-Qwen model. The training process exhibits a similar convergence pattern: the MSE loss rapidly decreases and stabilizes, while the Kendall rank correlation coefficient steadily improves, indicating that the simple MLP architecture effectively captures the importance ranking of thinking tokens in R1-Qwen. Furthermore, the bottom panel highlights the high overlap rate between the predicted and ground-truth critical tokens; notably, the overlap rate for the top-40% of tokens exceeds 80% after approximately 200 steps. This high alignment confirms that the Importance Predictor can accurately identify pivotal tokens within R1-Qwen’s reasoning process, providing a reliable basis for the subsequent KV cache compression.
Budget Impact Analysis over Benchmarks. Figures 16 illustrate the granular impact of KV budget constraints on reasoning performance and system throughput. Focusing on R1-Llama, we observe a consistent trade-off across all datasets: increasing the KV budget significantly boosts reasoning accuracy at the cost of linearly decreasing throughput. Specifically, on the challenging AIME24 benchmark, expanding the budget from 2,500 to 5,000 tokens improves Pass@1 accuracy from $40.0\$ to $49.3\$ , while the throughput decreases from $\sim$ 600 to $\sim$ 445 tokens/s. This suggests that while a tighter budget accelerates inference, a larger budget is essential for solving complex problems requiring extensive context retention. Experimental results on R1-Qwen exhibit a highly similar trend, confirming that the performance characteristics of DynTS are model-agnostic. Overall, our method allows users to flexibly balance efficiency and accuracy based on specific deployment requirements.
<details>
<summary>x51.png Details</summary>

### Visual Description
## Bar Chart: R1-Llama Performance Across Datasets and KV Budgets
### Overview
The image displays four side-by-side bar charts comparing the performance of the R1-Llama model across four datasets (AIME24, AIME25, AMC23, GPQA-D) at varying KV Budgets (2500–5000). Each chart shows two metrics: **Pass@1** (accuracy) and **Throughput (TPS)**. The charts use blue bars for Pass@1 and orange lines for Throughput, with legends positioned in the top-right corner of each panel.
---
### Components/Axes
- **X-Axis**: KV Budget (2500, 3000, 3500, 4000, 4500, 5000)
- **Y-Axes**:
- Left: Pass@1 (percentage, varies per panel)
- Right: Throughput (TPS, consistent scale across panels)
- **Legends**:
- Blue bars: Pass@1
- Orange lines: Throughput
- **Panel Titles**:
- Top-left: Dataset name (e.g., "R1-Llama | AIME24")
---
### Detailed Analysis
#### Panel 1: R1-Llama | AIME24
- **Pass@1**:
- 2500 KV: 40.0%
- 3000 KV: 44.7%
- 3500 KV: 45.3%
- 4000 KV: 42.0%
- 4500 KV: 39.3%
- 5000 KV: 49.3%
- **Throughput (TPS)**:
- 2500 KV: 500
- 3000 KV: 450
- 3500 KV: 400
- 4000 KV: 350
- 4500 KV: 300
- 5000 KV: 250
#### Panel 2: R1-Llama | AIME25
- **Pass@1**:
- 2500 KV: 20.0%
- 3000 KV: 24.7%
- 3500 KV: 29.3%
- 4000 KV: 28.0%
- 4500 KV: 28.0%
- 5000 KV: 29.3%
- **Throughput (TPS)**:
- 2500 KV: 500
- 3000 KV: 450
- 3500 KV: 400
- 4000 KV: 350
- 4500 KV: 300
- 5000 KV: 250
#### Panel 3: R1-Llama | AMC23
- **Pass@1**:
- 2500 KV: 79.0%
- 3000 KV: 86.5%
- 3500 KV: 84.0%
- 4000 KV: 87.0%
- 4500 KV: 87.0%
- 5000 KV: 87.0%
- **Throughput (TPS)**:
- 2500 KV: 500
- 3000 KV: 450
- 3500 KV: 400
- 4000 KV: 350
- 4500 KV: 300
- 5000 KV: 250
#### Panel 4: R1-Llama | GPQA-D
- **Pass@1**:
- 2500 KV: 37.9%
- 3000 KV: 45.8%
- 3500 KV: 45.1%
- 4000 KV: 46.3%
- 4500 KV: 45.5%
- 5000 KV: 46.4%
- **Throughput (TPS)**:
- 2500 KV: 500
- 3000 KV: 450
- 3500 KV: 400
- 4000 KV: 350
- 4500 KV: 300
- 5000 KV: 250
---
### Key Observations
1. **Pass@1 Trends**:
- Pass@1 generally increases with KV Budget, though some panels show minor fluctuations (e.g., AIME24 drops at 4000 KV).
- AMC23 achieves the highest Pass@1 (87.0% at 4000+ KV), while AIME25 has the lowest (29.3% at 5000 KV).
2. **Throughput Trends**:
- Throughput consistently decreases as KV Budget increases across all datasets.
- The decline is linear, with a ~20 TPS drop per 500 KV increment.
3. **Dataset Variability**:
- AMC23 shows the most stable Pass@1 improvement, while AIME25 exhibits the weakest performance.
- GPQA-D demonstrates moderate gains in Pass@1 but follows the same throughput trade-off.
---
### Interpretation
- **Accuracy-Throughput Trade-off**: Higher KV Budgets improve accuracy (Pass@1) but reduce computational efficiency (Throughput). This suggests a critical balance for real-world deployment.
- **Dataset-Specific Behavior**:
- AMC23’s high Pass@1 indicates better model alignment with this dataset, possibly due to task similarity or data quality.
- AIME25’s low Pass@1 may reflect dataset complexity or model limitations.
- **Scalability Insight**: The linear Throughput decline implies diminishing returns at higher KV Budgets, highlighting the need for optimization strategies (e.g., quantization, parallelization).
This analysis underscores the importance of dataset-specific tuning and resource allocation when deploying R1-Llama in production environments.
</details>
<details>
<summary>x52.png Details</summary>

### Visual Description
## Bar Chart: Performance Metrics Across Datasets (R1-Qwen)
### Overview
The image displays four bar charts comparing performance metrics (Pass@1 and Throughput) for the R1-Qwen model across four datasets: AIME24, AIME25, AMC23, and GPQA-D. Each chart uses a KV Budget (x-axis) ranging from 2500 to 5000, with Pass@1 (y-axis: 20–100) and Throughput (y-axis: 600–800 TPS) as metrics. An orange line represents Throughput trends, while blue bars show Pass@1 values.
---
### Components/Axes
- **X-axis**: KV Budget (2500, 3000, 3500, 4000, 4500, 5000)
- **Y-axis (Left)**: Pass@1 (%) (20–100)
- **Y-axis (Right)**: Throughput (TPS) (600–800)
- **Legend**:
- Blue bars: Pass@1
- Orange line: Throughput
- **Panel Titles**:
- R1-Qwen | AIME24
- R1-Qwen | AIME25
- R1-Qwen | AMC23
- R1-Qwen | GPQA-D
---
### Detailed Analysis
#### R1-Qwen | AIME24
- **Pass@1**: 42.7 (2500), 46.0 (3000), 42.0 (3500), 46.0 (4000), 48.0 (4500), 52.0 (5000)
- **Throughput**: 750 (2500), 700 (3000), 650 (3500), 600 (4000), 550 (4500), 500 (5000)
- **Trend**: Pass@1 fluctuates slightly, while Throughput decreases steadily.
#### R1-Qwen | AIME25
- **Pass@1**: 30.0 (2500), 33.3 (3000), 36.0 (3500), 34.0 (4000), 36.7 (5000)
- **Throughput**: 750 (2500), 700 (3000), 650 (3500), 600 (4000), 550 (5000)
- **Trend**: Pass@1 increases modestly, while Throughput declines linearly.
#### R1-Qwen | AMC23
- **Pass@1**: 82.0 (2500), 84.5 (3000), 90.5 (3500), 87.5 (4000), 87.0 (4500), 88.5 (5000)
- **Throughput**: 750 (2500), 700 (3000), 650 (3500), 600 (4000), 550 (5000)
- **Trend**: Pass@1 peaks at 3500 KV Budget, then stabilizes. Throughput decreases consistently.
#### R1-Qwen | GPQA-D
- **Pass@1**: 44.6 (2500), 46.7 (3000), 48.0 (3500), 47.8 (4000), 48.4 (4500), 48.2 (5000)
- **Throughput**: 750 (2500), 700 (3000), 650 (3500), 600 (4000), 550 (5000)
- **Trend**: Pass@1 increases gradually, while Throughput declines steadily.
---
### Key Observations
1. **Throughput Consistency**: All datasets show a linear decline in Throughput as KV Budget increases, indicating a trade-off between computational resources and efficiency.
2. **Pass@1 Variability**:
- **AMC23** achieves the highest Pass@1 (up to 90.5%), suggesting superior performance on this dataset.
- **AIME25** has the lowest Pass@1 (30–36.7%), indicating potential challenges in task-specific optimization.
3. **Stability in GPQA-D**: Pass@1 remains relatively stable (~44.6–48.2%) despite increasing KV Budget.
---
### Interpretation
- **Trade-off Analysis**: The consistent decline in Throughput across all datasets highlights a universal efficiency constraint as computational resources (KV Budget) grow.
- **Dataset-Specific Performance**:
- **AMC23**’s high Pass@1 suggests it may be better suited for tasks requiring accuracy, possibly due to larger or more structured data.
- **AIME25**’s low Pass@1 could reflect task complexity or insufficient model adaptation.
- **GPQA-D**’s stable Pass@1 implies a balanced performance, making it a candidate for applications prioritizing consistency over peak accuracy.
- **Optimization Insight**: For AIME25, increasing KV Budget beyond 3500 yields diminishing returns in Pass@1, suggesting resource allocation should prioritize lower budgets for this dataset.
The data underscores the need for dataset-specific optimization strategies to balance accuracy and efficiency in R1-Qwen deployments.
</details>
Figure 16: Impact of budget on Pass@1 and throughput for R1-Llama (top) and R1-Qwen (bottom) across AIME24, AIME25, AMC23, and GPQA-D datasets. The blue bars represent accuracy (left y-axis), and the orange lines represent throughput (right y-axis).
<details>
<summary>x53.png Details</summary>

### Visual Description
## Heatmap: Performance Metrics Across Models and Datasets
### Overview
The image displays four heatmaps comparing performance metrics (Pass@1) for the R1-Llama model across different datasets (AIME24, AIME25, AMC23, GPQA_D) under varying **Ratio** (0.1–0.5) and **Local Window Size** (500–3000). Each heatmap uses a color gradient (light to dark) to represent values, with a legend on the right indicating the scale.
### Components/Axes
- **X-axis (Ratio)**: 0.1, 0.2, 0.3, 0.4, 0.5
- **Y-axis (Local Window Size)**: 500, 1000, 2000, 3000
- **Legend**: Color gradient from light (low values) to dark (high values), labeled "Pass@1" with a range of 24–48.
- **Sections**:
1. **R1-Llama | AIME24**
2. **R1-Llama | AIME25**
3. **R1-Llama | AMC23**
4. **R1-Llama | GPQA_D**
### Detailed Analysis
#### R1-Llama | AIME24
- **Values**:
- 500: 41.3, 42.7, 45.3, 44.7, 42.7
- 1000: 39.3, 49.3, 45.3, 44.0, 46.0
- 2000: 44.7, 47.3, 49.3, 46.0, 43.3
- 3000: 40.0, 45.3, 41.3, 42.7, 46.7
- **Trend**: Values peak at **Ratio 0.3** and **Window Size 2000** (49.3).
#### R1-Llama | AIME25
- **Values**:
- 500: 26.7, 30.7, 24.0, 26.7, 26.7
- 1000: 27.3, 26.7, 29.3, 30.7, 26.7
- 2000: 28.7, 27.3, 28.0, 26.7, 26.7
- 3000: 26.7, 26.7, 26.7, 26.7, 26.7
- **Trend**: Highest value at **Ratio 0.3** and **Window Size 1000** (30.7).
#### R1-Llama | AMC23
- **Values**:
- 500: 85.5, 87.0, 87.3, 87.3, 88.0
- 1000: 84.0, 87.0, 86.0, 88.0, 86.0
- 2000: 87.5, 88.0, 86.0, 88.5, 86.5
- 3000: 86.5, 86.5, 89.0, 87.5, 86.5
- **Trend**: Highest value at **Ratio 0.3** and **Window Size 3000** (89.0).
#### R1-Llama | GPQA_D
- **Values**:
- 500: 44.1, 44.9, 44.9, 45.8, 44.2
- 1000: 45.8, 45.4, 47.4, 45.6, 46.8
- 2000: 45.5, 46.6, 46.5, 46.8, 45.5
- 3000: 46.2, 48.4, 46.3, 46.9, 45.8
- **Trend**: Highest value at **Ratio 0.2** and **Window Size 3000** (48.4).
### Key Observations
1. **AMC23 Dataset**:
- Values exceed the legend’s stated range (24–48), reaching **89.0**. This suggests either a miscalibrated legend or an outlier.
- Performance improves with larger window sizes (e.g., 3000) and mid-range ratios (0.3–0.4).
2. **AIME24 Dataset**:
- Consistent performance across ratios, with a peak at **Ratio 0.3** and **Window Size 2000** (49.3).
3. **GPQA_D Dataset**:
- Values cluster around 45–48, with a notable peak at **Ratio 0.2** and **Window Size 3000** (48.4).
4. **Legend Discrepancy**:
- The legend’s upper bound (48) does not align with the AMC23 data (89.0), indicating a potential error in the visualization.
### Interpretation
- **Model Performance**: R1-Llama shows varying effectiveness across datasets. AMC23 yields the highest Pass@1 scores, suggesting it is the most favorable for this model configuration.
- **Optimal Parameters**:
- For **AIME24** and **GPQA_D**, mid-range ratios (0.3–0.4) and larger window sizes (2000–3000) maximize performance.
- **AIME25** exhibits lower overall performance, with minimal improvement beyond **Ratio 0.3**.
- **Legend Issue**: The AMC23 data’s extreme values (e.g., 89.0) contradict the legend’s 24–48 range, raising questions about data normalization or visualization accuracy.
- **Trend Consistency**: Larger window sizes generally correlate with higher performance, but this is not universal (e.g., AIME25 shows no improvement beyond 1000).
This analysis highlights the importance of dataset-specific tuning for R1-Llama and underscores potential visualization inconsistencies in the provided heatmaps.
</details>
<details>
<summary>x54.png Details</summary>

### Visual Description
## Heatmap: Model Performance Comparison Across Datasets
### Overview
The image contains four heatmaps comparing the performance of the R1-Qwen model across different datasets (AIME24, AIME25, AMC23, GPQA_D) under varying local window sizes (500, 1000, 2000, 3000) and ratio values (0.1–0.5). Performance is measured using a metric labeled "Pass@1," with color intensity indicating higher/lower values.
---
### Components/Axes
1. **X-Axis (Ratio)**: Discrete values [0.1, 0.2, 0.3, 0.4, 0.5].
2. **Y-Axis (Local Window Size)**: Discrete values [500, 1000, 2000, 3000].
3. **Color Scale (Pass@1)**:
- AIME24: 40.0–52.0 (dark blue = highest).
- AIME25: 31.3–36.0 (dark blue = highest).
- AMC23: 85.0–88.5 (dark blue = highest).
- GPQA_D: 46.1–49.7 (dark blue = highest).
4. **Legend**: Positioned on the right of each heatmap, with a vertical color bar and numerical range.
---
### Detailed Analysis
#### AIME24
- **Key Values**:
- 500 window: 47.3 (0.1), 46.0 (0.2), 42.7 (0.3), 46.0 (0.4), 47.3 (0.5).
- 2000 window: 44.0 (0.1), 47.3 (0.2), 45.3 (0.3), **52.0** (0.4), 43.3 (0.5).
- 3000 window: 42.7 (0.1), 44.0 (0.2), 42.0 (0.3), 46.0 (0.4), 48.7 (0.5).
- **Trend**: Performance peaks at 2000 window size and 0.4 ratio (52.0).
#### AIME25
- **Key Values**:
- 500 window: 32.7 (0.1), 30.7 (0.2), 34.0 (0.3), 35.3 (0.4), 35.3 (0.5).
- 2000 window: **36.0** (0.1), 34.7 (0.2), 35.3 (0.3), 32.0 (0.4), 35.3 (0.5).
- 3000 window: 31.3 (0.1), 34.0 (0.2), 31.3 (0.3), 35.3 (0.4), 35.3 (0.5).
- **Trend**: Highest value at 2000 window and 0.1 ratio (36.0).
#### AMC23
- **Key Values**:
- 500 window: 86.5 (0.1), 87.5 (0.2), 85.5 (0.3), 88.0 (0.4), 86.5 (0.5).
- 2000 window: 87.5 (0.1), **88.5** (0.2), 87.5 (0.3), 88.5 (0.4), 87.5 (0.5).
- 3000 window: 85.5 (0.1), 86.5 (0.2), 85.5 (0.3), 88.5 (0.4), 88.5 (0.5).
- **Trend**: Consistent high performance across all ratios, peaking at 2000/3000 window sizes and 0.2–0.5 ratios.
#### GPQA_D
- **Key Values**:
- 500 window: 47.1 (0.1), 46.7 (0.2), 49.1 (0.3), 48.3 (0.4), 48.0 (0.5).
- 2000 window: 46.7 (0.1), **49.7** (0.2), 48.2 (0.3), 48.0 (0.4), 48.2 (0.5).
- 3000 window: 45.7 (0.1), 47.6 (0.2), 47.2 (0.3), 47.5 (0.4), 46.1 (0.5).
- **Trend**: Peak at 2000 window and 0.2 ratio (49.7).
---
### Key Observations
1. **Window Size Impact**:
- Larger window sizes (2000–3000) generally improve performance across datasets.
- Exceptions: AIME25 shows mixed results with 3000 window size.
2. **Ratio Impact**:
- Higher ratios (0.4–0.5) often correlate with better performance, except in AIME25 (0.1 ratio outperforms).
3. **Dataset Variance**:
- AMC23 consistently achieves the highest scores (85–88.5).
- GPQA_D has the lowest scores (45.7–49.7).
---
### Interpretation
- **Optimal Configuration**: For most datasets, a window size of 2000 and ratio of 0.4–0.5 yields the best results. AMC23 benefits most from larger windows, while AIME25 performs best with smaller windows (500) and low ratios (0.1).
- **Trade-offs**: Larger windows may increase computational cost but improve accuracy. GPQA_D shows diminishing returns with larger windows.
- **Anomalies**: AIME25’s 3000 window size underperforms compared to smaller windows, suggesting potential overfitting or inefficiency at extreme window sizes.
This analysis highlights the importance of tuning window size and ratio based on dataset characteristics to balance performance and resource usage.
</details>
Figure 17: Impact of different local window sizes and retention ratios of section window
Local Window and Retention Ratio Analysis over Benchmarks. Figures 17 illustrate the sensitivity of model performance to variations in Local Window Size and the Retention Ratio of the Selection Window. A moderate local window (e.g., 1000–2000) typically yields optimal results, suggesting that recent context saturation is reached relatively. Furthermore, we observe that the retention ratio between $0.3$ and $0.4$ across most benchmarks (e.g., AIME24, GPQA), where the model effectively balances and reasoning performance. Whereas lower ratios (e.g., $0.1$ ) consistently degrade accuracy due to excessive information loss.
## Appendix E Limitations and Future Work
Currently, DynTS is implemented based on the transformers library, and we are actively working on deploying it to other inference frameworks such as vLLM and SGLang. Additionally, our current training data focuses on mathematical reasoning, which may limit performance in other domains like coding or abstract reasoning. In the future, we plan to expand data diversity to adapt to a broader range of reasoning tasks. Moreover, constrained by computational resources, we used a relatively small dataset ( $\sim 7,000$ samples) for training. This dataset’s scale limits us to optimizing only the importance predictor’s parameters, since optimizing all parameters on the small dataset may compromise the model’s original generalization capabilities. This constraint may hinder the full potential of DynTS. Future work can focus on scaling up the dataset and jointly optimizing both the backbone and the predictor to elicit stronger capabilities.